Deep Learning for Vulnerability and Attack Detection on Web Applications: A Systematic Literature Review

: Web applications are the best Internet-based solution to provide online web services, but they also bring serious security challenges. Thus, enhancing web applications security against hacking attempts is of paramount importance. Traditional Web Application Firewalls based on manual rules and traditional Machine Learning need a lot of domain expertise and human intervention and have limited detection results faced with the increasing number of unknown web attacks. To this end, more research work has recently been devoted to employing Deep Learning (DL) approaches for web attacks detection. We performed a Systematic Literature Review (SLR) and quality analysis of 63 Primary Studies (PS) on DL-based web applications security published between 2010 and September 2021. We investigated the PS from different perspectives and synthesized the results of the analyses. To the best of our knowledge, this study is the ﬁrst of its kind on SLR in this ﬁeld. The key ﬁndings of our study include the following. (i) It is fundamental to generate standard real-world web attacks datasets to encourage effective contribution in this ﬁeld and to reduce the gap between research and industry. (ii) It is interesting to explore some advanced DL models, such as Generative Adversarial Networks and variants of Encoders–Decoders, in the context of web attacks detection as they have been successful in similar domains such as networks intrusion detection. (iii) It is fundamental to bridge expertise in web applications security and expertise in Machine Learning to build theoretical Machine Learning models tailored for web attacks detection. (iv) It is important to create a corpus for web attacks detection in order to take full advantage of text mining in DL-based web attacks detection models construction. (v) It is essential to deﬁne a common framework for developing and comparing DL-based web attacks detection models. This SLR is intended to improve research work in the domain of DL-based web attacks detection, as it covers a signiﬁcant number of research papers and identiﬁes the key points that need to be addressed in this research ﬁeld. Such a contribution is helpful as it allows researchers to compare existing approaches and to exploit the proposed future work opportunities.


Introduction and Background
Due to the extensive use of websites and web applications, web vulnerabilities are continuously growing.A survey conducted in 2019 [1] found that nine of 10 web applications are vulnerable and that sensitive data breaches are possible on 68% of web applications.Furthermore, that network intrusion was caused by unauthorized access to web servers in 8% of cases.The immeasurable and disparate use of the Internet makes it a hackers' aim.The main goal of web vulnerabilities detection techniques is to protect websites and web applications from cyber-attacks such as Cross-Site Scripting (XSS), SQL injection, etc.The subsections below present the necessary background for understanding the remainder of this paper, including the general architecture of web-based applications, types of web vulnerabilities, and different web vulnerabilities prevention and detection methods.

Web Applications Architecture
Web-based applications are the essential network-based solution to offer standard web services.The development of these applications is based on client and server-side development.The server side involves a web server, a web application, and a database server; it utilizes backend scripting languages including .NET, PHP, and JEE (Jakarta Enterprise Edition).The client-side works on the user's web browser with front-end scripting languages, including CSS/HTML, Javascript, etc.These two are usually interconnected via HTTP protocol.Figure 1 presents server-side and client-side web application architecture.Web applications became an integral part of individuals' daily life because of their accessibility and convenience.However, this increased popularity is a double-edged sword.Indeed, web-based applications are the main highway for attackers to jeopardize critical services in vital sectors such as healthcare, education, banking, and e-commerce.

Web Vulnerabilities
The term vulnerability is a weakness, glitch, or loophole in web application development.An exploit is when the vulnerability is exploited, and the attack succeeds.Based on the research work [2], web application vulnerabilities can be classified into three categories (Figure 2): Improper input validation refers to an incorrect validation and sanitization of user input.SQL injection and Cross-Site Scripting (XSS) are examples of web attacks caused by improper input validation vulnerability.Second, Improper session management refers to when the web session is not secured correctly, and thus, the application could not identify if web requests are malicious until these are linked with a proper valid session identifier.Cross-Site Request Forgery (CSRF) and session highjacking are examples of web attacks caused by improper session management.Finally, Improper authorization and authentication vulnerability implies a logic flaw in the exercise of access control policies as well as functions of authenticating.Broken access control is one of the web attacks that could happen if the web application does not correctly manage authentication and authorization procedures.

Web Vulnerabilities Countermeasures
Numerous researchers have proposed different methodologies to counter web vulnerabilities.Figure 3 presents the main web applications security approaches.These methods can be used mutually at different stages of the web application development life-cycle, and they can either be implemented and placed at the client side or server side of web applications.

Secure Programming
It is a set of rules and good practices enabling the development of secure web applications.The secure coding standards include queries parametrization (i.e., query parameters are replaced with placeholders and parameter values are supplied at execution time), input validation (i.e., checks if the input meets a set of criteria such as a string contains no standalone single quotation marks), and sanitization of user input (i.e., modifies the input to ensure that it is valid).The OWASP project (Open Web Application Security Project) has proposed different standards to allow developers to follow secure practices when they are coding web applications (ASVS (Application Security Verification Standard) [3], ESAPI (Enterprise Security API), SAMM (OWASP Software Assurance Maturity Model) [4]).Although secure programming can help to prevent web vulnerabilities, it imparts time overhead and is not enough because of the complexity of web applications and the diversity of technologies and external libraries involved in the development of web applications.

Static Analysis
It aims at finding web vulnerabilities by inspecting source or binary code without running it.There exist several research works on web vulnerabilities detection using static analysis.For instance, ref [5] proposed a contextual, inter-function, and data-flow analysis to discover taint-style vulnerabilities such as SQL, command injection, and XSS attacks.Ref. [6] combined source code taint-analysis (i.e., track user inputs to verify if they reach a sensitive sink (a function that can be exploited)) to find web vulnerabilities with data mining to reduce false positives.Ref. [7] described a static analysis method that automatically detects access control vulnerabilities in web applications.Ref. [8] proposed a Machine Learning-based static analysis method to discover web vulnerabilities.They precisely used Hidden Markov Model and annotated code slices to train a model to discover vulnerabilities in source code.Ref. [9] presented a methodology and tool based on symbolic code execution (i.e., instead of running the program with concrete inputs, symbolic execution runs them with symbolic ones and finds vulnerabilities along with the inputs that will trigger them) to identify vulnerabilities in web-based applications.
Overall, static analysis-based tools are a solution to find web vulnerabilities, but they tend to generate false positives, they are time-consuming, and they may never converge for large code bases.

Dynamic Analysis
It runs the application web and tries to identify security violations by using techniques such as code instrumentation (i.e., inserting checks into the program) and fuzzing (i.e., it inputs random test data to a target program in order to explore all possible paths).
Ref. [10] improved the detection of XSS attacks in web applications by using dynamic analysis and a fuzzy engine.They extracted Application Entry Points (AEP) using a web crawler and then used a fuzzy engine that generates invalid strings for each AEP until the Web Application Firewall is defeated, in which case its signatures database is updated.Ref. [11] presented a method for the detection of DOM-XSS attacks: they used dynamic taint analysis of JavaScript code (i.e., taint traces are obtained while parsing web pages), and fuzzing to automatically derive attack vectors based on those taint traces, and then, they verify DOM-XSS vulnerability by rendering HTTP responses on the browser.Ref. [12] used code instrumentation to generate models that describe how and with whom client-side components interact, which allows protecting JavaScript-based web applications against client-side validation attacks.Refs.[13,14] presented a concolic (concrete + symbolic) execution-based approach for the detection of XSS attacks, and SQL injection and OS command injection, respectively.
Dynamic analysis-based approaches incur no false positives but can not achieve high code coverage.

Black-Box Fuzzing
It sends random malicious data to a web application without regard to its logic and identifies whether the application is vulnerable based on its responses.Black-box fuzzers are generally composed of (i) a crawler which identifies all possible web pages and entry points in the analyzed web application, (ii) a test data generator that generates random data for each application entry point, and (iii) a monitor that detects errors in the application runtime behavior.
Ref. [15] proposed KameleonFuzz, a black-box XSS fuzzer for web applications that can generate malicious inputs to exploit XSS vulnerabilities and also to detect how close it is revealing a vulnerability.They used a genetic algorithm guided by an attack grammar for malicious inputs generation and evolution.Ref. [16] described a black-box fuzzing approach to detect XQuery injection and parameter tampering vulnerabilities in web applications driven by XML databases.The proposed approach takes place in two phases: (1) a training phase in which the application behavior is learned.It involves (i) a crawler that identifies injection points, (ii) a model constructor that constructs legitimate query models, and (iii) an HTML/JavaScript analyzer that extracts constraints on HTTP parameters.(2) The testing phase involves three components: (i) an attack generator that generates attack strings related to XQuery injection and parameter tampering vulnerabilities, (ii) a model constructor that constructs illegitimate query models resulting from the execution of attack queries, and (iii) a detector that identifies the type of vulnerability exploited by comparing the query model generated during the testing phase against the appropriate query model generated during the training phase.Ref. [17] proposed a black-box fuzzing technique to detect logic vulnerabilities in web applications.They firstly collected HTTP traces in which users interact with a certain application.Then, they build a navigation graph model that synthesizes web resources (i.e., data types, URLs, forms, HTTP parameters, JSON objects, etc.).Afterward, they extracted behavioral patterns that model actions usually done by users and actions allowed by the navigation graph model.Then, they generated test cases that would break those behavioral patterns.Finally, they called a test oracle that collects from the executed test a partially ordered set of events and verifies whether all sequences satisfy the provided LTL (Linear Temporal Logic) formula.The test oracle returns true if a certain predefined logic property is violated and false otherwise.Likewise, ref. [18] proposed a method for the detection of XSS vulnerability in which they used a state automaton to obtain knowledge about the application behavior and a genetic algorithm to automatically generate inputs with better fitness values toward triggering an instance of the given vulnerability.
Web Vulnerability Scanning Tools can also be considered in the category of black-box fuzzers.They scan web applications from the outside to look for security vulnerabilities.They are frequently used by security analysts and hackers to find web application vulnerabilities.A large number of both commercial and open-source tools of this type are available (e.g., Burp Suite, Nessus, Nekto, etc.).
Black-box fuzzing does not require the availability of application source code, but it can incur a high rate of false negatives and false positives depending on the efficiency of the crawler and the attack data generator, respectively.

Intrusion Detection Systems (IDS)
They are defined as systems built to monitor host systems (Host Intrusion Detection System (HIDS)) or network communications (Network Intrusion Detection System (NIDS)) or web applications (Web Application Firewall (WAF)).HIDSs are usually employed for malware detection (i.e., malicious software that infects computers).NIDSs usually detect network attacks such as DoS (Denial of Service), MITM attacks (Man-In-the-Middle), as well as some types of web attacks.WAFs help protect web applications by filtering and monitoring HTTP traffic between a web application and the Internet.They detect clientside (e.g., DOM-XSS attack) and server-side (e.g., SQL injection attack) web attacks.We distinguish between traditional WAFs and WAFs based on Machine Learning (ML) ( [19][20][21]) or Deep Learning algorithms (Figure 4 summarizes the main Deep Learning models that might be used for web attacks detection).
Traditional WAFs (e.g., ModSecurity) use static pattern rules matching to detect attacks.Thus, they can not detect new-unknown attacks.In addition, updating rules can be a tedious task if the attack pattern is complicated.However, they generate a few false positives.As for ML/DL-based WAFs, there exist three main approaches: (1) the anomalybased approach in which models are trained using normal data instances uniquely and unsupervised or hybrid ML algorithms, (2) the signature-based approach (the most used) in which models are trained with normal and abnormal data instances and offline supervised ML algorithms, and (3) the hybrid approach, which combines the signature-based approach and anomaly-based detection approach.The first approach can detect zero-day attacks but suffers from a high rate of false positives because it is not obvious to define with certainty what is normality in a certain field.The second type of approach can not detect zero-day attacks but can detect known attacks accurately and with fewer false positives than the first approach.The third approach can make the best of both worlds.It can achieve high accuracy in detecting known attacks while generating low false positives in detecting new unknown attacks.Furthermore, the IDS signature database can be improved by adding the signature of newly detected attacks.In this work, we are particularly interested in exploring existing studies on web attacks detection using Deep Learning in a systematic way.Thus, researchers and practitioners willing to use Deep Learning for web attacks detection will hopefully find valuable ground on which they can base to develop new and efficient DL-based web attacks detection models.
To the best of our knowledge, this study is the first of its kind on SLR in this field.We studied 63 DL-based web attacks detection papers published between 2010 and September 2021.Additionally, we categorized the papers according to several perspectives and came up with some interesting research opportunities.Our main contributions are summarized below:

•
Identifying the Primary Studies (PS) related to the DL-based web attacks detection and getting different insights from the studies.

•
Performing a quality analysis on the PS.

•
Presenting the results of the investigation including publication information, datasets, detection models, detection performance, research focus, and limitations.

•
Summarizing the findings and identifying some interesting opportunities for future work in the domain of DL-based web attacks detection.
The remainder of this article is organized as follows.Section 2 describes surveys and systematic literature reviews related to DL-based web attacks detection.Section 3 presents the methodology followed in conducting this SLR.Section 4 presents the results and analysis.Section 5 describes the limitations of this study.Finally, Section 6 concludes this study.

Related Work
As far as we know, this is the first study to systematically investigate Deep Learning for vulnerability and attack detection on Web applications.However, there are also other interesting partial surveys in the area.
Ref. [2] conducted a survey on the detection and prevention of web vulnerabilities.They explained in detail web vulnerabilities and the different methods used to counter them.However, they only reviewed traditional Machine Learning-based web attacks detection research works.Related surveys, such as [22,23], have described Machine Learning or Deep Learning applications to cyber-security problems but without paying a particular attention to web applications security.Refs.[24,25] are two surveys about web vulnerabilities classification and countermeasures.They both do not focus on Deep Learning-based approaches for web vulnerabilities detection and do not follow the SLR protocol.Ref. [26] is an SLR on web services attacks and security.Ref. [27] is a recently published survey presenting the latest Machine Learning and Deep Learning-based approaches used for detecting XSS attacks.

Review Methodology
The systematic review in this study conducted by adapting the strategy proposed by [28,29] consists of three main steps; including (i) planning, (ii) conducting, and (iii) reporting the review results.The detail of these steps is summarized in Figure 5.The planning phase (first step) determines if there is a need to conduct a systematic review.The second step develops a review protocol including (i) identifying research questions, (ii) creating a search strategy, (iii) defining the study selection criteria, (iv) developing quality assessment rules, (v) determining the data extraction strategies that will be used, and (vi) defining the methods that will be used to synthesize the extracted data.We provide details about the proposed protocol in the following subsections.The second phase explains the necessary primary steps to conduct a systematic review of the study.In the two first steps (steps 3 and 4), we select the PS by applying the selection criteria and quality assessment rules defined in the planning phase, and then we describe their contents.In the second step (step 5), we extract from selected PS the data that will help answer the research questions.In the third step (step 6), we use different methods to synthesize the extracted data to facilitate the answer to the research questions.In the fourth step (step 7), we answer the research questions based on the synthesized data.In the reporting phase, we discuss the review results, and we state the limitations of selected PS.

Research Questions
The main focus of our study is to analyze scientific literature on web attacks detection using Deep Learning techniques available from 2010 to September 2021 inclusively.Based on that, we specify the research questions (RQ1-RQ10) that we detailed in Table 1.Identify the limitations of the studies as stated by their authors.

Search Strategy
We performed exhaustive searches on different online libraries.The following search string yielded most appropriated results:

•
("deep learning" OR "neural networks") AND ("web attacks" OR "web security" OR "web application security" OR "web vulnerabilities") We adapted search strings to be suitable for each database according to their specific requirements.Then, we queried each database by title, abstract, and keywords.The digital libraries utilized in this study include Scopus, Web Of Science, ScienceDirect, IEEE, ACM, and Springer.The first part of Figure 6 (blue) indicates the steps followed in conducting the search strategy.Getting rid of duplicate PS.

2.
Applying inclusion and exclusion criteria to determine the relevant PS.

3.
Performing a quality assessment of selected PS.
Our criteria for inclusion included the following: Studies written in a language other than English.
After the phase mentioned above, we found 63 journal and conference articles fulfilling our selection strategy.The second part of Figure 6 (green) indicates the steps followed in conducting the study selection strategy.

Quality Assessment Criteria
To assess the selected PS quality, we performed a quality analysis questionnaire.This quality assessment aims at giving a quality score to each PS and not intended to eliminate any PS selected at the previous phase of the SLR.Table 2 explains a total of eight quality assessment questions.Each question was scored as follows: "fully answered" = 1, "partly answered" = 0.5, "not answered" = 0. Between 0 and 8, quality scores are assigned to each PS by summing the individual question scores.Table 3 shows the selected PS with their QA scores.The quality analysis of selected studies shows that the DL-based web attacks detection research domain is yet to be explored properly.Few studies obtained good scores; still, most of the studies reported average scores.

Data Extraction
In this step, the main focus is to extract important information from each PS that helps answer research questions and store that extracted data in spreadsheets to use in the data synthesis process later on.Table 4 provides the data extraction form used in this SLR study.Main objective RQ11

Data Synthesis
Data synthesis aims at using various methods to synthesize the data extracted from the selected PS to answer the research questions.Therefore, we considered different synthesis methods, including narrative synthesis, tables, and visualization tools such as bar charts, pie charts, and line graphs.

Results and Discussion
In the following subsections, the findings of this SLR study will be presented and discussed for each Research Question (RQ).

RQ1: What Are the Trend and Types of Studies on DL-Based Web Attacks Detection?
This question aims at reviewing bibliometric studies in the DL-based web attacks detection domain; the answers reflect publication information of Primary Studies.
RQ1-1: What is the annual number of studies on DL-based web attacks detection?Figure 7 is the year-by-year presentation of selected studies.The year started in 2010 and ended in September 2021, and we have shown ten years of data of 63 articles.The figure shows the disparate distribution of papers according to the years.In 2012, we found the first and only one research article on the DL-based web attacks detection.Since that day, several research articles have been published on the topic.The number of papers published reaches the maximum in 2019 and then decreased in 2020.We finished the SLR by September 2021.Thus, the 12 papers published in 2021 are not representative of the papers published in that year.This scenario suggests that the distribution of published research papers is not equal, and web vulnerability detection topics using Deep Learning will gain more attention in coming years.

RQ1-2: What is the percentage of studies published in journals and conferences?
Figure 8 shows that 57% of PS are published in conferences while 43% are published in journals.The RQ1 answer indicates that interest in detecting web vulnerabilities using Deep Learning models is very recent; since 2019, the number of articles published in this research field has increased significantly.In addition, the number of journal articles is less than the number of conference papers.

RQ2: What Datasets Are Used to Evaluate the Proposed Approaches for DL-Based Web Attacks Detection?
In Machine Learning approaches, the choice of the dataset is a key point in the evaluation of detection models' performance.Thus, this question aims at reviewing and discussing the limitations of datasets commonly used in web attacks detection.
According to the dataset type, we shaped the studies into the following two classes: (i) public datasets-free and open access and (ii) private datasets-not open access.
We found that some studies combine more than one public dataset or even use private and public datasets at the same time to conduct the experiments.Figure 9 gives an overview of the datasets percentage utilized in articles reviewed in this study.We found that 37 studies used private datasets, 29 studies used the CSIC-2010 dataset, three studies used the ECML/PKDD 2007 dataset, two studies used KDD-Cup99, six studies use the CICIDS 2017 dataset, and six studies used publicly available datasets not very commonly used in research works (e.g., xssed.com,Apache 2006/2017, HttpParams).We detail below the public datasets that have been used in the majority of the reviewed studies: DL-based web attacks detection models development faces the limitation of problems related to datasets available for evaluating detection models' performance.In fact, existing public datasets are outdated and simplistic, meaning that they do not include newly discovered web attacks and they do not reflect the complexity of real-world web applications.Moreover, comparing research works in this field is almost impossible because most researchers use private datasets, and even when they adopt public datasets, they use different portions and apply different pre-processing techniques, which results in different versions of the same dataset.Additionally, public datasets need thorough pre-processing.Otherwise, evaluation results can not reflect the real models' performance.For instance, a dataset with many duplicated instances results in biased accuracy.More, a poor feature selection/extraction strategy can produce over-fitted or under-fitted models.

RQ3: What Frameworks and Platforms Are Used to Implement the Proposed Solutions for DL-Based Web Attacks Detection?
The main objective behind this research question is (i) to give an overview of software and platforms commonly used in DL-based web attacks detection models development and (ii) to gauge the interest of researchers to give technical implementation details.
Figure 10 summarizes the frequency of usage of frameworks and platforms that are used for developing DL-based web attacks detection models.First, it shows that the most used frameworks and platforms are Keras and TensorFlow.Then, various frameworks such as PyTorch, Theano, Scikit-learn, Weka, and MATLAB are used by few studies.However, 23 studies did not provide implementation details.The objective of this research question is to present and discuss the performance metrics that are commonly used to evaluate DL-based web attacks detection models.
Using DL models for web attacks detection amounts to developing a classification model that can identify whether a web application is vulnerable or not (i.e., binary classification), or whether it is vulnerable to a specific web attack (e.g., vulnerable or not to XSS attacks), or determine to which web attack it is vulnerable (multi-classification problem).
Different performance metrics have been used in the literature to evaluate DL-based web attacks detection models.However, we detail below the most widely used metrics as reported in Figure 11.

•
Accuracy: measures the ratio of the number of samples classified correctly over the total number of samples.Accuracy is not useful when the classes are unbalanced (i.e., there are a significantly larger number of examples from one class than from another).However, it does provide valuable insight when the classes are balanced.Usually, it is recommended to use recall and precision along with accuracy.
• Recall or Sensitivity or True Positive Rate or Detection Rate: measures the proportion of actual positives that are correctly identified.The higher value of sensitivity would mean a higher value of true positive and lower value of false negative.The lower value of sensitivity would mean lower value of true positive and higher value of false negatives.
• Precision: measures the number of positive class predictions that actually belong to the positive class.Precision does not quantify how many real positive examples were predicted as belonging to the negative class, that is why it is advisable to compute the True Negative Rate (TNR) metric.
• F1-score: weighted harmonic mean of precision (P) and recall (R) measures.It is recommended to use F1-score rather than accuracy if we need to seek a balance between precision and recall and there is an uneven class distribution: α is chosen such that recall is considered α times as important as precision.If α is 1 2 , then precision and recall are given equal importance.The choice of α, and thus the trade-off between precision and recall, depends on the classification problem.

•
False Positive Rate: the ratio of all benign samples incorrectly classified as malicious.
It is used to plot the ROC curve: In intrusion detection systems, it is important to have a low FPR.Otherwise, the detection system is considered not reliable.• Area Under the Curve (AUC): measures two-dimensional area under the ROC curve, ranging from 0 to 1, indicating a model's ability to distinguish between classes.Models should have a high value of AUC, so-called models with good skill.The ROC (Receiver Operating Characteristic) curve is the plot between the TPR (y-axis) and the FPR (x-axis).

•
True Negative Rate or Specificity: measures the proportion of actual negatives correctly identified.The higher value of specificity would mean a higher value of true negative and lower false positive rate.The lower value of specificity would mean a lower value of true negative and higher value of false positive.
For multi-classification problems, it is straightforward to calculate accuracy; however, metrics such as precision, recall, FPR, F1-Score, and AUC cannot be calculated easily because TP and TN do not exist for such problems.These metrics can only be determined for three or plus class problems by collapsing the problem into a two-class problem (i.e., all classes versus one class), where the metrics are calculated for each class.Usually, for multi-class problems, only accuracy is used.As seen in Figure 12, extrinsic and intrinsic feature selection and extraction is the most used approach in reviewed studies, which is followed by intrinsic feature selection and extraction, and then extrinsic feature selection and extraction.As for external feature extraction techniques, we can see from Figure 13 that the most used methods are word-level embedding, followed by manual feature extraction, and then character-level embedding and Encoder-Decoder models.Feature selection came in the fourth position after manual feature extraction, but only one study among the seven studies has identified the technique used.We observed that most reviewed studies do not discuss feature engineering in detail.Indeed, even if Deep Learning can extract abstract and complex features in the course of model training, it is important to identify whether the input to the DL model is a result of external feature extraction and selection or of a simple conversion from textual to numerical format and that feature selection and extraction is performed by the classification model during training.Yet, the studies that pay more attention to feature engineering are those using traditional ML models for classification and DL models for feature selection and extraction.The main objective of this research question is to identify DL models used for the classification of web attacks.
In this section, we extract from the selected PS the classification models used for detecting web attacks.As seen in Figure 14, there exist different Deep Learning models with the exception of a few models such as Generative Adversarial Network and variants of Encoder-Decoders models.Few studies use Deep Learning models as feature extractors and Machine Learning (ML) algorithms as web attacks classifiers.This is why they were part of this study.Finally, we noticed that CNN, LSTM, and DFFN are the most Deep Learning models used in the reviewed studies.Since Generative Adversarial Network and Encoder-Decoder models have shown promising results in networks intrusion detection (e.g., [92][93][94][95]), exploring these models in web attacks detection is an important area for improvement.

RQ7: What Types of Web Attacks Do the Proposed Approaches Detect?
In this research question, we attempt to identify the types of web attacks that reviewed studies try to detect using Deep Learning models.
As seen in Figure 15, most studies develop Deep Learning models for detecting web attacks without targeting a specific type of web attacks.Still, some studies focus on detecting specific web attacks, namely query injection attacks, XSS attacks, and file and path injection attacks.Moreover, 46 studies develop binary classification models, while only 15 studies develop multi-classification models (Figure 16).Since most if not all studies do not mention whether the proposed DL models were trained on a dataset that contains an even number of instances for each type of web attack, it is important to evaluate how well these DL models will perform in detecting specific types of web attacks.This can be achieved by giving more attention to the development of multi-classification models.In this research question, we report the experimentation details of DL-based web attacks detection models proposed in the selected PS.
Table 5 summarized the experiments conducted in the reviewed studies: targeted web attacks, classification models, datasets, performance metrics, as well as limitations of studies as stated by their authors.If a given reviewed study performed many experiments, we report the experiment that yielded the best results.Due to different performance measures and datasets, it is hard to compare and rank studies.However, we observed that most if not all reviewed studies achieved highperformance metrics, but few of them had discussed the threats to validity of their experiments (i.e., what are the things that may invalidate the results of the experiments).4.9.RQ9 and RQ10: What Is the Research Focus and Limitations According to the PS?
In this section, we report the research focus as well as the limitations of the studies as stated by their authors.This part will help researchers and practitioners to have an idea of what is already done in previous research works and to develop more effective and improved detection models.We organize this section according to the Deep Learning classification models used in the reviewed studies.In the end, we discuss papers that used Deep Learning feature extraction methods for traditional Machine Learning algorithms.

CNN or CNN Combined with LSTM or GRU
Ref. [30] proposed a CNN-based method for detecting SQLI.They showed that the proposed model outperforms ModSecurity-a rule matching-based firewall for detecting web attacks.
Ref. [32] introduced a method for detecting malicious HTTP GET requests using a new architecture of CNNs for classification, and they used NLP-based analysis and Auto-Encoders for URL representation and extraction.However, the authors state that the proposed model can not be updated easily when new training data are available and can be defeated by adversarial attacks.
Ref. [38] described a method for detecting web attacks injected in web HTTP requests using word embedding and CNNs.The proposed method can not detect web attacks hidden in parts of the HTTP request message other than the URL.
Ref. [42] proposed a method for detecting web attacks using CNN and GRU along with word-level embedding-based features augmented with manually extracted features.
Ref. [44] provided a method for detecting XSS attacks using CNN, LSTM, and wordlevel embedding.They identify the problem of scarcity of datasets in the field of web security as a limitation of their study.
Ref. [48] worked on the detection of SQL injection attacks in PHP code.They compared different classification algorithms and feature representation techniques.They reported that the best algorithms are CNN and Multi-Layer Perceptron (MLP) applied to manually extracted features and TF-IDF bag of words model.As future work, they propose to collect a dataset for SQL injection attack detection and to build a Word2vec model for PHP source code as well as to develop an SQL injection attack detection model using CFG (Control Flow Graph) attributes.
Ref. [50] detected web attacks hidden in web HTTP requests using Bag of Words and CNN models.They plan to consider multi-class classification instead of binary classification in future works.
Ref. [56] introduced AI-IDS, which is a Deep Learning model for detecting three types of web attacks: password guessing and authentication bypass, SQL injection, and application vulnerability attack.The proposed model works in parallel with a signaturebased NIDS to correct or improve its detection rules.Thus, after repeated manual revalidation and daily retraining, the model can be used as a standalone tool when it reaches an acceptable rate of false positives.The study limitations include the need to re-validate and retrain Deep Learning-based misuse intrusion detection tools due to the low tolerance for a high rate of false alarms.Moreover, their study shows that misuse detection tools can be used in parallel with signature-based tools to improve the detection rules of the latter and the detection quality of the former by checking the malicious events that are detected by one and not by the other.
Ref. [70] introduced an anomaly detection method of web attacks using character-level embedding and CNN followed by LSTM.They trained the model on a dataset that contains only two out of three attacks and then tested the model on the attacks that did not belong to the training set in order to enhance the model's capability to detect unknown attacks.One significant issue of this work is that more web attacks need to be included in the dataset.In addition, scenario-based attacks through several correlated requests need to be considered, and the deployment of the model in a practical web service is to be tested.
Ref. [74] evaluated different Deep Learning models based on LSTM, CNN, and CNN combined with LSTM, for the detection of web attacks hidden in HTTP GET and HTTP POST requests.Moreover, they analyzed some false positives and found that they occur because either the HTTP requests contain strings that never appear in the training set or because of the way the request is split into a sequence of words.Therefore, they should inspect all the fields of the HTTP request and use more sophisticated models to convert request strings to numerical vectors.
Ref. [78] used character and keyword level embedding along with CNN without a pooling layer followed by GRU to detect web attacks through the classification of URLs into different categories: normal or type of attack.To improve their proposed model, they should consider reduced memory consumption and online update of trained models.
Ref. [39] implemented and compared three Deep Learning models for the detection of web attacks.First, they empirically showed that the classification performance of CNN combined with LSTM is better than that of LSTM or CNN.In addition, they showed how the dropout rate, the number and width of filters, the number of hidden units, and the size of local max-pooling influence the performance of CNN, LSTM, and CNN combined with LSTM models, respectively.
Ref. [41] used character-level embedding with CNN to extract relevant features from URLs, registry keys, and File paths.Then, they fed the output features to three fully connected layers to classify URL, registry keys, or File paths as normal or malicious.They showed that compared with other feature extraction methods, the proposed method has better classification performance but incurs a computational overhead if training long strings.They faced difficulties in collecting and labeling registry keys and file paths datasets, which resulted in poor classification performance in comparison with results obtained for URLs classification.
Ref. [51] proposed a CNN-based system for detecting web attacks.The system comprises two networks that are trained separately; the first one locates suspicious payloads in the HTTP request by identifying their start and end positions.They took the top three suspicious payloads returned by the first network and passed them to the second network, which identifies the attack type in each payload.If the three payloads are benign, the request is classified normal; otherwise, it is anomalous.Although the proposed system can only detect SQLI and XSS attacks, it reduces the computational cost by 82.6% and increases the detection accuracy by 22.3% compared with character-level CNN.
Ref. [55] proposed a multi-classification system for detecting malicious HTTP requests by identifying malicious HTTP parameters.The system is composed of a character-level embedding layer and a convolutional-pooling layer.Compared with SVM and Random Forest, the proposed system is better with respect to different performance metrics.In addition, the proposed model can be updated by retraining the model with new or rectified instances of HTTP requests.Finally, they should consider testing the performance of the model in more practical applications.
Ref. [63] proposed a novel CNN-based model for the detection of SQL injection attacks.The approach novelty consists of modifying the pooling layers so as to retain the maximum of information about the SQL query string.However, they plan to implement a multiclassification model that is not limited to identifying SQL injection attacks.
Ref. [69] proposed a Deep Learning-based approach for the detection of web attacks.They concatenated four models with the same architecture-a character-level embedding layer, two convolutional-pooling layers, and a dense layer.Then, they fed the four outputs to a dense layer.They demonstrated that the proposed model is more accurate and takes less time and memory resources than if only one model is used.
Ref. [77] proposed an IDS for detecting web attacks, DDoS, Infiltration, and Brute Force, based on the analysis of IoT networks traffic.The IDS consists of a trained Tree-CNN model that uses Soft-Root-Sign (SRS) activation function.They justified this choice by the fact that Tree-CNN shows better performance in other fields such as image classification and that the SRS activation function allows faster training and detection time.They compared the proposed model with other ML algorithms, Deep Belief Network, as well as Tree-CNN that uses other activation functions (RELU, Softmax).The results showed the out-performance of the proposed model.However, the evaluation of proposed detection models is limited by the scarcity of IoT-based datasets.
Ref. [79] developed a Deep Learning-based model for the detection of malicious HTTP requests.They used an ASCII code to convert the HTTP requests to a two-dimensional matrix and then fed it to a CNN network.They showed that the proposed model is more accurate than other state-of-the-art models.However, compared with word-level and character-level embedding, the ASCII code-based conversion causes a time overhead in the training and testing phases.
Ref. [81] aimed at finding the best hyper-parameters and the embedding approaches that should be advised for building CNN models that help to improve web attacks detection accuracy and to reduce detection time overhead.To this end, they built different CNN models using character-level and word-level embedding, different values of hyperparameters (activation functions, kernel sizes, optimizers, number of layers, etc.), and validation methods.The comparison showed that word-level embedding, Relu function, Adam optimizer, two fully connected layers, 128 filters for each kernel size (2,3,4), and 10 fold cross-validation method bring the best detection accuracy with less time overhead.They plan to investigate new embedding approaches that could outperform word-level and character-level embedding.
Ref. [87] proposed a new method for representing HTTP web requests, which consists of substituting each character or symbol in the HTTP web request with its corresponding ASCII code.Then, they fed the resulting integer vector into a CNN network to classify HTTP web requests into benign and malicious.They compared the proposed method with word-level and character-level embedding and showed that it produces better classification results.They raise the problem of time overhead incurred by the new method at the training and testing phases.
In [37,57], the authors introduced a model uncertainty to web attacks detection that aims at finding annotations errors in web log datasets: they wrongly tagged some web logs of the dataset (normal web logs are tagged attacked and attacked web logs are tagged safe), and they included these web blogs in both the training and testing set.Then, they trained a CNN network followed by two fully connected layers on the resulting dataset to classify web logs as attacked or safe.Afterwards, they computed a Bayesian variance of classified web logs, which represents the model uncertainty; that is how the model is confident about its predictions.They showed that web logs with high variance are most likely to be wrongly labeled, which can help security experts to correct the dataset and retrain the model on more clean data.They compared the model proposed with Softmax (i.e., if the softmax output is 0.5, it means the model is not sure about the prediction), and they showed that in most cases, the model uncertainty finds more annotation errors because mislabeled inputs have the highest variances in most cases.As for the prediction time, the model uncertainty takes a long time compared with Softmax, but the time overhead is nearly imperceptible for the user.The authors plan to exploit model uncertainty in other security scenarios such as locating the adversarial web request samples and to combine softmax output and the model uncertainty as a unified standard to evaluate the prediction confidence.
In [50], the authors presented a method for classifying malicious HTTP requests based on the URL and payload.They used the Bag of Words technique to convert the textual input to a two-dimensional numerical matrix, which they input to a two-layer CNN.They evaluate the model using the CSIC2010 dataset, and they compare it with state-of-the-art methods.The proposed approach achieved the best performance results, but it only handles binary classification.
In [90], the authors propose a novel approach for the detection of malicious HTTP requests in the particular case where the datasets available for the target system are limited in size and of low degree of diversity to build performant DL models.The approach consists of three phases.(i) In model initialization, a large public dataset is used to build a DL model that detects malicious HTTP requests.They used word2vec for the vectorization of HTTP payloads and TextCNN for their classification.(ii) In data augmentation, noise is added to the original samples of the targeted system-based dataset while keeping the keywords unchanged.This way, the samples get diversified but keep their semantic meaning.They used TF-IDF to define keywords.(iii) Third, in Transfer Semi-Supervised Learning, they froze the first n layers of the TextCNN model built at the first phase, and they trained the rest on the original and generated labeled and unlabeled targeted system-based datasets.This way, the obtained model has the knowledge learned by the initial model and adapted to the new target system without the risk of over-fitting due to the small size of the dataset.The results of the conducted experiments show that the proposed approach achieved better performance in comparison with other baseline models and methods such as Bi-LSTM, SVM, AE, character-level embedding, and N-gram.However, as future work, they plan to (i) include more complex web attacks, (ii) handle encrypted malicious HTTP requests, (iii) consider non-textual elements of the HTTP web requests, and (iv) build anomaly-based detection models.

Recurrent Neural Networks
Ref. [60] introduced an anomaly detection method that used (i) word-level embedding to represent URLs, (ii) two separate GRU or LSTM networks to predict the next token in the URLs path or query parameter given a set of previous tokens, and finally, (iii) an MLP to predict if a URL is normal or anomalous based on the probabilities vectors returned by the GRU or LSTM networks.However, the proposed model can not handle some kind of long URLs properly and also can not dynamically leverage between true positives and false positives.
Ref. [62] proposed an anomaly detection method of web attacks using the LSTM network and lexical and statistical features extracted manually from HTTP web requests.They showed that compared with other methods where selecting a subset of extracted features was necessary to increase the classification model accuracy, the LSTM model was able to achieve high accuracy without feature selection.
Ref. [64] suggested APPMINE, an anomaly-based method for web vulnerabilities detection which uses an LSTM network and web application system calls as features.However, APPMINE could miss certain classes of attacks not manifested in the system call sequences generated by the web application.Moreover, APPMINE has other issues, including privacy leakage: monitoring a system to collect application system calls could impact users' privacy.In addition, the training of APPMINE should be done for each web application to detect anomalies specific to that web application.Finally, APPMINE is also prone to adversarial attacks. Ref.
[68] presented an anomaly detection method of web attacks using a word-level embedding for feature extraction and bi-directional LSTM for the classification of malicious URLs. Ref.
[72] proposed a multi-classification model that can detect six categories of web attacks hidden in URLs using one-hot encoding and GRU networks.Furthermore, they showed that GRU networks outperform Random Forest in terms of accuracy even when small training sets are used.
Ref. [76] proposed an anomaly detection method to detect XSS attacks using LSTM and word-level embedding.However, they need more XSS-oriented datasets to assess the model.
Ref. [33] implemented and compared two Deep Learning models, namely MLP and LSTM, for the detection of SQL injection attacks.They used statistical features of URLs as input to the MLP model, whereas they used lexical features as input to the LSTM.They found that MLP is better than LSTM both in classification performance and time.They used an external dataset different from the training and test set to evaluate the capability of the models to detect unknown attacks.
Ref. [43] filtered out known attacks using a signature-based intrusion detection system.Afterward, they used a trained two-layer LSTM model as an anomaly-based intrusion detection system to detect attacks that the misuse detection system could not identify.Then, a signature generation module is called to update the signature repository with new signatures extracted from the new attacks detected by the anomaly-based IDS.They compared the proposed model with other traditional Machine Learning algorithms, and they showed that it is better in every performance metric.
Ref. [53] proposed a Deep Learning-based approach for the detection of malicious JavaScript programs.The pre-processing phase is based essentially on a static analysis of JavaScript code that consists of six main steps-(i) JavaScript de-obfuscation, (ii) Abstract Syntax Tree (AST) generation, (iii) Program Dependency Graph (PDG) generation, (iv) Program slices generation, (v) Tokenization and generalization of program slices, (vi) Vectorization-and outputs an 80-dimensional vector.Then, they applied a two-layer Bi-LSTM on the output vector to classify JavaScript codes as malicious or benign.Compared with other Machine Learning algorithms and signature-based open-source antivirus tools, the proposed model achieves the best detection performance provided that the JavaScript code is de-obfuscated.The limitation of this study is induced by the limitation of static analysis which does not allow to detect malicious JavaScript code that is generated dynamically.In future work, they will explore other types of Neural Networks such as tree and graph structure neural networks.
Ref. [59] proposed a modular neural network for the detection of XPath injection attacks.The model comprises two LSTM networks; the first one classifies login attempts as valid or malicious.The second network classifies user input as valid, invalid, or malicious.The final decision about the user request is made based on the classification output of the two LSTM models.If the user request is classified as malicious, fake data are returned in place of real data.If the user request is classified as invalid, a message error is returned; otherwise, the user request is processed by the web server.
Ref. [67] proposed a Deep Learning approach for the detection of SQLI vulnerability in PHP codes.They used a tool that captures the opcode of PHP code before it is executed.Then, they converted the opcode slice to an integer vector that they pass to an embedding layer to obtain a matrix that they fed to an LSTM layer followed by two dense layers.The issue in that study is training and testing sets are generated from the same dataset.Thus, it is possible that the selected dataset may not reflect real PHP applications.

Encoder-Decoder Models
Ref. [66] proposed an anomaly detection method of web attacks based on a GRU-Encoder-Decoder model with an attention mechanism.They justify the use of the attention mechanism with the fact that malicious strings may occur in different locations of the HTTP request payload.Therefore, it is essential to consider all hidden states instead of the last hidden state only.In addition, the attention mechanism allows verifying if the model is well-trained thanks to the visualization techniques that can be built using the attention weights.
Ref. [84] proposed an anomaly-based detection method of web attacks.They collected traces of normal behavior of java web applications and trained a Stacked Denoising Auto-Encoder to distinguish normal requests from malicious requests based on execution traces.In the test phase, the trained model takes as input a trace vector and tries to reconstruct it; if the reconstruction error is more significant than a certain threshold, then the web request is classified as abnormal; otherwise, it is normal.The threshold is defined during the training phase so that the F1-score remains maximal.However, they should investigate more complex networks, such as LSTM Auto-Encoder or CNN Auto-Encoder, and evaluate the performance of the proposed model on zero-day attacks.Moreover, they should consider updating trained models using online data from actual world usage without incorporating attack data into the normal behavior dataset.
Ref. [31] trained a Stacked Auto-Encoder with Logistic Regression classifier to distinguish DDOS attacks from normal web requests based on eight features representative of DDOS attacks.The Stacked Auto-Encoder is intended to extract an abstract representation of features, while Logistic Regression is used for binary classification. Ref.
[61] implemented two Deep Learning models to detect web attacks hidden in URLs.The first one is the Stacked Auto-Encoder model, and the second is an RNN.In addition, they should consider identifying other types of web attacks that appear in URLs and user-agent strings and cookies.
Ref. [71] proposed a zero-wall system for the detection of web attacks.It operates behind a signature-based Web Application Firewall (WAF).The WAF drops known attacks, and the zero-wall systems intercept allowed web requests and classifies them as benign or malicious.If it is malicious, then the web requests are analyzed by security experts to ensure it is not a false positive.If it is a false positive, then a rule set is added to a white list so it is not rejected again by the zero-wall system.If it is a true positive, then a rule set is added to the WAF signature database.The zero-wall system uses an LSTM-based Encoder-Decoder model to distinguish between benign and malicious web requests.Indeed, if it fails to reconstruct the input token sequence corresponding to a web request, then the latter is considered malicious.To speed up the detection time taken by the zero-wall system, they used hash tables to not process web requests that have already been seen and classified by the system.However, the system has a few issues, including class imbalance problem-too small volume of training data, and poisoning attack-an attacker may inject many malicious samples in the normal traffic hoping the system would learn from the wrong dataset.

Deep Belief Networks
Ref. [73] used a deep belief network along with word2vec and statistical features to detect SQL injection attacks hidden in GET or POST web requests.
Ref. [85] developed a Deep Belief Network-based IDS for the detection of different types of attacks including web attacks in IoT systems.They showed that the proposed model outperforms different Machine Learning algorithms such as RNN and SVM.They plan to improve their work by detecting other attacks against IoTs and by evaluating the proposed model using other datasets.

Ensemble Classification Models
Ref. [40] focused on detecting web attacks in HTTP requests sent by edge devices to the cloud using a weighted average ensemble model composed of two Residual Networks (ResNet) that use different feature representations of URLs.As with previous studies, the online updating of detection models and adversarial attacks are two problems that are not addressed by this work.
Ref. [54] implemented a detection method of web attacks in heterogeneous and adversarial environments using an adaptive IDS, which is an ensemble classification model that aims at tracking the best classification model for each data instance.As future work, they plan to apply the proposed method to malware and network intrusion detection and compare it with other ensemble classification algorithms.
Ref. [75] proposed a web attacks detection system for IoT networks, which consists of an ensemble classifier that identifies malicious URLs based on the predictions obtained from three sub-models-CNN, LSTM, and ResNet-and an update module that fine-tunes each sub-model in order to detect novel web attacks.Moreover, they showed that the proposed system outperforms each sub-model as well as other state-of-the-art classification models.However, they plan to explore other alternative Deep Learning models to improve the performance of the overall system as well as to detect web attacks other than XSS attacks and SQL injection attacks.
In [89], the authors proposed a data-augmentation method based on self-adapting noise adding, which consists of adding noisy data to web attacks datasets in proportion with the web HTTP request length.The method aims at overcoming the problem of classification in unbalanced datasets.They evaluated the method by implementing different stateof-the-art DL models and comparing their classification performance with and without data augmentation on public datasets as well as on a synthetic dataset for both binary classification and multi-classification problems.The results showed that the Bi-LSTM model with the DA-SANA method has the best classification performance values.The method presents the limitations of time cost and computational resources required to train Bi-LSTM models.In addition, the DA-SANA method does not consider the files uploaded as part of the web HTTP request length.
In [91], the authors propose a detection, mitigation, and attacker profiling system for securing web applications against web attacks.The system is composed of a Cookie Analysis Engine (CAE) and a DL classifier.First, the HTTP request is checked for forged cookies.Then, it is forwarded to the DL classifier.The decision of whether the request is normal or malicious is based on the combined outcome of the CAE and the DL classifier.Furthermore, if a user is profiled as suspicious or attacker, the subsequent requests will be blocked without calling neither the CAE nor the DL classifier.Then, the attacker's profiling option aims at reducing the processing time, which makes the proposed system suitable for real-time environments.A state-of-the art comparison showed that the proposed system has the best performance results on the private dataset as well as on the CSIC2010 dataset and in the real-time environment.They also mentioned that most web attacks are injected in the HTTP payloads rather than in the cookies.They plan to join the proposed system with a deception mechanism that is intended to analyze the characteristics of the attacks and the tactics of attackers.4.9.6.Deep Feed Forward Networks Ref. [36] worked on a method for the detection of malicious HTML pages using DFFNs and a custom representation of HTML pages.One limitation of the study is that the proposed model may not be able to detect malicious web content that experts mislabeled as normal in the training set.
[52] detected two types of web attacks (SQL injection and distributed denial of service) in network packets using statistical methods for labeling web packets and supervised Machine Learning algorithms for building predictive models.However, the study has a few limitations, including variable size network and memory and time constraints, which constitute an obstacle for real-time web intrusion detection.
Ref. [82] implemented and compared different Machine Learning algorithms (Random Forest, Decision Tree, AdaBoost, Deep NN, etc.) for the detection of SQL injection attacks.They used manually extracted features that are relevant to the domain of SQL injection attacks.They empirically showed that Random Forest outperforms all the other algorithms.
Ref. [35] trained a simple MLP to detect SQLI attacks by converting each URL to a binary vector whose length is 13, which is the number of the most popular SQLI attack keywords and patterns, and it can be extended to include other SQLI keywords.
Ref. [55] proposed a Neural Network-based method for the detection of XSS attacks that can be used on both the client side or server side.They extracted 41 numerical features including URL, JavaScript, and HTML-based aspects that characterize XSS attacks, and they fed them to a two-layer neural network.They compared the proposed method with other Machine Learning algorithms and found that it is more performant with respect to different aspects of evaluation measurements.They also highlighted that traditional Machine Learning algorithms have also good classification performance, which they explained by the quality and method of collecting the training dataset and the strategy used for feature extraction.
Ref. [59] proposed a multi-classification model that classifies each URL as benign or an SQLI attack in which case it specifies its type.First, they extracted 32 keywords that characterize different types of SQLI attacks.Then, they represented each URL with a binary vector that indicates the presence (1) or absence (0) of the keyword.Finally, each URL is assigned a class between eight classes-benign, tautologies, illegal/logically incorrect queries, piggy-backed query, union queries, stored procedures, inference SQLI attack, and alternate encoding-by using a trained MLP network.
Ref. [83] developed a three-layer Deep Neural Network model to identify functions in JavaScript code that are vulnerable to DOM-XSS attack.They used the classification model as a pre-filter to taint tracking, which is a dynamic analysis method for detecting DOM-XSS vulnerabilities.They experimentally showed that the combination of these two approaches improves precision and recall while decreasing time overhead.They identify two limitations to their study: the dataset used to evaluate the proposed model may not apply to other browsers and may contain mislabeled instances (i.e., noisy ground truth).
Ref. [86] developed an artificial neural network for identifying web attacks that were not detected at the stage of signature-based analysis.They defined custom features to identify user behavior and to decide whether it is normal or not.They plan to extend the proposed system to take into account every parameter in web pages in order to achieve more coverage of the user's behavior.
Ref. [88] proposed a DL approach for detecting XSS attacks in PHP and JavaScript source code.They used two representation techniques to transform source code files into numerical vectors: word2vec and code2vec ( [96]), which extract features from the source code Abstract Syntax Tree.They implemented a DFFN with an attention mechanism to classify the source code as vulnerable or not to XSS attacks.The proposed approach outperforms existing static analysis tools in every classification performance metric.However, it also has different limitations that undermine its applicability to real-world scenarios.Indeed, the model is evaluated using a synthetic dataset.In addition, the code2vec method does not scale to large source code files, and both representation techniques do not take into account invocations between different source code files.4.9.7.Deep Learning-Based Feature Extraction Ref. [34] represented a static analysis method for detecting malicious JS code by using a Stacked Denoising Auto-Encoder for feature extraction and logistic regression for classification.However, the limitations of the study include the need for minification and obfuscation of JS code and the long time it takes for training the proposed model.
Ref. [58] used n-gram and Stacked Auto-Encoder for feature construction and extraction, as well as Isolation Forest for classification to build an anomaly web attacks detection method.
Ref. [80] used word-level embedding to convert URLs to numerical matrices.Then, a Convolutional-Pooling layer is used to extract features from the embedded matrices.Afterward, the extracted features are merged with statistical features before they are fed into an SVM classifier.They built a function that takes as input the output of the maxpooling layer and outputs the corresponding words in the input URL, which may help researchers validate the trained model and understand which words contribute to a specific URL classification.Moreover, they showed the impact of testing a model on a dataset containing duplicated data: the proposed model accuracy goes from 99.33 to 100% when tested on the CSIC 2010 dataset-which contains a repetition rate of 81%-without deduplication.
Ref. [47] used genetic algorithms to distinguish normal network traffic from abnormal network traffic.Then, they used a shallow Neural Network to classify attack types: either web attacks, infiltration attacks, PortScan, BruteForce, DoS, or Botnet attacks.Finally, they showed that the proposed model is better than other Machine Learning techniques in terms of accuracy, detection rate, recall, and precision.
Ref. [49] exploited different Deep Learning models, namely Stacked Auto-Encoder and Deep Belief Network, for feature extraction and used one-class learning algorithms, namely Isolation Forest, one-class SVM, and Elliptic Envelop, for web requests classification.

Limitations
This study aims at conducting an SLR of DL-based web attacks detection research works.The lack of comparison between selected studies is the primary limitation of this study.Although we reported the results of selected studies, we could not compare their performance because they used different types of datasets and performance measures.In the searching phase, we collected studies from the most extensively utilized six digital libraries (Scopus, Web Of Science, ScienceDirect, IEEE, ACM, and Springer).Still, some resources might be left out.Thus, it is hard to declare that our search strategy covered all relevant studies.Moreover, our selection was conducted in two phases: firstly, we selected title and abstract-based studies, and secondly, the selection was based on the full text of each study.However, there is still a possibility that a suitable study might be excluded mistakenly.We would also like to mention that we genuinely reported the experiment results of the reviewed studies, but we did not perform their models and we did not reproduce their experimentation.

Conclusions
Web applications are prone to many security threats.Therefore, many techniques have been proposed to detect and prevent web attacks.This study represents an SLR of the existing DL-based web attacks detection research papers.Initially, we searched and selected journals and conference papers focused on DL-based detection of web vulnerabilities that have been published from 2010 until September 2021.Afterward, we selected relevant studies based on the title, abstract, and content.After the selection phase, we obtained 63 Primary Studies (PS), on which we applied a non-eliminatory quality analysis.We studied the selected PS from different perspectives and synthesized the results of their studies using different synthesis and visualization techniques.
Based on this analysis, we have learned different lessons from which we have inferred interesting research opportunities for future work in the DL-based web attacks detection domain: • Generate standard public real-world datasets: it is important to generate web attacks detection datasets to resolve the current datasets issues.Indeed, most researchers used private datasets.Additionally, public datasets do not reflect the complexity of realworld web applications and do not include newly discovered web attacks.Moreover, when the same public dataset is adopted, different portions are used for training and testing, which complicates the comparison between approaches.Therefore, there is a need for standard realistic public datasets to allow the research communities to contribute efficiently in this field, to facilitate comparative analyses between research works, and to make the proposed models applicable in real-world web applications.• Consider the standard classification performance metrics: some studies used a single performance metric (e.g., Accuracy) to evaluate their detection model, which is not sufficient in the case of imbalanced datasets.Moreover, most reviewed studies do not consider computational overhead as an evaluation metric, although it is a very critical issue in real-world web applications.Moreover, few researchers provided a detailed report of false alarms (FPR), which is an important point because it helps in model retraining while saving analysts time.In addition, most studies reported their results, but none of them outlined the threats to the validity of the experiments.Therefore, it is important to identify standard performance metrics that researchers should consider in order to ensure that the proposed models are accurate, cost-effective, reliable, and reproducible.
• Explore advanced DL models in the field of web attacks detection: the existing DLbased web attacks detection literature lacked some advanced DL models.In particular, applying Generative Adversarial Networks (GAN) and Encoders-Decoders to web attacks detection is interesting because they have been successfully exploited in a similar domain that is Networks Intrusion Detection (e.g., Refs.[92][93][94][95]).Support online learning: almost all reviewed studies adopted offline learning in building their detection models.In offline and online learning, the model is trained using batch algorithms (i.e., the cost function is computed over a group of instances).However, while in offline learning, the model is tested and validated using batch algorithms, in online learning, the model is tested using real-time data; thereby, the cost function is re-valuated over a single data instance at a time.In industry, models trained online are preferred over models trained offline because the latter are generally evaluated against outdated and simplistic datasets.Therefore, it is important that researchers give more consideration to online learning in order to reduce the gap between research and industry in the DL-based web attacks detection area.• Support adaptive incremental learning: according to a recent study conducted in 2020 [97], 42% of attacks are zero-day attacks (i.e., new or unknown attacks), while 58% are based on known vulnerabilities.As with traditional detection systems, Machine Learning-based approaches have a difficult time detecting zero-day attacks because they rely on past and known attacks.Thus, it is fundamental to constantly retrain ML models to account for these attacks.However, retraining models from scratch is timeconsuming and computationally intensive.Thus, incremental learning is essential, as it will allow updating trained models as new data is generated.• Generate a corpus for web attacks detection: although DL-based web attacks detection problems are similar to Natural Language Processing (NLP) problems, there exists no corpus yet that can help to develop NLP-like models for web attacks detection problems.• Define a standard and transparent research methodology: it is important to define a transparent research protocol that researchers should follow when proposing a DL-based method for the detection of web attacks.Such a methodology will improve the quality of research works and facilitate their comparison.• Develop a common framework for comparing DL-based web attacks detection models: there is a large diversity in performance measures, datasets, and platforms used in reviewed studies, which makes a comparative analysis between research works difficult if not impossible.Therefore, it is fundamental to provide a standardization of datasets, performance metrics, environments, as well as a transparent research methodology that allows comparing the different approaches and evaluating the models' suitability for real-world web applications.
Author Contributions: R.L.A. and E.H.N. contributed equally to this work.All authors have read and agreed to the published version of the manuscript.

Figure 1 .
Figure 1.Overview of web architecture.
LDAP: Lightweight Directory Access Protocol OS: Operating Systems RFI: Remote File Inclusion LFI: Local File Inclusion DT: Directory Traversal
LSTM: Long Short-Term Memory GRU: Gated Recurrent Network RNN: Recurrent Neural Networks CNN: Convolutional Neural Networks

Figure 4 .
Figure 4. Classification of Deep Learning techniques.

Figure 6 .
Figure 6.Search and study selection process.

Figure 6
Figure 6 explains our search strategy achieved an initial set of 290 PS.Then, we apply additional filtering that aims at: 1.Getting rid of duplicate PS.2.Applying inclusion and exclusion criteria to determine the relevant PS.3.Performing a quality assessment of selected PS.

Figure 8 .
Figure 8. Percentage of studies published in journals and conferences.

Figure 9 .
Figure 9. Histogram of datasets used in selected PS.

•
KDD-Cup99: The dataset contains 41 features.It can get in three following versions: (i) complete training set, (ii) 10% of the training set, and (iii) testing set.It is mainly used for building networks intrusion detection models.• UNSW-NB15: The dataset combines actual modern normal activities and synthetic contemporary attack behaviors.It has nine types of attacks, which are mostly related to network intrusion.The training dataset includes 175,341 instances whereas 82,332 instances are in the testing set.It is also mostly used in networks intrusion detection.• CICIDS-2017: The Canadian Institute for Cybersecurity created this dataset.It has 2,830,540 distinct instances and 83 features containing 15 class labels (1 normal + 14 attack labels).The dataset contains only 2180 web attacks instances, which means it is insufficient for evaluating a web attacks detection model.• CSIC-2010: The dataset contains the generated traffic targeted to an e-commerce web application.It is an automatically generated dataset that contains 36,000 normal requests and more than 25,000 anomalous requests (i.e., web attacks).• ECML/PKDD 2007: The dataset is part of ECML and PKDD conferences on Machine Learning.The dataset contains 35,006 normal traffic and 15,110 malicious web requests.The dataset was developed by collecting real traffic and then processed to mask parameter names and values-replacing them with random values.

Figure 10 .
Figure 10.Histogram of frameworks and platforms used in selected PS.

4. 4 .
RQ4: What Performance Metrics Are Used in DL-Based Web Attacks Detection Literature?

Figure 11 .
Figure 11.Histogram of performance metrics used in selected PS.

4. 5 .
RQ5: What Are the Feature Selection and Extraction Approaches Used in DL-Based Web Attacks Detection Literature?Although Deep Learning performs automatic feature selection and extraction during models training, in the context of web attacks detection, the model input is textual in most cases, which means that prior to model training, the input should be processed beforehand.In this research question, we will shed the light on the feature selection and extraction approaches that are used to process the input to DL-based web attacks detection models.Feature engineering is an essential step in the construction of Machine Learning models.It includes feature selection and feature extraction.Feature selection starts from a set of attributes and retains the most relevant ones.Feature extraction starts from a set of attributes and derives attributes intended to be informative and non-redundant.According to reviewed papers, we can group feature selection and extraction approaches into three categories.(1) In intrinsic feature selection and extraction, features are selected and extracted in the course of model training, thereby, the feature selection and extraction are performed by the DL classification model.In this category, the model input is either a set of numerical features likely selected using a feature selection method, or a result of a simple conversion from textual to numerical format.(2) Extrinsic and intrinsic feature selection and extraction consists of applying external feature selection and extraction methods on the input; then, the resulting features are fed to the DL classification model, which extracts and selects more abstract and complex features during training.In this category, the external feature extraction methods are more sophisticated; they are either based on techniques that are employed in Natural Language Processing (NLP) problems, namely word-level and character-level embedding, or on manual feature extraction (i.e., features can be extracted using automatic tools, but these tools are built according to experts instructions).(3) In extrinsic feature selection and extraction, feature selection and extraction are external to the classification model construction.It can involve text classification techniques, manual feature extraction, and/or DL models.In this category, a traditional Machine Learning model is used for classification.

Figure 12 .
Figure 12.Histogram of feature selection and extraction categories used in selected PS.

Figure 13 .
Figure 13.Histogram of feature selection and extraction methods used in selected PS.

4. 6 .
RQ6: What Classification Models Are Used to Detect Web Vulnerabilities?

Figure 14 .
Figure 14.Histogram of classification models used in selected PS.

Figure 15 .
Figure 15.Histogram of web attacks types used in the selected PS.

Figure 16 .
Figure 16.Percentage of studies proposing binary or multi-classification models.

4. 8 .
RQ8: What Is the Performance of DL-Based Web Attacks Detection Models?

Table 2 .
Quality assessment questions.

Table 3 .
Quality assurance scores of selected PS.

Table 5 .
A summary of experiments conducted in selected PS.
• Bridge the expertise in web application security and expertise in Machine Learning in order to build theoretical Machine Learning models tailored for web attacks detection: most existing DL models were designed with other applications in mind.For instance, Convolutional Neural Networks and Recurrent Neural Networks were originally developed to answer the specific requirements of image processing and Natural Language Processing problems, respectively.Additionally, because each web application has its own business logic, it is interesting to have a theoretical Deep Learning model for detecting web attacks without a need to learn each web application separately.• Support secure learning: Machine Learning models are prone to adversarial attacks.In such attacks, attackers evade intrusion detection systems by exploiting the underlying Machine Learning model.For instance, an adversarial attack can disrupt the model training by contaminating training data with malicious data, thereby tricking the detection model into misclassifying malicious web requests as benign.•