PF-TL: Payload Feature-Based Transfer Learning for Dealing with the Lack of Training Data

The number of studies on applying machine learning to cyber security has increased over the past few years. These studies, however, are facing difficulties with making themselves usable in the real world, mainly due to the lack of training data and reusability of a created model. While transfer learning seems like a solution to these problems, the number of studies in the field of intrusion detection is still insufficient. Therefore, this study proposes payload feature-based transfer learning as a solution to the lack of training data when applying machine learning to intrusion detection by using the knowledge from an already known domain. Firstly, it expands the extracting range of information from header to payload to accurately deliver the information by using an effective hybrid feature extraction method. Secondly, this study provides an improved optimization method for the extracted features to create a labeled dataset for a target domain. This proposal was validated on publicly available datasets, using three distinctive scenarios, and the results confirmed its usability in practice by increasing the accuracy of the training data created from the transfer learning by 30%, compared to that of the non-transfer learning method. In addition, we showed that this approach can help in identifying previously unknown attacks and reusing models from different domains.


Introduction
With the advance of information technology, cyberattacks are becoming more intelligent and mass-produced, overwhelming the detection, analysis, and response abilities of traditional security approaches [1,2]. Thus, the number of studies applying artificial intelligence technology to the cybersecurity field is increasing [3][4][5]. Among these studies, intrusion detection is one of the particular fields where machine learning is showing higher detection rates and fewer false positive cases than the conventional signature-based detection methods [5][6][7][8].
In order to understand how machine learning works better than signature-based methods, one must first thoroughly understand the structure of the data of a common intrusion detection event. These data consist of two main fields (header and payload), shown in Figure 1. The header contains network information and the flow of the source IP and destination IP, while the payload contains the server and user information as well as data on requests and responses [9]. As cyberattacks continue to evolve, the number of cases where potential threats are hidden within payloads is increasing.
As demonstrated above, the traditional signature-based detection system uses an attack pattern, such as "admin or 1=1", to detect SQL injection, despite the user information in the user-agent section implying that it is, indeed, an attack via an automated attack tool (sqlmap) [10]. This is the epitome of a situation where the decision regarding intrusion should be made based on the attack information, such as attack pattern and related characters as well as domain knowledge, such as URL information, body, user-agent, referrer and so on. Recent studies have expanded the detection area from the header to payload to maximize accuracy and minimize false positives [8,11,12].
Torrano-Gimenez et al. used n-gram to extract features from the payload [13], and Betarte et al. extracted text-based features from the payload based on professional opinions [14], while Min et al. extracted statistical, vectorized features from the payload. These studies demonstrate the expanding the detection area from the header with limited information to the payload with rich information can prove to be successful not only in the conventional signature-based intrusion detection system, but also in the field of machine learning for the same field. However, the application of such supervised machine learning requires sufficient training data before being deployed into production [15]. Obtaining these training data requires a lot of resources and professional manpower, especially when one tries to collect data that contain the payload. Even after collecting enough labeled data and, thus, building a model with good performance, it is still difficult to reuse the model in other domains [3,6,16].
One of the machine-learning technologies to solve this problem is transfer learning, which allows information learned from an existing domain to be transferred to another domain [17][18][19][20]. Transfer learning is widely applied to natural language processing and visual recognition, but studies using transfer learning in intrusion detection are lacking.
Recently, studies have attempted to use transfer learning in malware classification [21,22] and network intrusion detection [23][24][25]. Wu et al. used deep learning for feature extractionbased transfer learning, while Zhao et al. proposed the HeTL approach to detect unknown attacks, using already-known detection techniques. However, their applicability in the real environment is insufficient due to the use of structured datasets [5], such as KDD99 [26] and NSL-KDD [22], or the limited experiments through hypotheses within the same dataset [27].
Therefore, we propose payload feature-based transfer learning (PF-TL) to solve the shortage of training data in the field of intrusion detection. For our proposal, the following points were considered.
Firstly, by limiting the domain subject of this study to an intrusion detection event for webservices [14,[28][29][30][31][32][33], transfer learning could be applied since intrusion detection events share similar characteristics with a variation of already-known web attacks. Secondly, the method to select and extract the features of information were carefully chosen. Well-known signatures were used to extract the characteristics of the attacks, while text vectorization was used to extract unstructured domain knowledge, such as user and application information [34,35]. We call this the hybrid feature extraction method. With this new method, we aim to transfer the detection methodology of the source domain to the target domain [36][37][38]. In addition, the transfer learning algorithm, based on heterogeneous features seeking optimized space between the two domains, is applied to enhance accuracy [18][19][20][21][22][23][24][25].
This study makes the following contributions. First, the proposed transfer learning approach makes it easier to obtain optimized label data for the target domain, which saves the time and manpower needed in creating training data (Section 5).
Second, an effective feature extraction method for transfer learning between intrusion detection domains is presented. Moreover, feature extraction was extended to the payload, including headers, the signature-based feature in the payload, and text vectorization-based feature methods (Section 3).
Third, an improvement over the optimization method and the manual configuration of HeTL suggested in another study [25] is presented. This leads to optimized transfer learning for the web application intrusion detection field (Section 4).
Fourth, a benchmark against the HeTL using the training datasets from the same domain, which in turn proved the necessity of PF-TL, is presented and validated using two publicly available datasets and one real-world dataset (Section 5).
Fifth, we showed that it is possible to use the training data generated by the proposed approach to identify the new types of attack detection that do not exist in the training data, and therefore demonstrated that the model could be reused in other domains (Section 5).

Preliminaries
In this section, we will first describe transfer learning and then explain the intrusion detection system.

Transfer Learning
Traditional machine learning assumes that training and test data are imported from the same domain so that the feature spaces and the distributions of two datasets are similar [17][18][19][20]. Training data are often difficult to obtain and time-consuming and expensive to collect [39]. Therefore, when applying machine learning to other domains, training with data that can be obtained more easily from a particular domain and using the knowledge gained from it is desirable. This methodology is called transfer learning [18][19][20]40].
When implementing transfer learning, the following points should be considered. First, what source domain information is useful? Second, what is the best way to send this information to the target domain? Third, what can be done to avoid sending unnecessary information to the desired results? 2.1.1. Categorization According to Similarity between Domains and Feature Space Transfer learning can be categorized into homogenous and heterogeneous transfer learning, depending on the similarities between source and target domains and the feature space in between [20].
Homogenous transfer learning [20] is used in the same kind of transfer learning for which the feature space of the data in the source and target domain is represented by the same feature (X s = X t ) and label (Y s = Y t ), and the space itself comprises the same dimension (ds = dt). Therefore, this method focuses on closing the gap in the distribution of data between domains, which has already been experienced in the cross-domain transition.
As for heterogeneous transfer learning [20], the source and target domain may not share features (X s = X t ) and labels (Y s = Y t ), and the size of the feature space may also vary. Therefore, this method requires the transformation of feature and label spaces to bridge the gaps between knowledge areas and to deal with differences in data distribution between domains. This method is even more difficult because little expressive common points are present between domains. That is, knowledge in the source domain is available, but it can be displayed in a different way than the target domain, and the most important thing is its extraction method. Based on the transition content, transfer learning can be divided into four categories (instance-, feature-, parameter-, and relational-based) [20].
Instance-based transfer learning assumes that a particular part of the data in the source domain can be re-weighted and re-used in the target domain for learning. In this situation, instance weight and importance can be sampled.
Feature-based transfer learning can be applied to both homogenous and heterogenetic transfer learning. It aims to reduce the feature distribution gap between the feature spaces of the source and target domain through a "good" feature.
Parameter-based transfer learning is applied by extracting each shared parameter and priority between the source and target domain models.
Relational-based transfer learning forms transfer learning for the relational area. The underlying assumption for this learning is that some relations between the source and target domain data are similar. Thus, the knowledge to be transmitted is the relation between the data. The recent statistical relation learning technique is similar to this one [20].

Intrusion Detection
Intrusion detection aims to identify various types of malicious activities on a network and computer. How the system responds to the identified malicious activities differs in IDS and IPS; IDS responds by passively monitoring the activities, while IPS actively restricts the traffic of the activities. These intrusion detection techniques are continuously developed and can be largely divided into the signature-based intrusion detection system (SIDS) [5] and anomaly-based intrusion detection system (AIDS) [5].

Signature-Based Intrusion Detection
SIDS uses pattern matching technology to find known attacks. This is also known as knowledge-based detection or misuse detection [5,41]. The main concept is to build an intrusion signature database, compare it with the data in the packets, and raise an alert if a match is found. SIDS generally shows excellent accuracy in detecting previously known intrusions, but it shows a poor detection rate when a new attacks occurs until the signature database is updated [42].

Anomaly-Based Intrusion Detection
AIDS uses machine learning as well as statistics-and knowledge-based methods to overcome the limitations of SIDS. The main concept is that malicious and benign behaviors are different. Therefore, this detection comprises two phases that define normal behavior: a learning phase and a test phase. The advantage of AIDS is that it can identify zero-day attacks because the recognition of abnormal user activities does not rely on the signature database [42].

Overall Process of Proposed Approach
In this section, we propose payload feature-based transfer learning (PF-TL) as described in Figure 2 in which the intrusion detection knowledge is transferred from a source domain to a target domain in order to train an unlabeled dataset.

Optimized Latent Target Domain
Target Domain

Target Domain
Step 1 Step 2 Step 3

Projective Space T'
Step 1 : A hybrid Feature Extraction for identifying attack and domain characteristics in payload Step 2 : Optimizing latent subspace for both source and target domain data Step 3 : source classification model is constructs Labelded target Domain dataset The source domain contains labeled intrusion detection data that can be used as training data. The target domain contains unlabeled intrusion detection data that can be categorized as test data. The source and target domain data comprise attack and normal data and we assume that the source domain is labeled by an attack type and the target domain is unlabeled.
This proposal is comprised of the following steps.
Step 1: Features are extracted from the header and payload data of the source domain and target domain, using both attack feature extraction and text vectorization, in order to effectively transfer the attack knowledge.
Step 2: An optimized latent subspace between the source domain and target domain's feature space is created, using PF-TL on the features from the extracted payload. These optimized latent subspaces are then used to create an optimized source domain and optimized target domain datasets.
Step 3: The optimized source classification model is created through the optimized source domain dataset created in Step 2. It uses the Random Forest, SVM, MLP, and KNN algorithms [2] as the model classification and selects a titration algorithm. The subsequently generated model (optimized source classification model) can then be used to obtain the target domain (labeled) that is formed by adding a label to the target domain (unlabeled). This allows labeled data to be generated, even in the target domain where training data are limited, and machine learning can be applied through the subsequently created model.

Hybrid Feature Extraction for Identifying Attack and Domain Characteristics
In this section, we propose a payload-based feature extraction method, which enhances the knowledge transfer between two domains for transfer learning.
Our proposal includes not only the statistical feature extraction of header-the common approach used in many research-but also a hybridized feature extraction method that extracts both attack features and domain characteristics from the payload as shown in the Figure 4. The attack features are extracted from the payload based on the keyword or special characters commonly used in the attacks [14,37,43]. The domain characteristics are extracted using latent semantic analysis (LSA) on the vectorized payloads.
In order to do so, the collected source data must include a payload of the intrusion detection event, which contains a variety of information. As shown in the Figure 3, the Electronics 2021, 10, 1148 6 of 21 header of the source data includes the IP, port, packet size, etc., of a specific network connection. The payload is comprised of text data that contain the signature used in a detection system as well as related domain information, such as application, server and users [9].  First, the payload can be distinguished according to the characteristics of the service. In particular, in the case of web service, it can be divided into the URI, query, body, etc., within the payload. URI refers to the area up to the previous part of the string "?" that contains a specific location for resources on a particular server, including domains. Query refers to the area after "?" that contains the remaining area from the URI area. Body means the area containing head information, excluding the URI and Query areas. Payloads divided, such as those in Figure 3 (URI, Query, and Body), are more representative of the characteristics of each domain (Step 2).
Therefore, for each payload area, the feature required for transfer learning can be extracted through the following tasks (Step 3).
First, as proposed by Latha et al. [44], to extract the characteristics of each type of attack from the payload, the keywords and characters used for each type of attack, identified by security experts, were extracted to create the feature, shown in Table 1. Table 1 is similar to the character extraction method used by experts to determine attacks when using a signature. It differs from signature-based detection, as it does not simply consider one pattern but considers various signature combinations for related types, such as that in Table 2, extracting them to values such as frequency and length.  [44].  Such extracted features show that even within the same dataset, the distribution of features can differ by event, and even in the same type of attack, the distribution of features differ depending on the domain. In addition, even for features that characterize the type of attack, feature distribution varies depending on the characteristics of the classified fields.

Attack Class
The second is the feature extraction method for domain characteristics.
In order to apply LSA on the payload as a feature extractor, the payload data must be vectorized first. In this paper, we used term frequency-inverse document frequency (TF-IDF) [45] to achieve that. Nevertheless, since the payload often consists of seemingly unmeaningful combination of characters instead of general words and sentences, we obtained the vectorized values, using a hash of a limited size of characters instead of words.
Furthermore, it removes noise by applying a truncated SVD for LSA and extracts only the characteristics.
Using this method, 593 and 300 features have been extracted herein based on the characteristics of the domain information. Moreover, some additional common information was used to extract the method, version, and host related information for HTTP as eight binary values, and the information about the character format for URI and Query areas was also extracted as a feature.
The extracted features, using the methods suggested in this study, seemed to characterize the intrusion detection events better than the existing methods.

Optimizing Latent Subspace for Both Source and Target Domain Data
This chapter proposes feature-based transfer learning [25] to optimize the space and distribution of features between source and target domains based on the payloadbased extracted features in the previous chapter. This helps in obtaining optimized latent subspaces S and T, as they are similar to each other. In addition, we can create a model using optimized latent subspace S and make predictions with optimized latent subspace T to obtain labeled data of target domain T [46].
The optimization methods suggested in this chapter are shown in Figure 4. We apply principal component analysis (PCA) to extract characteristics only for given S and T and to obtain S and T with a projection of the same size. Optimization is performed through similarity parameters β and distance measures D( * , * ) for the two S and T to obtain V s and V t , respectively, which are similar to each other.  To summarize, the issue of optimizing feature-based similarity for the two domains that have been projected by PCA can be considered as the well-known Equation (1). However, here, l( * , * ) is the distance measure for the two given matrixes.
By introducing the distance measures l( * , * ) and D( * , * ), used in Zhao et al. [41], Equation (1) can be expressed as Equation (2). However, here, P s is a projection matrix that converts V s to the S coordinate system, and P t is a projection matrix that converts V s to the T coordinate system.
Therefore, the optimal value is obtained by applying AdaDelta [47], which is quantified for Equation (2) to obtain the optimization goals of similar source domain V s and target domain V t .
First, the partial differential coefficient for each application of Equation (2) is as follows: Here, if the variables V s and V t , which we have final interest in, are fixed, the optimal values of Equations (5) and (6) are zero, and accordingly, P s and P t are arranged into the following: Therefore, it is possible to obtain V s and V t , by repeating optimization through AdaDelta, along with Equation (7). However, since existing AdaDelta performs optimization for each individual element, it does not consider the relation between matrix elements for V s and V t . Thus, to consider the relation between matrix elements for V s and V t , AdaDelta is quantified and applied in a matrix-approximation-based optimization method by introducing the following.

1.
All operations are calculated based on matrix operations.

2.
Initialize with random number N (0, 1) for the initial value of V s and V t . 3.
Estimates of changes (g s , g t , z s , z t ) are estimated using Frobenius Norm.

4.
Then, select the relation that is orthogonal to each other through QR decomposition as the result value for V s and V t , either at the initial or calculated value.
The finally proposed feature-based transfer learning algorithm can be presented as shown in Algorithm 1 below. Algorithm 1: Optimization performance algorithm.

Experiments and Evaluation
In this chapter, experiments were conducted and evaluated through three scenarios, as shown in Table 3 below, for transfer learning, which is proposed to solve the shortage of training data through a knowledge transfer between the two domains. In addition, in the case when transfer learning is not used, the performance comparison between No-TL and heterogeneous-based transfer learning, proposed by Zhao et al. [24], evaluates the superiority of the transfer learning proposed in this study. Thus, the accuracy and F1-score are compared for the labeled data generated from each experiment. In the first scenario, experiments are performed when labels are produced for different types of attacks, using the labels granted for a particular type of attack, provided that the distributions between the features differ due to the transfer learning according to the type of attack that occurred in the same dataset.
The second scenario is the same model, but it assumes that the distribution between each feature is different for different equipment due to the transfer learning that occurred in the different datasets. The labeled, well-known source domain of the existing equipment is used to apply to the other target domain of the same model.
In the third scenario, experiments are performed on the dataset in the real environment. In addition, in Section 4.5, a performance comparison and parameter sensitivity analysis of various algorithms are performed.

Dataset
The selection of the dataset used in this experiment considered the following conditions to implement the three scenarios presented in Table 3: the dataset's intrusion detection area, presence of labels, multi class, payload, etc. Two of the public datasets in various fields of intrusion detection, such as those in Table 4, and the web application firewall dataset in the real environment were selected as datasets to satisfy this requirement. The selected datasets are PKDD2007 (ECML PKDD2007 Discovery Challenge dataset) [48], CSIC2012 (HTTP CSIC Torpeda 2012 dataset) [49], and WAF2020 (web application firewall real dataset) [50].

PKDD2007
The PKDD2007 dataset was created in 2007 through Challenger for web traffic analysis at the 18th European Machinery Conference (ECML) and the 11th European Conference (PKDD) on knowledge discovery principles and practices in databases. As part of Challenger, the dataset was provided with normal traffic and seven types of attack traffic [48,51].
The dataset included 35,006 requests as normal traffic and 15,110 requests as attacks. The PKDD2007 dataset recorded and generated traffic and processed some information, including replacing all URL, parameter, and values with randomly strings. The seven types of attacks were Normal, SQL Injection, Path Traversal, XSS, Command Extraction, XPATH Injection, LDAP Injection, and SSI Attack. The dataset is in the XML file format, comprising reqContext, class, and request; request is divided into Method, Protocol, URI, Query, and Headers. The CSIC2012 dataset was presented in the TORPEDA Framework in RECSI2012 in 2012. The TORPEDA framework is used for generating labeled web traffic for the evaluation and testing of web application firewalls (WAF) [49][50][51][52][53]. The data comprises 10 classes, including 8363 requests classified as normal, 16,456 requests classified as anomalous, and 40,948 requests classified as attacks. The 10 types of attacks were Normal, SQLi, Format String, Buffer Overflow, XPath, XSS, CRLFi, SSI, LDAP, and Anomalous. The dataset is in the XML file format, comprising label and request. Request is divided into Method, Protocol, Path, Headers, and Body.

WAF2020
The WAF2020 dataset is the data collected over a year from the web application firewall (WAF) established and operated by Company A's SOC [50]. The dataset was detected in the WAF based on users that accessed the portal-based web application environment. The dataset was divided into 13 types of attacks, among which the false-detection event that detected a normal user as an attack was selected as Normal. The data comprised 10,000 requests as Normal and 57,407 requests as attack. The thirteen types of attacks were Normal, Default Page Access, HTTP Method Restrictions, Directory Access, URL Extension Access, Command Injection, XSS, SQL Injection, Header Vulnerability, Application Vulnerability, SSI Injection, LDAP Injection, and Directory Listing. The dataset is in csv file format and is divided into Method, Protocol, URI, URI_Query, Body, etc.
In particular, the attack type of each dataset was defined with a different name and similar attack types were matched and sorted out for experimentation. Selected attack types were Normal, XSS, SQLi (defined as SQL Injection in PKDD2007), LDAPi (defined as LDAP Injection in PKDD2007), SSI (defined as SSI Attack in PKDD2007), OS Commander.

Data Preprocessing
At this stage, the data are structured and preprocessed so that the collected datasets (PKDD2007, CSIC2012, and WAF2020) can be applied to machine learning.
Pre-processing steps are performed in the following order: Normalization, Field Selection, Feature Extractor and Selection, and Sampling, as shown in Figure 5.

Normalization
Among the datasets collected, PKDD2007 and CSIC2012 are in the form of unstructured XML, shown in Sample 1 [54] below, and WAF2020 is composed of structured data. To start with, we equally perform normalization with Method, Version, URI, Query, and Body for the applicable dataset. In particular, values that represent user information, etc., are included in the Body field. Then, the letter "\n" will be removed and URI decoding will be applied for URI, Query, and Body.

Field Selection
For each dataset, we select the field that is going to be used for the experiment. In PKDD2007 and CSIC2012, the category type can be divided into Class, Method, and Version, and the text type selects URI, Query, and Body. At this time, missing values occur if a Query does not exist in the generated data of CSIC2012. Therefore, the missing values are batch-processed with "?" for the field.

Feature Extractor and Selection
Feature extraction is performed with signature-based feature extraction and textvectorization-based feature extraction according to the selected field. Signature-based feature extraction first performs numericalization for Method and Version. Method is expressed as three features with numerical values of "0" and "1" depending on the existence of GET, POST, and PUT, and Version is expressed as two features with numerical values of "0" or "1" depending on the existence of HTTP/1.0 and HTTP/1.1. Second, characterbased features by attack type are extracted as numerical features in URI, Query, and Body fields. Hence, eight common-related features, 196 URI-related features, 196 Queryrelated features, and 196 Body-related features are created. Text-vectorization-based feature extraction performs text vectorization for URI, Query, and Body characters using LSA, which compresses the hashing vector, comprising 10,000 features, into 100 features again. As a result, URI, Query, and Body strings are converted into features with a size of 100 each. This merges the extracted features into an index basis and will eventually produce a dataset with 901 features.

Sampling
After preprocessing, each dataset is randomly sampled so that it contains 1000 samples from the attacker's label and 4000 samples from the normal label for experiments by scenario.

Evaluation Environment & Metrics
This experimental environment was implemented using Python in Ubuntu 18.04.2 LTS. The classical machine learning library Scikit-learn 0.20.4 was used. Hardware specifications include NVidia GeForce RTX 2060 * 2 for GPU, 128 GB RAM, an 8 TB hard disk, and an AMD Ryzen Threadripper 1900X 8-core processor environment.
Accuracy and F1-score for the predicted target data after transfer learning were selected as the evaluation method for this experiment. The three scenario approaches (No-TL, HeTL, and PF-TL) presented in Table 3 were evaluated using Random Forest, SVM, MLP, and KNN as classification algorithms. To calculate the selected evaluation method, confusion matrix metrics were used, which are commonly used in machine learning. As shown in Table 5, the confusion matrix [55] comprises four information points. Based on these four information points, four scales can be evaluated, as shown in Table 6 [55]. First, accuracy is a measure of evaluating how accurately a model classifies. Second, precision is a measure of evaluating how reliable the results that were predicted through the model are. Third, recall is a measure compared to precision, which can indicate how well the predicted results reflect the actual results. In other words, it is a measure related to the practicality of the model. Fourth, F1-score is a number that divides the product of precision and recall by the sum of the two and then multiplies by two. Therefore, in this study, we are going to measure performance through the accuracy and F1-score of the model.

Experimental Result
As presented in Table 3, the experiment was conducted through three scenarios, and results were obtained.

Scenario 1: Transfer Learning Comparison for Different Attack Types on the Same Equipment
The first scenario is based on "Can we accurately detect when a new type of unknown attack occurs?" in a security system. In other words, the dataset being targeted is the same, the feature space between domains is the same, and the feature distribution is different.
To test this, the PKDD2007 dataset was used, the web application intrusion detection dataset was set to be identical, and the source and target domains were set differently for each type of labeled attack. The feature space used the 901 features presented earlier. Feature distribution is differently distributed for each detection event and for HTTP header and request values.
For the types of attacks used in the experiment, SQLi, LDAPi, XSS, and SSI in the PKDD2007 dataset were selected as four types of attacks that are uniformly distributed, and source and target domains were set up and experimented, as shown in Table 7. In addition, for more accurate results, the experiments were performed by changing the target domain type (LDAPi, XSS, and SSI) to the same type of source domain (SQLi), as shown in Table 8.  As seen in Tables 7 and 8, the experimental results show that the predictive model performance for different types of attacks in the same dataset is improved, as No-TL < HeTL(2017) < PF-TL in accuracy and F1-score. In particular, the accuracy and F1-score vary widely by type when transfer learning is not used. However, the application of transfer learning shows that the deviation is not significant. This enables transfer learning to increase the training dataset for the deficient attack types when the variation in the amount of training data by attack type is significant. This can also be used by transfer learning to increase detection accuracy when events that are not included in the training data occur.

Scenario 2: Transfer Learning Comparison for Attack Types from other Equipment
The second scenario is the view on whether a well-trained model can perform equally well in other network environments. In other words, the target dataset is different, the feature space between domains is the same, and the feature distribution is different. To test this, the PKDD2007 dataset, which is a web application intrusion detection dataset, was selected as the source domain, the CSIC2012 dataset was selected as the target domain, and 901 features were used in common, as shown in Scenario 1 for feature space.
The results of Table 9 show that the accuracy and F1-score performed better when transfer learning was applied compared to when it was not applied. The results show that accuracy is high, but the F1-score is low when transfer learning takes place from XSS to SSI. This suggests that the performance was very poor in recall or precision. This also shows that the use of transfer learning reduces bias against data, which increases model performance, such as the F1-score. Moreover, while the previous model does not perform well when applied to other security equipment, it can be used for other security equipment if the training data are optimized for the target security equipment through transfer learning. The third scenario is the view on whether what was done in Scenarios 1 and 2 can be well applied in the real environment. To this end, we applied the web application firewall dataset WAF2020 collected in the real environment to Scenarios 1 and 2.
First, Scenario 1 was applied to the WAF2020 Dataset. As shown in Table 10, the use of PF-TL in the real environment shows high accuracy and F1-score. In particular, if the domain applies the model without using transfer learning from OS Commander to SQLi, it shows a low accuracy of 0.21; however, when transfer learning is applied, high accuracy of 0.99 or higher is obtained. Second, transfer learning was tested with WAF2020 for models created in PKDD2007+ CSIC2012, as shown in Scenario 2. Table 11 contains a new dataset called PKCS, created by merging existing common datasets (PKDD2007 and CSIC2012) and applying the model obtained from this to the WAF2020 dataset. As a result, an accuracy of lower than 0.53 was obtained when transfer learning was not applied, whereas a 0.99 or higher accuracy was obtained when transfer learning was applied. This shows that transfer learning can be applicable to data in the real environment by applying it to the existing public dataset.

Limitations and Discussion
In this study, three different scenarios for domain knowledge transfer were used to create optimized training data for target domain, using payload feature-based transfer learning. In addition, by comparing with No-TL and HeTL, we showed that the proposed transfer learning could create the dataset that might effectively be used to improve model accuracy. Considering how the machine learning applications in intrusion detection have been dependent on the training data, the research result is very promising. Through the aforementioned three scenarios, our proposal can significantly enhance the detection accuracy of unknown attacks based on the same dataset and guarantee understandable performance on a dataset of different security devices when using the transfer learning.
Nevertheless, in order to achieve the desired outcomes, the following limitations regarding the labels of source and target domains, feature spaces and distributions, were put into the consideration [56].
First, the source and target domains must have the same feature spaces. To solve this, PCA was performed to adjust the number of features.
Second, the source and target domains must have the same number of records. Thus, the number between domains was identically adjusted through sampling.
Third, the similarity between the source and target domain is high and must be entered in pairs when being input. Hence, similarity parameter β was used and matched and was set as β = 1 for this experiment.
In this experiment, we will look into the proposed algorithm and the similarity parameter β.

Algorithm Selection
To select the best algorithm, the PKDD2007 dataset and the CSIC2012 dataset were selected as the target and source domains, respectively.
The dataset configuration used the entire datasets without distinction of attack types. The data used were applied with Random Forest, Linear SVM, MLP, and KNN algorithms to compare the transfer learning results. As shown in Table 12, the comparison shows that Random Forest has the best performance. In particular, it has the highest accuracy and a high improvement rate of 95.4% when transfer learning is applied. In addition, the performance of the KNN algorithm was excellent when transfer learning was not used, but its performance improvement was not high when transfer learning was applied. Therefore, in this paper, Random Forest was applied in the experiment.

Parameter Sensitivity
For inter-domain optimization, two hyperparameters must be set: similarity parameter β [46] and the size of new feature space k. Therefore, we decided on the sensitivity of parameters through an experiment. First, we analyzed the sensitivity to similarity parameter β.
To analyze the sensitivity to similarity parameter β, the size of the new feature space k was fixed with the same value of 100 as the control variable.
As seen in Figure 6, as the similarity parameter β increases, the similarity between the target and the source domains increases. In particular, the source and target domains are almost identical when β is greater than 0.8. In this experiment, the β value was set to 1 for more obvious results.

Comparison of Methods of Creating Training Data
While there are multiple ways to create training data, including Generative Adversarial Net (GAN) and SVM Feature selection, many studies suggest creating training data using the same datasets that are labeled together. These methods focus on creating new training datasets based on the same dataset [57][58][59][60][61]. Nonetheless, our proposal focuses on transferring data from both the same and different datasets using three scenarios.

Payload Based Intrusion Detection
Torrano-Gimenez et al. [13] proposed a method for extracting feature from the payload using n-gram to increase the accuracy of web intrusion detection. Generic feature selection (GFS) measurements were applied to decrease the number of features extracted by n-gram, and the accuracy tended to improve through classification algorithms (Random Forest, Random Tree, CART, and C4.5). Betarte et al. [14] extracted character-based features with the help of an expert's analysis in the payload, based on which classification algorithms (RF, KNN, and SVM) suggested better performance than conventional signature-based detection technology. Min et al. [62] proposed statistical methods that are important in network flows for efficient feature extraction in packet headers and payloads, as well as methods using the Random Forest algorithm after word vector followed by feature extraction via Text-CNN.
However, in these studies, a large number of training data is needed to achieve good results and learned models cannot be reused in other domains.

Transfer Learning for Intrusion Detection
Transfer learning has commonly been applied to natural language processing and visual recognition; however, its application in the cybersecurity field is insufficiently studied. Intel Labs and Microsoft researchers Chen and Parich [21] proposed a malware classification method using deep-learning transfer learning technology that has low false detection rates and high accuracy through a static malware as the image network (STAMINA). Microsoft's REA+L dataset was used as an applicable dataset and showed improvements over previous studies [22]. Wu et al. [23] proposed the TL-ConvNet model in transfer learning for network intrusion detection. Furthermore, the deep learning algorithm Con-vNet was used for feature extraction. Source domain (KDDTest+) was learned through the NSD-KDD dataset and transfer learning was implemented through the prediction of the target domain (KDDTest-21). Moreover, Zhao et al. [24,25] proposed a feature-based HeTL approach to detect unknown attacks based on known detection techniques. NSD-KDD was used as a dataset, and the source and target domains were selected and tested for each type of attack within the same dataset. Subsequent studies [25] also suggest transfer learning that uses clustering to improve the optimization of the source and target domains. However, studies of transfer learning in the field of network intrusion detection have so far been conducted using formalized datasets, such as KDD99 and NSL-KDD, and have been experimented with different types and distinctions of attacks within the same dataset. Therefore, a limitation exists in the application of transfer learning in a real network environment [63,64].

Conclusions and Future Work
In this paper, we proposed payload feature-based transfer learning as a solution to the shortage of training data in the field of intrusion detection, which extracts features from the payload of the intrusion detection event. Our proposal confirmed that transfer learning can create optimized training data for a domain that has insufficient labeled data or no label data at all. In addition, the proposal demonstrated three distinctive scenarios in which we proved the proposal's capability of detecting new types of attack events and reusing a well-trained model in another network environment by utilizing a newly created training dataset from PF-TL. Furthermore, by validating real-life datasets, we showed that our proposal can indeed be used in practice. In the future, we will cover methods to process imbalanced training data and to create big data specifically optimized for the intrusion detection field to increase the utility of the training data created from transfer learning.