A Payload Based Malicious HTTP Trafﬁc Detection Method Using Transfer Semi-Supervised Learning

cn; Tel.: +86-1366-6600-887 Abstract: Malicious HTTP trafﬁc detection plays an important role in web application security. Most existing work applies machine learning and deep learning techniques to build the malicious HTTP trafﬁc detection model. However, they still suffer from the problems of huge training data collection cost and low cross-dataset generalization ability. Aiming at these problems, this paper proposes DeepPTSD, a deep learning method for payload based malicious HTTP trafﬁc detection. First, it treats the malicious HTTP trafﬁc detection as a text classiﬁcation problem and trains the initial detection model using TextCNN on a public dataset, and then adapts the initial detection model to the target dataset based on a transfer learning algorithm. Second, in the transfer learning procedure, it uses a semi-supervised learning algorithm to accomplish the model adaptation task. The semi-supervised learning algorithm enhances the target dataset based on a HTTP payload data augmentation mechanism to exploit both the labeled and unlabeled data. We evaluate DeepPTSD on two real HTTP trafﬁc datasets. The results show that DeepPTSD has competitive performance under the small data condition.

before they can be appropriately extracted, and thus they are usually used for representing network traffic based on protocols such as TCP and UDP, and not suitable for malicious network traffic detection from single HTTP requests. On the other hand, payload based methods [8][9][10][11][12][13][14][15] detects malicious network traffic by analyzing the content of the packets (e.g., URL, file path, request body, etc.), and thus they can be used for single HTTP request. Therefore, these two types of detection methods can be used to complement each other. In this paper, we focus on the payload based malicious HTTP traffic detection method.
Most state-of-the-art payload based methods adopt learning techniques for malicious network traffic detection, including traditional machine learning techniques [8][9][10][11] and deep learning techniques [12][13][14][15][16]. They train detection models by exploiting a corpus of correctly labeled samples (including both normal and various types of malicious network traffic samples), and then utilize the detection models for real-time malicious network traffic detection. Although learning based methods could achieve high detection accuracy, they still suffer from the following problem: the learning techniques (especially the deep learning techniques) require a large number of labeled samples to train the models. However, it is very difficult to collect adequate labeled samples in practice, especially the malicious network traffic samples. Although this problem could be alleviated by training the detection models on a public dataset with a relatively large number of labeled samples and then applying the trained models to the target systems, there is a significant pattern shift between the samples collected from different systems (called the train-test gap problem), due to the unique characteristic of each individual system [17]. Applying the models trained on the public dataset to a new system usually results in performance degradation.
To address the above problems, we propose DeepPTSD, a payload based malicious HTTP traffic detection method based on deep transfer semi-supervised learning technique. The main contributions of this paper are summarized as follows.
(1) We propose a deep learning framework for malicious HTTP traffic detection. To ensure high generalization ability under the condition that the target system has very limited number of labeled samples, we leverage the labeled samples of a public dataset to train the initial detection model, and then fine-tune the initial detection model to adapt to the target system based on a transfer semi-supervised learning model. To the best of our knowledge, this is the first solution of combining deep learning with transfer learning and semi-supervised learning for by augmenting the payload dataset malicious HTTP traffic detection.
(2) We propose a HTTP payload data augmentation method based on a novel keyword avoidance perturbation mechanism. To increase the diversity of the training samples, we generate pseudo labeled samples by imposing noises to the original labeled HTTP traffic samples while keeping the key portion unchanged. The pseudo labeled and original labeled samples are then utilized collaboratively in the semi-supervised learning algorithm.
(3) We conduct cross-dataset validation on two real datasets to evaluate DeepPTSD. The results show that DeepPTSD has competitive performance under the small data condition.
The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 details the proposed method. Section 4 reports the experiment results. Section 5 concludes the paper.

Related work 2.1. Malicious Network Traffic Detection
The existing malicious network traffic detection methods could be divided into two categories according to the exploited information, i.e., statistic based methods and payload based methods. Statistic based methods extract flow based statistical features (e.g., duration of flow, inter arrival time, number of packets, etc.) and apply rule based or machine learning based techniques to detect malicious network traffic [1][2][3][4][5]. Although statistic based methods could adapt to data privacy and encryption, they cannot detect malicious network traffic that does not exhibit significant differences in flow patterns (e.g., SQL injection). In addition, they require a relative long period of monitoring time to extract statistical features, so they are usually designed for network traffic based on protocols such as TCP and UDP, and not suitable for malicious network traffic detection from single HTTP requests. Payload based methods detect malicious network traffic by exploiting the content of the packets (e.g., URL, file path, request body, etc.). Most state-of-the-art payload based methods adopt traditional machine learning techniques and deep learning techniques for malicious network traffic detection. Given a set of network traffic samples as training data, learning based methods train a detection model, which could classify a new network traffic sample as malicious or benign. The key factor for the learning based methods is to extract good features to represent the network traffic samples. Traditional machine learning techniques explore features from various aspects, including lexical features (e.g., Ngram features, bag-of-word features, etc.) [8,9,18,19] and statistical semantic features (e.g., number of characters, number of parameters, count of sensitive keywords, etc.) [10,11,20]. However, these features should be manually designed based on domain knowledge and require constant adjustment to accommodate changes in malicious network traffic patterns.
In recent years, thanks to the superior capability of automatically extracting representative features from raw data (especially the unstructured data), deep learning techniques have been exploited for malicious network traffic detection [16]. For example, Saxe et al. [21] proposed eXpose, which is a malicious payload detection model based on character-level embedding and CNN (convolutional neural network). Park et al. [12] proposed a malicious HTTP message detection model by using an autoencoder with character-level binary image transformation. Peng et al. [13] proposed a joint approach of CNN, LSTM, and attention mechanism to detect malicious HTTP traffic by exploiting character-level lexical features and host features. However, character-level features cannot capture the latent semantics of payloads, resulting in low generalization ability. Aiming at this problem, more recent researches design detection model based on word-level features. For example, Yu et al. [22] modeled a HTTP traffic sample as a natural language sequence under predefined vocabulary using BiLSTM (bidirectional long short-term memory) with attention mechanism. Le et al. [14] proposed URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding by applying CNN to both characters and words. Yang et al. [15] designed a convolutional GRU (gated recurrent unit) for malicious URL detection based on a predefined keyword library. However, deep learning techniques require a large number of labeled samples to train the models, while it is almost impossible to collect adequate labeled samples for each individual system, especially the malicious samples.

Small Data Learning
Small data learning is referred to the techniques designed to carry out learning on the condition of lacking sufficient training samples [23], and the main strategies of small data learning include data augmentation and knowledge transfer.
The data augmentation strategy attempts to compensate the training data by adding slightly modified copies of the existing data or newly created synthetic data from the existing data, with the purpose of reducing overfitting when training a learner [24][25][26]. On the other hand, knowledge transfer strategy tries to train a learner for the target domain by leveraging knowledge from other domains, and could be divided into three categories according to the types of transferred knowledge, i.e., representation transfer, model transfer, and cognition knowledge transfer. Representation transfer attempts to transfer the knowledge of representing the same object from different domains [27]. Model transfer uses the learner trained from other domains to fit the small training data in the target domain through fine-tuning its parameters [28,29]. Cognition knowledge transfer leverages high-level knowledge such as common sense, domain knowledge, and other concepts [30,31].
Small data learning is widely used in research areas such as computer vision and natural language processing. In contrast, the application of small data learning in malicious network traffic detection has been much more limited, especially when combined with deep learning. In our work, we propose a novel data augmentation scheme for payload data, and design a small data learning framework based on the data augmentation scheme.

Preliminary
First, we elaborate the data used for payload based malicious HTTP traffic detection. Malicious information of web attacks is usually contained in the content of the request URL (if a GET request is sent) or the request body (if a POST request is sent). If a HTTP traffic sample is a GET request, its payload is the path of the URL. If a HTTP traffic sample is a POST request, its payload is the path of the URL and the body of the request (i.e., the request parameters). To this end, a HTTP traffic sample s is defined as a tuple s = (x, y) where x denotes the payload extracted from s (in the form of a string), and y denotes the label of s (i.e., normal or the malicious type). Note that y is unknown if s is a testing sample. We give an example of HTTP traffic sample in Example 1.

Example 1.
Given a GET request with the URL as: http://www.test.com/scripts/index.php?id=1' union select 1,2,3 from admin#, the obtained HTTP traffic sample is s = (x, y), where x is the path of the URL (i.e., /scripts/index.php?id=1' union select 1,2,3 from admin#) and y is SQL injection (i.e., a type of malicious HTTP traffic).
Second, we define the problem in this paper as follows. Given a large public labeled dataset D PU and a small dataset from the target system D TS (including both labeled and unlabeled samples), the malicious HTTP traffic detection problem can be considered as learning a mapping f : X → Y using D PU and D TS . For any unseen sample from the target system s TS = x TS , y TS , f x TS and y TS should be as similar as possible. Here, X represents the input space and Y represents the set of labels.
Third, Figure 1 gives the architecture of DeepPTSD, which consists of three major steps, i.e., model initialization, data augmentation, and transfer semi-supervised learning. First, the model initialization step trains the initial model, with DPU as training data and parameters being randomly initialized. Here, we treat malicious HTTP traffic detection as a special text classification problem and utilize TextCNN as the underlying deep neural network (detailed in Section 3.2). Second, the data augmentation step enhances the dataset of the target system by generating a diverse set of samples from D TS . Here, we propose a HTTP payload data augmentation method based on a keyword avoidance perturbation mechanism for this task (detailed in Section 3.3). Third, the transfer semi-supervised learning step adapts the initial model to the target system by fine-tuning it on the augmented dataset of D TS . In this paper, we use the same deep neural network (i.e., TextCNN) for both the initial model and the model of the target system (i.e., target model). Here, we design a learning framework by integrating a semi-supervised learning framework (i.e., the UDA framework detailed in Section 3.4) into a transfer learning structure, to leverage the augmented samples.
In practice, the malicious HTTP payload detection model can be deployed as follows. A user consults a web application and sends a request to the webserver. Before the request arrives at the server, it will be processed by a WAF (Web Application Firewall), which takes as input all the HTTP requests and processes them using the proposed malicious HTTP payload detection model. If the request is malicious then it will be rejected automatically, else it will be sent to the webserver.

Model Initialization
The purpose of the model initialization step is to train the initial malicious HTTP traffic detection model based on the public labeled dataset D PU .Since we focus on the payload of single HTTP requests, and payload analysis task is similar with NLP (Natural Language Processing) task [32], we treat malicious HTTP traffic detection as a special text classification problem.
First, given a HTTP traffic sample s = (x, y), we have to segment x into discrete tokens before inputting it into a text classifier. Unlike English sentences that can be naturally segmented into words by space and punctuation, the payload of a HTTP traffic sample is an integrated string. Although some studies used single characters [12,13] or N-gram algorithm [8,19] to segment an integrated string, it would result in a large number of tokens without semantic meanings. For instance, if we set N = 3, the N-gram algorithm would segment the payload in Example 1 into many meaningless tokens (e.g., htt, est, cri), which make it difficult for the text classifiers to learn real semantic patterns. Fortunately, the payload of a HTTP traffic sample usually involves a variety of special characters (e.g., ?, :, /). Therefore, we use the 26 special characters in HTTP traffic (i.e., ', ", <, >, +, -, _, *, =, {, }, (, ), [, ], ∼, /, \, #, :, ;, ?, !, -, and, @) as the basis for segmentation. We give an example of payload segmentation in Example 2.
Example 2. The payload in Example 1 is segmented into http www test com scripts index php id 1 union select 1 2 3 from admin. It can be seen that tokens that have strong semantics related to malicious behaviors (e.g., select, union) are correctly segmented.
Second, we apply TextCNN proposed in [33] to train the initial detection model, since it shows a good balance between classification accuracy and training efficiency for the text classification tasks. As shown in Figure 2, TextCNN is composed of four sequentially stacked layers, i.e., an embedding layer, a convolutional layer, a pooling layer, and an output layer. (1) The embedding layer is used to transform each discrete token into a low-dimensional semantic vector, so that tokens with similar semantics would be close in vector space. Here, we utilize word2vec [34] as the embedding technique. Given a HTTP traffic sample s = (x, y), let the length of x be l and the dimension of the semantic vectors be d, x can then be represented as an l × d matrix. Here, we denote the semantic vector corresponding to the ith token in x as v i . Note that the payload would be padded or truncated if it was shorter or longer than l. (2) The convolutional layer uses multiple filters with different window sizes to extract feature maps. Specifically, a filter with window size h × d is applied to each possible contiguous h tokens in x (i.e., the concatenation of the semantic vectors where W is the weight matrix and b is the bias term. Thus, each filter would generate a feature vector c = [c 1 , c 2 , . . . , c l−h+1 ]. (3) The pooling layer applies a max-over-time pooling operation over each feature vector, i.e., it takes the maximum value of the feature vector as the feature value corresponding to the particular filter. Then, by concatenating the pooling results of all filters, we can obtain a final feature vector with uniform length for different payloads (i.e., the length equals to the number of filters). (4) The output layer accepts the final feature vector and output the probability distribution over all the HTTP traffic types based on a fully connected operation.

Data Augmentation
Since the dataset of the target system DTS is usually limited in size and of low degree of diversity, retraining or fine-tuning the initial detection model on such dataset would easily lead to overfitting. Data augmentation is a process of generating more data with higher degree of diversity by using the available limited data, so as to address the overfitting issue [35].
Since payloads are in the form of textual data, it is intuitive to apply textual data augmentation methods for the payload data augmentation task. The existing textual data augmentation methods include back translation [36,37], word replacement [38,39], and sentence generation [35,40] Back translation is a procedure of translating into another language and then back to the original language. It can generate diverse paraphrases while preserving the semantics of the original sentence. However, payloads are not human language, and thus cannot be translated. Word replacement generates new sentences by replacing selected words in the original sentence with new words, and the most intuitive strategy is replacing with synonym. However, a large portion of tokens in payloads have particular meaning. For example, if we replace "select" in the payload of Example 1 with "choose", the new payload /scripts/index.php?id=1'union choose 1,2,3 from admin# would lose the ability of web attack, because "select" is a particular keyword for querying the database. Sentence generation uses generative model such as VAE (variational autoencoder) and GAN (generative adversarial network) to generate sentences from a latent space. However, it is difficult to learn a conditional token distribution from the payloads, since they do not have strong structure and grammar patterns as compared with the natural human language.
Based on the above analysis, the existing textual data augmentation methods cannot be directly used for payload data augmentation task. Aiming at this problem, we propose a HTTP payload data augmentation method based on a keyword avoidance perturbation mechanism. The idea is to add noises to the original payloads while keeping the particular keywords unchanged. The proposed HTTP payload data augmentation method consists of two steps, i.e., keyword mining and payload perturbation.

Keyword mining:
A simple approach is to manually build a keyword library for HTTP payloads [12,14]. However, this approach has the following limitations. First, manual building is tedious and costly, and it usually has a very low recall rate. Second, attackers could use obfuscation operation to change the keywords in a payload into meaningless substrings, in order to bypass the malicious keyword detection system. As shown in Example 3, it is extremely difficult to manually collect these obfuscated keywords. Example 3. The keywords "select" and "union" could be changed into "/*!50000%53elect*/" and "uni%0bon", respectively. Then, the payload in Example 1 can be transformed into: http://www.test.com/scripts/index.php?id=1' uni%0bon /*!50000%53elect*/ 1,2,3 from admin#.
Therefore, we use a TF-IDF based method for keyword mining. First, given a malicious HTTP payload dataset MD and a benign HTTP payload dataset BD, where each payload has been segmented based on the method in Section 3.2, we calculate the TF-IDF score tis(w) for each token w based on Equation (1), where n MD (w) is the number of w in MD, n MD is the number of all tokens in MD, and d j is the j th payload in BD. According to Equation (1), a token that frequently appears in malicious payloads but seldom or never appears in benign payloads would be given a high TF-IDF score. Second, for each token w, if its TF-IDF score tis(w) is larger than a predefined threshold σ, it will be defined as a keyword. Here, we give a partial of the mined keywords as follows: "script", "cgi", "bin", "path", "cookie", "cross", "meta", "select", "where", "from", etc.
Payload perturbation: Once the keyword library is constructed, we generate new payloads from the original payloads by retaining keywords and adding noises to the uninformative tokens based on the rules in Table 1. Note that we adopt character-level randomization rather than word-level replacement in Rule 4, because most tokens in the payloads do not have natural language semantics (e.g., www, index, admin in the payload of Example 1). We give an example of payload perturbation in Example 4.

Rule 1
Keep all keywords in the keyword library unchanged.

Rule 3
Randomize each number in the range of [0,9]. Rule 4 Change each letter into a random letter.
The dataset of the target system D TS includes a labeled dataset D TSL and an unlabeled dataset D TSU . In practice, D TSL is usually of a very small size and D TSU is usually of a relatively larger size. We perform payload data augmentation on both D TSL and D TSU . We denote the generated labeled dataset and the generated unlabeled dataset as GD TSL and GD TSU , respectively.

Transfer Semi-Supervised Learning
We transfer the initial detection model to the target system by fine-tuning it on D TSL ∪ D TSU ∪ GD TSL ∪ GD TSU . Here, fine-tuning is known as a process that copies the parameters of the initial model to initialize the corresponding part of the target model and then the target model is retrained through backpropagation with target training samples [28]. The fine-tuning process is performed based on the proposed transfer semi-supervised learn-ing framework, which is designed by integrating a semi-supervised learning framework into a transfer learning structure. We explain the details as follows.
First, the transfer learning structure is designed based on the idea proposed in [28]. The parameters of the initial model are copied to the target model. Then, the first n layers of the target model are frozen, and then the parameters of the rest layers are retrained based on the dataset of the target system. A frozen layer means that it does not change during retraining. We explain the idea as follows. (1) We do not retrain all the layers, since this would cause catastrophic forgetting, eliminating the knowledge of the initial model. Especially, if the dataset of the target system is small, retraining all layers would easily result in overfitting [41]. (2) We leave the last a few layers unfrozen, since they contain the most specific knowledge that should be learnt from the target system [28]. For example, if the public dataset and the dataset of target system have different types of malicious HTTP traffic, the last layer of TextCNN (i.e., the output layer) is mandatory to be retrained.
Second, to better exploit the labeled and unlabeled samples from the target system, we conduct the fine-tuning process in a semi-supervised learning style. In all the deep semi-supervised learning styles, the smoothness enforcing method tries to regularize the model's prediction to be less sensitive to small perturbations [41][42][43]. The smoothness enforcing method is the most suitable deep semi-supervised learning style for our problem, since it is able to exploit all the labeled, unlabeled, and augmented samples. Specifically, given an original sample so, the smoothness enforcing method creates a perturbed version of s o (denoted as s p , and then trains the model to output similar predictions on s o and s p . In our problem, the payload data augmentation operation corresponds to the perturbation operation of the smoothness enforcing method. To this end, we leverage and refine the UDA framework proposed in [43] to perform the smoothness enforcing based semi-supervised learning. Specifically, we fine-tune the initial model to minimize the loss function in Equation (2), where LL D TSL ∪ GD TSL and UL D TSU ∪ GD TSU denote the loss on labeled datasets (i.e., D TSL and GD TSL ) and unlabeled datasets (i.e., D TSU and GD TSU , λ is a trade-off weight, p θ (x) denotes the predicted label distribution of sample x based on the current model parameters θ, δ(x) is the real label distribution of sample x (i.e., a one-shot vector where only the entry corresponding to the real label is 1), x * ∼ A PL (x) represents the augmented sample generated from sample x, and D KL (p θ (x) p e (x * )) represents KL divergence between p θ (x) and p θ (x * ).
Finally, Figure 3 shows the structure of the proposed transfer semi-supervised learning framework, and we summarize DeepPTSD in Algorithm 1. To conduct the experiments, we collected two raw datasets. One was an open dataset from FSecurify [44], which contained a large number of HTTP requests to a WAF (web application firewall). Only two types of HTTP requests were labeled (i.e., benign and malicious), and thus it did not discriminate among different types of malicious HTTP requests. The other one was collected from a honeypot server deployed by our lab during March, 2019. It contained 6465 HTTP traffic samples, including 879 benign samples, 614 SQL injection samples, 4093 XSS samples, and 879 directory traversal samples. To balance the data of different types, we resampled the above two raw datasets and form the following two datasets.
(1) Dataset A: It contained 80,000 samples (including 40,000 benign samples and 40,000 malicious samples), which were randomly selected from the open dataset.
(2) Dataset B: It contained 800 samples (including 200 benign samples, 200 SQL injection samples, 200 XSS samples, and 200 directory traversal samples), which were randomly selected from the honeypot dataset.

Evaluation Strategy
First, the malicious HTTP traffic detection is essentially a classification problem. Thus, we measured the performance based on three metrics, i.e., Precision, Recall, and F1-Measure.
Second, this paper focuses on cross-system malicious HTTP traffic detection problem. Therefore, we treated Dataset A as the large public labeled dataset and Dataset B as the small dataset of the target system. In addition, Dataset B was further divided into two subsets, i.e., the labeled subset (called Subset BL) and the unlabeled subset (called Subset BU). Subset BL contained 200 samples (i.e., 50 samples for each type) and Subset BU contained 600 samples (i.e., 150 samples for each type). To cope with different evaluation scenarios, we designed the following two evaluation strategies.
(1) Strategy A: It corresponded to the standard training and testing strategy. Specifically, it trained the target model on Subset BL and tested it on Subset BU.
(2) Strategy B: It corresponded to the training and testing strategy for the small data condition. Specifically, it trained the target model on Subset BL ∪ Subset BU (corresponding to the semi-supervised learning scenario), Subset BL ∪ Dataset A (corresponding to the transfer learning scenario), or Subset BL ∪ Subset BU ∪ Dataset A (corresponding to the transfer semi-supervised learning scenario). Then, it tested the target model on Subset BU.
For both strategy A and strategy B, we adopted a four-fold-cross-validation procedure, i.e., we randomly generated Subset BL and Subset BU four times and reported the average performance.

Parameter Settings
Through a series of grid search experiment, we set the parameters of the proposed method as follows. For the TextCNN in Section 3.2, we set the length of the input payload l = 64 and the dimension of the word embeddings d = 32. We defined three window sizes (i.e., h × d, h = 2, 3, and 4) for the convolutional filters. For the keyword mining algorithm in Section 3.3, we set the TF-IDF score threshold σ = 0.5 based on a grid search experiment. For the transfer semi-supervised learning framework in Section 3.4, we set the number of frozen layers n = 2 and the trade-off weight of the loss function λ = 1 based on the previous experiences.

Experiment 1: The Evaluation of Different Payload Segmentation Strategies
In this experiment, we tried to evaluate the effectiveness of the proposed payload segmentation approach (abbreviated as SegAsWord) by comparing it with the following two payload segmentation approaches, which were utilized in previous work.
(2) SegAsNGram: It corresponded to the approach that segments payloads based on the character-level N-gram algorithm [9]. Here, we set N = 3.
Here, strategy A was used for evaluation and TextCNN is used as the classifier. The experiment results are shown in Table 2. First, SegAsNGram outperforms SegAsChar. It was intuitive that N-gram fragments could better capture the payload semantics than single characters did. For example, if there was a 3-gram fragment "sel", it was probably a SQL injection payload with the keyword "select", while it was almost impossible to draw a reasonable conclusion if only a single character "s" was observed. However, although single characters could not capture the payload semantics, the performance of SegAsChar was not that bad. It is because that the convolution operation of TextCNN processed the adjacent h characters in a filter, which could achieve a similar effect with character-level N-gram. Second, it was also intuitive that SegAsWord outperformed SegAsNGram, because the segmented keywords could capture the payload semantics in an explicit manner.

Experiment 2: Ablation Experiment
In the first experiment, we tried to evaluate the data augmentation module of DeepPTSD. We compared the proposed payload data augmentation mechanism (abbreviated as Aug-ByKeyword) with the following three variants of payload data augmentation mechanisms.
(1) NoAug: It corresponded to the variant that no payload data augmentation was performed.
(2) RandomAug: It corresponded to the variant that the noises were added to all the tokens without retraining the keywords. Note that the special characters (e.g., %, @, #) were retained. (3) AugByDic: It corresponded to the variant that the keywords needed to be retrained were predefined in a dictionary instead of being extracted from the dataset.
Here, strategy A was used for evaluation and TextCNN was used as the classifier. The experiment results are shown in Table 3. First, RandomAug had the worst performance and it performed even worse than NoAug. It showed that adding randomized noises without considering the keywords would not increase the diversity of the training samples but even damaged the semantic integrity of the payloads. It verified the effectiveness of the keyword avoidance perturbation mechanism. Second, AugByDic and AugByKeyword outperformed NoAug. It demonstrated the necessity of the payload data augmentation module. Third, AugByKeyword had only a sight advantage over AugByDic. However, AugByDic required manual efforts for defining the dictionary, and thus AugByKeyword had stronger scalability. In the second experiment, we tried to evaluate the transfer learning module and the semi-supervised learning module. Given the five datasets D PU , D TSL , D TSU , GD TSL , and GD TSU (specified in Algorithm 1), we compared the performance of the following three variants of DeepPTSD. Here, strategy A was used for evaluating DeepPTSD_None, while strategy B was used for evaluating DeepPTSD, DeepPTSD_T, and DeepPTSD_S. All the methods were tested on D TSU . The experiment results are shown in Table 4 to control sequence. First, DeepPTSD_T outperformed DeepPTSD_None and DeepPTSD_S outperformed AugByKeyword and DeepPTSD_None. It demonstrated that both the transfer learning module and the semi-supervised learning module contributed to the detection model. As a result, DeepPTSD had the best performance. Second, AugByKeyword and DeepPTSD_A outperformed DeepPTSD_T. It indicated that the data augmentation and semi-supervised learning modules were more effective for the detection model than the transfer learning module. This result might be affected by the experiment dataset. Transfer learning usually requires a very large source domain dataset to gain its power (e.g., the pre-training language model such as BERT and GPT). When the source domain dataset is limited, leveraging the unlabeled data and performing data augmentation directly on target domain dataset could achieve better results.

Experiment 3: Comparison Experiment
To evaluate the competitive performance of DeepPTSD, we compared it with the following five baselines.
(1) SVM_C: It corresponded to the character-level feature based method. Specifically, it first segmented each HTTP traffic sample into characters, and built a feature vector to reflect the character distribution based on VSM (vector space model). Then, it trained a SVM (support vector machine) classifier based on the character feature vectors. (2) SVM_W: It corresponded to the word-level feature based method [18]. Specifically, it first segmented each HTTP traffic sample into words based on the approach proposed in Section 3.2, and built a feature vector to reflect the word distribution based on VSM. Then, it trained a SVM classifier based on the word feature vectors. (3) CNN: It trained the detection model based on the TextCNN model [14]. Here, a HTTP traffic sample was segmented based on the approach proposed in Section 3.2. (4) RNN: It trained the detection model based on a BiLSTM model [22]. Here, a HTTP traffic sample was also segmented based on the approach proposed in Section 3.2. (5) AE: It corresponded to the autoencoder based method [12]. Specifically, it first trained an autoencoder on D PU ∪ D TSL ∪ D TSU in an unsupervised manner to map each HTTP traffic sample into a unified latent feature space. Then, it trained a MLP (multi-layer perceptron) classifier based on the latent feature vectors.
Here, strategy A was used for evaluating SVM_C, SVM_W, CNN, and RNN, while strategy B was used for evaluating AE and DeepPTSD. AE was trained on D PU , D TSL , and D TSU , while DeepPTSD is trained on D PU , D TSL , D TSU , GD TSL , and GD TSU . AE can be viewed as an alternative way to implement the transfer semi-supervised learning. The experiment results are shown in Table 5 and the following tendencies can be discerned. First, SVM_C had a far lower performance than SegAsChar and SVM_W. It indicated that single characters could not capture the payload semantics. Although SegAsChar also accepted single characters as input, the convolution operation of TextCNN could process multiple adjacent characters in a filter. Second, CNN and RNN outperformed SVM_W. It showed that deep learning based methods could extract more advanced features to capture the payload semantics. Third, AE outperformed CNN and RNN. It verified the effectiveness of the utilization of source domain data and unlabeled target domain data, which could provide richer knowledge to improve the generalization ability of the detection model when the labeled target domain data are insufficient. Finally, DeepPTSD had the best performance. It showed that it could achieve a better performance by integrating transfer learning, data augmentation, and semi-supervised learning.

Conclusions
In this paper, aiming at the huge training data collection cost and low cross-system detection ability of the existing malicious HTTP traffic detection methods, we propose DeepPTSD, a payload based malicious HTTP traffic detection method based on deep transfer semi-supervised learning technique. It improves the generalization ability of the detection model under the situation of insufficient training data by using two strategies. The first is a transfer learning strategy, which learns generalized knowledge from a public dataset and adapts the knowledge to the target system. The second is a semi-supervised learning strategy, which performs data augmentation by using a keyword avoidance perturbation mechanism and trains the detection model in a smoothness enforcing style by exploiting all the labeled, unlabeled, and augmented target domain data. Through a series of experiments based on real datasets, we demonstrate that DeepPTSD outperforms the existing baselines when the training dataset is highly limited.
In the future, we will extend our work from the following directions. First, the dataset used in the experiments only contains simple types of web attacks (e.g., SQL injection, XSS). We will extend our method to detect advanced web attacks (e.g., webshell). Second, we will fuse our method with statistic based method to detect encrypted malicious HTTP traffic. Third, the malicious HTTP traffic detection is treated as a text classification task, and only the payload data in the form of string are considered in the detection model. In the future, we will extend our method to accommodate other elements of the HTTP protocol (e.g., request header, response header, etc.). Fourth, we will study the anomaly detection based method for the malicious HTTP traffic detection problem, in case no malicious samples are available.