Skip to Content
SensorsSensors
  • Article
  • Open Access

9 December 2021

CBD: A Deep-Learning-Based Scheme for Encrypted Traffic Classification with a General Pre-Training Method

,
,
and
1
State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China
2
Henan Key Laboratory of Network Cryptography Technology, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Intelligent Solutions for Cybersecurity

Abstract

With the rapid increase in encrypted traffic in the network environment and the increasing proportion of encrypted traffic, the study of encrypted traffic classification has become increasingly important as a part of traffic analysis. At present, in a closed environment, the classification of encrypted traffic has been fully studied, but these classification models are often only for labeled data and difficult to apply in real environments. To solve these problems, we propose a transferable model called CBD with generalization abilities for encrypted traffic classification in real environments. The overall structure of CBD can be generally described as a of one-dimension CNN and the encoder of Transformer. The model can be pre-trained with unlabeled data to understand the basic characteristics of encrypted traffic data, and be transferred to other datasets to complete the classification of encrypted traffic from the packet level and the flow level. The performance of the proposed model was evaluated on a public dataset. The results showed that the performance of the CBD model was better than the baseline methods, and the pre-training method can improve the classification ability of the model.

1. Introduction

In recent years, with the rapid development of network technology and people’s gradual awareness of private data, a variety of encryption technologies have been widely used in network communications, resulting in a rapid increase in network encrypted traffic. At the same time, encrypted traffic is also used by some people as the tool to hide activities, which also provides an opportunity for malicious network attackers to hide their command-and control activities. Therefore, encrypted traffic classification can better monitor abnormal conditions in the network, and detect network attack behaviors in time. It has also contributed to the improvement in network service performance and the creation of a good network environment, which has gained widespread attetion from researchers.
Network traffic classification refers to using algorithms to construct a classification model for neiwork traffic, which has three levels of granularity for tasks: sequence-based, packet-based, and flow-based classification. The granularity can be determined according to practical scenarios, such as application classification, protocol identification, service analysis, etc. Network traffic classification plays a significant role in network management, traffic control, and security detection. Meanwhile, the efficient and accurate classification of network traffic is an important foundation for maintaining network security.
However, most of the existing research results are only implemented in a closed environment. The experimental data are often labeled data, and training data and test data often have high similarities. Moreover, some results tend to require more computing resources and time. In a real environment, the data are often unlabeled, which can greatly decrease the performance of many classification models. To solve this problem, researchers have tried to develop a model with a strong generalization ability.
In this work, we propose a model called CBD (based on a convolutional neural network, bidirectional encoder representation from transformers and dense network) that supports unlabeled data and has a certain generalization ability. It can directly learn from unlabeled data and use this knowledge for model training. Model transfer can also be carried out for different datasets, which only need to fine-tune the parameters to complete the goal of encrypted traffic classification. The contributions of this paper are summarized as follows:
  • A novel encrypted traffic classification model called CBD is designed. It combines a Convolutional Neural Network (CNN) and Bidirectional Encoder Representation from Transformers (BERT) to automatically learn the features of traffic data from the packet level and flow level to achieve the encrypted traffic classification for applications.
  • A general pre-training method suitable for the field of encrypted traffic analysis is proposed. For unlabeled data, this method proposes two tasks of identifying ciphertext packets and identifying continuous flows, to deepen the model’s understanding and learning of encrypted traffic data.
  • The CBD model has achieved good results when encrypting traffic classification. In addition, the performence of the CBD model has obvious advantages compared with other methods.
The rest of the paper is organized as follows. Section 2 summarizes the related work of network traffic analysis. In Section 3, we provide a detailed description of the overall model structure. In Section 4, we introduce the specific details of the experiment, and we evaluate and compare the experimental results. Finally, we conclude this paper in Section 5.

3. Model Structure

Encrypted traffic has many characteristics, such as high entropy, unobvious statistical characteristics, and a weak correlation between adjacent bytes. It is difficult to extract features manually and they cannot be well represented. It may be difficult to directly use traditional or classic methods in other fields, such as NLP and CV, to classify encrypted traffic. Therefore, we designed a CBD model for encrypted traffic, and the overall structure is shown in Figure 1, including three modules: CNN module, BERT module, and Dense module. The process of the entire model includes the pre-training process and the fine-tuning process.
Figure 1. The overall framework of the CBD model.
The left side of Figure 1 represents the pre-training process, the right side represents the fine-tuning process, and the dotted arrow represents the transfer of the module.

3.1. Data Preprocess

For the raw traffic data, the data must be preprocessed before they can be input into the neural network model. The preprocess mainly includes traffic segmentation, traffic cleaning, traffic conversion and time interval integration. Algorithm 1 represents the process.
Algorithm 1 Preprocessing Algorithm.
Input: 
Raw network traffic dataset D; Number of traffic classes c;
Output: 
Packet stream set P;
1:
for each i [ 1 , c ]  do
2:
    Randomly select n consecutive 10 packets in D;
3:
    for each j [ 1 , 10 × n ]  do
4:
        Trim packet j and uniform length of 256 bytes, packet j p j ;
5:
        Generate the packet stream, payload p j p j ;
6:
        Count the time interval of p j and p j + 1 ;
7:
        if the time interval between p j and p j + 1 < 1 sencond then
8:
           Continue
9:
        else
10:
           Add p 0 between p j and p j + 1 , where p 0 = ( 1 , . . . , 1 ) 256 ;
11:
        end if
12:
    end for
13:
    Generate packet stream set P i = { p 1 , . . . , p 0 , . . . , p 0 , . . . , p 10 * n } ;
14:
end for
15:
return Packet stream set P = { P 1 , P 2 , . . . , P c } ;
Line 2 is traffic segmentation. Several flow segments with a window of 10 are randomly intercepted, that is, each flow segment contains 10 consecutive packets.
Line 4 is traffic cleaning. The payload part of each packet is read, and then the length is unified, the first 256 bytes of each packet are intercepted, and 0 is filled in if the result is less than 256 bytes to obtain the raw stream
p = ( b 1 , b 2 , , b 8 × 256 ) .
Line 5 is traffic conversion. The raw stream is converted to decimal, and a value of 0-255 is taken, according to each byte; then, a 256-dimensional sequence p is obtained.
Line 6–9 is time interval integration. According to the statistical results of the time interval between two adjacent packets of different classes [7], a blank packet is inserted if the packet interval is more than 1s, and the packet within 1s is ignored. The blank packet uses a 256-dimensional stream of all 1s to represent the payload, which can prevent the parameters of each neuron in the neural network from being invalidated by multiplying by 0 when encountering a blank packet.
Finally, the preprocessed packet stream set P is obtained, and the model can be input in the next step.

3.2. CNN Module

The structure of the CNN module is shown in Figure 2. It consists of a 1D-CNN model, including four convolutional layers and three pooling layers.
Figure 2. The framework of the CNN module.
The input is a 256-dimensional vector p , the kernel size of the first convolutional layer is 3, and the number of output channels is 10; therefore, the output is a 10 × 254 -dimensional matrix. The kernel size of the second convolutional layer is also 10. After the convolutional layer, a max pooling layer with a pooling size of 3 and a stride length of 1 is applied, and the number of output channels is 20; therefore, the output is a 20 × 250 -dimensional matrix. The third and fourth layers are consistent with the second layer method, and the number of output channels is 10 and 1 respectively, resulting to a 242-dimensional vector as output. Finally, a Dense module is connected for dimensionality reduction to facilitate the subsequent operation of the BERT module.

3.3. BERT Module

The structure of the BERT module is shown in Figure 3, based on the BERT model in [25].
Figure 3. The framework of the BERT module.
The BERT model [21] is mainly composed of the Encoder of the Transformer model [17]. The encoder has six layers, and each layer contains two sub-layers, which are a multi-head attention mechanism and a fully connected feedforward network. A sentence X is input into the Encoder, X R b a t c h _ s i z e × s e q _ l e n , whose dimension is [ b a t c h _ s i z e , s e q _ l e n ] .
Initially, position embedding is performed, obtaining [ b a t c h _ s i z e , s e q _ l e n , e m b e d _ d i m ] -dimension X e m b e d R b a t c h _ s i z e × s e q _ l e n × e m b e d _ d i m ,
X e m b e d = E m b e d d i n g L o o k u p ( X ) + P o s i t i o n E n c o d i n g .
For the multi-head attention mechanism sub-layer, to learn the expression of multiple meanings, a linear mapping is made of X e m b e d . Three weights are assigned W Q , W K , W V R e m b e d _ d i m × e m b e d _ d i m , and three matrices Q , K , V are formed after linear mapping, which is consistent with the dimension before linear transformation.
Q = L i n e a r ( X e m b e d ) = X e m b e d W Q ,
K = L i n e a r ( X e m b e d ) = X e m b e d W K ,
V = L i n e a r ( X e m b e d ) = X e m b e d W V .
The number of heads is defined as h , h e a d _ s i z e = e m b e d _ d i m / h . After splitting according to h e a d _ s i z e , the dimensions of Q , K , V are [ b a t c h _ s i z e , s e q _ l e n , h , e m b e d _ d i m / h ] , after transposition is [ b a t c h _ s i z e , h , s e q _ l e n , e m b e d _ d i m / h ] .
For the i-th head, the dimensions of Q i , K i , V i are all [ b a t c h _ s i z e , s e q _ l e n , e m b e d _ d i m / h ] ; then, the output of the i-th head is
h e a d i = A t t e n t i o n ( Q i , K i , V i ) = s o f t m a x Q i K i T d k V i .
where d k is the dimension of K i , d k = [ b a t c h _ s i z e , s e q _ l e n , e m b e d _ d i m / h ] .
For the multi-head attention mechanism sub-layer, the information of each head is connected to obtain X h i d d e n : [ b a t c h _ s i z e , s e q _ l e n , e m b e d _ d i m ] ,
X h i d d e n = M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , . . . , h e a d h ) .
Then, residual connection and normalization are performed. Since the dimensions of X e m b e d and X h i d d e n are the same, we can directly add the elements to make the residual connection. Then, this is normalized to the standard normal distribution, obtaining L a y e r N o r m ( X e m b e d + X h i d d e n ) .
After each sub-layer, a residual connection and normalization will be added, so the output of each sub-layer is
S u b L a y e r _ o u t p u t = L a y e r N o r m ( X + ( S u b L a y e r ( X ) ) ) .
The BERT model [25] contains two tasks. The first task, Masked Language Model (MLM), is a token-level task that can solve the problem where bidirectional model causes the predicted next word to appear in a given sequence. A part of the token is randomly masked in proportion, so that the model predicts and restores the part that is covered or replaced. The second task, Next Sentence Prediction (NSP), is a sentence-level task. Since many NLP downstream tasks are based on the relationship between sentences, it is necessary to determine whether two adjacent sentences are contextual.
In the BERT module we designed, we set the number of layers of Transformer Encoder to 4, 8 and 12, respectively. Since each flow segment contains 10 consecutive packets, each packet becomes a packet stream after preprocessing. After adding the time interval, a flow segment contains up to 15 packets. Therefore, to enable a flow segment to be input into the BERT module for processing at the same time, it is assumed that a flow contains 15 packets. If there are fewer than 15 packets, blank packets are inserted before completion. That is, the BERT module gathers 15 outputs of the previous module as input at a time. A Dense module is also connected after the BERT module for final classification.

3.4. Dense Module

The structure of the Dense module is shown in Figure 4, which is mainly composed of a fully connected layer.
Figure 4. The framework of the Dense module.
After the CNN module and BERT module, there are Dense modules, respectively. Dense module 1 after the CNN module contains a fully connected layer, which can change the dimension of the output of the CNN module to a dimension suitable for the input of the BERT module. Dense module 2 after the BERT module is also a fully connected layer, and the specific parameter settings are determined by the number of final classification classes.

3.5. Pre-Training of Unlabeled Data

The pre-training proposed in this paper contains two stages, which correspond to the two tasks of the BERT model pre-training.
The first stage is based on the packet level, which corresponds to the token-based MLM task in the BERT model. This stage mainly trains the model’s understanding of encrypted packets. By calculating the entropy value of the payload, each packet is divided into a plaintext packet and a ciphertext packet.
For a packet, extract the payload part, get p = { x 1 , x 1 , , x n } , calculate the entropy of each packet,
H = i = 1 n P ( x i ) log 2 P ( x i ) , 1 i n ,
Entropy is a measure of the degree of chaos in the system. The larger the entropy, the more chaotic the system and the less information it carries. For plaintext payload and ciphertext payload, the principle is similar. The entropy of the plaintext payload should be much smaller than the entropy of the ciphertext payload, because encrypted data have high randomness and do not contain obvious information. By calculating and comparing experimental data, we set the threshold of entropy H 0 = 4 .
H ( p ) < H 0 , p P p l a i n H ( p ) H 0 , p P c i p h e r .
when H H 0 , the packet is considered to be a ciphertext packet; when H < H 0 , the packet is considered to be a plaintext packet.
The labeled plaintext and ciphertext packets are utilized to train the model, to identify encrypted packets. After completing the first stage of pre-training, proceed to the second stage of tasks.
The second stage is based on the flow level, which corresponds to the sentence-based NSP task in the BERT model. This stage mainly trains the model’s understanding of a flow, that is, the understanding of the relationship between packets. Initially, positive and negative sample sets are constructed.
The positive sample set S + contains n positive samples,
S + = { s 1 + , s 2 + , . . . , s n + } .
the positive sample s + is defined as a continuous flow F, where each flow contains 10 consecutive packets, denoted as
s i + F i = { p 1 i , p 2 i , . . . , p 10 i } , 1 i n .
The negative sample set S has the same number as the positive sample set S + , and contains n negative samples.
S = { s 1 , s 2 , . . . , s n } .
the negative sample s is defined as a discontinuous flow F ¯ , which is obtained by transforming the positive sample. Each packet in the positive sample is replaced with other packets with a certain probability, and the sample after the replacement is called a negative sample.
s i F i ¯ = { f ( p 1 i ) , f ( p 2 i ) , . . . , f ( p 10 i ) } , 1 i n ,
f ( p j i ) = p j i ( P = 0.7 ) p j i ( P = 0.3 ) , ( i , j ) ( i , j ) , 1 j 10 .
The labeled positive and negative sample sets are utilized to train the model so that it can distinguish whether a flow is continuous. After completing the pre-training, the CBD model will be further fine-tuned according to the downstream task—encrypted traffic classification—to achieve the final goal.

3.6. Model Transfer and Fine-Tune

In the entire CBD model, some modules will perform the transfer based on supervised learning. In this transfer process, the structure of the module is fixed, but the parameters will change according to the tasks after the transfer. This transfer method is also called parameter fine-tuning under supervised learning in the deep model.
It should be noted that, in addition to the three modules in the fine-tuning phase, we will also transfer the CNN module and the Dense module in the pre-training phase.
After the CBD model is fine-tuned, it will output the predicted classification results in the test phase. We will evaluate the classification results in the next section.

4. Experiment

This section mainly describes the experimental settings, experimental evaluation metrics, and specific experimental results to verify the effectiveness of the CBD model proposed in this paper.

4.1. Experimental Settings

This paper used the public dataset ISCXVPN2016 [8] published by the Canadian Institute of Cyber Security, University of New Brunswick in the downstream task of encrypted traffic classification. We choosed the traffic data of the two social networks Facebook and Skype as the experimental data. Facebook traffic included two applications: chat and audio, and Skype traffic included two applications, chat and file transfer. The traffic of each specific application can be encapsulated by VPN protocol, or just ordinary network traffic nonVPN. The experiment used eight classes of data, a total of 8000 samples. Each class randomly selects 1000 samples from ISCXVPN2016, and each sample is a flow segment, that is, it contains 10 consecutive packets. The specific data classes are shown in Table 1.
Table 1. Data classes of encrypted traffic.
In the pre-training process, the plaintext packet in the first stage is composed of 256-byte plaintext, and the ciphertext packet is composed of randomly selected data, other than the eight classes of data mentioned above, regardless of the class and data volume. In the second stage, 5000 samples are randomly selected to generate a positive sample set, and the class is also ignored. The negative sample set is obtained through the transformation of the positive sample set.

4.2. Evaluation Metrics

When evaluating the performance of a model, the class of interest is usually regarded as the positive class, and the other classes are regarded as the negative class. The evaluation index is usually formulated with four basic conditions: True Positive (TP), which predicts the positive class as a positive class. False Positive (FP), predicts the negative class as a positive class. True Negative (TN), predicts the negative class as a negative class. False Negative (FN), predicts the positive class as a negative class. We use five commonly used evaluation metrics—Accuracy, F1-score, Precision, Recall, and Area Under Curve (AUC)—as a basis for evaluating the performance of the model.
Accuracy is the ratio of the number of correctly classified samples to the total number of samples for a given data.
A c c u r a c y = T P + T N T P + T N + F P + F N = n c o r r e c t n .
where n c o r r e c t represents the number of samples that are correctly predicted, and n represents the total number of samples.
F1-score is the harmonic average of Precision and Recall. Precision refers to the proportion of the real class in the sample predicted as the positive class, and Recall refers to the proportion of all the positive classes that are predicted to be the positive class.
2 F 1 - s c o r e = 1 P r e c i s i o n + 1 R e c a l l ,
F 1 - s c o r e = 2 P r e c i s i o n × R e c a l l P r e c i a i o n + R e c a l l ,
P r e c i s i o n = T P T P + F P ,
R e c a l l = T P T P + F N .
In multi-class problems, we calculate the macro-F1-score. The macro-F1-score calculates the F1-score for each class separately, and then takes the unweighted average. Macro-precision and macro-recall are the same.
m a c r o - F 1 - s c o r e = 1 C i = 1 C F 1 - s c o r e i ,
m a c r o - P r e c i s i o n = 1 C i = 1 C P r e c i s i o n i ,
m a c r o - R e c a l l = 1 C i = 1 C R e c a l l i .
where C represents the number of classes.
The ROC Curve (receiver operating characteristic curve) is a curve obtained by using the False Positive Rate (FPR) as the x-axis and True Positive Rate (TPR) as the y-axis. The larger the area of the AUC, the better the classification effect. FPR represents the probability that negative samples are mistakenly classified as a positive class, and TPR represents the probability of positive samples are correctly classified as a positive class.
T P R = T P T P + F N ,
F P R = F P F P + T N .

4.3. Experimental Results

The CBD model selects 4-layer, 8-layer, and 12-layer BERT for experiments. The experiment is an eight-class experiment; that is, the random classification accuracy rate is 12.5%. In the experiment, we found that an eight-layer BERT can achieve the best results.
In order to demonstrate the performance of our proposed model, we perform experiments on a small-scale dataset, as shown below:
1.
Sample Size = 20% or 0.2 (200 samples are randomly selected from each class of data to form a new dataset);
2.
Sample Size = 40% or 0.4 (400 samples are randomly selected from each class of data to form a new dataset);
3.
Sample Size = 60% or 0.6 (600 samples are randomly selected from each class of data to form a new dataset);
4.
Sample Size = 80% or 0.8 (800 samples are randomly selected from each type of data to form a new dataset).
Additionally, we perform eight-class experiments on four different sample sizes to compare three types of BERT. The results are shown in Figure 5.
Figure 5. The three models with different BERT layers and eight-class accuracy on four sample sizes. The abscissa is the BERT model with three different layers, the ordinate is the eight-class accuracy, and the four colors represent four different sample sizes.
It can be seen from Figure 5 that the classification accuracy of all three models increases as the sample size increases. Overall, the eight-layer BERT model has the best effect. It can achieve a classification accuracy of 70% in a sample size of 0.2, and it also performs best in a sample size of 0.4, reaching a classification accuracy of approx. 84%. As the sample size increases, the advantages of the eight-layer BERT model compared to the other two are no longer obvious.
Table 2 shows the detailed comparison results under the four metrics.
Table 2. Classification performance of models with different BERT layers—sample size=0.4, pre-training epoch = 200.
It can be seen from Table 2 that, when the number of layers is too low, the model learning ability is poor, the improvement speed is slow, and the final performance is not high enough. When the number of layers is too high, the model is too complex, and when the sample size provided by the target task is limited, the model’s improvement performance is limited. Therefore, in the experiment, the performance of the CBD model of the eight-layer BERT is the best.
The general pre-training method is an important part of the CBD model. To verify the effectiveness of the pre-training method, we compare the CBD model with the no pre-training CBD model, and the results are shown in Figure 6.
Figure 6. ROC curve of eight-layer BERT (best) at 40% sample size. The picture on the left (a) is the model without pre-training, and the picture on the right (b) is the model with pre-training. The abscissa is the FPR, and the ordinate is the TPR. The eight colors represent the eight classes of data in the downstream tasks.
It can be seen from Figure 6 that no pre-training model performed worse than original CBD model. For a complex model with many BERT layers and many neuron parameters, the accuracy without pre-training is not good. This is due to insufficient data and insufficient model learning. However the pre-trained model has been fully learned, and the higher the number of layers layers, the stronger the learning ability. The model has learned the hidden features in the data, so the accuracy is high. However, after reaching a certain level, the performance improvement brought by increasing the number of layers is very limited. In addition, we found that, without pre-training, the smaller the sample size, the more the number of training failures.
In order to verify whether the number of pre-training epochs has an effect on the final classification result, we set different pre-training epochs and compared the experimental results. The results are shown in Table 3.
Table 3. The classification performance of models with different pre-training epochs when the number of BERT layers is eight and sample size is 0.4.
We also observed the changes in the model accuracy during the training process, and the results are shown in Figure 7.
Figure 7. The impact of different pre-training epochs on accuracy when sample size is 0.4. The abscissa is the number of epochs during training in the downstream task, and the ordinate is the classification accuracy of the test. The four colors represent the performance of the eight-layer BERT in different epochs during pre-training.
It can be seen from Table 3 and Figure 7 that the higher the number of pre-training epochs, the more adequate the pre-training, and the better the performance of the model. However, after more than 200 epochs, the increase in pre-training brings little performance improvement. Therefore, we chose the optimal pre-training epochs number of 200.
To verify the indispensability of each module in the CBD model, we compare the CBD model with other models. First, the necessity of the BERT module is verified. tCLD-Net is similar to the CBD model, which combines deep learning and transfer learning. The difference is that tCLD-Net did not use the BERT module, but the LSTM module. We compared the two models in the same dataset. The dataset is in the target domain mentioned in [14]. The comparison results are shown in Figure 8.
Figure 8. Test results on the three-class target domain. The abscissa represents the sample size, and the ordinate represents the accuracy of classification. The four colors represent CBD Model, tCLD-Net, CBD Model without pre-training and CLD-Net without transfer.
It can be seen from Figure 8 that the CBD Model performs better than tCLD-Net on the three-class dataset. However, it is worth noting that the pre-training of tCLD-Net (also called source domain training) is based on label data, and a large amount of artificially labeled label data need to be provided. For the CBD Model, the performance is similar when the target task training set is the same, and the pre-training data do not need to be artificially provided with labels. In addition, the performance of the no pre-training CBD Model is worse than the no pre-training CLD-Net when the amount of data is insufficient. This is caused by the high complexity of the model and insufficient amount of data.
In order to verify the necessity of CNN module, we removed the CNN module in the CBD model. An embedding layer was directly embedded on the ciphertext packet, and then the BERT module and the Dense module were connected to form an Embedding BERT Dense (EBD) model. The results of a comparison between the performance of the EBD model and the CBD model are shown in Table 4.
Table 4. Evaluation metrics of CBD model and EBD model when BERT layers = 8, sample size = 0.4, and pre-training epoch = 200.
It can be seen from Table 4 that the classification effect of the EBD model is poor, and the time efficiency of the EBD model in the experiment is very low. When BERT is directly used for data with a close contextual correlation, the effect is better, but the context correlation of encrypted data is deliberately confused and the correlation is weak; therefore, the effect of the EBD model is poor. The reason for the unsatisfactory effect of the EBD model may be that a sample in the experiment is 10 consecutive packets randomly selected from a flow, and the sample data are likely to be ciphertext packets. If the experiment does not randomly select, but selects the first 10 packets of a flow as a sample, it is likely to obtain plaintext packets or handshake packets. At this time, the performance of the EBD model may be better than the current results. Therefore, CNN has an irreplaceable role.
In addition, if there is no CNN module and no Embedding layer, directly using the Bert module to learn the packet payload is equivalent to encoding the payload, and every two bytes are mapped to a number, that is, 0-65535. In this encoding process, every two bytes correspond to a word in NLP. At this time, when this model is iterated to 60 epochs, the classification accuracy of 8-class fluctuates at around 10%, and the highest is 10.25%. Therefore, every module in the CBD model is indispensable.
We also select five models in the existing researches to compare with the CBD model. The five models are one-dimensional CNN (1D-CNN) and two-dimensional CNN (2D-CNN) mentioned in [5], Stacked Autoencoder (SAE) mentioned in [2], a combination of CNN and LSTM (CNN-LSTM) mentioned in [4], and CLD-Net mentioned in [7]. The five models are all performed on the dataset mentioned in this paper. The comparison results are shown in Figure 9.
Figure 9. Accuracy comparison results of several models when the sample size is 0.4.
The comparative experiments are implemented on a three-class dataset. It can be seen from Figure 9 that the accuracy of the CBD model is the highest, reaching more than 91%. CLD-Net also has good results, which may be related to the traffic recombination strategy mentioned in [4]. The accuracy of 1D-CNN, 2D-CNN, SAE, and CNN-LSTM models are all less than 80%. This may be due to the small sample size and the simple structure of several models, which makes it unable to learn and understand the hidden features of encrypted traffic. At the same time, the sample of our experiment is a random 10 consecutive packets without header information or handshake phase packets. Some existing traffic classification models may not randomly select packets in their own experiments. This contains not only the payload, but also other information. However, the data in the real situation often only have payload information, so the effect of experimenting with several existing models is not ideal.

5. Conclusions

In this paper, we proposed an encrypted traffic classification model with a general pre-training method. Compared with other traffic classification methods that combine deep learning and transfer learning, this model can directly learn the basic characteristics of traffic data from unlabeled data, and use CNN and BERT to automaticly learn the features of traffic from the perspective of packets and flows. The experiment was performed on the public dataset, and a class-balanced dataset was constructed for eight-class tasks. When the sample size was only 0.4 and the number of BERT layers was eight, the CBD model could achieve a classification accuracy of more than 91% in three-class tasks. In eight-class tasks, the classification accuracy rate could reach about 84%, which was nearly 8% higher than that of the model without pre-training. The eight-class tasks refer to the encrypted traffic classification tasks performed on the balanced dataset proposed in this paper, and the three-class tasks refer to the classification tasks performed on the target domain dataset mentioned in [14].
In future work, we will consider the classification of unknown classes of encrypted traffic to further fit the actual network environment. At the same time, the classification of real-time network traffic will also be regarded as the research goal of the next stage. Another research direction is designing a model that uses less time and computational resource, which can detect and classify unknown traffic, and supports real-time updates.

Author Contributions

Conceptualization, C.G.; methodology, X.H. and Y.C.; software, X.H. and Y.C.; validation, X.H.; formal analysis, X.H.; investigation, X.H.; resources, C.G. and F.W.; data curation, X.H. and Y.C.; writing—original draft preparation, X.H.; writing—review and editing, X.H.; visualization, X.H. and Y.C.; supervision, C.G. and F.W.; project administration, C.G. and F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 61772548.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.unb.ca/cic/datasets/vpn.html (accessed on 18 June 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wang, Z. The Applications of Deep Learning on Traffic Identification. BlackHat USA 2015, 24, 1–10. [Google Scholar]
  2. Höchst, J.; Baumgärtner, L.; Hollick, M.; Freisleben, B. Unsupervised Traffic Flow Classification Using a Neural Autoencoder. In Proceedings of the 2017 IEEE 42nd Conference on Local Computer Networks (LCN), Singapore, 9–12 October 2017; pp. 523–526. [Google Scholar]
  3. Li, R.; Xiao, X.; Ni, S.; Zheng, H.; Xia, S. Byte Segment Neural Network for Network Traffic Classification. In Proceedings of the 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018; pp. 1–10. [Google Scholar]
  4. Marín, G.; Casas, P.; Capdehourat, G. Deep in the Dark—Deep Learning-Based Malware Traffic Detection without Expert Knowledge. In Proceedings of the 2019 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 23 May 2019; pp. 34–42. [Google Scholar]
  5. Pacheco, F.; Exposito, E.; Gineste, M. A framework to classify heterogeneous Internet traffic with Machine Learning and Deep Learning techniques for satellite communications. Comput. Netw. 2020, 173, 107213. [Google Scholar] [CrossRef]
  6. Aceto, G.; Ciuonzo, D.; Montieri, A.; Pescapé, A. Toward effective mobile encrypted traffic classification through deep learning. Neurocomputing 2020, 409, 306–315. [Google Scholar] [CrossRef]
  7. Hu, X.; Gu, C.; Wei, F. CLD-Net: A Network Combining CNN and LSTM for Internet Encrypted Traffic Classification. Secur. Commun. Netw. 2021, 2021, 5518460. [Google Scholar] [CrossRef]
  8. Lashkari, A.; Draper-Gil, G.; Mamun, M.; Ghorbani, A. Characterization of Encrypted and VPN Traffic Using Time-Related Features. In Proceedings of the International Conference on Information Systems Security and Privacy (ICISSP), Rome, Italy, 19–21 February 2016. [Google Scholar]
  9. Pan, S.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 297–315. [Google Scholar] [CrossRef]
  10. Taheri, S.; Salem, M.; Yuan, J. Leveraging Image Representation of Network Traffic Data and Transfer Learning in Botnet Detection. Big Data Cogn. Comput. 2018, 2, 37. [Google Scholar] [CrossRef] [Green Version]
  11. Garcia, S.; Grill, M.; Stiborek, J.; Zunino, A. An empirical comparison of botnet detection methods. Comput. Secur. J. 2014, 45, 100–123. [Google Scholar] [CrossRef]
  12. Sun, G.; Liang, L.; Chen, T.; Xiao, F.; Lang, F. Network traffic classification based on transfer learning. Comput. Electr. Eng. 2018, 69, 920–927. [Google Scholar] [CrossRef]
  13. Liu, X.; You, J.; Wu, Y.; Li, T.; Li, L.; Zhang, Z.; Ge, J. Attention-based bidirectional GRU networks for efficient HTTPS traffic classification. J. Inf. Sci. Eng. 2020, 541, 297–315. [Google Scholar] [CrossRef]
  14. Hu, X.; Gu, C.; Chen, Y.; Wei, F. tCLD-Net: A Transfer Learning Internet Encrypted Traffic Classification Scheme Based on Convolution Neural Network and Long Short-Term Memory Network. In Proceedings of the 2021 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), Beijing, China, 15–17 October 2021; pp. 1–5. [Google Scholar]
  15. Wang, C.; Dani, J.; Li, X.; Jia, X.; Wang, B. Adaptive Fingerprinting: Website Fingerprinting over Few Encrypted Traffic. In Proceedings of the 11th ACM Conference on Data and Application Security and Privacy, Virtual Event, 26–28 April 2021. [Google Scholar]
  16. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 2030–2096. [Google Scholar]
  17. Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial Discriminative Domain Adaptation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2962–2971. [Google Scholar]
  18. Sirinam, P.; Mathews, N.; Rahman, M.; Wright, M. Triplet Fingerprinting: More Practical and Portable Website Fingerprinting with N-shot Learning. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019. [Google Scholar]
  19. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 5998–6008. [Google Scholar]
  20. Bikmukhamedo, R.; Nadeev, A. Generative transformer framework for network traffic generation and classification. T-Comm-Telecommun. Transp. 2020, 14, 11. [Google Scholar] [CrossRef]
  21. Wang, H.; Li, W. DDosTC: A Transformer-Based Network Attack Detection Hybrid Mechanism in SDN. Sensors 2021, 21, 5047. [Google Scholar] [CrossRef] [PubMed]
  22. Sharafaldin, I.; Lashkari, A.; Hakak, S.; Ghorbani, A. Developing Realistic Distributed Denial of Service (DDoS) Attack Dataset and Taxonomy. In Proceedings of the IEEE 53rd International Carnahan Conference on Security Technology, Chennai, India, 1–3 October 2019. [Google Scholar]
  23. Kozik, R.; Pawlicki, M.; Choraś, M. A new method of hybrid time window embedding with transformer-based traffic data classification in IoT-networked environment. Pattern Anal. Appl. 2021, 24, 1441–1449. [Google Scholar] [CrossRef]
  24. Garcia, S.; Parmisano, A.; Erquiaga, M. IoT-23: A labeled dataset with malicious and benign IoT network traffic (Version 1.0.0) [Data set]. Zenodo 2020. [Google Scholar] [CrossRef]
  25. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  26. He, H.; Yang, Z.; Chen, X. PERT: Payload Encoding Representation from Transformer for Encrypted Traffic Classification. In Proceedings of the 2020 ITU Kaleidoscope: Industry-Driven Digital Transformation (ITU K), Online, 7–11 December 2020; pp. 1–8. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.