ICLSTM: Encrypted Trafﬁc Service Identiﬁcation Based on Inception-LSTM Neural Network

: The wide application of encryption technology has made trafﬁc classiﬁcation gradually become a major challenge in the ﬁeld of network security. Traditional methods such as machine learning, which rely heavily on feature engineering and others, can no longer fully meet the needs of encrypted trafﬁc classiﬁcation. Therefore, we propose an Inception-LSTM(ICLSTM) trafﬁc classiﬁcation method in this paper to achieve encrypted trafﬁc service identiﬁcation. This method converts trafﬁc data into common gray images, and then uses the constructed ICLSTM neural network to extract key features and perform effective trafﬁc classiﬁcation. To alleviate the problem of category imbalance, different weight parameters are set for each category separately in the training phase to make it more symmetrical for different categories of encrypted trafﬁc, and the identiﬁcation effect is more balanced and reasonable. The method is validated on the public ISCX 2016 dataset, and the results of ﬁve classiﬁcation experiments show that the accuracy of the method exceeds 98% for both regular encrypted trafﬁc service identiﬁcation and VPN encrypted trafﬁc service identiﬁcation. At the same time, this deep learning-based classiﬁcation method also greatly simpliﬁes the difﬁculty of trafﬁc feature extraction work.


Introduction
In recent years, traffic encryption has been widely used on the Internet due to advanced encryption technology. A large number of services and applications use encryption algorithms as the primary method for protecting information. Gartner estimated that more than 80% of enterprise network traffic was encrypted by 2019, and 94% of Google network traffic was encrypted by May 2019. This encryption technology not only protects the freedom, privacy, and anonymity of network users, but also enables them to circumvent firewall detection and surveillance systems [1]. However, encryption has also been exploited by unscrupulous individuals to gain illegal benefits. For example, in 2020, more than 70% of malware campaigns used some kind of encryption to hide malware delivery, commands and data leakage. Therefore, the identification and classification of encrypted traffic have received a lot of attention from academia and industry [2]. The development of encryption technology makes the data packets change from plaintext to ciphertext after passing through the encryption algorithm (for example, symmetric cryptography or asymmetric cryptographic algorithm, etc.). A lot of information becomes no longer visible, which also brings great difficulties to the encrypted traffic classification. Practical scenarios often require the identification of specific protocols or application types. For encrypted application traffic, this also makes the traffic classification difficult because there are more application types and there is little difference between different types.
With the good results achieved by deep learning in image classification, speech recognition, natural language, and so forth, deep learning methods are also gradually applied to the field of network security because of their advantages of automatic feature extraction. To learn spatial features, researchers have used convolutional neural networks (CNNs), including one-dimensional convolutional neural networks (1dCNN) and two-dimensional convolutional neural networks (2dCNN) [3] to obtain better classification results. Javaid et al. [4] proposed a network intrusion detection scheme using sparse autoencoders (SAEs). Reference [5] used Long Short Term Memory (LSTM) to extract time series features between traffic groupings. However, the commonly used CNN, LSTM, and other methods improve the classification results while the problem of network computational complexity cannot be ignored.
Deep learning has emerged as a highly desirable approach for traffic classification as an end-to-end method. It is able to learn the nonlinear relationship between the original input and the corresponding output without decomposing the problem into subproblems of feature selection and classification [6,7]. One of the advantages of deep learning is higher learning capability than the traditional ML methods [8]. Another advantage is that it can automatically select features by training, does not need domain experts to select features, and does not need to rely on complex feature engineering. In addition, due to the different popularity of various applications, the problem of class imbalance in traffic samples often arises when constructing traffic datasets [9]. Based on the above analysis, we propose a new deep learning model for encrypted traffic service identification-a neural network structure based on the Inception module [10] in parallel and LSTM module. The model uses a convolutional neural network introducing the Inception module for packet local space feature extraction and an LSTM module for packet time series feature extraction. The accuracy and effectiveness of the model are finally evaluated by the public ISCX 2016 traffic dataset, and weights are assigned to different categories for the imbalanced dataset during the experiment. The main contributions of this paper are summarized as follows: • A new encrypted traffic identification method-ICLSTM is proposed, which can automatically extract traffic features using neural networks, and no complex feature engineering is required. • A model architecture containing two neural networks is proposed for feature extraction of encrypted traffic. One-dimensional convolutional neural network embedded the Inception module is used to extract local features of the traffic, and the LSTM model is used to extract the temporal features of the packets within a session. Then the extracted features are fused, which extends the feature information and enhances the characterization of packet features. Experiments show that our method can achieve better results in encrypted traffic service identification. • A processing scheme for unbalanced data sets is proposed to enhance the symmetry of the data by adopting the method of assigning weights to different categories to effectively alleviate the data imbalance problem.
Section 2 presents the work related to the classification of encrypted traffic. Section 3 presents the proposed method in this paper. Section 4 focuses on the experimental results and evaluation. Section 5 gives the conclusion and future work.

Related Work
The rapid growth of encrypted network traffic makes traffic classification more difficult. The existing solutions to traffic classification are mainly divided into three categories: port-based, deep packet inspection (short for DPI) [11,12] based and machine learning approaches [13,14]. Classical port-based approaches perform well for applications with specific port numbers, but do not classify all protocols due to the dynamic port assignment used by many applications in the current state [15]. DPI can analyze the entire packet data to identify its network protocol and application, but this inspection is difficult to properly analyze encrypted traffic because the packet payload is encrypted into a pseudo random format, containing fewer common features used for traffic classification. Current research on encrypted traffic classification mainly focus on automatic classification of applications using machine learning algorithms, classification of well-known applications such as HTTP, SMTP, FTP, Skype, and so forth. Among them, flow features (duration, bytes per second) and packet features (packet size, inter-packet duration) are the most commonly used features. Michael J. De Lucia et al. used support vector machine (SVM) malicious communication detection mechanism and achieved good results and low false positive rate (FPR) [16]. Gil et al. used time-dependent features, such as flow duration, bytes of traffic per second, forward inter arrival time and backward inter arrival time, and so forth, to describe network traffic using K nearest neighbors (KNN) and C4.5 decision tree algorithm [17]. They used the C4.5 algorithm to describe six major categories of traffic such as web browsing, email, chat, and VoIP, achieving a recall rate of about 92%. On the same dataset over VPN tunnels, they achieved a recall of about 88% using the C4.5 algorithm. The literature [18] proposed a multi-level P2P traffic classification technique using C4.5 decision trees and statistical features of flows for P2P classification, which was also applicable to encrypted traffic. Similarly, there is literature [19,20] that used machine learning (KNN, SVM) for fine-grained classification of encrypted traffic classification. However, the machine learning-based methods relies heavily on effective feature extraction and selection, which is a waste resource for traffic feature extraction. Moreover, traditional threat detection to decrypt encrypted traffic is not feasible as it not only consumes a lot of computational resources but also compromises privacy and data integrity [21].
The deep learning approach can effectively solve these problems because deep learning methods can read information directly from the raw data and can achieve a high accuracy rate. Not much work has been done to classify network traffic using deep learning methods. Wang [22] proposed an end-to-end encrypted traffic classification method based on one-dimensional convolutional neural networks. The method integrated feature extraction, feature selection, and classifier into a unified end-to-end framework designed to automatically learn the nonlinear relationship between the original input and the desired output, and obtains a high accuracy rate. The author illustrated the effectiveness of convolutional neural networks for encrypted traffic classification, but did not tune the model as well as did not consider whether the dataset is balanced. Reference [23] proposed a deep packet framework embedded with stacked self-encoders and convolutional neural networks that can handle both traffic features and application recognition and can identify encrypted traffic, which can achieve 98% recall in application recognition and 94% in traffic classification. Following this line of research, Zhuang Zou [24] et al. proposed a new deep neural network combining convolutional and recurrent neural networks to improve the accuracy of classification and also illustrated the effectiveness of hidden temporal features among traffic packets for classification of encrypted traffic. Similarly, there are, for example, Stack Auto Encoder (SAE), CNN, LSTM networks [25][26][27], and so forth. Deep networks require a large number of implicit layers and a large number of neurons to improve their accuracy. Inspired by this, the deep neural network model combining Inception network and LSTM network in our work can learn data features directly from the raw traffic and classify the application traffic for the purpose of encrypted traffic service identification. For the class imbalance problem, we consider assigning different weights to different classes separately to reduce the impact of class imbalance on the classification effect.

Methodology
In this work, we design an Inception-LSTM architecture, which consists of two deep learning methods, namely Inception module and LSTM module, for application service identification. Before training the model, we need to process the prepared traffic data into the format needed by the model so that it can be correctly input the model. For this purpose, we preprocess the dataset and use the preprocessed data to train a neural network model to predict the category to which the packets belong. The details of the preprocessing stage, the neural network architecture, and the method design are described in detail below. Figure 1 shows the architecture of the ICLSTM.

Dataset
The data source used in this paper is "ISCX 2016" [17], which consists of captured traffic from different applications. In this dataset, the captured packets are classified into different categories based on the application that generated the packet (e.g., Skype, Hangouts, etc.) and the specific activity that the application engaged in during the captured session (e.g., voice call, chat, file transfer, or video call, etc.). The published ISCX 2016 dataset includes seven types of regular encrypted traffic and seven types of VPN tunneling traffic. In this paper, a total of 5.17 GB of raw traffic of six types of regular encrypted traffic and six types of VPN tunnel transmission encrypted traffic are selected as the sample data, and the sample datasets are in the PCAP file format. Table 1 shows the details of the sample data set after processing in this paper. There are five types of network traffic segmentation-TCP connection, flow, session, service and host. Currently, the segmentation methods based on flow and session are widely used in research. Flow are all packets with the same five-tuple (source IP address, source port, destination IP, destination port, and transport layer protocol). Session are all packets that consist of bidirectional flow, that is, the source and destination IP/port of the five-tuple are interchangeable.
Raw flow: the set P of all packets, each packet has five-tuple information. Flow: The set P is divided into multiple subsets according to the five-tuple information, and the packets in each subset are arranged in chronological order within a certain time window, making it a flow f. Session: The difference with flow is that the source and destination IP/port of its five-tuple are interchangeable, so it is also called bidirectional flow. The current research is more utilized also based on session flow, so we also chose the session flow.
In addition, analyzed from the protocol layer perspective, although the traffic feature is mainly reflected in the application layer, it is also reflected in different forms in each protocol layer. Therefore, to avoid missing key features, all protocol layer session flow data are selected in this paper for data pre-processing.

Data Preprocessing
To reduce the redundant information in the raw traffic and adapt it to the input form suitable for deep learning model, we utilized the USTC-TK2016 toolset [22] to process the raw traffic through the following four steps: traffic split, traffic clean, uniform the input size, generate gray images, and convert the IDX files.
(1) Pcap-session segmentation: continuous raw traffic is divided into multiple discrete traffic units according to a certain granularity [28]. (2) Traffic clean: packet files without application layer generate bin files without actual content. The packet files with the same content for sessions or flows generate duplicate files. So it is necessary to clean up the chopped traffic data and only retain the needed traffic data. (3) Uniform the input size and generate gray images: using deep learning Neural network to train data requires a fixed amount of input, and we unifies the session segments in the above steps to 784 bytes in size. On the one hand, there are relevant papers proving that 784 bytes are effective for classification, and on the other hand, 784 bytes are more lightweight for some literature dealing with 1500 bytes. If the segment size is larger than 784 bytes, it is trimmed to 784 bytes. If the segment size is smaller than 784 bytes, add 0 × 00 at the end to supplement to 784 bytes and convert it to a gray images of size 28 × 28. (4) Conversion to IDX: IDX format is a common file format in the field of deep learning.
We converted the generated traffic gray images to the IDX file format which commonly used by neural network models.
We randomly select four preprocessed generated images from each service category in both regular encrypted traffic and VPN encrypted traffic. The results of the data preprocessing are analyzed using visualization techniques, as shown in Figure 2. Through the visualized image, it is obvious that there is a great distinction between different categories of encrypted traffic, and the data of the same category are similar. Therefore, we believe that these unstructured abstract data can be explored by neural networks to classify the encrypted traffic using richer features.

Class Imbalance
From Table 1, we can clearly see that there is a significant difference in the number of traffic in each class. In the classification task, if there is a significant difference in the number of samples in each class, the model obtained is generally better at predicting classes with a larger number of samples and worse at predicting classes with fewer samples, so we need to achieve a more balanced training set. To solve this problem, we automatically assign a weight to the different categories of data in the training set [29], and the category data with a large percentage are assigned to smaller weights, conversely, larger weights. This method makes the loss function pay more attention to the data with insufficient sample size. In this way, the problem of data imbalance is alleviated and the generalization ability of the model is enhanced. This is calculated in Equations (1) and (2). score = log( µ × total label_nums key ). (1) where µ = 0.15; the log can smooth the weights of imbalance classes; total represents the number of training samples; label_nums key represents the number of training samples labeled as key, this task is only used for the model training phase.

Background on Neural Networks
Neural networks (NNs) are networks consisting of a large number of interconnected artificial neurons. These networks are usually composed of a large number of building blocks called neurons, which also represent a specific output function, called the activation function. The neurons are connected to each other by a number of linkages. Each connection between two neurons represents a weighted value, called a weight, for the signal passing through that connection. It is equivalent to the memory of an artificial neural network. During training, a large number of data samples are used to train the neural network to achieve the desired output of the neural network. Deep learning framework can be considered as a special kind of neural network with many hidden layers. Nowadays, with the rapid growth of computing power and the availability of graphics processing units (GPUs), deep neural network training has become more feasible [23]. Therefore, the current deep learning framework has attracted the attention of researchers in various fields. In the following, we will briefly review the two most important deep neural networks used in our proposed network traffic classification scheme, named the Inception module and the LSTM module.

Inception Module
CNNs are a class of feedforward neural networks that include convolutional computation and have a deep structure with representational learning capability to classify input information in a translation invariant manner according to its hierarchical structure, and are one of the core algorithms for image classification problems [30,31]. Its input layer can handle multidimensional data and data with local relevance, for example, the common one-dimensional convolutional neural network can receive one or two-dimensional arrays. It is good at mining the internal features of hidden images and is well suited for processing sequence data or language-based data. Some convolutional neural networks can also recognize words in images and also can be combined with RNNs to extract character features and sequence labeling from images, which also provides the possibility of successful applications of convolutional neural networks on network traffic [32]. For example, ref. [33] used MLP, SAE and CNN to classify over 200,000 samples of encrypted data from 15 applications. The core of convolutional neural networks are: local perceptual weight sharing and down-sampling. However, for the traditional CNN architecture, the mainstream network structure tends to increase the number of layers and neurons, which also greatly increases the disadvantages of the network: more parameters, easier overfitting; the larger the network, the greater the complexity and difficult to apply; the deeper the network, the easier the gradient disappears. Inspired by the GoogLeNet network [10], we introduce two Inception modules on the basis of convolutional neural network to reduce the parameters while increasing the depth and width of the network. The structure of the Inception module is shown in Figure 3. The Inception module is a change to the convolutional layers to increase the network width, enhance the adaptability of the network to the scale, and improve the network performance. In each Inception module, convolutional kernels of different sizes, which can be interpreted as different receptive fields, and then connected to enrich the information in each layer. The Inception module uses convolution kernels with different scales (1*1, 3*3, and 5*5) to extract features from the input. Features with different scales can be extracted, and the output features are not uniform distribution. The features with strong relevance are clustered together, and the relevant non-critical features are then weakened. In addition, the module introduces 1*1 convolution kernel to reduce feature dimensionality, parameters and computational complexity. The output of this method has less "redundant" information, and this feature set is transmitted layer by layer and used as the input of reverse calculation, the rate of convergence is faster. The output part of the module has a concatenate layer, the purpose of which is to merge four sets of features of different types but the same size to form a new feature map, which is also the local spatial features inside the obtained data packet.

Long Short-Term Memory Module
Nowadays, it is common to treat network traffic as a continuous one-dimensional byte stream with a hierarchical structure, and its whole stream structure (bytes, packets, sessions, communication streams, etc.) is similar to the structure of characters, words, sentences and so forth. In the field of natural language, with sequential correlation. LSTM [34], as a special recurrent neural network, overcomes the problems of gradient disappearance and gradient explosion of traditional RNN, Moreover, it is better at extracting long-term dependence of sequential data, and is more suitable for dealing with network traffic data with temporal characteristics. The LSTM is a chain structure with four internal network layers, the structure is shown in Figure 4. LSTM is able to remove or add information to cell states through a structure called gate, which is a combination of a sigmoid layer and a dot product operation that selectively decides which information can pass. LSTM contains three gates to control cell states, namely: a forget gate, an input gate and an output gate.
(I) Forget gate. The first step of LSTM is to decide what information needs to be discarded from the cell state, which is done by the sigmoid cell of the oblivion gate to decide what information needs to be removed from the LSTM memory. 0: Completely discarded, 1: Completely retained. Calculated as Equation (3).
where b f is a constant, called the bias value. (II) Input gates. The second step of the LSTM is to use the input gates to decide what information to add to the cell state into the LSTM memory. The output is calculated as Equations (4)- (6).
where i t indicates whether the value needs to be updated, C t denotes a new vector of candidate values, C t is the cell information, and f t is the parameter of the forgetting gate with a value between 0 and 1, 0: Completely discarded, 1: Completely retained.
(III) Output gate. After updating the cell states it is necessary to determine which state characteristics of the output cells are based on the input h t−1 and x t . The calculation is as Equations (7) and (8).
where O t is the output value and h t is a value between −1 and 1. This step determines which part of the cell state will be output to the next neural network or the next moment.
The temporal backward propagation algorithm of LSTM starts from the last value of the time series, gradually calculates the gradient of each parameter in reverse, and finally updates the network parameters with the gradient of each moment respectively. Firstly, the partial derivatives corresponding to the output of the memory cell are calculated, and then the partial derivatives of the output gate are calculated. The partial derivatives corresponding to the memory cell state, forget gate and and input gate are calculated separately. Finally, the model connection weights are updated using the gradient descent method.

Model Architectures
In recent years, both CNN and LSTM have been successfully used in NLP, such as sentiment analysis and text classification [35][36][37]. We are inspired by these studies to design an Inception-LSTM framework that includes two deep learning methods, namely Inception module and LSTM module for encrypted traffic classification, and compare the performance with the latest model.
The input of the model is the processed encrypted traffic session data and the output is the object labels to be estimated. The model is executed by two parallel neural networks. The proposed Inception component is based on a one-dimensional CNN convolutional layer which introduces two identical Inception modules and a normalization layer [38] to enhance the computational as well as generalization capabilities of the network. The LSTM module extracts the temporal features of the packets within a session. The model receives processed network traffic data, which is considered as essentially hierarchically structured byte flow data.
The Inception architecture consists of two identical Inception modules that are stacked on top of each other, and the structure of each Inception module is shown in Section 3.4.1, Figure 3. Firstly, the Inception architecture reads the 28*28*1 traffic images from the IDX file and uses the Inception module to extract the local features inside the packets as much as possible. These image pixels are normalized from [0, 255] to [0, 1]. Each Inception module is superimposed by multiple convolutions to extract richer features, so the inputs of this model are processed by multiple parallel convolutional branches and then the outputs of these branches are combined into a single tensor. In the structure of the system proposed in this paper, the outputs of the two Inception modules are passed through a flatten layer and the outputs are converted to 1*1240 one-dimensional vectors to be fused with the outputs of the LSTM modules.
Similarly, The LSTM module extract the backward and forward dependencies as the temporal relationships inside the packets. The module first reads a 28*28*1 traffic image, and then connects a fully connected layer. In addition, the dropout technique is used behind each layer to avoid overfitting problems, and the LSTM module can obtain a 1*200 one-dimensional vector. Finally, the local spatial features extracted by the Inception module and the inter-packet timing features extracted by the LSTM module are fused to extend the packet feature information and enhance the characterization ability of the internal features of the packet. We feed the obtained 1*1440 one-dimensional vectors into the two fully connected layers and use the softmax for the traffic classification.

Experimental Configuration
This experiment is based on Python version 3.7, the experimental environment uses the win10 64-bit operating system, the processor is an Intel (R) Core (TM) i5-3470 CPU @3.20 GHz, and there is 20G of memory.
The experiment uses five-fold cross-validation, and the pre-processed data is randomly divided into five parts, the four of them are used as training data and one as test data in turn for the experiment. The main parameters of the model and hyperparameters are as follows: batchsize is 256, epoch is 200, and Adam (lr = 0.0005, beta_1 = 0.95, beta_2 = 0.999, epsilon = 10 −8 ) is chosen as the optimizer for parameter learning.
Reference [22] demonstrated that sessions are more suitable as the type of traffic representation used for encrypted traffic classification. Because sessions contain bidirectional flows that contain more interaction information compared to unidirectional flows, the end-to-end method can learn more features from sessions than flows. From the protocol layer perspective, in order to avoid missing protocol layer information, five experiments will be conducted in this paper using traffic samples from all layers of the session. Our experiments are listed in Table 2 below. Mixed encrypted traffic service identification 6 3 Regular encrypted traffic service identification 6 4 VPN encrypted traffic service identification 6 5 encrypted traffic service identification 12 Experiment 1 involves protocol encapsulation identification, mainly to implement VPN encapsulation encryption or regular encrypted traffic classification. Experiment 2 mixed encryption traffic service identification, without considering the encryption method or encapsulation type, which only consider the service type identification in the case of the same service type, and the same traffic type of traffic into one category, a total of six categories (Regular encrypted traffic and VPN encrypted traffic with the same type of service are grouped together). Experiments 3 and 4 are regular encrypted traffic service identification and VPN encrypted traffic service identification, with six service types respectively. Experiment 5 is the traffic service type identification considering encapsulation method, identification of traffic service types under regular encrypted traffic (six service types) and VPN encrypted traffic (six service types), for a total of twelve categories.

Evaluation Indicators
In this paper, we use four evaluation metrics: Accuracy, Precision, Recall and F1score. Accuracy is used to evaluate the overall performance of the classifier and this is calculated in Equation (9). Precision (Equation (10)), Recall (Equation (11)) and F1-score (Equation (12)) are often used to evaluate the performance of a certain class of traffic. TP is the number of instances correctly classified as X, TN is the number of instances correctly classified as non-X, FP is the number of instances incorrectly classified as X and FN is the number of instances incorrectly classified as non-X. The indicator is calculated as follows.

Evaluation
We use the ISCX 2016 dataset and five-fold cross-validation approach for experiments, and the experimental results are the performance of each category metric after crossvalidation on each classification task. Figure 5 shows the variation of the five-fold crossvalidation training accuracy for the five experiments.  From Figure 5a-e, we can see that the overall experimental average results are stable, and the training accuracy gradually reaching a peak and leveling off with the increase of training batches. In particular, the best training effect in protocol encapsulation identification (Experiment 1), regular encrypted traffic service identification (Experiment 3) and VPN encrypted traffic service identification (Experiment 4), and their test accuracy can reach nearly 100% and 98.7% and 98.9% respectively. Besides, in the mixed encrypted traffic service identification (Experiment 2) and encrypted traffic service identification (Experiment 5), the overall training effect is satisfactory and meets the expectation except for a few folds in which the effect is not good enough. However, the mixed encrypted traffic service identification has the problem of insufficient training stability, we believe that because the VPN encrypted traffic and regular encrypted traffic are affected by the encryption protocol which are differences, it causes the service identification performance of each application to vary greatly.

Impact of Setting Category Weights
To alleviate the category imbalance, we consider using assigning weights to different categories in the training phase. Taking encrypted traffic service identification as an example, Figure 6 shows the effect of setting weights on F1-score, precision and recall of encrypted traffic service identification (Experiment 5).
Combined with the sample size analysis of the dataset in Table 1, Figure 6a-c show that the F1-score, precision, and recall of the most of small sample traffic types achieved better values in encrypted traffic service identification after assigning weights respectively. For example, in terms of F1-score, some small sample traffic categories (P2P, Streaming, VPN-Email, etc.) have improved, and with a maximum improvement of 4%. In terms of accuracy, traffic categories with smaller samples (VPN-File Transfer, VPN-P2P, VPN-Streaming and VPN-Email, etc.) improved by 1% to 8%. Similarly, recall of the small sample categories, which improved with a maximum of 4%. The same situation exists in other identification tasks. Meanwhile, we also found a phenomenon that the F1-score, precision and recall of the large sample traffic categories (Chat, File Transfer and VoIP) decreased by 0.3% to 5% through experiments. A slight decrease in the identification of large sample traffic types is expected, this is because we focus more on small sample traffic types when assigning weights to categories.

Analysis of Results
Figures 5a and 7a show that the accuracy of the proposed method in this paper can reach close to 1 on average in the protocol encapsulation identification, the precision, recall and F1-score also achieve very good results. Figure 7b-e show the results of the proposed method (ICLSTM) for the traffic service identification on the test set. We can see that the overall model classification effect is more stable. Figure 7 shows that the precision, recall and F1-score reach high values in most experiments of traffic identification. For example, Figure 7c,d show the regular encrypted traffic service identification and VPN encrypted traffic service identification respectively (Experiment 3 & 4). The accuracy can reach 98.7% and 99%, the precision achieved was 98.8% and 97.6%, and the recall can achieve 99% and 97.5% respectively. In addition, in the mixed encrypted traffic service identification experiments, we ignore the protocol encapsulation type and group VPN encrypted traffic and regular encrypted traffic of the same application traffic into the same category for identification, and the results are shown in Figures 5b and 7b and Table 3 (Experiment 2). The accuracy can reach 98.2% on average, and the test precision, recall and F1-score are 98%, 98.4% and 98.2% respectively. Figure 7e shows the performance of the encrypted traffic service identification, which can achieve an average accuracy of 98.1% on the test set, and the precision, recall and F1-score can reach 98%, 98.2% and 98.1% respectively.
As the Table 3 shows, our model ICLSTM achieved high accuracy, precision, recall and F1-score for each traffic service identification task on the test set. This also shows that our model can achieve autonomous feature extraction as well as successful differentiation of each application service, which illustrates the effectiveness of the method proposed in this paper.

Model Comparison
To better analyze the performance of ICLSTM on encrypted traffic service identification, we compare with the results of other methods proposed in recent studies. Such as 1DCNN, Text Convolution [39] and SAE [19], and so forth. The results are shown in the following table. Since F1-score are not shown in some papers, some experimental comparison metrics are selected to compare the performance only in terms of precision and recall. The precision, recall and F1-score in the table are the average values of the corresponding category identification results. Table 4 shows that the proposed method -ICLSTM can achieve almost the same performance as 1DCNN in protocol encapsulation identification, both the accuracy and recall are substantially improved relative to the method proposed in the literature [17]. Table 5 shows that the precision and recall are significantly improved in both regular encrypted traffic service identification and VPN encrypted traffic service identification, especially the best result in regular encrypted traffic service identification with an improvement of 13.6%.
In addition, in the mixed encrypted traffic service identification, no comparison was made because other papers did not have relevant experiments. In the encrypted traffic service identification (Experiment 5), Tables 3 and 6 show that the average accuracy of ICLSTM can reach 98.1%, the regular encrypted traffic service identification achieves better performance. The precision, recall and F1-score are 96.6%, 97.6% and 97% respectively, with nearly 10% improvement. At the same time, VPN encrypted traffic identification although achieved a better recognition results, compared with the literature [23], the results are not significantly improved, which is also worthy of future research. Table 7 shows the performance of each category in encrypted traffic service identification, and it can be seen from the table that our model has the highest overall precision, recall and F1-score. Last but not least, the most papers do not consider the category imbalance, we address this issue by assigning weights to different categories. Through the above experiments, it can be found that the proposed method in this paper can obtain better identification and classification results on both encrypted traffic service identification, which also verifies the effectiveness of our proposed method.

Conclusions and Future Work
In this paper, we propose an ICLSTM architecture to implement encrypted traffic service identification. We utilize Inception and LSTM, which can simultaneously process the local spatial information of packets and the inter-packet timing information, enhance the characterization ability of traffic features. Firstly, these feature values are fused to obtain a one-dimensional feature vector, which are fed into the fully connected layer, and the probability distribution of the labels is output through the softmax layer for the purpose of classification. Then, the method used in this paper considers the problem of data imbalance, which assigns weights to different categories by introducing weight parameters, render the model pays more attention to small sample categories. In addition, we compare this paper with other traffic classification methods. The experimental results show that the accuracy of ICLSTM can achieve more than 98% for each identification task. The precision, recall and F1-score of mixed encrypted traffic service identification, regular encrypted traffic service identification and VPN encrypted traffic service identification all reach 97%, which are better than other methods. Our method does not need to rely on feature engineering, and has better identification capability in encrypted traffic service identification. The experimental results illustrate that the deep learning method is superior to traditional methods in traffic classification and will be the core of future work. In future work, we plan to propose new methods to solve the problem of data imbalance as well as more lightweight methods to make encrypted traffic classification easier and make our proposed methods more suitable for real-time encrypted traffic service identification, which can be applied to practical scenarios.  Acknowledgments: The authors would like to thank the anonymous reviewers for their contribution to this peper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: