An LSTM-Based Deep Learning Approach for Classifying Malicious Trafﬁc at the Packet Level

: Recently, deep learning has been successfully applied to network security assessments and intrusion detection systems (IDSs) with various breakthroughs such as using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) to classify malicious trafﬁc. However, these state-of-the-art systems also face tremendous challenges to satisfy real-time analysis requirements due to the major delay of the ﬂow-based data preprocessing, i.e., requiring time for accumulating the packets into particular ﬂows and then extracting features. If detecting malicious trafﬁc can be done at the packet level, detecting time will be signiﬁcantly reduced, which makes the online real-time malicious trafﬁc detection based on deep learning technologies become very promising. With the goal of accelerating the whole detection process by considering a packet level classiﬁcation, which has not been studied in the literature, in this research, we propose a novel approach in building the malicious classiﬁcation system with the primary support of word embedding and the LSTM model. Speciﬁcally, we propose a novel word embedding mechanism to extract packet semantic meanings and adopt LSTM to learn the temporal relation among ﬁelds in the packet header and for further classifying whether an incoming packet is normal or a part of malicious trafﬁc. The evaluation results on ISCX2012, USTC-TFC2016, IoT dataset from Robert Gordon University and IoT dataset collected on our Mirai Botnet show that our approach is competitive to the prior literature which detects malicious trafﬁc at the ﬂow level. While the network trafﬁc is booming year by year, our ﬁrst attempt can inspire the research community to exploit the advantages of deep learning to build effective IDSs without suffering signiﬁcant detection delay.


Introduction
In a gigantic connected world like the Internet-of-Things (IoT), protecting the network from network attacks may require a comprehensive approach to defeat while under the limitation of the existing resources, e.g., the capacity of the gateway or edge server.The task is getting much harder with the rapid increase of the volume of attacks such as the distributed denial of service (DDoS) from IoT devices.For example, a Mirai-based botnet launches one of the biggest DDoS attacks in history directed at Dyn in October of 2016.The attack surpassed 1 Tbps at peak and created disruption for many major sites, including AirBnB, Netflix, PayPal, Visa, Amazon, the New York Times and so on.Currently, the variants of Mirai malware such as Satori and Miori are still storming to the network with the new records of traffic volume towards the victim, including the systems of the enterprise companies.In practice, the attacker often handles a large number of the botnets of injected IoT devices to launch such an attack.As a result, detecting the malware traffic in the early period of distributing the malicious code can significantly help to prevent the malware from becoming widespread, and mitigate the attack magnitude.
Over the decades, the malicious traffic detection system is the top topic and considered the most in the intrusion detection system (IDS) field.Thus far, the most popular approach of such IDSs is to build a classification mechanism to distinguish the incoming traffic into benign and malicious classes.An IDS can use the signature-based or rule-based approach to identify abnormal traffic.Apparently, the major drawbacks of these traditional IDSs include: (1) requiring many efforts, e.g., labor, to maintain the signature database and updates; (2) becoming ineffective for maintaining such database if the new traffic types or attack variances increasingly spike in a short time.In addition, many adversaries still have certain advantages because exploiting vulnerabilities in the systems of millions of codes, including the zero-day ones available on the darknet, may not require much effort.When the ecosystem of Internet-connected systems expands and the number of IoT devices inside increased rapidly, the attack surface also increases.Consequently, that increase can cause a greater risk of attack and easily overwhelm the update requests.Note that the traffic promises to explode quickly when massive quantities of IoT applications are predicted to flood the market in the next few years, particularly with the readiness of 5G networks.In this case, deep learning promises to be a game changer to ignite new detection approaches.The most benefits of the deep learning approaches are to build a thorough pattern that can highly represent the characteristics of a specific object through auto-learning a large volume of data and species.
Although applying deep learning (DL) in network security has just emerged recently, the topic has received a lot of attention from the research community [1][2][3][4][5][6][7] due to the robust auto-learning ability of DL.In addition, the evolution and the increased availability of GPU-processors significantly help to accelerate matrix computations and massive mathematical calculations, thus directly supporting the feasibility of the DL-based approaches.Deep learning for malicious traffic detection is generally categorized into many classes that are primarily based on the built-in network model, e.g., Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) or unsupervised learning such as an auto-encoder.However, existing literature still suffers several critical drawbacks.First, such systems are built on the flow-based approach (i.e., collecting packets of the same flow for a certain period of time and then classifying packets into flows) and evaluated offline.Since the system needs to allocate significant time for accumulating the traffic flow, it is not applicable to an online real-time system.This packet collection time can be extremely high if the flow is likely the part of a long session data.Furthermore, extracting the features for the flow-based detection system also requires a certain period of time.Apparently, this whole traffic profiling process naturally increases the overall time of the detection.Secondly, the flow-based classification can require a lot of resources, e.g., memory and storage, to store and process the accumulated traffic, including the deep checking on the assembled data.Some recent works, as well as our previous work [8], have shown that it is not necessary to examine all packets in a flow to determine whether it is a malicious flow; instead, a few first packets of the flow are enough for the detection purposes.
Based the above observation, in this research, we propose a novel approach to build a packet-based malicious traffic classification framework.Instead of wasting time on waiting for a complete flow of packets and then launching the classification, we use word embedding to learn the similarity in semantics and syntax of header fields a packet and leverage LSTM to learn temporal features of the fields.The goal of the method is to identify the attack only by directly monitoring a packet.
Due to considering the traffic at the packet level, our system has significant advantages to speed up the detection process, e.g., ignore checking a large number of packets in a session.The experimental results show that our method is competitive and prominent better in the term of accuracy, precision, recall rate, and F1-measure as compared to the prior work.
In summary, the main contributions of this study are as follows: (1) We propose an LSTM-based deep learning approach for the packet-level classification in IDSs.Specifically, our approach can distinguish semantic meanings of malicious traffic by packet, and are thus highly competitive with the flow-based DL approaches while reducing significant time on flow processing.
(2) We propose a novel word-embedding and LSTM model to classify the incoming packet into the normal and abnormal state.The evaluation results of the proposed model on ISCX2012, USTC-TFC2016, IoT data set from Robert Gordon University and IoT data set collected on our Mirai bonet show that the proposed framework can achieve near 100% accuracy and precision in detecting malicious packets.
(3) Our discussion on packet-based deep learning classification and detection promises to provide valuable information and inspire the research community to overcome the remaining challenges, particularly for the target of speeding up the detection process in DL-based IDSs.
The remainder of this paper is organized as follows.Section 2 lists several relevant fundamental concepts and state-of-the-art work of traffic classification and deep learning detection approaches.Section 3 presents our methodology and the detail architecture of the packet classification module in our classification system, along with the data set for evaluation.We show the experimental design and results in Section 4. The conclusions and future work of this paper are summarized in Section 5.

Related Work
Deep learning for malicious traffic detection has gained several notable achievements with various network models.For example, the authors in [2] proposed a novel network-based anomaly detection method which extracts behavior snapshots of the network and uses deep autoencoders to detect anomalous network traffic emanating from compromised IoT devices.The method is evaluated on commercial IoT devices infected by authentic botnets such as Mirai and BASHLITE.However, the performance of the work primarily relies on several self-generated synthetic data sets, which may lack the diversity of data exchange.In another research work [3], the authors proposed a malware traffic classification method using CNN by considering traffic data as images.The work is one of the first attempts to apply a representation learning approach for malware traffic classification from raw traffic.Unfortunately, the lack of detecting the unknown attacks and high detection time limits its prospect to deploy in practice.In addition, Li et al. [7] and Yin et al. [6] proposed using an RNN model or a combination with Restricted Boltzmann Machines (RBM) for extracting micro-flow features based on a small set of packets and then directly training with the raw packets.However, they mostly target the detection accuracy and ignore the evaluation of the detection time.Thus far, there are several other state-of-the-art relevant studies such as [9,10].Generally, the key points of such work are that they all rely on the flow or session-based approach, i.e., indexing the traffic by the flow.Thus far, we have found no attempt to detect the attack traffic at the packet level.Our statistics on the relevant research and our research position are summarized in Table 1.

Research
Year Method Feature Shortcomings M. S. Kim [11] 2004 Formulization flow-based For offline detection R. Fu [12] 2016 LSTM, GRU flow-based For offline detection B J. Radford [9] 2018 LSTM flow-based For offline detection C. Li [7] 2018 RNN, RBM process packet, flow-based For offline detection Y. Chen [4] 2018 TSDNN, QDBP flow-based, N class For offline classification X. Yuan [5] 2017 LSTM, GRU process packet, flow-based For offline detection J. Cui [10] 2018 WEDL-NIDS process packet, flow-based For offline detection Ours 2019 LSTM, Word-embedding raw data, packet-based Target for online detection Recently, word2vector [13,14] and LSTM-based learning models [10] provide a very powerful tool to give packet semantic meanings.This means we now can find a way to provide a thorough pattern for the malicious traffic by deep learning of a large number of raw packets.This particularly helps to reinforce our approach to consider the malicious traffic classification at the packet level.Compared to the prior studies, there are two major differences in our work: (1) the system can identify the malicious traffic by classifying the individual packet, instead of checking the whole packets of a traffic flow; (2) our system can work directly with raw packets, i.e., reading and making detection decisions.To the best of our knowledge, our study is the first work on packet-based malicious traffic detection.In addition, despite the high detection accuracy achieved by prior work for the same attack types, a novel research direction on packet-based classification (i.e., ours) sheds a light toward research on online real-time deep learning based IDSs.

Dataset
In deep learning, a lack of a variety of shareable traces dataset can be seen as one of the most obvious obstacles to obtain realistic progress on traffic classification.Many researchers may like creating synthetic datasets through their testbed.The shortcomings of this approach are that their generated traffic may not represent well the real traffic on the Internet, thus the performance evaluation results of the proposed solutions can face the problems of the credibility.Over the decades, the researchers have been recommended to use public famous traffic datasets such as KDD CUP1999 and NSL-KDD to test.Each dataset explicitly provides useful statistics on labeled features and the number of benign and malicious flows as well.However, these datasets do not provide information at the raw traffic level, which is required in our approach.In addition, while the credible dataset from Microsoft Malware Classification Challenge [15] can provide a metadata manifest and hexadecimal representation of the malware's binary content, the missing of packet information in the data, unfortunately, makes it hard to be used in this research.
For the ones matching the requirements, USTC-TFC2016 [3] is one of the most prominent datasets.Table 2 summarizes the statistics of the benign and malware traffic in the dataset.As their statement, there are a total of ten types of malware traffic from public websites which were collected from a real network environment from 2011 to 2015.Along with such malicious traffic, the benign part contains ten types of normal traffic which were collected using IXIA BPS, a professional network traffic simulation equipment.The size of USTC-TFC2016 dataset is 3.71 GB, and the format is pcap.For the Mirai-based DDoS traffic, we use the dataset from Robert Gordon University [16], denoted by Mirai-RGU.This data set contains Mirai botnet traffic such as Scan, Infect, Control, Attack traffic and normal IoT IP Camera traffic.It contains ten types of malicious traffic, include HTTP flood, UDP flood, DNS flood, Mirai traffic, VSE flood, GREIP flood, GREETH flood, TCP ACK flood, TCP SYN flood, and UDPPLAIN flood.The dataset includes features such as Time, Source, Destination, Protocol, Length, and overall payload information.In addition, we also collected a dataset from the Mirai botnet we have built in the campus of National Chung Cheng University(CCU) (denoted by Mirai-CCU), as shown in Figure 1, to generate four types of attack traffic with much bigger attack magnitude: TCP SYN (41 GB), TCP ACK (2.4 GB), HTTP POST (103 GB), UDP (127.06GB) in the total of 667 GB attack data.To enrich the dataset for our experiments, we select ISCX2012 [17], which also contains both malicious and benign traffic.ISCX2012 consists of packets collected for seven days.Packets collected in the first and sixth day are normal traffic.In the second and third day, both normal packets and attack packets are collected.In the fourth, fifth, and seventh days, besides the normal traffic, HTTP DoS, DDoS and IRC Botnet, and Brute Force SSH packets are collected, respectively.

Word Embedding and Data Preprocessing
Our goal is to classify incoming packets into benign or malicious classes without pre-processing the packets into a specific flow.To achieve this target, instead of considering the whole flow (e.g., as a document), we consider each packet (e.g., as a paragraph) and construct the key sentence from each packet, in which each word is a field in the packet header.After that, we apply word embedding [18] to extract semantics and syntax features from this sentence.We choose to consider meanings on the sentence rather than the whole paragraph since the meaning of a paragraph usually can be captured by the key sentence.Here, the order of the fields in each packet (fixed for each packet type) plays the role of resembling some grammar rules which are decisive in building sentence patterns for malicious traffic (signature-based detection) or benign traffic (anomaly detection).Notably, this packet-to-sentence-based model can significantly accelerate the traffic classification, since the behavior and characteristics of one or several first packets can entirely reveal whether their flow is a malicious one.
In general, a field in each packet could be a byte of the packet header, a field of the packet header, or a block of the packet payload.As the initial trial, we consider a field in the packet header as a word and trim a packet to a fixed length of n = 54 bytes.Depending on the field length in the packet, the word size can vary.The strict order of the fields in the packet structure constructs a potential grammar rule for the built sentence.Note that the extracting field stage can be done along with the packet reading/decoding (i.e., data pre-processing), thus the resource consumption is quite efficient.In addition, if the length of the packet is less than n bytes, it will be padded with zeros (as shown in Figure 2).The rationale for examining 54 bytes is that most TCP packets have a 14-byte MAC header, a 20-byte IP header, and a 20-byte TCP header.UDP header (4) extended to 12 fields udp.sport, udp.dpot, udp.len, udp.chksum, 0, 0, 0, 0, 0, 0, 0, 0 After applying the word-embedding technique to fields in the packet header, each header field is embedded, (i.e., to integer number format based on their index in a dictionary of all words) and reshaped, (i.e., dimension) and put to the LSTM-based training model for performing the classification task.In order to understand each attack type and remain consistent with the order of the fields in the packet header, it is important to maintain the sequence order of fields and a consistent word dictionary.Finally, the word size (hyperparameter) selection and further the word-embedding strategy are adjustable and can be specified by the deployment environment and IoT applications under the system's protection umbrella.

Classification
The classification mechanism is the next important part of the system.As illustrated in Figures 3-5, our packet-based traffic classification framework consists of three modules: packet pre-processing module, word embedding module, and LSTM module.However, to perform training in the LSTM module, we first need to label the benign/malicious mark for the packets in the dataset, e.g., ISCX2012 [17], USTC-TFC2016 [3] and Mirai-RGU [16].Based on the label of the flows (well-labeled in ISCX2012 and USTC-TFC2016 dataset) and the packets (well-labeled in Mirai-RGU dataset), we mark the label of the corresponding packet (malicious/benign) according to the flow's label that the packet belongs to.
For the LSTM module, we build standard LSTM cells, in which each cell consists of input gate, output gate, and forget gate.The sigmoid function is used as the activation function (α).Our proposed framework, as shown in Figure 4, consists of three LSTM layers with dropout.Total parameters of the first layer (dimension = 128, without dropout) exceed 4 million (4,194,304).The pseudo-code of our algorithm and the processing flow is illustrated in Algorithm 1.The preprocessing phase adjusts features to ensure data representation is suitable for the used algorithms, i.e., parsing the packet and converting it to the translated word.The new dataset of translated Words is in the format of the dataset of integer number.The translated dataset is then split into two parts: Training Data and Testing Data, respectively.Training stage starts on the Training Data and running the LSTM three-layer model.Note that the dimension from input data are required to reshape, e.g., 64, before performing the training and testing.The dropout rate is flexibly set and can be tuned up to 0.5.The loss function is based on binary cross entropy and RMSProp optimizer is selected for the learning rate adjustment method.A dense output layer with softmax is added to the model.The model is then compiled with the binary cross-entropy as the loss function and the Adam optimiser over total of 200 iterations.Finally, the model is testing on the Testing Data to determine the effectiveness of the model in terms of accuracy, precision, recall, f1-score, and loss (defined later in the next section).Parse each byte of the extracted data as a word; words defines the words extracted from the packet; 10: if Word in dictionary, dictionary(words[i]) then 12: index = dictionary(words[i]) % Pulling index of the words 13: else 14: dictionary[] = words[i]; %Update the dictionary with the new word; 15: index = dictionary(words[i]) The recall (Equation ( 3)) measures the number of correct classifications that is considered with the number of missed entries: The false alarm or fair (Equation ( 4)) measures the proportion of benign packets incorrectly classified as malicious: The F 1 -score (Equation ( 5)) measures the harmonic mean of precision and recall, which serves as a derived effectiveness measurement: In our experiments, based on the flow statistics in the datasets and number of labeled packets in data pre-processing, we can calculate the number of packets (including benign packets) for training and number of real attack packets in testing, as summarized in Tables 5-8 below.Due to a large amount of input data, we set the mini-batch size value at 100 and the data training with less than 200 epochs.
The performance results of our proposal for each kind of malicious traffic with 10-fold testing and 10-fold validation are shown in Tables 9 and 10, respectively.From these two tables, we can observe that our method can perform the classification with nearly 100% accuracy and precision in major cases of the security attacks, which outperforms compared to the prior works, e.g., [6,19,20].However, since the characteristics of traffic types and attacks are unique, the performance measurements are also different in each scenario and heavily relied on the training dataset.From our experimental experience, we notice that, besides tuning the learning model and its parameters (e.g., learning rate), word-embedding and attack representation samples in datasets play a critical role in the improvement of the performance.One might question the aforementioned results of our proposed DL-based framework.That is, intuitively, the proposed framework can classify malicious packets because it obtains the features of malicious packets from the training dataset.Thus, we perform a validation test in which the Mirai-RGU dataset is used for training and packets from Mirai-CCU dataset are used for validation.
Our experimental results show that, even though the training and validation stages use different datasets, the proposed framework is still able to detect malicious packets with very high accuracy.The performance results are shown in Table 11.

Time Efficiency
Time efficiency is also an important metric measurement in our experiment.The time efficiency involves two cases: training time and detection time.The training time is defined by the total time to complete the training on the selected dataset.This time depends on the data-preprocessing, the training model (number of layers and dims) and server capacity.Therefore, it would not be fair to draw comparisons within this respect, due to differences in the hardware and software used.Through our evaluation, due to considering at the deeper level (the packet level), the amount of data (after transferring from packet information) used for training in our method is expected to be higher than the flow-based approaches.As a result, the time required for training will be higher.The time for offline training, i.e., ours, could be up to 17 h with USTC-TFC2016 at 200 epochs, in order to obtain the detection performance results as mentioned above.The bigger dataset it is, the more training time it costs.However, our method has advantages of detection time.The detection time in this study is defined as the total time (elapsed time) to perform the classification on a given testing sample, i.e., the testing dataset.Given a testing file 108 MB and our server capacity (mentioned above), our system only takes fewer than 2 s to complete the classification, no matter what type of traffic is.This achievement emboldens our approach for online monitoring since we believe that this speed can satisfy most on-demand applications and in medium networks.For the core networks, we may need more efforts to integrate this system for potential deployment, since the network speed at such nodes can be up to hundreds of gigabits per second.Note that the system at the core networks can be equipped with the much more powerful servers.

Discussion
There is a trade-off between the word size and the classification performance, i.e., accuracy and detection time.Reducing the word size can potentially increase the detection time, and training time as well, due to more words involved, longer sentence created and dimension extended.For accuracy, the performance can vary and depend on the considered attack type.Apparently, we can't scale the size too small.In our evaluation, we set the word size to two bytes as aforementioned.However, there is promising research to propose an algorithm to auto-calculate this value properly on the deployed environment, even that for specific attacks.
In addition, the approach to consider the classification at the packet level can open the door to accelerating the detection since we can schedule the packets for parallel processing.However, the explicit shortcoming is to increase the training time and resource usage (e.g., memory) due to a large number of parameters and data size putting to the training model.Balancing the factor of acceptable training time but still gaining the best classification performance is a non-trivial task and a potential research direction.
Finally, the DL-based classification approach is highly susceptible to the data poisoning attack due to its dependence on the training data.Thus far, we have found few attack models targeting the evasion of the deep learning-based malicious classification systems, including ours.However, this can be soon changed when the popularity of deep learning will attract more attackers to exploit its vulnerabilities for hacking or monetization.Generating/preventing adversary models against deep learning is thus one of the most interesting and promising security research topics.

Conclusions
In this paper, we have proposed a novel packet-based malicious traffic classification framework by using the word embedding and Long Short-Term Memory (LSTM) network model.Our evaluation results show that the performance of ours is competitive with the prior work on classifying flows into benign or malicious, but with much lower flow pre-processing time.In other words, the major advantage of the proposed framework is that it does not require pre-process packets into flows, thus boosting the detection acceleration.We believe that our first attempt on using a packet-based malicious classification approach can inspire the research community to consider the optimization methods and exploit the advantages of deep learning to build effective IDSs without suffering significant detection delay.

Figure 1 .
Figure 1.The testbed of our Mirai-based DDoS dataset in the campus of National Chung Cheng University (CCU).

Figure 2 .
Figure 2. The illustration of the packet-word-transfer mechanism.

Figure 3 .
Figure 3.The illustration of the packet-word-transfer and classification module in the proposal.Packet-work-transfer module is at the input data layer.

Figure 4 .Figure 5 .
Figure 4.The illustration of the full network model.

Algorithm 1 :
Algorithm for packet-based traffic classification Data: Sequence of raw packets from network Result: Accuracy, precision, recall, f1-score, FAR, loss 1: dictionary = array() % Create the mapping word to the index array 2: translatedWords = null; %variable to store the values of words to converted to integer format; 3: while true do

Table 1 .
Summary of several state-of-the-art relevant research works and our research location.(Notations: GRU: Gated Recurrent Unit; TSDNN: Shaped Deep Neural Network; QDBP: Quantity Dependent Backpropagation; WEDL: Word Embedding-based Deep Learning; NIDS: Network Intrusion Detection System).
Table 3 lists the header fields and their length of TCP and UDP packets.

Table 6 .
Train/Test/Validate traffic proportion in the Mirai-RGU dataset.

Table 8 .
Train/Test/Validate traffic proportion in our dataset.

Table 9 .
Performance evaluation results of the proposed solution on the four datasets with 10-fold testing.

Table 10 .
Performance evaluation results of the proposed solution on the four datasets with 10-fold validation.

Table 11 .
Performance evaluation results of the Mirai-CCU dataset validation on the Mirai-RGU training model.