Novel Feature Extraction Method for Detecting Malicious MQTT Traffic Using Seq2Seq

: Owing to their wide application, Internet of Things systems have been the target of mali ‐ cious attacks. These attacks included DoS, flood, SlowITe, malformed, and brute ‐ force attacks. A dataset that includes these attacks was recently released. However, the attack detection accuracy reported in previous studies has not been satisfactory because the studies used too many features that are not important in detecting malicious message queue telemetry transport (MQTT) traffic. Therefore, this study aims to analyze these attacks. Herein, a novel feature extraction method is proposed that includes the source port index, TCP length, MQTT message type, keep alive, and connection acknowledgment. The attacks were classified using the Seq2Seq model. During the ex ‐ periment, the accuracy of the proposed method was 99.97%, which is 7.33% higher than that of previously reported methods.


Introduction
The widespread use of the Internet has resulted in the widespread use of IoT. Korea has implemented advanced metering infrastructure (AMI) for electricity, gas, and water [1][2][3]. IoT is used in several fields and is convenient for people. In addition, IoT systems transmit sensitive and important data, necessitating ongoing maintenance. Therefore, IoT security must be ensured.
However, IoT faces numerous security risks. The number of IoT attacks has increased rapidly [4]. DoS, flooding, SlowITe [5], and malformed attacks have been performed to disrupt IoT systems. A brute-force attack was also performed to obtain authentication passwords. These attacks have released malicious message queue telemetry transport (MQTT) traffic data, MQTTset [6]. In earlier studies, attacks were detected using machine and deep learning.
However, these methods did not achieve satisfactory accuracies because they had basic machine learning methods and used too many features that are not important in detecting malicious MQTT traffic. Furthermore, the study authors did not explain why these features should be used to detect attacks.
To solve this problem, we propose a novel feature extraction method and deep learning model for detecting malicious MQTT traffic. The overall structure of the proposed system is shown in Figure 1. We intend to identify DoS, flood, SlowITe, malformed, and brute-force attacks. First, we obtain packet information using Tshark [7] before extracting the source port index, TCP length, MQTT message type, keep alive, and connection acknowledgment (ACK) from the packet information. Second, we intend to classify malicious MQTT traffic using the Seq2Seq model [8]. The accuracy of the proposed method was 99.97%, which is 7.33% higher than that obtained using the machine learning methods (92.64%) [6].
The contributions of this study are as follows. First, we analyze DoS, flood, SlowITe, malformed, and brute-force attacks. Second, a novel feature extraction method for detecting attacks from malicious MQTT traffic is proposed. Third, we show that the accuracy of our Seq2Seq model is 99.97%.
The remainder of this paper is organized as follows. Section 2 presents related studies. Section 3 introduces the MQTT system, packet types, and packet structures. Section 4 analyzes MQTT attacks. Section 5 proposes a novel feature extraction method. Section 6 presents the Seq2Seq model for attack classification. Section 7 presents experimental results. Section 8 presents a discussion and conclusions.

Related Studies
Cyber attacks are caused by malicious files or network traffic. Antivirus software [9] is used to prevent malicious files, whereas intrusion detection systems [10] are used to prevent malicious network traffic. In addition, deep learning technologies have been used to prevent cyber attacks due to malicious files and network traffic.
Several methods have used deep learning to prevent malicious files. The first method involves detecting malicious files by converting them into images using convolutional network networks (CNNs) [11]. The second method involves detecting malicious files by extracting opcodes from them using recurrent neural networks [12]. The third method involves detecting malicious document files using deep learning [13]. The fourth involves detecting malicious scripts, such as PowerShell, using deep learning [14].
In addition, several methods have used deep learning to prevent malicious network traffic. One such method involves detecting malicious network traffic using convolutional neural networks, autoencoders, and recurrent neural networks. One study used the NSLKDD dataset to simulate the detection of malicious network traffic [15]. Another method involves detecting malicious network traffic using a directed acyclic graph (DAC) and belief rule base (BRB) [16]. The third method involves detecting malicious network traffic by providing a hierarchical temporal-and spatial-feature-based detection system [17]. In this method, spatial and temporal features are learned using a convolutional neural network and long short-term memory (LSTM), respectively.
Because IoT is widely used, several methods for detecting malicious IoT traffic have been proposed. A SlowITe attack was proposed to disrupt MQTT systems [5]. Additionally, MQTT traffic datasets, such as those for DoS, flood, SlowITe, malformed, and bruteforce attacks, have been released [6]. To identify attacks, the following 33 feature types were used: However, the accuracy of this method was unsatisfactory. This study proposes a novel feature extraction method using only five features and the Seq2Seq model. The accuracy of our method was 7.33% higher than that of previous methods.
It is worth noting that Nagarajan et al. proposed a framework to detect anomalies in cyber physical systems [18]. They proposed a CNN with a Kalman filter-based Gaussian mixture model to detect inference, DoS, fuzzy, replay, and false data injection attacks. The attacks detected in this study are different from those investigated in our study. Furthermore, Gopal et.al. proposed a method for deploying evidence-based detection by using a trust authority to detect selfish nodes that do not participate in forwarding messages [19]. In addition, Nagarajan et al. proposed cryptographic algorithms to enhance data integrity in IoT systems [20]. However, their research goals were different from ours.

Message Queue Telemetry Transport (MQTT)
The MQTT protocol [21] was developed for IoT environments. As shown in Figure 2, it comprises publishers such as sensors, brokers, and subscribers. The publishers include temperature, light intensity, humidity, gas, and motion sensors. The sensors publish their measured data to the broker. The broker then forwards the data to the subscribers. When subscribers subscribe to a broker, they provide the broker with interesting topics. According to the topic, the broker forwards the data to the subscribers. MQTT packet types are as follows. A total of 16 types were defined as shown in Table 1.

Type
Value This section introduces the four important MQTT packet structures. First, a connect packet is provided. The structure of the connect packets is shown in Figure 3, and it includes a message type, keep alive, client ID, username, and password. Second, a connect ACK packet is provided. The structure of the connect ACK packet is shown in Figure 4; it has a message type and return code. The return code determines whether a connection request has been accepted. Third, a subscription request packet is provided. The structure of a subscribe request packet is shown in Figure 5; it includes a message type and topic. Fourth, a publish packet is provided. The structure of a publish packet is shown in Figure 6, and it includes a message, message type, and topic.

MQTT Attack Analysis
Various attacks, such as DoS, flood, SlowITe, malformed, and brute-force attacks, can attack the MQTT protocol [6]. This section analyzes these attacks to extract features that can be used to identify malicious MQTT traffic.

DoS Attack
As shown in Figure 7, a client has many connections with a broker in a DoS attack. The broker performs poorly because the client has many connections and sends many packets through these connections. The DoS attack traffic was generated using the MQTTmalaria tool [22].

Flood Attack
The flood and DoS attacks are similar. As shown in Figure 8, although the client has many connections with the broker during a DoS attack, it has only one connection with the broker during a flood attack and sends many packets through the connection. The client sends many large packets to the broker, which negatively affects the broker's performance. Flood attack traffic was generated using the IoT-Flock tool [23].

SlowITe Attack
SlowITe attacks [5] are a type of slow DoS attack. As shown in Figure 9, when a client sends a connect packet to a broker, the client sets the keep alive time, which indicates how long the broker should wait for a packet from the client. Typically, the client sends a publish packet when it receives an ACK packet from a broker. However, the client does not send a publish packet to the broker during a SlowITe attack. Therefore, during the keep alive time, the broker waits for the published packet. Thus, the broker performs poorly because of its many connections with the client during a SlowITe attack.

Malformed Attack
During a malformed attack, the client sends incorrect packets to the broker, causing exceptions. As shown in Figure 10, the client sends a subscription request to the broker. Subsequently, the client receives the published packets from the broker. However, the broker receives a publish packet from the client, which causes an exception. Malformed attack traffic was generated using the MQTTSA tool [24].

Brute-Force Attack
In a brute-force attack, the client sends several connection packets to the broker to obtain authentication passwords. As shown in Figure 11, A client sends a connect packet that includes an identifier and a password to the broker. If the user ID and password are incorrect, then the broker sends a connection-refused packet. Otherwise, a connectionaccepted packet is sent. Malicious brute-force attack traffic was generated using the MQTTSA tool [24].

Feature Extraction
This section proposes a novel feature extraction method for classifying malicious MQTT traffic. The features are the source port index, TCP length, MQTT message type, keep alive, and connection ACK.

Source Port Index
To detect malicious MQTT traffic, we first extracted the source port index. As discussed in Section 4.1, in a DoS attack, a client has many connections with a broker through various ports. Therefore, each client connects to a broker by using several ports. The original source port number is not used, and each source port is provided with an index number for each client. For example, if a client has two source ports, a source port index of 1001 is assigned to the first source port, and a source port index of 1002 is assigned to the second source port. We do not use the destination port because it is a broker port, which is port 1883.

TCP Length
Next, we extracted the TCP length. In a flood attack, a client sends large packets to a broker, as described in Section 4.2. When the packet size exceeded 10,000 bytes, we set the value to 10,000. Otherwise, we set the value to −1.

MQTT Message Type
Third, we extracted the MQTT message type. As discussed in Section 4.2, a client sends many publish packets to the broker during a flood attack. As shown in Table 1, the message-type value of a publish packet is three.
Additionally, as shown in Section 4.4, in a malformed attack, a client sends a subscribe request, followed by a publish packet, to the broker. Therefore, the client attempts to create an exception for the broker. We extracted the MQTT message types to detect flood and malformed attacks.

Keep Alive
Fourth, we extracted keep alive. As shown in Section 4.3, a client sends connection packets whose keep alive values are 65,535 in the SlowITe attack. Therefore, the broker should wait for 65,535 s. To detect SlowITe attacks, we extracted keep alive.

Con Ack
Fifth, we extracted the connection ACK. As shown in Section 4.5, in a brute-force attack, a client sends connection packets including usernames and passwords. When the username and password are incorrect, the broker sends a connection refusal packet. Therefore, to detect a brute-force attack, we extracted the connection ACK.
The relationship between MQTT attacks and their features is presented in Table 2.

Preprocessing
To detect malicious MQTT traffic, we used k consecutive packets , per client . Each client sends several packets to its broker. , , , , … , , We then extracted the source port index, TCP length, message type, keep alive, and connection ACK from each packet , as follows: Because the client had the first source port in the first packet, the source port index was 1001. Second, because the TCP packet size was less than 1000 bytes, we set the value to −1. Third, because it was a connection request packet, the message-type value was one. Fourth, because the keep alive value was less than 1000, we set the value to −1.
In the second packet, the source port index was 1001. The TCP packet length was set to −1. Because the message type was a connection ACK, the message-type value was set to two. The keep alive value was set to −1. As it was a connection-accepted packet, the connection ACK value was set to zero.
Because the third packet was a publish packet, the message-type value was set to three. The message-type value was set to 12 because the fourth packet was a ping request packet. In addition, because the fifth packet was a ping-ACK packet, the message-type value was set to 13.
Second, the sequence data of the DoS attack were as follows: { (1001, -1, 1, -1, -1), (1002, -1, 1, -1, -1), (1003, -1, 1, -1, -1),..., (1001, -1, 2, -1, 0), (1002, -1, 2, -1, 0),...} Because the client used the first source port in the first packet, the source port index was 1001. Because the TCP length was less than 1000, it was set to −1, and because it was a connected packet, the message type was one. In the second packet, the source port index was set to 1002 because the client used the second source port. In the third packet, the source port index was set to 1003 because the client used the third source port. In the fourth packet, the connection request was accepted because the message type was two and the connection ACK was zero.
Sixth, the sequence data of the brute-force attack were as follows: In the first packet, the client sent the connection packet. In the second packet, the broker sent a connection-refused packet with message type five. This process is repeated for a brute-force attack.

Seq2Seq Model for Detection of Malicious MQTT Traffic
To detect malicious MQTT traffic, we used the Seq2Seq model [8], as shown in Figure  12. The input data length was set to 50, the embedding size was set to 128, and the number of LSTM nodes was set to 128. The dropout rate was 0.2, and the Softmax function was used. The batch size was 32, and the maximum epoch was 10. Using the Seq2Seq model, we determined whether the input data were normal or indicative of DoS, flood, SlowITe, malformed, or brute-force attacks.

Setup
For the experiments, we used a computer with a 3.7 GHz i7 CPU, 16 GB of memory, a Nvidia 1080 GPU, and a Windows 10 Pro operating system. The structure of the proposed system is shown in Figure 1. First, we extracted the packet information from the MQTT pcap files using Tshark. Second, we preprocessed the packet information to extract sequence data. Third, we classified MQTT attacks using the Seq2Seq model.
Second, we extracted the sequence data from the packet information by considering k consecutive packets. We used the source port index, TCP length, MQTT message type, keep alive, and connect ACK. , , , , , … , , Third, Keras [25] was used to implement the Seq2Seq model to detect malicious MQTT traffic. We measured the accuracy of our method using five-fold cross-validation. A confusion matrix was used because there were six types of MQTT traffic.

Accuracy
We measured the accuracy of our method using feature extraction and the Seq2Seq model. Additionally, we measured the accuracy of the following models: decision tree (DT), naïve Bayes (NB), neural network (NN), multilayer perceptron (MLP), random forest (RF), and gradient boost (GB) [26]. The decision tree and random forest methods showed the highest accuracy. In our experiment, the accuracy of the decision tree and random forest models was 92.64%, whereas that of our method was 99.97%, as shown in Figure 13. When we used the feature extraction method and Seq2Seq model, the accuracy of our method increased by 7.33%. The confusion matrix for the proposed method is presented in Table 4, and most attacks were classified correctly.

Predicted Normal BF DoS Flood SlowITe Malformed
Actual

Analysis
In this section, we analyze the effects of not using each feature. First, when the source port index was not used, the accuracy of detection was reduced by 0.2% to 99.77%, as shown in Figure 14. As shown in Table 5, DoS and brute-force attacks were not effectively detected. Therefore, we demonstrated the significance of a source port index in detecting DoS and brute-force attacks because clients use many source ports. Table 5. Confusion matrix when a source port index is not used.

Predicted Normal BF DoS Flood SlowITe Malformed
Actual Second, when the TCP length was not used, the accuracy of detection was reduced by 0.24% to 99.73%, as shown in Figure 15. In addition, flood attacks were not detected, as shown in Table 6. We demonstrated the significance of the TCP length in detecting flood attacks. Figure 15. Accuracy when the TCP length is not used.

Predicted Normal BF DoS Flood SlowITe Malformed
Actual Third, when the message type was not used, the accuracy of detection was reduced by 0.04% to 99.93%, as shown in Figure 16. As shown in Table 7, 13 malformed attacks were classified as DoS attacks. Therefore, we demonstrated the significance of the message type in detecting malformed attacks.

Predicted Normal BF DoS Flood SlowITe Malformed
Actual Fourth, when keep alive was not used, the accuracy of detection was reduced by 0.01% to 99.96%, as shown in Figure 17. As shown in Table 8, DoS attacks were classified as SlowITe attacks, and malformed attacks were classified as SlowITe attacks. We demonstrated that keep alive is important for detecting SlowITe attacks because it is used in SlowITe attacks.
Fifth, when a connection ACK was not used, the accuracy remained at 99.97%, as shown in Figure 18. However, as shown in Table 9, DoS and malformed attacks were classified as brute-force attacks. Therefore, we demonstrated the significance of a connection ACK in detecting brute-force attacks. Figure 18. Accuracy when a connection ACK is not used. Table 9. Confusion matrix when a connection ACK is not used.

Predicted Normal BF DoS Flood SlowITe Malformed
Actual

Discussion
In this study, we used pcap files provided in a previous study to analyze attacks, extract features, and detect attacks using the Seq2Seq model. Future studies will involve setting up IoT systems and verifying our ability to detect attacks in real time.
In this study, we used sequence data, including k consecutive packets per client; the arrival time of each packet was not considered. However, if there is a significant time gap between two consecutive packets, it may be difficult to detect attacks. Therefore, we plan to investigate how the time gap between two packets affects attack detection.
Recently, several studies have investigated the ability of generative adversarial networks to avoid detection [27]. In the future, we intend to investigate the possibility of evading malicious MQTT traffic detection using a generative adversarial network and propose a method to prevent these adversarial attacks.

Conclusions
In this study, we first analyzed DoS, flood, SlowITe, malformed, and brute-force attacks in malicious MQTT traffic. Second, we proposed a novel feature extraction method using the source port index, TCP length, MQTT message type, keep alive, and connection ACK. Third, we proposed a Seq2Seq model to classify the five attacks that demonstrated an accuracy of 99.97%, which is 7.33% higher than that of the previous methods. Therefore, we anticipate that our feature extraction method and the Seq2Seq model can prevent malicious MQTT traffic.

Conflicts of Interest:
The authors declare no conflict of interest.