Novel Feature Extraction Method for Detecting Malicious MQTT Traffic Using Seq2Seq

Choi, Sunoh; Cho, Jaehyuk

doi:10.3390/app122312306

Open AccessArticle

Novel Feature Extraction Method for Detecting Malicious MQTT Traffic Using Seq2Seq

by

Sunoh Choi

and

Jaehyuk Cho

^*

Department of Software Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 12306; https://doi.org/10.3390/app122312306

Submission received: 14 November 2022 / Revised: 28 November 2022 / Accepted: 30 November 2022 / Published: 1 December 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Owing to their wide application, Internet of Things systems have been the target of malicious attacks. These attacks included DoS, flood, SlowITe, malformed, and brute-force attacks. A dataset that includes these attacks was recently released. However, the attack detection accuracy reported in previous studies has not been satisfactory because the studies used too many features that are not important in detecting malicious message queue telemetry transport (MQTT) traffic. Therefore, this study aims to analyze these attacks. Herein, a novel feature extraction method is proposed that includes the source port index, TCP length, MQTT message type, keep alive, and connection acknowledgment. The attacks were classified using the Seq2Seq model. During the experiment, the accuracy of the proposed method was 99.97%, which is 7.33% higher than that of previously reported methods.

Keywords:

MQTT; feature extraction; Seq2Seq

1. Introduction

The widespread use of the Internet has resulted in the widespread use of IoT. Korea has implemented advanced metering infrastructure (AMI) for electricity, gas, and water [1,2,3]. IoT is used in several fields and is convenient for people. In addition, IoT systems transmit sensitive and important data, necessitating ongoing maintenance. Therefore, IoT security must be ensured.

However, IoT faces numerous security risks. The number of IoT attacks has increased rapidly [4]. DoS, flooding, SlowITe [5], and malformed attacks have been performed to disrupt IoT systems. A brute-force attack was also performed to obtain authentication passwords. These attacks have released malicious message queue telemetry transport (MQTT) traffic data, MQTTset [6]. In earlier studies, attacks were detected using machine and deep learning.

However, these methods did not achieve satisfactory accuracies because they had basic machine learning methods and used too many features that are not important in detecting malicious MQTT traffic. Furthermore, the study authors did not explain why these features should be used to detect attacks.

To solve this problem, we propose a novel feature extraction method and deep learning model for detecting malicious MQTT traffic. The overall structure of the proposed system is shown in Figure 1. We intend to identify DoS, flood, SlowITe, malformed, and brute-force attacks. First, we obtain packet information using Tshark [7] before extracting the source port index, TCP length, MQTT message type, keep alive, and connection acknowledgment (ACK) from the packet information. Second, we intend to classify malicious MQTT traffic using the Seq2Seq model [8]. The accuracy of the proposed method was 99.97%, which is 7.33% higher than that obtained using the machine learning methods (92.64%) [6].

The contributions of this study are as follows. First, we analyze DoS, flood, SlowITe, malformed, and brute-force attacks. Second, a novel feature extraction method for detecting attacks from malicious MQTT traffic is proposed. Third, we show that the accuracy of our Seq2Seq model is 99.97%.

The remainder of this paper is organized as follows. Section 2 presents related studies. Section 3 introduces the MQTT system, packet types, and packet structures. Section 4 analyzes MQTT attacks. Section 5 proposes a novel feature extraction method. Section 6 presents the Seq2Seq model for attack classification. Section 7 presents experimental results. Section 8 presents a discussion and conclusions.

2. Related Studies

Cyber attacks are caused by malicious files or network traffic. Antivirus software [9] is used to prevent malicious files, whereas intrusion detection systems [10] are used to prevent malicious network traffic. In addition, deep learning technologies have been used to prevent cyber attacks due to malicious files and network traffic.

Several methods have used deep learning to prevent malicious files. The first method involves detecting malicious files by converting them into images using convolutional network networks (CNNs) [11]. The second method involves detecting malicious files by extracting opcodes from them using recurrent neural networks [12]. The third method involves detecting malicious document files using deep learning [13]. The fourth involves detecting malicious scripts, such as PowerShell, using deep learning [14].

In addition, several methods have used deep learning to prevent malicious network traffic. One such method involves detecting malicious network traffic using convolutional neural networks, autoencoders, and recurrent neural networks. One study used the NSLKDD dataset to simulate the detection of malicious network traffic [15]. Another method involves detecting malicious network traffic using a directed acyclic graph (DAC) and belief rule base (BRB) [16]. The third method involves detecting malicious network traffic by providing a hierarchical temporal- and spatial-feature-based detection system [17]. In this method, spatial and temporal features are learned using a convolutional neural network and long short-term memory (LSTM), respectively.

Because IoT is widely used, several methods for detecting malicious IoT traffic have been proposed. A SlowITe attack was proposed to disrupt MQTT systems [5]. Additionally, MQTT traffic datasets, such as those for DoS, flood, SlowITe, malformed, and brute-force attacks, have been released [6]. To identify attacks, the following 33 feature types were used:

{tcp.flags, tcp.time_delta, tcp.len, mqtt.conack.flags, mqtt.conack.flags.reserved, mqtt.conack.flags.sp, mqtt.conack.val, mqtt.conflag.cleansess, mqtt.conflag.passwd, mqtt.conflag.qos, mqtt.conflag.reserved, mqtt.conflag.retain, mqtt.conflag.uname, mqtt.conflag.willflag, mqtt.conflags, mqtt.dupflag, mqtt.hdrflags, mqtt.kalive, mqtt.len, mqtt.msg, mqtt.msgid, mqtt.msgtype, mqtt.proto_len, mqtt.protoname, mqtt.qos, mqtt.retain, mqtt.sub.qos, mqtt.suback.qos, mqtt.ver, mqtt.willmsg, mqtt.willmsg_len, mqtt.willtopic, mqtt.willtopic_len}

However, the accuracy of this method was unsatisfactory. This study proposes a novel feature extraction method using only five features and the Seq2Seq model. The accuracy of our method was 7.33% higher than that of previous methods.

It is worth noting that Nagarajan et al. proposed a framework to detect anomalies in cyber physical systems [18]. They proposed a CNN with a Kalman filter-based Gaussian mixture model to detect inference, DoS, fuzzy, replay, and false data injection attacks. The attacks detected in this study are different from those investigated in our study. Furthermore, Gopal et.al. proposed a method for deploying evidence-based detection by using a trust authority to detect selfish nodes that do not participate in forwarding messages [19]. In addition, Nagarajan et al. proposed cryptographic algorithms to enhance data integrity in IoT systems [20]. However, their research goals were different from ours.

3. Message Queue Telemetry Transport (MQTT)

The MQTT protocol [21] was developed for IoT environments. As shown in Figure 2, it comprises publishers such as sensors, brokers, and subscribers. The publishers include temperature, light intensity, humidity, gas, and motion sensors. The sensors publish their measured data to the broker. The broker then forwards the data to the subscribers. When subscribers subscribe to a broker, they provide the broker with interesting topics. According to the topic, the broker forwards the data to the subscribers.

MQTT packet types are as follows. A total of 16 types were defined as shown in Table 1.

This section introduces the four important MQTT packet structures. First, a connect packet is provided. The structure of the connect packets is shown in Figure 3, and it includes a message type, keep alive, client ID, username, and password.

Second, a connect ACK packet is provided. The structure of the connect ACK packet is shown in Figure 4; it has a message type and return code. The return code determines whether a connection request has been accepted.

Third, a subscription request packet is provided. The structure of a subscribe request packet is shown in Figure 5; it includes a message type and topic.

Fourth, a publish packet is provided. The structure of a publish packet is shown in Figure 6, and it includes a message, message type, and topic.

4. MQTT Attack Analysis

Various attacks, such as DoS, flood, SlowITe, malformed, and brute-force attacks, can attack the MQTT protocol [6]. This section analyzes these attacks to extract features that can be used to identify malicious MQTT traffic.

4.1. DoS Attack

As shown in Figure 7, a client has many connections with a broker in a DoS attack. The broker performs poorly because the client has many connections and sends many packets through these connections. The DoS attack traffic was generated using the MQTT-malaria tool [22].

4.2. Flood Attack

The flood and DoS attacks are similar. As shown in Figure 8, although the client has many connections with the broker during a DoS attack, it has only one connection with the broker during a flood attack and sends many packets through the connection. The client sends many large packets to the broker, which negatively affects the broker’s performance. Flood attack traffic was generated using the IoT-Flock tool [23].

4.3. SlowITe Attack

SlowITe attacks [5] are a type of slow DoS attack. As shown in Figure 9, when a client sends a connect packet to a broker, the client sets the keep alive time, which indicates how long the broker should wait for a packet from the client. Typically, the client sends a publish packet when it receives an ACK packet from a broker. However, the client does not send a publish packet to the broker during a SlowITe attack. Therefore, during the keep alive time, the broker waits for the published packet. Thus, the broker performs poorly because of its many connections with the client during a SlowITe attack.

4.4. Malformed Attack

During a malformed attack, the client sends incorrect packets to the broker, causing exceptions. As shown in Figure 10, the client sends a subscription request to the broker. Subsequently, the client receives the published packets from the broker. However, the broker receives a publish packet from the client, which causes an exception. Malformed attack traffic was generated using the MQTTSA tool [24].

4.5. Brute-Force Attack

In a brute-force attack, the client sends several connection packets to the broker to obtain authentication passwords. As shown in Figure 11, A client sends a connect packet that includes an identifier and a password to the broker. If the user ID and password are incorrect, then the broker sends a connection-refused packet. Otherwise, a connection-accepted packet is sent. Malicious brute-force attack traffic was generated using the MQTTSA tool [24].

5. Feature Extraction

This section proposes a novel feature extraction method for classifying malicious MQTT traffic. The features are the source port index, TCP length, MQTT message type, keep alive, and connection ACK.

5.1. Source Port Index

To detect malicious MQTT traffic, we first extracted the source port index. As discussed in Section 4.1, in a DoS attack, a client has many connections with a broker through various ports. Therefore, each client connects to a broker by using several ports. The original source port number is not used, and each source port is provided with an index number for each client. For example, if a client has two source ports, a source port index of 1001 is assigned to the first source port, and a source port index of 1002 is assigned to the second source port. We do not use the destination port because it is a broker port, which is port 1883.

5.2. TCP Length

Next, we extracted the TCP length. In a flood attack, a client sends large packets to a broker, as described in Section 4.2. When the packet size exceeded 10,000 bytes, we set the value to 10,000. Otherwise, we set the value to −1.

5.3. MQTT Message Type

Third, we extracted the MQTT message type. As discussed in Section 4.2, a client sends many publish packets to the broker during a flood attack. As shown in Table 1, the message-type value of a publish packet is three.

Additionally, as shown in Section 4.4, in a malformed attack, a client sends a subscribe request, followed by a publish packet, to the broker. Therefore, the client attempts to create an exception for the broker. We extracted the MQTT message types to detect flood and malformed attacks.

5.4. Keep Alive

Fourth, we extracted keep alive. As shown in Section 4.3, a client sends connection packets whose keep alive values are 65,535 in the SlowITe attack. Therefore, the broker should wait for 65,535 s. To detect SlowITe attacks, we extracted keep alive.

5.5. Con Ack

Fifth, we extracted the connection ACK. As shown in Section 4.5, in a brute-force attack, a client sends connection packets including usernames and passwords. When the username and password are incorrect, the broker sends a connection refusal packet. Therefore, to detect a brute-force attack, we extracted the connection ACK.

The relationship between MQTT attacks and their features is presented in Table 2.

5.6. Preprocessing

To detect malicious MQTT traffic, we used k consecutive packets

p_{i, j}

per client

c_{i}

. Each client sends several packets to its broker.

c_{i} = \{p_{i, 1}, p_{i, 2}, \dots, p_{i, n}\}

We then extracted the source port index, TCP length, message type, keep alive, and connection ACK from each packet

p_{i, j}

as follows:

p_{i, j} = \{source port index, tcp length, message type, keep alive, connection ack\}

Finally, the sequence data which we used are as follows:

s_{i, j} = \{p_{i, j}, p_{i, j + 1}, \dots, p_{i, j + k - 1}\}

For example, first, the sequence data of normal traffic were as follows:

{(1001, -1, 1, -1, -1), (1001, -1, 2, -1, 0), (1001, -1, 3, -1, -1), (1001, -1, 12, -1, -1), (1001, -1, 13, -1, -1), (1001, -1, 3, -1, -1),...}

Because the client had the first source port in the first packet, the source port index was 1001. Second, because the TCP packet size was less than 1000 bytes, we set the value to −1. Third, because it was a connection request packet, the message-type value was one. Fourth, because the keep alive value was less than 1000, we set the value to −1.

In the second packet, the source port index was 1001. The TCP packet length was set to −1. Because the message type was a connection ACK, the message-type value was set to two. The keep alive value was set to −1. As it was a connection-accepted packet, the connection ACK value was set to zero.

Because the third packet was a publish packet, the message-type value was set to three. The message-type value was set to 12 because the fourth packet was a ping request packet. In addition, because the fifth packet was a ping-ACK packet, the message-type value was set to 13.

Second, the sequence data of the DoS attack were as follows:

{(1001, -1, 1, -1, -1), (1002, -1, 1, -1, -1), (1003, -1, 1, -1, -1),..., (1001, -1, 2, -1, 0), (1002, -1, 2, -1, 0),...}

Because the client used the first source port in the first packet, the source port index was 1001. Because the TCP length was less than 1000, it was set to −1, and because it was a connected packet, the message type was one. In the second packet, the source port index was set to 1002 because the client used the second source port. In the third packet, the source port index was set to 1003 because the client used the third source port. In the fourth packet, the connection request was accepted because the message type was two and the connection ACK was zero.

Third, the sequence data of the flood attack were as follows:

{(1001, -1, 1, -1, -1), (1001, -1, 2, -1, 0), (1001, 1000, 3, -1, -1), (1001, 1000, 3, -1, -1), (1001, 1000, 3, -1, -1),...}

In the third packet, because the client sent a publish packet whose message type was 3 and whose size was larger than 1000 bytes, the TCP length value was set to 1000. Subsequently, the client sent many publish packets whose TCP length was greater than 1000 bytes.

Fourth, the sequence data of the SlowITe attack were as follows:

{(1001, -1, 1, 10,000, -1), (1002, -1, 1, 10,000, -1), (1001, -1, 2, -1, 0), (1002, -1, 2, -1, 0), (1003, -1, 1, 10,000, -1), (1003, -1, 2, -1, 0),...}

In the first packet, the client sent a connection packet with a source port index of 1001. Because the keep alive value was greater than 10,000, we set the keep alive value to 10,000. In the second packet, the client sent a connection packet with a source port index of 1002, and the keep alive value was set to 10,000.

Fifth, the sequence data of a malformed attack were as follows:

{(1001, -1, 1, -1, -1), (1001, -1, 2, -1, 0), (1001, -1, 8, -1, -1), (1001, -1, 9, -1, -1),..., (1001, -1, 3, -1, -1), (1001, -1, 3, -1, -1),...}

In the third packet, the client sent a subscribe request packet of message type eight. In the fifth packet, the client sent a publish packet with message type three. This is a malformed attack that causes an exception to the broker.

Sixth, the sequence data of the brute-force attack were as follows:

{(1001, -1, 1, -1, -1), (1001, -1, 2, -1, 5), (1002, -1, 1, -1, -1), (1002, -1, 2, -1, 5), (1003, -1, 1, -1, -1), (1004, -1, 2, -1, 5),...}

In the first packet, the client sent the connection packet. In the second packet, the broker sent a connection-refused packet with message type five. This process is repeated for a brute-force attack.

6. Seq2Seq Model for Detection of Malicious MQTT Traffic

To detect malicious MQTT traffic, we used the Seq2Seq model [8], as shown in Figure 12. The input data length was set to 50, the embedding size was set to 128, and the number of LSTM nodes was set to 128. The dropout rate was 0.2, and the Softmax function was used. The batch size was 32, and the maximum epoch was 10. Using the Seq2Seq model, we determined whether the input data were normal or indicative of DoS, flood, SlowITe, malformed, or brute-force attacks.

7. Experimental Results

7.1. Setup

For the experiments, we used a computer with a 3.7 GHz i7 CPU, 16 GB of memory, a Nvidia 1080 GPU, and a Windows 10 Pro operating system. The structure of the proposed system is shown in Figure 1. First, we extracted the packet information from the MQTT pcap files using Tshark. Second, we preprocessed the packet information to extract sequence data. Third, we classified MQTT attacks using the Seq2Seq model.

We used six pcap files, as listed in Table 3 [6]. These include normal MQTT traffic, DoS, flood, SlowITe, malformed, and brute-force attacks. The file names and numbers of packets in each file are listed in Table 3.

First, we extracted packet information using Tshark as follows:

tshark –r capture_1w_mqtt.pcap –T fields –E separator=, -e frame.time_relative –e ip.src –e tcp.srcport -e ip.dst -e tcp.dstport –e mqtt.msgtype –e mqtt.kalive –e mqtt.conack.val > capture_1w_mqtt.csv.

Second, we extracted the sequence data from the packet information by considering k consecutive packets. We used the source port index, TCP length, MQTT message type, keep alive, and connect ACK.

s_{i, j} = \{p_{i, j}, p_{i, j + 1}, \dots, p_{i, j + k - 1}\}

Third, Keras [25] was used to implement the Seq2Seq model to detect malicious MQTT traffic. We measured the accuracy of our method using five-fold cross-validation. A confusion matrix was used because there were six types of MQTT traffic.

7.2. Accuracy

We measured the accuracy of our method using feature extraction and the Seq2Seq model. Additionally, we measured the accuracy of the following models: decision tree (DT), naïve Bayes (NB), neural network (NN), multilayer perceptron (MLP), random forest (RF), and gradient boost (GB) [26]. The decision tree and random forest methods showed the highest accuracy. In our experiment, the accuracy of the decision tree and random forest models was 92.64%, whereas that of our method was 99.97%, as shown in Figure 13. When we used the feature extraction method and Seq2Seq model, the accuracy of our method increased by 7.33%.

The confusion matrix for the proposed method is presented in Table 4, and most attacks were classified correctly.

7.3. Analysis

In this section, we analyze the effects of not using each feature. First, when the source port index was not used, the accuracy of detection was reduced by 0.2% to 99.77%, as shown in Figure 14.

As shown in Table 5, DoS and brute-force attacks were not effectively detected. Therefore, we demonstrated the significance of a source port index in detecting DoS and brute-force attacks because clients use many source ports.

Second, when the TCP length was not used, the accuracy of detection was reduced by 0.24% to 99.73%, as shown in Figure 15. In addition, flood attacks were not detected, as shown in Table 6. We demonstrated the significance of the TCP length in detecting flood attacks.

Third, when the message type was not used, the accuracy of detection was reduced by 0.04% to 99.93%, as shown in Figure 16. As shown in Table 7, 13 malformed attacks were classified as DoS attacks. Therefore, we demonstrated the significance of the message type in detecting malformed attacks.

Fourth, when keep alive was not used, the accuracy of detection was reduced by 0.01% to 99.96%, as shown in Figure 17. As shown in Table 8, DoS attacks were classified as SlowITe attacks, and malformed attacks were classified as SlowITe attacks. We demonstrated that keep alive is important for detecting SlowITe attacks because it is used in SlowITe attacks.

Fifth, when a connection ACK was not used, the accuracy remained at 99.97%, as shown in Figure 18. However, as shown in Table 9, DoS and malformed attacks were classified as brute-force attacks. Therefore, we demonstrated the significance of a connection ACK in detecting brute-force attacks.

8. Discussion

In this study, we used pcap files provided in a previous study to analyze attacks, extract features, and detect attacks using the Seq2Seq model. Future studies will involve setting up IoT systems and verifying our ability to detect attacks in real time.

In this study, we used sequence data, including k consecutive packets per client; the arrival time of each packet was not considered. However, if there is a significant time gap between two consecutive packets, it may be difficult to detect attacks. Therefore, we plan to investigate how the time gap between two packets affects attack detection.

Recently, several studies have investigated the ability of generative adversarial networks to avoid detection [27]. In the future, we intend to investigate the possibility of evading malicious MQTT traffic detection using a generative adversarial network and propose a method to prevent these adversarial attacks.

9. Conclusions

In this study, we first analyzed DoS, flood, SlowITe, malformed, and brute-force attacks in malicious MQTT traffic. Second, we proposed a novel feature extraction method using the source port index, TCP length, MQTT message type, keep alive, and connection ACK. Third, we proposed a Seq2Seq model to classify the five attacks that demonstrated an accuracy of 99.97%, which is 7.33% higher than that of the previous methods. Therefore, we anticipate that our feature extraction method and the Seq2Seq model can prevent malicious MQTT traffic.

Author Contributions

Conceptualization, S.C.; Methodology, S.C.; Software, S.C.; Validation, S.C.; Writing, S.C.; Supervision, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP) funded by the Korean Government, Ministry of Science and ICT (MSIT) (Implementation of Verification Platform for ICT Based Environmental Monitoring Sensor), under Grant 2019-0-00135 and partially supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP) funded by the Korea Government, Ministry of Science and ICT (MSIT) (Building a Digital Open Lab as open innovation platform) under Grant 2021-0-00546.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Electricity AMI. Available online: http://www.aitimes.com/news/articleView.html?idxno=141421 (accessed on 8 November 2022).
Gas AMI. Available online: http://www.gasnews.com/news/articleView.html?idxno=104555 (accessed on 8 November 2022).
Water AMI. Available online: https://www.boannews.com/media/view.asp?idx=85538 (accessed on 8 November 2022).
IoT Malware Statistics. Available online: https://blog.sonicwall.com/en-us/2019/10/sonicwall-encrypted-attacks-iot-malware-surge-as-global-malware-volume-dips/ (accessed on 8 November 2022).
Vaccari, I.; Aiello, M.; Cambiaso, E. SlowITe, a Novel Denial of Service Attack Affecting MQTT. Sensors 2020, 20, 2932. [Google Scholar] [CrossRef]
Vaccari, I.; Chiola, G.; Aiello, M.; Mongelli, M.; Cambiaso, E. MQTTset, a New Dataset for Machine Learning Techniques on MQTT. Sensors 2020, 20, 6578. [Google Scholar] [CrossRef] [PubMed]
Tshark. Available online: https://tshark.dev (accessed on 8 November 2022).
Hya, S.; Oriol, V.; Quoc, V.L. Sequence to Sequence Learning with Neural Networks. In Proceedings of the NDSS, San Diego, CA, USA, 23–26 February 2014. [Google Scholar]
Antivirus Software. Available online: https://en.wikipedia.org/wiki/Antivirus_software (accessed on 8 November 2022).
Intrusion Detection System. Available online: https://en.wikipedia.org/wiki/Intrusion_detection_system (accessed on 9 November 2022).
Gibert, D. Convolutional Neural Networks for Malware Classification. Master’s Thesis, Universitat de Barcelona, Barcelona, Spain, 2016. [Google Scholar]
Choi, S.; Bae, J.; Lee, C.; Kim, Y.; Kim, J. Attention-Based Automated Feature Extraction for Malware Analysis. Sensors 2020, 20, 2893. [Google Scholar] [CrossRef] [PubMed]
Šrndic, N.; Laskov, P. Detection of Malicious PDF files Based on Hierarchical Document Structure. In Proceedings of the NDSS, San Diego, CA, USA, 26 February–1 March 2017. [Google Scholar]
Choi, S. Malicious Powershell Detection Using Graph Convolution Network. Appl. Sci. 2021, 11, 6429. [Google Scholar] [CrossRef]
Naseer, S.; Saleem, Y.; Khalid, S.; Bashir, M.K.; Han, J.; Iqbal, M.M.; Han, K. Enhanced Network Anomaly Detection Based on Deep Neural Networks. IEEE Access 2018, 6, 48231–48246. [Google Scholar] [CrossRef]
Zhang, B.-C.; Hu, G.-Y.; Zhou, Z.-J.; Zhang, Y.-M.; Qiao, P.-L.; Chang, L.-L. Network Intrusion Detection Based on Directed Acyclic Graph and Belief Rule Base. ETRI J. 2017, 39, 592–604. [Google Scholar] [CrossRef]
Wang, W.; Sheng, Y.; Wang, J.; Zeng, X.; Ye, X.; Huang, Y.; Zhu, M. HAST-IDS: Learning Hierarchical Spatial-Temporal Features Using Deep Neural Networks to Improve Intrusion Detection. IEEE Access 2017, 6, 1792–1806. [Google Scholar] [CrossRef]
Nagarajan, S.M.; Deverajan, G.G.; Bashir, A.K.; Mahapatra, R.P.; Al-Numay, M.S. IADF-CPS: Intelligent Anomaly Detection Framework towards Cyber Physical Systems. Comput. Commun. 2022, 188, 81–89. [Google Scholar] [CrossRef]
Gopal, D.G.; Saravanan, R. Selfish node detection based on evidence by trust authority and selfish replica allocation in DANET. Int. J. Inf. Commun. Technol. 2016, 9, 473–491. [Google Scholar] [CrossRef]
Nagarajan, S.M.; Deverajan, G.G.; Kumaran, U.; Thirunavukkarasan, M.; Alshehri, M.D.; Alkhalaf, S. Secure Data Transmission in Internet of Medical Things Using RES-256 Algorithm. IEEE Trans. Ind. Inform. 2021, 18, 8876–8884. [Google Scholar] [CrossRef]
MQTT. Available online: https://mqtt.org (accessed on 8 November 2022).
Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef]
Ghazanfar, S.; Hussain, F.; Rehman, A.U.; Fayyaz, U.U.; Shahzad, F.; Shah, G.A. IoT-Flock: An Open-source Framework for IoT Traffic Generation. In Proceedings of the International Conference on Emerging Trends in Smart Technologies, Karachi, Pakistan, 26–27 March 2020. [Google Scholar]
Palmieri, A.; Prem, P.; Ranise, S.; Morelli, U.; Ahmad, T. MQTTSA: A Tool for Automatically Assisting the Secure Deployments of MQTT Brokers. IEEE World Congr. Serv. 2019, 2642, 47–53. [Google Scholar]
Keras. Available online: https://keras.io (accessed on 8 November 2022).
Random Forest. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (accessed on 8 November 2022).
Choi, S. Malicious PowerShell Detection Using Attention against Adversarial Attacks. Electronics 2020, 9, 1817. [Google Scholar] [CrossRef]

Figure 1. Malicious MQTT Traffic Detection System Structure.

Figure 2. MQTT system structure.

Figure 3. Connect packet structure.

Figure 4. Connect ACK packet structure.

Figure 5. Subscribe request packet structure.

Figure 6. Publish packet structure.

Figure 7. DoS attack.

Figure 8. Flood attack.

Figure 9. SlowITe attack.

Figure 10. Malformed attack.

Figure 11. Brute-force attack.

Figure 12. Seq2Seq model to detect malicious MQTT traffic.

Figure 13. Accuracy comparison with other ML methods.

Figure 14. Accuracy when a source port index is not used.

Figure 15. Accuracy when the TCP length is not used.

Figure 16. Accuracy when the message type is not used.

Figure 17. Accuracy when keep alive is not used.

Figure 18. Accuracy when a connection ACK is not used.

Table 1. MQTT packet types.

Type	Value	Direction	Description
Reserved	0	-	-
CONNECT	1	C -> S	Connection Req
CONACK	2	S -> C	Ack for Req
PUBLISH	3	C <-> S	Publish
PUBACK	4	C <-> S	ACK (QoS 1)
PUBREC	5	C <-> S	QoS 2
PUBREL	6	C <-> S	QoS 2
PUBCOMP	7	C <-> S	QoS 2
SUBSCRIBE	8	C -> S	Subscribe
SUBACK	9	S -> C	Ack for Sub
UNSUBSCRIBE	10	C -> S	Unsubscribe
UNSUBACK	11	S -> C	Ack for Unsub
PINGREQ	12	C -> S	Ping Req
PINGRESP	13	S -> C	Ack for Ping
DISCONNECT	14	C -> S	Disconnect
Reserved	15	-	-

Table 2. Relation between MQTT attacks and features.

	Source Port Index	TCP Length	Message Type	Keep Alive	Connection Ack
DoS Attack	O	-	-	-	-
Flood Attack	-	O	O	-	-
SlowITe Attack	-	-	-	O	-
Malformed Attack	-	-	O	-	-
Brute-force Attack	O	-	-	-	O

Table 3. MQTT pcap files.

Attack	File Name	Number of Packets
Normal	capture_1w_mqtt_1800s.pcap	22,360
DoS	capture_malariaDoS_mqtt.pcap	93,317
Flood	capture_flood_mqtt.pcap	303
SlowITe	slowite_mqtt.pcap	3048
Malformed	malformed_mqtt.pcap	3656
Brute force	Brute-force_mqtt.pcap	2921

Table 4. Confusion matrix of our method.

		Predicted
		Normal	BF	DoS	Flood	SlowITe	Malformed
Actual	Normal	4419	-	-	-	-	-
	BF	-	606	-	-	-	-
	DoS	-	-	18,653	-	-	-
	Flood	-	-	-	62	-	-
	SlowITe	-	-	-	-	408	-
	Malformed	1	-	4	-	-	735

Table 5. Confusion matrix when a source port index is not used.

		Predicted
		Normal	BF	DoS	Flood	SlowITe	Malformed
Actual	Normal	4419	-	-	-	-	-
	BF	10	588	-	-	-	-
	DoS	12	3	18,637	-	-	1
	Flood	-	-	9	53	-	-
	SlowITe	-	-	-	-	408	-
	Malformed	7	-	7	-	-	726

Table 6. Confusion matrix when the TCP length is not used.

		Predicted
		Normal	BF	DoS	Flood	SlowITe	Malformed
Actual	Normal	4419	-	-	-	-	-
	BF	-	606	-	-	-	-
	DoS	-	-	18,653	-	-	-
	Flood	62	-	-	-	-	-
	SlowITe	-	-	-	-	408	-
	Malformed	2	-	2	-	1	735

Table 7. Confusion matrix when the message type is not used.

		Predicted
		Normal	BF	DoS	Flood	SlowITe	Malformed
Actual	Normal	4419	-	-	-	-	-
	BF	-	606	-	-	-	-
	DoS	-	-	18,652	-	-	1
	Flood	-	-	-	62	-	-
	SlowITe	-	-	-	-	408	-
	Malformed	1	-	13	-	2	724

Table 8. Confusion matrix when keep alive is not used.

		Predicted
		Normal	BF	DoS	Flood	SlowITe	Malformed
Actual	Normal	4419	-	-	-	-	-
	BF	-	606	-	-	-	-
	DoS	-	-	18,650	-	3	-
	Flood	-	-	-	62	-	-
	SlowITe	-	-	-	-	408	-
	Malformed	1	-	1	-	2	736

Table 9. Confusion matrix when a connection ACK is not used.

		Predicted
		Normal	BF	DoS	Flood	SlowITe	Malformed
Actual	Normal	4419	-	-	-	-	-
	BF	-	606	-	-	-	-
	DoS	-	1	18,652	-	-	-
	Flood	1	-	-	61	-	-
	SlowITe	-	-	-	-	407	1
	Malformed	1	1	-	-	-	738

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Choi, S.; Cho, J. Novel Feature Extraction Method for Detecting Malicious MQTT Traffic Using Seq2Seq. Appl. Sci. 2022, 12, 12306. https://doi.org/10.3390/app122312306

AMA Style

Choi S, Cho J. Novel Feature Extraction Method for Detecting Malicious MQTT Traffic Using Seq2Seq. Applied Sciences. 2022; 12(23):12306. https://doi.org/10.3390/app122312306

Chicago/Turabian Style

Choi, Sunoh, and Jaehyuk Cho. 2022. "Novel Feature Extraction Method for Detecting Malicious MQTT Traffic Using Seq2Seq" Applied Sciences 12, no. 23: 12306. https://doi.org/10.3390/app122312306

APA Style

Choi, S., & Cho, J. (2022). Novel Feature Extraction Method for Detecting Malicious MQTT Traffic Using Seq2Seq. Applied Sciences, 12(23), 12306. https://doi.org/10.3390/app122312306

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Novel Feature Extraction Method for Detecting Malicious MQTT Traffic Using Seq2Seq

Abstract

1. Introduction

2. Related Studies

3. Message Queue Telemetry Transport (MQTT)

4. MQTT Attack Analysis

4.1. DoS Attack

4.2. Flood Attack

4.3. SlowITe Attack

4.4. Malformed Attack

4.5. Brute-Force Attack

5. Feature Extraction

5.1. Source Port Index

5.2. TCP Length

5.3. MQTT Message Type

5.4. Keep Alive

5.5. Con Ack

5.6. Preprocessing

6. Seq2Seq Model for Detection of Malicious MQTT Traffic

7. Experimental Results

7.1. Setup

7.2. Accuracy

7.3. Analysis

8. Discussion

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI