Enhanced MQTT Protocol for Securing Big Data/Hadoop Data Management

Kamoun-Abid, Ferdaous; Meddeb-Makhlouf, Amel

doi:10.3390/jsan15010022

Open AccessArticle

Enhanced MQTT Protocol for Securing Big Data/Hadoop Data Management

by

Ferdaous Kamoun-Abid

^*

and

Amel Meddeb-Makhlouf

New Technologies and Communication Systems Laboratory, National School of Electronics and Telecommunications, Sfax 3018, Tunisia

^*

Author to whom correspondence should be addressed.

J. Sens. Actuator Netw. 2026, 15(1), 22; https://doi.org/10.3390/jsan15010022

Submission received: 29 November 2025 / Revised: 10 February 2026 / Accepted: 12 February 2026 / Published: 16 February 2026

(This article belongs to the Section Network Security and Privacy)

Download

Browse Figures

Versions Notes

Abstract

Big data has significantly transformed data processing and analytics across various domains. However, ensuring security and data confidentiality in distributed platforms such as Hadoop remains a challenging task. Distributed environments face major security issues, particularly in the management and protection of large-scale data. In this article, we focus on the cost of secure information transmission, implementation complexity, and scalability. Furthermore, we address the confidentiality of information stored in Hadoop by analyzing different AES encryption modes and examining their potential to enhance Hadoop security. At the application layer, we operate within our Hadoop environment using an extended, secure, and widely used MQTT protocol for large-scale data communication. This approach is based on implementing MQTT with TLS, and before connecting, we add a hash verification of the data nodes’ identities and send the JWT. This protocol uses TCP at the transport layer for underlying transmission. The advantage of TCP lies in its reliability and small header size, making it particularly suitable for big data environments. This work proposes a triple-layer protection framework. The first layer is the assessment of the performance of existing AES encryption modes (CTR, CBC, and GCM) with different key sizes to optimize data confidentiality and processing efficiency in large-scale Hadoop deployments. Afterwards, we propose evaluating the integrity of DataNodes using a novel verification mechanism that employs SHA-3-256 hashing to authenticate nodes and prevent unauthorized access during cluster initialization. At the third tier, the integrity of data blocks within Hadoop is ensured using SHA-3-256. Through extensive performance testing and security validation, we demonstrate integration.

Keywords:

Hadoop security; HDFS; encryption; big data; confidentiality; integrity; DataNode; block; extended MQTT

1. Introduction

Nowadays, data increases exponentially every minute, as enterprises and professionals use new technologies for storage. The evolution of data mining and storage technologies poses significant threats to the security of individuals’ information. Existing research presents robust and scalable models for securing big data in cloud environments, providing methods to ensure the reliability of metadata, as well as the security of data blocks and transmissions.

Over the past two decades, the volume of information processed by search engines has increased exponentially. In 2000, Google handled approximately 33.8 million search queries per day. By 2018, this figure had surged to 5.6 billion daily searches [1]. More recent estimates suggest that, as of 2024, Google processes between 8.3 and 22 billion searches per day, depending on the data source, with projections indicating this number could reach approximately 13.7 billion daily queries by 2025 [2]. This sustained and rapid growth highlights the critical importance of big data technologies in managing, processing, and analyzing massive volumes of fast-moving information streams effectively. The rapid advancement of cloud computing offers support for storing and processing large-scale data.

On the other hand, for the transmission of IoT information, the most used protocol is MQTT (Message Queuing Telemetry Transport). In fact, MQTT is an application layer protocol in the OSI model for big data ecosystems [3]. This protocol uses a service-oriented architecture. These services, in terms of their authentication, are based on obtaining tokens that register and assign them (JSON Web Tokens: JWT) [4]. Furthermore, it organizes a set of users around a central broker. Their role is to distribute messages among themselves. This protocol is based on the publish/subscribe model. Its function is to allow clients to subscribe to a well-defined topic. Topics are presented as a channel through which subscribers send or receive information. The MQTT protocol ensures the monitoring of all publishers and subscribers. Moreover, this protocol manages all end-to-end information transmission operations between users [3].

Recent work has focused on integrating MQTT to improve IoT communication in big data environments. One example is the work presented by Youcheng et al. in [5], which offers bidirectional, lightweight, and reliable transmission. Furthermore, it minimizes bandwidth consumption and latency and facilitates information management. The results in this article confirm that the proposed architecture demonstrates the solution’s performance in the big data environment. In another work, researchers A. Shvaika et al. in [6] proposed the ThingsBoard Message Queue (TBMQ) approach, which presents a set of brokers cooperating in clusters and connected to a system that guarantees fault tolerance through distributed replication. Furthermore, this proposed architecture features a horizontal distribution of the MQTT broker, which can meet the requirements of big data environments handling enormous volumes of information, thereby improving system reliability.

MQTT and big data systems using Hadoop have several security vulnerabilities, posing cybersecurity challenges, such as:

Client interaction: Exchanges between the client and the resource manager or data nodes can be compromised, with an infected client being able to introduce malicious content or compromise connections.
Data fragmentation: The use of big data clusters implies redundancy of data distributed across multiple nodes. This fragmentation complicates security in the absence of an appropriate protection model.
Data integrity: Checks the integrity and redundancy of the information stored on a block.
Data access control: Hadoop only offers access control at the data schema level, without finer granularity to precisely define user responsibilities and rights.

To address the challenges, frameworks such as MQTT and Hadoop have emerged as effective solutions for distributed transmission and storage and parallel processing of large-scale datasets. Indeed, Hadoop, an open-source big data framework, is based on Google’s cloud computing system. At its core, Hadoop integrates the HDFS (Hadoop Distributed File System) for distributed storage and the MapReduce framework for distributed processing, providing a transparent, distributed infrastructure [7]. The principle of this framework is to enable the combination of multiple servers or ordinary computers to form a big data cluster. For its exceptional compatibility, high computational power, and cost-effectiveness [8,9], Hadoop has gained widespread adoption across various industries, including finance, healthcare, and e-commerce. Unfortunately, authentication and encryption are also generally not available by default in HDFS, which poses serious security risks.

There are several methods to secure information stored under HDFS, including ECC (Elliptic Curve Cryptography) encryption and other algorithms. This can give a 27.6% improvement in storage and encryption efficiency over traditional methods [10]. Despite the advancements offered by existing solutions, several limitations persist. Notably, Hadoop’s centralized NameNode architecture introduces a single point of failure for metadata management, increasing the risk of system downtime. Additionally, many encryption techniques are either computationally intensive or inadequately integrated into the Hadoop storage layer. Furthermore, only a limited number of approaches support secure computation directly over encrypted data. To address these challenges, S. Guan et al. [10] proposed a secure and scalable Hadoop-based storage architecture that employs a dual-channel metadata management mechanism. This approach leverages HDFS federation alongside Zookeeper to enhance high availability. The system also incorporates a hybrid encryption strategy: lightweight Elliptic Curve Cryptography (ECC) for general data protection and Paillier homomorphic encryption for sensitive data, enabling secure computations without requiring decryption. This model effectively balances data security, computational efficiency, and the real-time processing capabilities of Hadoop’s distributed framework.

For MQTT-based transmission in this architecture, the goal is to establish a connection between the client and the broker, and we will present the steps below. In the first step, the publisher collects information from various sources, such as wearable devices and sensors deployed on machines. Publishers publish on a well-defined topic. The second entity focuses on the subscriber, who works on a specific topic on which the publisher publishes information. The third entity is the broker (intermediate server). The broker gathers the information from the publisher and transmits it to the subscriber. The broker manages multiple topics simultaneously [11].

In this article, we presented a three-layered security enhancement framework for Hadoop-based big data environments. The three layers are as follows.

(1) The first layer focuses on encrypting information stored in Hadoop to ensure data confidentiality. This layer involves a comparative performance analysis of several AES encryption methods (CTR, CBC, and GCM) to select the most suitable solution. (2) The second layer introduces a DataNode authentication mechanism based on SHA-3-256 hashing, enabling the recognition and blocking of unauthorized access when a user tries to log into the cluster. Finally, (3) the third level proposes an improvement on the MQTT protocol (secure extension of the MQTT protocol), which combines the use of TLS and JWT and running hash verification functions in order to strengthen the security, authenticity, and double integrity (integrity existing by default MQTT and integrity of information and access to the DataNode) of the data transmission layer.

The remainder of this paper is organized as follows: Section 2, “Background”, reviews related work. Section 3, “Materials and Methods”, describes the proposed solution in detail. Section 4, “Results”, presents the experimental setup and evaluates the performance of the proposed system. Section 5 is the “Discussion”. Section 6, “Limitations and Future Work”, clarifies the limitations of our work and discusses directions for future research. Finally, Section 7 concludes the study.

2. Background

2.1. Related Work

Enhancing the security of MQTT data ingestion into big data/Hadoop for analytics has been proposed in the literature. For example, Martin Barton et al. [12] proposed a method for implementing smart factories in small- and medium-sized enterprises (SMEs), following the principles of Industry 4.0. Their work focuses on integrating IoT equipment for data collection and utilizes Hadoop as the storage and processing environment. This approach enables the fusion of data and processes within the data warehouse. The work also employs the MQTT transmission protocol, which offers low bandwidth and is well-suited to real-time processes.

Additionally, S. Sarkar et al. [13] presented an emergency evacuation system for urban environments. This solution collects real-time data via IoT sensors, such as temperature, pressure, and humidity. The researchers use an MQTT broker for communication, Apache Spark 4.0.2 Streaming for dynamic risk score calculation at each exit, and Apache Kafka for data stream management. The system is hosted on the AWS cloud, and Hadoop (HDFS) is used for data storage. Mechanisms for securing Hadoop and MQTT have also been proposed. In the following, we introduce recent works in these fields.

2.1.1. Security of Big Data/Hadoop

Big data is the concept of real-time processing, due to the speed at which data is produced and transformed to meet needs [14,15]. Big data can be highly variable in terms of the nature of the data. That is, it may require different processing tools, depending on the specific data and the objectives of the analysis [15]. Since Hadoop is an exceptionally efficient version of the MapReduce algorithm and is presented as a software system accessible to everyone, it offers the possibility of distributed data processing. It has become the most widely used technology for cloud storage because it allows companies to manage and store enormous amounts of data at high speed.

Cloud computing has emerged as an essential solution for processing large volumes of data, offering users on-demand, reliable, flexible, and cost-effective services. Therefore, big data security management has become a major challenge. In fact, the proposed solutions by researchers aim to optimize the process of encryption and decryption of files by integrating the AES (Advanced Standard Encryption), RSA (Rivest–Shamir–Adleman), and other algorithms within the Hadoop environment. Nevertheless, the problem is that previous research, which used only one encryption algorithm, such as AES, resulted in a 50% increase in the size of encrypted files in a big data context. For this reason, recent works proposed to use a combination of algorithms to minimize the storage size [16,17,18].

Indeed, the security of large amounts of data stored in servers requires encryption techniques. Unfortunately, traditional algorithms are not up to the task in the case of big data. For this, Y. Filaly et al. in [16] proposed a hybrid encryption method for large data. They use a combination of AES, RSA, and CP-ABE (encryption based on the characteristics of the encryption policy). They evaluate their model by comparing it to traditional encryption algorithms such as Blowfish, 3DES (Data Encryption Standard), and DES. The results show that their work is more efficient in terms of throughput, encryption time, and decryption time. In the same way, A. Fashakh et al. in [17] proposed a method of encryption and decryption of HDFS files using AES and OTP (One-Time Password) algorithms. Their method aims to reduce the capacity of the information file and accelerate the download and encryption in big data systems.

In the field of big data using Hadoop, intruders can easily access the system, and they can steal the stored data, which leads to the implementation of methods to secure it. In this context, Saritha Gattoju et al. in [18] implemented an AES encryption mechanism in HDFS to ensure confidentiality, where the approach maintains data security by introducing authentication, ACLs (Access Control Lists), and authorization techniques. At the HDFS level, users’ access is controlled using ACLs.

In the field of cloud big data, several researchers are working on security. Scientists Uma et al. in [19] proposed an architecture entitled “Secure Authentication and Data Sharing in Cloud” (SADS-Cloud). This work focuses on confidentiality, integrity, and processing efficiency in the cloud big data environment. It also addresses secure authentication via SHA-3 hashing, encryption using SALSA20 applied at the MapReduce level, and efficient data management through DBSCAN clustering. Their evaluation focuses on reducing encryption/decryption time and minimizing information loss.

In several contexts, such as health, economy, and others, big data refers to a complex problem of digital data collected from many sources of different types (text, image, video, etc.), which leads to the difficulty of management with traditional technologies. For these purposes, researchers use new technologies such as the cloud and big data to guarantee the security of the stored and transmitted information.

In this context, the following recent works are based on encryption, hashing at the cloud level, and using the blockchain as a security technology [10,20].

In the healthcare domain, it is necessary to pay attention to the security of shared patient medical data, so researchers A. Alabdulatif et al. [20] aimed at protecting big medical data, derived from sources such as IoT devices, electronic medical records, and social media. However, data management faces challenges, including data storage, cleaning, and security. They proposed a hybrid cloud-based access control system combining public and private clouds. The public cloud stores encrypted medical data, while the private cloud manages sensitive encryption keys and metadata. In addition, they used the Role-Based Access Control (RBAC) model with hierarchical role structures to ensure that user permissions are managed strictly. In terms of validating their work, scientists tested it on Microsoft Azure with tools such as Visual Studio and ASP.NET.

To address the security and performance challenges associated with storing and managing big data in cloud environments, scientists use Hadoop as a big data environment, but single-point failures in the NameNode lead to metadata reliability issues. Existing encryption methods in Hadoop are either inefficient or insufficient for large-scale and diverse data types. Furthermore, data integrity and security risks arise from unauthorized access or malicious operations in cloud environments. To resolve this issue, S. Guan et al. [10] proposed a secure storage solution leveraging HDFS with advanced encryption techniques to improve data security and efficiency, where dual-channel distributed storage was introduced. In this solution, the NameNode service is distributed across multiple nodes, which ensures fault tolerance by using Zookeeper for synchronization. In addition, it uses the enhanced ECC (Elliptic Curve Cryptography) encryption algorithm for lightweight and efficient encryption of general data and Paillier homomorphic encryption for sensitive data that requires secure computation without decryption. Indeed, the encryption algorithms are tightly coupled to the storage architecture, supporting secure storage and retrieval of data in real time. We notice that centralized server dependencies lead to trust issues that make data vulnerable to security threats.

To ensure confidentiality, several works proposed data encryption, but they do not take into account authentication and authorization, as in [1,21], where they based their studies on Kerberos to ensure authentication, as in [22,23].

For improved security and data protection, the Privacy-Preserving Data Mining (PPDM) technique was employed by R. Josphineleela et al. in [21]. They focus on reducing privacy and security risks by introducing process-level controls in data processing, information collection, and data publication. This work addresses general issues related to security and data access by dividing collected data into several segments and applying PPDM privacy preservation methods.

In the context of big data, the generation of large datasets requires specific processing models. The paper by S. Nayini and A.R. Kandlakunta in [22] examines the scalability and efficiency of processing large data in an atomic and distributed manner using Hadoop MapReduce. The main principle of this paper is to decompose raw data into sections for analysis in clustered environments, thereby addressing issues related to large datasets. Furthermore, it highlights the importance of analyzing the threats and weaknesses of Hadoop MapReduce based on case studies, such as those in smart cities. In the same field, and to reduce security gaps in Hadoop, S. Palit and Ch. S. Roy in [23] proposed the use of Rhino and Sentry projects to improve Hadoop security. The Rhino project secures access control at the table, cell, or column level, while the Sentry project is used for information authorization. This authorization is based on roles, creating various groups to achieve different access levels to different sets of data.

To enhance existing solutions, researchers base their studies on machine learning (ML), which improves decision-making, powers recommendation systems, and strengthens cybersecurity through real-time threat detection. In this context, the work [24] focused on analyzing big data using ML, where M. Dandekar et al. examine the uses, processes, principles, and challenges. The analysis techniques employed include predictive, diagnostic, descriptive, and prescriptive analysis. They use artificial intelligence (AI) methods for Spark and Hadoop to support the analysis of big data.

As shown in Table 1, integrity in the cloud environment is ensured using hashing techniques; however, these approaches do not extend to the big data layer (Hadoop). This creates a potential security gap at the Hadoop level, specifically concerning the modification or replication of DataNode (DN) identities (ID-DN). Such issues are not addressed during the initialization of YARN and HDFS, leaving the system vulnerable to identity-based attacks within the cluster.

In addition, most approaches for securing big data environments—whether on the cloud, blockchain, or the Hadoop platform—rely on proven encryption mechanisms, among which AES occupies a prominent place. This algorithm is favored for its cryptographic robustness and efficiency in processing large volumes of data [25]. In this context, we also chose to use AES to secure exchanges within the Hadoop environment. However, to address the limitations of existing work, we propose an in-depth study of AES operating mode variants (CTR, CBC, and GCM), evaluating them through a comparative analysis of the influence of mode and key size on execution time. The experiment measures the impact of these parameters on the time required to send encrypted information to the HDFS distributed system, aiming to identify the optimal balance between security and performance. Furthermore, our approach enhances the overall security level, which is not present in other benchmarks, by integrating a two-level integrity verification mechanism: first at the NameNode level, to guarantee metadata consistency, and second at the data block level, to ensure the reliability and authenticity of the information stored in HDFS.

2.1.2. Security of the MQTT Protocol

For improvements in application-level security, there are several research studies based on the use of the MQTT protocol in different environments [3,4,11,26,27,28].

Recent work focuses on secure authentication mechanisms based on attributes, tokens, and identities. Researchers Patel and Doshi, in [11], proposed a three-phase architecture: User registration, device registration, and authentication to secure MQTT communications.

However, some architectures are static by design; this makes it difficult to add or remove brokers, as well as to handle variations in the number of users connected to each broker during operation. Researchers Kawaguchi et al. [26] presented an MQTT protocol architecture for edge computing. The content of this article focuses on the broker’s operation. It improves availability while reducing the amount of data broadcast.

Nonetheless, security and communication protocols have been extensively explored for Internet of Things (IoT) environments. Researchers generally proposed a multi-layered security approach, ensuring protection at the network, device, and application levels through mechanisms such as intrusion detection, encryption, and authentication. Manjushree Nayak et al. in [27] introduced a unified security model combining identity verification, cryptographic protection, and secure messaging protocols. Furthermore, they utilized widely used communication protocols, including MQTT, TLS/SSL, CoAP, and DTLS, which facilitate securing data exchange in IoT systems.

Nevertheless, the work presented by Muhammad Nadeem et al. in [28] analyzes some security threats at layer three of IoT systems (the network layer). It evaluates the effectiveness of the MQTT protocol when combined with TLS. They also demonstrate that TLS significantly improves data integrity and confidentiality in a simulated IoT environment.

Furthermore, several studies highlight the importance of using big data in healthcare [29,30]. The use of big data in the medical field is based on the rapid processing of enormous amounts of information. Moreover, this application focuses on optimizing clinical monitoring, improving the quality of care, and enhancing patient satisfaction.

Scientists Akanksha Sharma et al. [29] present the key characteristics of big data applied to healthcare through the 10 Vs (including, veracity, value, variety, visualization, virility, viscosity, variability, velocity, validity, and volume), as well as the most commonly used platforms, such as Hadoop and Cassandra, for addressing diseases such as diabetes, heart disease, and others. This work finally highlights the essential role of analytics in various clinical uses, including medical imaging, signal processing, genomics, and clinical informatics.

The work presented by Mumtaz Karatas et al. [30] studied the cooperation between industry, the healthcare sector, and big data. This work focuses on the management of healthcare services through emerging technologies such as medical cyber-physical systems, the Internet of Health Things, and machine learning. Big data, as demonstrated in this study, plays a crucial role in this transformation. This is achieved through improvements in data collection, processing, and sharing. This article highlights the role of big data in Industry 4.0, which is working within the medical field to improve healthcare systems.

2.1.3. Security Big Data/Hadoop MQTT

There is the work [31] presented by Hung-Yu Chien et al., which uses an approach based on integrity alone (without encryption) at the client–broker channel level when the content is already encrypted end-to-end, and on full compatibility with MQTT 5.0 and its strong authentication mechanisms. Furthermore, these researchers focused on a formal security analysis, validated using the AVISPA tool. This work satisfies the objectives of confidentiality, authentication, and integrity.

This article, presented by Saud Alharbi et al. [32], proposes HECS4MQTT, a multi-layered security framework for a healthcare IoT system using the MQTT protocol. The article addresses the need to ensure data protection while respecting the significant resource constraints of IoT devices. The authors emphasize that lightweight encryption solutions alone become insufficient when data travels to the cloud; therefore, they adopt a layered encryption approach combined with a multi-broker MQTT architecture. Regarding edge computing, IoT devices encrypt data using lightweight algorithms (notably Salsa20 with Blake2b for integrity).

To summarize our analysis of the related works, Table 1 presents the comparative results regarding the use of big data in cloud/blockchain environments, the use of big data with the Hadoop platform, and the use of the Hadoop platform with MQTT.

Table 1. Comparative table of related work.

Years, Ref	Authentication	Authorization	Confidentiality	Integrity ID-DN	Monitoring IDS	Availability	User Access Control	Network Access Control	Used Simulation Method	Results
Cloud/Blockchain environments used big data
2025 [7]	Hybrid cryptographic key management between DataNodes and NameNode	Based on blockchain-based access control and smart contracts	Ensured using AES-GCM encryption for data-at-rest and RSA for key exchange	Verified using digital signatures and hash-based validation of DataNode IDs	Integrated intrusion detection (IDS) module monitors Hadoop logs and node behaviors	Availability through distributed replication across multiple DataNodes	User access is verified using blockchain credentials	Network-level control via secure communication channels (TLS + IDS alerts)	Using Hadoop cluster (2–3 DataNodes) on VirtualBox	Encrypt (time/file size) () 2.03 s/size = 100 M () 10.75 s/size = 500 M (*) 21.10 s/size = 1 G
2024 [10]	Kerberos authentication	No	Encryption: () Dual-thread ECC encryption (DH-ECC) () Paillier homomorphic encryption algorithm analysis	No	No	Improved storage efficiency	No	No	Hadoop-based encrypted storage scheme	Encrypt (time/file size = 4 G) () AES = 130 s () DES = 140 s () ECC = 150 s () DH-ECC = 50 s
2023 [20]	Two-factor authentication (2FA), OTP	Access control mechanisms	Encryption (RSA, AES, RC4)	SHA-512, SHA-256, MD5	Yes In cloud	Security and availability ensured through encryption	Yes, role-based and attribute-based access control implementation	Firewalls in Cloud	Microsoft Azure, C#/ASP.NET	The y-axis shows time in milliseconds, and the X-axis represents the size in bytes. (encrypt = 3500 ms) (decryp = 700 ms)
2021 [33]	Mutual authentication with lightweight protocols in IoT edge computing	Blockchain-based access control using Hyperledger Fabric (HLF)	Blockchain-based provenance mechanism ensuring data privacy	Cryptographic hash functions and blockchain ledger for data integrity	No	Ensures data availability through decentralized storage on Hadoop	Blockchain-based authentication for IoT devices	No	Experimentation on Hadoop and Hyperledger Fabric with Raspberry Pi devices	600 transactions per minute, 500 ms average response time, improved data traceability and security
Big data with Hadoop platform
2024 [24]	No	No	No	No	Discusses the use of big data analytics in cybersecurity for real-time threat detection and mitigation	Highlights scalability and performance	No	No	Hadoop, Spark, and machine learning algorithms	Analytics results
2024 [1]	No	No	RSA encryption	No	No	Ensures availability through HDFS replication and MapReduce for parallel processing.	No	No	Single-node Hadoop cluster	Size 1 MB: (Mappers 1 = 16 s/Mappers 2 = 15 s/Mappers 3 = 14 s) Size 10 MB: (Mappers 1 = 13 s/Mappers 2 = 14 s/Mappers 3 = 13 s) Size 100 MB: (Mappers 1 = 15 s/Mappers 2 = 14 s/Mappers 3 = 14 s)
2023 [16]	CP-ABE (Ciphertext-Policy Attribute-Based Encryption), RSA, and AES	CP-ABE	AES, RSA, and CP-ABE	No	No	The Hadoop Distributed File System (HDFS) ensures high availability and fault tolerance.	Attribute-based access control	No	Conducted on a Hadoop system with an Intel^® Core i5 processor, 8 processors, 16 GB memory, and CentOS 7.5 (1 slave)	1 GB/encry ()Proposed = 5 min ()DES = 13 min ()3DES = 12.5 min () Blowfish = 11.8 min
2023 [21]	No	No	Privacy-Preserving Data Mining (PPDM) through data decentralization	No	No	Ensures security by decentralizing data to prevent unauthorized access	Decentralized Data Control	No	NA	Improved data privacy by reducing security risks from centralized data storage
2023 [17]	AES (Advanced Encryption Standard) et OTP (One-Time Password)	NA	AES et OTP	No	No	HDFS ensures high availability	Access control is based on user attributes	No	NA	encryption/decryption speed and reduces the size of encrypted files by 20% compared to traditional methods
2021 [18]	Kerberos (between users and Hadoop services)	UNIX-style permissions and ACLs	AES	No	No	HDFS data replication	Kerberos and delegation tokens	No	Hadoop installed on a VirtualBox, virtual machine Ubuntu	AES crypt (key: 128/192/256) (block: 128) time/s: 150/300/700
Hadoop with MQTT
2024 [12]	NA	NA	NA	NA	NA	Usage Hadoop	NA	Local Network	NA	Summary in the form of a BPMN diagram
2024 [13]	NA	NA	NA	NA	NA	Combination of Kafka (for streaming) and Hadoop/HDFS (for storage)	NA	NA	Physical test bed with 6 IoT beacons	calculates risk scores during a simulated evacuation (test-bed)
Our work
2026	() AES (Advanced Encryption Standard) () JWT (JSON Web Tokens)	Access control mechanisms	AES-GCM	() SHA-3-256 () MD5 (by using (Hadoop) (*) Integrity offered by AES-GCM encryption	Module monitors (verification function)	High availability through distributed replication across multiple DataNodes	User access is verified using (hash verification function)	Network-level control via secure communication channels (TLS + IDS alerts)	Using Hadoop cluster (3 DataNodes) on VirtualBox	Encrypt (time/file size) () 2 s/size = 100 M () 7.69 s/size = 500 M (*) 19.97 s/size = 1 G

NA: Not available. (*): denotes a chip.

Many solutions exist that rely on different isolated security aspects and do not introduce integrated mechanisms that simultaneously guarantee data integrity, confidentiality, and trust between different distributed components.

In light of this analysis, we present in the following section our proposal to secure MQTT and Hadoop based on two main aspects: (1) the use of hashing to ensure integrity and the use of the AES algorithm with different operating modes for encryption to ensure data confidentiality; and (2) the extension of the MQTT protocol to secure big data transmission between the devices in our environment. This solution will subsequently be evaluated experimentally to measure its effectiveness and compare the obtained performances, as detailed in the experimental results section.

3. Materials and Methods

3.1. System Architecture

Hadoop starts up, and the system saves the ID-DN and ID-Block as reference identifiers. After this one-time step, MQTT’s role comes into play.

As shown in Figure 1, the MQTT process starts when the Hadoop (1) NameNode sends the ID-DN and ID-Block to the MQTT broker. Then, (2) the broker forwards them to the Hadoop client (in our case, the DN). At this point, the core of our approach is executed. (3) There is an exchange of information about the identifiers to perform the hash verification functions. These functions are defined in detail in the following Section 3.2. This adds another hashing step for greater integrity and improved security. If the hash verification test is successful, (4) the Hadoop client subscribes to this topic.

3.2. Proposed Three-Phase Integrity Defense for Hadoop

Encrypting data is essential to protect data between Hadoop and the front-end. In the Hadoop ecosystem, encryption of data in transit is based on the SASL (Simple Authentication and Security Layer) authentication mechanism configured in Hadoop, which secures exchanges between clients and servers. This protocol helps preventing man-in-the-middle attacks by ensuring the integrity and confidentiality of communications. Additionally, Hadoop supports multiple encrypted channels, including RPC (Remote Procedure Call), HTTP, and DTTP (Data Transfer Protocol), to secure data transmission in motion.

Our proposal tackles the secure management of large-scale data in distributed environments, with a focus on the Hadoop ecosystem. It consists of three complementary components. The first ensures data confidentiality by encrypting HDFS data blocks at the time of storage, protecting sensitive information in big data contexts. The second component verifies the integrity and consistency of the Hadoop cluster during startup. This is achieved through a hashing mechanism: each DN generates a unique identifier that is hashed and stored upon creation. During cluster restart, these identifiers are rehashed and compared to the stored values to validate the authenticity of each DN. Any mismatch triggers a security alert, preventing unauthorized modifications or replication. Before proceeding to the third phase, a hash pair composed of (ID-DN/ID-Block) is calculated as a reference hash for use in the third phase. The third component verifies the integrity of an IDentifier pair (ID-DN, ID-Block) after startup. This is achieved through a hashing mechanism. The robustness of our proposal consists of the generation of a unique identifier for each block. This IDentifier is hashed and stored upon creation.

The process is started upon cluster restart, at which point the Block and DN identifiers are recalculated and compared to the stored values to validate authenticity. Therefore, to prevent any unauthorized action, the detection of an anomaly results in the sending of a security alert.

The proposed methodology comprises 3 key phases aimed at enhancing the security and integrity of the Hadoop Distributed File System. As shown in Figure 2, the first phase encrypts client-submitted data blocks, which are then segmented and distributed across the cluster’s DN. For each block, a hashed fingerprint of the hosting DN’s identifier is computed and stored in the NameNode, creating a secure association between data and its storage location. The second phase occurs during cluster startup: the system rehashes the identifiers of active DN and compares them with the stored hashes. A successful match permits normal startup, while any discrepancy triggers a security alert, indicating possible unauthorized access or tampering. The third phase, starting from the second startup, performs an integrity check of an ID pair (ID-DN, ID-Block). Any detected discrepancy immediately activates a security alert.

3.2.1. Hadoop-Based Approach

The concept of “Big Data” encompasses a wide range of services, applications, technologies, and architectural models. However, existing data-processing solutions are increasingly inadequate for managing the ever-growing volume of data being generated. As illustrated in Figure 3, the Hadoop big data environment handles information through distinct stages of recording. During both the writing and reading phases, the NameNode plays a critical role by segmenting and replicating the data across the system to ensure reliability and accessibility. Algorithms 1 and 2 presents the main steps.

Algorithm 1. Secure Write Phase in the Enhanced Hadoop/MQTT Framework

Input: File to be written F
Output: File F stored in replicated blocks across the DataNodes
1. HADOOP CLIENT: Divide file F into blocks {B1, B2, …, B1}
2. HADOOP CLIENT → NAMENODE: Notify the NameNode to write the file
3. NAMENODE: Identify the appropriate DataNodes for each block
4. HADOOP CLIENT → DATA NODE (DN): Send each block to the assigned DN
5. REPEAT for each block Bk:
         5.1 DN: Save block Bk
         5.2 DN: Replicate the block to other DataNodes according to the replication policy (e.g., 2 copies)
  UNTIL all blocks are written

Algorithm 2. Read Phase in the Enhanced Hadoop/MQTT Framework

Input: Name of the file F to read
Output: Reconstructed file F
1. NAMENODE → HADOOP CLIENT: Divide the file into i blocks {B1, B2, …, Bi} and specify the DataNodes that store each block
2. For each block Bk:
           2.1 DN: Identify the location of the blocks (e.g., B1: DN1, DN3, B2: DN2, DNj, …)
3. HADOOP CLIENT: To retrieve block Bk:
IF an available DN contains block Bk THEN
           3.1 Retrieve the block from this DN
ELSE
           3.2 Retrieve the block from another DN containing this block
// Repeat for all blocks {B1, B2, …, Bi} to reconstruct file F

3.2.2. Confidentiality Check

In this study, we integrate the AES algorithm within the Hadoop ecosystem to enhance data confidentiality. Specifically, AES encryption is applied at the point of file ingestion into the Hadoop Distributed File System (HDFS). By encrypting data prior to its distribution across DN, this approach substantially strengthens the system’s security posture, ensuring that only authorized users equipped with the appropriate key can access or manipulate the stored information.

The encryption process is listed below:

The distributed file system (HDFS) transmits a query to the master node, which contains the information for creating a new file.
In this step, the NameNode analyzes the availability of space.
The file is encrypted in the NameNode when the characteristics are specified.
The AES key is generated and applied to the file to encrypt it to protect it.
In this step, the output data is encrypted, followed by a writing process to store the information in a particular DN.
The NameNode stores information about the current NameNode and the secondary NameNode (replication node) throughout the replication operation.

3.2.3. Integrity Check

Data integrity in Hadoop is essential to ensure that slaves are neither corrupted nor altered. Hadoop implements several mechanisms to guarantee this integrity, particularly during data storage, transfer, and processing. However, it does not check the uniqueness of DNs to prevent redundancy from unauthorized bursts. By default, Hadoop uses checksums to detect any alteration of data blocks stored in HDFS. Furthermore, it guarantees data redundancy through block replication across multiple DNs. Nevertheless, the problem is that Hadoop does not verify DN IDs during the access control process. Furthermore, it also lacks verification of (DN-ID/Block-ID) pairs. To overcome these limits, we propose a two-level Integrity verification.

In what follows, Table 2 contains the abbreviations used in the rest of our article: the Section Integrity DataNode Level and the Section Integrity Blocks Level.

Integrity DataNode Level

Hadoop ensures the security of exchanges between nodes to prevent any modification of data during transmission. However, in Hadoop, the identity verification of DN IDs (DNs) already created at startup (ID-reference) with the DN ID at this access (Current-ID) is missing. To overcome this limit, we propose to guarantee DN integrity, as follows. We started by implementing the Hadoop platform for our big data environment. The process begins with starting the Hadoop environment: the NameNode, then the DNS. At this point, each DNS has a specific identity. According to the process illustrated in the diagram of Figure 4, the verification steps are implemented after starting HDFS to test whether this is an implementation phase or a platform access. This phase ends with a hash of the ID-DN and saving it as a reference hash “

H_{refecence}

.”

Retrieve ID-DN.
Calculate the new ID hash value “ $H_{Curent}$ ”
Retrieve ID-DN reference hash “ $H_{reference}$ ”
Test whether the ID-DN is changed or not by hash verification « $H_{reference}$ = ? $H_{Curent}$ ».
4.1
If the verification test is false: Stop YARN.
4.2
Else, Start YARN, if there is another DN to verify the uniqueness of their ID. If here is another DN to repair the process from the beginning.

Integrity Blocks Level

To overcome the block integrity control limitation, we propose guaranteeing the integrity of the (DN-ID/ID-Block) pairs, as presented in the following steps. This step is executed after proper startup and after the first verification is successful. At this point, each DNS has a specific identity. According to the architecture illustrated in Figure 2, the following steps are implemented:

Retrieve the DN-ID and retrieve the ID-Blocks.
Calculate the new hash value of the ID “H_Curent” at the DN and block level.
Retrieve the reference hash of (ID-DN/ID-Blocks) “H_(“reference”)”.
Verify the modification of (ID-DN/ID-Blocks) by verifying the hash H_(“reference”) = ? H_“Actual”. If the verification test is false, then raise an alert; otherwise, there is another replication of the block in another DN to verify the uniqueness of its identifier

3.3. Extended MQTT for Security

To ensure secure communication, it is necessary to implement confidentiality and integrity between the Hadoop components, based on the TLS protocol and the functions verification (H1) and verification (H2), which is a transport layer protocol. For these reasons, we use TLS, a more secure network protocol based on TCP and verification functions, to ensure a double level of hashing. For even greater security, we proposed using the MQTT protocol with an extension (confidentiality, integrity, and access control).

Adequate data dissemination and discovery are necessary for MQTT brokers to be addressed. To solve this problem, we propose a suitable architecture for implementing it within Hadoop as a big data environment. In our case, the working environment is static by design. The initiation of this protocol requires client intervention. When the Hadoop client requests to view the stored information, the steps in this process begin. Furthermore, the user is responsible for the broker discovery phase, providing the necessary data, and then connecting to that broker. We present, in the following, the sequence of steps in our process, including the execution scenario, discovery mechanism, and authentication process.

3.3.1. Execution Scenario

Our scenario consists of two brokers, 1 and 2, and a “head broker”. A user of our Hadoop architecture (DN1) wants to request a resource R1 on topic X via broker 1. He then sends it to the “head broker” for proper routing. Meanwhile, broker 2 manages a subscriber to the same topic T. Therefore, we see that broker 1 publishes topic T and broker 2 is subscribed to it. This example is illustrated in Figure 5.

3.3.2. Authentication Module

The authentication module sends the JSON Web Tokens (JWT) to other devices that are authenticating. This token works with a unique public key, known to all other devices, which allows them to validate the received tokens. The process begins when a new broker, in a new discovery mechanism phase, requests communication with the authentication service to obtain an authentication token, which is shown in Figure 6.

3.3.3. Discovery Mechanism

Before starting this discovery process, the equipment in our environment has to be started. This is done by executing our proposed “Three-Phase Integrity Defense for Hadoop” process. If the hash verification functions return a positive response and Hadoop starts correctly, H(ID-DN1) is sent to DN1, and the discovery process starts. The discovery mechanism begins with a request (resource R2) from the client (DN1) to its broker regarding the location of a specific resource on the network. The broker then locates this resource using a dedicated discovery entity, the “head broker.” Once the resource is located, the broker redirects the user to the broker that holds it. After the final connection phase, we added a verification step using the hash H1 of the ID-DN and a JWT for authentication. Finally, the client connects to this broker as a user to access the resource. This architecture is illustrated in Figure 7.

4. Results

Our work focuses on enhancing data security in a big data environment, specifically within Hadoop ecosystems. The simulation was performed on a computing platform featuring an Intel^® Core i7 processor operating at 3.00 GHz with eight logical cores, 16 GB of RAM, a 128 GB SSD, and running the Windows 10 Professional operating system. Within the cluster configuration, one node serves as the master node, while the remaining nodes act as DNs. The Hadoop framework is employed in a pseudo-distributed mode, allowing comprehensive control over system components and facilitating detailed performance monitoring.

The primary objective of this experiment is to assess the performance of various encryption algorithms, specifically the different operational modes of the AES encryption algorithm, applied to data files of varying sizes. The evaluation focuses on key performance indicators, including the upload time (

T_{upload}

), defined as the duration required to transfer the encrypted file into the HDFS. This performance metric is formally defined by Formula (1):

T_{upload} = T_{end}^{HDFS} - T_{start}^{HDFS}

(1)

where

T_{end}^{HDFS}

and

T_{start}^{HDFS}

refer to the timestamps for the start and completion of the HDFS upload.

Multiple test scenarios are conducted using files of varying sizes (e.g., 100 MB, 500 MB, and 1 GB). The encryption algorithms are benchmarked against the aforementioned performance metrics to evaluate their effectiveness and suitability within a big data environment.

Our study is structured into two main components:

Hadoop Environment Security Reinforcement: We propose an integrity verification mechanism for DNs to prevent unauthorized modifications or malicious node impersonation. This is achieved through SHA-3-256 hashing to validate DN identity, ensuring uniqueness, preventing hash collusion, and providing trustworthiness within the cluster.
Data Encryption for Confidentiality and Integrity: We evaluate the performance of AES encryption modes (CTR, CBC, and GCM) with varying key sizes (128, 192, and 256 bits) to ensure secure data storage and transmission. The results (transfer times, computational overhead) are analyzed to determine the optimal encryption approach for large-scale datasets.

We describe, in the following, the configuration of MQTT to test the security of communication with an MQTT broker, specifically through tests of secure message exchange. The goal is to evaluate the behavior of the broker and clients after enabling SSL/TLS encryption, in order to guarantee secure communication between MQTT clients.

In the following, Table 3 contains the abbreviations used in our article, in Section 4.1 and Section 4.2.

The execution time of our proposed process is the sum of the times of the steps presented in Figure 8 (Equation (2)):

T_{T o t a l} = T_{E n (B) & H (I D)} + T_{n e w - M Q T T} + T_{V (H 1)} + T_{V (H 2)}

(2)

where

T_{E n (B) & H (I D)}

: Sum of block encryption time and block identity hash time.

T_{n e w - M Q T T}

: Sum of execution times of the “extended secure MQTT” steps.

T_{V (H 1)}

: First check hash function execution time.

T_{V (H 2)}

: Second check hash function execution time.

We measured the execution time of our proposed process, which varies between 11.3016 s and 12.5816 s.

4.1. DataNode Integrity Verification

Our proposed solution implements a proactive security mechanism specifically targeting DN in a Hadoop cluster to guarantee their authenticity and integrity prior to integration into the operational environment. This mechanism addresses critical security risks such as unauthorized node injection, tampering, or metadata corruption, which could otherwise compromise the cluster’s reliability and data integrity.

The core of the protocol relies on cryptographic verification of each DN’s unique identifier (UUID) by applying a SHA-3-256 hash function. Upon cluster initialization—triggered by the standard Hadoop startup script (start-all.sh)—an auxiliary automated script (verify_all_datanode_hashes.sh) is executed. This script interacts with the NameNode’s REST API endpoint (http://NameNode:9870/fsck (accessed 2 October 2025)), which provides real-time metadata, including the UUIDs of all active DNs in the cluster.

The script proceeds to generate SHA-3-256 hashes for each retrieved UUID, thereby creating cryptographic fingerprints that serve as immutable identifiers for the DN. These newly generated hashes are systematically compared against a trusted baseline, maintained in a secure reference file (hash.ref1.datanode.txt). This file contains pre-validated SHA-3-256 hashes corresponding to authorized DNs.

In the event that any hash generated during verification deviates from the stored reference—indicating a potential compromise such as an unauthorized or altered node—the protocol enforces a security fail-safe by preventing the launch of the YARN service. Blocking YARN startup effectively halts resource scheduling and job execution, thus safeguarding the cluster from operating in an untrusted state.

This security check is critical, as YARN is responsible for resource management and job scheduling across the cluster, and its compromise could enable unauthorized data access or corruption. The entire verification workflow and its fail-safe response are depicted in Figure 9, highlighting the system’s capability to detect and mitigate security threats before they impact the cluster.

Our approach complements existing encryption mechanisms, such as AES in CTR, CBC, and GCM modes, by introducing an additional layer of integrity verification that is crucial for sensitive big data environments. While encryption protects the confidentiality of the data, our protocol ensures the authenticity and integrity of the DataNodes themselves, mitigating risks related to unauthorized node insertion or metadata tampering. Experimental results demonstrate that the computational overhead introduced by the verification process is minimal and does not significantly impact system performance. This negligible cost is vastly outweighed by the enhanced security guarantees provided, thereby validating the protocol’s suitability and practical value in critical big data deployments.

4.2. Analysis of AES Modes and Key Sizes Based on Transfer Time to HDFS

In this simulation section, we evaluate the transfer times to HDFS of encrypted files using the AES algorithms (CTR, CBC, and GCM), varying both file and key sizes. This approach aims to analyze the behavior of each algorithm in real-life situations and identify the one that best fits our working environment. The comparison between AES-CTR, AES-CBC, and AES-GCM highlights distinct trade-offs in terms of performance and security depending on the usage context. AES-CTR (Counter mode) is distinguished by its high throughput and inherent parallelizability, making it particularly suitable for applications that demand rapid encryption of large data volumes or real-time processing scenarios such as streaming. However, its security critically depends on the correct management of unique nonces, as any nonce reuse can expose the system to significant vulnerabilities [34]. In contrast, AES-CBC provides moderate performance and acceptable security levels, contingent upon secure handling and non-reuse of the initialization vector (IV). Nonetheless, CBC mode is more susceptible to key reuse issues and padding oracle attacks, and its performance tends to degrade with increasing data size, which restricts its applicability primarily to the secure storage of small- to medium-sized static files [35]. AES-GCM (Galois/Counter Mode) integrates authenticated encryption with high efficiency, delivering both data confidentiality and integrity within a single cryptographic operation. While AES-GCM may introduce marginally higher computational overhead for very large files, its strong resistance to forgery and tampering renders it an optimal choice for contemporary secure communication systems [36]. Overall, although each AES mode is tailored to distinct operational contexts, AES-GCM offers the most robust and balanced solution for distributed, high-risk environments where stringent security assurances are paramount.

In the following section, we present a comparative analysis of AES-CTR, AES-CBC, and AES-GCM to elucidate their respective trade-offs and inform the selection of the most appropriate encryption mode for our specific use case.

4.2.1. AES-CTR Mode (Counter Mode)

AES-CTR exhibits high efficiency in encrypting large files, with transfer times increasing nearly linearly as file size grows. Performance remains largely consistent across key sizes of 128, 192, and 256 bits, though a slight increase in transfer time is observed for the 256-bit key when encrypting the largest file tested (1.4 GB). Despite this minimal overhead, the 256-bit key remains the preferred choice due to its superior security strength. The minimal performance variation related to key size underscores AES-CTR’s suitability for applications demanding both robust encryption and high throughput, even for substantial data volumes. The tested file sizes include approximately 101 MB, 233 MB, 636 MB, and 1.4 GB. Transfer times (in seconds) for uploading encrypted files to HDFS illustrate only minor differences among key sizes, with a slight upward trend as file size increases. These results are visually represented in the first section of Figure 10.

4.2.2. AES-CBC (Cipher Block Chaining)

AES-CBC mode demonstrates transfer times that increase substantially with file size, particularly when using 128-bit keys on large files (1.4 GB). For larger datasets, 192- and 256-bit keys offer marginal performance improvements compared to the 128-bit key. Unlike AES-CTR, CBC exhibits greater variability in processing times, indicating increased computational overhead. Consequently, AES-CBC generally underperforms relative to CTR mode for large files and shows heightened sensitivity to both key length and data size, making it less suitable for encrypting very large datasets without further optimization.

The following shows the transfer times (in seconds) to HDFS for files encrypted with AES-CBC using 128-, 192-, and 256-bit keys. For smaller files (101 MB to 233 MB), the 128-bit key consistently yields the fastest transfer times (e.g., 3.54 s for 101 MB), whereas the 256-bit key incurs the highest overhead (6.16 s for 101 MB), indicating a notable performance penalty associated with longer keys on smaller datasets. For medium-sized files (636 MB), the 192-bit key provides the best performance (8.415 s), followed by the 128-bit key (10.61 s), with the 256-bit key remaining the slowest (12.949 s), further underscoring the impact of key length on processing efficiency. Interestingly, for the largest file tested (1.4 GB), the 192- and 256-bit keys outperform the 128-bit key (approximately 22 s versus 29.02 s), suggesting that stronger keys may confer a performance advantage at very large scales. These findings are illustrated in the second section of Figure 10.

4.2.3. AES-GCM (Galois/Counter Mode)

AES-GCM demonstrates consistent and efficient performance for small- and medium-sized files, particularly those between 101 and 636 MB. In this evaluation, encryption and subsequent transfer times to systems like HDFS remain relatively stable regardless of key size. However, for larger files, such as those around 1.4 GB, a noticeable increase in transfer time is observed, particularly when using 192- and 256-bit keys. This performance degradation is due to the increased computational load associated with larger key sizes. Notably, AES-128-GCM outperforms its 192- and 256-bit counterparts for processing very large files, delivering faster processing times while maintaining a robust security profile. Providing both encryption and authentication, AES-GCM guarantees data confidentiality and integrity, making it a preferred solution in scenarios where security is paramount. This is further confirmed by the following detailed measurements. For smaller files (101 MB and 233 MB), the transfer times remain low across all key sizes, with minor variations. As file sizes increase, the impact of the key length becomes more evident. For example, with a 1.4 GB file, the transfer time using a 128-bit key (18.39 s) is significantly lower compared to using a 192-bit key (25.84 s) or a 256-bit key (28.16 s). This clearly illustrates that while AES-GCM ensures high security at all key sizes, using a 128-bit key is the most efficient choice when handling very large files in terms of transfer speed. These observations are further detailed in the final section of Figure 10.

This study, illustrated in Figure 8, evaluates the performance of three AES encryption modes (CTR, CBC, and GCM) across different key sizes (128, 192, and 256 bits) by measuring the time required to transfer encrypted files to the Hadoop Distributed File System (HDFS).

5. Discussion

5.1. Evaluation of AES Modes and Key Impact

From a broader perspective, the comparative evaluation of AES encryption modes—AES-CTR, AES-CBC, and AES-GCM—highlights critical trade-offs among performance, security, and sensitivity to key size. AES-CTR delivers the highest processing speeds, making it especially well-suited for large datasets and real-time streaming applications. However, its security is heavily dependent on the strict management of unique nonces for each encryption operation to prevent vulnerabilities. AES-CBC provides a moderate balance between performance and security but demands careful handling of the initialization vector (IV). Its effectiveness diminishes with increasing file sizes, primarily due to the overhead associated with larger key lengths.

In contrast, AES-GCM offers robust security guarantees by integrating encryption with authentication and data integrity verification. While maintaining efficient performance for small- to medium-sized files, AES-GCM incurs a noticeable increase in processing time for very large files (exceeding 1 GB). Nonetheless, this modest performance penalty is outweighed by its superior security properties, making AES-GCM particularly suitable for environments where both confidentiality and data integrity are critical.

Figure 11 presents a comparative analysis of transfer times to HDFS for the three AES modes using a 256-bit key across varying file sizes. Although AES-GCM exhibits slightly longer processing times than AES-CTR for larger files, this trade-off is justified by the added benefits of authenticated encryption. Unlike AES-CBC and AES-CTR, which do not inherently provide integrity checking, AES-GCM ensures both confidentiality and authenticity—key requirements for secure distributed file systems such as HDFS. Considering these results and the critical need to preserve both data confidentiality and integrity within our system, AES-GCM was selected as the encryption algorithm for this study.

5.2. Comparative Results

To evaluate the performance of our approach compared to existing methods, we performed a comparative analysis with the algorithm proposed by F. Youness et al. [16] and FILALY Youness et al. [7]. The objective of the work [16] is to improve the security of data stored in the distributed HDFS while optimizing key performance indicators, such as encryption and decryption times and throughput. The work [7] aims to strengthen security in cloud environments by working on authentication, confidentiality, and dynamic access control techniques.

Figure 12 presents a comparative study of transfer times to HDFS for different file sizes (ranging from 128 MB to 1 GB) between our proposed AES-GCM-based encryption approach (shown in red), the hybrid encryption method introduced by F. Youness et al. [16] (shown in green), and the hybrid encryption method introduced in [7] (shown in blue). To ensure a fair comparison, the RSA encryption time of the AES key (2.6 s, determined by simulation) is subtracted from the transfer times reported for the baseline method. Despite this adjustment, our algorithm demonstrates superior transfer efficiency, particularly as file sizes increase. This performance advantage highlights the lightweight yet effective integration of confidentiality and integrity provided by AES-GCM encryption.

In contrast, the referenced hybrid approach, which combines multiple cryptographic algorithms, incurs substantially higher computational overhead. Consequently, even excluding the RSA key encryption time, our method exhibits better scalability and throughput for secure data transfers within distributed storage environments like HDFS.

While hybrid encryption schemes are generally acknowledged for enhancing security, they typically impose increased computational costs and longer execution times. Our approach, leveraging AES-GCM, achieves comparable data security without these performance penalties. Its high efficiency and built-in authentication mechanisms ensure both data confidentiality and integrity while maintaining low encryption and decryption latencies. These characteristics make AES-GCM a compelling alternative to complex hybrid systems in scenarios where both security and performance are paramount.

5.3. Evaluation of Extended MQTT

In terms of secure communication, we rely on SSL/TLS encryption (on port 8883). In addition, MQTT guarantees the confidentiality and integrity of data by encrypting messages during transmission. This approach protects communications against man-in-the-middle attacks and data interception.

The configured parameters are:

(*) Port 8883 is the default port for secure MQTT connections.

(*) --cafile specifies the certificate authority certificate used to establish a trusted connection.

We observe that the clients successfully established a secure connection with the broker. It is clear in Figure 13 that messages are exchanged, and SSL/TLS encryption ensures the confidentiality of the data during transmission. Furthermore, no errors or warnings related to certificate verification are observed. Therefore, the transition to a secure communication channel is successful in guaranteeing the confidentiality and integrity of the information. Our goal is to ensure security requirements are met, and we propose to use the modified MQTT protocol with TLS, which provides encryption, authentication, and double integrity at the data transport layer. Our addition also includes the added integrity of the Hadoop team identity (DN and block).

6. Limitations and Future Work

While advanced MQTT variants enhance IoT communication security through encryption and robust authentication mechanisms, they still have significant shortcomings, particularly in preserving metadata privacy and mitigating denial-of-service (DoS) attacks. Furthermore, even with payload encryption, topics and identifiers often remain exposed, allowing attackers to obtain sensitive system information. In addition, these protocols provide insufficient protection against connection or message floods, leaving MQTT brokers vulnerable to malicious overloads. To address these weaknesses, we propose a new extended secure MQTT architecture that systematically hashes block and data node identities using SHA-3-256. This approach enhances metadata confidentiality by masking exposed identifiers and increases resilience against DoS attacks through integrity and authenticity verification before any connection. Our solution maintains the lightweight nature of MQTT while strengthening its security at critical points overlooked by existing improvements. However, our approach may require slightly longer computation times due to the combination of three safety levels. Nevertheless, this additional workload is offset by achieving a significantly higher level of safety compared to existing methods. For future work, we plan to improve our method by integrating machine learning processes for intrusion detection. Furthermore, we have planned future work focusing on the in-depth analysis of confidential information collected by a sensor-based system, aiming to manipulate this information in real time. In addition, we have focused on quantum communication technologies as a promising area of research.

7. Conclusions

In this study, we present a three-step strategy to enhance the security of the Hadoop distributed data-processing framework. We also focus on safeguarding communication between Hadoop components by employing an extended version of the MQTT protocol. Our approach adds an extra layer of trust by verifying the identities of Dispatch Nodes (DNs) during cluster initialization. This verification ensures that only authenticated and trusted DNs participate in the system, thereby protecting the runtime environment from unauthorized node insertion or tampering.

The first step implements a lightweight process that hashes and stores the unique identifiers of the DNs upon their creation. Each time the Hadoop cluster restarts, these hashed identifiers are recalculated and compared against the stored credentials to verify the nodes’ authenticity. Any detected inconsistency triggers an alert and interrupts the startup process to prevent potential security threats.

The second step is executed once the first step is validated. This step involves verifying a hashed identity pair (ID-DN/ID-Block). Each time the Hadoop cluster restarts, the hashed identifiers are regenerated and compared with the stored reference values to confirm the block’s authenticity. Any detected inconsistency triggers an alert.

The third step concerns the protection of data stored in HDFS. We conducted a comparative analysis of various AES encryption modes (CBC, CTR, and GCM) and key sizes (128, 192, and 256 bits), evaluating their impact on encryption time and HDFS upload latency. Based on this evaluation, AES-GCM was selected for its optimal balance of security and performance, ensuring data confidentiality and integrity with minimal overhead. Although AES-GCM incurs a slightly higher transfer time than CTR and CBC, it offers enhanced security through authentication and integrity, making it a robust and compliant choice for distributed storage systems.

The MQTT protocol is implemented by integrating the hashed identity H(ID-DNi) and a JWT during connection. The results indicate that the transmission time to HDFS is shorter than that of existing approaches by Filaly Youness et al. (2025) [7] and F. Youness et al. (2023) [16]. This improvement is attributed to the simplicity and efficiency of the proposed framework for processing large datasets. Future work will focus on an in-depth analysis of confidential data collected from sensor-based systems for real-time processing, as well as the exploration of quantum communication technologies.

Author Contributions

Conceptualization: F.K.-A. and A.M.-M.; Methodology: F.K.-A.; Software: F.K.-A.; Validation: F.K.-A. and A.M.-M.; Formal Analysis: F.K.-A.; Investigation: F.K.-A.; Resources: F.K.-A.; Data Management: F.K.-A.; Drafting—First Version: F.K.-A.; Drafting—Revision and Correction: F.K.-A. and A.M.-M.; Visualization: F.K.-A. and A.M.-M.; Supervision: A.M.-M.; Project Administration: A.M.-M.; Funding secured: F.K.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study were generated solely for testing and validation purposes of the proposed system. These data are not publicly available, as they were produced within a controlled experimental environment and do not constitute publicly archived or reusable datasets.

Acknowledgments

The authors thank the Information Security Research Group for their constructive discussions throughout this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kala, K.; Makhloga, K.; Khan, A.; Pandey, A.; Mittal, S. A Framework for Big Data Security Using MapReduce in IoT Enabled Computing. In Proceedings of the 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 15–16 March 2024; pp. 570–573. [Google Scholar]
DemandSage. Google Search Statistics: How Many Google Searches Per Day? 2024. Available online: https://www.demandsage.com/google-search-statistics (accessed on 10 September 2025).
Yassein, M.B.; Shatnawi, M.Q.; Aljwarneh, S.; Al-Hatmi, R. Internet of Things: Survey and open issues of MQTT protocol. In Proceedings of the 2017 International Conference on Engineering & MIS (ICEMIS), Monastir, Tunisia, 8–10 May 2017; IEEE: New York, NY, USA, 2017; pp. 1–6. [Google Scholar]
Azzedin, F.; Alhazmi, T. Secure data distribution architecture in IoT using MQTT. Appl. Sci. 2023, 13, 2515. [Google Scholar] [CrossRef]
Shan, Y.; Su, Y.; Lin, J.; Shan, T. IoT Communication Based on MQTT and OneNET Cloud Platform in Big Data Environment. Preprints 2024, 2024011250. [Google Scholar]
Shvaika, A.; Shvaika, D.; Landiak, D.; Artemchuk, V. A distributed architecture for MQTT messaging: The case of TBMQ. J. Big Data 2025, 12, 224. [Google Scholar] [CrossRef]
Filaly, Y.; Berros, N.; El Bekkali, M.; Younes El Bouzekri, E.L. A Comprehensive Survey on Big Data Privacy and Hadoop Security: Insights into Encryption Mechanisms and Emerging Trends. Results Eng. 2025, 27, 106203. [Google Scholar] [CrossRef]
Wen, C.; Yang, J.; Gan, L.; Pan, Y. Big data driven internet of things for credit evaluation and early warning in finance. Future Generat. Comput. Syst. 2021, 124, 295–307. [Google Scholar] [CrossRef]
Huang, X.; Yi, W.; Wang, J.; Xu, Z. Hadoop-based medical image storage and access method for examination series. Math. Probl. Eng. 2021, 2021, 5525009. [Google Scholar] [CrossRef]
Guan, S.; Zhang, C.; Wang, Y.; Liu, W. Hadoop-based secure storage solution for big data in cloud computing environment. Digit. Commun. Netw. 2024, 10, 227–236. [Google Scholar] [CrossRef]
Patel, C.; Doshi, N. A novel MQTT security framework in generic IoT model. Procedia Comput. Sci. 2020, 171, 1399–1408. [Google Scholar] [CrossRef]
Barton, M.; Budjac, R.; Tanuska, P.; Sladek, I.; Nemeth, M. Advancing small and medium-sized enterprise manufacturing: Framework for IoT-based data collection in Industry 4.0 concept. Electronics 2024, 13, 2485. [Google Scholar] [CrossRef]
Sarkar, S.; Kumar, S.S.; Giri, A.; Dammur, A. Advancing Urban Evacuation Management: A Real-Time, Adaptive Model Leveraging Cloud-Enabled Big Data and IoT Surveillance. In Proceedings of the 2023 4th International Conference on Intelligent Technologies (CONIT), Bangalore, India, 21–23 June 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Kaur, P.; Sharma, M.; Mittal, M. Big Data and Machine Learning Based Secure Healthcare Framework. Procedia Comput. Sci. 2018, 132, 1049–1059. [Google Scholar] [CrossRef]
Sh Ahmed, S.; L Abd Al-Nabi, D. Using Hadoop to analyze big data for multiple purposes: An applied study according to the map-reduce model. Int. J. Nonlinear Anal. Appl. 2023, 14, 47–62. [Google Scholar]
Filaly, Y.; El Mendili, F.; Berros, N.; El Bouzekri El Idrissi, Y. Hybrid encryption algorithm for information security in Hadoop. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 1297–1302. [Google Scholar] [CrossRef]
Fashakh, A.; Ibrahim, A.A. A proposed secure and efficient Big Data (Hadoop) security mechanism using encryption algorithm. In Proceedings of the 2023 International Conference on Electrical, Computer and Energy Technologies (ICECET), Cape Town, South Africa, 8–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Saritha, G.; Nagalakshmi, V. An efficient approach for BigData security based on Hadoop system using cryptographic techniques. Indian J. Comput. Sci. Eng. 2021, 12, 1027–1037. [Google Scholar] [CrossRef]
Narayanan, U.; Paul, V.; Joseph, S. A novel system architecture for secure authentication and data sharing in cloud enabled Big Data Environment. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 3121–3135. [Google Scholar] [CrossRef]
Alabdulatif, A.; Thilakarathne, N.N.; Kalinaki, K. A Novel Cloud Enabled Access Control Model for Preserving the Security and Privacy of Medical Big Data. Electronics 2023, 12, 2646. [Google Scholar] [CrossRef]
Josphineleela, R.; Kaliappan, S.; Natrayan, L.; Garg, A. Big Data Security through Privacy—Preserving Data Mining (PPDM): A Decentralization Approach. In Proceedings of the Second International Conference on Electronics and Renewable Systems (ICEARS-2023), Tuticorin, India, 2–4 March 2023; IEEE: New York, NY, USA, 2023; pp. 718–721. [Google Scholar]
Nayini, S.; Kandlakunta, A.R. Big Data Hadoop: Security, Privacy, Performance and Scalability. Privacy, Performance and Scalability. 2024. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5071036 (accessed on 11 February 2026).
Palit, S.; Roy, C.S. Securing Big Data with Hadoop in enterprise information systems. In Proceedings of the International Conference on Research in Education and Science (ICRES 2024), ISTES, Antalya, Turkey, 27–30 April 2024; ISTES: San Antonio, TX, USA, 2024; pp. 1402–1416. [Google Scholar]
Dandekar, M.; Lote, S.; Dandekar, P. Implementing the power of Big Data analytics. In Proceedings of the 2024 IEEE 3rd International Conference on Electrical Power and Energy Systems (ICEPES), MANIT Bhopal, India, 21–22 June 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Singla, R.; Kaur, N.; Koundal, D.; Bharadwaj, A. Challenges and developments in secure routing protocols for healthcare in WBAN: A comparative analysis. Wirel. Pers. Commun. 2022, 122, 1767–1806. [Google Scholar] [CrossRef]
Kawaguchi, R.; Bandai, M. Edge based MQTT broker architecture for geographical IoT applications. In Proceedings of the 2020 International Conference on Information Networking (ICOIN), Barcelona, Spain, 7–10 January 2020; IEEE: New York, NY, USA, 2020; pp. 232–235. [Google Scholar]
Nayak, M.; Patro, P.; Awotunde, J.B.; Gupta, S.K. IoT Security Architectures and Protocols. In Security Paradigms in 6G Smart Cities and IoT Ecosystems; CRC Press: Boca Raton, FL, USA, 2025; pp. 52–71. [Google Scholar]
Nadeem, M.; Mustafa, R.; Abi-Char, P.E.; Tucker, R.S. A Study of Security Threats in IoT Network Layer using MQTT and TLS. In Proceedings of the 2025 12th International Conference on Information Technology (ICIT), Amman, Jordan, 27–30 May 2025; IEEE: New York, NY, USA, 2025; pp. 161–166. [Google Scholar]
Sharma, A.; Malviya, R.; Gupta, R. Big data analytics in healthcare. In Cognitive Intelligence and Big Data in Healthcare; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2022; pp. 257–301. [Google Scholar] [CrossRef]
Karatas, M.; Eriskin, L.; Deveci, M.; Pamucar, D.; Garg, H. Big Data for Healthcare Industry 4.0: Applications, challenges and future perspectives. Expert Syst. Appl. 2022, 200, 116912. [Google Scholar] [CrossRef]
Chien, H.Y.; Shih, A.T.; Huang, Y.M. Exploring MQTT Broker-Based, End-to-End Models for Security and Efficiency. Sensors 2025, 25, 5308. [Google Scholar] [CrossRef]
Alharbi, S.; Awad, W.; Bell, D. HECS4MQTT: A Multi-Layer Security Framework for Lightweight and Robust Encryption in Healthcare IoT Communications. Future Internet 2025, 17, 298. [Google Scholar] [CrossRef]
Pajooh, H.H.; Rashid, M.A.; Alam, F.; Demidenko, S. IoT Big Data provenance scheme using blockchain on Hadoop ecosystem. J. Big Data 2021, 8, 114. [Google Scholar] [CrossRef]
Dworkin, M.J. Recommendation for Block Cipher Modes of Operation: The CBC, CFB, OFB, CTR, and XTS Modes. NIST 2001. [Google Scholar] [CrossRef]
Vidhya, S. Enhancing Cloud Security for Structured Data: An AES-GCM Based Format-Preserving Encryption Approach. In Artificial Intelligence Based Smart and Secured Applications, Proceedings of the International Conference on Advancements in Smart Computing and Information Security, Rajkot, India, 16–18 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 196–205. [Google Scholar]
Dworkin, M.J. Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC. NIST 2007. [Google Scholar] [CrossRef]

Figure 1. Architecture enhanced MQTT in Hadoop platform.

Figure 2. Three-phase defense for Hadoop.

Figure 3. Hadoop architecture.

Figure 4. Overall process to guarantee integrity.

Figure 5. Execution scenario of extended MQTT.

Figure 6. Send token for the authentication.

Figure 7. Discovery mechanism for extended MQTT.

Figure 8. Execution process “Enhanced MQTT”.

Figure 9. Data integrity check in DataNode and alerting in Hadoop.

Figure 10. Impact of key size on HDFS transfer time with AES-CTR, AES-CBC, and AES-GCM algorithms.

Figure 11. Transfer times to HDFS with AES modes (256-bit keys) across file sizes.

Figure 12. Comparison transfer time to HDFS: Our work vs. other works [7,16].

Figure 13. Secure communication test after using extended MQTT.

Table 2. Explanation of the used abbreviations.

Symbol Used	Description
ID-reference	DN identifier created at the first startup (reference identity)
Current-ID	DN Identifier created at the new startup (current identity)
ID-DN	DN Identifier
$H_{reference}$	Reference ID-DN hash
$H_{Curent}$	New ID-DN hash
ID-Blocks	Identifier of Blocks
H_”Curent”	Hash (ID-DN/ID-Block) first startup
H_(“reference”)	Reference hash (ID-DN/ID-Blocks)
H_”Actual”	New hash (ID-DN/ID-Blocks)

Table 3. Description of the used abbreviations.

Symbol Used	Description
UUID	User unique identifier for DN
start-all.sh	Script shell for start hadoop
verify_all_datanode_hashes.sh	Script shell for verify all DN hashes
AES in CTR, CBC, and GCM modes	Mode encryption algorithm: CTR: Counter Mode CBC: Cipher Block Chaining GCM: Galois/Counter Mode

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kamoun-Abid, F.; Meddeb-Makhlouf, A. Enhanced MQTT Protocol for Securing Big Data/Hadoop Data Management. J. Sens. Actuator Netw. 2026, 15, 22. https://doi.org/10.3390/jsan15010022

AMA Style

Kamoun-Abid F, Meddeb-Makhlouf A. Enhanced MQTT Protocol for Securing Big Data/Hadoop Data Management. Journal of Sensor and Actuator Networks. 2026; 15(1):22. https://doi.org/10.3390/jsan15010022

Chicago/Turabian Style

Kamoun-Abid, Ferdaous, and Amel Meddeb-Makhlouf. 2026. "Enhanced MQTT Protocol for Securing Big Data/Hadoop Data Management" Journal of Sensor and Actuator Networks 15, no. 1: 22. https://doi.org/10.3390/jsan15010022

APA Style

Kamoun-Abid, F., & Meddeb-Makhlouf, A. (2026). Enhanced MQTT Protocol for Securing Big Data/Hadoop Data Management. Journal of Sensor and Actuator Networks, 15(1), 22. https://doi.org/10.3390/jsan15010022

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced MQTT Protocol for Securing Big Data/Hadoop Data Management

Abstract

1. Introduction

2. Background

2.1. Related Work

2.1.1. Security of Big Data/Hadoop

2.1.2. Security of the MQTT Protocol

2.1.3. Security Big Data/Hadoop MQTT

3. Materials and Methods

3.1. System Architecture

3.2. Proposed Three-Phase Integrity Defense for Hadoop

3.2.1. Hadoop-Based Approach

3.2.2. Confidentiality Check

3.2.3. Integrity Check

Integrity DataNode Level

Integrity Blocks Level

3.3. Extended MQTT for Security

3.3.1. Execution Scenario

3.3.2. Authentication Module

3.3.3. Discovery Mechanism

4. Results

4.1. DataNode Integrity Verification

4.2. Analysis of AES Modes and Key Sizes Based on Transfer Time to HDFS

4.2.1. AES-CTR Mode (Counter Mode)

4.2.2. AES-CBC (Cipher Block Chaining)

4.2.3. AES-GCM (Galois/Counter Mode)

5. Discussion

5.1. Evaluation of AES Modes and Key Impact

5.2. Comparative Results

5.3. Evaluation of Extended MQTT

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI