1. Introduction
The Internet of Things (IoT) comprises a vast network of heterogeneous devices equipped with sensors, actuators, software, and network connections [
1]. In recent years, the proliferation of IoT devices has accelerated: in 2020, the number of operational IoT devices worldwide exceeded 31 billion, with forecasts projecting an increase to 75 billion by 2025 [
2]. As these numbers grow, IoT systems continue to transform multiple sectors, including smart thermostats, wearable fitness trackers, business sensors, and autonomous vehicles [
1]. However, this expanding architecture introduces significant security and privacy concerns, as malicious actors can exploit device vulnerabilities to compromise sensitive Data [
3]. Consequently, robust security measures are necessary to protect IoT devices and data during transmission.
To address these vulnerabilities, machine learning (ML) techniques have been increasingly adopted to enhance security in IoT contexts. An essential component of this defense ecosystem is the intrusion detection system (IDS) [
4], which monitors network traffic and device activity to detect malicious behavior or policy violations in real time. IDSs are generally classified into two main analytical approaches: signature-based and anomaly-based detection. The former identifies attacks by matching observed patterns against a database of known threat signatures, providing high accuracy for previously recorded attacks but limited adaptability to emerging threats. In contrast, anomaly-based detection involves constructing a model of normal device or network behavior and flagging any significant deviations that may indicate intrusions, cyberattacks, or system malfunctions [
4]. Due to their ability to detect novel and unknown attacks, anomaly-based detection approaches have become increasingly crucial in IoT environments. However, traditional ML-based anomaly detection still faces challenges such as unauthorized data exposure and reliance on centralized architectures, potentially introducing single points of failure [
5]. Consequently, recent research has focused on the use of decentralized learning paradigms, such as federated learning (FL), to address these limitations and improve scalability, privacy, and resilience in IoT-based intrusion detection systems.
Federated learning (FL) has emerged as a promising paradigm for addressing the challenges of ML-based anomaly detection. This approach enables collaborative model training across distributed devices without sharing raw data; instead, model updates are aggregated at a central aggregator. FL offers advantages such as data heterogeneity and privacy preservation. However, frequent model aggregation and communication with a central server can increase the latency and bandwidth consumption, limiting the scalability of FL in large IoT deployments and negatively impacting overall performance [
6]. These constraints have motivated the development of more efficient federated architectures.
Hierarchical federated learning (HFL) addresses these challenges by introducing an intermediate fog layer, which represents a distributed computing tier positioned between IoT devices and the cloud [
7]. It consists of multiple fog servers that provide localized storage, computation, and aggregation services closer to data sources [
7]. This layer locally aggregates updates from closely related IoT devices before final cloud aggregation [
8,
9]. This structure improves scalability and reduces latency by distributing the computations and reducing network traffic. However, the use of fog servers introduces new security and trust concerns; for example, if a fog server is compromised or acts maliciously, it can tamper with model updates and undermine the integrity of the global model [
10]. Thus, ensuring the security and authenticity of the global model is critical to maintaining system reliability.
To strengthen trust in HFL environments, blockchain technology has gained attention as a complementary solution. A blockchain operates as a decentralized ledger that records transactions across a peer-to-peer network using cryptographic hashes, ensuring transparency, immutability, and tamper-resistance [
11]. When integrated with FL, a blockchain ensures model provenance and guarantees that the recorded updates cannot be altered or deleted, thus preserving model integrity. At the same time, only authorized fog servers can store the aggregation results [
12].
This study proposes Block-HFL, a blockchain-enabled hierarchical federated learning framework developed to enhance the security, scalability, and trustworthiness of anomaly detection in IoT environments. This framework is designed for deployment in real-world IoT environments that require distributed intelligence and strong data confidentiality, with typical application domains including smart healthcare (e.g., wearable sensors that monitor patient health) and industrial IoT (IIoT; e.g., anomaly detection in production systems or factory networks). These domains involve large numbers of heterogeneous, resource-constrained devices that cannot rely on a single centralized server due to privacy risks and scalability limitations. The Block-HFL framework addresses these challenges by distributing the computations among fog servers and employing blockchain-based accountability to ensure integrity and trust. The contributions of this work are as follows:
We propose a blockchain-enabled hierarchical federated learning (Block-HFL) framework in which fog servers perform partial aggregations and the elected leader conducts the final aggregation. This architecture minimizes the communication overhead and improves scalability.
We incorporate blockchain-based authentication into the model to authorize participating servers and immutably record global model updates using smart contracts and the InterPlanetary File System (IPFS).
We propose an accuracy-based leader election mechanism in the HFL context, where fog servers locally aggregate client updates, and the fog server with the highest validation accuracy is dynamically selected as the leader for global aggregation. This adaptive strategy enhances model convergence and fairness.
We validated the framework using the accuracy, precision, recall, and F1-score, and compared these values with those reported in relevant works. The results demonstrated a reduced training time and the maintenance of a high accuracy across 4, 8, and 16 clients, highlighting the robustness and scalability of the proposed approach.
The remainder of this article is structured as follows:
Section 2 reviews recent advancements in FL for IoT and intrusion detection.
Section 3 details the proposed methodology.
Section 4 presents and analyzes the experimental data.
Section 5 discusses the results and key findings. Finally,
Section 6 concludes the article and outlines potential future research directions.
4. Experiments
This section presents the experimental methodology for the proposed model.
4.1. Experimental Setup
The experiments were carried out on Windows 10, equipped with an Intel Core i7 CPU (2.80 GHz) and 16 GB of RAM. Visual Studio Code (v1.106.2) served as the development environment, and all components were implemented in Python 3.10. Although a single machine was used, the distributed IoT–fog–blockchain architecture was fully emulated by running each entity as independent Python processes communicating over local gRPC channels. This setup enabled parallel execution of multiple federated learning clients and fog servers, and supported local blockchain transactions via Ganache. Global model weights were stored off-chain using an IPFS node accessed via the IPFS HTTP Client (v0.7.0). Model construction and training were performed using TensorFlow 2.19.0 and Keras 2.19.0. NumPy 2.2.5 was used for numerical computations, and Pandas 2.2.3 for data preprocessing, including loading and organizing the dataset shards. The gRPC framework (v1.71.0; Google LLC) facilitated remote procedure calls between IoT devices and fog servers. For blockchain integration, the Web3.py library (v7.11.1) [
38] was used to deploy and interact with the smart contract implemented in Solidity. The smart contract recorded the CIDs of the global model stored on IPFS.
4.2. Dataset and Data Distribution
The Edge-IIoTset dataset [
13] is a recent cybersecurity dataset designed explicitly for evaluating IDSs in IoT environments. This dataset was selected for its comprehensive representation of modern IIoT infrastructures that integrate emerging technologies. These technologies collectively emulate realistic IIoT communication scenarios, making Edge-IIoTset well-suited for training and evaluating machine learning-based IDS models, particularly within FL frameworks. This dataset has 61 features (selected) that are highly correlated from an original set of 1176 extracted features. Additional details about the dataset can be found in the Data Availability statement below.
The dataset was generated from a diverse range of IoT devices, including temperature and humidity, ultrasonic, water level, pH, soil moisture, heart rate, and flame sensors. It contains normal traffic and 14 attack types categorized into five main groups: (1) Distributed denial of service (DDoS) attacks, (2) scanning attacks, (3) man-in-the-middle (MITM) attacks, (4) injection attacks, and (5) malware attacks. In FL tasks, the data distribution needs to be non-IID, imbalanced, and representative of real-world scenarios. For experimental purposes, the Edge-IIoTset dataset was partitioned into several local datasets to facilitate training in accordance with the FL requirements.
Before distribution to clients, the dataset underwent a systematic data cleaning and pre-processing pipeline to ensure consistency and reduce noise. Duplicate entries and missing values, such as ‘NaN’ and ‘INF’, were removed. The attack type column was converted to the int32 data type, meeting TensorFlow’s label format requirements for training. Continuous features were normalized using Min–Max normalization, which scales all values into the interval [0, 1], thus ensuring uniform feature contributions during training.
After preparing the dataset, 10% of the samples were selected via stratified sampling to ensure equal representation across all classes. This subset was used exclusively as an IID evaluation shard: each fog server evaluated its locally aggregated model on this subset to determine the leader for the current round, and the selected leader subsequently used the same subset to evaluate the final global model before storing it on the blockchain.
The remaining 90% was divided among the clients. Each client received a unique data shard generated in a non-IID manner, reflecting the heterogeneity of typical IIoT networks.
Table 4 displays the partitioning of the dataset into training and test sets. To assess the scalability and communication efficiency, the experiments used 4, 8, or 16 clients, with each number representing a different federated configuration.
The experiments were conducted on the Edge-IIoTset dataset, which is widely used in research on anomaly and intrusion detection in IoT and IIoT environments. It was divided among four federated clients in the first experiment. To simulate a non-IID environment, the Dirichlet distribution with a concentration parameter of
was used to obtain unequal data splits (lower
values create greater heterogeneity and more unbalanced data) [
39]. This method was utilized as it produces diverse data distributions across clients, better reflecting real IoT network conditions.
Figure 6 shows the data distributions for the four clients. The left chart shows the total number of samples, while the right chart shows the distribution by attack type. The datasets were highly imbalanced, as evidenced by Client 1 receiving the most samples (approximately 848,000) and Client 3 receiving the fewest (about 47,000). The class distributions also differed substantially among clients. Specifically, Clients 0 and 1 had a higher proportion of normal traffic than Clients 2 and 3, which had notably more scanning and injection attack samples. These direct disparities enabled a thorough assessment of the federated learning model’s robustness and generalization under highly heterogeneous data conditions. The dataset partitions remained fixed throughout all training rounds; no resampling or redistribution was performed, ensuring that each client preserved the same non-IID distribution across the entire experiment.
4.3. Evaluation Metrics
The performance of the proposed Block-HFL framework was evaluated using several standard metrics in order to quantify both the effectiveness and computational efficiency of the developed models.
True Positives (TPs): The number of attack instances that were correctly classified into their respective attack categories (e.g., DDoS, MITM, malware) by the model.
True Negatives (TNs): The number of normal instances that were correctly classified as normal.
False Positives (FPs): The number of normal instances that were incorrectly classified as an attack category.
False Negatives (FNs): The number of attack instances that were incorrectly classified either as normal or as another wrong attack category.
Accuracy is defined as the proportion of correctly classified instances and is calculated as follows:
Precision measures the proportion of correctly identified attack instances among all the instances predicted as attacks, and is calculated as follows:
Recall quantifies the proportion of actual attack instances that are correctly identified by the model:
The F1-score gives the ratio of successfully classifying attacks to the total number of expected attack outcomes, which can be calculated as follows:
The training time measures the duration required for each IoT client to complete local model training on its own dataset. It is computed as
where
and
denote the timestamps recorded at the beginning and end of the local training process on each client device.
4.4. Model Configuration
Table 5 summarizes the main parameters and the CNN architecture used in the proposed federated anomaly detection model. A single CNN architecture was employed across all IoT devices. The chosen configuration—a lightweight 1D CNN with one convolutional layer (64 filters, kernel size of 3), followed by a Dense layer with 32 hidden nodes—was selected to balance detection performance with computational efficiency, making it suitable for resource-constrained IoT environments.
5. Results and Discussion
The experimental evaluation was conducted across three federated settings with 4, 8, and 16 clients to examine the behavior of the proposed Block-HFL framework at increasing scales and levels of heterogeneity. The 4-client configuration was used for illustrative analyses, including the visualization of non-IID data distributions, as it provides a clear and interpretable baseline. Additionally, the impacts of hierarchical aggregation, blockchain integration, and varying clients (4, 8, and 16) on scalability, efficiency, and security are examined to demonstrate the overall effectiveness of the proposed system.
5.1. Model Performance Analysis
In this section, the performance of the individual local models trained by each client before global aggregation is evaluated. Each client operated on a distinct, non-IID subset of the Edge-IIoTset dataset, as described in
Section 4.2.
Figure 7 illustrates the evolution of the accuracy, precision, recall, and F1-score across ten federated training rounds for each client in the four-client setup.
Figure 7a shows that the accuracy increased substantially during the initial rounds and stabilized after the third round. Clients 0, 1, and 3 demonstrated consistently high accuracy, exceeding 97% throughout, whereas Client 2 exhibited slower convergence, with accuracy rising from 82% to approximately 85%. This disparity can be explained by the heterogeneity of the local datasets; specifically, that of Client 2 presented reduced size and limited diversity. The trends in the precision, recall, and F1-score shown in
Figure 7b–d mirror those observed for the accuracy. Clients 0, 1, and 3 consistently achieved precision, recall, and F1-score values above 97%, indicating strong and stable classification performance. Conversely, Client 2’s scores—while lower—demonstrated gradual improvement over the rounds. These findings underscore the substantial effect of data heterogeneity, thereby reaffirming the pivotal role of data distribution in determining performance and convergence behaviors in federated learning scenarios.
Figure 8 shows the global model’s confusion matrix under the four-client setup after the 10th round and details how each traffic type was classified. The model detected normal and MITM traffic perfectly, demonstrating a strong ability to separate benign from MITM attacks with 100% accuracy. It identified malware correctly 49.75% of the time, indicating that some malicious patterns overlap with other attacks. The injection category had a 96.46% accuracy, showing strong results. DDoS traffic was recognized with an accuracy of 93.03% and scanning attacks with an accuracy of 75.61%, highlighting the model’s robustness in distinguishing most attack types despite some confusion between overlapping feature spaces.
5.2. Block-HFL Evaluation for Different Numbers of Clients
We deployed our model for FL experiments using different numbers of clients; namely,
,
, and
. The non-IID data distributions for
and
used for the experiments are provided in the
Supplementary Materials (
Figures S1–S2), and detailed confusion matrices for the global models are presented in
Figures S3–S4.
Table 6 compares the accuracy outcomes of the global models, the best clients, and the worst clients after the first and tenth FL rounds. At the beginning of training (first round), the 4-client configuration achieved a global accuracy of 92.08%. This increased to 95.64% by the tenth round, indicating steady convergence. Similarly, that of the 8-client setup improved from 93.44% to 94.96%, while the 16-client setup improved from 94.81% to 95.92%, representing the highest overall global accuracy among all the configurations. In terms of the local performance, the best local accuracies (B) remained consistently high across all sets, ranging from 98.93% to 99.34%. In contrast, the worst local accuracies (W) exhibited noticeable variation. The lowest value, 82.12%, was observed for the 4-client setup in the first round. Overall, the 16-client configuration in the tenth round achieved the highest accuracy levels across all metrics. These results collectively confirm that both the global performance and learning consistency are maintained while increasing the number of clients.
Figure 9 presents the average global model accuracy obtained under different client configurations.
Figure 10 shows the accuracy progression across ten federated rounds for each setting. The results demonstrate that the proposed Block-HFL framework maintained stable convergence and a consistent performance as the number of clients increased. The average global accuracy remained high, ranging from 93.97% with 4 clients to 94.16% with 8 clients and 95.46% with 16 clients. This convergence indicates that hierarchical aggregation enables the model to preserve effectiveness and generalization, even as more clients participate. The consistent performance across configurations validates the scalability and robustness of the proposed Block-HFL architecture under non-IID data conditions.
Figure 11 compares the average training time for a centralized aggregation approach and the proposed Block-HFL framework across 4, 8, and 16 clients over 10 training rounds. The reported values indicate the average training time calculated across ten federated training rounds for each client configuration. The results clearly demonstrate the efficiency of Block-HFL in terms of computational latency. While centralized aggregation exhibited a steep increase in training time as the number of clients grew (from 320.9 s to 677.1 s), Block-HFL also showed an upward trend (from 116.7 s to 379.7 s). However, the increase in Block-HFL was substantially smaller in magnitude, reflecting its greater scaling efficiency. This reduction highlights the robustness of Block-HFL in distributed environments, where hierarchical aggregation at the fog layer mitigates the computational and communication burden on a single central server. Decentralized coordination and parallel local processing significantly reduce the latency, enabling scalable and real-time anomaly detection in industrial IoT scenarios.
5.3. Blockchain Integration
The integration of a blockchain into the Block-HFL framework was evaluated to determine its transaction time, gas consumption, and overall system latency. Blockchain transaction time was measured as the duration between submitting the global model transaction and receiving its mined receipt, while gas consumption was returned by Ganache. System latency was computed as the end-to-end delay from initiating the global model to the successful retrieval of the corresponding CID from the blockchain.
Figure 12 presents the blockchain transaction time across ten federated rounds, which ranged from 0.04 to 0.12 s. These low values indicate that recording the global model on-chain introduces a negligible delay to the training process. In this framework, the global model is stored off-chain using IPFS, while only the corresponding IPFS hash is recorded on the blockchain. This approach enables secure and verifiable model tracking without incurring significant on-chain storage requirements. Similarly,
Figure 13 displays the gas usage per round, which remained nearly constant at approximately 175,000 to 205,000 gas units, indicating the cost stability of blockchain operations. This stability resulted from each transaction executing the same smart contract function with a fixed computational complexity.
Figure 14 compares the total round latency to that associated with the blockchain submission through to client retrieval, indicating that the blockchain operations account for less than 3% of the total communication time. These results demonstrate that the blockchain layer provides secure, transparent, and lightweight global model storage with minimal performance overhead, thus supporting the practicality and efficiency of the proposed Block-HFL system.
5.4. Comparison with Related Works
Table 7 provides a comparative overview of the proposed model and related FL-based IDS approaches. The comparison encompasses multiple dimensions, including the publication year, dataset, data distribution (non-IID), classifier architecture, blockchain used, number of participating clients, and each study’s primary contributions. The number of clients varied across different methods. In the proposed Block-HFL system, client counts of
n = 4, 8, and 16 were selected to achieve balanced hierarchical aggregation. This configuration ensured that the clients were evenly distributed among the fog servers, thereby supporting synchronized updates and equitable leader election.
Figure 15 shows the final global model accuracy for three sets of federated clients, comparing the Block-HFL model with the results of Ferrag et al. [
13] and Rashid et al. [
23]. The Block-HFL framework consistently reached a higher global accuracy in all cases, showing better scalability and stability when the data are not evenly distributed. Ferrag et al. reported accuracies of 86.86%, 91.10%, and 92.95% with 3, 9, and 15 clients, respectively, while Rashid et al. achieved 90.58%, 90.73%, and 90.18% with the same numbers of clients. In comparison, the Block-HFL model achieved 95.64%, 94.96%, and 95.92% with 4, 8, and 16 clients, respectively, indicating the highest stability and accuracy. These results suggest that integrating hierarchical aggregation with a blockchain improves both performance and fairness.
Figure 16 compares the average training time of the proposed Block-HFL framework with that achieved by Rashid et al. [
23] across three client sets. The study by Ferrag et al. was excluded from this comparison because their performance metrics did not report training time, which precluded a direct quantitative comparison. The results indicate that the training time increased with the number of participating clients in both approaches due to the larger volume of local updates per round. The proposed Block-HFL framework had a slightly higher training time. This was primarily due to the larger and more comprehensive dataset used in this study. Rashid et al. used a smaller subset of the Edge-IIoTset, with about 0.17 million samples. In contrast, the proposed model operated on a larger part of the Edge-IIoTset dataset, which contains roughly 1.59 million samples. This greater data volume increased the amount processed in each federated round, which, in turn, prolonged the local training duration per client. However, using a larger dataset enhanced the model’s generalization and ensured that the resulting global model better represented diverse IoT traffic behaviors.
6. Conclusions and Future Work
The Block-HFL framework was proposed for anomaly detection in IoT environments. This approach combines hierarchical aggregation with blockchain-based authentication and storage to provide high detection accuracy and enhanced scalability, while using an accuracy-based leader election mechanism to ensure reliable model updates. Experimental evaluation on the Edge-IIoTset dataset demonstrated the robust performance of the model under non-IID data conditions; it maintained a global accuracy above 94% while significantly reducing the communication overhead and latency. The integration of a blockchain ensures the security, transparency, and resistance to tampering of the final global model with only a minor increase in the computational overhead.
Although this study demonstrated the proposed framework’s effectiveness, it was conducted in a simulated environment using a private Ethereum network, which may differ from real-world blockchain deployments. Future research should focus on implementing the framework in actual IoT infrastructure and addressing practical challenges such as the computational load, memory limitations, and battery consumption. Integrating advanced privacy-preserving techniques, such as differential privacy or homomorphic encryption, could further enhance data confidentiality. Expanding the model to incorporate diverse deep learning architectures, such as RNNs or transformers, and evaluating the performance across a broader range of IoT datasets are also recommended for future work.