Federated Learning for IoT Intrusion Detection

: The number of Internet of Things (IoT) devices has increased considerably in the past few years, resulting in a large growth of cyber attacks on IoT infrastructure. As part of a defense in depth approach to cybersecurity, intrusion detection systems (IDSs) have acquired a key role in attempting to detect malicious activities efﬁciently. Most modern approaches to IDS in IoT are based on machine learning (ML) techniques. The majority of these are centralized, which implies the sharing of data from source devices to a central server for classiﬁcation. This presents potentially crucial issues related to privacy of user data as well as challenges in data transfers due to their volumes. In this article, we evaluate the use of federated learning (FL) as a method to implement intrusion detection in IoT environments. FL is an alternative, distributed method to centralized ML models, which has seen a surge of interest in IoT intrusion detection recently. In our implementation, we evaluate FL using a shallow artiﬁcial neural network (ANN) as the shared model and federated averaging (FedAvg) as the aggregation algorithm. The experiments are completed on the ToN_IoT and CICIDS2017 datasets in binary and multiclass classiﬁcation. Classiﬁcation is performed by the distributed devices using their own data. No sharing of data occurs among participants, maintaining data privacy. When compared against a centralized approach, results have shown that a collaborative FL IDS can be an efﬁcient alternative, in terms of accuracy, precision, recall and F1-score, making it a viable option as an IoT IDS. Additionally, with these results as baseline, we have evaluated alternative aggregation algorithms, namely FedAvgM, FedAdam and FedAdagrad, in the same setting by using the Flower FL framework. The results from the evaluation show that, in our scenario, FedAvg and FedAvgM tend to perform better compared to the two adaptive algorithms, FedAdam and FedAdagrad.


Introduction
The Internet of Things (IoT) is a network of interconnected smart devices that contribute towards generating and gathering enormous amounts of data [1].IoT devices are now used in every area of our daily life.Examples span from our home, with devices such as smart appliances and entertainments systems, to smart city and its invisible infrastructure such as pedestrian and road sensors.The smart grid, autonomous automobile systems, smart medical devices, industrial control systems and robotics are just some of the areas in which IoT devices are being used on a daily basis.Data gathered by all of these devices requires storage and analysis.To obtain insights into this data and enable intelligent applications, techniques such as machine learning (ML) have been widely deployed [2].Detection of cyber attacks is part of these intelligent applications.Given the continuous increase in numbers of cyber attacks in IoT infrastructure [3], monitoring this high volume of data for the detection of cyber attacks is critical.However, it can be achieved only through the use of automated methods based on ML and deep learning (DL) [4].DL is a branch of ML that has become widely popular in many fields, including science, finance, medicine and engineering [5].DL for intrusion detection has also become increasingly popular as it allows for a more sophisticated analysis of network traffic and more precise detection of anomalies compared to traditional ML methods [6].DL models tend to achieve better performance and accuracy over ML models in highly complex environments, where large volumes of data exist [7], in exchange for more computational power.While different, ML and DL methods share a similar procedure.They both require data and model to be available at a central location.In other words, these systems are centralized.Data is captured in remote locations and transferred to a central repository where it is processed in preparation for classification.This works well in environments limited to a single organization, with sites located in the same geographic area.In contrast, in organizations with multiple sites or in the event of collaboration between organizations, data transfers could compromise privacy [8].Moreover, data volumes represent another challenge.Captures from IDS sensors tend to be quite large.Transferring these large volumes of data to a central location for classification could represent a serious bottleneck for the network [9].Furthermore, given that some IoT applications are latency critical [10,11], transfers of data to centralized location could compromise their correct functionality.
Federated learning (FL) is one of the latest paradigms in the area of ML that can be used to address these challenges.FL was introduced by Google in 2017 [12] with the aim of addressing issues related to data privacy.It uses a distributed environment where participating clients complete analysis of their own data with no need of transfers.Instead, clients share a model used for training their data.Only parameter updates are exchanged with a server that takes the role of the aggregator.The server coordinates clients until training is completed and performs aggregation of their weights and results through the use of an averaging algorithm such as federated averaging (FedAvg) [12].
Given that IoT networks are fundamentally distributed, FL can be applied to address the limitations of a centralized approach in IoT intrusion detection, as it can analyze traffic and identify attacks as close to the source of data as possible.However, FL approaches to IoT IDS are still at their infancy and require further evaluation before they can be deployed in real-world scenarios [13].
In this article, we evaluate how FL performs in a scenario where four distributed clients collaborate to classify attacks in the ToN_IoT [14] and the CICIDS2017 [15] datasets.The objective is to evaluate FL as an alternative to a typical DL method where data is stored and analyzed at a single location.The federated system created here uses horizontal data partitioning, where each client participating in the process owns different data samples but with the same dimensional space as every other client.Data from the two datasets is randomly divided so that each client has access to its own portion.No data sharing occurs between clients.On the other hand, the model, which is a shallow artificial neural network (ANN), is shared amongst clients, with parameter updates exchanged with the server for aggregation, as previously explained.Results are then compared against a centralized approach, using the same ANN model.
Key contributions of this article can be summarized as follows: • We propose a method for the detection of attacks in IoT network environments based on a FL framework that uses FedAvg as the aggregation function.The distributed framework is composed of four clients, sharing a shallow ANN, and a server acting as the aggregator.The primary objective is the evaluation of FL as an approach to the detection and classification of attacks in an IoT network environments.

•
We evaluate the framework on two open-source datasets, namely ToN_IoT and CI-CIDS2017, on both binary and multiclass classification.Our method offers a high level of accuracy with a low False Positive (FP) rate in both types of classification for both datasets.

•
We compare results from our experiments against a centralized approach based on the same model, showing that performance of our FL framework is comparable to its centralized counterpart.

•
In this scenario, we evaluate three alternative aggregation methods, namely FedAvgM, FedAdam and FedAdagrad, and compare their performances against FedAvg.
The remainder of this article is organized as follows: Section 2 provides a review of the related work in the area of IDS in the IoT environment using FL.Section 3 proposes the FL method for intrusion detection in IoT environments.Section 4 describes the datasets and performance metrics used.Results are discussed in Section 5.The conclusion is drawn in Section 6.

Literature Review
The term federated learning (FL) was firstly introduced in 2017 in a paper published at Google [12].In their work the authors developed a decentralized approach, namely federated learning, with the objective of ensuring data privacy of participating clients.In FL, a server or aggregator takes the role in coordinating several clients into analyzing data using a shared model.The data owned by the clients remains with the data owner and it is never transferred between devices.The model is shared amongst clients and only parameter updates are exchanged.As explained by [12], when implementing FL, several key constraints need taking into consideration in order to optimize a solution to the problem.These constraints exist as FL must be able to train data with the following characteristics [12,16]:

•
Non-IID-Data stored locally in a device is not a representation of the entire population distribution.

•
Unbalanced-Local data has a large variation in size.In other words, some devices will train on larger datasets compared to others.

•
Massively Distributed-Large number of clients.

•
Limited Communication-Communication amongst clients is not guaranteed as some may be offline.Training may be completed with a smaller number of devices or asynchronously.
Typical federated learning data is not identically distributed.For instance, in IoT environments devices acquiring data for analysis capture data in different formats and of different types to each other.Therefore, the local data cannot be used as an example of the entire data distribution.Similarly, data size can vary drastically between devices, depending on a given scenario.The large number of devices involved and issues with communication between clients and server must also be taken into consideration when developing FL applications.

Federated Learning in IoT Intrusion Detection
Being a technology that often requires devices to connect to a central location remotely, the IoT fits well with the FL paradigm.Similarly, IDSs are normally structured as a distributed environment, making the use of FL in IoT intrusion detection even more appropriate.In fact, FL for IoT IDSs has seen a surge of interests in recent years.Research work in this area covers many aspects of this new technology.In this section, we discuss the current literature in the area of IoT intrusion detection using FL.
Sarhan et al. [17] presented a Cyber Threat Intelligence sharing scheme based on federated learning.The idea presented in their work is to create a framework to allow independent organizations to share their knowledge of cyber threats.Each organization can use a common global model, provided by an orchestrating server, to analyze their data.A federated averaging algorithm is used to aggregate results and update parameters to allow the model to adapt continuously to achieve better performance in the detection of threats.The framework requires each organization to maintain their local data using a common logging format and feature set.They compared results obtained from their framework with a centralized model, where data from all organization is stored at a single location, and a localized model where each organization complete their own analysis.The FL model achieved results generally on par with the other models, demonstrating that FL can be used efficiently in a collaborative intrusion detection system.However, the assumption of having a common feature set amongst different organization could be a limitation of their work, particularly in relation to IoT environments, where devices produce a large amount of heterogeneous traffic, which can create difficulties in creating a commonly structured dataset.
Another work based on FL is given in [18].They have used long short-term memory (LSTM) as the basis for their FL model, which was tested against a modified dataset of system calls created by AT&T several years ago.Results were compared against standard models based on LSTM and convolutional neural networks (CNNs).While the results presented were positive, their value is undermined by the use of an old dataset, which may not represent the type of command set commonly used in devices nowadays.
A FL method based on LSTM is given by [19] to identify false data injection attacks in solar farms.They have used a traffic generator to build their own dataset to test their method.Results show that their model offers efficient attack detection, improving over a standard centralized method based on the same LSTM model.
FL is also used in [20] as a method to detect attacks in IoT environments.Their work combines FL with ensemble learning.Each node uses a gated recurrent unit (GRU)based model to perform the classification.Weights are updated globally using federated averaging.The outcome from classification by each node is used as input to an ensemble model based on a random forest algorithm.Using a dataset based on Modbus traffic, the authors have achieved promising results.
Zhang et al. [21] developed a platform named FeDIoT that uses FL on real devices to detect anomalies in IoT traffic.Using the N-BaIoT [22] dataset and the LANDER dataset [23] they employed a model based on Auto-Encoder run by the clients in their FL network.Results from their experiments demonstrate that their method can be an efficient technique in detecting attacks.
An interesting approach is proposed by [24].They built a model using FL and an ensemble stacking approach during aggregation of results from clients.Their idea is to collect parameters from participating clients and concatenate them into a matrix.This is subsequently used with some test data by the aggregator to obtain a final result.They named their aggregation method FedStacking and tested it with multilayer perceptron (MLP) models running on several clients.
The authors in [25] proposed a FL framework in support of fog-based resourceconstrained IoT devices.They named their approach Fog-FL and they have used an interesting approach of using local fog nodes as aggregators for the FL network.Rather than having a central aggregator communicating directly with distributed nodes, they added an additional layer of aggregators selected based on geospatial location.The approach selects one of these nodes as the global aggregator at each FL round.According to the authors, this process increases the efficiency of the system in terms of power consumption and communication delays considerably.
An interesting work is given by Chen et al. in [26].Their work proposes a novel method for intrusion detection in wireless edge networks, named Federated Learning-based Attention Gated Recurrent Unit (FedAGRU).Their method demonstrated an improved communication in exchanging model updates compared to the FedAvg aggregation model.At the same time they achieved superior accuracy in detection of attacks when compared to a centralized CNN model.
An anomaly-based IDS using FL in Industrial IoT (IIoT) networks is proposed by Zhang et al. [27].They adopted an instance-based transfer learning approach using ensemble techniques and proposed a novel aggregation algorithm based on a weighted voting approach.Their method achieved a superior detection performance when compared with a centralized model in multiclass classification.
Campos et al. [13] proposed an evaluation of an FL-enabled IDS approach, where they used three different settings with the ToN_IoT dataset.Using the IBMFL library they also tested different aggregation functions in the same scenarios with excellent results.
Several other relevant works have been presented in the area of IoT intrusion detection including [28][29][30][31][32]. Table 1 presents a summary of work applying FL for IoT intrusion detection.

Averaging Algorithms
FedAvg is an algorithm based on a federated version of stochastic gradient descent (SGD), namely FedSGD, which was also proposed on the original FL paper from Google [12] as a baseline for FedAvg.FedSGD uses a randomly selected client to complete a single batch gradient calculation for every round of communication.The average gradient on its local data is sent back to the server which, in turn, aggregates them and applies the update to the model.FedAvg is a generalization of FedSGD, where the client updates the weights, rather than the gradient, multiple times before it is sent to the server for aggregation.FedAvg makes it possible for a network of clients to train ML and DL models collectively but still using their local data.This is the basis for a successful FL network as it removes the need for clients to upload data to a centralized server, hence allowing the main requirements of privacy to be met.The pseudocodes of FedAvg are given in Algorithm 1.
FedAvg offers good performance in non-heterogeneous data.However, it is now established that the more heterogeneous the data the longer FedAvg takes to converge [33][34][35].As a consequence research has been carried out to offer alternative solutions to FedAvg, to improve on it or to be used in specific scenarios.Several alternative methods have been proposed in the literature [34][35][36][37][38][39][40] to address limitations of FedAvg.
FedAdam and FedAdagrad have been proposed together in [40] as server-side methods to improve on FedAvg in situations where the noise distribution is high.The pseudocodes for both algorithms are presented in Algorithm 2. In Lines 15 and 16 of the pseudocodes, either FedAdagrad or FedAdam is to be selected as the rest of the algorithm is the same for both.
Sample a subset S of clients 4: for each client i ∈ S in parallel do 6: for e = 1, . . ., E do 7: for b ∈ B do 8: end for 10: end for 11: end for 16: 17: Another alternative algorithm is Federated averaging with Momentum or FedAvgM [35].The pseudocodes for this are presented in Algorithm 3. Notice that the algorithm is practically the same as FedAvg at the server side.However, it does change how the clients calculate the weights.Using Nesterov accelerated gradient [41], a momentum is added to improve the calculation of weights when data contains too much noise.A momentum is an improvement to standard SGD accelerating the process of finding the best minimum when calculating the gradient [42].Given that SGD has a limitation that can make it stagnant in flat areas in noisy environments, a momentum can be used as an approach to accelerate the progress of the search of the minimum without getting stuck.Nesterov accelerated gradient is a further improvement of the standard momentum as it updates parameters according to the previous momentum and then corrects the gradient to achieve the parameter updating [43].
end for 18: end for 19: Return w to Server

Federated Learning Frameworks
While the use of FL is quite recent, several Python libraries exist for the development of its applications.For instance, as a part of TensorFlow, Google created TensorFlow Federated or TFF [44], which is an open-source framework for ML methods applied to decentralized data.According to their website, TFF was created to facilitate open research and experimentation using FL.Another popular library for FL is PySyft [45], recently renamed Syft.This was created by OpenMined and it's an open-source stack that focuses on providing FL with secure and private communication.IBM also created their own FL framework [46] which they named IBM Federated Learning.This is a library designed to support an easy implementation of ML in a federated environment.Flower, the library of choice for this work, is a stable high-level library for Python.Flower helps transitioning rapidly from existing ML implementations into a FL setup.This allows a quick way for the evaluation of existing models in a federated environment [47] and it was the main reasoning behind its choice.

Proposed Model
The experiments were carried out using a workstation with an Intel© Core™ i7-5960X CPU and 32 GB of RAM, running Linux Mint 20.3 Cinnamon as the main operating system (OS).The testbed for experiments was created using the Python library, Flower, at version 1.0.0.

Overall Architecture
The proposed model is composed of four virtual clients and one server acting as the aggregator.Figure 1 illustrates the topology and the steps taken by the FL model at each round.Before training can start, clients connect to the server.The training process begins when the server sends initial parameters to the clients.Upon receiving these, clients undertake training on their own data by updating weights locally.At the end of their training, each client sends their updates to the server.Using FedAvg, as described in Algorithm 1, the server aggregates these updates into a global update, which is then sent back to clients for a new training round.This process is repeated until all rounds have completed.
Overall, the process for a training round consists of the following steps: -The server starts and accepts connections from a number of clients based on a specific scenario.A flowchart of the overall FL process using the Flower library is shown in Figure 2. High-level pseudocodes for the FL process with Flower are given in Algorithm 4.
Hyperparameters, as shown in Table 2, were set at the server side, using a method defined by Flower as a strategy, and sent to the clients at the start of the process.These are: learning rate (LR) set to 0.01, number of rounds set to 5 and epochs set to 5 for the first three rounds and then to 8 for the last two rounds.All of these settings were chosen following an empirical evaluation, where different values of LR, epochs and FL rounds were used.The values selected were those offering the best outcome in this scenario.S → AggregateResult(R) 13: end for 14: Return R Flower uses the concept of strategies as a way to configure several options, including the type of averaging algorithm that is used to aggregate parameters during training.In fact, a strategy can be used to define several other customizable settings.For instance, the minimum number of total clients in the FL system, the minimum number of clients required to be available for training and the minimum number of clients required for validation are configurable directly via a strategy.

Shared Model
The model used for classification is a shallow ANN with a dense input layer formed by 24 neurons, a dense hidden layer formed by 16 neurons and an output layer.The loss function is Adam.The activation function is ReLU for input and hidden layer, while sigmoid or softmax was used for the output layer depending whether the classification was binary or multiclass.The choice of the model, its activation and loss function was made to ensure the shared model was fast in training on the data, so that focus could be given on selecting the best options for the FL framework and its aggregation methods.Figure 3 illustrates the layers of the ANN model showing an example of multiclass output with 10 outcomes.The number 43 in the input layers indicates the number of data points or features fed into the input layer.This configuration is used on all clients.With this model, each client performs a classification of their portion of data and sends back weights to the server for aggregation and update.On the server side, the shared ANN model is also used, with the same configuration, at the beginning of the training process with a small portion of local data.A round of training is completed by the server to provide clients with initial weights that can be somewhat meaningful to the type of data used.This is to avoid using completely random weights as the initial weights for the global model.

Comparison of Averaging Algorithms
As a part of the experiments, several aggregation functions were tested in this scenario.The results from the experiments above using the FedAvg algorithm were used as a baseline for evaluating the other aggregation methods including FedAvgM, FedAdam and FedAdagrad.All parameters and the shared models remain the same for each scenario.

Datasets
The experiments were carried out using two open-source datasets: ToN_IoT and CICIDS2017.The first is data obtained from a large IoT network, while the other is purely based on a typical network environment.Both datasets are widely used in intrusion detection and present different characteristics which can be of value for testing the proposed model.

ToN_IoT Dataset
The ToN_IoT dataset [14] was collected using a large-scale network created by the University of New South Wales (UNSW) at the Australian Defence Force Academy (ADFA).This network included physical systems, virtual devices, cloud platforms and IoT sensors offering a large number of heterogeneous sources.The data include several captures from devices with different perspective of the network: IoT/IIoT, Network, Linux and Windows.For this set of experiments, the network data was used for the model training.Preference was given to the train_test_network data as it provides a sample of the network data, as a single file in CSV format, specifically created with the intent of evaluating the efficiency of ML applications.The data contains 43 features in total and includes a large sample of normal traffic plus nine different types of attacks.These are listed in Table 3.The Numerical ID represents the value used by the algorithm to classify the samples during multiclass classification.This dataset represent actual IoT data, making it one of the most relevant for this work among those publicly available.The CICIDS2017 [15] was created by the Canadian Institute for Cybersecurity and was specifically designed to help developing solutions to anomaly detection.The dataset contains traffic generated from a network captured over several days and includes a diverse range of attack scenarios.This is a larger dataset compared to the ToN_IoT in numbers of samples, features and classes.The diversity of data is one of the reasons behind its choice as it offers a more complex environment for network traffic analysis.In total, the dataset contains 79 features with each data sample labeled as either normal or as a specific attack type.A list of all types of attacks is presented in Table 4.

Data Pre-Processing
In order to simulate a realistic FL environment, each client has to obtain its own portion of the data.Therefore, both the ToN_IoT and the CICIDS2017 dataset were pre-divided into several parts randomly.However, to ensure horizontal FL could be achieved, each portion of the data maintained the same dimensionality.The same distribution of classes in the labels was also maintained for all clients to ensure consistency during training.Before training, each client pre-processed their own data.This was achieved using the Scikit-learn library.Firstly the data was checked for null values.The rows containing these were removed as they represented an insignificant portion of data samples in both datasets.Categorical objects were also identified and encoded into numerical form to ensure data could be inputted into the DL models.The next step was the normalization of the data (i.e., scaling values between 0 and 1).This is an important step to ensure that no outliers exist in the data that could otherwise bias the outcome of the model training.Again, the Scikit-learn library with its MinMaxScaler class was used to complete this task.Mathematically, normalization was carried out by Equation (1).
To conclude the pre-processing of the data, at each client, the dataset was divided into train and test data using a 70:30 ratio, where 30% of the data was kept aside for testing the model with previously unseen data.This is a standard process for ML, as it allows validating results obtained from training using previously unseen data.This step ensures that ML models used in operational environment, with live data, can achieve similar results to their performance during training.

Performance Metrics
Evaluation of ML and DL models for classification problems such as the one presented in this work is mostly based on metrics obtained from a confusion matrix (CM).This is a cross table that reports how often a model is capable of correctly classifying a data sample with its real label.The model attempts to discover the correct type of data sample.This prediction is recorded and compared against the real type.The CM is used to calculate the number of occurrences the model correctly or incorrectly classifies data.In the context of anomaly or intrusion detection, a CM can be used to verify the rate at which a model manages to: A CM is often displayed in a tabular format similar to Figure 4. On the right-hand side the numbers indicate the matching color code (e.g., dark blue indicates numbers in the order 80 K, in this case, but this value changes according to the number of samples classified).
An ideal model would identify all TP and TN correctly and never confuse one class of traffic for the other.Of course, this is not realistically achievable.However, the rates of FP and FN should be kept to a minimum.A CM allows for certain important metrics to be calculated.These are: • Accuracy-This is the ratio of correctly classified instances among the total number as shown in Equation (2).
• Precision-This provides the rate of elements that have been classified as positive and that are actually positive.It is obtained by dividing correctly classified anomalies (TP) by the total number of positive instances (FP + TP) as shown in Equation (3).
• Recall-Also defined as sensitivity or true positive rate (TPR), it is obtained from the correctly classified attacks (TP) divided by the total number of attacks (TP) + (FN) and measures the model's ability to identify all positive instances (i.e., attacks) in the data.
• F1-score-This uses both precision and recall to calculate their harmonic mean as shown in Equation ( 5).The higher the score the better the model.
In multiclass classification an averaging technique is used to obtain an overall score for each metric.Several exist as explained by [48].In this case a weighted averaging score is used where class imbalance is considered according to the number of samples of each class in the data.

Results and Discussion
In this section, we present results from the experiments on the ToN_IoT and CI-CIDS2017 datasets using the Flower FL environment.To evaluate the performance of the FL model, experiments were completed on both datasets for both binary and multiclass classification.Each is presented and discussed as a standalone scenario.Results from a traditional centralized approach, using the same ANN model on both datasets, are used as a baseline for evaluation of the FL model.Accuracy, precision, recall and F1-score are used as the metrics to compare performance between the FL model and its centralized counterpart.FedAvg is used as the averaging algorithm for aggregation of parameters.Results are also given of further experiments completed to evaluate FedAvgM, FedAdam and FedAdagrad as alternative methods to FedAvg.

Binary Classification
Binary classification aims at identifying anomalies from the given data.Table 5 and Figure 5 show results from the binary classification on the ToN_IoT dataset.The table presents results from each participating client, the actual aggregated results from the server and the centralized model.Results from each client are purely informational.In typical FL models the number of clients can be quite large; therefore, keeping track of scores from each client would be impractical and unnecessary.In this case, given that only four clients are available, seeing how they perform on their own data and comparing their results with the aggregated model can be useful particularly to identify possible substantial differences in results.From the results it is evident that the federated model offers scores that are slightly lower than the centralized model.However, the difference is not substantial, indicating the horizontal FL system can perform well in binary classification using the FedAvg as the aggregating algorithm.As to the performances of alternative averaging algorithms, results from the binary classification on the ToN_IoT dataset are given in Table 6.
From the results it is clear that, in this context, the FedAvgM algorithm seems to offer the best metrics.It offers superior accuracy, precision, recall and F1-score compared to any of the other methods.Only FedAvg seems to perform closely.FedAdam and FedAdagrad perform poorly compared to the others, with FedAdagrad in particular being the most unreliable.
As a way to validate the classification on the ToN_IoT dataset, the CICIDS2017 was used for a similar experiment.The results of the binary classification on the CICIDS2017 dataset are presented in Table 7 and Figure 6.The results are very similar to the previous scenario, indicating consistency.Moreover, even in this scenario, the centralized model offers a better performance compared to the FL system, confirming the results from the binary classification on the ToN_IoT.As to the performances of alternative averaging algorithms, binary classification on CICIDS2017 has given some different results.With a larger and more heterogeneous dataset, FedAvg performed better than the rest.FedAvgM is very close in most metrics and even offered better scores in Recall.Again, FedAdagrad and FedAdam have given the poorest results.This time FedAdam performed worse than FedAdagrad.However, performance from both methods is well behind the other two algorithms.Results are presented in Table 8.The performances of alternative averaging algorithms for the binary classification on the ToN_IoT dataset and the CICIDS2017 dataset are further illustrated in Figure 7.

Multiclass Classification
In multiclass classification the objective is to identify the actual attack from the target labels.The ToN_IoT dataset contains ten different classes of data: nine are attacks, while the remaining class is normal traffic.Table 9 and Figure 8 show the results obtained in multiclass classification.In this scenario, the centralized model clearly outperforms in all metrics its federated counterpart.However, scores from the FL system are still quite high, demonstrating the soundness of the approach even in multiclass classification.As to the performances of alternative averaging algorithms, multiclass classification on the ToN_IoT dataset returned similar results to the binary classification.FedAvg and FedAvgM achieved better scores compared to the other two methods.FedAdagrad in particular performed quite poorly in this scenario with very low scores in all metrics.FedAdam achieved better scores but still well below the performance of FedAvg and FedAvgM.Table 10 provides the results for all algorithms.The multiclass classification on the CICIDS2017 confirms the soundness of results from the federated system.Moreover, it apparently shows that, in a more complex dataset such as this, a smaller discrepancy of results exists between the centralized model and the federated version.The centralized model has performed better once more but the performance from the FL system is closer.The results are presented in Table 11 and Figure 9.As to the performances of alternative averaging algorithms, in the multiclass classification on the CICIDS2017, the improved performance of FedAdagrad is noticeable as well as the very poor performance of FedAdam.FedAvg and FedAvgM performed well in all metrics.Again, all results are available in tabular format in Table 12.The performances of alternative averaging algorithms for the multiclass classification on the ToN_IoT dataset and the CICIDS2017 dataset are further illustrated in Figure 10.

Confusion Matrices
The CMs of the FL model for binary classification on the ToN_IoT dataset and the CICIDS2017 dataset are illustrated for reference in Figure 11.

Conclusions and Future Works
User data privacy has become of paramount importance for organizations in recent years, particularly so, since the official publication of the European General Data Protection Regulation (GDPR) in May 2018, which intensified pressure on organizations to ensure privacy of users' data is maintained at all times.High fines are expected for those who do not comply with GDPR.Large data storage for ML analysis could potentially lead to privacy issues.In IoT environments, data needs to be transferred from devices to a central location for analysis and this has the potential to cause privacy issues.FL can be used as a method to ensure privacy is maintained while training ML models.In this article we have performed several experiments to evaluate FL as an alternative method to a centralized model in detecting attacks in IoT environments, while maintaining data privacy.The results from the experiments demonstrated how a collaborative federated system using horizontal data partitioning can have a close performance to a centralized model.The FL model was built out of four clients and one server.Data analysis was performed at the client side, each using their own portion of the dataset.No data sharing between participants occurred.The role of the server was to coordinate the overall process.FedAvg was used as the algorithm for parameter aggregation.The overall process was completed over five rounds.At each round, the clients trained the DL model with their local data.The DL model used in this set of experiments is a shallow ANN with three layers.Results have shown that a FL system can provide an excellent alternative to a centralized model as it is capable of achieving comparable results in accuracy, precision, recall and F1-score in both binary and multiclass classification.
FedAvg is known to work well in networks where the data distribution is balanced.However, it has been found to have issues of convergence in situations where data distribution is highly variable.This tends to be a common situation in federated networks where each client uses its own data.Algorithms such as FedAdam, FedAdagrad and FedAvgM have been proposed to deal with these issues.In this article, we have presented an evaluation of these algorithms.Using the same testbed as the other experiments, each algorithm was evaluated in both binary and multiclass classification.Results from experiments have shown that FedAvg and FedAvgM tend to perform better in both scenarios compared to the two adaptive algorithms, FedAdam and FedAdagrad.In particular, FedAvg achieved a better score in binary classification on the CICIDS2017 dataset and multiclass classification on the ToN_IoT.In contrast, results from experiments with FedAdam and FedAdagrad, were generally negative.Only in multiclass classification with CICIDS2017 was FedAdagrad able to achieve scores in the order of 90% in all metrics.Other than that, the performances of both algorithms in this context was fairly poor.It is difficult to establish the exact reasons for this performance.However, given that these algorithms were designed to improve on the performance of FedAvg in a cross-device scenario [40], where a large number of clients is assumed, an empirical evaluation of these methods in more complex scenarios should be considered as a part of future works.Moreover, in order to improve on both binary and multiclass classification, the use of a more complex shared model should be considered.For instance, a possible approach that could be interesting to explore is the use of an ensemble of diverse models applied to a FL infrastructure where results are aggregated at a central location.

-
The server sends initial parameters of the global model to clients.-Each client completes training on their local data, calculates their local parameters and sends an update to the server.-The server updates parameters for the global model and aggregates results.

Figure 2 .
Figure 2. Flowchart of FL process using Flower.

Figure 3 .
Figure 3. Representation of the shared model.

Figure 5 .
Figure 5.Comparison of FL model vs. centralized model in binary classification on ToN_IoT.

Figure 7 .
Figure 7.Comparison of different aggregation algorithms for binary classification on ToN_IoT and CICIDS2017.

Figure 8 .
Figure 8.Comparison of FL model vs. centralized model in multiclass classification on ToN_IoT.

Figure 9 .
Figure 9.Comparison of FL model vs. centralized model in multiclass classification on CICIDS2017.

Figure 10 .
Figure 10.Comparison of different aggregation algorithms for multiclass classification on ToN_IoT and CICIDS2017.

Figure 11 .
Figure 11.Confusion matrices obtained from the FL binary classification.The CMs of the FL model for multiclass classification on the ToN_IoT dataset and the CICIDS2017 dataset are illustrated for reference in Figure12.
(a) Results for ToN_IoT in multiclass (b) Results for CICIDS2017 in multiclass

Figure 12 .
Figure 12.Confusion matrices obtained from the FL multiclass classification.

Table 1 .
Summary of FL being applied for IoT intrusion detection.

Algorithm 1
The FedAvg Algorithm.The K clients are indexed by k; B is the local minibatch size, E is the number of local epochs and η is the learning rate 11: ClientUpdate(k, w) : //run on client k 12: B ← (split P k into batches of size B) 13: for each local epoch i from 1 to E do 14: for batch b ∈ B do 15: w ← w − η (w; b) 1:

Algorithm 3
The FedAvgM Algorithm.The K clients are indexed by k; B is the local minibatch size, E is the number of local epochs and η is the learning rate 11: Client Update(k, w) : //run on client k 12: B ← (split P k into batches of size B) 13: for each local epoch i from 1 to E do 14: for batch b ∈ B do 15: v = βv + ∆w 16:

Table 5 .
Performance of FL in binary classification on ToN_IoT dataset.

Table 6 .
Performance of averaging algorithms in binary classification on ToN_IoT dataset.

Table 7 .
Performance of FL in binary classification on CICIDS2017 dataset.Comparison of FL model vs. centralized model in binary classification on CICIDS2017.

Table 8 .
Performance of averaging algorithms in binary classification on CICIDS2017 dataset.

Table 9 .
Performance of FL in multiclass classification on ToN_IoT dataset.

Table 10 .
Performance of averaging algorithms in multiclass classification on ToN_IoT dataset.

Table 11 .
Performance of FL in multiclass classification on CICIDS2017 dataset.

Table 12 .
Performance of averaging algorithms in multiclass classification on CICIDS2017 dataset.