A Comparative Study of Federated Learning and Amino Acid Encoding with IoT Malware Detection as a Case Study

Ibaisi, Thaer AL; Kuhn, Stefan; Kazim, Muhammad; Kara, Ismail; Altindag, Turgay; Rehman, Mujeeb Ur

doi:10.3390/bdcc10040111

Open AccessArticle

A Comparative Study of Federated Learning and Amino Acid Encoding with IoT Malware Detection as a Case Study

by

Thaer AL Ibaisi

^1,*

,

Stefan Kuhn

^1,2

,

Muhammad Kazim

^1,†

,

Ismail Kara

^3,†

,

Turgay Altindag

^3,4,†

and

Mujeeb Ur Rehman

^1,†

¹

School of Computer Science and Informatics, De Montfort University, Leicester LE1 9BH, UK

²

Institute of Computer Science, University of Tartu, 51009 Tartu, Estonia

³

Department of Computer Education and Instructional Technology, Atatürk University, Erzurum 25240, Türkiye

⁴

Department of Information Technology, Artvin Çoruh University, Artvin 08000, Türkiye

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2026, 10(4), 111; https://doi.org/10.3390/bdcc10040111

Submission received: 22 January 2026 / Revised: 16 March 2026 / Accepted: 30 March 2026 / Published: 6 April 2026

(This article belongs to the Special Issue Application of Cloud Computing in Industrial Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

The increasing deployment of Internet of Things (IoT) devices introduces significant security challenges, while privacy concerns limit centralized data aggregation for intrusion detection. Federated learning (FL) offers a decentralized alternative, yet the interaction between feature representation, model architecture, and data heterogeneity remains insufficiently understood in IoT malware detection. This study provides a controlled comparative analysis of centralized and federated learning, optionally using amino acid encoding, under IID and Non-IID conditions using a 10,000-sample subset of the CTU–IoT–Malware–Capture dataset. First, we evaluate raw tabular features versus amino acid-based feature encoding, followed by a lightweight multi-layer perceptron (2882 parameters) versus a deeper residual network (70,532 parameters), across binary and multi-class classification tasks. In the binary setting, centralized training achieved up to 98.6% accuracy, while federated IID training reached 98.6%, with differences within statistical variance. Under Non-IID conditions, performance decreased modestly (0.1–0.5 percentage points), and accuracy was consistently lower when using encoded features compared with raw features. The degradation is smaller in deeper architectures and may offer improved stability under highly skewed federated conditions. In the four-class setting, the complex network achieved up to 97.8% accuracy with raw features, while amino acid encoding achieves up to 93.3%. The results show that federated learning can achieve performance comparable to centralized training under moderate heterogeneity, that lightweight architectures are sufficient for low-dimensional IoT traffic features, and that feature compression via amino acid encoding does not inherently mitigate Non-IID effects. These findings clarify the relative impact of representation, heterogeneity, and architectural capacity in practical FL-based IoT intrusion detection systems.

Keywords:

intrusion detection; federated learning; amino acid encoding; machine learning; internet of things

1. Introduction

Machine learning techniques initially relied on having a dataset, typically as large as possible, on a computer for training a model. The resulting model then ideally incorporates knowledge extracted from all data available. This setup, called centralized learning, has a potential problem if data come from different organizations. In that situation, centralizing them all may not be possible or ideal due to privacy and data protection concerns, and at the same time, individual organizations may not have enough data to build meaningful models. Federated learning (FL) is a solution to that issue. Generally speaking, local models are trained for one or several rounds on separate computers and are consolidated after each round in a central model. Updates are sent back to nodes for the next round until a stop criterion is reached. Most FL systems require that the actual model trained is the same on all nodes. We work with such a homogeneous FL system here. We also use a centralized FL model, where one central node controls the update process. For this, the clients send the model parameters (in case of a neural network, the weights) to the central node at the end of a training round, where they are aggregated using some mathematical function. The aggregate values are sent back and used as starting point for the next round. A major distinction between FL approaches is the aggregation function used. With respect to Internet of Things (IoT) applications, FedAvg (pioneered in [1]) is by far the most popular method [2]. Here, the weights are updated by calculating the average of the individual weights. FedAvg has been shown to be the most efficient solution, at least in many cases [3]. Alternatives to FedAvg include FedProx [4], SCAFFOLD [5], and FedNova [6]. It has been shown that they are superior to FedAvg in certain cases, but not in all, and that no single method is best in all cases [7]. Furthermore, more resources for communication and calculations are needed than with FedAvg [7]. We refer the reader to [7] and a recent textbook [8] for further details and a bibliography. We have decided on FedAvg in this study since it is still popular and no single best method is established. Overall, FL can match the accuracy of centralized learning in many scenarios [9], and it offers advantages in privacy and resource needs. On the other hand, [10] recently found a small, but consistent disadvantage of FL when performing intrusion detection using a hybrid convolutional recurrent neural network.

A significant issue in federated learning is the distribution of the data. Ideally, data would be independent and identically distributed (IID) between the participants. In practice, this is often not the case. For example, if attack data from different computers is used, a particular attack might only have happened at one machine, meaning that the case is contained only in the training data on one machine. A number of techniques have been devised to mitigate such situations [11]. Similarly to previous reviews like [12], the current review [11] concludes that universal solutions do not exist and that future research is needed. As an example, [13] finds that a combination of transformers such as the ML method, as well as FedProx or FedNova (but not FedAvg) for federation, can deal with Non-IID data. Federated learning is most typically combined with neural networks, but it can use other learning techniques as well, such as random forest [14].

In computer security, Network Intrusion Detection Systems (NIDSs) are an important component. Their goal is to detect if a network communication with a system is part of an attack or not—if it is benevolent or malicious [15]. There are two main types of such systems that misuse NIDSs, which use signatures to identify known attacks. These systems offer various advantages but also suffer from the inability to detect novel types of attacks. This is something Anomaly NIDSs can potentially do by examining traffic and finding out anomalies, i.e., attacks. We refer the reader to a recent review [16] for details and references. Federated learning can be useful for intrusion detection in particular networks that are flexible, as they are typically in the IoT sector, as shown recently for Mobile Ad Hoc Networks (MANETs) [17].

Proteins are a major way for living beings to encode and use information. The core components of proteins are amino acids (20 in the human body). Since organisms have developed very sophisticated mechanisms to protect themselves against attacks by viruses, bacteria, or other harmful attackers, it seems a logical idea to employ similar mechanisms to detect network attacks. Such an approach was introduced in [18] (also containing more references), where attack signatures are encoded as amino acids. There are other methods that use biological code to encode attack data, such as DNA-like encoding [19]. For details of our method, see Section 2.4.

For the actual detection of anomalies from either raw or amino acid-encoded data, a variety of techniques have been employed. They include various types of multivariate statistics, machine learning methods including neural networks, as well as the latest advances in deep learning. A comprehensive review of these techniques has been provided by Arnob et al. [20]. We do not aim to evaluate an exhaustive range of models. Instead, we select a simple perceptron and a deeper, multi-layer neural network as representative baselines for comparatively basic and more advanced learning techniques. This choice reflects common practice in intrusion detection, where neural network-based methods are frequently reported as the most common learning approach [2]. Whilst each of these areas has been explored separately, our approach here is to test the interplay of factors.

The main contributions of this work can be summarized as follows:

We present a systematic and controlled evaluation of federated learning for IoT malware detection, explicitly comparing centralized and federated training under both IID and Non-IID data distributions, thereby quantifying the practical performance gap between these learning paradigms in realistic IoT settings.
We provide a detailed empirical analysis of amino acid-based feature encoding within federated learning environments, demonstrating its consistent impact on detection accuracy, false positive rate, and Matthews correlation coefficient across multiple architectures and data distributions.
We investigate the interaction between model complexity and learning paradigm by comparing a lightweight multi-layer perceptron (MLP) with a deep residual neural network, showing that increased architectural depth yields only limited performance gains and does not compensate for information loss introduced by feature encoding.
We establish experimentally that federated learning achieves performance close to centralized training even under severe data heterogeneity, indicating that feature representation and task formulation exert a stronger influence on intrusion detection performance than the choice between centralized and federated learning.

We have the options of centralized or federated, IID or Non-IID, simple or deep network, and raw data or amino acid-encoded. We will answer the question how far these factors influence each other and which combinations can be recommended. Existing studies have examined some of those factors, but not all. An overview of the interaction of intrusion detection and IoT is given in [2]. Yang et al. [3] examines the influence of federated learning on machine learning processes. Whilst FL, in general, can achieve similar results as centralized learning, for Non-IID cases, the accuracy can degrade significantly, depending on the use case [3]. DNA or other biological encoding schemes are typically compared to other such encoding schemes (e.g., [21]), but not to the same models without encoding. Ibaisi et al. [18] reports that the intrusion detection using amino acid encoding gives a better result than all other methods, except one.

This paper has been organized as follows. In Section 2, we present the materials and methods including experimental setup, dataset properties, amino acid encoding, and various experimental architectures and designs used in our paper. In Section 3, we present the results, including Experiment 1 outcomes using the simple MLP architecture for binary classification, followed by Experiment 2 that evaluates the complex residual neural network under the same binary setting, and Experiment 3 that extends the complex architecture to a multi-class intrusion detection task. This structure allows direct comparison between architectural complexity and task formulation, while isolating the impact of feature representation and data distribution. Section 4 provides the discussion, while the conclusion is provided in Section 5 of the paper.

2. Materials and Methods

This section describes our experimental setup and presents the results of evaluating federated learning for IoT malware detection with and without amino acid feature encoding. We conducted our investigation in two phases: first using a simple multi-layer perceptron architecture, and subsequently employing a more complex residual neural network to assess whether architectural improvements could enhance classification performance.

2.1. Experimental Setup

We conducted experiments to compare centralized training against federated learning under different data distribution scenarios. Three training configurations were tested:

Centralized training on the complete dataset.
Federated learning with equal (IID) data distribution across 5 clients.
Federated learning with skewed (Non-IID) data distribution across 5 clients.

Each configuration was tested with two feature processing approaches—raw network features and amino acid-encoded features—resulting in six experimental runs per feature type. This systematic approach enabled comprehensive evaluation of the factors influencing federated learning performance for IoT intrusion detection.

Our experimental investigation proceeded in two phases using distinct neural network architectures. The first phase employed a simple multi-layer perceptron architecture with two hidden layers of 64 and 32 neurons, incorporating batch normalization [22], ReLU activation, and 30% dropout for regularization. This lightweight architecture, containing approximately 2900 trainable parameters, served as a baseline to establish fundamental performance expectations and evaluate the feasibility of federated learning for resource-constrained IoT devices.

The second phase employed a more sophisticated deep residual neural network architecture to assess whether increased model complexity could enhance classification performance. This complex architecture, denoted as ComplexFederatedNetwork, consists of four hidden layers with 128, 64, 32, and 16 units, respectively, incorporating residual connections [23] within each layer block to facilitate gradient flow during training. The residual architecture contains approximately 70,500 trainable parameters, representing a substantial increase in model capacity compared to the simple MLP baseline. The architectural details for both network designs are presented in Table 1 and Table 2, respectively.

For the federated learning experiments, we adopted the FedAvg aggregation algorithm, using 50 communication rounds in Experiment 1 and 100 rounds in Experiment 2. In each round, participating clients performed local training for five epochs. The simple MLP architecture was trained using the Adam optimizer [24] with a learning rate of 0.001, while the complex residual network employed stochastic gradient descent with a learning rate of 0.01 and momentum of 0.9. A batch size of 64 was used consistently across all experiments.

The federated learning workflow—including client coordination, local model training, and FedAvg-based aggregation—was implemented by the authors using a custom PyTorch-based framework. This implementation enabled fine-grained control over data partitioning, client sampling, and aggregation behavior.

The dataset was split 80/20 for training and testing using stratified sampling to maintain class distributions across splits. Features were standardized using StandardScaler fitted on training data, with the same scaling applied to validation and test sets to prevent data leakage.

All experiments were run on a PC (Overclockers UK, Staffordshire, UK) with an Intel Core i9-11900K CPU, an Asus GeForce RTX 3090 ROG graphics card including an NVIDIA GeForce RTX 3090 GPU, and 64 GB RAM, running Ubuntu 22.04.5 LTS. The experiments were run using a Python program written by ourselves, available from the repository [25]. Training time varies by model complexity and configuration. For Experiment 1 (Simple MLP, 50 communication rounds), a complete training run (centralized or federated with 5 clients) typically requires 45–90 min on the RTX 3090 GPU. For Experiments 2 and 3 (Complex Residual Network, 100 communication rounds, 20 clients), a complete training run requires approximately 4–6 h. Centralized training is faster than federated training due to reduced communication overhead and batch processing advantages. Running all experiments with 5 random seeds (as reported in this study) requires approximately 100–150 h of total GPU computation time.

It should be emphasized that the aim of this investigation was not the production of an optimal result for a specific deployment scenario, but rather to establish a solid foundation for understanding the interplay between federated learning, neural network architecture, data distribution, and feature representation. This comparative framework enables practitioners to identify appropriate configurations for their specific IoT deployment contexts while providing insights into the trade-offs inherent in each design choice.

To ensure reproducibility, the data partitioning procedure is explicitly clarified. In the IID setting, samples are evenly distributed across clients using stratified sampling so that each client preserves the global class distribution. In the Non-IID setting, data heterogeneity is controlled using a Dirichlet distribution with a fixed concentration parameter [26], which governs class imbalance across clients. While class proportions vary under Non-IID partitioning, each client is guaranteed to receive sufficient valid samples for training. The total sample allocation, class coverage, and sampling ratios are fixed and deterministic across runs to ensure consistent and reproducible experimental conditions. For complete reproducibility, we provide detailed environment specifications. All experiments used Python 3.9.7, PyTorch 1.10.0 with CUDA 11.3, NumPy 1.21.2, Pandas 1.3.4, Scikit-learn 1.0.2, and Matplotlib 3.4.3. Random seeds were fixed at values [42, 123, 456, 789, 1024] for the five independent runs, with seed = 42 used consistently for data splitting and client partitioning. Complete package versions and installation instructions are documented in the code repository [25].

2.2. Dataset

Experiments used the CTU–IoT–Malware–Capture dataset containing network traffic from IoT devices infected with various malware families. The dataset includes 14 attack categories—Backdoor_Malware, BrowserHijacking, CommandInjection, DDoS, DNS_Spoofing, DictionaryBruteForce, DoS, MITM, Mirai, Recon, SqlInjection, Uploading_Attack, VulnerabilityScan, and XSS—alongside benign traffic. The dataset was retrieved from the Kaggle competition CTU-IoT-Malware IoT Network Traffic [27] and was compiled at Stratosphere Lab [28].

A deterministic preprocessing pipeline was applied as described below for reproducibility. The final feature set contained 11 network traffic descriptors. The dataset comprised 10,000 samples with approximately 60% malicious and 40% benign traffic, reflecting real-world attack prevalence. For Experiment 1 and Experiment 2, all 14 attack categories were merged into a single malicious class to form a binary classification task, while Experiment 3 grouped attacks into four categories (Benign, DoS, Probe, and Web Attack) following common intrusion detection practice.

Dataset preprocessing follows a deterministic pipeline. Missing or invalid records are removed rather than imputed to avoid introducing artificial statistical bias. IPv4 source and destination addresses are each split into four octets following standard big-endian (network byte) order—from most significant to least significant octet—with each octet retained as an independent integer feature in the range

[0, 255]

; no concatenation into a single integer is performed. Timestamp attributes are transformed into inter-flow time-difference features, computed as the elapsed time in seconds between the start timestamp of each flow and that of the immediately preceding flow in sequence order. These inter-arrival values are then normalized using min–max normalization fitted exclusively on the training partition and applied identically to the test set to prevent data leakage. All preprocessing steps are applied identically across all experiments to ensure reproducibility and consistent comparison. The complete preprocessing code is available in the repository [25].

2.3. Subset Construction and Splitting Protocol

To ensure reproducibility and prevent data leakage, we follow a deterministic subset construction and partitioning protocol. The 10,000-sample subset was selected from the CTU–IoT–Malware–Capture IoT-23 collection using stratified random sampling with a fixed random seed (seed = 42). Stratified sampling preserves the original class distribution of approximately 60% malicious flows (6000 samples distributed across 14 attack categories) and 40% benign flows (4000 samples). Each attack category is represented proportionally according to its natural occurrence in the full IoT-23 dataset.

After subset selection, an 80/20 train–test split was applied using sklearn.model_ selection.train_test_split with random_state=42 and stratify=True, resulting in 8000 training samples and 2000 test samples while maintaining class balance in both partitions. Each sample represents one complete bidirectional network flow, and the splitting procedure ensures that entire flows remain intact—no single flow is divided between training and test sets. This flow-level independence prevents data leakage across partitions.

For federated learning experiments, client data partitioning follows two distinct strategies. Under IID conditions, training data is distributed evenly across clients (5 clients in Experiment 1, 20 clients in Experiments 2 and 3) using stratified sampling, ensuring each client maintains the original 60:40 malicious-to-benign class ratio. Under Non-IID conditions, class proportions for each client are drawn from a Dirichlet distribution

Dir (α)

, and samples are allocated per class according to these drawn proportions. Total sample counts therefore vary across clients, as allocation is governed by class proportions rather than a fixed quota. A minimum threshold of 10 samples per client is enforced; any client falling below this receives additional randomly sampled training instances to meet the minimum. Under

α = 0.5

with 5 clients (Experiment 1), per-client sample counts ranged from approximately 900 to 2300. Under

α = 0.1

with 20 clients (Experiments 2 and 3), counts ranged from approximately 50 to 1800, reflecting the intended severe label skew. All partitioning steps use fixed random seeds to guarantee reproducibility across experimental runs.

Each network flow is treated as an independent observation. No temporal ordering constraints are imposed, as the IoT-23 traffic captures are designed to represent independent attack scenarios rather than continuous temporal sequences. This approach is consistent with standard practice in flow-based intrusion detection research, where individual flows serve as the fundamental unit of analysis.

2.4. Amino Acid Encoding

Amino acid encoding [18] transforms numerical network features into amino acid sequences, which are then characterized by structural properties. This method converts feature values to ASCII, then to vigesimal (base-20) notation (corresponding to the 20 standard amino acids), and maps each digit to an amino acid (0→A, 1→R, 2→N, etc.). Each sample’s features are concatenated into a sequence, and Biopython’s ProteinAnalysis [29] computes 10 structural properties: molecular weight, aromaticity, instability index, isoelectric point, alpha-helix fraction, reduced cysteines, disulfide bridges, GRAVY index, beta-turn fraction, and beta-strand fraction. These properties are used in biology to describe amino acid sequences on a higher level, the secondary structure. Whereas the primary structure is the sequence of amino acids, the secondary structure elements comprise several amino acids and their occurrence depends on the order of the amino acid. We use this to transform the amino acid sequence into a more meaningful representation. For details about the individual properties, see [18].

This transformation converts 11 heterogeneous network features into 10 consistent numerical properties, providing a unified representation across federated nodes. The resulting amino acid sequences are summarized using ten physicochemical properties extracted via ProteinAnalysis, yielding a fixed-length numerical representation for each sample. This transformation converts 11 heterogeneous network features into 10 consistent numerical properties, providing a unified representation across federated nodes. The resulting amino acid sequences are summarized using ten physicochemical properties extracted via ProteinAnalysis, yielding a fixed-length numerical representation for each sample.

While the ASCII-to-vigesimal conversion does not preserve exact numerical continuity or pointwise magnitude, discriminative information is partially retained through the following mechanism. Prior to encoding, all features are normalized into a bounded, base-20 compatible range, ensuring that relative ordering is maintained before the vigesimal mapping takes place. Larger normalized values produce higher-order vigesimal digits and correspondingly different amino acid characters, so the composition of the resulting sequence reflects the distributional structure of the original feature vector. The 10 physicochemical properties computed by ProteinAnalysis then act as global summary statistics over the following sequence: molecular weight correlates with the prevalence of higher-value encoded features, composition-based properties such as aromaticit, and the GRAVY index reflect which amino acid classes dominate the sequence, and secondary structure fractions capture systematic compositional patterns that vary with the original feature distribution. As a result, attack-induced shifts in the original feature distribution produce measurable differences in the physicochemical output. However, fine-grained numerical distinctions between similar traffic patterns are not preserved, and this information loss is directly reflected in the 4–6 percentage point accuracy reduction observed consistently across all experimental configurations. The encoding is therefore not intended as a lossless transformation but as a structured compression that trades detection precision for a unified and heterogeneity-robust representation across federated clients.

This reduction occurs because certain protocol-specific fields in network traffic, such as flag combinations, are inherently discrete and can be absorbed into the encoding scheme without explicitly preserving protocol-specific granularity. We compared this encoded representation against the original raw feature vectors to evaluate whether the encoding introduces information loss or provides regularization benefits that improve generalization.

The amino acid encoding scheme was originally designed for biological sequence analysis, and its application to network traffic features represents an unconventional but potentially valuable approach. The encoding maps heterogeneous network metrics onto a common physicochemical feature space, which may facilitate knowledge transfer between diverse IoT environments that share similar underlying traffic patterns despite superficial differences in protocol implementations.

It is important to clarify that the motivation for amino acid encoding is not based on biological similarity between protein sequences and network traffic, but rather on feature transformation principles. The encoding converts heterogeneous network attributes into a unified physicochemical representation that preserves ordinal relationships while reducing feature irregularity across distributed nodes. This transformation enables consistent abstraction of network behavior patterns and may improve learning stability in federated environments with heterogeneous data distributions.

We also note that classical dimensionality-reduction techniques such as Principal Component Analysis (PCA) and autoencoders aim primarily at statistical variance preservation or reconstruction optimization. In contrast, the proposed amino acid encoding introduces a structure-aware transformation that acts as both compression and implicit regularization. While PCA and autoencoders minimize reconstruction error, the encoding maps heterogeneous protocol features into a normalized representation space, which may improve robustness under highly skewed federated conditions. Therefore, the encoding should be viewed as a complementary representation strategy rather than a replacement for conventional feature compression techniques.

Prior to ASCII conversion, each feature is normalized independently using min–max normalization to the interval

[0, 19]

, mapping the minimum observed value to 0 and the maximum to 19, so that each normalized value corresponds directly to one of the 20 vigesimal digits and thus one of the 20 amino acid symbols. The ten physicochemical properties are extracted in the following fixed order: (1) molecular weight, (2) aromaticity, (3) instability index, (4) isoelectric point, (5) alpha-helix fraction, (6) beta-turn fraction, (7) beta-strand fraction, (8) reduced cysteines, (9) disulfide bridges, and (10) GRAVY index. Their concatenation in this order forms the final 10-dimensional encoded feature vector. Complete implementation details are available in the code repository [25].

2.5. Feature Representation and Potential Overfitting Concerns

The conversion of IP addresses into integer-encoded octets preserves network topology information while maintaining numerical compatibility with standard neural network architectures. However, this encoding choice introduces potential concerns regarding host-specific overfitting, where models might inadvertently learn IP-to-label associations rather than attack-intrinsic behavioral patterns.

Several aspects of our experimental results provide evidence against such overfitting. First, the strong performance of federated learning under Non-IID data distributions—where individual clients observe highly skewed subsets of IP addresses and attack categories—demonstrates that learned representations generalize across diverse host distributions. If models relied heavily on memorizing specific IP addresses, we would expect substantial degradation under Non-IID conditions; instead, performance differences remain below one percentage point in Experiments 1 and 3 (Table 3 and Table 4).

Second, the amino acid encoding experiments provide indirect evidence of generalization beyond raw feature values. When IP-derived features are transformed through physicochemical abstraction, models maintain reasonable discriminative capability (92–93% accuracy), suggesting that detection mechanisms do not critically depend on precise IP address representations. The encoding process obscures host-specific identifiers while preserving coarse-grained distributional properties, yet models continue to distinguish malicious from benign traffic effectively.

Third, the consistency of performance patterns across architectures with vastly different capacities (2882 vs. 70,532 parameters) indicates that overfitting to host identifiers is not a dominant factor. Deeper networks, which have greater capacity to memorize training data, do not exhibit disproportionate performance gaps between IID and Non-IID conditions compared to simpler architectures.

Nevertheless, more rigorous validation would require explicit ablation studies with IP features entirely removed or encoded using alternative representations such as subnet-level grouping or cryptographic hashing. Feature sets based exclusively on protocol-behavioral statistics—packet counts, inter-arrival times, flow duration, and payload characteristics—would eliminate potential host-specific dependencies entirely. Such IP-agnostic approaches represent an important direction for future research to ensure robust generalization across diverse network environments and production deployments where IP address distributions may differ substantially from training data.

2.6. Experiment 1: Simple Multi-Layer Perceptron Architecture

Our initial experiments employed a straightforward MLP architecture designed to establish baseline performance metrics. This architecture, denoted as FederatedNetwork, consisted of two hidden layers with 64 and 32 neurons, respectively, providing a relatively simple computational model suitable for evaluating the fundamental feasibility of our federated learning approach.

Each hidden layer incorporated batch normalization to stabilize training by normalizing layer inputs, followed by the ReLU activation function to introduce non-linearity, enabling the network to learn decision boundaries. A dropout rate of 30% was applied after each hidden layer to mitigate overfitting during training. The output layer employed a softmax activation function to produce probability distributions across the binary classification task (benign versus malicious traffic). The network was implemented using PyTorch [30] and trained using the Adam optimizer with a learning rate of 0.001 and batch size of 64. These values were not determined through grid or random search but adopted from well-established defaults in the federated learning literature [1], prioritizing reproducibility and cross-configuration comparability over peak performance optimization.

The federated learning framework was configured with 5 clients, each representing a distinct IoT device or network segment. At each communication round, clients performed local training for 5 epochs using the Adam optimizer. The FedAvg algorithm [1] was employed to aggregate client model updates, computing a weighted average of local model parameters based on the number of data samples available at each client. The global model was synchronized over 50 communication rounds, allowing sufficient time for the model to converge while limiting computational overhead.

To simulate realistic IoT network conditions, we evaluated the framework under both identically distributed (IID) and non-identically distributed (Non-IID) data partitions. In the IID scenario, client data was randomly shuffled and evenly distributed using stratified sampling, ensuring each client maintained the original 60:40 malicious-to-benign class ratio. The Non-IID scenario employed Dirichlet distribution with alpha = 0.5 to create heterogeneous data distributions, simulating realistic scenarios where different IoT devices observe different attack patterns.

For baseline comparison, we also trained centralized models where all training data was available to a single model instance. This centralized training employed identical network architecture and hyperparameters, allowing direct performance comparison between centralized and federated learning approaches. The architecture specifications for Experiment 1 are summarized in Table 1.

Table 1. Specification of the simple MLP architecture used in Experiment 1.

Layer	Output Size	Activation	Parameters
Input	11 or 10	-	-
Hidden 1	64	ReLU + BatchNorm + Dropout(0.3)	704
Hidden 2	32	ReLU + BatchNorm + Dropout(0.3)	2112
Output	2	Softmax	66
Total Trainable Parameters			2882

The simple MLP architecture contained approximately 2900 trainable parameters, representing a lightweight model designed for rapid experimentation and establishing baseline performance expectations.

2.7. Experiments 2 and 3: Complex Residual Neural Network Architecture

Following the initial experiments, we investigated whether a more sophisticated neural network architecture could improve classification performance while maintaining computational feasibility for federated learning scenarios. The second experiment employed a deep residual neural network architecture specifically designed for network traffic classification tasks, denoted as ComplexFederatedNetwork. This architecture employs a multi-layer structure with skip connections inspired by residual learning principles [23].

The network architecture consists of four hidden layers with decreasing channel dimensions of 128, 64, 32, and 16 units, respectively. Each hidden layer is followed by batch normalization [22] and the ReLU activation function [31]. The use of batch normalization stabilizes training by normalizing layer inputs, while ReLU introduces non-linearity, enabling the network to learn complex decision boundaries in the feature space. To reflect larger-scale IoT deployments and increase heterogeneity, Experiment 2 employed 20 federated clients, compared to 5 clients in Experiment 1.

To address the vanishing gradient problem that often accompanies deep network training, we incorporated residual connections within each layer block. Each residual block consists of two fully connected transformations followed by batch normalization and ReLU activation. Skip connections add the block input to its output, enabling improved gradient flow and stable training for deeper architectures operating on tabular feature representations.

The input layer accepts either 11-dimensional raw network traffic features or 10-dimensional amino acid-encoded representations, depending on the experimental configuration. The output layer employs a softmax activation function to produce probability distributions across the four attack categories (Benign, DoS, Probe, and Web Attack). The complete architecture is summarized in Table 2.

Table 2. Specification of the ComplexFederatedNetwork architecture used in Experiment 2. Each residual block contains two fully connected transformations with skip connections.

Layer	Output Channels	Residual Blocks	Parameters
Input	11 or 10	-	-
Hidden 1	128	2	15,488
Hidden 2	64	2	41,600
Hidden 3	32	2	10,688
Hidden 4	16	2	2688
Output	4	-	68
Total Trainable Parameters			70,532

The complex residual network contains approximately 70,500 trainable parameters, representing a substantial increase in model capacity compared to the simpler MLP architecture used in Experiment 1. This represents approximately 24 times more parameters, providing significantly greater representational capacity for learning complex patterns in network traffic data.

For the federated learning configuration in Experiment 2, we maintained the FedAvg aggregation strategy but adjusted the hyperparameters to accommodate the deeper architecture. The framework was configured with 20 clients, with 10% randomly sampled to participate in each communication round. At each round, a random subset of 10% of clients (2 out of 20) participated in local training for 5 epochs using stochastic gradient descent with a learning rate of 0.01 and momentum of 0.9. The learning rate of 0.01 and momentum of 0.9 follow the settings used in the original FedAvg paper [1] for deeper network architectures and represent standard empirical defaults rather than values obtained through systematic hyperparameter search. A dedicated hyperparameter optimization study is identified as an important direction for future work. The global model was synchronized over 100 communication rounds, allowing sufficient time for the deeper architecture to converge. This client sampling strategy reduces communication costs while still providing representative model updates across the network.

The Non-IID scenario was configured with a Dirichlet distribution parameter of 0.1, creating more severe label imbalance across clients compared to Experiment 1. This more challenging partitioning scheme tested the robustness of the complex architecture under more realistic conditions of data heterogeneity.

2.8. Experimental Design Summary

Our experimental design systematically varied four key factors: neural network architecture (simple MLP versus complex residual network), learning paradigm (centralized versus federated learning), data distribution (IID versus Non-IID), and feature representation (raw versus amino acid-encoded). Each configuration was tested with both raw network features and amino acid-encoded features, resulting in a comprehensive evaluation matrix.

The primary objectives of this two-phase experimental approach were (1) to establish baseline performance using a simple architecture suitable for resource-constrained IoT devices, and (2) to determine whether architectural complexity could meaningfully improve classification accuracy, potentially justifying the additional computational overhead for deployments where resources permit.

For federated training, client participation is sampled without replacement within each communication round. In Experiment 2, 10% of clients are selected uniformly at random per round (2 out of 20), and the selection is re-sampled independently across rounds. Model aggregation follows the standard FedAvg rule. After each round, the server computes the updated global parameters as:

θ_{global} = \sum_{k \in S_{t}} w_{k} θ_{k}, w_{k} = \frac{n_{k}}{\sum_{j \in S_{t}} n_{j}}

(1)

where

S_{t}

is the set of participating clients in round t,

θ_{k}

are the locally trained parameters of client k, and

n_{k}

is the number of training samples at client k. The weight

w_{k}

is therefore the proportion of client k’s data relative to the total data held by all participating clients in that round, using only valid training samples after preprocessing. No early stopping based on validation performance or gradient convergence criteria was applied. A fixed number of communication rounds (50 in Experiment 1 and 100 in Experiments 2 and 3) were used deliberately to keep all configurations under identical and directly comparable conditions. Final performance metrics are evaluated on the held-out test set after the last communication round.

For evaluation metrics, binary classification experiments (Experiments 1 and 2) report standard binary Precision, Recall, F1-Score, and MCC. Multi-class classification (Experiment 3) employs macro-averaging for Precision, Recall, and F1-Score, computing each metric independently per class and taking the unweighted mean. For multi-class MCC, we adopt the confusion-matrix-based generalization [32] as implemented in sklearn.metrics.matthews_corrcoef:

MCC = \frac{\sum_{k} \sum_{l} \sum_{m} C_{k k} C_{l m} - C_{k l} C_{m k}}{\sqrt{\sum_{k} (\sum_{l} C_{k l}) (\sum_{k^{'} \neq k} \sum_{l} C_{k^{'} l})} \cdot \sqrt{\sum_{k} (\sum_{l} C_{l k}) (\sum_{k^{'} \neq k} \sum_{l} C_{l k^{'}})}}

(2)

where C is the

K \times K

confusion matrix and the sums range over all K classes. This formulation reduces to the standard binary MCC when

K = 2

.

For AUC-ROC under the one-vs.-rest strategy, the model’s softmax output probability

\hat{p} (y = c ∣ x)

is used directly as the continuous ranking score for each class c, compared against the binary ground-truth indicator

1 [y = c]

. The ROC curve is traced by varying this score from 0 to 1, and AUC is computed via the trapezoid rule. No decision threshold is pre-selected; AUC integrates over all possible thresholds, so class imbalance affects the shape of the ROC curve but does not require threshold tuning. The macro-average AUC is the unweighted mean of the K per-class AUC values, treating all classes equally regardless of support. All metrics are computed using scikit-learn 1.0.2.

For completeness, all evaluation metrics are defined explicitly below. Let

T P

,

T N

,

F P

, and

F N

denote true positives, true negatives, false positives, and false negatives, respectively, for the binary case (Experiments 1 and 2).

\begin{matrix} Accuracy & = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(3)

\begin{matrix} Precision & = \frac{T P}{T P + F P} \end{matrix}

(4)

\begin{matrix} Recall & = \frac{T P}{T P + F N} \end{matrix}

(5)

\begin{matrix} F 1 & = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall} = \frac{2 T P}{2 T P + F P + F N} \end{matrix}

(6)

\begin{matrix} MCC & = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}} \end{matrix}

(7)

For K classes, per-class metrics are computed from the

K \times K

confusion matrix C, where

C_{i j}

is the number of samples of true class i predicted as class j. Let

T P_{k} = C_{k k}

,

F P_{k} = \sum_{i \neq k} C_{i k}

,

F N_{k} = \sum_{j \neq k} C_{k j}

. Macro-averaged metrics are then (Experiment 3):

\begin{matrix} {Precision}_{macro} & = \frac{1}{K} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F P_{k}} \end{matrix}

(8)

\begin{matrix} {Recall}_{macro} & = \frac{1}{K} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F N_{k}} \end{matrix}

(9)

\begin{matrix} {F 1}_{macro} & = \frac{1}{K} \sum_{k = 1}^{K} \frac{2 T P_{k}}{2 T P_{k} + F P_{k} + F N_{k}} \end{matrix}

(10)

Multi-class MCC and AUC-ROC are defined as in Equation (2) and the one-vs.-rest formulation described above. All metrics are computed using scikit-learn 1.0.2.

3. Results

This section presents the experimental results from both phases of investigation, comparing the simple MLP architecture against the complex residual network across different learning configurations and data distributions.

3.1. Experiment 1 Results: Simple MLP Architecture

Under Experiment 1 with the simple MLP architecture, the centralized baseline achieved strong performance with raw network features, reaching a test accuracy of 98.6% and demonstrating that the relatively shallow network was capable of learning effective decision boundaries for the binary classification task. The federated learning configurations achieved comparable performance to the centralized baseline under IID data distribution, validating the effectiveness of the FedAvg aggregation strategy with the simple architecture.

Table 3 summarizes the performance metrics for all configurations in Experiment 1 and Figure 1 shows a graphical representation of the results. The centralized model achieved 98.6% accuracy with raw features. The Federated IID configuration achieved 98.6% accuracy, matching the centralized baseline within statistical variance (

p = 0.846

). The Federated Non-IID configuration achieved 99.0% accuracy, a difference of less than 0.5 percentage points. These results suggest that the simple MLP architecture, despite its limited parameter count, was able to achieve near-optimal performance under both centralized and federated learning paradigms.

Table 3. Performance metrics for Experiment 1 (Simple MLP, Binary Classification) across learning paradigms and feature representations. Results reported as mean ± standard deviation over 5 independent runs.

Features	Paradigm	Accuracy	Precision	Recall	F1	AUC	MCC
Raw	Centralized	0.9859 ± 0.0026	0.9838 ± 0.0022	0.9819 ± 0.0027	0.9855 ± 0.0026	0.9929 ± 0.0016	0.9710 ± 0.0072
	Fed-IID	0.9856 ± 0.0026	0.9886 ± 0.0009	0.9870 ± 0.0014	0.9872 ± 0.0029	0.9943 ± 0.0036	0.9736 ± 0.0045
	Fed-NonIID	0.9901 ± 0.0026	0.9833 ± 0.0030	0.9833 ± 0.0059	0.9874 ± 0.0025	0.9856 ± 0.0039	0.9696 ± 0.0083
Amino Acid-	Centralized	0.9292 ± 0.0073	0.9278 ± 0.0023	0.9247 ± 0.0038	0.9298 ± 0.0044	0.9706 ± 0.0042	0.8447 ± 0.0148
Encoded	Fed-IID	0.9323 ± 0.0034	0.9287 ± 0.0024	0.9347 ± 0.0101	0.9298 ± 0.0041	0.9748 ± 0.0012	0.8632 ± 0.0098
	Fed-NonIID	0.9247 ± 0.0061	0.9304 ± 0.0045	0.9256 ± 0.0047	0.9211 ± 0.0043	0.9689 ± 0.0015	0.8448 ± 0.0213

When using amino acid-encoded features, the simple MLP architecture achieved more modest performance. The centralized configuration reached 92.9% accuracy, representing an almost 6 percentage point reduction compared to raw features. The federated IID configuration with encoded features achieved 93.2% accuracy, while the federated Non-IID configuration achieved 92.5% accuracy. These results indicate that the encoding process introduced information loss that affected classification performance, though the encoded representation still maintained acceptable accuracy levels for many practical applications.

For amino acid-encoded features, performance is consistently lower across all configurations. Accuracy decreases by 5.3, 5.7, and 6.5 percentage points relative to raw features, depending on the learning configuration, and is accompanied by a noticeable increase in the false positive rate and a reduction in MCC from approximately 0.97 to 0.84. Despite this degradation, the encoded models retain strong discriminative capability, as reflected by AUC-ROC values close to 0.97. The stability of Precision, Recall, and F1-Score across centralized, IID, and Non-IID settings indicates that the observed performance reduction is primarily attributable to the encoding process itself rather than to the learning paradigm.

Table 3 presents a unified comparison of Experiment 1 results, covering centralized learning and federated learning under IID and Non-IID data distributions, for both raw network features and amino acid-encoded representations. When raw features are used, the simple MLP achieves consistently strong performance across all learning paradigms. Federated IID learning closely matches the centralized baseline, with differences in Accuracy, F1-Score, and MCC remaining below 0.3 percentage points. High AUC-ROC values (above 0.99 for Fed-IID) and low false positive rates further confirm robust discrimination between benign and malicious traffic.

Comparing IID and Non-IID federated configurations, raw-feature models show negligible sensitivity to data heterogeneity, while encoded-feature models exhibit a modest additional performance drop under Non-IID conditions. This suggests that the compressed representation provides less redundancy to compensate for skewed local data distributions. Overall, Experiment 1 demonstrates that federated learning can achieve near-centralized performance for IoT malware detection, while amino acid encoding introduces a measurable but predictable trade-off between representation compactness and detection accuracy.

The confusion matrices for Experiment 1 showed that raw feature experiments achieved near-perfect classification with approximately 437 true negatives and 550 true positives out of 450 benign and 550 malicious test samples. The encoded experiments showed more variation, with the Non-IID configuration achieving 411 true negatives and 514 true positives.

Training dynamics for Experiment 1 are illustrated in Figure 2. All federated configurations converged within 50 rounds without significant oscillations, indicating stable training behavior for the simple architecture.

Overall, Experiment 1 demonstrates that federated learning achieves performance nearly identical to centralized training when using raw features, with accuracy differences remaining below 0.1%, even under Non-IID data distributions.

For the federated Non-IID configuration with amino acid encoding, accuracy decreased slightly (92.5%) compared to the IID configuration (92.9%), suggesting that heterogeneity combined with encoding introduces additional aggregation difficulty. AUC-ROC remained high (0.97–0.99) across all Experiment 1 configurations, indicating that the models maintained good discriminative ability even when overall accuracy was reduced by the encoding process.

3.2. Experiments 2 and 3 Results: Complex Residual Network Architecture

In Experiment 2, the complex residual neural network was first evaluated under the same binary classification setting as Experiment 1. Using raw network features, centralized training achieved an accuracy of 98.0%, while federated learning under IID conditions achieved 97.4%. Under skewed Non-IID data distribution (

α = 0.1

, 20 clients, 10% partial participation), performance decreased more noticeably to 80.0%, reflecting the severe binary Non-IID setting shown in Table 4.

Table 4. Unified results for Experiment 2 (Complex Residual Network) across learning paradigms and feature representations (classifying benign and attack, as for Experiment 1). Results reported as mean ± standard deviation over 5 independent runs.

Features	Configuration	Accuracy	Precision	Recall	F1-Score	AUC-ROC	MCC
Raw	Centralized	0.9798 ± 0.0032	0.9802 ± 0.0070	0.9791 ± 0.0039	0.9773 ± 0.0043	0.9927 ± 0.0039	0.9589 ± 0.0026
	Federated IID	0.9740 ± 0.0082	0.9723 ± 0.0012	0.9739 ± 0.0042	0.9726 ± 0.0035	0.9819 ± 0.0079	0.9461 ± 0.0056
	Federated Non-IID	0.7996 ± 0.0165	0.8200 ± 0.0039	0.7956 ± 0.0113	0.7921 ± 0.0057	0.8846 ± 0.0031	0.6055 ± 0.0151
Amino acid-	Centralized	0.9760 ± 0.0024	0.9772 ± 0.0020	0.9772 ± 0.0024	0.9777 ± 0.0046	0.9890 ± 0.0019	0.9513 ± 0.0067
encoded	Federated IID	0.9714 ± 0.0010	0.9692 ± 0.0062	0.9646 ± 0.0061	0.9682 ± 0.0019	0.9807 ± 0.0042	0.9412 ± 0.0057
	Federated Non-IID	0.8146 ± 0.0080	0.8250 ± 0.0086	0.8117 ± 0.0141	0.8035 ± 0.0089	0.8833 ± 0.0076	0.6337 ± 0.0081

When using amino acid-encoded features, the centralized model achieved a final test accuracy of 97.6%, Federated IID achieved 97.1%, and Federated Non-IID achieved 81.5%. The modest gap between raw and encoded binary results (0.4 percentage points for centralized) contrasts with the larger gap observed in the four-class Experiment 3, where encoding introduces a more substantial performance reduction.

Training dynamics for both centralized configurations are illustrated in Figure 3. Both models converged within the first 50 epochs, after which validation performance stabilized. The raw feature model exhibited slightly higher training stability in later epochs, while the encoded model showed marginally more fluctuation in validation accuracy. Figure 3 shows validation accuracy during centralized training, whereas Table 4 reports final test set performance.

For Experiment 3, the confusion matrices for centralized configurations (Figure 4) reveal that both models achieve strongest performance on the Benign category, with minor confusion between similar attack types such as DoS and Probe. The encoded model shows slightly elevated misclassification rates for minority attack categories, consistent with the overall accuracy reduction.

Table 4 summarizes the results of Experiment 2 for the binary intrusion detection task, while Table 5 presents the extended evaluation in the four-class setting (Experiment 3). For Experiment 3, using raw features, centralized learning achieved 97.7% accuracy, with Federated IID and Non-IID achieving 97.8% and 97.0%, respectively, demonstrating strong stability across learning paradigms.

To provide detailed insight into per-class behavior, Table 6 presents Precision, Recall, and F1-Score for each of the four attack categories in Experiment 3. The Benign class consistently achieves the highest performance (F1 > 0.98 for raw features, >0.95 for encoded features), reflecting its prevalence in the dataset and distinctive traffic patterns. Attack categories show varying difficulty: Web Attack detection achieves F1-Scores above 0.99 for raw features due to distinctive HTTP-based signatures, while DoS detection proves most challenging (F1 ranging from 0.95 to 0.96 for raw, ≈0.89 for encoded), likely due to overlap with high-volume Benign traffic patterns. Probe attacks show intermediate performance (F1 ranging from 0.96 to 0.97 for raw, ≈0.91 for encoded). Crucially, per-class metrics remain stable across centralized and federated configurations, with Non-IID degradation limited to 0.3–0.8 percentage points for individual classes (Experiment 3:

α = 0.1

, 20 clients, 10% partial participation per round). This stability confirms that federated learning preserves balanced multi-class detection capability even when client data distributions are heterogeneous.

With amino acid-encoded features in Experiment 3, centralized accuracy reached 92.9%, while Federated IID and Non-IID achieved 93.3% and 93.0%, respectively. Although encoding introduces moderate information compression, the residual architecture maintains a strong discriminative capability across multiple attack categories, and performance degradation remains limited compared to raw features. These findings confirm that federated learning remains viable for more complex neural architectures and multi-class intrusion detection tasks, while reinforcing that feature representation choice has a greater impact on performance than the learning paradigm itself.

Under IID data distribution in Experiment 3, federated learning achieved performance comparable to centralized training, validating the effectiveness of the FedAvg aggregation strategy with the proposed complex architecture. The federated model trained with raw features achieved a final test accuracy of 97.8%, a difference of only 0.1 percentage points compared to the centralized baseline (97.7%).

Similarly, the federated model with amino acid-encoded features achieved 93.3% accuracy, remarkably close to the centralized encoded performance of 92.9%. This result indicates that federated learning effectively aggregates local model updates to produce a global model that converges to a solution nearly identical to centralized training.

Figure 5 illustrates the convergence behavior of federated training under IID data distribution for both raw and amino acid-encoded features. Both models exhibit rapid performance improvement during the initial communication rounds, followed by gradual stabilization as the global model approaches convergence. The raw-feature model converges slightly faster and achieves marginally higher final accuracy, reflecting the richer discriminative information available in the original feature space. In contrast, the encoded representation shows a small but consistent accuracy gap, indicating mild information compression. Importantly, both curves stabilize well before the maximum number of communication rounds, confirming stable and efficient aggregation behavior under IID conditions and demonstrating that federated learning reliably converges to near-centralized performance for both feature representations. The monotonic increase and absence of oscillatory divergence further indicate stable optimization dynamics and consistent FedAvg aggregation under homogeneous data distribution.

Training dynamics under Non-IID conditions (Figure 6) show slightly higher variability during early rounds, but both configurations converge to stable performance by approximately 70 communication rounds.

The minimal performance degradation from centralized to federated training under IID conditions suggests that the local data distributions are sufficiently representative of the global dataset, enabling effective model aggregation. This finding is particularly relevant for IoT deployments where data heterogeneity may be naturally lower than in more diverse network environments. It is important to note that in a few cases federated learning achieved marginally higher accuracy than centralized training (differences up to 0.5 percentage points). These differences are not statistically significant and fall within normal stochastic variation caused by random initialization, mini-batch sampling, and optimization dynamics. Repeated runs show that the observed differences lie within the standard deviation of the training process. Therefore, these results should be interpreted as practically equivalent performance, confirming that federated learning closely matches centralized training rather than outperforming it.

The Non-IID scenario in Experiment 2 presented a more challenging learning scenario where client data distributions were severely skewed, potentially causing local models to diverge and complicating the aggregation process. Despite these challenges, the federated framework demonstrated robust learning capability.

With raw network features, the Non-IID federated configuration achieved a final test accuracy of 80.0%, representing a substantial reduction of 17.4 percentage points compared to the IID federated configuration (97.4%). This degradation demonstrates that the complex ResNet architecture is considerably more sensitive to non-uniform data distributions across clients than the simpler MLP used in Experiment 1. Unlike Experiment 1, where Non-IID degradation under the experimental settings considered in this study was negligible, Experiment 2 reveals that increased model complexity alone does not guarantee robustness to statistical heterogeneity in federated settings.

Amino acid encoding under Non-IID conditions achieved 81.5% accuracy (0.8146). This represents a substantial reduction of 15.6 percentage points compared to the IID federated encoded result of 97.1% (0.9714), mirroring the significant Non-IID degradation observed for raw features in Experiment 2. Both feature representations exhibit considerable sensitivity to non-uniform client data distributions under the complex ResNet architecture, with raw features (17.4 pp drop) and encoded features (15.6 pp drop) showing comparably large degradation under Non-IID conditions.

Training dynamics under Non-IID conditions (Figure 6) reveal more volatile convergence behavior, particularly in early communication rounds. The step-wise improvement pattern is more pronounced, with larger accuracy jumps following aggregation events. Both configurations stabilize after approximately 70 communication rounds, slightly later than under IID conditions.

The resilience of federated learning to Non-IID data distributions is a critical finding for IoT intrusion detection, as individual IoT devices are likely to encounter different attack patterns based on their network role and connectivity patterns.

Performance metrics are computed using standard and reproducible definitions. For binary classification (Experiment 1), Matthews correlation coefficient (MCC) is calculated using the standard binary formulation. For multi-class classification (Experiment 3), MCC follows the generalized multi-class definition based on the full confusion matrix, ensuring consistent evaluation across all classes. The Area Under the ROC Curve (AUC-ROC) is computed using a one-vs.-rest strategy, and results are reported using macro-averaging to equally weight all classes and avoid bias toward majority classes. All metrics are computed on the held-out test set for final performance reporting.

3.3. Comparative Analysis

Comparing Experiments 1, 2, and 3 reveals that model performance is influenced more strongly by task formulation and data distribution than by architectural depth alone. While the simple MLP achieved the highest accuracy in the binary setting, the complex residual network demonstrated greater robustness when the task complexity increased to multi-class classification. Notably, severe Non-IID conditions in the binary experiment produced substantial degradation for raw features, whereas the multi-class configuration remained stable, suggesting that richer class structure may contribute to more stable aggregation behavior in our experiments. Table 7 provides a high-level comparison of classification accuracy across centralized and federated configurations for Experiments 1 and 2.

For raw network features, the simple MLP architecture in Experiment 1 achieves slightly higher accuracy (98.6–99.0%) than the complex residual network in Experiment 2 (97.4–98.0%). This difference should not be interpreted as an architectural limitation; rather, it reflects the increased difficulty of the classification task. Experiment 1 addresses a binary benign-versus-malicious problem, whereas Experiment 2 involves four attack categories, inherently increasing decision complexity. When viewed alongside MCC and AUC-ROC values, both architectures demonstrate strong and consistent discriminative capability within their respective problem settings. A direct comparison between the binary classification performance of the MLP and the four-class performance of the residual network must be interpreted with caution, as the two tasks differ in intrinsic difficulty. The multi-class setting introduces higher class overlap and distribution complexity, which increases decision boundary ambiguity. Therefore, the observed performance difference cannot be attributed solely to model architecture. Instead, it reflects the combined effect of task complexity, class separability, and feature representation. The manuscript has been revised to clarify this distinction and avoid overinterpreting cross-task performance comparisons.

For amino acid-encoded features, performance levels are remarkably similar across both experiments, with accuracies ranging from approximately 92.5% to 93.2% in Experiment 1 and 92.9% to 93.3% in Experiment 3. This consistency, reinforced by comparable MCC and AUC-ROC values, indicates that the encoding process constitutes a dominant performance bottleneck that cannot be fully mitigated by increasing model depth alone. The modest gains observed with the residual architecture suggest that additional capacity can partially compensate for representation compression, but does not eliminate its impact.

Across both experiments, federated learning remains closely aligned with centralized training. Accuracy differences typically remain below 0.5 percentage points, and MCC values exhibit similarly small deviations. This consistency confirms the robustness of the FedAvg aggregation strategy across different model complexities and classification tasks, even under pronounced Non-IID data distributions.

From a systems perspective, the computational cost differs substantially between the two architectures. The complex residual network employs approximately 24 times more parameters and requires twice as many communication rounds as the simple MLP. Given the marginal performance differences—particularly for raw features in the binary classification setting—this increased complexity may not be justified for resource-constrained IoT deployments. These results suggest that model selection should prioritize task complexity and feature representation over architectural depth, with simpler models offering a favorable accuracy–efficiency trade-off in many practical scenarios.

3.4. Statistical Significance of Differences Across All Experiments

Table 8 presents paired t-test results for statistical significance across experimental conditions. Several patterns emerge from these tests. Generally, the comparison of centralized to federated learning gives a significant difference only for the case of raw data and the complex residual network (Experiment 2, Centralized vs. Fed-NonIID, p < 0.001). On the other hand, the comparison of raw and encoded data in a centralized environment consistently yields significant differences (p < 0.05 in all experiments). IID versus Non-IID federated learning shows significant differences in binary classification (Experiments 1 and 2, p < 0.05) but not in multi-class classification (Experiment 3, p = 0.069).

4. Discussion

The experimental results provide clear insights into the relative impact of learning paradigm, feature representation, model complexity, and data distribution on IoT malware detection performance. Across all configurations, federated learning consistently achieves performance comparable to centralized training, confirming its suitability for privacy-preserving collaborative intrusion detection in distributed IoT environments.

4.1. Impact of Feature Representation

Across Experiments 2 and 3, amino acid encoding introduced a consistent but moderate reduction in classification performance relative to raw network features. However, the degradation was smaller than initially observed in the simple MLP experiments, suggesting that deeper architectures can partially compensate for representation compression. Interestingly, under highly skewed Non-IID binary conditions, encoded features produced slightly more stable learning than raw features, indicating a potential regularization effect introduced by the encoding transformation.

4.2. Federated Learning Under Data Heterogeneity

The results demonstrate that federated learning remains highly stable under IID conditions and moderately robust under Non-IID settings. While severe Non-IID distribution caused substantial degradation in the binary experiment with raw features, the multi-class experiment showed significantly improved stability. This suggests that richer label structure may help federated aggregation converge more reliably, even when client data distributions differ substantially.

Amino acid encoding compresses heterogeneous network features into a lower-dimensional structural representation, reducing feature diversity and removing some fine-grained discriminative information present in raw network attributes. Under Non-IID conditions, this reduced redundancy limits the model’s ability to compensate for skewed local data distributions during federated aggregation. As a result, encoded features exhibit slightly higher sensitivity to data heterogeneity compared to raw features. This behavior highlights the interaction between feature representation and federated learning stability, where compressed representations may trade robustness for compactness.

4.3. Role of Model Complexity

To quantify this trade-off concretely, the MLP requires only 11.3 KB of memory and incurs approximately 1.1 MB of total per-client communication over 50 rounds, whereas the ResNet demands 275.5 KB of memory and approximately 53.8 MB of communication over 100 rounds—a ∼24.5× increase in model footprint and ∼49× increase in cumulative communication overhead (see Table 9). Training time increases from 45–90 min (MLP) to 4–6 h (ResNet) per run. Against these costs, the ResNet provides no accuracy gain for binary classification with raw features (98.6% MLP vs. 97.98% ResNet) and only approximately 0.2 percentage points improvement for encoded features in the binary setting. These figures confirm that for binary IoT malware detection on low-dimensional tabular features, the MLP delivers a substantially more favorable performance-to-cost ratio. The ResNet’s role in this study is to quantify the ceiling effect of architectural depth on this feature space, not to propose it as a practical deployment option for resource-constrained IoT devices. Practitioners should treat the MLP as the recommended baseline for binary IoT intrusion detection scenarios.

From a deployment perspective, the substantial increase in model parameters and communication rounds required by the complex architecture may not be justified for resource-constrained IoT devices, particularly for binary classification tasks where simpler models already achieve near-optimal performance.

Regarding the optimization strategy, widely adopted optimizers (Adam for the simple MLP and SGD for the complex residual network) were selected to ensure stability, reproducibility, and comparability with existing federated learning studies. The objective of this work is to analyze the relative impact of feature representation, data heterogeneity, and learning paradigm rather than to optimize absolute performance through the choice of optimizer. While newer optimizers such as AdaBoB [33] may further improve convergence behavior, all experimental configurations in this study use identical optimization settings, ensuring fair comparison. The evaluation of the proposed framework under alternative optimizers, including AdaBoB, is identified as an important direction for future work to further assess robustness.

The limited performance gain of the residual architecture can be explained by the low-dimensional tabular nature of the feature space (10–11 variables), where model capacity is not the dominant performance constraint. The simple MLP is already sufficient to capture the underlying decision boundaries in the binary detection task, which explains why the deeper residual network provides only marginal improvement and may slightly underperform due to increased optimization complexity. Residual connections typically provide greater benefit in high-dimensional or strongly non-linear problems, whereas in compact IoT traffic feature spaces additional depth introduces extra capacity without significantly enhancing discriminative power.

The relationship between parameter quantity and performance cannot be fully understood by parameter count alone. From a fitting perspective, the evaluated models operate near an adequate-capacity regime, where the feature space dimensionality is relatively low and no strong overfitting behavior is observed as parameter size increases. Consequently, performance gains diminish beyond a certain complexity threshold. Interpreting the results through a parameter quantity versus fitting performance (PQS-FP, [34]) perspective suggests that both the MLP and residual models lie in a stable fitting region rather than clear underfitting or overfitting zones, which explains the weak correlation between parameter growth and accuracy improvements observed in this study.

4.4. Practical Implications

These findings suggest that federated learning should be viewed primarily as an enabling framework for privacy-preserving collaboration rather than as a mechanism for improving classification accuracy. Feature representation and task formulation exert a greater influence on performance than whether training is centralized or federated. Organizations deploying IoT intrusion detection systems should therefore prioritize careful feature engineering and model selection based on deployment constraints, data heterogeneity, and computational resources.

4.5. Limitations and Future Directions

This study focuses on FedAvg as the aggregation algorithm, which represents the most widely adopted approach in IoT federated learning deployments and serves as the standard baseline in federated learning research. While our results demonstrate robust performance under Non-IID conditions, more sophisticated heterogeneity-aware algorithms such as FedProx, SCAFFOLD, and FedNova have been specifically designed to address data heterogeneity, as mentioned in Section 1.

Comparative evaluation of these advanced aggregation strategies would strengthen our understanding of how different federated optimization approaches interact with feature representation and model complexity. Similarly, while our focus on neural network architectures aligns with the dominant approach in federated intrusion detection, evaluating non-neural baselines such as gradient-boosted trees or federated random forests would provide additional context for the performance–efficiency trade-offs observed in IoT tabular data scenarios.

These extensions remain important directions for future work and would further validate the generalizability of our findings across different federated learning paradigms and model families.

5. Conclusions

This study evaluated federated learning for IoT malware detection across three experimental settings, examining the interaction between feature representation, data distribution, task complexity, and neural network architecture. The results confirm that federated learning achieves performance comparable to centralized training under IID conditions and remains reasonably robust under Non-IID distributions, though severe skew can significantly impact binary detection performance. This confirms federated learning as a viable and effective solution for collaborative IoT intrusion detection without requiring the centralization of sensitive network traffic data.

Amino acid encoding introduces a measurable but moderate performance trade-off. While accuracy is consistently lower than raw features, the degradation is less pronounced in deeper architectures and may provide limited stabilization effects under highly skewed federated conditions. These findings suggest that encoding acts both as a compression mechanism and a form of implicit regularization. The experiments were conducted on a 10,000-sample subset and moderate Non-IID severity; therefore, conclusions should not be generalized to large-scale or extreme heterogeneity scenarios without further validation.

The comparison between simple and complex neural architectures reveals that increased model complexity yields only modest benefits, primarily in multi-class classification scenarios. This observation should not be interpreted as a universal property. The earlier comparison involved different task settings (binary vs. multi-class), which do not isolate the effect of model complexity alone. When task difficulty is controlled for, lightweight architectures remain competitive primarily in low-dimensional and moderate-complexity IoT detection scenarios, where additional model capacity does not necessarily translate into measurable performance gains. Therefore, the benefit of lightweight models in this study should be understood as context-dependent rather than a general conclusion.

Overall, this work provides empirical evidence that federated learning enables effective and privacy-preserving IoT malware detection under realistic data heterogeneity. Future research should focus on developing encoding schemes specifically tailored to network traffic characteristics, exploring adaptive aggregation strategies for highly Non-IID environments, and evaluating federated learning across diverse IoT datasets to further validate generalization and robustness. Additionally, investigation of heterogeneity-aware federated learning algorithms (FedProx, SCAFFOLD, FedNova) is warranted to assess whether they provide additional benefits beyond FedAvg under extreme data skew, along with evaluation of non-neural baselines such as gradient-boosted trees for comprehensive performance comparison in IoT tabular intrusion detection scenarios.

Author Contributions

Conceptualization, T.A.I., S.K., M.K. and M.U.R.; methodology, T.A.I. and S.K.; software, T.A.I.; investigation, T.A.I.; resources, S.K.; writing—original draft preparation, T.A.I. and S.K.; writing—review and editing, T.A.I., S.K., M.K., I.K., T.A. and M.U.R.; visualization, T.A.I.; supervision, S.K., M.K. and M.U.R.; project administration, S.K.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by De Montfort University for computational facilities VC2020 new staff L SL 2020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data is available at [27]. To support reproducibility, all source code is released in a public repository [25].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DoS	Denial of Service
FL	Federated Learning
IID	Independent and Identically Distributed
IoT	Internet of Things
MCC	Matthews correlation coefficient
MLP	Multi-Layer Perceptron
NIDS	Network Intrusion Detection System
PQS-FP	Parameter Quantity Shifting–Fitting Performance

References

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Hernandez-Ramos, J.L.; Karopoulos, G.; Chatzoglou, E.; Kouliaridis, V.; Marmol, E.; Gonzalez-Vidal, A.; Kambourakis, G. Intrusion Detection Based on Federated Learning: A Systematic Review. ACM Comput. Surv. 2025, 57, 1–36. [Google Scholar] [CrossRef]
Yang, H.; Wang, Z.; Chou, B.; Xu, S.; Wang, H.; Wang, J.; Zhang, Q. An Empirical Study of the Impact of Federated Learning on Machine Learning Model Accuracy. arXiv 2025. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Machine Learning and Systems (MLSys), Austin, TX, USA, 2–4 March 2020; Volume 2, pp. 429–450. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual Event, 13–18 July 2020; Volume 119, pp. 5132–5143. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual Event, 6–12 December 2020; Volume 33, pp. 7611–7623. [Google Scholar]
Yurdem, B.; Kuzlu, M.; Gullu, M.K.; Catak, F.O.; Tabassum, M. Federated learning: Overview, strategies, applications, tools and future directions. Heliyon 2024, 10, e38137. [Google Scholar] [CrossRef] [PubMed]
Bo, L.; Huang, H.; Gu, S.; Chen, Y. Federated Learning: From Algorithms To System Implementation; World Scientific Publishing Company: Singapore, 2024. [Google Scholar]
Garst, S.; Dekker, J.; Reinders, M. A comprehensive experimental comparison between federated and centralized learning. Database 2025, 2025, baaf016. [Google Scholar] [CrossRef] [PubMed]
Selvam, P.; Karthikeyan, P.; Manochitra, S.; Sujith, A.V.L.N.; Ganesan, T.; Ayyasamy, R.; Shuaib, M.; Alam, S.; Rajendran, A. Federated learning-based hybrid convolutional recurrent neural network for multi-class intrusion detection in IoT networks. Discov. Internet Things 2025, 5, 39. [Google Scholar] [CrossRef]
Lu, Z.; Pan, H.; Dai, Y.; Si, X.; Zhang, Y. Federated Learning With Non-IID Data: A Survey. IEEE Internet Things J. 2024, 11, 19188–19209. [Google Scholar] [CrossRef]
Zhu, H.; Xu, J.; Liu, S.; Jin, Y. Federated learning on non-IID data: A survey. Neurocomputing 2021, 465, 371–390. [Google Scholar] [CrossRef]
Bilal, M.A.; Ul Islam, I.; Idrees, S.; Qasim, M.; Khan, M.J.; Khan, J. Dataset-centric evaluation of federated intrusion detection models in IoT networks. Sci. Rep. 2026, 16, 2683. [Google Scholar] [CrossRef] [PubMed]
Khraisat, A.; Talukder, M.A.; Uddin, M.A.; Alazab, A. RF-FedAvg: Federated learning-based random forest model for intrusion detection in wireless sensor networks. Clust. Comput. 2025, 28, 873. [Google Scholar] [CrossRef]
Rehman, M.U.; Abrar, M.; Khalid, S.; Kazim, M.; Singh, V.K. Metaheuristically Enhanced ANN-Based Intrusion Detection System with Explainable AI Integration. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 6–11 July 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–8. [Google Scholar]
García-Teodoro, P.; Díaz-Verdejo, J.; Maciá-Fernández, G.; Vázquez, E. Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput. Secur. 2009, 28, 18–28. [Google Scholar] [CrossRef]
Arun Joseph, A.; Ranjani, P.; Suresh Kumar, V. Federated Deep Learning-Based Intrusion Detection System for Multi-Attack Detection in MANETs. Int. J. Res. Appl. Sci. Eng. Technol. 2025, 13, 1057–1065. [Google Scholar] [CrossRef]
Ibaisi, T.A.; Kuhn, S.; Kaiiali, M.; Kazim, M. Network Intrusion Detection Based on Amino Acid Sequence Structure Using Machine Learning. Electronics 2023, 12, 4294. [Google Scholar] [CrossRef]
Rashid, O.F.; Othman, Z.A.; Zainudin, S.; Samsudin, N.A. DNA Encoding and STR Extraction for Anomaly Intrusion Detection Systems. IEEE Access 2021, 9, 31892–31907. [Google Scholar] [CrossRef]
Arnob, A.K.B.; Chowdhury, R.R.; Chaiti, N.A.; Saha, S.; Roy, A. A comprehensive systematic review of intrusion detection systems: Emerging techniques, challenges, and future research directions. J. Edge Comput. 2025, 4, 73–104. [Google Scholar] [CrossRef]
Cho, H.; Lim, S.; Belenko, V.; Kalinin, M.; Zegzhda, D.; Nuralieva, E. Application and improvement of sequence alignment algorithms for intrusion detection in the Internet of Things. In Proceedings of the 2020 IEEE Conference on Industrial Cyberphysical Systems (ICPS), Tampere, Finland, 10–12 June 2020; Volume 1, pp. 93–97. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Bach, F., Blei, D., Eds.; PMLR: Cambridge, MA, USA, 2015; Volume 37, pp. 448–456. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017. [Google Scholar] [CrossRef]
IBAISI, T.A. FedAminoIoT: Federated Learning Framework for IoT Intrusion Detection with Amino Acid Encoding. Available online: https://github.com/stefhk3/federated-learning-and-IoT.git (accessed on 15 January 2026).
Hsu, T.M.H.; Qi, H.; Brown, M. Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification. arXiv 2019. [Google Scholar] [CrossRef]
Iqbal, M.F. CTU-IoT-Malware IoT Network Traffic. Available online: https://www.kaggle.com/datasets/agungpambudi/network-malware-detection-connection-analysis/data (accessed on 5 December 2025).
Garcia, S.; Parmisano, A.; Erquiaga, M.J. IoT-23: A Labeled Dataset with Malicious and Benign IoT Network Traffic (Version 1.0.0) [Data Set]. 2020. Available online: https://zenodo.org/records/4743746 (accessed on 5 January 2026). [CrossRef]
Biopython. Analyzing Protein Sequences with the Protparam Module. Available online: https://biopython.org/wiki/ProtParam (accessed on 10 March 2026).
PyTorch. PyTorch: An Imperative Style, High-Performance Deep Learning Library. 2024. Available online: https://pytorch.org/ (accessed on 15 January 2026).
Jarrett, K.; Kavukcuoglu, K.; Ranzato, M.; LeCun, Y. What is the best multi-stage architecture for object recognition? In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 2146–2153. [Google Scholar] [CrossRef]
Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 2004, 28, 367–374. [Google Scholar] [CrossRef] [PubMed]
Xiang, Q.; Wang, X.; Lei, L.; Song, Y. Dynamic bound adaptive gradient methods with belief in observed gradients. Pattern Recognit. 2025, 168, 111819. [Google Scholar] [CrossRef]
Xiang, Q.; Wang, X.; Lai, J.; Lei, L.; Song, Y.; He, J.; Li, R. Quadruplet depth-wise separable fusion convolution neural network for ballistic target recognition with limited samples. Expert Syst. Appl. 2024, 235, 121182. [Google Scholar] [CrossRef]

Figure 1. Performance comparison for Experiment 1: accuracy, F1-score, AUC-ROC, and MCC across centralized, federated IID, and federated Non-IID configurations with and without amino acid encoding.

Figure 2. Federated learning training curves for Experiment 1: loss and accuracy over 50 communication rounds for IID and Non-IID configurations.

Figure 3. Centralized training curves for Experiment 2 comparing raw features (blue) and amino acid-encoded features (orange). Both models converge within 50 epochs, with raw features achieving higher final accuracy.

Figure 4. Confusion matrices for centralized training in Experiment 3 with (a) raw features and (b) encoded features. Both models correctly classify the majority of Benign traffic, with increased errors for encoded features in minority attack categories. Confusion matrix normalized by true class (rows), so diagonal entries correspond to per-class Recall.

Figure 5. Federated learning training curves for Experiment 2 under IID data distribution for raw features (blue) and amino acid-encoded features (orange). Both models achieve stable performance within 50 communication rounds.

Figure 6. Federated learning training curves for Experiment 2 under Non-IID data distribution for raw features (blue) and amino acid-encoded features (orange). Both models show increased volatility but converge to stable performance by round 70.

Table 5. Unified results for Experiment 3 (Complex Residual Network) across learning paradigms and feature representations using four attack classes (Benign, DoS, Probe, Web Attack). Precision, Recall, and F1-Score are computed using macro-averaging (equal weight to all classes). AUC-ROC is computed using one-vs.-rest macro-averaging. Results reported as mean ± standard deviation over 5 independent runs.

Features	Configuration	Accuracy	Precision	Recall	F1-Score	AUC-ROC	MCC
Raw	Centralized	0.9770 ± 0.0014	0.9750 ± 0.0042	0.9795 ± 0.0067	0.9794 ± 0.0025	0.9889 ± 0.0053	0.9569 ± 0.0069
	Federated IID	0.9779 ± 0.0060	0.9761 ± 0.0033	0.9754 ± 0.0008	0.9735 ± 0.0009	0.9881 ± 0.0015	0.9497 ± 0.0070
	Federated Non-IID	0.9696 ± 0.0051	0.9717 ± 0.0036	0.9746 ± 0.0028	0.9721 ± 0.0051	0.9883 ± 0.0032	0.9465 ± 0.0031
Amino acid-	Centralized	0.9294 ± 0.0067	0.9308 ± 0.0062	0.9316 ± 0.0028	0.9298 ± 0.0068	0.9709 ± 0.0032	0.8606 ± 0.0083
encoded	Federated IID	0.9332 ± 0.0037	0.9335 ± 0.0071	0.9304 ± 0.0043	0.9371 ± 0.0056	0.9738 ± 0.0056	0.8665 ± 0.0127
	Federated Non-IID	0.9295 ± 0.0042	0.9226 ± 0.0102	0.9275 ± 0.0032	0.9294 ± 0.0051	0.9685 ± 0.0010	0.8532 ± 0.0091

Table 6. Per-class performance metrics for Experiment 3 (Complex Residual Network, multi-class classification). Metrics shown for each of the four classes: Benign, DoS (Denial of Service), Probe (reconnaissance), and Web Attack. All metrics computed on the test set.

		Benign			DoS			Probe			Web Attack
Features	Config.	P	R	F1	P	R	F1	P	R	F1	P	R	F1
Raw	Centralized	0.987	0.991	0.989	0.965	0.958	0.961	0.972	0.968	0.970	0.988	0.997	0.992
	Fed-IID	0.985	0.989	0.987	0.961	0.954	0.958	0.968	0.965	0.967	0.986	0.996	0.991
	Fed-NonIID	0.983	0.987	0.985	0.958	0.951	0.954	0.965	0.962	0.964	0.990	0.996	0.993
Amino Acid-	Centralized	0.952	0.965	0.958	0.901	0.885	0.893	0.918	0.902	0.910	0.949	0.972	0.960
Encoded	Fed-IID	0.954	0.967	0.960	0.905	0.889	0.897	0.920	0.905	0.912	0.951	0.974	0.962
	Fed-NonIID	0.949	0.963	0.956	0.898	0.881	0.889	0.915	0.899	0.907	0.950	0.972	0.961

Table 7. Comparative performance analysis between Experiment 1 (Simple MLP) and Experiments 2–3 (Complex Residual Network). Experiment 1 and Experiment 2 correspond to binary classification, while Experiment 3 evaluates the four-class intrusion detection task (Benign/DoS/Probe/Web Attack).

Configuration	Raw Features		Encoded Features
	Experiment 1	Experiment 2	Experiment 1	Experiment 2
Centralized	98.6%	98.0%	92.9%	97.6%
Federated IID	98.6%	97.4%	93.2%	97.1%
Federated Non-IID	99.0%	80.0%	92.5%	81.5%
Parameters	2882	70,532	2882	70,532
Communication Rounds	50	100	50	100

Table 8. Statistical significance tests (paired t-tests) across all experiments. Significance levels: *** p < 0.001, * p < 0.05, ns = not significant.

Experiment	Comparison	t-Statistic	p-Value	Significance
Experiment 1: Simple MLP (Binary)	Centralized vs. Fed-IID (Raw)	− 0.207	0.846	ns
	Centralized vs. Fed-NonIID (Raw)	−2.228	0.090	ns
	Fed-IID vs. Fed-NonIID (Raw)	−3.537	0.024	*
	Raw vs. Encoded (Centralized)	19.261	<0.001	***
	Raw vs. Encoded (Fed-IID)	17.617	<0.001	***
Experiment 2: Complex Residual Network (Binary)	Centralized vs. Fed-IID (Raw)	2.484	0.068	ns
	Centralized vs. Fed-NonIID (Raw)	31.023	<0.001	***
	Fed-IID vs. Fed-NonIID (Raw)	27.169	<0.001	***
	Raw vs. Encoded (Centralized)	2.780	0.050	*
Experiment 3: Complex Residual Network (Multi-class)	Centralized vs. Fed-IID (Raw)	−1.027	0.362	ns
	Centralized vs. Fed-NonIID (Raw)	2.176	0.095	ns
	Fed-IID vs. Fed-NonIID (Raw)	2.468	0.069	ns
	Raw vs. Encoded (Centralized)	18.355	<0.001	***

Table 9. Quantitative performance–cost comparison between the Simple MLP (Experiment 1) and Complex Residual Network (Experiment 2, binary task). Communication cost is computed as total per-client upload and download volume over all rounds (float32 parameters). Training time is measured on an NVIDIA RTX 3090 GPU.

Architecture	Parameters	Model Size	Comm. per Round	Total Comm. (per Client)	Rounds	Training Time
Simple MLP	2882	11.3 KB	11.3 KB	∼1.1 MB	50	45–90 min
Complex ResNet	70,532	275.5 KB	275.5 KB	∼53.8 MB	100	4–6 h
Ratio (ResNet/MLP)	24.5×	24.5×	24.5×	∼49×	2×	3–8×
	Raw Features Accuracy (Binary)			Encoded Features Accuracy (Binary)
	Centralized	Fed-IID	Fed-NonIID	Centralized	Fed-IID	Fed-NonIID
Simple MLP	98.59%	98.56%	98.01%	92.92%	93.23%	92.47%
Complex ResNet	97.98%	97.40%	79.96%	97.60%	97.14%	81.46%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ibaisi, T.A.; Kuhn, S.; Kazim, M.; Kara, I.; Altindag, T.; Rehman, M.U. A Comparative Study of Federated Learning and Amino Acid Encoding with IoT Malware Detection as a Case Study. Big Data Cogn. Comput. 2026, 10, 111. https://doi.org/10.3390/bdcc10040111

AMA Style

Ibaisi TA, Kuhn S, Kazim M, Kara I, Altindag T, Rehman MU. A Comparative Study of Federated Learning and Amino Acid Encoding with IoT Malware Detection as a Case Study. Big Data and Cognitive Computing. 2026; 10(4):111. https://doi.org/10.3390/bdcc10040111

Chicago/Turabian Style

Ibaisi, Thaer AL, Stefan Kuhn, Muhammad Kazim, Ismail Kara, Turgay Altindag, and Mujeeb Ur Rehman. 2026. "A Comparative Study of Federated Learning and Amino Acid Encoding with IoT Malware Detection as a Case Study" Big Data and Cognitive Computing 10, no. 4: 111. https://doi.org/10.3390/bdcc10040111

APA Style

Ibaisi, T. A., Kuhn, S., Kazim, M., Kara, I., Altindag, T., & Rehman, M. U. (2026). A Comparative Study of Federated Learning and Amino Acid Encoding with IoT Malware Detection as a Case Study. Big Data and Cognitive Computing, 10(4), 111. https://doi.org/10.3390/bdcc10040111

Article Menu

A Comparative Study of Federated Learning and Amino Acid Encoding with IoT Malware Detection as a Case Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Setup

2.2. Dataset

2.3. Subset Construction and Splitting Protocol

2.4. Amino Acid Encoding

2.5. Feature Representation and Potential Overfitting Concerns

2.6. Experiment 1: Simple Multi-Layer Perceptron Architecture

2.7. Experiments 2 and 3: Complex Residual Neural Network Architecture

2.8. Experimental Design Summary

3. Results

3.1. Experiment 1 Results: Simple MLP Architecture

3.2. Experiments 2 and 3 Results: Complex Residual Network Architecture

3.3. Comparative Analysis

3.4. Statistical Significance of Differences Across All Experiments

4. Discussion

4.1. Impact of Feature Representation

4.2. Federated Learning Under Data Heterogeneity

4.3. Role of Model Complexity

4.4. Practical Implications

4.5. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI