Enhancing Internet of Things Network Security Using Hybrid CNN and XGBoost Model Tuned via Modiﬁed Reptile Search Algorithm

: This paper addresses the critical security challenges in the internet of things (IoT) landscape by implementing an innovative solution that combines convolutional neural networks (CNNs) for feature extraction and the XGBoost model for intrusion detection. By customizing the reptile search algorithm for hyperparameter optimization, the methodology provides a resilient defense against emerging threats in IoT security. By applying the introduced algorithm to hyperparameter optimization, better-performing models are constructed capable of efﬁciently handling intrusion detection. Two experiments are carried out to evaluate the introduced technique. The ﬁrst experiment tackles detection through binary classiﬁcation. The second experiment handles the task by speciﬁcally identifying the type of intrusion through multi-class classiﬁcation. A publicly accessible real-world dataset has been utilized for experimentation and several contemporary algorithms have been subjected to a comparative analysis. The introduced algorithm constructed models with the best performance in both cases. The outcomes have been meticulously statistically evaluated and the best-performing model has been analyzed using Shapley additive explanations to determine feature importance for model decisions.


Introduction
The internet of things (IoT) has ushered in a new era of connectivity, transforming various industries by enabling faster sensor and data access.This enhanced networking capability has been pivotal in facilitating real-time monitoring, which is indispensable for further process optimization across sectors.Real-time data acquisition has significantly improved healthcare by enabling timely patient monitoring and emergency notification systems [1].In healthcare, IoT devices range from blood pressure and heart rate monitors to advanced devices capable of monitoring specialized implants.The internet of medical things (IoMT) has even led to the creation of automated systems used to analyze health statistics [2].The industrial application of the IoT includes asset management, predictive maintenance, and manufacturing process control.Overall, the IoT's integration into various domains is revolutionizing the way we live and work, providing more efficient, costeffective, and intelligent solutions [3].
In the manufacturing sector, the IoT has streamlined operations, ensuring efficient production and quality control.The integration of IoT devices has revolutionized the way machines and systems communicate and interact with each other, forming the backbone of smart factories.Through IoT, manufacturers can monitor and control their production processes in real time, with sensors embedded in machines, equipment, and products collecting data on various parameters such as temperature, pressure, vibration, and performance metrics.This data can be transmitted and analyzed instantly, providing valuable insights into operations.IoT also helps manufacturers achieve greater visibility and transparency across the entire supply chain, optimizing inventory management, reducing stockouts, and improving order fulfillment.Technological advancements continue to benefit manufacturers by opening new revenue streams, improving industrial safety, and reducing operational costs, reshaping the way manufacturers operate in the digital revolution [4].
The widespread adoption of the IoT is not without challenges.Devices often grapple with limited battery lifetimes, the need to function in remote locations, and demanding transceiver operations.Among these challenges, security stands out as the most daunting.While a device running out of battery is an observable setback, a data breach, often clandestine, can wreak havoc.IoT devices must communicate with each other and central systems, requiring complex transceiver operations.This can lead to high energy consumption and potential interference with other devices, complicating the network [5].Security is indeed one of the most critical challenges in the IoT ecosystem.The interconnected nature of these devices means that a breach in one device can potentially compromise an entire network.Issues such as weak authentication, lack of encryption, and insecure interfaces can lead to unauthorized access and data theft [6].Radio frequency (RF) attacks have become a prevalent attack vector within the IoT ecosystem.Attackers can exploit vulnerabilities in wireless communication protocols to intercept, modify, or disrupt the RF signals between devices.This can lead to data leakage, device malfunction, or even taking control of the devices [7].Message Queuing Telemetry Transport (MQTT) is one of the standard application layer protocols emerging in the IoT ecosystem [8].As an emerging technology, it is a popular target for malicious actors seeking new vulnerabilities [9].
To counter these threats, several network security measures, such as blocklists and firewalls, have been implemented [10,11].However, artificial intelligence (AI) algorithms aim to address security challenges without the constraints of predefined rules or continuous manual intervention [12].A pivotal factor influencing AI's performance is the judicious selection of hyperparameters that steer the algorithm.With the burgeoning complexity of emerging algorithms, the conventional trial-and-error approach for hyperparameter tuning is becoming increasingly untenable.This optimization challenge can be equated to NP-hard problems, which are notoriously difficult to resolve using discrete methods [13].A potential respite from this optimization conundrum lies in metaheuristic algorithms.While not always possible to pinpoint the absolute optimal solution, their iterative nature enhances the probability of identifying a near-optimal solution.Often, in practical scenarios, a "good enough" solution is more valuable than an elusive perfect one.Additionally, it is important to explore and address emerging challenges that accompany developments in the field.
In the scope of this research, a refined methodology designed to confront the security challenges inherent in the IoT is introduced.The introduced approach incorporates convolutional neural networks (CNNs) to effectively manage feature sizes within IoT MQTT dataset and employs extreme gradient boosting (XGBoost) for intrusion identification and detection.A distinctive aspect of the introduced methodology lies in the integration of a modified version of a well-established algorithm, specifically tailored for hyperparameter optimization in the unique context of IoT security [14].This integration represents a notable combination of concepts and techniques, embodying both typical combination novelty and incremental novelty.
Our main scientific contributions to this work are: • Proposing a CNN-centric approach for feature reduction in IoT datasets; • Utilizing XGBoost for the classification of intrusion events; • Introducing a modified algorithm specifically designed for optimization; • Implementing the proposed methodology on real-world data, addressing a pressing real-world challenge.This paper's organization has been meticulously designed to lead the reader through a logical development of concepts and advances.After this brief introduction, Section 2 provides an overview of the pertinent literature and presents a broad overview of the approaches.The suggested approaches are described in Section 3, going into depth on the unique methodologies and algorithms that are the foundation of our contribution.The experimental design is described in Section 4, together with the datasets, metaheuristics, parameters, and metrics used in our study.Our experimental findings are provided in Section 5, followed by analysis and conclusions drawn from them.A thorough conclusion that summarizes the major findings and considers the wider ramifications of our work concludes Section 6, followed by proposals for future works.

Related Works
The history of intrusion detection systems (IDSs) and intrusion prevention systems (IPSs) can be traced back to an academic paper written in 1986 [15].The Stanford Research Institute developed the Intrusion Detection Expert System (IDES) using statistical anomaly detection, signatures, and profiles to detect malicious network behaviors.In the early 2000s, IDSs became a security best practice, with few organizations adopting IPSs due to concerns about blocking harmless traffic.The focus was on detecting exploits rather than vulnerabilities.The latter part of 2005 saw the growth of IPS adoption, with vendors creating signatures for vulnerabilities rather than individual exploits [16].The capacity of IPSs increased, allowing for more network monitoring.
Next-generation intrusion prevention systems (NGIPSs), which include capabilities like application and user control, were developed during this time period, marking a significant turning point.Sandboxing and emulation features were added to fulfill the requirement for defense against zero-day malware.By 2016, most businesses had deployed next-generation firewalls (NGFWs), which contain IDS/IPS functionality.High-fidelity machine learning is the current focus for tackling threat detection and file analysis [17].
The groundbreaking academic publication "An Intrusion-Detection Model" by Dorothy E. Denning, which inspired the creation of IDES, is one example of earlier studies that addresses intrusion detection in networks.To identify hostile network behaviors, the Stanford Research Institute used statistical anomaly detection, signatures, and profiles.Significant turning points in the development of IPS technology, such as the switch to NGIPSs and NGFWs, have been reached [18].
Multi object optimization problems.

Convolutional Neural Networks
CNNs are a specialized subclass of artificial neural networks (ANNs) that are particularly well-suited for analyzing visual data.CNNs are designed to automatically and adaptively learn spatial hierarchies of features.This is particularly beneficial for tasks like image recognition, object detection, and even medical image analysis.The concept of residual learning, as introduced by Kaiming He et al., further enhances the capabilities of CNNs by allowing them to benefit from deeper architectures without the risk of overfitting or vanishing gradients [19].
As opposed to ANNs, CNNs employ local connectivity by linking each neuron to a localized region of the input space.This is in stark contrast to traditional ANNs, where each neuron is connected to all neurons in the preceding and following layers.Yann LeCun's paper emphasizes that this local connectivity is crucial for the efficient recognition of localized features in images [20].Furthermore, they use shared parameters across different regions of the input, which significantly reduces the number of trainable parameters.This is in contrast to traditional ANNs, where each weight is unique, leading to a much larger number of parameters and higher computational costs.CNNs are inherently designed to recognize the same feature regardless of its location in the input space.This is a crucial advantage over traditional ANNs, which lack this form of spatial invariance.Notably, they often employ deeper architectures, which are made computationally feasible through techniques like residual learning, as discussed in Kaiming He et al.'s paper.
The Inception architecture, introduced by Christian Szegedy et al., is another example of a deep yet computationally efficient network [21].CNNs are designed to be computationally efficient, particularly when dealing with high-dimensional data.The architecture leverages local connectivity and parameter sharing to reduce computational requirements.The concept of residual learning, as discussed in the paper by Kaiming He et al., allows CNNs to be trained more efficiently, even when the network is very deep.
Notably, several unique architectural elements are associated with CNNs, and these include filters, kernels, and pooling layers.Filters and kernels use learnable weight matrices that are crucial for feature extraction.They slide or convolve across the input image to produce feature maps.Yann LeCun's paper highlights the effectiveness of gradient-based learning techniques in training these filters [22].Pooling layers serve to reduce the spatial dimensions of the input, thereby decreasing computational complexity and increasing the network's tolerance to variations in the input.They are particularly useful in making the network robust to overfitting.
CNNs can be effectively combined with other types of neural networks, like recurrent neural networks (RNNs), for sequential data processing tasks such as video analysis and natural language processing.Additionally, CNNs can be integrated with traditional machine learning algorithms, like support vector machines (SVMs), for tasks like classification, thereby creating a hybrid model that leverages the strengths of both methodologies.In summary, CNNs offer a robust, adaptable, and computationally efficient approach to a wide range of machine learning tasks.Their unique architecture, as validated by seminal research papers, makes them highly effective for tasks involving spatial hierarchies and structured grid data.

Extreme Gradient Boosting
XGBoost is an optimized distributed gradient boosting approach designed to be highly efficient and flexible.It has gained immense popularity in machine learning competitions and is widely regarded as the "go-to" algorithm for structured data.XGBoost has been optimized for both computational speed and model performance, making it highly desirable for real-world applications [23].There are several advantages of decision-tree-based techniques [24].
One of the most significant advantages of decision trees is their ease of interpretation.They can be visualized, and the decision-making process can be easily understood, even by non-experts.Decision trees are computationally inexpensive to build, evaluate, and interpret compared to algorithms like support vector machines (SVMs) [25] or ANNs.Unlike other algorithms that require extensive pre-processing, decision trees can handle missing values without imputation, making them more robust.Decision trees can also capture complex non-linear relationships in the data, which linear models may not capture effectively.Further, this approach can be used for both classification and regression tasks, making it very versatile.
Gini impurity is a metric used to quantify the disorder or impurity of a set of items.It is crucial for the "criterion" parameter in the decision tree algorithm.Lower Gini impurity values indicate more "pure" nodes.The Gini impurity is used to decide the optimal feature to split on at each node in the tree.
Further advantages of using XGBoost are that of ensemble learning [26].Ensemble methods, particularly boosting algorithms like XGBoost, are less susceptible to the overfitting problem compared to single estimators due to their ability to optimize on the error.By combining several models, ensemble methods can average out biases and reduce the variance, thus minimizing the risk of overfitting.Ensemble methods often achieve higher predictive accuracy than individual models.XGBoost, in particular, has been shown to outperform deep learning models in certain types of data sets, especially when the data are tabular.
The objective function optimized by XGBoost includes both a loss term and a regularization term, making it adaptable to different problems:

Metaheuristic Optimization
Metaheuristic optimization algorithms have gained significant attention in the field of computational intelligence for their ability to solve complex optimization problems that are often NP-hard.Traditional optimization algorithms, such as gradient-based methods, often get stuck in local optima and are not well suited for solving problems with large, complex search spaces.In contrast, metaheuristics offer several advantages [27].
Additionally, addressing the challenges of multi-objective optimization problems has been a focal point for many works, leading to the development of various multi-objective evolutionary algorithms [28].However, a common hurdle in these algorithms is the delicate balance required between diversity and convergence.This balance critically impacts the quality of solutions derived from the algorithms [29].
Designed to explore the entire solution space, metaheuristics often find a near-optimal solution within reasonable time periods.They are problem-independent, meaning they can be applied to a wide range of optimization problems without requiring problem-specific modifications.Metaheuristics are highly scalable and can handle problems with a large number of variables and constraints.They are less sensitive to the initial conditions and can adapt to changes in the problem environment.Metaheuristics can find near-optimal solutions to NP-hard problems in polynomial time, which is a significant advantage over traditional methods that often fail to find feasible solutions within a reasonable time frame.
Algorithms often draw inspiration from various natural phenomena, social behaviors, and physical processes.Some notable examples include the genetic algorithms (GAs) [30], inspired by the process of natural selection and genetics; particle swarm optimization (PSO) [31], based on the social behavior of birds flocking or fish schooling; the ant colony optimization (ACO) [32] algorithm, which mimics the foraging behavior of ants in finding the shortest path; and the firefly algorithm (FA) [33], which draws inspiration from courting rituals of fireflies.Additional recent examples include the salp swarm optimizer (SSA) [34], the whale optimization algorithm (WOA) [35], and the COLSHADE [36] optimization algorithm.
Metaheuristics are a popular approach for researchers used to improve hyperparameter selections.Many examples exist in the literature, with some interesting examples originating from medical applications [37].Further applications include time-series forecasting [38,39] and computer security [40][41][42].Hybridization techniques have also shown great promise when applied to metastatic algorithms, often producing algorithms that demonstrate performance improvements on given tasks [43].

Methods
The Methods section serves as the backbone of this research, offering a comprehensive and rigorous examination of the algorithms under study.Specifically, this section delves into the original reptile search algorithm (RSA) [44] and a proposed modified version.The objective is to elucidate the mathematical foundations, operational mechanics, and strategies that underpin these algorithms.Moreover, a critical evaluation of their strengths, weaknesses, and potential for further development is presented.The section aims to provide the reader with a deep understanding of the algorithms, thereby setting the stage for the experimental results and discussions that follow.

Original RSA
The RSA, like many optimization metaheuristics, employs global as well as local search to effectively locate promising areas within the search space.This algorithm draws inspiration from nature, mathematically modeling the hunting strategies of crocodiles.As it is a gradient-free population-based method, it can effectively tackle complex challenges.
During the initialization process of the RSA, a population of agents is generated based on stochastic techniques.The population is then evaluated and the best solution is considered near-optimal.The population P can be represented as: where N denotes P size, n the dimensionality of a given challenge, and x a potential solution.The set population is generated in accordance with: where B lower and B upper define the lower and upper bounds of the search space, and rand is an arbitrary value.Once a population is established, the algorithm can proceed with optimization.The utilized strategy is highly dependent on the number of remaining iterations of the optimization.For the exploration mechanism, two behaviors are distinctly simulated.The first one simulates crocodile high walking.The second strategy simulates belly walking.These are mathematically modeled as: where, in this context, B j (t) symbolizes the j-th component of a given best candidate.Randomness is introduced using the rand value selected from [0, 1].The iterations are tracked using t and T, which denote the current and maximum iterations, and sensitivity is defined by β.Specialized values η, R, and ES are defined in accordance with the following: where the parameter η defines the hunter operator.The role of R is to reduce the search space, while ES defines the evolutionary sense.Random values are denoted as r 2 and r 3 , and PD describes the percentage difference between the current and best solution.A small value is also added by to avoid math errors.Exploitation similarly employs two distinct hunting strategies, hunting cooperation and coordination.Which technique is utilized is highly dependent on the number of remaining iterations in the optimization.
where, in this context, B j (t) symbolizes the j-th component of a given best candidate.Randomness is introduced using the rand value selected from [0, 1]. the iterations are tracked using t and T, which denote the current and maximum iterations, and sensitivity is defined by β.Specialized values η, R, and ES are defined previously.

Modified RSA
While the original RSA demonstrates admirable performance, certain deficiencies can be observed using evaluations using standard CEC [45] evaluation functions.As a relatively novel algorithm, the RSA has significant potential for improvement through hybridization.This work attempts to address some of the observed issues associated with the original algorithm through hybridization.Mechanisms inspired by the genetic algorithm (GA) [30] are introduced to formulate the genetically inspired RSA (GIRSA).
The introduction is activated following each iteration.An arbitrary agent is selected and spliced by the best attained solution resulting in a combined solution.The parameters are uniformly combined.The crossover is governed by the control parameter pc.Empirically, the optimal value for this parameter has been determined to be pc = 0.1.
An additional modification incorporates parameter mutation.Once a mutation is triggered, an arbitrary value is selected from a given parameter constraint.One-half of the selected value is subtracted or added to said parameter.The decision of whether addition or subtraction is used is determined by the mutation direction md parameter.Once again, the value of md is empirically determined as md = 0.1.
Once a new solution is generated, the worst-performing solution is removed from the solution and replaced by the new agent.The performance of the new solution is not conducted until the subsequent iteration of the algorithm.Taking this approach maintains the computational complexity of the original algorithm.Finally, the pseudocode for the described algorithm is presented in Algorithm 1.It should be emphasized that once a new agent is generated and mutated, it is not evaluated until the next iteration of the optimization.This way the complexity of the metahersutic remains consistent with the original version of the algorithm.

Experimental Setup
To evaluate the potential of the introduced approach for both detecting and identifying intrusion within IoT networks, a public real-world IoT MQTT dataset [46] (https://www.kaggle.com/datasets/cnrieiit/mqttsetaccessed on 24 October 2023) is utilized.However, due to the extensive computational demands of model optimization, the reduced version of this dataset is utilized.The dataset is comprised of a total of 6 classes: legitimate traffic, SlowITe, Bruteforce, Malformed data, looking, and DoS attacks.A total of 34 features are present in the dataset.Further details can be accessed in the original work that introduced the dataset [46].An additional dataset is formulated from the existing dataset for the needs of thread detection.Classes are separated into legitimate data and anomalous activities.
Due to the large number of features present in the dataset that can result in important features blending into noise, a CNN is utilized for reduction.The reduction network consists of a convolution layer of 128 neurons total and a kernel size of 2 and utilizes a rectified linear unit (relu) activation function.This is followed by an output layer of 6 fully connected neurons that are utilized as an output.The network is trained through 10 epochs to acceptable levels of accuracy (82.7%).Following training, the network output is used as an input for the XGBoost algorithm to further improve accuracy.A graphical representation of the introduced framework for feature reduction and classification can be seen in Figure 1.Once the features have been reduced and an intermediate dataset formulated, XGBoost hyperparameters are subjected to optimization with the use of metaheuristic algorithms.XGBoost parameters chosen for optimization and associated constraints include learning rate [0.9, 0.1], minimum child weight [10,1], subsample [1,0], colsample [10,1], max depth [10,1], and gamma [0.0, 0.8].These parameters were chosen due to their high influence on XGBoost model performance, and their respective constraints have been empirically determined.
Several contemporary optimization metaheuristics have been included in a comparative analysis in order to determine an optimal approach.These include the original RSA [44] and GA [30] as the source of inspiration for the introduced GIRSA.Alongside these, established well-performing optimizers are also included, initialized with suggested parameters suggested in the works that originally proposed them.The included algorithms include the SSA [34], FA [33], PSO [31], WOA [35], and the COLSHADE [36] algorithm.
To provide a comprehensive assessment of the constructed models in comparison to those constructed by other contemporary optimizers a battery of standard classification metrics, including accuracy, precision, recall, and f1-score [47], is utilized, with accuracy being the objective function chosen to guide the optimization.Further metrics include Cohen's kappa [48], described in Equation (10), which gives a more complete easement in cases when unbalanced datasets are utilized.
in which p o and p e represent observed and expected classification values.
It is important to note that the exact execution times of experimentation can vary depending on the specific hardware simulations the experimentation is carried out on.This work utilized a PC using an Intel i7 CPU and Nvidia 4070 GPU with 32 GB of available ram.Simulations are carried out using Python 3 with the supporting TensorFlow, Pandas, and Numpy libraries.Visualizations are handled using matplotlib and seaborn.Metaheuristics are independently implemented for the needs of this research.

Experimental Outcomes
The following section describes the outcomes of two independent experiments.The first experiment involves optimizing XGboost models for anomalous traffic detection.The second experiment involves the exact classification of the type of malicious activity.Following the presentation of the results, the outcomes are meticulously statistically validated.Finally, the best-performing model is analyzed to determine feature importance, providing an advantageous starting point for future feature reduction research.

Binary Classification Experiment
Binary classification outcomes based on the objective function evaluations over 30 independent runs are shown in terms of best, worst, as well as mean and median in Table 1.Alongside these outcomes, the standard deviation and variance are provided in order to asses algorithm stability.
Additional evaluations using the indicator function outcomes over 30 runs are provided in Table 2 with relative stability indicators.An observation can already be made from the presented results.Models struggle with binary classification problems.Despite the poor performance shown by all models, the introduced algorithm attained the best outcome in comparison to other optimization algorithms even when limited performance is observed.Algorithm stability can be observed in Figure 2.An obvious advantage of the introduced algorithm is the overall better performance in comparison to other algorithms.However, a high rate of stability of the original RSA algorithm and FA needs to be noted despite inferior results in both indicator and objective evaluations.Convergence rates for each metaheuristic are tracked and plots can be observed in Figure 3. Improvements to the convergence rate can be noted, with the introduced algorithm no longer dwelling in less promising areas, and displaying a better exploration in compassion to the original, as well as competing metaheuristics.A comparison of all constructed models in detail is presented in Table 3.As can be observed from the attained outcomes, implemented models struggle with the challenging task of handling intrusion detection through binary classification within an MQTT IoT system.However, the introduced optimizer demonstrated the best outcomes when constructing an XGBoost binary classification model with a clear potential for hyperparameter optimization.Differentiating legitimate and malicious data can be challenging with a reduced feature space and number of classes, especially with observed minority classes in the dataset.
To enforce experiment repeatability, the hyperparameters selected for the respective best-performing model are provided in Table 4.

Multi-Class Classification Experiment
Experimentation with multi-class classification is carried out under identical test conditions as the binary classification experiment.However, a total of six classes are present in the dataset.Objective function outcomes are shown in Table 5. Indicator function outcomes are demonstrated in Table 6.By observing the outcome, a clear superiority of the introduced algorithm can be observed, with the GIRSA attaining the best outcomes in both indicator and objective function metrics.It is also important to note the admirable stability of the WOA despite this technique not demonstrating the optimal outcomes.Algorithm stability comparisons can also be observed in Figure 4, where all tested metaheuristics are graphically compared in terms of objective and indicator functions.A significant stability improvement in terms of stability as opposed to the original RSA can be observed for the introduced GIRSA.Further improvements in algorithm convergence can be observed in Figure 5.As can be observed in the convergence graphs, the introduced algorithm once again demonstrates an improvement in the exploration of the search space, locating a better solution and avoiding local solutions in favor of a global solution demonstrating better outcomes.
Detailed comparisons of the best-performing models generated by each metaheuristic are shown in Table 7. Various models show differing quality of performance facing different classification challenges.This is to be expected as per the NFL theorem.Notable outcomes are shown by the GIRSA algorithm, which has demonstrated a clear dominance in terms of optimal outcomes.However, notable results are also shown by the FA and the SSA algorithm.The confusion matrix of the best performing model for multi-class classification can be observed in Figure 6.
As can be observed, the algorithm struggles to identify slowITe and flood attacks often confusing slowing for legitimate data and flood attacks for denial-of-service attacks.However, flood and denial-of-service attacks are fairly similar in practice, so detection is still within acceptable margins.Additionally, malformed data can often be classified as DOS attacks.It can be deduced that the introduced approach performs significantly better when tackling the problem of anomalous traffic detection as a multi-class classification challenge, rather than a simple binary challenge.This is likely due to the confusion between flood and slowITe attacks with legitimate data.Nevertheless, the introduced method shows great potential for real-world implementation.Furthermore, the introduced optimization metaheuristic demonstrates improvements over existing techniques as well as the original base algorithm.To encourage experimental repeatability, hyperparameter selections made by each algorithm are shown in Table 8.

Outcome Statistical Validation
Modern optimization research demands optimization results be meticulously statistically validated in order to establish the statistical significance of the demonstrated improvements.The preferred approach for validating outcomes is the use of parametric tests.However, the safe utilization of these tests needs to be established first.Three criteria need to be met: independence, normality, and homoscedasticity [49].The first condition is fulfilled by utilizing an independent random seed for each execution.The normality condition is assessed using the outcomes of the Shapiro-Wilk test shown in Table 9 as well as through the visual observations of objective function outcome distributions shown in Figure 7.The resultant p-values were all less than 0.05, suggesting that the null hypothesis (H0) may be rejected.As a result, we may infer that the outcomes produced in all three simulations do not follow a normal distribution.These outcomes are further reinforced in Figure 7. Parametric tests were not applicable since the normality assumption was not fulfilled.As a result, the non-parametric Wilcoxon signed-rank [50] test was used in the following stage.This test can be applied to the same data series consisting of the best values achieved in every run of each metaheuristic.
The created algorithm is used as the control algorithm in this test, and the Wilcoxon signed-rank test was run on the specified data series.The calculated p-values in all three observed cases were less than 0.05.Considering the significance level of al pha = 0.1, these findings show that the introduced algorithm outperformed all competing approaches statistically significantly.Table 10 shows the overall results of the Wilcoxon signed-rank test.

Best Model SHAP Interpretation
The attained best-performing model has been subjected to analysis through the use of Shapley additive explanations (SHAP) [51] to determine feature impacts on model decisions.The feature reduction CNN models as well as the best constructed XGBoost model have been interpreted using the kernel and tree explainer techniques.SHAP utilized a gametheory-based approach to determine the impact each feature poses on model decisions.The analysis outcomes are graphically presented in Figure 8 for the CNN model and in Figure 9 for the XGBoost model.
From the feature importance analysis of the CNN feature reduction model, it can be deduced that the tcp.time_delta presents the highest influence on model decisions, closely followed by mqtt.conack.flags.reserved.The third-highest importance is shown by the mqtt.msgtypefeature.The remaining features that have a significant influence are mqtt.kaliveand mqtt.retain.Following these features, a significant reduction in importance can be observed.Accordingly, a set of six features have been maintained as outputs for the CNN and inputs for the XGBoost model.The importance of these synthetic features is shown in Figure 9.  Since the synthetic features have no direct interpretation of the real-world dataset, they are simply assigned a number.Synthetic feature 3 has the highest impact on the model decision, followed by features 1 and 5. Finally, a small impact can be observed in feature 0. Features 4 and 2 do not seem to notably influence model decisions, suggesting that further feature reduction may be performed while maintaining computational complexity.

Conclusions and Future Work
This work presents an approach for tackling the increasingly pressing challenge of security in IoT systems relying on the MQTT server.With the rising popularity of IoT networks, this challenge needs to be addressed in an effective and adaptive way.The potential of an approach based on AI is explored for anomalous activity detection as well as attacktype detection.Due to the large feature space of MQTT transactions, a technique based on CNNs is applied to reduce the feature space and prevent important features from fading into feature noise.To improve detection, the outputs of the CNN reduction mechanism are combined with XGBoost for classification.However, due to the considerable reliance of classification performance on hyperparameter selections, metaheuristic algorithms are used to improve model performance through hyperparameter tuning.Additionally, a modified version of the relatively recently introduced RSA is introduced to overcome some of the limitations of the original approach.The proposed algorithm draws inspiration from the GA and is therefore dubbed the GIRSA.
The introduced approach is assessed on a real-world dataset.The feature space is reduced using a CNN.This reduced feature space is used by the XGBoost algorithm to classify MQTT traffic.Two experiments are carried out, one handling simple anomalous detection, and the other identifying the specific type of attack.The introduced approach demonstrates potential.While struggling with simple binary classification, multi-class classification demonstrates decent potential with an accuracy rate of 87.94% of the best-performing model.The observed improvements have been meticulously statistically validated, and the best-performing model has been subjected to SHAP analysis in order to determine feature importance.
The utilization of data-driven methods for detection and identification offers several advantages.Data-driven methods are capable of adapting to emerging threats without the need for explicit programming and administrator interaction.Furthermore, reduced maintenance can often offset the costs associated with initial development and integration.
Reflecting on the validity of the proposed approach, it is essential to consider both internal and external validity aspects.Internally, the utilization of a CNN for feature space reduction and the subsequent integration with XGBoost for classification raises questions about the potential impact of hyperparameter choices on the model's performance.We addressed this concern through the application of metaheuristic algorithms for hyperparameter tuning, enhancing the robustness and reliability of our classification results.Externally, the generalizability of our findings is a crucial consideration.While our experiments were conducted on a real-world dataset, the specific characteristics of the dataset and the nature of MQTT transactions may limit the broader applicability of our approach to diverse IoT environments.Future work should include a broader range of datasets, ensuring that the effectiveness of our algorithm extends to various IoT scenarios.
As with any study, some limitations exist with this work as well.Only a limited set of optimization algorithms is explored.Due to limited computational resources, smaller populations are used for optimizations and only a limited number of optimization iterations is conducted.In future works, we hope to expand population sizes and periods of optimization to attain a better understanding of the full capabilities of each algorithm.
Future work will center on enhancing the proposed methodology by incorporating hybridization techniques and assessing alternative machine learning methodologies, including the exploration of deep CNNs and emerging versions of recurrent networks.Furthermore, there will be an investigation into the applicability of the introduced metaheuristic in addressing diverse and critical optimization challenges across various domains of research, including medicine, computer security, and waste management.

Figure 1 .
Figure 1.Flowchart of the introduced framework.

Figure 3 .
Figure 3. Objective and indicator function convergence plots.
cohen kappa violin plot diagram

Figure 4 .
Figure 4. Objective and indicator function distribution plots for multi-class classification.

Figure 5 .
Figure 5. Objective and indicator function convergence plots for multi-class classification.
le g it d o s m a lf o r m e d b r u t e f o r c e s lo w it e f

Figure 7 .
Figure 7. Objective function KDE plots for binary and multiclass classification.

Figure 8 .
Figure 8. SHAP analysis outcomes for best CNN model.
Pseudocode of the introduced GIRSA.

Table 1 .
Objective function overall outcomes.

Table 2 .
Indicator function overall outcomes.

Table 4 .
Control parameter selections made by each metaheuristic for respective best-performing binary classification models.

Table 5 .
Objective function outcome for multi-class classification.

Table 6 .
Indicator function outcome for multi-class classification.

Table 7 .
Detailed comparison between best-performing models for multi-class classification.

Table 8 .
Control parameter selections made by each metaheuristic for respective best-performing multi-class classification models.

Table 10 .
Wilcoxon signed-rank test values exhibiting p-values for all three experiments (GIRSA vs. others.)