Feature Engineering and Model Optimization Based Classiﬁcation Method for Network Intrusion Detection

: In light of the escalating ubiquity of the Internet, the proliferation of cyber-attacks, coupled with their intricate and surreptitious nature, has signiﬁcantly imperiled network security. Traditional machine learning methodologies inherently exhibit constraints in effectively detecting and classifying multifarious cyber threats. Speciﬁcally, the surge in high-dimensional network trafﬁc data and the imbalanced distribution of classes exacerbate the predicament of ideal classiﬁcation performance. Notably, the presence of redundant information within network trafﬁc data undermines the accuracy of classiﬁers. To address these challenges, this study introduces a novel approach for intrusion detection classiﬁcation which integrates advanced techniques of feature engineering and model optimization. The method employs a feature engineering approach that leverages mutual information maximum correlation minimum redundancy (mRMR) feature selection and synthetic minority class oversampling technique (SMOTE) to process network data. This transformation of raw data into more meaningful features effectively addresses the complexity and diversity inherent in network data, enhancing classiﬁer accuracy by reducing feature redundancy and mitigating issues related to class imbalance and the detection of rare attacks. Furthermore, to optimize classiﬁer performance, the paper applies the Optuna method to ﬁne-tune the hyperparameters of the Catboost classiﬁer, thereby determining the optimal model conﬁguration. The study conducts binary and multi-classiﬁcation experiments using publicly available datasets, including NSL_KDD, UNSW-NB15, and CICIDS-2017. Experimental results demonstrate that the proposed method outperforms traditional approaches regarding accuracy, recall, precision, and F-value. These ﬁndings highlight the method’s potential and performance in network intrusion detection.


Introduction
With the network security risk increasing, implementing effective intrusion detection mechanisms has emerged as a critical strategy for safeguarding computer systems and network security [1][2][3][4]. Conventional intrusion detection techniques primarily depend on pre-established rules or recognized attack attributes [5], such as employing string pattern matching to identify attack signatures and ascertain the presence of an intrusion. Nonetheless, these traditional methods exhibit limitations in detecting novel and unidentified attacks, relying heavily on expert experience [6,7]. In contrast, machine learning techniques present notable benefits in intrusion detection. By acquiring knowledge from network data and attack samples, these methods enable the identification of unfamiliar malicious behaviors and signatures. Prior research [8,9] has employed intelligent intrusion detection systems founded on machine learning approaches, which can detect attacks without any previous information. Subsequent investigations [10][11][12][13] delve into the examination and comparison of various datasets frequently employed in network intrusion detection systems. The network data exhibits high dimensionality and category imbalance, posing challenges for traditional classification algorithms in extracting significant features. Moreover, the category imbalance further exacerbates the issue by causing low detection rates in models for certain categories.

Materials and Methods
This section provides an overview of the network intrusion detection classification method based on feature engineering and model optimization. The structural framework of the paper is divided into four main parts: data pre-processing, feature engineering, training and validation of the classification models, and evaluation and analysis of the experimental results. The framework is depicted in Figure 1. The main objective of this study is to enhance the performance of network intrusion detection models through the adoption of feature engineering and model optimization methods. Firstly, we conduct data preprocessing on the NSL_KDD, UNSW-NB15, and CICIDS-2017 datasets, including data cleansing, one-hot encoding, and min-max normalization, to improve data usability and robustness. Subsequently, the preprocessed data undergoes a dual strategy of feature selection based on mrmr and feature engineering based on SMOTE. This approach enables the selection of relevant and non-redundant features and effectively addresses the issue of class imbalance, thereby bolstering the model's ability to handle the intricacies and diversities of network data. The feature-engineered data is then fed into the CatBoost model, which undergoes hyperparameter optimization using the Optuna framework, followed by training and classification. Evaluation metrics such as accuracy, recall, precision, and F1-score are employed to validate and assess the results. (1) The raw data is subjected to cleaning, normalization, and type conversion in the pre-processing stage to ensure data consistency and usability. (2) The feature engineering process involves further processing the pre-processed dataset using the mRMR algorithm. This algorithm selects highly relevant features, (1) The raw data is subjected to cleaning, normalization, and type conversion in the pre-processing stage to ensure data consistency and usability. (2) The feature engineering process involves further processing the pre-processed dataset using the mRMR algorithm. This algorithm selects highly relevant features, reduces redundant information, and enhances the representation of the classification model. Additionally, the SMOTE algorithm is employed to generate new minority class samples, addressing the issue of class imbalance, and improving the overall class distribution of the dataset. (3) The next step involves training and validating the classification model using the feature-engineered dataset. The dataset is divided into a training set and a test set, with Appl. Sci. 2023, 13, 9363 4 of 25 a ratio of 7:3. The Catboost model is optimized using the Optuna framework, which selects the best hyperparameter configuration to enhance the model generalization ability and performance. (4) Finally, the results are evaluated and analyzed. Performance metrics such as accuracy, precision, recall, and F-value are calculated to assess the classification performance of the model on the test set. These metrics provide insights into the effectiveness of the proposed approach.

Data Preprocessing
Due to the presence of missing data, duplicate data, and character data in the dataset, it is essential to address these factors to mitigate their adverse impact and improve the overall data quality. This can be achieved through pre-processing techniques to reduce noise and anomalies, rendering the data more suitable for training and analysis with machine learning models. The following steps were implemented to accomplish this: (1) Data cleansing: Data cleansing is essential in the data preprocessing pipeline. It involves scrutinizing the raw data to identify and rectify errors, missing values, duplicates, and inconsistencies [26]. By eliminating low-quality data, this process ensures that the analysis results remain unaffected by irrelevant or flawed information. In this study, specific cleansing procedures were undertaken, as outlined in Table 1, which comprehensively captures the different aspects targeted during the data cleansing process. Table 1. Data to be cleaned.

NSK-KDD num_outbound_cmds
The feature has zero values in the dataset and provides no information for the classification task is_host_login Indicates whether the host login is successful or not, no contribution to the classification task land This feature indicates whether the source IP address and the destination IP address are the same, and this feature has no effect on the classification task The feature is only an identifier and has no practical meaning for the classification task

Srcip dstip
These features represent the source and target IP addresses, and for the classification task of intrusion detection, the IP addresses alone have no direct impact CICIDS2017 Source IP Destination IP These features represent source and target IP addresses and have no impact on the classification task of intrusion detection

Timestamp
This feature represents the timestamp of the stream and has no effect on the classification task There are dirty Nan and Inf data in the dataset (distributed in columns 15 and 16) that need to be removed (2) Feature one-hot encoding: The purpose of feature one-hot encoding is to transform non-numeric categorical features into a numerical representation that can be processed by computer-based algorithms [27]. Machine learning algorithms rely on mathematical models and statistical methods, which typically operate on numeric data. However, categorical features consist of discrete symbols or labels that lack direct numerical significance.  1]. In this case, without normalization, the feature with the larger range of values will take on more weight in the training of the algorithm, so min-max normalization is introduced to map all the data to the interval [0, 1], thus speeding up the convergence of the model and improving the accuracy of the classification results, as follows: where Min is the minimum value of the feature, Max is the maximum value of the feature and X is the normalized feature value. The pre-processing steps mentioned above, including data cleansing, one-hot encoding, and data normalization, were applied separately to the three datasets used in this paper: NSL-KDD, UNSW-NB15, and CICIDS-2017. Each dataset underwent these preprocessing steps to ensure consistency and improve the data quality before further analysis and model training.

•
Numerical processing of NSL-KDD datasets The features in columns 2, 3, and 4 of the NSL-KDD dataset (corresponding to "proto-col_type", "service," and "flag", respectively) underwent unique encoding in this study. Specifically, the "protocol_type" feature was mapped as a 3-dimensional feature, the "service" feature was mapped as a 70-dimensional feature, and the "flag" feature was encoded accordingly. These encoded features were appended to the original data. As a result, the dimensionality of the original data increased from 41 to 118 after numerical processing, considering the removal of num_outbound_cmds, is_host_login, and land features, as they were irrelevant to determining abnormal data. This resulted in a total of 115 dimensions for the processed dataset. Table 2 illustrates the outcome of the numerical processing of the dataset. The UNSW_NB15 dataset underwent unique coding for the features in columns 3, 4, and 5 (corresponding to "proto", "service", and "state", respectively). In this study, these three columns of features were uniquely encoded. The "protocol type" feature was mapped as a 3-dimensional feature, the "proto" feature was mapped as a 131-dimensional feature, the "service" feature was mapped as a 12-dimensional feature, and the "state" feature was mapped as a 12-dimensional feature. The "srcip" and "dstip" columns in the feature dataset represent IP addresses, but since the judgment of data abnormality does not Appl. Sci. 2023, 13, 9363 6 of 25 depend on IP addresses, they were not included in the coding process. Table 3 illustrates the outcome of the numerical processing of the dataset. In the CICIDS2017 dataset, there are a total of 84 features and 1 tag. The features include information such as "Source IP", "Destination IP", and "Timestamp." However, based on experience and domain knowledge, it is determined that these three features are not relevant to the final detection target. Therefore, they were processed and removed from the dataset. Table 4 illustrates the outcome of the numerical processing of the dataset. The pre-processed dataset encounters the challenge of high dimensionality in the feature space, which can significantly increase algorithmic training and prediction time complexity and susceptibility to overfitting [29]. Considering the intricate non-linear relationships in network traffic feature data and the diverse patterns exhibited by different types of intrusions across multiple features, the mRMR algorithm, based on mutual information, effectively captures these non-linear relationships and quantifies the correlation between features and target variables. By utilizing information theory concepts such as information entropy, information gain, and mutual information, the MRMR algorithm assesses the informative value of features for classification purposes. It evaluates the classification performance by measuring the magnitude of the mRMR value between features and class labels [30]. The calculation of the mRMR value is as follows: According to the definition of mutual information, when variables X and Y are completely statistically independent, there is no information common between the two variables. Conversely, when the degree of dependence between the two variables is higher, the value of mutual information is greater. The amount of information shared is greater. The value of mutual information I(X, Y) is obtained according to Equation.
I(X, Y) = ∑ x,y p(x, y)log p(x, y) Maximum Relevance: The mRMR algorithm computes the level of correlation between each feature and the target variable. It assigns higher priority to features with stronger correlations with the target variable.
Minimum Redundancy: The mRMR algorithm evaluates the level of redundancy among features. It emphasizes selecting features that have minimal redundancy with the already selected features. This ensures that the selected feature subset exhibits a low correlation among its features and avoids including redundant information.
The final mRMR value is computed by considering both maximum relevance and minimum redundancy criteria. Features are ranked in descending order of their scores, and features with high scores are progressively added to the feature subset. This approach ensures that features with high relevance to the target variable and minimal redundancy with the existing feature set are included in the subset.
where S is the feature set; f i is the feature; c is the target category; I( f i , c) is the mutual information between the feature and the target category C; I f i , f j is divided into the information between the feature and the feature. The mRMR value combines the correlation between features and the target variable with the degree of redundancy between features. Its purpose is to select a subset of features that are both highly informative and unique. The coefficient ranges from 0 to 1, where a higher absolute value indicates a stronger correlation between the two variables. The strength of the correlation is typically determined based on the following Table 5. The mRMR values of each feature were calculated by averaging the mRMR values obtained from multiple extractions of the dataset used in the experiments conducted for this paper. The results are presented in the following Table 6.

Smote-Based Data Equalization
In the case of class imbalance, where the number of samples from the minority class is significantly smaller than the majority class, there is a risk of classifier bias towards the majority class, leading to performance issues [31]. To address this, minority class oversampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique) can balance the class distribution and improve classifier performance. SMOTE sampling operates by interpolating between samples from the minority class to generate new samples. The underlying principle of SMOTE is as follows: For each sample in the minority class, its K nearest neighbors (typically K = 5) are identified using a distance metric such as Euclidean distance [32]. A new sample is then created by randomly selecting a point along the line connecting the minority class sample and one of its K nearest neighbors. The principle is shown in Figure 2. The formula used for generating the new sample is as follows: where X new denotes a final synthetic sample, X denotes a minority class sample of the input, X near denotes a nearest neighbour sample of the selected X, rand(0,1) is a random number between 0 and 1.

Smote-Based Data Equalization
In the case of class imbalance, where the number of samples from the m is significantly smaller than the majority class, there is a risk of classifier bias majority class, leading to performance issues [31]. To address this, minorit sampling techniques such as SMOTE (Synthetic Minority Over-sampling Te balance the class distribution and improve classifier performance. SMOTE sampling operates by interpolating between samples from the m to generate new samples. The underlying principle of SMOTE is as follows: F ple in the minority class, its K nearest neighbors (typically K = 5) are ident distance metric such as Euclidean distance [32]. A new sample is then created selecting a point along the line connecting the minority class sample and one est neighbors. The principle is shown in Figure 2. The formula used for ge new sample is as follows: where X denotes a final synthetic sample, denotes a minority class s input, denotes a nearest neighbour sample of the selected , rand(0,1 number between 0 and 1. In SMOTE sampling, the selection of the number of nearest neighbor sa cial. If the number is too small, it may generate overly similar synthetic sam impacting the classifier's generalization ability. On the other hand, selectin nearest neighbor samples may lead to the generation of complex synthetic creasing the computational complexity of the classifier. Hence, in practical In SMOTE sampling, the selection of the number of nearest neighbor samples is crucial. If the number is too small, it may generate overly similar synthetic samples, thereby impacting the classifier's generalization ability. On the other hand, selecting too many nearest neighbor samples may lead to the generation of complex synthetic samples, increasing the computational complexity of the classifier. Hence, in practical applications, choosing an appropriate number of nearest neighbor samples is necessary based on specific circumstances. This paper applies the SMOTE algorithm to the three datasets used in the experiments to achieve balanced calculation. Consequently, the proportions of each category in the datasets are altered, as depicted in  choosing an appropriate number of nearest neighbor samples is necessary based on sp cific circumstances. This paper applies the SMOTE algorithm to the three datasets used in the expe ments to achieve balanced calculation. Consequently, the proportions of each category the datasets are altered, as depicted in Figures 3-5.

Catboost Model Based on Optimisation of Optuna Hyperparameters
CatBoost is an advanced machine learning algorithm based on Gradient Boosti Decision Trees (GBDT). It effectively addresses the challenges posed by large featu spaces and class imbalance through gradient optimization techniques. The algorithm u lizes a fully symmetric tree as its base learner, as illustrated in Figure 6 [33]. Unlike a ty ical decision tree, a fully symmetric tree ensures that internal nodes at the same dep employ identical features and feature thresholds for splitting. Consequently, a fully sy metric tree can be represented as a decision table with 2 entries, where represents t number of levels in the tree [34]. This balanced structure enhances processing speed a improves feature handling compared to a standard decision tree. choosing an appropriate number of nearest neighbor samples is necessary based on sp cific circumstances. This paper applies the SMOTE algorithm to the three datasets used in the expe ments to achieve balanced calculation. Consequently, the proportions of each category the datasets are altered, as depicted in Figures 3-5.

Catboost Model Based on Optimisation of Optuna Hyperparameters
CatBoost is an advanced machine learning algorithm based on Gradient Boosti Decision Trees (GBDT). It effectively addresses the challenges posed by large featu spaces and class imbalance through gradient optimization techniques. The algorithm u lizes a fully symmetric tree as its base learner, as illustrated in Figure 6 [33]. Unlike a ty ical decision tree, a fully symmetric tree ensures that internal nodes at the same dep employ identical features and feature thresholds for splitting. Consequently, a fully sym metric tree can be represented as a decision table with 2 entries, where represents t number of levels in the tree [34]. This balanced structure enhances processing speed an improves feature handling compared to a standard decision tree. choosing an appropriate number of nearest neighbor samples is necessary based on s cific circumstances.
This paper applies the SMOTE algorithm to the three datasets used in the expe ments to achieve balanced calculation. Consequently, the proportions of each category the datasets are altered, as depicted in Figures 3-5.

Catboost Model Based on Optimisation of Optuna Hyperparameters
CatBoost is an advanced machine learning algorithm based on Gradient Boosti Decision Trees (GBDT). It effectively addresses the challenges posed by large featu spaces and class imbalance through gradient optimization techniques. The algorithm u lizes a fully symmetric tree as its base learner, as illustrated in Figure 6 [33]. Unlike a ty ical decision tree, a fully symmetric tree ensures that internal nodes at the same dep employ identical features and feature thresholds for splitting. Consequently, a fully sy metric tree can be represented as a decision table with 2 entries, where represents number of levels in the tree [34]. This balanced structure enhances processing speed a improves feature handling compared to a standard decision tree.

Catboost Model Based on Optimisation of Optuna Hyperparameters
CatBoost is an advanced machine learning algorithm based on Gradient Boosting Decision Trees (GBDT). It effectively addresses the challenges posed by large feature spaces and class imbalance through gradient optimization techniques. The algorithm utilizes a fully symmetric tree as its base learner, as illustrated in Figure 6 [33]. Unlike a typical decision tree, a fully symmetric tree ensures that internal nodes at the same depth employ identical features and feature thresholds for splitting. Consequently, a fully symmetric tree can be represented as a decision table with 2 d entries, where d represents the number of levels in the tree [34]. This balanced structure enhances processing speed and improves feature handling compared to a standard decision tree. Appl  In the domain of network intrusion detection, certain classes (intrusions) often suffer from underrepresentation. When confronted with large-scale datasets in network intrusion detection, employing the CatBoost algorithm as a classifier offers inherent advantages in handling high-dimensional data, addressing the class imbalance, managing classification features, facilitating automatic feature scaling, and ensuring computational efficiency. CatBoost excels in mitigating sample imbalance through the utilization of weighted loss functions and class weight adjustment, thereby enhancing classification accuracy for minor classes. It mitigates the risk of overfitting by introducing random statistical alignment during model training, improving generalization performance. By adeptly tackling the challenges associated with network intrusion detection datasets and delivering efficient classification capabilities, CatBoost proves to be a valuable choice for this task.
CatBoost is built upon the foundation of a gradient-boosting tree. However, it deviates from the conventional approach by eschewing a greedy target statistics-based method for splitting nodes when handling category features. Instead, CatBoost considers the prior distribution term during the calculation of node gain. This approach effectively mitigates the impact of low-frequency features and noise in the category variables on the construction of decision trees.
where is the th data; x ， denotes the, th column of discrete features of the th row of data in the training set; is a prior weight; and is the prior distribution term.
Optuna is a framework for automated hyperparameter optimization, which efficiently explores the hyperparameter space to discover the optimal combination of hyperparameters [35]. When it comes to the CatBoost algorithm, there are several hyperparameters that can be fine-tuned, including learning rate, number of trees, and depth. Manually tuning these hyperparameters can be time-consuming and requires expertise in the field. By leveraging Optuna, the model's robustness can be enhanced through the optimization of hyperparameters, resulting in a selection of hyperparameter combinations that exhibit excellent performance across different environments [36]. The utilization of Optuna to optimize CatBoost facilitates improved model performance, generalization, and interpretability, ultimately yielding superior hyperparameters for intricate tasks like intrusion detection. This process unfolds through the following steps: 1.
Define hyperparameter space: Before employing Optuna for hyperparameter optimization, it is essential to establish the search space for the hyperparameters. This entails specifying a range or distribution of feasible values for each hyperparameter. The following Table 7 presents an overview of the hyperparameters in CatBoost and their respective details: In the domain of network intrusion detection, certain classes (intrusions) often suffer from underrepresentation. When confronted with large-scale datasets in network intrusion detection, employing the CatBoost algorithm as a classifier offers inherent advantages in handling high-dimensional data, addressing the class imbalance, managing classification features, facilitating automatic feature scaling, and ensuring computational efficiency. CatBoost excels in mitigating sample imbalance through the utilization of weighted loss functions and class weight adjustment, thereby enhancing classification accuracy for minor classes. It mitigates the risk of overfitting by introducing random statistical alignment during model training, improving generalization performance. By adeptly tackling the challenges associated with network intrusion detection datasets and delivering efficient classification capabilities, CatBoost proves to be a valuable choice for this task.
CatBoost is built upon the foundation of a gradient-boosting tree. However, it deviates from the conventional approach by eschewing a greedy target statistics-based method for splitting nodes when handling category features. Instead, CatBoost considers the prior distribution term during the calculation of node gain. This approach effectively mitigates the impact of low-frequency features and noise in the category variables on the construction of decision trees.
where σ j is the j th data; x i,k denotes the , k th column of discrete features of the i th row of data in the training set; a is a prior weight; and P is the prior distribution term.
Optuna is a framework for automated hyperparameter optimization, which efficiently explores the hyperparameter space to discover the optimal combination of hyperparameters [35]. When it comes to the CatBoost algorithm, there are several hyperparameters that can be fine-tuned, including learning rate, number of trees, and depth. Manually tuning these hyperparameters can be time-consuming and requires expertise in the field. By leveraging Optuna, the model's robustness can be enhanced through the optimization of hyperparameters, resulting in a selection of hyperparameter combinations that exhibit excellent performance across different environments [36]. The utilization of Optuna to optimize CatBoost facilitates improved model performance, generalization, and interpretability, ultimately yielding superior hyperparameters for intricate tasks like intrusion detection. This process unfolds through the following steps: 1.
Define hyperparameter space: Before employing Optuna for hyperparameter optimization, it is essential to establish the search space for the hyperparameters. This entails specifying a range or distribution of feasible values for each hyperparameter. The following Table 7 presents an overview of the hyperparameters in CatBoost and their respective details:

2.
Define the objective function: Define an objective function that inputs hyperparameters and outputs an evaluation metric. The output of the objective function will be utilized to assess the quality of the hyperparameter combination and conduct an optimization search. In this study, accuracy is employed as the objective function for parameter optimization.

3.
Hyperparametric Optimization Search: Optuna employs a data structure called 'Trials' to record each iteration's hyperparameter values and evaluation outcomes. During each iteration, a fresh hyperparameter combination is chosen for evaluation. For each selected combination, the objective function is invoked to assess its performance and store the evaluation results in the Trials data structure. The following hyperparameter combination attempt is determined based on the historical hyperparameter values and the evaluation outcomes of the objective function.

4.
Iterative process: Multiple iterations are conducted by systematically selecting new hyperparameter combinations, evaluating the objective function, and updating the model and strategy. This iterative process aims to identify the optimal combination of hyperparameters that minimizes or maximizes the value of the objective function. The schematic flow is shown in Figure 7.  Define an objective function that inputs hyperparameters and outputs an evaluation metric. The output of the objective function will be utilized to assess the quality of the hyperparameter combination and conduct an optimization search. In this study, accuracy is employed as the objective function for parameter optimization.

3.
Hyperparametric Optimization Search: Optuna employs a data structure called 'Trials' to record each iteration's hyperparameter values and evaluation outcomes. During each iteration, a fresh hyperparameter combination is chosen for evaluation. For each selected combination, the objective function is invoked to assess its performance and store the evaluation results in the Trials data structure. The following hyperparameter combination attempt is determined based on the historical hyperparameter values and the evaluation outcomes of the objective function.

4.
Iterative process: Multiple iterations are conducted by systematically selecting new hyperparameter combinations, evaluating the objective function, and updating the model and strategy. This iterative process aims to identify the optimal combination of hyperparameters that minimizes or maximizes the value of the objective function. The schematic flow is shown in Figure 7.

Experimental Environment
All experiments in this paper were implemented in a Windows 10 Intel Core i7-7700HQ CPU @ 2.80 GHz, 16.00 GB RAM environment. The algorithms were implemented using the Sklearn library in Python, etc.

Introduction to the Data Set
When choosing datasets for intrusion detection classifier performance comparison experiments, it is essential to consider the following factors: Diversity of datasets: Select datasets from various types and sources to ensure the results are comprehensive and applicable to different scenarios.

Experimental Environment
All experiments in this paper were implemented in a Windows 10 Intel (R) Core TM i7-7700HQ CPU (2.80 GHz), 16.00 GB RAM environment. The algorithms were implemented using the Sklearn library in Python, etc.

Introduction to the Data Set
When choosing datasets for intrusion detection classifier performance comparison experiments, it is essential to consider the following factors: Diversity of datasets: Select datasets from various types and sources to ensure the results are comprehensive and applicable to different scenarios.
Size of the dataset: Choose datasets with an adequate sample size to ensure the reliability and statistical significance of the experimental findings.
The authenticity of the dataset: Prefer real-world datasets whenever possible, as they reflect the complexities and characteristics of actual intrusion scenarios more accurately than synthetic or artificially generated datasets.
Based on these considerations, the following three datasets are recommended for intrusion detection classifier performance comparison experiments.
(1) NSL-KDD dataset: The NSL-KDD dataset is a revised version of the KDD-99 dataset, removing redundant and unnecessary data while retaining both normal and attack connections. The NSL-KDD dataset "https://www.unb.ca/cic/datasets/nsl.html (accessed on 1 May 2023)" [37] offers improved data quality and a more balanced distribution of classes. Consequently, this study selects the files KDDTrain+ and KDDTest+ from the NSL-KDD dataset as the training and testing sets for the classifier. Each record in the dataset consists of 43 features, with the last column representing four different types of access (DOS, PROBE, R2L, U2R) and one normal access. The specific composition and scale of the dataset are detailed in Table 8. (2) UNSW-NB15 dataset: The UNSW-NB15 dataset is a network intrusion detection dataset developed by the University of New South Wales "https://research.unsw.edu.au/projects/unsw-nb1 5-dataset (accessed on 1 May 2023)" [38][39][40][41][42]. It comprises real network traffic records, encompassing normal connections and various types of attacks, thus offering a more precise reflection of the threats and attacks prevalent in modern network environments. Utilizing the UNSW-NB15 dataset facilitates the assessment of classifiers in highly realistic and intricate network scenarios. The dataset includes nine types of attacks, namely Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. Each flow data within the dataset is characterized by 48 features. The training set in the CSV file encompasses 175,341 records, while the test set contains 82,332 records. The specific composition and scale of the dataset are outlined in Table 9. (3) CICIDS2017 dataset: The CICIDS2017 dataset https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 1 May 2023) [43], is developed by the Canadian Department of National Defence's Research and Development Canada (DRDC) for network intrusion detection. It comprises data traffic records from real-world network environments, including normal connections and various types of attacks. The dataset offers a substantial number of samples and diverse attack scenarios, featuring five major categories of attack accesses and one category for normal access, with each data entry containing 78 features. This dataset is suitable for evaluating classifiers' performance in large-scale and complex network environments. However, it is essential to note that the CICIDS2017 dataset is derived from real-world data collected over five days, specifically from 3 July 2017 to 7 July 2017. Each day's abnormal flow traffic may differ, so the entire dataset is segregated and saved based on dates. To address this, the current study integrates the data from these five days and randomly splits it into a training set and a test set in a 7:3 ratio. After partitioning, the training set comprises 1,980,111 records, while the test set contains 850,632 records. The specific composition and scale of the dataset are provided in Table 10.

Model Performance Evaluation Indicators
For classification problems, the classification result is uncertain, and all possible results can be divided into four cases in the following table. TP indicates the number of attack flows detected by the model and the detection result is correct; FN indicates the number of attack flows detected, but the detection result is incorrect, and the flows are normal; TN indicates the number of normal flows detected and the detection result is correct; FP indicates the number of normal flows detected, but the detection result is incorrect, and the flows are attack flows; FP indicates the number of flows detected as normal, but the detection result is incorrect, and the traffic is attack traffic. Where FP and FN are called false positives, as shown in Table 11, based on the four parameters, four metrics can usually be derived to measure the actual performance of a model. Accuracy, the number of predicted pairs as a proportion of the total number of samples, is mathematically expressed as follows: Precision, the mathematical expression for how many of the samples predicted to be positive are positive, is as follows: Recall the mathematical expression for how many positive cases in a sample are correctly predicted, is as follows: F1 score. Precision and recall are conflicting metrics, and in training, it is often necessary to find the equilibrium point between precision and recall, which is represented by the summed average of precision and recall, F, whose mathematical expression is as follows:

Experiment 1 Accuracy of Feature Extraction within Each Threshold
The choice of features directly impacts the classification outcome, making feature selection a crucial aspect of intrusion detection. The objective of feature selection is accomplished by calculating the mRMR value coefficient for each feature individually. This study discarded features exhibiting no correlation or weak correlation based on the feature correlation strength correspondence table. Subsequently, classification experiments are conducted within a weak correlation strength threshold of 0.0 to 1.0, as shown in Table 12. In the classification experiments, the CatBoost model is utilized as an example to determine the threshold value. The results depicted in the figure indicate that the recognition rate is higher when the coefficients fall within the range of 0.2 to 0.8 compared to other values. Moreover, the effectiveness diminishes when the mRMR value is too small or too large. Consequently, this study adopts the threshold value of 0.2 to 0.8 for the classification experiment and proceeds with the feature selection process.

Experiment 2 Optuna Optimization Effects
In conjunction with the selected model, the optimal hyperparameters (learning rate, maximum tree depth, number of trees, number of samples per leaf node, and regularization parameters) were determined for each dataset using CatBoost with Optuna search. The iterative results of this search process are presented in Figure 8. (a) is the iterative optimization on the NSL-KDD dataset, (b) is the iterative optimization on the UNSW-NB15 dataset, and (c) is the iterative optimization on the dataset CICIDS2017.
The optimal hyperparameters of catboost for different data sets were obtained after several iterations, and the results are shown in Table 13:  The optimal hyperparameters of catboost for different data sets were obtained a several iterations, and the results are shown in Table 13:

Experiment 3 Comparison of Multiple Optimization Algorithms
The classical machine learning algorithms and the models developed in this stu were utilized to perform binary and multiclassification experiments on three distinct tasets. Subsequently, a comprehensive comparison and analysis were conducted based the results.

•
Binary Classification Using the NSL-KDD dataset to train the model, it can be seen from Table 14 that accuracy, recall, precision and F1 of the Mrmr-SMOTE-Optuna-Catboost algorithm signed in this paper for the detection of the NSL-KDDtest+ dataset are 99.262 99.6350%, 99.2219% and 99.4280% respectively. All the metrics are higher than thos several machine learning models. Figure 9 is its confusion matrix.

Experiment 3 Comparison of Multiple Optimization Algorithms
The classical machine learning algorithms and the models developed in this study were utilized to perform binary and multiclassification experiments on three distinct datasets. Subsequently, a comprehensive comparison and analysis were conducted based on the results.

• Binary Classification
Using the NSL-KDD dataset to train the model, it can be seen from Table 14 that the accuracy, recall, precision and F1 of the Mrmr-SMOTE-Optuna-Catboost algorithm designed in this paper for the detection of the NSL-KDDtest+ dataset are 99.2623%, 99.6350%, 99.2219% and 99.4280% respectively. All the metrics are higher than those of several machine learning models. Figure 9 is its confusion matrix. Using the UNSW-NB15 dataset to train the model, Table 15 shows that the t algorithm designed in this paper is also superior to the comparison model in all indicators, which is better than most machine learning methods. Figure 10 is its confusion matrix.

•
Binary Classification Using the NSL-KDD dataset to train the model, it can be seen from Table 14 that the accuracy, recall, precision and F1 of the Mrmr-SMOTE-Optuna-Catboost algorithm designed in this paper for the detection of the NSL-KDDtest+ dataset are 99.2623% 99.6350%, 99.2219% and 99.4280% respectively. All the metrics are higher than those o several machine learning models. Figure 9 is its confusion matrix.    Using the UNSW-NB15 dataset to train the model, Table 15 shows that the t algorithm designed in this paper is also superior to the comparison model in all indicators, which i better than most machine learning methods. Figure 10   For all attack types in the CICIDS2017 dataset, as shown in Table 16, the intrusio detection rate after feature engineering and model optimization has been significantly im proved. The detection performance is better than traditional intrusion detection methods and Figure 11 shows its confusion matrix.  For all attack types in the CICIDS2017 dataset, as shown in Table 16, the intrusion detection rate after feature engineering and model optimization has been significantly improved. The detection performance is better than traditional intrusion detection methods, and Figure 11 shows its confusion matrix.  Using the UNSW-NB15 dataset to train the model, Table 15 shows that the t algorithm designed in this paper is also superior to the comparison model in all indicators, which i better than most machine learning methods. Figure 10   For all attack types in the CICIDS2017 dataset, as shown in Table 16, the intrusio detection rate after feature engineering and model optimization has been significantly im proved. The detection performance is better than traditional intrusion detection methods and Figure 11 shows its confusion matrix.   The integrated model proposed through the analysis of the results surpasses alternative strategies in the context of binary classification. It demonstrates remarkable generalization capabilities, making it suitable for datasets of varying complexity. The model consistently outperforms others across multiple datasets, underscoring its versatility and adaptability. Notably, this work emphasises feature selection and hyperparameter optimization more than previous studies. Our method was rigorously tested during validation on various benchmark datasets, yielding averaged and reliable results for a comprehensive comparison.

•
Multi classification Intrusion detection entails the accurate detection of intrusion behaviors and the precise identification of specific attack types. While some existing intrusion detection algorithms exhibit strong performance in binary classification tasks, they encounter challenges in multi-classification tasks, particularly in accurately categorizing the minority classes within abnormal data, resulting in decreased accuracy. This study leveraged information from three datasets to consolidate the smaller categories into larger categories. We then classified the categorical items into 5, 10, and 6 categories, including normal traffic. Tables 17-19 present a performance comparison between the intrusion detection model based on feature extraction and various machine learning algorithms for multi-classification tasks on the three datasets [58][59][60]. Table 17. The proposed model is compared with existing methods in multiple classifications on the NSL-KDD dataset.

Experiment 4 Optimization Algorithm Evaluation
In the preceding multiple sets of experiments, a comparative study was conducted between our proposed algorithm and traditional classical machine learning algorithms in binary and multiclass tasks. The results confirmed that our algorithm significantly outperforms the traditional methods in various evaluation metrics for binary and multiclass experiments. The analysis based on evaluation metrics demonstrates the algorithm's capability to effectively address the challenges posed by high dimensionality and class imbalance. Moreover, the optimized algorithm exhibits strong generalizability, as evidenced by its excellent performance across diverse datasets. To ensure the scientific rigor of the experimental results, a comprehensive performance comparison was further conducted between our algorithm and a series of state-of-the-art network intrusion detection methods. Given the deployment environment of practical Internet of Things (IoT) systems, the focus was placed on analyzing the comparative performance of multiple algorithms in the binary classification scenario, where the higher cost lies in correctly distinguishing between normal and intrusive traffic data rather than classifying various types of intrusions. The detailed comparison results are presented in Table 20.
Through comprehensive comparative analysis of various indicators, the proposed algorithm in this study demonstrated significant superiority over NSL-KDD, UNSW-NB15, and CICIDS-2017 datasets. Given the high dimensionality and data imbalance of the datasets, we first selected features that could best represent the original data, reducing the spatial complexity and dimensionality of the dataset. Subsequently, by oversampling the minority class data, we successfully addressed the issue of data class imbalance. After effective feature engineering, we input the reduced dataset into subsequent classification algorithms, significantly reducing the spatial complexity and the number of features for each data point, thereby reducing the overall dataset size and alleviating the computational burden of the algorithm.
Considering the intrusion detection application environment, where most dataset represents normal class samples, we focused on the minority class representing the attack category. Hence, the model combined SMOTE algorithm to enhance the proportion of the minority class, thereby increasing the algorithm's sensitivity to the minority class data and improving the accuracy of data classification. Consequently, our algorithm exhibits significant superiority in binary classification tasks for network intrusion detection. However, it is essential to note that despite demonstrating significant advantages in experiments, our method still has some limitations. For instance, the algorithm may have limited generalization performance and feature representation capabilities, as its performance could be influenced by the quality and quantity of features extracted from network data. Therefore, further research on feature engineering may be necessary to improve its effectiveness. Additionally, the algorithm's performance is optimized for specific datasets mentioned in the paper, which may affect its generalization ability on other untested data, posing certain challenges in practical applications. Addressing these limitations and further research to enhance the algorithm's robustness and adaptability will be crucial in advancing its practical application capabilities and contributing to the overall research.

Actual Deployment Testing
In recent years, the Internet of Things (IoT) has developed rapidly, bringing convenience to people. However, it also harbors numerous hidden risks, and security issues cannot be overlooked. In the future development and application of IoT, intrusion detection is worthy of attention and exploration. Through various experimental comparisons presented earlier, the theoretical superiority of our algorithm has been demonstrated. In the following sections, we will attempt to apply this algorithm to real-world scenarios and analyze its strengths and limitations under resource constraints or specific network conditions. Currently, the algorithm faces certain challenges when applied in different scenarios. Intrusion detection algorithms require high real-time capabilities in network environments, but traditional cloud-based intelligent controls fall short due to bandwidth limitations and delays, failing to meet the demands. Therefore, this study adopts an intelligent gateway controller based on edge computing to replace the traditional cloud-based framework, enabling the construction of an intrusion detection model. Edge computing provides services closer to smart devices at edge nodes, offering advantages like rapidity, real-time processing, low power consumption, and low bandwidth costs. Deploying the intrusion detection model at the edge can meet the real-time security needs of smart homes. In line with the "cloud-edge-device" model, a technological system combines edge computing, IoT, cloud computing, and big data technologies. This results in a three-dimensional, efficient intrusion detection and multi-device networking capability, forming an intelligent IoT system framework, as depicted in Figure 12. The framework enables real-time data aggregation and management, enhancing the security measures for IoT.
Intrusion detection algorithms require high real-time capabilities in network en ments, but traditional cloud-based intelligent controls fall short due to bandwidth tions and delays, failing to meet the demands. Therefore, this study adopts an inte gateway controller based on edge computing to replace the traditional cloud-based f work, enabling the construction of an intrusion detection model. Edge computin vides services closer to smart devices at edge nodes, offering advantages like rap real-time processing, low power consumption, and low bandwidth costs. Deployi intrusion detection model at the edge can meet the real-time security needs of homes. In line with the "cloud-edge-device" model, a technological system combine computing, IoT, cloud computing, and big data technologies. This results in a thr mensional, efficient intrusion detection and multi-device networking capability, fo an intelligent IoT system framework, as depicted in Figure 12. The framework e real-time data aggregation and management, enhancing the security measures for  Figure 12. Smart iot intrusion detection system framework.
In the above system framework, the research aims to enable the cloud server ceive instructions from administrators or users and relay these commands to the gent gateway of edge computing through the cloud server. Serving as the central all information, the intelligent gateway controls commands from the cloud server ceives sensor information from the device side or transmits control commands to t minal controller. The intrusion detection algorithm is applied to identify and dete status of the information, determining whether it is in a normal state.
In different application scenarios, the number of devices in the intelligent ga may vary. For this study, we have chosen Allwinnertech H6 as the core SoC of th intelligent gateway platform. This chip has a quad-core Cortex-A53 CPU and Mal GPU, supporting dual-channel DDR4 and EMMC5.0 high-speed flash memory. Su lection enables us to better test the algorithm's efficiency and versatility under l computational resources.
Within the framework of edge computing, implementing an intrusion detectio tem heavily relies on the data capture module to acquire flow data and parse data pa Simultaneously, the data processing module aids in preprocessing flow data to co to the model's input requirements. Lastly, the intrusion detection module classifi detects the flow, presenting the detection results. The specific deployment process d is illustrated in Figure 13. In the above system framework, the research aims to enable the cloud server to receive instructions from administrators or users and relay these commands to the intelligent gateway of edge computing through the cloud server. Serving as the central hub of all information, the intelligent gateway controls commands from the cloud server. It receives sensor information from the device side or transmits control commands to the terminal controller. The intrusion detection algorithm is applied to identify and detect the status of the information, determining whether it is in a normal state.
In different application scenarios, the number of devices in the intelligent gateway may vary. For this study, we have chosen Allwinnertech H6 as the core SoC of the edge intelligent gateway platform. This chip has a quad-core Cortex-A53 CPU and Mali T720 GPU, supporting dual-channel DDR4 and EMMC5.0 high-speed flash memory. Such selection enables us to better test the algorithm's efficiency and versatility under limited computational resources.
Within the framework of edge computing, implementing an intrusion detection system heavily relies on the data capture module to acquire flow data and parse data packets. Simultaneously, the data processing module aids in preprocessing flow data to conform to the model's input requirements. Lastly, the intrusion detection module classifies and detects the flow, presenting the detection results. The specific deployment process design is illustrated in Figure 13.
After completing the deployment, we conducted real-world testing and a comprehensive evaluation of the deployed model using the multiple evaluation criteria described earlier. In the Internet of Things (IoT) network environment, Denial-of-Service (DoS) attacks are currently one of the most common and highly destructive attacks. These attacks overwhelm the target by flooding it with many data packets, causing excessive resource consumption and rendering the service unavailable. To assess the algorithm's ability to detect such attacks, we chose the XOIC tool and utilized UDP and ICMP messages as the two DoS attack methods for testing.
Within the IoT framework described in the document, end-users issue control commands through the cloud server, which then relays the information to the edge intelligent gateway device. At the gateway, the information is subjected to real-time detection. Considering practical constraints, each normal connection was repeated ten times, lasting for 60 s. This comprised seven connections between the cloud server and the edge intelligent gateway and three connections between terminal devices and the edge intelligent gateway. Subsequently, the two attack types were initiated, each lasting for 10 s. This process was repeated 100 times, and the average values were calculated based on the actual detection results. Appl  After completing the deployment, we conducted real-world testing and a comprehensive evaluation of the deployed model using the multiple evaluation criteria described earlier. In the Internet of Things (IoT) network environment, Denial-of-Service (DoS) attacks are currently one of the most common and highly destructive attacks. These attacks overwhelm the target by flooding it with many data packets, causing excessive resource consumption and rendering the service unavailable. To assess the algorithm's ability to detect such attacks, we chose the XOIC tool and utilized UDP and ICMP messages as the two DoS attack methods for testing.
Within the IoT framework described in the document, end-users issue control commands through the cloud server, which then relays the information to the edge intelligent gateway device. At the gateway, the information is subjected to real-time detection. Considering practical constraints, each normal connection was repeated ten times, lasting for 60 s. This comprised seven connections between the cloud server and the edge intelligent gateway and three connections between terminal devices and the edge intelligent gateway. Subsequently, the two attack types were initiated, each lasting for 10 s. This process was repeated 100 times, and the average values were calculated based on the actual detection results.
Moreover, to minimize interference from other traffic passing through the gateway, we set up filtering rules in the data collection and analysis module. Specifically, we collected and analyzed attack traffic between attacking machines and the target and normal traffic between the intelligent gateway and the servers or terminal devices. Throughout the testing process, we observed the statistical information and logs from the local detection module to analyze the data. The actual intrusion detection results based on our proposed algorithm are presented in Table 21.  Moreover, to minimize interference from other traffic passing through the gateway, we set up filtering rules in the data collection and analysis module. Specifically, we collected and analyzed attack traffic between attacking machines and the target and normal traffic between the intelligent gateway and the servers or terminal devices. Throughout the testing process, we observed the statistical information and logs from the local detection module to analyze the data. The actual intrusion detection results based on our proposed algorithm are presented in Table 21. In the edge intelligent gateway application with limited computing resources, we implemented an enhanced model to perform feature engineering based on Minimum Redundancy, Maximum Relevance and Synthetic Minority Over-sampling Technique. This resulted in a reduction of data space complexity and a decrease in computational burden. Compared to the unprocessed algorithm, the improved algorithm demonstrated a decrease of 0.404 s in testing time under similar network conditions. This enhancement enables the algorithm to handle a greater volume of traffic data within constrained computing resources and time limitations, meeting the intrusion detection system controller's requirements for identifying network data flows. Furthermore, the improved algorithm exhibited notable improvements in accuracy, precision, recall, and F-measure values, with increases of 4.15%, 2.80%, 3.68%, and 3.24%, respectively, leading to more accurate data classification.
Based on the aforementioned test results, deploying the enhanced intrusion detection model on edge nodes enables real-time monitoring of traffic data. It allows timely response to intrusion behavior, effectively safeguarding the security of the IoT platform. However, to further enhance the algorithm's robustness and adaptability, we should focus on providing