The datasets used in the experiments of this section are derived from the core subsets of the UNSW-NB15 and TON_IoT datasets. Based on the aforementioned datasets, this paper conducts targeted modeling experiments designed to validate the performance differences and statistical significance of different models, as well as to further explore the models’ generalizability. On the UNSW-NB15 dataset, one training set and ten test sets are constructed through random sampling, and ten groups of independent experiments are carried out to ensure the reliability of the verification results. Each training set and test set contains 1100 samples with a data imbalance ratio of 10:1. This setup is used to evaluate the performance of the models in data-imbalanced scenarios. To verify the models’ generalizability, extended experiments are conducted on the TON_IoT dataset. Two data imbalance ratios (5:1 and 10:1) are set via random sampling, with specific experimental parameters as follows: when the imbalance ratio is 5:1, the training and test sets each contain 600 samples; when it is 10:1, both sets contain 1100 samples.
4.1. Optimal Feature Subset Selection
The raw network packets of the UNSW-NB15 dataset were created by the IXIA PerfectStorm tool in the Cyber Range Lab of UNSW Canberra for the generation of a hybrid of real modern normal activities and synthetic contemporary attack behaviors. There are a total of 49 features categorized into 6 major types. Among them, there are many irrelevant and redundant features. An excessive number of features will cause the rule explosion problem in the BRB and, to a certain extent, reduce the performance of the algorithm and the classifier.
Moustafa and Slay [
36] used association rule-mining techniques in their study to select the best features for the UNSW-NB15 dataset. Then, in 2017, T. Janarthanan et al. [
37] went further and applied a variety of feature selection methods to the UNSW-NB15 dataset, including the CfsSubsetEval method, GreedyStepwise method, InfoGainAttributeEval method and Ranker method. They evaluated the features recommended by these methods and ran machine learning algorithms such as random forests in Weka. The experimental results show that the five features shown in
Figure 6 perform the best among the proposed feature subsets.
Among them, the service (e.g., fttp, ftp, dns, and other nominal features; see
Table 1 for numerical values), sbytes (source-to-target bytes), and sttl (source-to-target survival time) features belong to the base feature class. The smean feature (the mean value of the size of the stream packets transmitted by src) belongs to the content feature class, and the ct_dst_sport_ltm feature (records with the same source IP address and source port number in one hundred records according to the last time of the record) belongs to the additional generated feature class.
The TON_IoT dataset contains heterogeneous data including telemetry, operating system logs, and network traffic, covering both normal behaviors and various attacks (e.g., DoS, DDoS, and ransomware) targeting IoT/IIoT services. It supports the training and performance evaluation of intrusion detection systems and security situation assessment models.
To screen out the feature subset with the highest degree of correlation with the target label, this study employs two classic feature selection methods: the Random Forest algorithm based on decision tree integration and the Pearson correlation coefficient method based on statistical correlation. The former evaluates the importance of features by calculating the splitting gain of features in the tree model, while the latter quantifies the linear correlation degree between features and the label. These two methods construct the feature evaluation system from two dimensions (model-driven and statistical correlation, respectively). By integrating the evaluation results of the two methods, it not only retains the features with statistical significance but also ensures the actual contribution of the selected features to the model prediction. Eventually, a feature subset with a high degree of correlation with the target label and a low redundancy level is constructed, which significantly improves the prediction performance and generalization ability of the model. The results of feature selection using the Random Forest algorithm and the Pearson coefficient are shown in
Figure 7 and
Figure 8 respectively.
Through the comprehensive analysis of the Random Forest algorithm and the Pearson correlation coefficient, we selected the following four features that have the strongest correlation with the labels:
Process_Virtual_Bytes Peak: Represents the peak value of the virtual bytes of a process, reflecting the maximum amount of virtual memory used by the process during its operation;
Process_Thread Count: Represents the number of threads of a process, reflecting the concurrent processing capability of the process;
Process_Handle Count: Represents the number of handles of a process, reflecting the occupancy of system resources by the process;
Process_Pool_Paged Bytes: Represents the number of bytes in the paged pool of the process, reflecting the usage of paged memory by the process.
4.2. Problem Description
The purpose of this case study was to demonstrate the validity of the proposed classification method, in which the experiment was conducted for the imbalance problem, in which the class with fewer samples was denoted as positive, i.e., positive samples, and the class with more samples was denoted as negative, i.e., negative samples.
- (1)
Establishment of cybersecurity data classification model for Industrial Internet based on CG-BRB
Step 1: Setting the reference values for the BRB model :
There are 10 classes in this dataset:
= {Normal, Analysis, Backdoor, Dos, Exploits, Fuzzers, Reconnaissance, Generic, Shellcode, Worms}. According to the dataset labels and cybersecurity semantics, these ten classes are mapped into two security states: normal data and attack data. Specifically, Normal is defined as normal data, while {Analysis, Backdoor, Dos, Exploits, Fuzzers, Reconnaissance, Generic, Shellcode, Worms} are defined as attack data, that is , where is normal data and is attack data.
Step 2: Rebalancing of training-set data:
The CBO algorithm is employed to rebalance the training set. The specific implementation procedure is described as follows: First, the minority-class samples are partitioned into several clusters using K-means clustering, where the number of clusters is determined by the elbow method. Subsequently, based on the inherent class-imbalance ratio, synthetic data are generated at the centroid of each cluster to transition the dataset into a balanced state. The evaluation of synthetic data quality remains a complex challenge due to the lack of standardized, widely accepted criteria [
38]. To comprehensively assess the quality of our generated synthetic data, we adopt both theoretical and experimental approaches. Theoretically, we use widely accepted metrics to evaluate the consistency between synthetic and original data. Experimentally, we conduct a controlled study using logistic regression: the majority-class samples and the test set remain unchanged, while the minority-class samples in the training set are replaced with either original or synthetic samples. The resulting classification performances are then compared to isolate the impact of synthetic data quality on model behavior. To verify the effectiveness of the CBO oversampling method in preserving the original data structure, we performed an adaptive clustering analysis (based on the elbow method) on the minority-class samples. As illustrated in
Figure 9, the original 100 minority-class samples were adaptively partitioned into two clusters via the K-means algorithm, with the distribution as follows: Cluster 0 contains 47 samples (47.0%), and Cluster 1 contains 53 samples (53.0%).
Figure 9 illustrates the distribution of original minority-class samples and CBO-generated samples in the PCA dimension-reduced space. After reducing the five-dimensional features to two dimensions via principal component analysis, we visually observed the following key characteristics:
Cluster structure preservation: The distribution patterns of the original Cluster 0 and Cluster 1 are effectively retained in the synthetic samples. Synthetic samples are tightly clustered around the original clusters, with no significant structural deviation.
Intra-cluster compactness: Synthetic and original samples within each cluster exhibit high aggregation in the PCA space, indicating that the generation process preserves the compactness of the original clusters. The centroids of Cluster 0 and Cluster 1 remain relatively stable, with no substantial shifts induced by synthetic samples.
Inter-cluster separability: The two clusters maintain a distinct separation boundary in the PCA space, with no cross-cluster mixing of synthetic samples. This verifies that the CBO method can retain the discriminative features between original clusters, avoiding the blurring of class boundaries caused by oversampling.
Following the comprehensive evaluation framework for synthetic data introduced by Hernandez et al. [
39], the quality of the generated synthetic data was assessed across multiple dimensions. The corresponding results are summarized in
Table 2.
The silhouette coefficient ranges within
. A larger value indicates better intra-cluster compactness and inter-cluster separability of samples. Values in the range of 0.71–1.0 correspond to the excellent interval, and 0.51–0.70 is the good interval [
40]. The result obtained in this paper is 0.5925, which is in the good interval. This shows that synthetic samples can reasonably preserve the clustering structure of the original data, exhibiting a stable and rational cluster distribution pattern.
Diversity is utilized to measure the discrepancy among generated samples, with a value range of . A higher score represents better sample differentiation and effectively avoids simple replication of original data. The diversity score of 0.7055 is at a relatively high level, indicating that the CBO method can effectively prevent invalid sample duplication and reduce the risk of model overfitting.
The Hellinger distance ranges within [0, 1]. A smaller value indicates a closer probability distribution between synthetic samples and original data. The result of 0.1842 in this paper confirms that the overall distribution deviation between synthetic data and original data is extremely small and that the distribution fitting effect is ideal.
The PCD index ranges within [0, 1]. A smaller value represents higher feature restoration accuracy of synthetic minority-class samples. The PCD result of this paper is 0.1162, which illustrates that the features of generated minority-class samples are consistent with real samples and that the sample generation accuracy is high.
AUC-ROC is adopted to measure the ability of classifiers to distinguish real samples from synthetic samples, with a value range of [0.5, 1]. The closer the value is to 0.5, the higher the similarity and the better the fusion effect between the two types of samples. The result of this paper is 0.5867, which proves that synthetic samples are highly similar to real samples in features and difficult to distinguish, with excellent data simulation performance and authenticity.
As shown in
Table 3, compared with training on original minority-class data, training on synthetic data slightly increases recall (0.8700 → 0.8900) while marginally decreasing the F1 score (0.9255 → 0.8900). These minor fluctuations are acceptable, even with a slight gain in minority-class detection. This demonstrates that the synthetic data preserve the key characteristics of the original data and have no significantly different impact on model performance.
Step 3: Data Characterization:
Selected feature subsets are fused. Content and generated features are fused based on shared attributes. Fusion results and serve as BRB premise attributes 1 and 2, with labeled features as the model’s outputs.
Five core attributes—service, sbytes (source bytes), sttl (source time to live), smean (mean source packet size), and ct_dst_sport_ltm (long-term count of connections to the destination port)—are selected for analysis. Firstly, continuous attributes (sbytes, sttl, and smean) are standardized to eliminate dimensional discrepancies. For the discrete “service” attribute, one-hot encoding is applied for numerical transformation. Based on data distribution characteristics and domain knowledge in network security, all attributes are then mapped to belief distributions that include five preset evaluation levels and global ignorance, with each distribution strictly satisfying the non-negativity and normalization constraints of belief measures.
Subsequently, the attributes are divided into two groups according to their network security semantic features: the traffic intensity group (sbytes, smean, and sttl) and the connection anomaly group (service and ct_dst_sport_ltm). The Entropy Weight Method (EWM) is employed to determine the weight of each attribute within its respective group, thereby quantifying the attribute’s information contribution—where lower information entropy indicates stronger discriminative power and a higher corresponding weight.
Based on the calculated weights, the belief distribution of each attribute in the group is discounted to obtain the Basic Probability Assignment (BPA) of the corresponding evidence. The BPAs of all evidence within the group are then fused iteratively. During the fusion process, a normalization conflict factor specific to the Evidential Reasoning (ER) algorithm is introduced to effectively address significant conflicts between pieces of evidence, ensuring the rationality of the fusion result. Ultimately, a comprehensive belief distribution is generated for each group, accurately characterizing the network security state with respect to the dimensions of traffic intensity and connection anomaly.
This ER fusion process achieves effective dimensionality reduction by converting the five original attributes into two comprehensive security indicators. More importantly, it significantly mitigates the rule explosion problem inherent in the Belief Rule Base (BRB) by reducing the number of premise attributes. The number of rules in a BRB follows the formula of , where n denotes the number of antecedent attributes and m represents the number of reference values for each attribute. Taking the experimental data as an example, if each of the five original attributes contains five reference values, the corresponding number of BRB rules is ; in contrast, the two reduced-dimension comprehensive attributes (each with five reference values) correspond to only rules. This dimensionality reduction scheme substantially lowers model complexity and improves inference efficiency while preserving key security information.
Step 4: BRB Construction:
The results of ER fusion in Step 3, , are utilized as prerequisite attributes to construct the BRB model.
Step 5: Parameter optimization:
After building the BRB model in Step 4, its parameters are optimized using the Circle-GWO algorithm, and the optimized BRB model is finally generated. An early stopping mechanism is introduced into the optimization process: the optimization will be terminated if there is no improvement in the loss value for five consecutive rounds, with the maximum number of optimization rounds set to 50. The variation between the number of optimization rounds and the loss value is illustrated in
Figure 10.
The experimental results before and after optimization are presented in
Table 4. The results show that the Circle-GWO algorithm significantly improves both the performance and convergence of the Belief Rule Base (BRB).
- (2)
Numerical analysis
For UNSW-NB15, five attributes were selected and divided into 2 categories with imbalance ratios of 5:1 and 10:1. BRB’s prior attribute was set to 2, and Circle-GWO optimized the model for 100 iterations per run. Given data imbalance, accuracy and recall serve as evaluation metrics. The formulas for accuracy, recall, and precision follow:
where TM is the number of samples, TP represents the number of positive-class samples correctly predicted as the positive class, FN represents the number of positive-class samples incorrectly predicted as the negative class, and FP represents the number of negative-class samples incorrectly predicted as the positive class.
4.3. Experimental Results and Comparative Studies
To verify the effectiveness of the proposed CG-BRB model, this paper selects four mainstream classifiers for comparative experiments. These experiments are conducted based on 1100 sample data under the condition of a data imbalance ratio of 10:1, with specific experimental results and comparative analysis presented in
Table 5,
Table 6,
Table 7 and
Table 8. The selected comparative classifiers include XGBoost, RF, PSO-SVM, and KNN. The mean values and error bars of all experimental results are illustrated in
Figure 11 (Error bars denote standard deviation).
In imbalanced classification tasks, accuracy is prone to artificial inflation by majority-class samples, failing to objectively reflect minority-class detection capability. While precision reveals the false-positive risk among minority-class predictions, it overlooks the false-negative risk on true minority samples. In contrast, recall directly quantifies minority-class coverage, serving as the core metric for critical sample identification. As the harmonic mean of precision and recall, the F1 score balances false positives and false negatives to provide a comprehensive measure of overall classification performance. Thus, this study adopts recall and the F1 score as the primary evaluation metrics.
In terms of the stability of recall and the F1 score, the proposed CG-BRB model exhibits performance comparable to that of XGBoost, and both significantly outperform the other comparative models. Specifically, CG-BRB’s recall standard deviation is 0.0129, which is only slightly higher than that of XGBoost (0.0125) and far lower than those of Random Forest (0.0487), KNN (0.0460), and PSO-SVM (0.0388). For the F1 score, the standard deviation of CG-BRB is 0.0231, showing a small gap relative to XGBoost (0.0155) and PSO-SVM (0.0264) while being obviously superior to Random Forest (0.0309) and KNN (0.0365).
To verify whether the CG-BRB model proposed in this paper has statistically significant advantages over other comparative models, this section conducts significance tests on two core evaluation metrics—namely, recall and F1 score. The test procedure first adopts the Shapiro–Wilk test to analyze the normality of the data, on the basis of which the specific method for subsequent significance tests is determined. The results of the Shapiro–Wilk test are presented in
Table 9 and
Table 10.
The results of the aforementioned Shapiro–Wilk test indicate that the datasets corresponding to the recall and F1 score of each model all conform to the normal distribution. On this basis, this paper adopts the paired t-test method to conduct subsequent significance tests.
: The indicator difference between the CG-BRB model and the comparative models is equal to 0 (i.e., there is no significant difference between them).
: The indicator difference between the CG-BRB model and the comparative models is not equal to 0 (i.e., there is a significant difference between them).
Significance level .
The superiority of the CG-BRB model is fully verified through statistical significance analysis based on 10 independent test datasets. The Shapiro–Wilk normality test confirms that the core performance metrics (recall and F1 score) of all comparative models follow a normal distribution. On this basis, a two-tailed paired t-test with a significance level of is adopted to conduct paired difference analysis. The results show that the CG-BRB model exhibits significant or extremely significant advantages over the RF, KNN, and PSO-SVM models in key performance metrics.
In terms of recall, compared with the RF, KNN, and PSO-SVM models, the two-tailed p-values of the CG-BRB model are all less than 0.001, with mean differences reaching 0.115, 0.209, and 0.097, respectively. Among these, the performance gap with the KNN model is the most prominent: the average recall is 20.9% higher, and the corresponding t-statistic is as high as 13.080, which fully demonstrates that the advantage of the CG-BRB model in recall capability has strong statistical reliability. In terms of the F1 score, which characterizes comprehensive performance, compared with the RF, PSO-SVM, and KNN models, the two-tailed p-values of the CG-BRB model are , , and less than 0.0001 (for KNN), respectively, with mean differences of 0.049, 0.0516, and 0.1309 in sequence. This further verifies that the model is significantly superior to the above three models in balancing precision and recall.
To assess the statistical significance of performance differences between CG-BRB and XGBoost, paired two-tailed
t-tests were performed on the 10-fold cross-validation results (degrees of freedom,
; significance level,
). As shown in
Table 11 (recall) and
Table 12 (F1 score), the two-tailed
p-values for the comparisons are approximately 0.497 (recall) and 0.149 (F1 score), both of which exceed the pre-defined significance level of 0.05. Combined with the negligible mean performance differences, these results confirm that there are no statistically significant differences between the two models in core classification performance. Given that XGBoost has been widely proven to achieve excellent performance in imbalanced data classification tasks, this finding further confirms that the CG-BRB model also achieves competitive performance on imbalanced data classification tasks.
However, in Industrial Internet scenarios, operation and maintenance personnel not only require models to accurately identify whether an event is classified as an attack but also demand clear justifications for these decisions and traceable reasoning processes. Traditional machine learning models represented by XGBoost are essentially “black-box models” that cannot provide clear decision-making logic and interpretability for operation and maintenance staff, making it difficult for them to meet the stringent requirements of the Industrial Internet field. In contrast, the CG-BRB model offers inherently strong interpretability, which enables explicit and traceable reasoning processes. Therefore, it is more suitable for the practical application requirements of the Industrial Internet domain.
Overall analysis shows that the CG-BRB model is not only statistically superior to the RF, KNN, and PSO-SVM models but also that it has a unique interpretability advantage while maintaining performance comparable to that of the XGBoost model. It fully meets the dual requirements of model performance and operational security in the Industrial Internet field, making it a better solution for imbalanced data classification tasks in this field.