1. Introduction
As more devices and services become interconnected through the internet, the risk of cyberattacks is increasing rapidly [
1]. Among these attacks, denial-of-service (DoS) is particularly dangerous because it overwhelms target systems or networks with excessive traffic, disrupting critical services and causing severe damage [
2]. Therefore, network intrusion detection systems (NIDS) have become a crucial defense mechanism against such evolving threats [
3]. By monitoring and analyzing network traffic, NIDS can effectively identify malicious activities. Numerous studies have applied machine learning to develop efficient NIDS frameworks [
4].
Machine learning is well suited for intrusion detection because it can automatically learn data patterns and predict attack behaviors [
5]. However, constructing a machine learning-based NIDS remains challenging due to the high class imbalance in real-world network data. Machine learning models typically assume evenly distributed classes, yet attack instances are far less frequent than normal traffic [
6]. This imbalance results in biased models that tend to overfit normal traffic and fail to detect rare but critical DoS attacks. To address this issue, researchers have proposed various solutions, which can generally be categorized into three approaches: data-level, algorithm-level, and hybrid methods [
7,
8].
At the data-level, sampling techniques are commonly used to balance class distributions. Under-sampling methods reduce the majority class size by randomly removing samples, thus addressing imbalance without increasing data volume [
9,
10]. Conversely, over-sampling methods, such as the synthetic minority over-sampling technique (SMOTE) and its variants (e.g., Borderline-SMOTE), generate synthetic samples of minority classes to balance datasets [
11]. These methods have demonstrated improved performance in class-imbalanced data; however, under-sampling may cause significant information loss [
12], whereas over-sampling can lead to overfitting by generating redundant synthetic samples [
13].
At the algorithm-level, cost-sensitive learning is often employed to mitigate class imbalance without altering data distribution [
14]. This approach assigns higher misclassification costs to minority class instances within the loss function, enabling the model to pay greater attention to them during training. Although cost-sensitive methods preserve the original data size, their effectiveness depends heavily on the proper selection of cost parameters. Consequently, they may lead to trade-offs between sensitivity and specificity, making the optimization process more unstable [
15].
To overcome the limitations of both approaches, hybrid methods integrate data-level and model-level strategies [
16]. A representative example is EasyEnsemble [
17]. This technique applies random under-sampling multiple times to generate several balanced subsets of the training data. Each subset is then used to train an independent AdaBoost model, which increases the weights of misclassified samples so that subsequent models focus more on them. This design allows EasyEnsemble to combine data-level resampling with algorithm-level cost-sensitive weighting, effectively mitigating information loss while emphasizing minority class instances during learning [
18]. Recent studies have also explored deep learning–based hybrid approaches that combine data-level sampling method with algorithm-level deep learning architecture [
19,
20].
However, hybrid methods still have limitations. The final prediction from hybrid methods such as EasyEnsemble is typically obtained through a simple aggregation of multiple model outputs, which may dilute the unique patterns and characteristics learned by each individual model. Although deep learning–based hybrid approaches have shown improved performance, they heavily rely on synthetic data quality and often fail to explicitly preserve minority class characteristics. Thus, effectively improving minority class detection and ensuring stable prediction performance of these approaches remain to be addressed. To this end, we propose MCH-Ensemble (Minority Class Highlighting Ensemble), a novel hybrid framework. We aim to enhance DoS attack detection by emphasizing minority class features to enhance minority class recognition accuracy, thereby achieving more accurate predictions.
In MCH-Ensemble, k balanced training subsets are generated from the original dataset using random under-sampling, where k is a hyperparameter that controls the number of subsets. Each subset is then used to train base models such as decision tree, extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM). Subsequently, attack class instances predicted by each model are compared with the actual attack class labels, and a constant weight of 1 is added to the values of correctly predicted instances.
Subsequently, attack class instances that are correctly predicted are identified, and a constant weight of 1 is added to their feature values. This process highlights well-predicted minority features and introduces a boosting-like effect. All highlighted subsets are then merged into a single dataset, which is used to train a random forest model as the final classifier.
MCH-Ensemble enhances detection performance by leveraging the bagging mechanism of random forest. In bagging, multiple decision trees are trained on different subsets of the data, and their results are aggregated. By highlighting features of attack classes, the method produces a more detailed dataset that distinguishes among normal traffic, attack traffic, and severe attacks. Each tree learns distinct patterns from these variations, and when combined, the ensemble can detect attacks more accurately by capturing richer and more diverse decision boundaries.
For evaluation, we employ network intrusion detection datasets, specifically focusing on DoS attack traffic. To determine the optimal configuration, decision tree, XGBoost, and LightGBM models are compared in combination with the random forest meta-model to identify the most effective setup for attack class detection.
The main contributions of this study are summarized as follows:
We propose MCH-Ensemble, a hybrid ensemble framework that effectively addresses class imbalance in DoS attack detection.
We introduce a novel highlighting mechanism that emphasizes attack class features within the data, thereby improving the model’s ability to classify subtle attack patterns.
We demonstrate that integrating these highlighted datasets with a random forest meta-model enhances prediction performance by leveraging diverse and fine-grained bagging samples.
The effectiveness of MCH-Ensemble is evaluated on the UNSW-NB15, CIC-IDS2017, and WSN-DS datasets to identify the optimal classifier combination for enhanced detection performance.
The remainder of this paper is organized as follows:
Section 2 presents related work.
Section 3 describes the proposed MCH-Ensemble method.
Section 4 discusses the experimental results, and
Section 5 concludes the paper and outlines directions for future research.
3. Methodology
This section introduces the methods used in this study, including machine learning-based classifiers, under-sampling ensemble, and minority class highlighting. These methods collectively form the proposed framework.
3.1. Machine Learning-Based Classifiers
3.1.1. Decision Tree
A decision tree is a tree-structured model that generates decision rules from training data to classify new instances [
32]. It consists of nodes and branches, where each internal node represents a decision criterion, and branches connect nodes based on possible outcomes. The root node initiates the data division, while the leaf nodes represent the final output classes according to the learned decision rules [
33]. Decision trees have the advantage of interpretability because their decision process can be visualized in a hierarchical structure [
34]. However, since a single tree is trained on the entire dataset, it is prone to overfitting when the tree depth increases.
3.1.2. Random Forest
Random forest is an ensemble learning algorithm that combines multiple decision trees to improve prediction stability and accuracy [
35]. It is based on bagging, which stands for “bootstrap aggregation” [
36]. In this process, the model first performs sampling with replacement to create several bootstrap training sets. Each training set is used to train an independent decision tree. The predictions from all trees are then aggregated, typically through majority voting, to produce the final classification result. Because random forest aggregates multiple models, it provides robust predictions even for large-scale datasets. However, it has limitations in interpreting the relative importance of individual features and may require parameter tuning to achieve optimal performance [
37].
3.1.3. Extreme Gradient Boosting (XGBoost)
Extreme gradient boosting (XGBoost) is an ensemble algorithm based on gradient boosting principles [
38]. It builds successive models that correct the residual errors of previous ones, assigning feature-specific weights and combining the outputs to generate the final prediction. Initially, XGBoost trains a decision tree using the training data and computes the residual errors as the difference between actual and predicted values. These residuals are then used to generate weighted training samples for the next iteration, allowing subsequent trees to focus on previously misclassified instances. XGBoost improves upon traditional gradient boosting by employing GPU-accelerated parallel processing to reduce training time. It also mitigates overfitting through level-wise tree growth, maintaining balanced structures that enable stable and efficient training on large-scale datasets [
39].
3.1.4. Light Gradient Boosting Machine (LightGBM)
LightGBM is a boosting-based ensemble model designed for high training efficiency using a leaf-wise growth strategy [
40]. In contrast to XGBoost’s level-wise method, LightGBM splits the leaf node that yields the maximum loss reduction at each iteration, enabling faster convergence and improved accuracy. This leaf-wise splitting provides substantially higher training speed than level-wise approaches while maintaining competitive predictive performance [
41].
3.2. Under-Sampling Ensemble
To address the class imbalance problem, we employ an under-sampling ensemble method that constructs multiple balanced training datasets through repeated sampling with replacement. This approach compensates for the potential information loss typically caused by standard under-sampling. By creating multiple balanced training datasets from diverse subsets of majority class instances [
42], the method enhances generalization capability and reduces the likelihood of losing informative samples. Furthermore, this ensemble allows classifiers to learn from a wider range of feature representations.
Figure 1 illustrates the process of the under-sampling ensemble. Here, n
minority represents the number of minority class instances. Majority class instances are randomly sampled
k times with replacement to match n
minority. The parameter
k is a hyperparameter and is set to 5 in this study (
Appendix A). The number of sampled majority class instances (n
sampled-majority) equals n
minority. These sampled majority class instances are then combined with the minority class instances to generate balanced datasets. All minority class instances remain constant across subsets. As a result,
k balanced training datasets are produced.
3.3. Minority Class Highlighting
To emphasize the distinctive characteristics of the minority class, we apply a minority class highlighting (MCH) process. MCH increases the feature values of correctly predicted minority class instances, thereby refining data representation and sharpening the decision boundary between minority and majority classes. This process helps reduce misclassification, whether a minority instance is incorrectly classified as a majority instance or vice versa, especially for samples located near the classification boundary.
The MCH procedure is illustrated in
Figure 2. Let
X1,
X2,
X3, …,
Xn denote independent variables;
y denotes the actual class, and
represents the predicted class produced by the classifier. (a) The classifier is first trained on a balanced dataset generated through sampling. (b) Predictions are obtained from the trained classifier. (c) The predicted minority class labels are compared with the actual minority class labels. (d) For instances that are correctly classified as minority class in both actual and predicted results, a constant value of 1 is added to their feature values. Finally, the highlighted training data is obtained through MCH. For example, consider an instance whose actual label is minority (
y = 1) and is predicted as minority (
= 1). If its original feature values are (
X1 = 0.0944,
X2 = −0.0320, and
X3 = −0.0511), then MCH increases each feature by 1 and yields (
X1 = 1.0944,
X2 = 0.9680, and
X3 = 0.9489). This indicates the process by which MCH selectively amplifies correctly classified minority instances to reinforce their representation in the training data.
MCH is expressed in Equation (1) as follows:
The feature vector of instance i is denoted as Xi = (Xi1, Xi2, …, Xin); and represent its actual and predicted labels, respectively. The highlighting coefficient is set to . The MCH weights the feature values only when the actual and predicted labels indicate the minority class. This rule ensures that only the instances correctly detected by the classifier are emphasized, thereby fine-graining minority-class characteristics.
MCH achieves finer segmentation of minority-class features via distinct weighting schemes. In machine learning, input feature transformation can be conceptually represented as linear operations of the form , where a and b denote the scaling factor and bias, respectively. Highlighting feature values via multiplication (×a) modifies gradients, potentially distorting the data distribution or excessively amplifying certain dimensions. In contrast, MCH employs an additive (+b) approach, enabling minority-class instances to be subtly shifted without altering the overall distribution. Therefore, MCH enhances the representation of minority-class features while preserving the intrinsic structure of the data; it thus enables the classifier to more effectively learn their underlying structural patterns.
3.4. MCH-Ensemble: Minority Class Highlighting Ensemble
We propose MCH-Ensemble (Minority Class Highlighting Ensemble), a framework that enhances prediction accuracy by fine-graining the characteristics of the minority class. The overall architecture of the proposed method is illustrated in
Figure 3. MCH-Ensemble consists of two main stages: (1) training base classifiers on balanced datasets (
Figure 4) and (2) training a random forest on the highlighted data (
Figure 5). In the first stage, an under-sampling ensemble is applied to the imbalanced dataset, producing
k balanced training subsets. The hyperparameter
k is set to 5 in this study. Each balanced subset is used to train a base classifier. The base classifiers include decision tree, extreme gradient boosting (XGBoost), and LightGBM. After training, predictions are obtained from each of the
k classifiers. In the second stage, the MCH process is performed using the outputs of the base classifiers. This step produces
k datasets that contain fine-grained representations of the minority class. These highlighted datasets are then combined and used to train a random forest model, which serves as the final meta-classifier. The random forest learns from the enriched feature representations and produces the final prediction results.
4. Experimental Setup
This section describes the experimental setup, which includes four components: data collection, data preprocessing, modeling, and evaluation metrics.
4.1. Data Collection
Three different DoS network intrusion datasets are used to evaluate the proposed model. All three datasets are imbalanced, meaning the proportions of normal and attack instances differ significantly. The UNSW-NB15 dataset [
43] was obtained from OpenDrive, the CIC-IDS2017 dataset [
44,
45] from the Canadian Institute for Cybersecurity, and the WSN-DS dataset [
46,
47] from the Kaggle platform. The details of these datasets are summarized in
Table 1.
In this study, “Normal” instances represent the majority class, whereas “Attack” instances correspond to the minority class. The UNSW-NB15 dataset contains 109,353 instances with 45 features, including 93,000 normal and 16,353 attack samples, resulting in an imbalance ratio of 5.69. The CIC-IDS2017 dataset comprises 692,703 instances with 79 features, including 440,031 normal and 252,672 attack samples, giving an imbalance ratio of 1.74. The WSN-DS dataset exhibits the highest imbalance ratio of 9.83, consisting of 340,066 normal and 34,595 attack samples.
4.2. Data Preprocessing
Data preprocessing is a critical step before model training to ensure reliability and consistency. The preprocessing steps are as follows:
Removal of null values and standardization of numeric independent variables;
Selection of significant independent variables using the variance inflation factor (VIF);
Splitting the dataset into training and testing subsets.
Rows containing null values were removed because missing data can degrade model prediction accuracy [
48]. Each dataset numerical contained numerical variables, and no additional categorical encoding was required. When features with different units or scales are used in a single model, the variables with larger magnitudes may disproportionately influence the results. Therefore, numerical independent variables with varying distributions were standardized to have a mean of 0 and a standard deviation of 1 [
49].
We next evaluated multicollinearity using the VIF. Multicollinearity refers to the presence of strong correlations among independent variables, which makes it difficult to estimate the individual effects of each predictor on the dependent variable and reduces model reliability [
50]. The VIF is a widely used indicator for detecting multicollinearity, where a VIF value of 10 or higher indicates a high correlation among variables [
51]. In the UNSW-NB15 dataset, the variable
tcprtt exhibited a VIF value of 9.5, which, although below the threshold of 10, was excluded because it showed relatively high correlation with other independent variables [
52]. After this analysis, the significant independent variables selected for each dataset were as follows: 29 for UNSW-NB15, 16 for CIC-IDS2017, and 13 for WSN-DS. For reproducibility, the preprocessed datasets are summarized in detail in
Appendix B.
For the experiments, 80% of the CIC-IDS2017 and WSN-DS datasets were used for training, and the remaining 20% were used for testing. Both sets maintained the original class imbalance distribution. The training data were balanced using the under-sampling ensemble, while the testing data were balanced through a single round of under-sampling. For the UNSW-NB15 dataset, the training data consisted of 56,000 “Normal” and 12,264 “Attack” instances, and the testing data consisted of 37,000 “Normal” and 4089 “Attack” instances, preserving the original imbalance ratio. The training data were balanced using the under-sampling ensemble, and the testing data were balanced through single under-sampling. The testing datasets were used to validate the proposed method.
However, real-world network intrusions are inherently imbalanced. To verify that the proposed method performs well under realistic conditions, additional results obtained on the original imbalanced testing data are reported in
Appendix C.
4.3. Modeling: Minority Class Highlighting Ensemble (MCH-Ensemble)
After data preprocessing, the proposed model was developed using the MCH-Ensemble framework.
Figure 6 presents a summary of the model’s construction and design. Machine learning-based classifiers—decision tree, XGBoost, and LightGBM—were employed as base learners. A random forest was used as the meta-model to combine the outputs from the base classifiers.
All four classifiers were tuned using Optuna that automatically derived optimal hyperparameters for the three datasets [
53]. These hyperparameters are listed in
Table 2,
Table 3 and
Table 4, respectively.
4.4. Evaluation Metrics
To comprehensively evaluate the performance of the proposed method, five standard classification metrics are used: accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC) [
54,
55,
56].
Accuracy represents the proportion of correctly predicted instances among all predictions and is defined as:
where
TP (true positive) denotes the number of actual minority class instances correctly predicted as minority class;
TN (true negative) represents the number of actual majority class instances correctly predicted as majority class;
FP (false positive) refers to majority class instances incorrectly predicted as minority class; and
FN (false negative) represents minority class instances incorrectly predicted as majority class.
Precision measures the proportion of correctly predicted positive instances among all predicted positive instances:
Recall measures the proportion of actual positive instances correctly identified by the model:
F1-score is the harmonic mean of precision and recall, providing a balanced measure of both metrics:
AUC-ROC measures the overall discriminative ability of a classifier by quantifying the area under the ROC curve. The ROC curve represents the relationship between the true positive rate (TPR) and false positive rate (FPR) across all possible decision thresholds. The TPR and FPR are defined in Equations (6) and (7), respectively:
The AUC-ROC is calculated as follows:
The ROC curve was plotted to visually assess the performance of MCH-Ensemble [
57]. The
x-axis and
y-axis represent the FPR and TPR, respectively. A curve that is closer to the upper-left corner indicates superior classification performance.
A model demonstrates better performance when accuracy approaches 100%, and precision, recall, F1-score, and AUC-ROC approach 1.
5. Experimental Results
This section presents the experimental results used to evaluate the performance of the proposed MCH-Ensemble across three datasets: UNSW-NB15, CIC-IDS2017, and WSN-DS.
The proposed framework incorporates two key innovations: (1) MCH, which refines the representation of minority class features, and (2) an ensemble of base classifiers and a random forest meta-model, which improves predictive performance by learning from diverse, fine-grained bagging samples. To validate the effectiveness of these components, we conducted a series of comparative experiments. We employed three heterogeneous ensemble configurations: decision tree + random forest (DT + RF), XGBoost + random forest (XGB + RF), and LightGBM + random forest (LGBM + RF). Each ensemble was evaluated both with and without MCH to assess the contribution of the highlighting process.
5.1. UNSW-NB15
Table 5 presents the performance of all ensemble models on the UNSW-NB15 testing set across five evaluation metrics. Overall, ensembles incorporating MCH consistently outperformed those without it. Notably, all ensembles without MCH exhibited identical performance. This is because the diverse outputs from the base classifiers are not reflected in training sets, so the random forest meta-model is trained on the simply aggregated dataset of the
k balanced training datasets generated by the under-sampling ensemble. This confirms that the MCH step is crucial for fine-graining the training data and enhancing model diversity in random forest. And all ensembles that employed MCH achieved the precision of 1.0000, indicating that emphasizing minority class features enabled the model to more effectively learn the distinguishing characteristics of attack instances.
Among all the tested configurations, the XGB + RF ensemble achieved the best overall performance on the testing set: an accuracy of 98.79%, a precision of 1.0000, a recall of 0.9758, an F1-score of 0.9877, and an AUC-ROC of 0.9963. The recall value was slightly lower than those of the models without MCH but the other performance metrics were higher, confirming the superiority of the proposed method. These results indicate that the XGB + RF ensemble is the most suitable configuration for the UNSW-NB15 dataset.
Figure 7 compares the ROC curves of ensemble models on the UNSW-NB15 testing set. These results demonstrate that the models reliably and consistently classify the classes. The XGB + RF ensemble achieved the highest AUC scores among all the compared ensemble models.
5.2. CIC-IDS2017
Table 6 shows the performances of all ensemble models on the CIC-IDS2017 testing set across five evaluation metrics. Overall, all models achieved excellent performances, indicating that the dataset itself contained features that facilitated straightforward distinguishing between attack and normal instances. In particular, all models with MCH achieved a precision and an AUC-ROC of 1.0000, indicating that they effectively and accurately identified attack instances while minimizing false positives.
Among all configurations, the XGBoost + random forest (XGB + RF) ensemble achieved the best overall results on the testing set, with an accuracy of 99.99%, precision of 1.0000, recall of 0.9999, F1-score of 0.9999, and AUC-ROC of 1.0000. These findings demonstrate that the XGB + RF ensemble is the most suitable configuration for the CIC-IDS2017 dataset.
Figure 8 compares the ROC curves of ensemble models on the CIC-IDS2017 testing set. The ensemble model without MCH achieved an excellent performance with an AUC of 0.9996, whereas those with MCH classified more accurately and achieved a higher AUC of 1.0000.
5.3. WSN-DS
Table 7 summarizes the performance of all ensemble models on the WSN-DS testing set across the five evaluation metrics. All ensembles without MCH performed identically on the WSN-DS testing set, similar to the UNSW-NB15 and CIC-IDS2017 datasets, because the random forest meta-model was trained on the same aggregated dataset. In general, ensembles incorporating MCH achieved better results than those without it and achieved a precision of 1.0000, like the UNSW-NB15 dataset, because the emphasizing minority class features enabled the model to more effectively learn the distinguishing characteristics of attack instances.
Among all the tested configurations, the DT + RF ensemble achieved the best overall performance on the testing set, with an accuracy of 99.39%, a precision of 1.0000, a recall of 0.9879, an F1-score of 0.9939, and an AUC-ROC of 0.9971. Although its AUC-ROC value was slightly lower than that of LGBM + RF, the other performance metrics were higher; this indicated that the proposed method achieved superior performance across all key metrics. The DT + RF was the simplest and most intuitive among the evaluated configurations and exhibited high performance with minimal model complexity. Given these results, decision tree + random forest (DT + RF) can be considered the most suitable method for the WSN-DS dataset.
Figure 9 compares the ROC curves of ensemble models on the WSN-DS testing set. All models classified the classes reliably and consistently, wherein the LGBM + RF achieved a slightly higher AUC of 0.9972 compared with other models.
5.4. Stepwise Performance Comparison of MCH-Ensemble
The overall performances of the MCH-Ensemble on the UNSW-NB15, CIC-IDS2017, and WSN-DS datasets were validated via experiments. However, the components that contribute to the observed performance gains must also be analyzed. The MCH-Ensemble comprises two main stages: (1) training base classifiers on balanced datasets and (2) training a random forest on the highlighted data. The contribution of each stage was analyzed via these comparative experiments: training the base classifier without the under-sampling ensemble; training the base classifier using the under-sampling ensemble; and training the MCH-Ensemble.
Table 8,
Table 9 and
Table 10 present the performance of each stage across five evaluation metrics for each dataset. The MCH-Ensemble demonstrated overall superior performance across all three datasets compared with the conventional single model and the model using the under-sampling ensemble. The single model exhibited reduced predictive capability for minority classes, resulting in relatively low recall compared with the other metrics. Although the under-sampling ensemble improved recall, its performance on some metrics deteriorated compared with the single model because it averaged the predictions of multiple classifiers trained on various balanced subsets of data. Both the models showed noticeable variations in metric values depending on the dataset, indicating their limited performance stability. The MCH-Ensemble exhibited minimal performance fluctuation across the three datasets, indicating that it effectively addressed class imbalance and maintained robust and stable performance regardless of dataset characteristics.
The computational cost of MCH-Ensemble was also measured.
Table 11 shows the training time and memory consumption of the proposed model on each dataset. The computational cost varied based on the dataset size and the degree of class imbalance; therefore, efficiency should be carefully considered in practical applications.
5.5. Performance Comparison with Previous Studies
The performance of MCH-Ensemble in improving DoS detection was confirmed by comparing it with hybrid methods across the same five metrics across the three datasets (Refer to
Table 12).
MCH-Ensemble outperformed the existing hybrid methods in most evaluation metrics, exhibiting consistent performance. This indicates that MCH-Ensemble enhances the detection capability of ensemble-based intrusion detection models and provides a more robust and generalizable approach for identifying DoS attacks in diverse network environments.
Figure 10 compares the ROC curves of MCH-Ensemble and existing classification method across the three datasets, indicating their enhanced performance. MCH-Ensemble performed equally or slightly lower to other methods but exhibited higher values for the indicators used.
6. Discussion and Conclusions
DoS attack detection in NIDS is an emerging research topic in cybersecurity. Although machine learning techniques have been widely adopted to build effective NIDS models, real-world network data often suffer from severe class imbalance. Machine learning techniques are typically designed assuming evenly distributed classes; therefore, class imbalance leads to biased models or failure in the detection of rare but critical DoS attacks. This raises an important research question: how can minority class detection be effectively improved while ensuring stable prediction performance under class imbalance? To address this question, MCH-Ensemble, a method designed to address class imbalance and enhance prediction accuracy for DoS detection, was proposed herein. We aimed to enhance DoS attack detection by emphasizing minority-class features to enhance minority-class recognition accuracy and achieve high-accuracy predictions. The proposed framework comprises two main stages. In the first stage,
k balanced training datasets were constructed using the under-sampling ensemble technique, where
k is a hyperparameter set to 5 in this study. Machine learning-based classifiers were trained on each balanced dataset, and their predictions were obtained. In the second stage, MCH was applied to the minority class predictions of the
k classifiers. This process generated
k highlighted training datasets with enhanced representation of minority class features. Finally, a random forest model was trained on all highlighted datasets to produce the final predictions. Experiments were conducted using three DoS datasets—UNSW-NB15, CIC-IDS2017, and WSN-DS—with varying imbalance ratios. The ROC curve is used to Analyze the performance comparison. Model performance was evaluated using five metrics: accuracy, precision, recall, F1-score, and AUC-ROC. The results demonstrated that the XGBoost + random forest (XGB + RF) ensemble achieved the best performance on the UNSW-NB15 and CIC-IDS2017 datasets, whereas the decision tree + random forest (DT + RF) ensemble performed optimally on the WSN-DS dataset. These findings indicate that different dataset characteristics can influence the optimal model configuration. MCH-Ensemble showed improved performance compared with existing models. On the UNSW-NB15 and CIC-IDS2017 dataset, it achieved improvements in accuracy, precision, recall, F1-score, and AUC-ROC by ~1.2% and ~0.61%, ~9.8% and 0.77%, ~0.7% and ~0.56%, ~5.3% and ~0.66%, and ~0.1% and ~0.06%, respectively. It improved these metrics by ~0.17%, ~1.66%, ~0.11%, ~0.88%, and ~0.06% on the WSN-DS dataset similarly. Overall, the proposed method exhibited robust and stable performance across datasets with diverse imbalance ratios. The significance of MCH-Ensemble lies in its ability to extend beyond conventional data preprocessing methods. By highlighting minority class features, the framework captures attack characteristics more precisely, thereby improving detection performance. Moreover, the proposed architecture is model-agnostic, allowing it to be readily integrated with various machine learning or deep learning-based classification models. Nevertheless, this study has certain limitations. All experiments were conducted exclusively on DoS-type attacks, which limited the evaluation scope of MCH-Ensemble. Although this study focused on binary DoS detection, MCH-Ensemble was designed to refine and emphasize minority-class feature representations, rather than being restricted to a specific attack type. Future research will therefore apply this model to broader network intrusion categories and validate its capability in multiclass scenarios. Furthermore, the validation experiments were conducted using publicly available NIDS datasets. Future research should include validation on datasets from other domains to assess the generalizability of the proposed approach. We plan to extend our experiments to additional domains to further verify the robustness of MCH-Ensemble. The advantages and limitations of MCH-Ensemble are summarized in
Appendix D.