Intrusion Detection Method Based on Preprocessing of Highly Correlated and Imbalanced Data

Semenov, Serhii; Krupska-Klimczak, Magdalena; Czapla, Roman; Krzaczek, Beata; Gavrylenko, Svitlana; Poltorazkiy, Vadim; Vladislav, Zozulia

doi:10.3390/app15084243

Open AccessArticle

Intrusion Detection Method Based on Preprocessing of Highly Correlated and Imbalanced Data

by

Serhii Semenov

^1,*

,

Magdalena Krupska-Klimczak

^1,*

,

Roman Czapla

¹,

Beata Krzaczek

¹,

Svitlana Gavrylenko

²,

Vadim Poltorazkiy

²

and

Zozulia Vladislav

²

¹

Institute of Security and Computer Science, University of National Education Commission, ul. Podchorążych 2, 30-084 Krakow, Poland

²

Department of “Computer Engineering and Programming”, National Technical University «Kharkiv Polytechnic Institute», 61000 Kharkiv, Ukraine

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4243; https://doi.org/10.3390/app15084243

Submission received: 11 March 2025 / Revised: 1 April 2025 / Accepted: 7 April 2025 / Published: 11 April 2025

(This article belongs to the Special Issue Intelligent Systems and Information Security)

Download

Browse Figures

Versions Notes

Abstract

This paper examines traditional machine learning algorithms, neural networks, and the benefits of utilizing ensemble models. Data preprocessing methods for improving the quality of classification models are considered. To balance the classes, Undersampling, Oversampling, and their combination (Over + Undersampling) algorithms are explored. A procedure for reducing feature correlation is proposed. Classification models based on meta-algorithms such as SVM, KNN Naive Bayes, Perceptron, Bagging, Random Forest, AdaBoost, and Gradient Boosting have been thoroughly investigated. The settings of the base classifiers and meta-algorithm parameters have been optimized. The best result was obtained by using an ensemble classifier based on the Random Forest algorithm. Thus, an intrusion detection method based on the preprocessing of highly correlated and imbalanced data has been proposed. The scientific novelty of the obtained results lies in the integrated use of the developed procedure for reducing feature correlation, the application of the SMOTEENN data balancing method, the selection of an appropriate classifier, and the fine tuning of its parameters. The integration of these procedures and methods resulted in a higher F1 score, reduced training time, and faster recognition speed for the model. This allows us to recommend this method for practical use to improve the quality of network intrusion detection.

Keywords:

computer systems; network; machine learning; data preprocessing; SMOTEENN; ensemble classifier; random forest; gradient boosting

1. Introduction

1.1. Motivation

The rapid development of information technologies has made computer systems an integral part of modern life. Their efficiency and stability are critical for security, quality of service, and even public safety.

According to Check Point Software’s 2023 Cyber Security Report [1], the number of registered criminal offenses related to computer usage continues to increase each year. This suggests the presence of a problem in ensuring the stable and uninterrupted operation of computer systems and networks, highlighting that modern approaches to assessing their state are insufficiently effective.

1.2. State of the Art

The need to enhance intrusion detection methods in computer systems and networks is driven by their increasing complexity, which makes them more vulnerable to attackers. Additionally, attackers continuously develop more sophisticated and complex methods, leading to data breaches, financial losses, or even disruptions of critical infrastructure. In this context, intrusion detection systems (IDSs) play a crucial role in network protection and system attack identification. IDSs are designed to detect malicious activity not only within the network but also at the system and file system levels. The task of intrusion detection can be conventionally divided into two main directions: misuse detection and anomaly detection. Misuse detection is the process of identifying and preventing unauthorized or malicious activities in computer systems and networks. It involves the use of attack signatures. Systems that engage in misuse detection analyze user actions and network traffic to detect previously known patterns of information attacks. Anomaly detection is the process of identifying deviations from the “normal” or expected behavior of a system. An anomaly detection system generates a baseline template (profile) of normal behavior using training data. This template can include various activity characteristics, such as the use of system resources, the execution of potentially harmful actions, unauthorized data transmission events to external servers, and more. Any deviation from the profile is considered an attack. The main advantage of such systems is the ability to detect new attacks.

Successfully identifying the state of computer systems and networks requires effective detection technologies, continuous monitoring, adaptation to new threats, and ongoing improvements in data analysis methods. All these aspects are urgent tasks and emphasize the need for the development, implementation, or improvement of modern intrusion identification systems.

A large number of processes occurring in computer systems, including those caused by malware, are analyzed using complex mathematical algorithms based on machine learning techniques [2,3,4].

K-nearest neighbors (k-NN) is a widely used classification technique due to its straightforward implementation and adaptability to various data formats. It classifies new data points by examining the classes of their closest neighbors, determined by a distance metric. Despite its simplicity, k-NN is computationally expensive with large datasets, and its accuracy is heavily dependent on the chosen ‘k’ value [5,6].

Support Vector Machines (SVMs) offer an alternative approach by constructing a hyperplane that optimally separates data classes. SVM’s ability to handle both linear and non-linear data, coupled with strong generalization capabilities, makes it powerful. However, similar to k-NN, SVM encounters computational challenges with large datasets and requires careful parameter selection, particularly for the kernel function [7].

The naive Bayes classifier uses probabilistic methods for classifying objects, assuming the independence of features. The main advantage of the naive Bayes classifier is its simplicity of implementation and its ability to handle high-dimensional data. The limitations include the assumption of feature independence and potential unsuitability for data with complex relationships [8].

Logistic regression is one of the most widely used classification methods. It can work with continuous and categorical variables, allowing for the assessment of their importance. However, it can be unstable to data outliers and insensitive to non-linear relationships between input variables [9].

Decision trees are also widely used in intrusion detection. The main advantage of decision trees is their ability to handle both categorical and numerical features, as along with their ease of interpretability. A limitation of decision trees is their tendency to overfit the training data [10,11].

Multilayer perceptron (ML) is effective for modeling complex non-linear relationships in classification but require large datasets, are susceptible to overfitting, and have longer training times than other methods [12].

Given the availability of large datasets and complex features, deep learning methods such as convolutional neural networks (CNNs) [13], recurrent neural networks (RNNs) [14], long short-term memory (LSTM) [15], and generative adversarial networks (GANs) are increasingly used. However, neural networks and deep learning methods require a large amount of data for training, can be prone to overtraining, and are sensitive to data outliers.

Ensemble classifiers are also a powerful machine learning technique that combines multiple weak learners (models) to create a stronger overall learner. They can capture a broader range of patterns in the data, leading to more reliable and accurate predictions. They also reduce variance by averaging out the errors of base models, resulting in predictions with lower variance and higher accuracy. On the other hand, training ensemble methods can be computationally expensive compared to training a single model. This can be a factor if there are limited computational resources, or fast training times are required. Ensemble models based on meta-algorithms such as bagging [16], boosting [17], and stacking [18] have become particularly popular today. The choice of the most advanced ensemble classifier depends on your specific data and task requirements.

Additionally, efficient model construction includes data preprocessing. This stage is crucial for ensuring the data are in a suitable format and quality for the model to learn from effectively, especially when faced with imbalanced and highly correlated data.

Thus, despite significant progress in intrusion detection using machine learning and deep learning, there is a need for the continuous improvement of methods and technologies used for monitoring and identifying the state of computer systems and networks.

1.3. Objectives and Contribution

The purpose of this paper is to develop an intrusion detection method based on the preprocessing of highly correlated and imbalanced data.

The main objectives of this study include the following:

Addressing classification challenges related to imbalanced data, where a small number of intrusion detection examples can lead to critical false negatives.
Developing techniques for reducing the feature space when dealing with highly correlated data.
Evaluating the effectiveness of different classifiers as base models and optimizing their parameters.
Proposing an intrusion detection method based on the preprocessing of highly correlated and imbalanced data.

This paper is organized as follows: Section 2 presents the methodology of research. Section 3 describes the model construction, which involves data collection and preprocessing, feature engineering, model selection, training, and evaluation. In order to study the effectiveness of using these methods, their software models were developed in the Collab Python environment. Section 4 and Section 5 discuss the main results of the investigation and present conclusions.

2. Materials and Methods

Model construction involves several key stages: data collection and preprocessing, feature engineering, model selection, training, and evaluation.

Data collection is the process of gathering information from various sources for the purpose of analysis or modeling.

Data preprocessing, a critical step in reliable data analysis, typically deals with low-quality data. It is considered a non-trivial task, often accounting for up to 80% of the total effort in intelligent data analysis [19]. Without proper preprocessing, further analysis may be impossible, as analytical algorithms rely on clean and well-structured data to produce accurate results.

One of the important stages of preprocessing is data balancing. The situation with imbalanced data is quite common in the context of network intrusion detection, as the amount of benign traffic significantly exceeds the amount of malicious traffic. Imbalanced classifiers create challenges for predictive modeling and lead to the construction of biased models with poor predictive performance, especially for the minority class. The model completely ignores the minority class and labels all objects with the labels of the majority class [20]. One approach to solving this problem is to use various class balancing strategies. The following data balancing approaches are distinguished: Undersampling, Oversampling, and their combination (Over + Undersampling) [21]. The Undersampling technique involves reducing the number of examples in the majority class. The Oversampling technique involves increasing the number of examples in the minority class. One of the best techniques is the SMOTEENN algorithm. This algorithm generates a certain number of artificial observations that would be similar to the observations that are in the minority class, but, at the same time, it does not duplicate them and then remove noisy data.

An important component of preprocessing is identifying features that are correlated with each other. The presence of correlated features negatively affects the quality of the model and makes it less effective and less interpretable. In addition, it significantly increases the training time of the model.

As previously mentioned, a combination of various methods is used to enhance the effectiveness of intrusion detection. For example, network traffic analysis can be used to detect known attacks, while user behavior analysis can identify new or sophisticated attacks.

Previous research has shown that ensemble classifiers are the most effective [22,23,24]. The relevance of using ensemble classifiers lies in their ability to improve the accuracy and robustness of models by combining several base classifiers, which helps to reduce the impact of random errors and increase the overall accuracy of predictions [25].

The following approaches are used to collect ensembles: boosting, bagging, and stacking.

Stacking is an ensemble method that integrates various classification or regression models using a meta-learner. Initially, base-level models are trained on the full training dataset. Subsequently, a meta-model is trained using the outputs of these base-level models as its input features. This technique commonly utilizes diverse learning algorithms for the base level, resulting in a heterogeneous ensemble [26].

Boosting is an ensemble modeling technique that attempts to build a strong classifier from several weak classifiers and combines them in a series. The base classifiers are trained sequentially. Firstly, a model is built from the training data. Then, the second model is built, which tries to correct the errors present in the first model. This procedure continues, and new classifiers are added to the meta-model until either the complete training dataset is predicted correctly, or the maximum number of base models is added [27].

Bagging or Bootstrap Aggregation is a machine learning ensemble meta-algorithm that combines the base models in parallel [28].

The result of each model was averaged for regression, and the majority voted for classification by the meta-model.

The effectiveness of bagging was achieved because the basic algorithms trained on different subsamples turned out to be quite different, and their errors were mutually compensated during voting, as well as due to the fact that outliers may not fall into some training subsamples.

In this work, meta-algorithms based on bagging, Random Forest [23], Adaboost [29], and Gradient Boosting [30] are researched as models, which are the most effective in detecting intrusions in computer systems and networks.

To compare the results of the ensemble classification models, the following algorithms were used: SVM, KNN, Naive Bayes, and Perceptron.

The problem statement of classification can be described as follows:

Let input data be the set of marked pairs X₁ = (x_i₁, x_i₂, …, x_im), where x_i is the set of features, and y_i is a classifying label. Every x_i is characterized by a set of features F.

There exists an unknown fitness function—a mapping f: F → Y, the values of which are only known for a finite set of training samples (X, Y) = {(x₁, y₁), …,(x_m, y_m)}.

The task itself consists of the formation of such a meta-algorithm f that is able to classify an arbitrary object x ∈ X and to adjust the values of its parameters w in such a way to bring the predicted value

\hat{y}

closer to the actual value y:

F (f (w, x), \hat{y}) \to y .

The model’s performance was evaluated using accuracy (1), precision (2), recall (3), and F1 score (4).

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{T P}{T P + F N}

(3)

F 1 s c o r e = \frac{2}{\frac{1}{\Pr e c i s i o n} + \frac{1}{Re c a l l}} = \frac{T P}{T P + 0.5 (F P + F N)}

(4)

In addition, an important criterion for the quality of the model is its training time and recognition time.

3. Results

3.1. Data Collection and Preprocessing

This dataset was created at the ACCS Cyber Range Laboratory to simulate real-world network traffic, including normal operations and synthetic cyber-attacks. It contains 45 features and covers three main attack types—Exploits, Fuzzers, and DoS—generated by the IXIA tool.

To evaluate the effectiveness of the methods in Google Colab, Python-based software (version 3.12.4) models were developed, including the data preprocessing procedure, the configuration of basic classifiers, and meta-algorithms. The data preprocessing stage includes such tasks as data analysis and filling in missing values, the removal of non-informative features, normalization, and data balancing.

3.2. Feature Engineering

Data Balancing Procedure

Firstly, the original data are imbalanced and have the following class distribution, as shown in Table 1.

To balance the classes, the Undersampling, Oversampling, and their combination (Over + Undersampling) algorithms were used. The results of this study are presented in Figure 1. The Random Forest algorithm was used as the base classifier. The SMOTEENN algorithm achieved the best result, as it combines the SMOTE (Synthetic Minority Oversampling Technique) and Edited Nearest Neighbor (ENN). This combination generates instances of the minority class while removing noisy data [31]. Using the SMOTEENN algorithm for class balancing improved the model’s performance (F1 score) by up to 31% compared to imbalanced data.

3.3. Construction of Features Reducing Procedure SP_PCA

The analysis of the data revealed a significant number of correlated features, as shown in Figure 2.

This problem can be described as follows: Let the initial data consist of objects, each of which is described by a set of p parameters x⁽¹⁾, x⁽²⁾, …, x^(p), so the initial information is n objects of p-dimensional data:

{X_{1} = ({x_{1}}^{(1)}, {x_{1}}^{(2)}, \dots, {x_{1}}^{(p)})}^{T}, {X_{2} = ({x_{2}}^{(1)}, {x_{2}}^{(2)}, \dots, {x_{2}}^{(p)})}^{T}, \dots, {X_{n} = ({x_{n}}^{(1)}, {x_{n}}^{(2)}, \dots, {x_{n}}^{(p)})}^{T}

The goal of dimensionality reduction is to construct for each observation X_i a new feature vector Y₁ = (y_i⁽¹⁾, y_i⁽²⁾, …, y_i^(m)), where m is much smaller than p, and such information compression should ensure its minimal loss.

To solve this problem, the authors in [32] proposed a special procedure, SP_PCA, for reducing the correlation of the initial data, which is based on the following algorithm:

Step 1. Convert all categorical features into numerical values using the factorize() method. Then, fill in any missing data.

Step 2. Analyze the features, and remove any uninformative ones from the dataset.

Step 3. Build the correlation matrices.

Step 4. If there are features that correlate beyond the specified threshold (e.g., 90%), process them using principal component analysis (PCA). To accomplish this, we create data frames with the two features that have the highest correlation and apply the PCA method. Each pair of features is transformed into a new feature. After forming the new features, we remove the old ones and incorporate the new features into the main dataset.

Step 5. Build the model, and evaluate its performance. If the model’s performance has not significantly changed and remains above the specified threshold, return to step 4; otherwise, proceed to step 6.

Step 6. If the model’s accuracy has significantly decreased, analyze the features that were processed with the principal component analysis in step 4, and decide on their restoration. Finish the process.

The proposed method SP_PCA allowed us to reduce the feature space from 42 (3 features were removed manually as uninformative) to 31 features.

In our research, the correlation threshold value is 90%. Our goal was to reduce data redundancy, so we used a higher threshold. It was selected iteratively experimentally and varied from 87% to 96%.

Next, we will evaluate the effectiveness of these algorithms.

3.4. Model Selection, Training, and Evaluation

3.4.1. Evaluation of the Effectiveness of Using the SMOTEENN Algorithm

In the beginning, we evaluated the effectiveness of using the SMOTEENN algorithm. To compare its impact, classification models based on Bagging, Random Forest, Adaboost, Gradient Boosting, Perceptron, Naive Bayes, SVM, and KNN were created. While each of these classification algorithms has unique strengths, effective tuning is crucial for optimal performance on real-world datasets.

The following parameters were used:

SVM: Kernel selection—linear, regularization parameter (C)—1.0, gamma parameter for RBF kernel—‘scale’;
KNN: Number of neighbors (k)—5, distance metric—‘minkowski’;
Naive Bayes: Class priors—none, smoothing coefficient (alpha)—1 × 10⁻⁹;
Perceptron: Learning rate—‘constant’, activation function—linear, maximum number of epochs—1000.

Decision trees were used as the base classifiers for ensemble models, and their parameters were also tuned for optimization. The parameter settings included the following:

Criterion for measuring the quality of a split—‘gini’;
Minimum number of samples required to split an internal node—2;
Minimum number of samples required to be at a leaf node—1;
Maximum depth of the tree—none (unlimited);
No maximum feature parameter was specified.

The meta-algorithm was optimized by selecting the number of base classifiers and the procedure for forming training samples. The following parameters were used:

Bagging—80 base classifiers;
Random Forest—100 base classifiers;
Adaboost—50 base classifiers;
Gradient Boosting—100 base classifiers.

A total of 80% of the input data were used for training, while 20% was reserved for testing. The research results are presented in Table 2.

As it can be seen, applying data balancing, tuning the parameters of base classifiers, and optimizing the meta-algorithm consistently improved classification quality compared to imbalanced data. The best results were obtained using the Random Forest meta-algorithm. The Bagging meta-algorithm based on decision trees also demonstrated strong evaluation metrics.

As a result, applying the SMOTEENN algorithm in combination with the Random Forest meta-algorithm improved model quality by 31% in the task of detecting intrusions in computer systems and networks.

Next, we evaluate the effectiveness of the SP_PCA feature space reduction procedure.

3.4.2. Evaluation the Effectiveness of Using the Feature Space SP_PCA Reduction Procedure

For the comparison of the effectiveness of the SP_PCA reduction procedure, the feature space was also reduced using VarianceThreshold and Random Forest methods.

VarianceThreshold removes features with low variance by specifying a deviation threshold, excluding columns in which a certain percentage of labels match. For example, if we want to remove features where more than 80% of the labels are similar, we can set the VarianceThreshold to 0.16 (calculated as 0.8 × (1 − 0.8)). This threshold ensures that features with low discriminatory power will be rejected.

Random Forest considers the average contribution of each feature to all decision trees in the ensemble. The higher the importance is, the greater the feature affects predictions. Thus, Random Forests provide a simple way to select features based on importance scores.

After fitting the model to the training data, we can access the feature importance parameter and use it in the further construction of classification models.

As shown in Figure 3, using SP_PCA to reduce correlation resulted in a 26% reduction in the number of features, VarianceThreshold reduced them by 14%, and Random Forest reduced them by 19%, compared to the original (raw) data.

Further studies are related to the assessment of the quality of the model after the feature space reduction stage. To compare the effectiveness of SP_PCA, VarianceThreshold, and Random Forest, classification models based on decision tree, KNN, Logistic Regression, SVM, Gradient Boosting, and Random Forest were created.

We evaluate the following key performance metrics: Average F1 score, Training Time, and Recognition Time. The results of the method assessment are shown in Table 3.

As can be seen from Table 3, the reduction in the feature space allowed for a reduction in the inference time of almost all models or an improvement their prediction time. The F1 score metrics of all the models have increased as well. The best result was obtained by using the ensemble classifier based on the Random Forest model.

As a result, the integrated use of the SMOTEENN data balancing method and the SP_PCA procedure improved the F1 score of the Random Forest-based model from 95% to 97% (an additional 2% after the data-balancing step). The training time decreased from 16.51 s to 15.58 s. Testing time was reduced by 29%.

Since the SP_PCA and VarianceThreshold algorithms are more effective for this dataset, we combined these two algorithms in further research to reduce the feature space.

3.4.3. Evaluation of the Effectiveness of the Integrated Use of the SP_PCA and VarianceThreshold Reduction Procedure

To reduce the feature space, two algorithms, SP_PCA and VarianceThreshold, were combined sequentially. The result, shown in Figure 4, demonstrates that the combination of SP_PCA and VarianceThreshold algorithms led to a reduction in the feature space from 42 (for raw data) to 28. To evaluate the combined use of the SP_PCA and VarianceThreshold methods, Decision Tree, KNN, Logistic Regression, SVM, Gradient Boosting, and Random Forest were used as base classifiers.

The proposed procedure reduced the model’s training time and improved recognition time. As shown in Figure 5, the training time for the Random Forest model was reduced by 16%, and the recognition time was improved by 32% (Figure 6). At the same time, the F1 score of the models hardly changed (Figure 7) compared to the results obtained by applying the feature reduction method SP_PCA alone (the F1 score of the Random Forest model remained unchanged at 97%).

4. Discussion

A number of limitations in the use of existing computer system identification models were identified. Data preprocessing is a crucial step for building successful models and can improve data quality and model performance. It also helps identify and resolve underlying patterns in the data. Essentially, data preprocessing prepares the data to be a suitable training ground for our machine learning model. By providing clean, consistent, and well-formatted data, we give our model the best foundation for accurate training.

Each dataset requires unique data preprocessing steps. Although most preprocessing workflows use common methods, the specific application of these methods will depend on the characteristics of the individual dataset. This research led to the proposal of a procedure for balancing data and reducing correlation, which improved the quality of computer system identification, reduced output data, and shortened training time. The experiments evaluated the effectiveness of network intrusion detection, its practical significance, and the prospects for further research. We also tested the results obtained from pre-processing data on a more modern model based on Vision Transformer for Small-size Datasets (ViTSD) [33].

In our future work, we plan to extend our research using deep learning methods (LSTM, CNN, autoencoders). We will explore how these techniques could be integrated into modern IDS architectures and discuss their potential benefits and challenges.

5. Conclusions

Thus, this paper considers the task of improving the quality of network intrusion detections.

The UNSW-NB15 modified dataset was used as the source data, containing information about the normal functioning of a computer network and intrusion scenarios. The dataset contains information about the flow between hosts and packet-level inspection to distinguish between nine types of attacks: Exploits, Fuzzers, and DoS.

To balance the classes, the Undersampling, Oversampling, and their combination (Over + Undersampling) algorithms are being researched. The use of the SMOTEENN data balancing method improved the F1 score of the models by up to 31%.

To reduce the correlation of the initial data, the combination of the special procedure SP_PCA and the VarianceThreshold method is proposed.

To estimate the integrated use of the SMOTEENN, SP_PCA, and VarianceThreshold methods, Decision Tree, KNN, Logistic Regression, SVM, Gradient Boosting, and Random Forest methods were used as the base classifier. The developed methods were implemented using Python and the Google Colab cloud service with Jupyter Notebook (version 7.1). The parameters of base classifiers and meta-algorithms were tuned. The best result was obtained by using the ensemble classifier based on the Random Forest algorithm. Using the proposed procedure reduced the training time of the model based on the Random Forest algorithm by 16% and improved the recognition time by 32%. Additionally, it improved the model’s F1 score to 97%.

Thus, a method based on the preprocessing of highly correlated and imbalanced data is proposed. The scientific novelty of the obtained results lies in the integrated use of the developed procedure for reducing feature correlation, the SMOTEENN data balancing method, the selection of classifier type, and the tuning of its parameters.

Author Contributions

Conceptualization, S.S.; methodology, S.S.; software, R.C., B.K., S.G., V.P. and Z.V.; validation, R.C., B.K., S.G., V.P. and Z.V.; formal analysis, R.C., B.K., S.G., V.P. and Z.V.; investigation, S.S.; writing—original draft preparation, S.S. and M.K.-K.; writing—review and editing, S.S. and M.K.-K.; visualization, M.K.-K.; supervision, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Check Point Software’s 2023 Cyber Security Report. Available online: https://gp.gov.ua/ua/posts/pro-zareyestrovani-kriminalni-pravoporushennya-ta-rezultati-yih-dosudovogo-rozsliduvannya-2 (accessed on 20 June 2024).
Moskalenko, V.; Kharchenko, V.; Semenov, S. Model and Method for Providing Resilience to Resource-Constrained AI-System. Sensors 2024, 24, 5951. [Google Scholar] [CrossRef] [PubMed]
Amarudin, R.; Ferdiana, A.; Widyawan. A Systematic Literature Review of Intrusion Detection System for Network Security: Research Trends, Datasets and Methods. In Proceedings of the 4th International Conference on Informatics and Computational Sciences (ICICoS), Semarang, Indonesia, 10–11 November 2020; pp. 1–6. [Google Scholar] [CrossRef]
Tai, J.; Alsmadi, I.; Zhang, Y.; Qiao, F. Machine Learning Methods for Anomaly Detection in Industrial Control Systems. In Proceedings of the IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 2333–2339. [Google Scholar] [CrossRef]
Bicego, M.; Rossetto, A.; Olivieri, M.; Londoño-Bonilla, J.; Orozco-Alzate, M. Advanced KNN Approaches for Explainable Seismic-Volcanic Signal Classification. Math. Geosci. 2022, 55, 59–80. [Google Scholar] [CrossRef]
Malhotra, S.; Bali, V.; Paliwal, K.K. Genetic Programming and K-Nearest Neighbour Classifier Based Intrusion Detection Model. In Proceedings of the 7th International Conference on Cloud Computing, Data Science & Engineering—Confluence, Noida, India, 12–13 January 2017; pp. 42–46. [Google Scholar] [CrossRef]
Khreich, W.; Khosravifar, B.; Hamou-Lhadj, A.; Talhi, C. An Anomaly Detection System Based on Variable N-Gram Features and One-Class SVM. Inf. Softw. Technol. 2017, 91, 186–197. [Google Scholar] [CrossRef]
Salau, A.O.; Assegie, T.A.; Akindadelo, A.T.; Eneh, J.N. Evaluation of Bernoulli Naive Bayes Model for Detection of Distributed Denial of Service Attacks. Bull. Electr. Eng. Inform. 2023, 12, 1203–1208. [Google Scholar] [CrossRef]
Kamarudin, M.H.; Maple, C.; Watson, T.; Sofian, H. Packet Header Intrusion Detection with Binary Logistic Regression Approach in Detecting R2L and U2R Attacks. In Proceedings of the Fourth International Conference on Cyber Security, Cyber Warfare, and Digital Forensic (CyberSec), Jakarta, Indonesia, 29–31 October 2015; pp. 101–106. [Google Scholar] [CrossRef]
Meng, L.; Bai, B.; Zhang, W.; Liu, L.; Zhang, C. Research on a Decision Tree Classification Algorithm Based on Granular Matrices. Electronics 2023, 12, 4470. [Google Scholar] [CrossRef]
Zhu, B.; Wang, J.; Zhang, X. Fuzzy Decision Tree Based on Fuzzy Rough Sets and Z-Number Rules. Axioms 2024, 13, 836. [Google Scholar] [CrossRef]
Paul, S.; Kundu, R.K. A Bagging MLP-based Autoencoder for Detection of False Data Injection Attack in Smart Grid. In Proceedings of the IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT), Singapore, 1–5 November 2022; pp. 1–5. [Google Scholar] [CrossRef]
Wang, W.; Zhu, M.; Zeng, X.; Ye, X.; Sheng, Y. Malware Traffic Classification Using Convolutional Neural Network for Representation Learning. In Proceedings of the 2017 International Conference on Information Networking (ICOIN), Da Nang, Vietnam, 11–13 January 2017. [Google Scholar] [CrossRef]
Ashfaq, M. HCRNNIDS: Hybrid Convolutional Recurrent Neural Network-Based Network Intrusion Detection System. Processes 2021, 9, 834. [Google Scholar] [CrossRef]
Laghrissi, F.; Douzi, S.; Douzi, K. Intrusion Detection Systems Using Long Short-Term Memory (LSTM). J. Big Data 2021, 8, 65. [Google Scholar] [CrossRef]
Zhang, Z.; Kong, S.; Xiao, T.; Yang, A. A Network Intrusion Detection Method Based on Bagging Ensemble. Symmetry 2024, 16, 850. [Google Scholar] [CrossRef]
Bakhshipour, A. Cascading Feature Filtering and Boosting Algorithm for Plant Type Classification Based on Image Features. IEEE Access 2021, 9, 82021–82030. [Google Scholar] [CrossRef]
Semenov, S.; Mozhaiev, O.; Kuchuk, N.; Mozhaiev, M.; Tiulieniev, S.; Gnusov, Y.; Yevstrat, D.; Chyrva, Y.; Kuchuk, H. Devising a Procedure for Defining the General Criteria of Abnormal Behavior of a Computer System Based on the Improved Criterion of Uniformity of Input Data Samples. East.-Eur. J. Enterp. Technol. 2022, 6, 40–49. [Google Scholar] [CrossRef]
Cui, Z.G.; Cao, Y.; Wu, L.; Liu, H.N.; Qiu, Z.F.; Chen, C.W. Research on Preprocessing Technology of Building Energy Consumption Monitoring Data Based on Machine Learning Algorithm. Build. Sci. 2018, 34, 94–99. [Google Scholar] [CrossRef]
Krawczyk, B. Learning from Imbalanced Data: Open Challenges and Future Directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
Abdi, L.; Sattar, H. To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques. IEEE Trans. Knowl. Data Eng. 2016, 28, 238–251. [Google Scholar] [CrossRef]
Madhavi, M.; Nethravathi, N.P. Intrusion Detection in Networks Using Gradient Boosting. In Proceedings of the 2023 International Conference on Advances in Electronics, Communication, Computing and Intelligent Information Systems (ICAECIS), Bangalore, India, 19–21 April 2023; pp. 139–145. [Google Scholar] [CrossRef]
Zhang, J.; Zulkernine, M.; Haque, A. Random-Forests-Based Network Intrfabric usion Detection Systems. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2008, 38, 649–659. [Google Scholar] [CrossRef]
Varma, C.; Babu, G.; Sree, P.; Sai, N.R. Usage of Classifier Ensemble for Security Enrichment in IDS. In Proceedings of the 2022 International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 13–15 December 2022; pp. 420–425. [Google Scholar] [CrossRef]
El Houda, Z.A.; Brik, B.; Khoukhi, L. Ensemble Learning for Intrusion Detection in SDN-Based Zero Touch Smart Grid Systems. In Proceedings of the 2022 IEEE 47th Conference on Local Computer Networks (LCN), Edmonton, AB, Canada, 26–29 September 2022; pp. 149–156. [Google Scholar] [CrossRef]
Necati, D.; Dalkiliç, G. Modified Stacking Ensemble Approach to Detect Network Intrusion. Turk. J. Electr. Eng. Comput. Sci. 2018, 26, 35. [Google Scholar] [CrossRef]
Zwane, S.; Tarwireyi, P.; Adigun, M. Ensemble Learning Approach for Flow-Based Intrusion Detection System. In Proceedings of the 2019 IEEE AFRICON, Accra, Ghana, 25–27 September 2019; pp. 1–8. [Google Scholar] [CrossRef]
Gavrylenko, S.; Hornostal, O. Application of Heterogeneous Ensembles in Problems of Computer System State Identification. Adv. Inf. Syst. 2023, 7, 5–12. [Google Scholar] [CrossRef]
Semenov, S.G.; Liqiang, Z.; Weiling, C.; Davydov, V. Development of a Mathematical Model for the Software Security Testing First Stage. East. -Eur. J. Enterp. Technol. 2021, 3, 24–34. [Google Scholar] [CrossRef]
Mounika, K.; Rao, P.V. IDCSNet: Intrusion Detection and Classification System Using Unified Gradient-Boosted Decision Tree Classifier. In Proceedings of the 2022 International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 13–15 December 2022; pp. 1159–1164. [Google Scholar] [CrossRef]
Zhang, Y.; Deng, L.; Wei, B. Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation. Mathematics 2024, 12, 1709. [Google Scholar] [CrossRef]
Gavrylenko, S.; Poltoratskyi, V. Metod pidvyshchennia operatyvnosti klasyfikatsii danykh za rakhunok zmenshennia koreliatsii oznak, Systemy upravlinnia, navihatsii ta zviazku, Poltava: Natsionalnyi universytet. Poltav. Politekh. Im. Yuriia Kondratiuka 2023, 4, 71–75. [Google Scholar] [CrossRef]
Gavrylenko, S.; Poltoratskyi, V.; Nechyporenko, A. Intrusion detection model based on improved transformer. Adv. Inf. Syst. 2024, 8, 94–99. [Google Scholar] [CrossRef]

Figure 1. Model performance evaluation after balancing data using the different resampling algorithms.

Figure 2. Correlation matrix.

Figure 3. Results of reducing dataset.

Figure 4. Results of reducing a number of features.

Figure 5. Results of reducing train time.

Figure 6. Results of improving recognition time.

Figure 7. The assessment of the quality of the model after the feature space reduction stage. The obtained results are validated using GradientBoostingClassifie and the kddcup.data_10_percent_corrected and DoHBrw-2020 datasets.

Table 1. Information regarding the balance of data classes.

Normal	37,000
Exploits	11,132
Fuzzers	6062
DoS	4089

Table 2. Results of applying the proposed method of balancing.

Type of Data	Classifier	Mean Evaluation Metrics
Type of Data	Classifier	Precision	Recall	F1 Score	Accuracy
Imbalanced	Bagging	0.632711	0.606747	0.619223	0.859308
	Random Forest	0.652189	0.605505	0.631143	0.864856
	Adaboost	0.469637	0.571573	0.479471	0.552188
	Gradient Boosting	0.677363	0.622911	0.638217	0.862682
	Perceptron	0.448027	0.48171	0.375484	0.690192
	Naive Bayes	0.240122	0.508604	0.295897	0.450958
	SVM	0.593755	0.34801	0.413315	0.766314
	KNN	0.57165	0.483343	0.518767	0.805204
Balanced	Bagging	0.932608	0.934367	0.933333	0.954493
	Random Forest	0.946865	0.942628	0.943628	0.961703
	Adaboost	0.609027	0.739865	0.623658	0.535241
	Gradient Boosting	0.864355	0.866288	0.863651	0.910774
	Perceptron	0.755742	0.735643	0.738974	0.802861
	Naive Bayes	0.577641	0.684165	0.611498	0.699735
	SVM	0.757607	0.792579	0.772734	0.834756
	KNN	0.886423	0.901735	0.893365	0.920925

Table 3. Results of applying feature-reducing methods.

Model of Classifier	Performance Metrics	Balanced Data Without Feature Reduction	Feature-Reducing Methods
Model of Classifier	Performance Metrics	Balanced Data Without Feature Reduction	SP_PCA	Variance Threshold	Random Forest
Decision Tree Classifier	Average F1 score	0.91	0.95	0.95	0.95
	Train Time, s	1.579	1.412	1.919	1.481
	Recognition Time, s	0.015	0.01	0.009	0.012
KNN	Average F1 score	0.89	0.94	0.94	0.93
	Train Time, s	0.024	0.021	0.02	0.019
	Recognition Time, s	20.628	19.961	19.12	19.299
Logistic Regression	Average F1 score	0.83	0.83	0.84	0.83
	Train Time, s	18.735	15.357	16.187	13.823
	Recognition Time, s	0.013	0.01	0.011	0.017
SVM	Average F1 score	0.77	0.84	0.85	0.85
	Train Time, s	123.201	93.34	81.508	108.484
	Recognition Time, s	32.33	32.708	23.8	25.256
Gradient Boosting	Average F1 score	0.86	0.92	0.92	0.92
	Train Time, s	239.169	209.948	266.502	228.647
	Recognition Time, s	0.455	0.306	0.33	0.37
Random Forest	Average F1 score	0.95	0.97	0.97	0.97
	Train Time, s	16.507	15.575	20.359	16
	Recognition Time, s	0.958	0.689	0.696	0.825

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Semenov, S.; Krupska-Klimczak, M.; Czapla, R.; Krzaczek, B.; Gavrylenko, S.; Poltorazkiy, V.; Vladislav, Z. Intrusion Detection Method Based on Preprocessing of Highly Correlated and Imbalanced Data. Appl. Sci. 2025, 15, 4243. https://doi.org/10.3390/app15084243

AMA Style

Semenov S, Krupska-Klimczak M, Czapla R, Krzaczek B, Gavrylenko S, Poltorazkiy V, Vladislav Z. Intrusion Detection Method Based on Preprocessing of Highly Correlated and Imbalanced Data. Applied Sciences. 2025; 15(8):4243. https://doi.org/10.3390/app15084243

Chicago/Turabian Style

Semenov, Serhii, Magdalena Krupska-Klimczak, Roman Czapla, Beata Krzaczek, Svitlana Gavrylenko, Vadim Poltorazkiy, and Zozulia Vladislav. 2025. "Intrusion Detection Method Based on Preprocessing of Highly Correlated and Imbalanced Data" Applied Sciences 15, no. 8: 4243. https://doi.org/10.3390/app15084243

APA Style

Semenov, S., Krupska-Klimczak, M., Czapla, R., Krzaczek, B., Gavrylenko, S., Poltorazkiy, V., & Vladislav, Z. (2025). Intrusion Detection Method Based on Preprocessing of Highly Correlated and Imbalanced Data. Applied Sciences, 15(8), 4243. https://doi.org/10.3390/app15084243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intrusion Detection Method Based on Preprocessing of Highly Correlated and Imbalanced Data

Abstract

1. Introduction

1.1. Motivation

1.2. State of the Art

1.3. Objectives and Contribution

2. Materials and Methods

3. Results

3.1. Data Collection and Preprocessing

3.2. Feature Engineering

Data Balancing Procedure

3.3. Construction of Features Reducing Procedure SP_PCA

3.4. Model Selection, Training, and Evaluation

3.4.1. Evaluation of the Effectiveness of Using the SMOTEENN Algorithm

3.4.2. Evaluation the Effectiveness of Using the Feature Space SP_PCA Reduction Procedure

3.4.3. Evaluation of the Effectiveness of the Integrated Use of the SP_PCA and VarianceThreshold Reduction Procedure

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI