Research on Multi-Stage Detection of APT Attacks: Feature Selection Based on LDR-RFECV and Hyperparameter Optimization via LWHO

Zeng, Lihong; Li, Honghui; Fu, Xueliang; Han, Daoqi; Zhou, Shuncheng; He, Xin

doi:10.3390/bdcc9080206

Open AccessArticle

Research on Multi-Stage Detection of APT Attacks: Feature Selection Based on LDR-RFECV and Hyperparameter Optimization via LWHO

by

Lihong Zeng

,

Honghui Li

^*

,

Xueliang Fu

,

Daoqi Han

,

Shuncheng Zhou

and

Xin He

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(8), 206; https://doi.org/10.3390/bdcc9080206

Submission received: 19 May 2025 / Revised: 5 July 2025 / Accepted: 5 August 2025 / Published: 12 August 2025

Download

Browse Figures

Versions Notes

Abstract

In the highly interconnected digital ecosystem, cyberspace has become the main battlefield for complex attacks such as Advanced Persistent Threat (APT). The complexity and concealment of APT attacks are increasing, posing unprecedented challenges to network security. Current APT detection methods largely depend on general datasets, making it challenging to capture the stages and complexity of APT attacks. Moreover, existing detection methods often suffer from suboptimal accuracy, high false alarm rates, and a lack of real-time capabilities. In this paper, we introduce LDR-RFECV, a novel feature selection (FS) algorithm that uses LightGBM, Decision Trees (DTs), and Random Forest (RF) as integrated feature evaluators instead of single evaluators in recursive feature elimination algorithms. This approach helps select the optimal feature subset, thereby significantly enhancing detection efficiency. In addition, a novel optimization algorithm called LWHO was proposed, which integrates the Levy flight mechanism with the Wild Horse Optimizer (WHO) to optimize the hyperparameters of the LightGBM model, ultimately enhancing performance in APT attack detection. More importantly, this optimization strategy significantly boosts the detection rate during the lateral movement phase of APT attacks, a pivotal stage where attackers infiltrate key resources. Timely identification is essential for disrupting the attack chain and achieving precise defense. Experimental results demonstrate that the proposed method achieves 97.31% and 98.32% accuracy on two typical APT attack datasets, DAPT2020 and Unraveled, respectively, which is 2.86% and 4.02% higher than the current research methods, respectively.

Keywords:

APT attack detection; feature selection; machine learning; optimization algorithm

1. Introduction

In recent years, APT attacks, with their high concealment, long latency, and continuously evolving attack methods, have become one of the most destructive threats in the field of cybersecurity. From data theft of critical infrastructure to infiltration of national intelligence systems, APT attacks have caused incalculable economic losses and social impacts [1]. According to the 2024 MDR report by Kaspersky, APT attacks accounted for 43% of high-risk security incidents, affecting 25% of enterprise organizations, with a year-on-year increase of 74% [2]. Traditional rule-based intrusion detection systems (IDSs) have obvious shortcomings when facing APT attacks; in particular, when attackers use 0-day vulnerabilities and attack variants, the false negative rate remains high, making it difficult to cope with the complexity and concealment of modern APT attacks [3].

To break through the limitations of traditional defense, researchers have turned to intelligent detection technologies. Ghafir et al. [4] confirmed that machine learning has a good detection effect in APT attack detection, but machine learning still faces two major challenges. One is that feature redundancy in high-dimensional data is prone to lead to model overfitting, and the other is that model parameters have a decisive impact on the detection effect, while manual parameter tuning is not only inefficient but also difficult to achieve the optimal solution in complex attack scenarios.

Among most feature selection algorithms, the RFECV algorithm is a recursive feature elimination algorithm. By selecting a feature selection algorithm as an evaluator, it continuously iterates and removes one feature at a time, and each time a feature subset is selected from the remaining feature set, it is more representative. However, the features selected by a single evaluator are often not robust. In terms of parameter optimization, for traditional optimization algorithms such as grid search, when facing high-dimensional data, resource consumption is large, and random search lacks directionality, resulting in low convergence efficiency. The Wild Horse Optimizer (WHO) algorithm is a new type of swarm intelligence optimization method, and its non-linear convergence mechanism performs well in high-dimensional parameter spaces, but it still has the problem of falling into local optimal solutions.

This paper proposes an intelligent detection model for APT attack detection. This model uses an improved RFECV algorithm for feature selection and then uses an improved the Wild Horse Optimizer (WHO) algorithm for hyperparameter optimization of the model, aiming to improve the detection effect of APT attacks. This paper integrates three feature selection algorithms in the RFECV algorithm as an integrated feature importance evaluator to select the most representative and most robust features, reducing the feature dimension and thus reducing the detection time. Subsequently, Levy flight is introduced in the WHO algorithm to improve the problem of WHO falling into local optimal solutions, increasing the possibility of finding better solutions. Then, the best parameters obtained are used to train the model, thereby improving the accuracy of APT attack detection.

The main contributions of this paper are summarized as follows:

A novel integrated feature selection method LDR-RFECV is proposed. It employs three feature selection algorithms as the evaluators of feature importance in the RFECV algorithm, aiming to filter out the optimal feature subset, reduce the feature dimension, and thereby decrease the detection time.
An improved WHO algorithm is proposed. For the first time, the Levy flight mechanism is combined with the WHO algorithm to optimize the parameters of LightGBM and enhance the detection effect of APT attacks.
Experiments were conducted on DAPT2020 and Unraveled datasets to verify that the features selected based on LDR-RFECV and the optimized model have improved the detection time and accuracy in the APT attack stage.

The organization of this paper is as follows. Section 1 presents the research background. Section 2 reviews recent advances related to this study. Section 3 provides a detailed explanation of the proposed method. Section 4 presents the experimental setup and analyzes the results. Finally, Section 5 concludes the work and outlines potential directions for future research.

2. Related Work

In the field of Advanced Persistent Threat (APT) detection, traditional methods based on signature matching and anomaly detection have become increasingly inadequate due to the high stealthiness and multi-stage characteristics of APT. Signature-based detection relies heavily on predefined attack signature databases, making it difficult to handle unknown attacks and prone to evasion through forged signatures. Anomaly-based detection establishes a baseline of normal behaviors and detects abnormal behaviors that deviate from the baseline to alert. This method can detect unknown attacks but it will produce a high false alarm rate. In response to the increasing complexity and concealment of APT attacks, researchers have proposed a variety of methods to improve the ability to identify APT attacks from different perspectives. According to the existing literature, APT detection techniques can be divided into detection methods based on deep learning and machine learning, detection methods based on graph structure modeling, and detection methods focusing on feature selection and model parameter optimization. The following will review and summarize these representative studies to clarify the progress and shortcomings of current research.

2.1. Based on Deep Learning and the Machine Learning Model

To address the complexity and non-linear characteristics of APT attack behaviors, machine learning and deep learning models have gradually become mainstream approaches for APT detection. By modeling traffic features or behavioral sequences, these methods significantly enhance the ability to identify attack patterns.

Do et al. [5] proposed a hybrid deep learning model that combines MLP, LSTM, and CNN architectures. This method detects APT attacks in two stages, first extracting IP features based on flow, then classifying APT attack IP, and finally achieving 93% to 98% accuracy. However, this method may not be able to deal with IP spoofing attacks. El et al. [6] proposed a novel neural network-based framework that integrates Generative Adversarial Networks (GANs) and deep learning (DL) to detect APT threats in autonomous network systems. Using the TON IoT dataset, synthetic data samples were generated by GAN and combined with the original data to train various DL models. The MLP + AE model achieved an accuracy of 90.17% and an F1-score of 88.06%. This approach emphasizes detection in the context of IoT security. Eke et al. [7] developed a novel detection system for APT to detect and predict APT attacks. The method is based on the integrated DL multi-stage detection model and achieves a significant detection capability of 86.36 by detecting three different datasets, but the dataset used by the method is not representative of APT. Panahnejad et al. [8] used the network kill chain model to analyze, identify and prevent network attacks, and proposed an intelligent detection method called “APT-Dt-KC”, which uses the Pearson correlation coefficient to reduce the amount of data processing. The combination of Bayesian classification algorithm and Analytic Hierarchy Process (AHP) achieves 98% accuracy on the KDDCup’99 dataset, but the dataset used by this method is not advanced; therefore, the generalization on other newer datasets may be weak. Joloudari et al. [9] employed the C5.0 decision tree, Bayesian networks, and deep learning models for early-stage APT attack detection on the NSL-KDD dataset. Using 10-fold cross-validation, the approach achieved a best accuracy of 98.85%. However, it only distinguishes between normal and abnormal traffic. He et al. [10] proposed a comprehensive detection method targeting the lateral movement phase of Advanced Persistent Threat (APT) attacks. By combining active deception and passive scanning techniques, the method utilizes a neural network model to detect APT behaviors within internal networks, particularly in scenarios involving lateral movement through the SMB protocol. Nevertheless, this approach focuses solely on a single phase of APT attack detection. Dau et al. [11] employed various machine learning and deep learning models on the SCVIC-APT-2021 dataset, with LightGBM achieving a macro F1-score of 96.67%. Zha et al. [12] proposed SKT-IDS, an intrusion detection system based on Sigmoid Kernel Transformation and an encoder–decoder architecture, which identifies unknown attacks using a similarity threshold. On the NSL-KDD and CICIDS2018 datasets, it achieved recall rates of 65% and 69% for unknown attacks at a 1% false positive rate, but did not address the class imbalance issue.

2.2. Based on Graph Structure Modeling

To better characterize the multi-stage, context-dependent, and causal-chain features of APT attacks, recent studies have introduced graph-based modeling approaches. By constructing attack behavior graphs, knowledge graphs, or causal networks, these methods enable precise depiction of APT attack paths and stage identification.

Chen et al. [13] proposed APT-KGL, an APT detection system based on heterogeneous provenance graph learning. It uses meta-paths to extract semantic interactions between system entities and samples small subgraphs from the provenance graph to reconstruct and detect APT attack scenarios. However, it can only detect individual attack points and struggles to capture the complete APT attack chain. Weng et al. [14] introduced RT-APT, a real-time detection system that leverages the WL subtree kernel and FlexSketch algorithm to generate provenance graph feature sequences. It applies a K-means-based detection method to identify abnormal system states and outperforms existing approaches on multiple datasets, though it heavily relies on complete log data. Overall, graph-based methods are effective in capturing behavioral dependencies of APT attacks but often require complex data structures and significant computational resources.

2.3. Focus on Feature Selection and Model Parameter Optimization

Methods focusing on feature selection and model parameter optimization have demonstrated increasing importance in APT attack detection. On the one hand, feature selection effectively eliminates redundant and noisy features, reduces model complexity, shortens training and inference time, and enhances the real-time performance and deployability of detection systems. On the other hand, optimization techniques such as swarm intelligence algorithms enable fine-tuning of critical hyperparameters, thereby improving the model’s adaptability to complex APT patterns and enhancing its generalization ability—especially in identifying key stages of attacks. Kicska et al. [15] compared the performance of five swarm intelligence algorithms—Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC), Invasive Weed Optimization (IWO), Bat Algorithm (BA), and Gray Wolf Optimizer (GWO)—in feature selection for machine learning. They found that these algorithms effectively reduced data dimensionality and improved the accuracy of the decision tree classifier, with GWO demonstrating the best performance.

Hallaji et al. [16] analyzed the importance of network traffic features in APT detection and found that feature selection significantly improved model performance and generalization. Al et al. [17] introduced a Genetic Programming-based Feature Evolution approach (FEGP) that generates new features by mathematically combining existing ones through various operators. By integrating both evolved and original features, the method achieved an average accuracy of 79.28% using a Decision Tree on the DAPT2020 dataset. When using only original features, it attained 83.14% accuracy on the Unraveled dataset. However, the generalizability of the constructed features to classifiers beyond tree-based models remains insufficiently evaluated. Kumari et al. [18] introduced HHOSSA-Hybrid, a novel APT detection method combining Harris Hawks Optimization (HHO) and Sparrow Search Algorithm (SSA), to enhance detection performance. This method conducts feature selection and hyperparameter tuning, and applies oversampling to address class imbalance, achieving 94.468% accuracy on the DAPT2020 dataset. Abdulsattar et al. [19] employed a hybrid Shark and Bear Smell Search Optimization Algorithm (HSBSOA) for feature extraction in detecting botnet attacks within Flying Ad Hoc Networks (FANETs). Mei et al. [20] used Particle Swarm Optimization to tune a multiclass Support Vector Machine (PSO-MSVM), capable of identifying APT campaigns behind complex attacks. Compared with six typical methods, it achieved 93.96% accuracy. However, it performs better on small datasets and suffers from high time complexity on larger ones. Bakhiet et al. [21] proposed an APT detection approach combining Cat Swarm Optimization (CSO) with a Convolutional Neural Network (CNN), where model parameter optimization led to significantly improved detection accuracy.

In summary, research on APT attack detection exhibits diversity in methodological approaches, data sources, and detection objectives. Most studies focus on applying machine learning and deep learning frameworks to detect APT attacks and have achieved promising results. Meanwhile, graph-based modeling techniques have been gradually introduced to better capture the complex dependencies and multi-stage characteristics of APTs. In addition, feature selection has been proven to be a key step in enhancing detection efficiency and reducing model complexity. A comparison of the main studies discussed above is presented in Table 1.

3. Proposed Multi-Stage Detection Method for APT Attacks

This section introduces the proposed multi-stage detection model for APT attacks, and the overall framework of the detection model is shown in Figure 1.

In this framework, two datasets from different sources are first inputted and preprocessed, including missing value handling, data normalization, and format unification. Subsequently, the labels are encoded to adapt to the training process of subsequent machine learning models. To address the common issue of data imbalance in APT detection, oversampling techniques are introduced to balance the training data, ensuring that samples from all classes participate equally in model training.

Subsequently, the proposed LDR-RFECV feature selection method is employed to perform recursive feature elimination, extracting the most discriminative features for APT attack detection. To further improve model effectiveness, the proposed LWHO algorithm is applied to adjust the hyperparameters of the LightGBM model. The optimized model is trained and utilized to perform multi-stage classification of APT attacks, with accuracy, precision, recall, and F1-score used as evaluation indicators.

This method not only realizes the closed-loop of the whole process from data acquisition to detection output, but also has good real-time detection ability, making it suitable for rapid response and accurate identification of APT attacks in the actual network environment.

3.1. Dataset Introduction

To validate the effectiveness of the proposed multi-stage APT detection model, this study selects two representative APT attack detection datasets: DAPT2020 and Unraveled. Both datasets comprehensively cover the full lifecycle of typical APT attacks and provide clear delineations of attack stages along with corresponding behavioral labels, thereby offering strong support for the modeling and experimental evaluation of multi-stage APT detection tasks.

3.1.1. DAPT2020 Dataset

The DAPT2020 dataset is a network traffic dataset specifically constructed for the detection of APT [22], encompassing four distinct stages of APT attacks. Data collection spanned five consecutive days, with the entire attack lifecycle simulated from initial intrusion to data exfiltration. On Monday, all systems in the network operated under normal conditions, providing baseline traffic data. From Tuesday to Friday, each day simulated one stage of an APT attack. On Tuesday, the attack conducted network reconnaissance, used Burp Suite, Web Scarab, and other tools to conduct network scanning and application scanning, and explored the network structure of target users. On Wednesday, the attackers established a foothold through techniques including CSRF attacks, SQL injection, and trojan deployment, achieving remote communication between the victim and the command-and-control (C&C) server. On Thursday, the attack went further into the intranet, including a series of complex actions such as SSH horizontal connection, used weak authentication vulnerability to access FTP service, and exploited CVE-2012-2122 vulnerability to invade MySQL service, SMB vulnerability exploitation (CVE-2017-7494), privilege raising, account forgery, and backdoor maintenance. Finally, on Friday, the attackers exfiltrated data by injecting commands to export the /etc/passwd file and used tools such as PyExfil to transmit the stolen sensitive information to a cloud server controlled by the APT actors.

3.1.2. Unraveled Dataset

The Unraveled dataset is a network security dataset based on attack and defense simulation scenarios, covering different types of attackers [23], including the behavior characteristics of APT organizations, skilled hackers, and amateur attackers in the multi-stage attack process. Each type of attacker has different attack motives and technical capabilities. The dataset simulates attacks by both amateur attackers and skilled hackers, but is blocked by the measures deployed by the defenders. However, the APT group successfully breached the defenses, making this dataset a representative example for studying APT attacks.

The data collection spanned six weeks; the first week recorded only normal user activity, while the remaining period included a combination of normal and malicious traffic. The amateur attackers first conducted port scanning and service identification on the production network, and mistakenly entered the SSH honeypot system deployed by the defender. They performed brute-force attacks and successfully logged into the system, attempting directory traversal, but were blocked by the honeypot’s policy, which cut off their IP access and stopped the attack. Skilled hackers gathered employee information and sent a phishing email containing a malicious payload to the finance department. Although the victim downloaded the attachment, the execution attempt was blocked by the security system deployed by the defenders. The APT group simulated a typical and sophisticated multi-stage attack process. Beginning with reconnaissance, the attackers conducted low-frequency network scanning to avoid detection; after being blocked, they changed their IP address and scanning strategy to continue gathering information about the target network. After collecting the corresponding information, the APT attackers launched a watering hole attack, which eventually infected the victim’s terminal, they then proceeded with privilege escalation and lateral movement across the network. Ultimately, they obtained database administrator privileges, exfiltrated sensitive database content, and transmitted the stolen data to the APT group’s command and control (C&C) server. Finally, the attackers deleted traces of their activities to cover their tracks.

3.1.3. Features of the Dataset

The DAPT2020 dataset contains a total of 76 features, along with seven additional features that represent communication flows between the two systems. In addition, it includes two fields—Activity and Stage—used to indicate the type of attack and the corresponding APT attack phase. The Unraveled dataset consists of 85 features, along with four additional fields: Activity, Stage, Defender Response, and Signature.

3.2. Data Preprocess

This section outlines the preprocessing procedures applied to the DAPT2020 and Unraveled datasets. The steps include the removal of irrelevant or biased features, elimination of duplicate entries and invalid values (such as nulls and infinities), integer encoding of label fields, handling class imbalance, and normalization of the feature values.

3.2.1. Delete Irrelevant Columns

For the DAPT2020 dataset, the Stage field is used as the label, and the Activity field is removed because of the need to detect the stage to which an APT attack belongs. Additionally, seven features related to communication flows between systems—Flow ID, Src IP, Src Port, Dst IP, Dst Port, Protocol, and Timestamp—were removed. These fields are highly dependent on the underlying network topology and tend to vary significantly across different environments. IP addresses and ports serve as identifiers for network communication and may be useful in specific scenarios for distinguishing between attackers and victims. However, such features pose limitations for generalization, as IP spoofing and frequent changes in IP and port numbers are common tactics employed by attackers. Therefore, their inclusion would likely impair the model’s robustness across diverse settings. After preprocessing, 76 features and one label field were retained for experimentation.

For the Unraveled dataset, the feature identifying the communication flow between the two systems is also deleted, and a Stage field is retained as a label. This paper further deletes eight fields related to network applications and device fingerprints. These fields often directly reflect the specific implementation of systems and services in the network environment, but they have a significant lack of versatility in cross-network environments. Attackers may use completely different agents, request headers, or service names in different environments, and these differences do not necessarily reflect the nature of the attack but only the differences in deployment environments. Sixty-three features and one label field were eventually retained for the experiment.

3.2.2. Data Balance

Table 2 presents the number of samples for normal traffic and each attack stage before and after data balancing. It is evident that both datasets exhibit significant disparities between benign traffic and various attack stages, as well as imbalances among different stages themselves. Data imbalance will lead to insufficient learning of the model on minority classes, thus affecting the overall performance; proper data balancing can improve the performance of the model, but over-balancing may introduce the problem of over-fitting or information loss; therefore, it needs to be carefully balanced in the application [24]. To address this, a moderate balancing strategy was adopted, considering the specific data distribution patterns within each dataset. In the Unraveled dataset, the volume of benign traffic is drastically higher than that of the attack traffic, and random undersampling was applied to reduce the number of benign samples. For the Data Exfiltration stage in the DAPT2020 dataset and the Cover up stage in the Unraveled dataset, their scarcity could cause classifiers to overlook these critical phases. To mitigate this, the Synthetic Minority Over-sampling Technique (SMOTE) was employed to generate additional samples for these stages [25]. For other more sufficient stages, moderate oversampling is carried out. Table 2 shows the quantities of various types of traffic after data balancing processing, which has alleviated the extreme imbalance in the original dataset.

3.2.3. Label Encoding

This study performed integer encoding on the labels representing different APT attack stages in the dataset. Specifically, each APT attack stage was mapped to a unique integer value, converting the original categorical labels into a numerical format suitable for machine learning models. The encoding scheme used in this mapping is shown in Table 3.

3.3. The Proposed LDR-RFECV Feature Selection Algorithm

This section mainly introduces the Recursive Feature Elimination with the Cross-Validation (RFECV) algorithm and the LDR-RFECV algorithm proposed in this paper.

3.3.1. Recursive Feature Elimination with Cross-Validation

Recursive Feature Elimination with Cross-Validation (RFECV) is a widely adopted technique for feature subset selection [26]. Its core idea involves using a base estimator to recursively train the model and evaluate feature importance, ranking the features accordingly, and progressively eliminating the least important ones. This process retains only those features that contribute most significantly to the model’s predictive performance. By incorporating cross-validation, RFECV automatically determines the optimal number of features to retain. The iterative process terminates either when a predefined minimum number of features is reached or when the optimal evaluation score is achieved. Ultimately, the feature subset corresponding to the best cross-validation score is selected as the final set.

3.3.2. LDR-RFECV

In traditional RFECV, the evaluation of feature importance typically relies on a single model, such as logistic regression or random forest, which may introduce bias in the assessment of features. To enhance the accuracy and robustness of feature selection, this study proposes an improved method, LDR-RFECV, which integrates multiple structurally complementary tree-based models to evaluate feature importance. This ensemble approach enables a more stable and reliable recursive elimination strategy. The method combines three models to calculate the feature importance: LightGBM, Decision Tree (DT), and Random Forest (RF). LightGBM has the ability to deal with large-scale and high-dimensional data, and shows superior classification accuracy and dimensionality reduction effect on multiple datasets [27]. The structure of DT is intuitive and interpretable, which can clearly identify the key features that have the greatest impact on the classification results. By integrating multiple decision trees and combining feature importance and its confidence interval, RF has strong stability and anti-overfitting ability, and is suitable for high-dimensional feature screening [28].

Specifically, this paper first calculates the feature importance scores using LightGBM, DT, and RF, respectively, and then normalizes the feature importance scores of each model to make the importance of all features on the same scale. Finally, the scores are added with the same weights to obtain a more robust and comprehensive ranking of feature importance in each round of iteration, and then recursive feature elimination is performed. This method can make better use of the complementarity of different models to evaluate the importance of features, and improve the robustness of feature selection. Finally, the change curve of accuracy with the increase in the number of eliminated features is drawn. By analyzing this curve, the optimal number of feature subsets is determined, in contrast to the traditional RFECV, which selects feature subsets based on the highest cross-validation score, because when the accuracy difference between selecting fewer features and selecting more features is not significant, choosing fewer features is more conducive to simplifying the model, improving the generalization ability of the model, and can greatly reduce the time for training the model and making predictions, thereby improving efficiency.

LightGBM is an efficient tree model based on the gradient boosting framework [29]. Its feature importance is usually based on Information Gain or Split Count. In this paper, information gain is used as the measurement index, and the feature importance is calculated through Equation (1):

{I m p o r t a n c e}_{L G B M} (j) = \sum_{s \in S_{j}} △ H_{s}

(1)

Among them,

S_{j}

represents the set of partition nodes where feature j is located,

△ H_{s}

is the information gain brought by this partition, and the overall importance of feature j is the total gain of the node used for splitting where it is located.

DT is a fundamental tree model that constructs the entire tree by selecting the optimal features at each node for partitioning to minimize a certain cost function (such as the Gini index). In this paper, the Gini index is used as the measurement index. When feature j is used as the splitting condition of node s,

△ G_{s}

is its corresponding Gini gain. The overall importance of feature j can be calculated by Equation (2), which is the sum of its Gini gains on all splitting nodes using this feature:

{I m p o r t a n c e}_{D T} (j) = \sum_{s \in S_{j}} △ G_{s}

(2)

Random Forest (RF) is a type of ensemble model that integrates several decision trees. Feature importance in RF is calculated by averaging the importance scores derived from each tree. Specifically, let the model consist of T trees, the importance of feature j can be computed using Equation (3), where

{I m p o r t a n c e}_{t} (j)

denotes the total information gain contributed by feature j in the t-th tree:

{I m p o r t a n c e}_{R F} (j) = \frac{1}{T} \sum_{t = 1}^{T} {I m p o r t a n c e}_{t} (j)

(3)

Figure 2 illustrates the process for computing the combined feature importance, which includes the following steps:

Load the dataset and initialize the full set of features as the starting feature subset.
Compute the importance score of each feature using Equations (1)–(3), corresponding to LightGBM, Decision Tree (DT), and Random Forest (RF), respectively.
Normalize the feature importance scores obtained from each model.
Use Equation (4) to perform weighted summation on the feature importance scores obtained by each model, and set the weight coefficients α, β, and γ to equal values.
Rank the features according to their combined importance scores. Train the model using the current subset and calculate its cross-validation score.
Eliminate the least important feature from the current subset and repeat Steps 2–4 until only one feature remains.
Plot the curve showing model accuracy against the number of features removed to determine the optimal feature subset.

{I m p o r t a n c e}_{c o m b i n e d} (j) = {α I m p o r t a n c e}_{L G B M} (j) + β {I m p o r t a n c e}_{D T} (j) + γ {I m p o r t a n c e}_{R F} (j)

(4)

3.4. The Proposed LWHO Parameter Optimization Algorithm

This section mainly introduces the Wild Horse Optimizer (WHO) algorithm, the levy flight mechanism, and the LWHO algorithm proposed in this paper.

3.4.1. Wild Horse Optimizer

The Wild Horse Optimizer (WHO) is a new type of swarm intelligence optimization algorithm proposed by Iraj Naruei et al. in 2021. It is derived from the life behavior of the wild horse population and has the characteristics of strong evolutionary ability, fast search speed, and strong optimization ability [30]. This algorithm simulates the group behavior of horses, the grazing behavior of foals, the mating behavior of horses, the leadership of stallions, as well as the communication and selection behaviors of each leader.

This algorithm initializes a random population with

(X) = \{{\vec{X}}_{1}, {\vec{X}}_{2}, \dots, {\vec{X}}_{n}\}

. Assuming N is the total population size and PS is the percentage of stallions, then the number of stallions G is the product of the stallion proportion and the total population size. To achieve the grazing behavior, the foals always graze around the stallions. The grazing behavior is simulated by Equation (5), where the foals move at a different radius with the stallions as the center:

{\bar{X}}_{i, G}^{j} = 2 Z \cos (2 π R Z) \times ({S t a l l i o n}^{i} - X_{i, G}^{j}) + {S t a l l i o n}^{j}

(5)

Specifically, let i denote the i-th stallion and j represent the j-th foal associated with that stallion.

X_{i, G}^{j}

denotes the current position of the j-th foal, and

{S t a l l i o n}^{j}

represents the current position of the i-th stallion. The parameter

R \in [- 2, 2]

, while Z serves as an adaptation mechanism.

The foals gradually separate from the herd and, upon reaching maturity, mate with other foals that have also left the herd. This mating behavior is simulated using Equation (6):

X_{G, K}^{P} = C r o s s o v e r (X_{G, i}^{q}, X_{G, j}^{z}), i \neq j \neq k, p = q = e n d

(6)

The stallion, as the leader of the group, must guide the foals to a suitable area, namely, a waterhole. Other groups also move toward the waterhole, and a competition arises among the leaders for dominance over this location. If a group currently holds dominance over the waterhole, it retains control; otherwise, it must leave. This behavior is simulated using Equation (7):

{\bar{S t a l l i o n}}_{G_{i}} = \{\begin{matrix} 2 Z \cos (2 π R Z) \times (W H - {S t a l l i o n}_{G_{i}}) + W H i f R_{3} > 0.5 \\ 2 Z \cos (2 π R Z) \times (W H - {S t a l l i o n}_{G_{i}}) + W H i f R_{3} \leq 0.5 \end{matrix}

(7)

In this context, WH represents the location of the waterhole,

{S t a l l i o n}_{G_{i}}

denotes the current position of the leader of group i, and

{\bar{S t a l l i o n}}_{G_{i}}

refers to the next position of the leader of group i.

The leader of a group should be determined by its best position. When the foal’s position is better than the current stallion’s position, their positions are swapped using Equation (8):

{S t a l l i o n}_{G_{i}} = \{\begin{matrix} X_{G, i} i f \cos t (X_{G, i}) < \cos t ({S t a l l i o n}_{G i}) \\ {S t a l l i o n}_{G_{i}} i f \cos t (X_{G, i}) > \cos t ({S t a l l i o n}_{G_{i}}) \end{matrix}

(8)

3.4.2. Levy Flight

Levy flight refers to a random walk process where the step lengths follow a heavy-tailed Levy distribution, enabling occasional large jumps that enhance global exploration, distinguishing it from traditional Gaussian or uniform distributions. Levy flights are commonly generated using the Mantegna algorithm, which computes Levy-distributed random step lengths based on Equation (9):

e v y (β) = \frac{μ}{{|ν|}^{1 / β}}

(9)

where

μ ~ N (0, σ_{μ}^{2}), ν ~ (0, σ_{ν}^{2})

,

σ_{ν} = 1, β = 1.5

, the variances of

μ

are defined as follows:

σ_{μ} = {(\frac{Γ (1 + β) \sin (π β / 2)}{Γ ((1 + β) / 2) β 2^{(β - 1) / 2}})}^{1 / β}

.

3.4.3. LWHO

In APT attack detection, effectively optimizing the classifier’s hyperparameters plays a crucial role in enhancing both detection performance and model generalization. LightGBM is an efficient gradient boosting decision tree algorithm, and the configuration of these hyperparameters, such as learning rate, maximum depth, and the number of weak classifiers, will directly affect its detection performance. Due to the wide and complex hyperparameter search space, traditional methods such as grid search and random search are inefficient and prone to fall into local optima when dealing with such high-dimensional non-convex optimization problems. To address this challenge, this paper proposes a modified Wild Horse Optimizer (LWHO) to optimize the hyperparameters of LightGBM.

The LWHO algorithm is an improved metaheuristic optimization method built upon the traditional Wild Horse Optimizer (WHO) by introducing a Levy flight mechanism. Although the original WHO algorithm exhibits a certain degree of global search ability, like many swarm intelligence methods, it tends to fall into local optima. This is primarily due to its foal position update strategy, which relies on sinusoidal periodic perturbations, resulting in a search space that is largely confined to the vicinity of the stallion. Consequently, the algorithm’s exploration capability is limited. This issue is particularly pronounced in APT detection tasks, where attacks are characterized by complex behaviors and multi-stage processes. If the optimization converges prematurely to suboptimal hyperparameter configurations, it can significantly degrade the model’s detection accuracy and generalization capability.

Specifically, in the LWHO algorithm, each foal represents a candidate hyperparameter combination, with its “position” corresponding to a set of LightGBM hyperparameter values. The algorithm iteratively generates new hyperparameter configurations through the position updates of foals and stallions. To enhance exploration capability, a Levy flight mechanism is introduced to simulate the behavior of some foals playfully wandering away from the stallion and leaping into more distant regions of the search space in search of better solutions. Accordingly, a new position update formula (Equation (10)) is proposed to implement this jump-based search strategy. The F1-score of APT attack detection is used as the fitness function to evaluate the quality of each solution, thereby guiding the algorithm toward the optimal hyperparameter configuration. Figure 3 illustrates the overall workflow of the LWHO algorithm applied to LightGBM hyperparameter optimization, in the picture, i represent the i-th stallion and j represent the j-th foal associated with that stallion.

{\bar{X}}_{i, G}^{j} = L e v y (β) \times ({S t a l l i o n}^{i} - X_{i, G}^{j}) + X_{i, G}^{j}

(10)

The advantages of the improved algorithm are as follows:

Enhanced Global Search Capability: After the introduction of Levy flight, foals can perform jumping updates in a wide range, that is, some foals can perform non-linear search behavior around themselves, which combines long-distance jumping and local fine-tuning to avoid falling into local optima.
Increased Population Diversity: Due to the heavy-tailed nature of the Levy distribution, some foals can move far away from the current stallion, which significantly enhances the spatial diversity of the population and broadens the search space.
The algorithm employs a probabilistic mechanism: When a randomly generated number exceeds a control threshold (PC = 0.13), and another check determines it is less than 0.5, the foal’s position is updated using Levy flight. Meanwhile, the original herding and mating behaviors are preserved, strengthening the local search performance.

Algorithm 1 presents the pseudocode of the LWHO algorithm, which consists of the following steps:

Input the dataset, default parameters of the WHO algorithm, and the hyperparameter search space of LightGBM. Initialize the positions of foals and stallions.
For each foal within a group, if a randomly generated number is less than PC (PC is a crossover operator, with a default value of 0.13), perform mating to generate a new position; if the random number is greater than PC but less than 0.5, update the foal’s position using Levy flight; otherwise, use the herding behavior to update the position.
Evaluate the fitness of the new foal positions using LightGBM.
For each stallion, a random number is generated to determine whether it gains dominance over the waterhole; if the value exceeds 0.5, the stallion claims the waterhole and updates its position accordingly; otherwise, it abandons its current location.
Evaluate the fitness of the new stallion positions using LightGBM.
Select new stallions based on the current population’s fitness.
If a foal’s fitness exceeds that of the current stallion, swap their positions.
Output the best fitness value, which corresponds to the optimal hyperparameter configuration for the LightGBM model.

Algorithm 1. LWHO Pseudo Code

1: Input: DAPT2020 and Unraveled Dataset, Population size, MaxIter, PC = 0.13, LightGBM parameter search space

2: Output: Best LightGBM parameter dictionary and its fitness

3: Set LGB parameter search space, default parameters, Initialize population of horses

4: While current iter

\leq

MaxIter

5: For each Stallion

6: For each Foal in the group

7: If rand > PC then

8: If rand < 0.5

9: Calculate Levy Flight Step by Equation (9)

10: Update the position of the Foal by Equation (10)

11: Else

12: Update the position of the Foal by Equation (5)

13: Else

14: Update the position of the Foal by Equation (6)

15: Evaluate the fitness of the new position

16: End For

17: Update the position of the Stallion by Equation(7)

18: Evaluate the fitness of the new position

19: If fitness(Stallion) < fitness(Stallion_i) then

20: Stallion = Stallion_i

21: Select Foal with minimum fitness

22: If fitness(Foal) < fitness(Stallion) then

23: Exchange Foal and Stallion Position by Equation (8)

24: End For

25: Update optimum

26: End while

3.5. Evaluation Indicators

In order to comprehensively evaluate the classification ability of the proposed new FS method and the improved WHO algorithm in the task of APT attack detection, four commonly used but complementary evaluation indicators are used in this paper.

Accuracy: Accuracy is a measure of the proportion of a model’s predictions that are correct, that is, of all samples, how many are correctly classified. It is defined by Equation (11), where TP is the true number of cases, TN is the true negative number of cases, FP is the false positive number, and FN is the false negative number.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(11)

2.: Precision: Precision represents the proportion of samples classified as positive classes by the model that are truly positive classes, as defined by Equation (12).

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

3.: Recall: Recall is used to measure how many of all positive samples are successfully detected by the model, as defined by Formula (13).

R e c a l l = \frac{T P}{F P + F N}

(13)

4.: F1-Score: The F1-score is the harmonic mean of precision and recall, used to balance the importance of both metrics. It is defined by Equation (14). As a comprehensive performance indicator, the F1-score is particularly suitable for imbalanced classification problems, as it reflects the overall performance of the model in terms of both recall and precision.

F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(14)

4. Results and Analysis

This section presents the experimental results of the LDR-RFECV feature selection method and the LWHO algorithm for optimizing LightGBM hyperparameters. Among them, four feature selection methods were selected for comparison and nine machine learning models were used for classification detection comparison. The results are then analyzed in detail.

4.1. Experimental Environment

The experimental environment used in this paper is shown in Table 4. The Lenovo ThinkBook computer used in this experiment was operated in the Inner Mongolia Autonomous Region, China. Relevant experimental processes such as data processing and model training were completed relying on this device and its environment.

4.2. Feature Selection Results and Analysis

This section describes how to select the optimal feature subset through the LDR-RFECV algorithm. The integrated feature importance evaluator proposed in this paper is used as the evaluator in RFECV and compared with the Single evaluatorRFECV. Meanwhile, XGBoost, LightGBM, DT, and RF are used to select the same number of features for comparison, and the advantages and disadvantages of the method in this paper and other methods are analyzed.

After performing standard preprocessing operations such as normalization, cleaning, and encoding on the DAPT2020 dataset, a total of 76 features were retained. Figure 4 illustrates the trend of classification accuracy as the number of features decreases using the LDR-RFECV algorithm. It can be observed that as features are gradually removed—from 76 down to only 16—the model’s accuracy remains generally stable, with slight improvements in certain stages. However, once more than 60 features are removed, a notable decline in accuracy begins to emerge, and a sharp drop is observed when fewer than 5 features remain. Therefore, to achieve a better balance between accuracy and feature dimensionality, this study ultimately retains 16 key features and discards the remaining 60 redundant ones.

Similarly, after completing the initial preprocessing of the Unraveled dataset, a total of 63 features were retained. Figure 5 illustrates the variation in classification accuracy as the number of features decreases when applying the LDR-RFECV algorithm. As shown in the figure, the model performance remains relatively stable during the removal of the first 53 features, with slight improvements observed at certain stages. However, a marked decline in accuracy occurs once more than 53 features are eliminated, with the downward trend intensifying when 59 features are removed. This indicates that the excluded features in this stage play a crucial role in the classification task. Considering the trade-off between feature dimensionality and classification performance, this study ultimately retains 10 key features, thereby discarding 53 non-essential ones.

To verify the effectiveness of the LDR-RFECV feature selection method proposed in this paper, this paper conducts ablation experiments. In the RFECV algorithm, three single feature evaluators, LightGBM, DT, and RF, are used separately. Additionally, LightGBM, DT, and RF are used for feature selection alone without using RFECV. Moreover, a gradient boosting model XGBoost is selected. Using each method, 16 features are selected on the DAPT2020 dataset and 10 features are selected on the Unraveled dataset. Table 5 and Table 6 show the feature numbers selected by each method on the two datasets.

Although different methods select varying feature subsets across the two datasets, several features are commonly selected by all or most algorithms, highlighting their core importance in APT attack detection. For instance, in the DAPT2020 dataset, features with indices 14, 35, and 65 are consistently chosen by all methods. To illustrate the connection between the selected features and the attack stages on the two datasets, Table 7, referring to the MITRE ATT&CK framework, presents the connections between 10 selected features and the attack stages.

4.3. Performance Analysis After Feature Selection

To further validate the effectiveness of the proposed multi-classifier ensemble feature selection method (LDR-RFECV), nine machine learning models were employed for evaluation. These models include LightGBM, XGBoost, CatBoost, Random Forest (RF), Decision Tree (DT), K-Nearest Neighbors (KNNs), Neural Network (NN), Logistic Regression (LR), and Naive Bayes (NB).

4.3.1. Comparison of Different Feature Selection Methods

Table 8 presents a comparison of classification performance metrics—Accuracy, Precision, Recall, and F1-Score—across various models under different feature selection methods on the DAPT2020 dataset. The experimental results demonstrate that when using features selected by the proposed LDR-RFECV method, the gradient boosting-based models LightGBM, XGBoost, and CatBoost achieved the highest accuracy rates of 95.3%, 93.31%, and 92.75%, respectively, along with the highest F1-scores of 94.63%, 93.08%, and 91.65%, demonstrating significantly superior overall classification performance compared to other feature selection methods.

Table 9 presents the comparison of classification performance metrics across multiple models using different feature selection methods on the Unraveled dataset. The experimental results demonstrate that the proposed LDR-RFECV method also exhibits superior generalization performance on this dataset. Specifically, the LightGBM and XGBoost models achieved the highest accuracy rates—97.85% and 97.5%, respectively—while also recording the highest F1-scores of 94.49% and 93.32%, respectively.

Overall, different feature selection methods exhibit their own advantages. However, it can be observed that among the nine machine learning models evaluated in this study, LightGBM consistently outperformed other models across all feature selection strategies. Moreover, when using the feature subset selected by the proposed method, it achieved the highest accuracy and F1-score, demonstrating the effectiveness of the integrated feature importance evaluator used in the LDR-RFECV approach. The reason why the integrated feature selection method outperforms a single algorithm is that it fully integrates the characteristic advantages of each algorithm, forming a multi-dimensional collaborative screening mechanism. LightGBM, as an efficient gradient boosting framework, with histogram algorithms and one-sided gradient sampling techniques, not only has high computational efficiency when dealing with high-dimensional data, but also can accurately capture the non-linear relationships in the data and effectively mine the deep features of the data; decision trees split features based on information gain or Gini index, have strong interpretability, can clearly outline the hierarchical structure of the data, and perform well in identifying the logical patterns of the data; random forests, by integrating multiple decision trees, significantly reduce model variance and enhance the generalization ability of the model, and can conduct robust assessment of feature importance. When combined, they greatly reduce the risk of misjudgment caused by abnormal data distribution or interference of features, thereby making the feature selection results more accurate and reliable.

4.3.2. Performance Comparison Before and After Feature Selection

Table 10 shows the comparison of classification performance of various models on the DAPT2020 and Unraveled datasets, both with all features and with features selected by the LDR-RFECV algorithm. Overall, strong classifier models such as LightGBM, XGBoost, CatBoost, RF, and DT maintained excellent performance after feature selection, with most models experiencing performance changes within ±1%. The F1-score of the LightGBM model decreased by 0.15% and 0.31% on the two datasets, respectively, yet it still maintained a stable classification effect and remained the best-performing model among all. The F1-score of the DT model increased by 1.8% on the DAPT2020 dataset and decreased by only 0.18% on the Unraveled dataset. The F1-score of the RF model increased by 0.4% and 0.55% on the two datasets, respectively, indicating that feature selection had a positive impact on its performance. In contrast, weaker models such as NN, LR, and NB showed greater performance differences after feature selection. For example, the F1-score of the LR model decreased by more than 30% on both datasets. Although the decline was significant, the model’s initial performance before feature selection was far below that of mainstream models (e.g., LightGBM 94.78% vs. LR 70.41%); therefore, it is not recommended. Its performance changes had limited impact on the overall conclusion. In summary, the feature selection method proposed in this paper can effectively reduce the feature dimension without significantly compromising the performance of mainstream models.

4.3.3. ROC Curve Analysis

According to the previous experimental results, LightGBM achieved the best classification performance on both datasets. Therefore, it is selected as the optimal model in this study, and the ROC curves of its classification performance after feature selection are plotted for both datasets. The ROC curves in Figure 6 and Figure 7 show that LightGBM exhibited extremely high classification performance on both datasets. However, for Class 3 (the lateral movement phase of APT), the performance was poorer than for other classes on both datasets, while the classification performance for other classes was almost perfect. In APT attack detection, the lateral movement phase is a critical stage where attackers spread within the network and escalate privileges. Detecting this phase is crucial for preventing further penetration by attackers [31]. During this phase, APT attackers use a variety of attack methods, such as exploiting vulnerabilities and using built-in system tools or legitimate administrative tools to access other system networks. These behaviors are difficult to distinguish from normal activities, resulting in lower detection effectiveness.

4.4. Analysis of Parameter Optimization Results

As demonstrated in the experiments in Section 4.3, LightGBM can be considered the preferred model for APT attack detection tasks. However, the performance of LightGBM is highly sensitive to the selection of hyperparameters, including the learning rate, maximum depth, and number of weak learners. Careful optimization of these parameters, rather than relying on default settings, can significantly enhance the model’s convergence speed and generalization capability, thereby improving detection accuracy and efficiency. Table 11 shows the hyperparameters of LightGBM that need to be optimized, the description of each parameter, and the range of parameter search space.

4.4.1. Performance Comparison of Different Optimization Algorithms

In order to evaluate the performance differences of various algorithms in optimizing the hyperparameters of LightGBM, this paper conducted ablation experiments to compare the performance of the LWHO and WHO algorithms, and selected four relatively mainstream optimization algorithms, including the Gray Wolf Optimization (GWO), Particle Swarm Optimization (PSO), CMA-ES, and Bayesian algorithm. All algorithms were iterated 100 times, and the parameter search space was set to be the same. Table 12 shows the final results achieved by the four algorithms. The LWHO algorithm achieved the best results on both datasets, indicating that the introduction of Levy flight enhanced the global search ability of the WHO algorithm. The foals can perform leap-like updates within a larger range, that is, some foals can conduct non-linear search behaviors around themselves, combining long-distance leaps and local fine-tuning, avoiding getting stuck in local optima, and continuously exploring better solutions.

Table 13 presents a comparative analysis of the overall performance of the LightGBM model on the two datasets before and after hyperparameter optimization. As observed, on the DAPT2020 dataset, all performance metrics improved following optimization, with an average increase of 2.18%. On the Unraveled dataset, a mean improvement of 0.89% was achieved. Although the model’s performance remains largely consistent before and after feature selection, the optimized model shows significantly reduced training and prediction time, demonstrating greater efficiency.

These results indicate that the proposed feature selection and parameter optimization methods can significantly reduce the computational cost while enhancing detection performance, demonstrating their practical value in APT attack detection tasks. However, it should be noted that our method still has some limitations. For instance, the optimization algorithm we choose is not applicable to any dataset, and the performance of the method proposed in this paper may be poor due to the differences in different datasets and features.

4.4.2. Detection Performance by the APT Attack Stage

To more finely evaluate the impact of parameter optimization on the classification performance of each stage of APT attacks, Figure 8 and Figure 9, respectively show the comparison of accuracy and F1-score for each category in the two datasets before and after parameter optimization of the LightGBM model. It can be known that after parameter optimization, on the two datasets, the improvement for categories 0, 1, 4 and 5 is relatively small, but the detection improvement for category 3 (the lateral movement stage in APT) is more significant. On the DAPT2020 dataset, the accuracy of Class 3 increased by 3.96%, and the F1-score increased by 5.09%. On the Unraveled dataset, the accuracy of Class 3 increased by 14.71%, and the F1-score increased by 8.39%. The lateral movement phase is a relatively critical but highly concealed link in the APT attack chain. The significant improvement of the optimized model in this category further demonstrates that the feature selection method and hyperparameter optimization strategy proposed in this paper have effectively enhanced the model’s perception and discrimination ability for complex attack behavior patterns, and improved the detection system’s identification performance for key attack phases.

4.5. Analysis of Resource Consumption

Table 14 presents the running time and memory usage of the feature selection algorithm and the parameter optimization algorithm. It can be seen that the integrated feature selection algorithm proposed in this paper can complete the feature screening in less than one hour, and the parameter optimization algorithm can complete the parameter optimization of the model within two hours. The memory usage is within an acceptable range. Although the integrated feature selection algorithm and the improved WHO algorithm proposed in this paper use more resources than using these algorithms alone, once these two steps are completed, subsequent steps do not need to be repeated. Subsequently, only the selected features and the optimized parameters need to be used for training and prediction.

The training time of the NN model and the prediction time of the KNN model are significantly higher than those of other models. Therefore, Figure 10 and Figure 11, respectively present bar charts of the training and prediction times for the remaining seven models on the two datasets. After feature selection, both the training time and prediction time of the model have decreased, which is crucial for real-time detection of APT attacks. On the DAPT2020 dataset, LightGBM and Naïve Bayes (NB) exhibited significant reductions in training time—70.07% and 78.84%, respectively—while their prediction times decreased by 33% and 75.69%. Similarly, on the Unraveled dataset, training times dropped by 73.84% for LightGBM and 79.15% for NB, with corresponding reductions in prediction time of 17.48% and 83.75%. The only exception was the CatBoost model on the DAPT2020 dataset, which showed a slight increase in prediction time (from 0.0076 s to 0.009 s, an increment of just 0.0014 s), likely attributable to stochastic perturbations within the model or fluctuations in the testing environment. Overall, the results highlight the efficiency gains in computational cost brought by the proposed feature selection method.

4.6. Comparison with Previous Studies

Table 15 presents a comparative analysis between the proposed approach and existing studies. Overall, the method presented in this paper outperforms most of the existing methods in most performance indicators. Except for the accuracy rates achieved by the literatures [9,21] (98.4, 98.85) which are higher than that of our method (97.31, 98.32), but the number of features used in those literatures is significantly more than the method proposed in this paper. The method in literature [32] achieves a higher accuracy rate on the dataset Camflow-apt than the one we achieved on the DAPT2020 dataset, but due to the different datasets, we cannot make an absolute comparison. Further comparison reveals that most prior studies did not incorporate a systematic feature selection process, and therefore, failed to evaluate or report model efficiency in terms of time cost. In contrast, the proposed method employs the fewest number of features across both datasets while maintaining a well-balanced performance in terms of accuracy, precision, recall, and F1-score.

5. Conclusions

This paper addresses the issues of feature redundancy and model efficiency in Advanced Persistent Threat (APT) attack detection by proposing an improved Recursive Feature Elimination method based on multi-classifier feature importance evaluation (LDR-RFECV). In addition, the key hyperparameters of the LightGBM model are optimized using the Wild Horse Optimizer (WHO) Algorithm enhanced with the Levy flight mechanism. Experiments conducted on two real-world APT attack datasets, DAPT2020 and Unraveled, demonstrate that the proposed method can significantly reduce feature dimensionality and detection time while maintaining or even improving detection performance. Notably, it achieves a marked improvement in identification accuracy during critical attack stages such as lateral movement.

The APT attack detection model proposed in this study provides a new technical path for improving network security protection capabilities. This model can be integrated into existing IDSs (Intrusion Detection Systems) to form a collaborative detection architecture of “rule base + intelligent engine”. Traditional IDS relies on rule bases for field matching of attack traffic, and will have serious false positives when facing zero-day attacks. Our detection model, with the ability of autonomous learning for complex attack patterns, can be used as an intelligent detection engine to make up for the shortcomings of traditional IDS, enabling accurate identification and timely response to APT attacks.

At the same time, it is necessary to objectively recognize that there is still room for further exploration in the actual deployment of the model. Future research will focus on expanding the universality of the method, and optimize the model’s adaptability to different network environments and attack scenarios through verification and adaptation on more heterogeneous APT datasets.

Author Contributions

Conceptualization, L.Z., X.F. and H.L.; methodology, L.Z., X.F. and H.L.; software, L.Z.; validation, L.Z., D.H. and S.Z.; formal analysis, L.Z. and X.H.; investigation, L.Z.; resources, L.Z. and D.H.; data curation, L.Z.; writing—original draft preparation, L.Z., D.H. and S.Z.; writing—review and editing, L.Z. and X.H.; visualization, L.Z.; supervision, X.F. and H.L.; project administration, H.L.; funding acquisition, H.L. and X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation Project (No. 62041211), the Inner Mongolia Major science and technology project (No.2021SZD0004), the Iner Mongolia Autonomous Region science and technology plan project (No.2022YFHH0070), the Basic research expenses of universities directly under the Inner Mongolia Autonomous Region (No. BR22-14-05), the Inner Mongolia Natural Science Foundation Project (No. 2024MS06002), the Inner Mongolia Autonomous Region universities innovative research team project (No. NMGIRT2313) and the Inner Mongolia Natural Science Foundation Project (No. 2025ZD012).

Data Availability Statement

The datasets used and analyzed during the current study are publicly available. The DAPT2020 dataset can be accessed at https://gitlab.com/asu22/dapt2020 (accessed on 18 May 2025), and the Unraveled dataset is available at https://gitlab.com/asu22/unraveled (accessed on 18 May 2025).

Acknowledgments

The authors gratefully acknowledge Honghui Li, Xueliang Fu, Daoqi Han, Shuncheng Zhou, and Xin He for their valuable input and significant contributions throughout the project.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Messaoud, B.I.; Guennoun, K.; Wahbi, M.; Sadik, M. Advanced persistent threat: New analysis driven by life cycle phases and their challenges. In Proceedings of the 2016 International Conference on Advanced Communication Systems and Information Security (ACOSIS), Marrakesh, Morocco, 17–19 October 2016; pp. 1–6. [Google Scholar]
Kaspersky. APT and Financial Attacks on Industrial Organizations in Q4 2024. Available online: https://ics-cert.kaspersky.com/publications/reports/2025/03/25/apt-and-financial-attacks-on-industrial-organizations-in-q4-2024/ (accessed on 7 May 2025).
Burita, L.; Le, D.T. Cyber security and APT groups. In Proceedings of the 2021 Communication and Information Technologies (KIT), Vysoke Tatry, Slovakia, 13–15 October 2021; pp. 1–7. [Google Scholar]
Ghafir, I.; Hammoudeh, M.; Prenosil, V.; Han, L.; Hegarty, R.; Rabie, K.; Aparicio-Navarro, F.J. Detection of advanced persistent threat using machine-learning correlation analysis. Future Gener. Comput. Syst. 2018, 89, 349–359. [Google Scholar] [CrossRef]
Do Xuan, C.; Dao, M.H. A novel approach for APT attack detection based on combined deep learning model. Neural Comput. Appl. 2021, 33, 13251–13264. [Google Scholar] [CrossRef]
El Alami, H.; Rawat, D.B. A Novel Neural Networks-based Framework for APT Detection in Networked Autonomous Systems. In Proceedings of the 2024 33rd International Conference on Computer Communications and Networks (ICCCN), Kailua-Kona, HI, USA, 29–31 July 2024; pp. 1–6. [Google Scholar]
Eke, H.N.; Petrovski, A. Advanced persistent threats detection based on deep learning approach. In Proceedings of the 2023 IEEE 6th International Conference on Industrial Cyber-Physical Systems (ICPS), Wuhan, China, 8–11 May 2023; pp. 1–10. [Google Scholar]
Panahnejad, M.; Mirabi, M. APT-Dt-KC: Advanced persistent threat detection based on kill-chain model. J. Supercomput. 2022, 78, 8644–8677. [Google Scholar] [CrossRef]
Joloudari, J.H.; Haderbadi, M.; Mashmool, A.; Ghasemigol, M.; Band, S.S.; Mosavi, A. Early detection of the advanced persistent threat attack using performance analysis of deep learning. IEEE Access 2020, 8, 186125–186137. [Google Scholar] [CrossRef]
He, D.; Gu, H.; Zhu, S.; Chan, S.; Guizani, M. A comprehensive detection method for the lateral movement stage of apt attacks. IEEE Internet Things J. 2023, 11, 8440–8447. [Google Scholar] [CrossRef]
Dau, D.-D.; Lee, S.; Kim, H. A comprehensive comparison study of ML models for multistage APT detection: Focus on data preprocessing and resampling. J. Supercomput. 2024, 80, 14143–14179. [Google Scholar] [CrossRef]
Zha, C.; Wang, Z.; Fan, Y.; Zhang, X.; Bai, B.; Zhang, Y.; Shi, S.; Zhang, R. SKT-IDS: Unknown attack detection method based on Sigmoid Kernel Transformation and encoder–decoder architecture. Comput. Secur. 2024, 146, 104056. [Google Scholar] [CrossRef]
Chen, T.; Dong, C.; Lv, M.; Song, Q.; Liu, H.; Zhu, T.; Xu, K.; Chen, L.; Ji, S.; Fan, Y. APT-KGL: An intelligent apt detection system based on threat knowledge and heterogeneous provenance graph learning. IEEE Trans. Dependable Secur. Comput. 2022, 1–15. [Google Scholar] [CrossRef]
Weng, Z.; Zhang, W.; Zhu, T.; Dou, Z.; Sun, H.; Ye, Z.; Tian, Y. RT-APT: A real-time APT anomaly detection method for large-scale provenance graph. J. Netw. Comput. Appl. 2025, 233, 104036. [Google Scholar] [CrossRef]
Kicska, G.; Kiss, A. Comparing swarm intelligence algorithms for dimension reduction in machine learning. Big Data Cogn. Comput. 2021, 5, 36. [Google Scholar] [CrossRef]
Hallaji, E.; Razavi-Far, R.; Saif, M. A Study on the Importance of Features in Detecting Advanced Persistent Threats Using Machine Learning. arXiv 2025, arXiv:2502.07207. [Google Scholar] [CrossRef]
Al Mamun, A.; Al-Sahaf, H.; Welch, I.; Camtepe, S. Genetic programming for enhanced detection of Advanced Persistent Threats through feature construction. Comput. Secur. 2025, 149, 104185. [Google Scholar] [CrossRef]
Kumari, I.; Lee, M. A prospective approach to detect advanced persistent threats: Utilizing hybrid optimization technique. Heliyon 2023, 9, e21377. [Google Scholar] [CrossRef] [PubMed]
Abdulsattar, N.F.; Abedi, F.; Ghanimi, H.M.; Kumar, S.; Abbas, A.H.; Abosinnee, A.S.; Alkhayyat, A.; Hassan, M.H.; Abbas, F.H. Botnet detection employing a dilated convolutional autoencoder classifier with the aid of hybrid shark and bear smell optimization algorithm-based feature selection in FANETs. Big Data Cogn. Comput. 2022, 6, 112. [Google Scholar] [CrossRef]
Mei, Y.; Han, W.; Li, S.; Lin, K.; Luo, C. A hybrid intelligent approach to attribute Advanced Persistent Threat Organization using PSO-MSVM Algorithm. IEEE Trans. Netw. Serv. Manag. 2022, 19, 4262–4272. [Google Scholar] [CrossRef]
Bakhiet, A.M.; Aly, S.A. Hybridizing Base-Line 2D-CNN Model with Cat Swarm Optimization for Enhanced Advanced Persistent Threat Detection. In Proceedings of the 2024 International Telecommunications Conference (ITC-Egypt), Cairo, Egypt, 22–25 July 2024; pp. 596–601. [Google Scholar]
Myneni, S.; Chowdhary, A.; Sabur, A.; Sengupta, S.; Agrawal, G.; Huang, D.; Kang, M. DAPT 2020-constructing a benchmark dataset for advanced persistent threats. In Proceedings of the Deployable Machine Learning for Security Defense: First International Workshop, MLHat 2020, San Diego, CA, USA, 24 August 2020; pp. 138–163. [Google Scholar]
Myneni, S.; Jha, K.; Sabur, A.; Agrawal, G.; Deng, Y.; Chowdhary, A.; Huang, D. Unraveled—A semi-synthetic dataset for Advanced Persistent Threats. Comput. Netw. 2023, 227, 109688. [Google Scholar] [CrossRef]
Stando, A.; Cavus, M.; Biecek, P. The effect of balancing methods on model behavior in imbalanced classification problems. In Proceedings of the Fifth International Workshop on Learning with Imbalanced Domains: Theory and Applications, Turin, Italy, 18 September 2023; pp. 16–30. [Google Scholar]
Efendi, R.; Wahyono, T.; Widiasari, I.R. DBSCAN SMOTE LSTM: Effective Strategies for Distributed Denial of Service Detection in Imbalanced Network Environments. Big Data Cogn. Comput. 2024, 8, 118. [Google Scholar] [CrossRef]
Hackeling, G. Mastering Machine Learning with Scikit-Learn; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
Li, Z.-s.; Yao, X.; Liu, Z.-g.; Zhang, J.-c. Feature selection algorithm based on LightGBM. J. Northeast. Univ. 2021, 42, 1688. [Google Scholar]
Nam, Y.; Han, S. Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem. J. Classif. 2025, 1–14. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Naruei, I.; Keynia, F. Wild horse optimizer: A new meta-heuristic algorithm for solving engineering optimization problems. Eng. Comput. 2022, 38, 3025–3056. [Google Scholar] [CrossRef]
Alizadeh, F.; Khansari, M.; Arabsorkhi, A. Lateral movement detection through a heteregenous GNN model of kernel-level log. In Proceedings of the 2024 11th International Symposium on Telecommunications (IST), Tehran, Iran, 9–10 October 2024; pp. 38–43. [Google Scholar]
Wang, L.; Fang, L.; Hu, Y. A dynamic provenance graph-based detector for advanced persistent threats. Expert Syst. Appl. 2025, 265, 125877. [Google Scholar] [CrossRef]
Al Mamun, A.; Al-Sahaf, H.; Welch, I.; Mansoori, M.; Camtepe, S. Detection of advanced persistent threat: A genetic programming approach. Appl. Soft Comput. 2024, 167, 112447. [Google Scholar] [CrossRef]
Almazmomi, N.K. Advanced Persistent Threat Detection Using Optimized and Hybrid Deep Learning Approach. Secur. Priv. 2025, 8, e70011. [Google Scholar] [CrossRef]
Mei, Y.; Han, W.; Li, S.; Lin, K.; Tian, Z.; Li, S. A Novel Network Forensic Framework for Advanced Persistent Threat Attack Attribution Through Deep Learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 12131–12140. [Google Scholar] [CrossRef]

Figure 1. The overall framework diagram of the detection model.

Figure 2. Flowchart for calculating the importance of comprehensive features.

Figure 3. Flowchart of the LWHO algorithm. i represent the i-th stallion and j represent the j-th foal associated with that stallion.

Figure 4. The curve of the accuracy rate decreasing with features on the DAPT2020 dataset.

Figure 5. The curve of the accuracy rate decreasing with features on the Unraveled dataset.

Figure 6. ROC curve of LightGBM on the DAPT2020 dataset for multi-stage detection.

Figure 7. ROC curve of LightGBM on the Unraveled dataset for multi-stage detection.

Figure 8. Multiclass classification performance of the optimized LightGBM model on the DAPT2020 dataset.

Figure 9. Multiclass classification performance of the optimized LightGBM model on the Unraveled dataset.

Figure 10. Bar chart of training time and prediction time on the DAPT2020 dataset.

Figure 11. Bar chart of training time and prediction time on the Unraveled dataset.

Table 1. Related work.

Ref.	Class Method	Advantage	Limitations
[5]	CNN + LSTM	Proposed a novel deep learning hybrid model for IP classification	May not handle IP spoofing attacks effectively
[6]	MLP + AE	Utilizes GAN to generate new data samples for training	Focuses primarily on IoT security detection
[8]	NB + AHP	Applies the cyber kill chain model for analyzing and preventing attacks	Dataset is not advanced
[10]	CNN	Detects the lateral movement stage in APT attacks	Limited to a single stage of APT
[12]	AE + SKT	Detects unknown attacks using an encoder–decoder structure	Suffers from data imbalance, affecting detection accuracy
[14]	Provenance Graph	Reconstructs attack scenarios using provenance graphs with context awareness	Depends on complete log data
[9]	C5.0 DT + NB	Achieves high accuracy in early APT detection	Only distinguishes normal and abnormal traffic without multi-stage detection
[20]	PSO-MSVM	Performs well on small datasets	High time complexity on large-scale datasets

Table 2. The differences in normal and attack traffic distributions before and after data balance.

Stage	Before Balance		After Balance
Stage	DAPT2020	Unraveled	DAPT2020	Unraveled
Benign	44,258	6,618,218	44,258	100,000
Reconnaissance	11,909	34,794	11,909	34,794
Establish Foothold	8604	26,260	10,000	26,260
Lateral Movement	2451	2112	10,000	10,000
Data Exfiltration	6	7168	500	20,000
Cover up	/ *	362	/ *	5000

* DAPT2020 is not applicable to this category.

Table 3. Integer encoding for APT attack stages.

APT Stage	Encoded Value
Benign	0
Reconnaissance	1
Establish Foothold	2
Lateral Movement	3
Data Exfiltration	4
Cover Up	5

Table 4. Experimental environment parameter.

Software/Hardware Component	Specification or Version
Operating System	Windows 11 64-bit
Processor (CPU)	AMD Ryzen 7 8845 H w/Radeon 780 M Graphics 3.80 GHz
Memory (RAM)	32 GB
Python Environment	Python 3.11.6
Data Processing Libraries	NumPy 1.26.4, Pandas 2.2.3
Machine Learning Library	Scikit-learn 1.5.2
Visualization Tool	Matplotlib 3.9.2

Table 5. The feature index selected on the DAPT2020 dataset.

Feature Selection Method	Feature Index
LDR-RFECV	0, 5, 8, 14, 15, 17, 18, 23, 24, 28, 33, 34, 35, 36, 51, 65
LGB-RFECV	0, 5, 14, 15, 17, 18, 23, 24, 25, 28, 33, 34, 35, 36, 51, 65
DT-RFECV	0, 8, 14, 15, 18, 25, 27, 28, 33, 34, 35, 36, 45, 50, 65, 75
RF-RFECV	0, 11, 14, 15, 17, 18, 27, 28, 33, 34, 35, 36, 39, 51, 53, 65
XGBoost	7, 8, 10, 12, 14, 28, 34, 35, 36, 39, 42, 43, 44, 50, 65, 66
LightGBM	0, 5, 8, 14, 15, 16, 17, 18, 23, 24, 28, 33, 34, 35, 36, 65
RF	0, 14, 15, 16, 17, 18, 24, 25, 27, 28, 33, 34, 35, 36, 46, 65
DT	0, 8, 14, 15, 18, 24, 25, 27, 28, 30, 33, 35, 36, 50, 65, 75

Table 6. The feature index selected on the Unraveled dataset.

Feature Selection Method	Feature Index
LDR-RFECV	0, 2, 9, 10, 16, 20, 21, 30, 36, 37
LGB-RFECV	0, 1, 9, 10, 16, 20, 21, 26, 30, 36
DT-RFECV	0, 4, 5, 11, 12, 16, 17, 21, 22, 24
RF-RFECV	0, 1, 5, 6, 10, 11, 30, 37, 38, 45
XGBoost	2, 4, 15, 19, 22, 31, 38, 45, 52, 53
LightGBM	0, 1, 10, 12, 16, 20, 21, 30, 36, 37
RF	0, 1, 5, 6, 10, 11, 30, 37, 38, 45
DT	0, 4, 12, 16, 17, 21, 22, 24, 45, 52

Table 7. A mapping table of features and attacks.

Name	Explanation	ATT&CK * Tactics Mapping
Flow Duration	Represents long-connection attacks (e.g., remote control, persistent connections). Normal traffic usually has short connection durations, while attack traffic (e.g., botnet C2 communication, data theft) maintains longer connections to sustain control or data transmission.	Command and Control (TA0011), Exfiltration (TA0010)
Fwd Packet Length Max	Represents large data transfer attacks (e.g., file uploads, malicious payload delivery in exploitation). Attack traffic often sends large packets to deliver malware or stolen data, while normal traffic has more uniform packet sizes.	Execution (TA0002), Exfiltration (TA0010)
Flow Packets/s	Represents traffic flooding attacks (e.g., SYN Flood, UDP Flood in DDoS). Attacks send massive packets in a short time to occupy bandwidth/resources, while normal traffic has stable rates.	Impact (TA0040)
Flow IAT Mean	Represents scanning attacks (e.g., port scanning, service probing). Scanning attacks initiate frequent short connections, resulting in a much smaller inter-flow interval mean than normal traffic (normal intervals are more random).	Reconnaissance (TA0043)
Flow IAT Max	Represents intermittent communication attacks (e.g., dormant malware). Malware communicates periodically after long dormant periods, with much larger maximum inter-flow intervals than normal traffic.	Persistence (TA0003), Command and Control (TA0011)
Bwd IAT Total	Represents reverse connection attacks (e.g., reverse shells, remote control). Backward traffic (target to attacker) maintains long connections, with total intervals larger than normal responsive traffic.	Command and Control (TA0011)
Bwd IAT Min	Represents data exfiltration attacks (e.g., real-time stolen data upload). Backward traffic transfers stolen data rapidly, with minimal intervals smaller than normal responsive traffic.	Exfiltration (TA0010)
Bwd Header Length	Represents malicious reverse communication (e.g., C2 reverse responses). Controlled devices may hide commands in backward headers, with lengths differing from normal responses.	Command and Control (TA0011), Defense Evasion (TA0005)
Bwd Packets/s	Represents controlled device data return attacks (e.g., botnets reporting to C2). Infected devices send massive backward packets, with rates exceeding normal responsive traffic (e.g., web responses).	Command and Control (TA0011), Exfiltration (TA0010)
ACK Flag Count	Represents connection hijacking or session manipulation (e.g., TCP session hijacking). Attacks forge excessive ACK packets to maintain illegal connections; normal traffic has ACK counts matching data transfer.	Defense Evasion (TA0005), Lateral Movement (TA0008)

* ‘ATT&CK’ stands for Adversarial Tactics, Techniques, and Common Knowledge, a framework by MITRE describing adversary behaviors in cyber attacks, aiding threat analysis and defense.

Table 8. Classification performance indicators of different feature selection methods on the DAPT2020 dataset under multiple models.

Method	Model	Acc.	Prec.	Rec.	F1	Method	Model	Acc.	Prec.	Rec.	F1
LDR-RFECV	LightGBM	95.3	94.33	94.99	94.63	LGB-RFECV	LightGBM	95.02	94.1	94.55	94.32
	XGBoost	93.31	92.37	94.19	93.08		XGBoost	93	91.15	93.38	92.14
	CatBoost	92.75	90.4	93.25	91.65		CatBoost	92.57	90.54	92.31	91.3
	RF	93.75	92.4	94.48	93.29		RF	93.94	92.49	93.26	92.82
	DT	93.44	91.96	93.74	92.79		DT	93.45	92.41	93.44	92.88
	KNN	90.74	89.44	89.89	89.62		KNN	90.98	89.88	90.28	90.07
	NN	87.45	82.61	85.89	84.12		NN	89.06	87.53	76.04	77.91
	LR	64.53	41.36	35.17	35.97		LR	73.4	53.43	52.36	51.37
	NB	40.28	50.05	54.56	44.24		NB	32.85	51.37	61.16	38.91
DR-RFECV	LightGBM	95.24	94.25	94.89	94.56	RF-RFECV	LightGBM	95.05	94.13	94.69	94.4
	XGBoost	93.23	92.01	94.17	92.93		XGBoost	92.99	91.64	93.83	92.58
	CatBoost	92.61	90.14	93.47	91.65		CatBoost	92.27	89.64	92.79	91.09
	RF	93.73	92.91	93.77	93.31		RF	93.62	92.85	93.68	93.23
	DT	92.79	91.68	91.33	91.46		DT	93.09	92.35	92.11	92.17
	KNN	90.75	89.29	90.2	89.73		KNN	89.45	87.17	88.8	87.93
	NN	88.9	82.3	88.62	85.11		NN	89.02	82.13	90.34	85.6
	LR	61.93	56.79	36.05	38.91		LR	61.77	42.67	31.04	30.83
	NB	39.79	49.33	54.1	43.13		NB	38.95	44.17	54.61	39.65
XGBoost	LightGBM	94.91	94.04	94.61	94.32	RF	LightGBM	94.95	94.33	94.65	94.47
	XGBoost	92.95	91	93.66	92.2		XGBoost	92.76	91.89	93.64	92.59
	CatBoost	92.33	89.95	92.5	91.1		CatBoost	92.32	90.47	92.37	91.27
	RF	93.93	92.43	93.81	93.07		RF	93.73	92.71	93.99	93.28
	DT	93.35	92.12	93.54	92.78		DT	92.98	92.23	92.89	92.46
	KNN	90.94	89.77	90.34	90.05		KNN	90.58	89.48	90.71	90.08
	NN	89.25	83.91	88.92	86.13		NN	86.95	86.4	74.91	76.89
	LR	74.25	54.09	51.47	51.88		LR	61.37	35.88	33.47	33.57
	NB	32.71	43.31	56.61	34.21		NB	39.4	47.41	54.03	42.03
LightGBM	LightGBM	95.06	94.03	94.8	94.39	DT	LightGBM	95	94.24	94.71	94.44
	XGBoost	93.13	91.96	94.08	92.85		XGBoost	93.05	92.12	93.93	92.83
	CatBoost	92.3	90.19	92.98	91.45		CatBoost	92.33	90.2	92.72	91.33
	RF	93.56	92.46	94.5	93.36		RF	93.76	92.49	93.86	93.13
	DT	93.02	91.86	92.78	92.24		DT	92.96	91.61	92.58	92.06
	KNN	89.84	87.28	89.43	88.26		KNN	91.01	89.61	90.75	90.17
	NN	88.47	81.47	90.46	85.27		NN	88.18	82.24	87.39	84.42
	LR	64.47	50.88	35.8	37.67		LR	61.75	56.61	36.15	38.82
	NB	40.2	49.1	54.5	43.17		NB	40.22	50.25	54.39	44.56

Acc. stands for accuracy, Prec. stands for precision rate, while Rec. represents recall rate.

Table 9. Classification performance indicators of different feature selection methods on the Unraveled dataset under multiple models.

Method	Model	Acc.	Prec.	Rec.	F1	Method	Model	Acc.	Prec.	Rec.	F1
LDR-RFECV	LightGBM	97.85	97.03	93.42	94.49	LGB-RFECV	LightGBM	97.78	96.91	93.14	94.22
	XGBoost	97.5	96.65	92.21	93.32		XGBoost	97.43	96.34	92.05	93.12
	CatBoost	97.4	96.55	92.35	93.48		CatBoost	97.44	96.29	92.52	93.64
	RF	97.3	96.55	91.49	92.6		RF	97.28	96.33	91.46	92.54
	DT	96.99	95.54	91.05	92.06		DT	97.19	95.79	91.54	92.59
	KNN	96.15	91.01	90.74	90.81		KNN	96.14	91.04	90.62	90.74
	NN	94.81	91.27	84.37	82.59		NN	94.81	92.23	84.16	82.36
	LR	73.91	50.41	50.25	47.34		LR	70.35	49.17	46.96	45.68
	NB	49.01	39.88	45.11	31.83		NB	58.32	48.24	51.21	40.38
DR-RFECV	LightGBM	97.55	96.65	92.45	93.54	RF-RFECV	LightGBM	97.14	96.02	91.15	92.2
	XGBoost	97.26	96.29	91.54	92.6		XGBoost	96.94	95.88	90.43	91.44
	CatBoost	97.38	96.48	92.45	93.64		CatBoost	96.78	95.81	90.27	91.26
	RF	97.24	96.41	91.33	92.43		RF	96.86	96	90.03	90.99
	DT	97.07	95.38	91.33	92.31		DT	97.09	95.51	91.29	92.27
	KNN	95.62	89.68	89.51	89.5		KNN	95.99	90.69	90.29	90.41
	NN	94.59	89.9	83.53	81.37		NN	94.37	89.22	83.35	81.08
	LR	70.57	47.72	48.99	47.3		LR	72.88	50.4	57.13	51.01
	NB	57.87	44.67	48.88	38.23		NB	62.78	53.97	60.99	46.81
XGBoost	LightGBM	94.96	90.34	83.93	85.29	RF	LightGBM	96.39	95.31	88.84	89.52
	XGBoost	94.74	89.41	83.45	84.75		XGBoost	96.34	94.82	88.77	89.43
	CatBoost	94.57	89.66	82.72	84.22		CatBoost	96.13	94.31	88.58	89.2
	RF	95.17	91.12	84.6	85.93		RF	96.54	95.49	89.03	89.75
	DT	95.09	89.69	85.08	86.2		DT	96.38	93.89	89.06	89.73
	KNN	93.63	84.64	81.35	82.21		KNN	95.74	90.1	89.75	89.8
	NN	92.48	87.73	73.49	73.18		NN	94.03	87.27	82.31	79
	LR	72.66	46.15	45.62	39.9		LR	75.86	60.33	62.53	57.41
	NB	39.97	31.71	45.9	26.56		NB	64.23	59.51	64.95	52.04
LightGBM	LightGBM	97.74	96.83	93.08	94.15	DT	LightGBM	97.59	96.63	92.71	93.78
	XGBoost	97.41	96.5	91.92	93.02		XGBoost	97.4	96.42	92.11	93.21
	CatBoost	97.31	96.09	92.25	93.28		CatBoost	97.46	96.49	92.63	93.71
	RF	97.06	96.22	90.67	91.72		RF	97.36	96.55	91.71	92.82
	DT	96.75	94.92	90.21	91.12		DT	97.31	95.85	91.9	92.9
	KNN	96.12	90.92	90.78	90.8		KNN	95.87	90.25	89.88	89.94
	NN	94.79	91.45	84.33	82.52		NN	94.76	91.34	84.12	82.15
	LR	73.92	50.59	50.28	47.43		LR	83.71	66.33	72.59	68.92
	NB	57.86	42.94	49.05	36.77		NB	51.39	45.82	43.31	30.48

Acc. stands for accuracy, Prec. stands for precision rate, while Rec. represents recall rate.

Table 10. Comparison of classification performance before and after feature selection.

Dataset	All Features					Features Selected
Dataset	Model	Acc.	Prec.	Rec.	F1	Model	Acc.	Prec.	Rec.	F1
DAPT2020	LightGBM	95.37	94.48	95.11	94.78	LightGBM	95.3	94.33	94.99	94.63
	XGBoost	93.3	91.91	94.19	92.9	XGBoost	93.31	92.37	94.19	93.08
	CatBoost	92.66	90.47	93.54	91.87	CatBoost	92.75	90.4	93.25	91.65
	RF	93.22	91.62	94.41	92.89	RF	93.75	92.4	94.48	93.29
	DT	92.91	91.09	91.01	90.99	DT	93.44	91.96	93.74	92.79
	KNN	90.45	88.93	90.15	89.52	KNN	90.74	89.44	89.89	89.62
	NN	90.32	84.8	88.32	86.47	NN	87.45	82.61	85.89	84.12
	LR	82.37	81.49	67.75	70.41	LR	64.53	41.36	35.17	35.97
	NB	55.85	42.17	53.9	39.16	NB	40.28	50.05	54.56	44.24
Unraveled	LightGBM	97.95	97.26	93.73	94.81	LightGBM	97.85	97.03	93.42	94.49
	XGBoost	97.71	97.03	92.83	93.97	XGBoost	97.5	96.65	92.21	93.32
	CatBoost	97.35	96.69	92.13	93.27	CatBoost	97.4	96.55	92.35	93.48
	RF	97.17	96.49	90.97	92.05	RF	97.3	96.55	91.49	92.6
	DT	97.27	96.05	91.79	92.87	DT	96.99	95.54	91.05	92.06
	KNN	96.48	91.79	91.71	91.71	KNN	96.15	91.01	90.74	90.81
	NN	95.47	93.92	85.82	84.99	NN	94.81	91.27	84.37	82.59
	LR	92.46	88.84	81.4	78.34	LR	73.91	50.41	50.25	47.34
	NB	55.64	54.08	48.05	35.97	NB	49.01	39.88	45.11	31.83

Acc. stands for accuracy, Prec. stands for precision rate, while Rec. represents recall rate.

Table 11. The optimized list of hyperparameters.

Parameter Name	Description	Searching Space
max_depth	Maximum depth of the tree, which is used to limit tree layers and prevent overfitting.	[2, 10]
learning_rate	Learning rate that determines each tree’s contribution to the final result; lower values improve robustness but increase training time.	[0.01–0.5]
num_leaves	Maximum number of leaves per tree; a key parameter controlling model complexity. Larger values may enhance fitting ability but risk overfitting.	[2, 50]
feature_fraction	The ratio of features sampled during each iteration, which helps reduce overfitting.	[0.01, 1]
bagging_fraction	Proportion of data used in each iteration for training.	[0.01, 1]
lambda_l2	L2 regularization term to regulate model complexity and mitigate overfitting.	[0.01, 1]
lambda_l1	L1 regularization term used for feature selection, enhancing sparsity and generalization.	[0.01, 1]
min_data_in_leaf	Minimum number of samples required in a leaf.	[1, 30]
n_estimators	Total count of decision trees (weak learners) built during training.	[10, 120]

Table 12. Comparison of the effects of various optimization algorithms.

Dataset	Algorithm	Acc.	Prec.	Rec.	F1
DAPT2020	LWHO	97.33	96.8	97.25	97.01
	WHO	97.23	96.68	97.14	96.9
	GWO	97.11	96.59	97.21	96.89
	PSO	96.47	95.81	96.29	96.02
	CMA-ES	96.79	96.18	96.63	96.39
	Bayesian	95.62	95.01	95.24	95.1
Unraveled	LWHO	98.32	97.67	94.87	95.85
	WHO	98.27	97.67	94.71	95.75
	GWO	98.21	97.54	94.55	95.56
	PSO	97.98	97.47	94.05	95.18
	CMA-ES	98.20	97.54	94.53	95.55
	Bayesian	97.82	97.33	93.63	94.81

Acc. stands for accuracy, Prec. stands for precision rate, while Rec. represents recall rate.

Table 13. The overall performance table of LightGBM after parameter optimization.

Number of Features	Metrics	Before Optimization		After Optimization
Number of Features	Metrics	DAPT2020	Unraveled	DAPT2020	Unraveled
After Feature Selection	Accuracy	95.3	97.85	97.33	98.32
	Precision	94.33	97.03	96.8	97.67
	Recall	94.99	93.42	97.25	94.87
	F1-Score	94.63	94.49	97.01	95.85
All Features	Accuracy	95.37	97.95	97.23	98.35
	Precision	94.48	97.26	96.76	97.59
	Recall	95.11	93.73	97.13	95.06
	F1-Score	94.78	94.81	96.93	95.96

Table 14. Time resource consumption of FS algorithm and parameter optimization algorithm.

Algorithm	Total Running Time (s)		Total Memory Consumption (MB)
Algorithm	DAPT2020	Unraveled	DAPT2020	Unraveled
LDR-RFECV	23.10 min	59.47 min	77 MB	109 MB
LGB-RFECV	7.45 min	15.53 min	/	/
DT-RFECV	8.42 min	12.10 min	/	/
RF-RFECV	13.73 min	41.90 min	/	/
LWHO	95.75 min	137.57 min	303 MB	575 MB
WHO	78.68 min	106.73 min	281 MB	498 MB

“/” indicates that it is not applicable.

Table 15. Comparison of our method with existing studies.

Ref.	Method Used	Dataset	Features Num.	Acc.	Prec.	Rec.	F1.	Predict (s)
[18]	Hybrid HHOSSA (HHO and SSA)	DAPT2020	/	94.468	/	/	/	/
[21]	CSO-2D-CNN	DAPT2020	75	98.4	96	98	97	/
[33]	GPC	DAPT2020	66	85.13	37.87	98.22	93.61	0.0003
[17]	Genetic Programming (FEGP)	DAPT2020	/	79.28	/	/	98.58	/
[17]	Genetic Programming (FEGP)	Unraveled	/	83.14	/	/	99.93	/
[34]	SMA-Optimized CNN-LSTM	Unraveled	/	94.3	92.8	93.5	93.1	480
[9]	Six-layer deep learning model	NSL KDD	41	98.85	/	/	95.84	/
[35]	MLP DNN	UNSW-NB15	25	94.56	95.58	94.56	94.94	/
[6]	MLP + AE	TON IoT	/	90.56	89.27	90.86	88.06	/
[32]	CGL-AD	Camflow-apt	/	97.47	98.15	96.8	97.45	/
[32]	CGL-AD	Shellshock	/	94.27	98.59	90.87	94.10	49 s
Our	LDR-RFECV + LWHO LightGBM	DAPT2020	16	97.33	96.8	97.25	97.01	0.0402
Our	LDR-RFECV + LWHO LightGBM	Unraveled	10	98.32	97.67	94.87	95.85	0.1567

Acc. stands for accuracy, Prec. stands for precision rate, while Rec. represents recall rate. “/” indicates that it is not applicable, or the author of the literature did not provide this information.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, L.; Li, H.; Fu, X.; Han, D.; Zhou, S.; He, X. Research on Multi-Stage Detection of APT Attacks: Feature Selection Based on LDR-RFECV and Hyperparameter Optimization via LWHO. Big Data Cogn. Comput. 2025, 9, 206. https://doi.org/10.3390/bdcc9080206

AMA Style

Zeng L, Li H, Fu X, Han D, Zhou S, He X. Research on Multi-Stage Detection of APT Attacks: Feature Selection Based on LDR-RFECV and Hyperparameter Optimization via LWHO. Big Data and Cognitive Computing. 2025; 9(8):206. https://doi.org/10.3390/bdcc9080206

Chicago/Turabian Style

Zeng, Lihong, Honghui Li, Xueliang Fu, Daoqi Han, Shuncheng Zhou, and Xin He. 2025. "Research on Multi-Stage Detection of APT Attacks: Feature Selection Based on LDR-RFECV and Hyperparameter Optimization via LWHO" Big Data and Cognitive Computing 9, no. 8: 206. https://doi.org/10.3390/bdcc9080206

APA Style

Zeng, L., Li, H., Fu, X., Han, D., Zhou, S., & He, X. (2025). Research on Multi-Stage Detection of APT Attacks: Feature Selection Based on LDR-RFECV and Hyperparameter Optimization via LWHO. Big Data and Cognitive Computing, 9(8), 206. https://doi.org/10.3390/bdcc9080206

Article Menu

Research on Multi-Stage Detection of APT Attacks: Feature Selection Based on LDR-RFECV and Hyperparameter Optimization via LWHO

Abstract

1. Introduction

2. Related Work

2.1. Based on Deep Learning and the Machine Learning Model

2.2. Based on Graph Structure Modeling

2.3. Focus on Feature Selection and Model Parameter Optimization

3. Proposed Multi-Stage Detection Method for APT Attacks

3.1. Dataset Introduction

3.1.1. DAPT2020 Dataset

3.1.2. Unraveled Dataset

3.1.3. Features of the Dataset

3.2. Data Preprocess

3.2.1. Delete Irrelevant Columns

3.2.2. Data Balance

3.2.3. Label Encoding

3.3. The Proposed LDR-RFECV Feature Selection Algorithm

3.3.1. Recursive Feature Elimination with Cross-Validation

3.3.2. LDR-RFECV

3.4. The Proposed LWHO Parameter Optimization Algorithm

3.4.1. Wild Horse Optimizer

3.4.2. Levy Flight

3.4.3. LWHO

3.5. Evaluation Indicators

4. Results and Analysis

4.1. Experimental Environment

4.2. Feature Selection Results and Analysis

4.3. Performance Analysis After Feature Selection

4.3.1. Comparison of Different Feature Selection Methods

4.3.2. Performance Comparison Before and After Feature Selection

4.3.3. ROC Curve Analysis

4.4. Analysis of Parameter Optimization Results

4.4.1. Performance Comparison of Different Optimization Algorithms

4.4.2. Detection Performance by the APT Attack Stage

4.5. Analysis of Resource Consumption

4.6. Comparison with Previous Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI