Improving Remote Access Trojans Detection: A Comprehensive Approach Using Machine Learning and Hybrid Feature Engineering

Aburbeian, AlsharifHasan Mohamad; Fernández-Veiga, Manuel; Hasasneh, Ahmad

doi:10.3390/ai6090237

Open AccessArticle

Improving Remote Access Trojans Detection: A Comprehensive Approach Using Machine Learning and Hybrid Feature Engineering

by

AlsharifHasan Mohamad Aburbeian

¹

,

Manuel Fernández-Veiga

²

and

Ahmad Hasasneh

^3,*

¹

Doctoral Program in Information and Communications Technologies (DocTIC), Universidade de Vigo, 36310 Vigo, Spain

²

AtlanTTic Research Center, Universidade de Vigo, 36310 Vigo, Spain

³

Faculty of Artificial Intelligence, Arab American University, Ramallah P600, Palestine

^*

Author to whom correspondence should be addressed.

AI 2025, 6(9), 237; https://doi.org/10.3390/ai6090237

Submission received: 12 August 2025 / Revised: 18 September 2025 / Accepted: 19 September 2025 / Published: 21 September 2025

Download

Browse Figures

Versions Notes

Abstract

Remote Access Trojans (RATs) pose a serious cybersecurity risk due to their stealthy control over compromised systems. This study presents a detection framework that integrates host, network, and newly engineered behavioral features to enhance the identification of RATs. Two sets of experiments were performed: (i) using the original dataset only, and (ii) using an extended dataset with ten engineered features and importance analysis. The framework was evaluated on a public Kaggle dataset of an RAT and benign traffic. Eight machine learning classifiers were tested, including three baseline methods, four ensemble approaches, and a neural network. Results show that the engineered hybrid feature set substantially improves detection performance. Among the tested algorithms, Random Forest and MLP achieved the strongest performance, with accuracies of 98% and 97%, respectively, while Gradient Boosting and LightGBM also performed competitively. Performance was assessed using multiple metrics, and to gain deeper insight into model learning behavior, learning curves and Precision–Recall curves were analyzed. The results demonstrate how well hybrid feature modeling, neural networks, and ensemble machine learning techniques may improve RAT identification. In future work, exploring the use of explainable ML methods may improve the detection capabilities.

Keywords:

remote access trojans; cybersecurity; host-based features; machine learning; feature selection; confusion matrix

1. Introduction

1.1. Background

The malicious malware Remote Access Trojan (RAT) enables intruders to remotely take control of systems and intercept data, having a significant negative impact on governments, businesses, and private citizens [1]. The RAT is generally made up of an infected side and a control side. To discover the computers that can be infected, attackers can employ phishing, social engineering, or any other techniques to deliver the RAT into the victim’s device [2]. To prevent their appearance on the task list, concealment methods integrate themselves into other legal processes in which the RAT operations become more similar to those of legitimate programs, and it becomes harder to identify RATs. RATs typically rely on a client/server design. The attacker controls the client device to either directly steal information from the target device or employ the remote control ability of the target host by using the client to manage the server [3]. Figure 1 shows the runtime states of the RAT, which are connection setup, keep-alive, and command and control.

As shown in Figure 1, the connection establishment phase is the first and shortest stage in a RAT’s lifecycle, during which the attacker delivers the RAT and establishes the initial connection. After that, the system alternates between two recurring states: keep-alive and command-and-control [4]. The keep-alive state lasts the longest and involves minimal activity. During this phase, the RAT server sends small periodic signals (keep-alive requests) to check whether the client is still connected. These packets are typically tiny and contain little information [5]. From a detection standpoint, this quiet behavior can be identified by analyzing the network packets and monitoring the limited number of host operations. In contrast, the command-and-control state is more active and involves frequent communication between the client and server. Once connected, the attacker can issue commands to the client, which are executed by the server. This state often involves large data transfers, especially when the attacker is stealing sensitive information [6]. Finally, because of its malicious action, which was discussed previously, RAT becomes a serious problem in cyberspace [7]. This research presents a methodology for detecting RATs by employing various machine learning algorithms to achieve the highest possible accuracy.

Figure 1. Different operational states of a Remote Access Trojan (RAT) [5].

1.2. Related Studies

Through the literature, numerous studies have discussed RAT detection, and the different methodologies used by other researchers can be categorized into three groups: host-based, network-based, and hybrid-based. This section will discuss each one separately.

1.2.1. Host-Based Studies

The host-based approach relies on evaluating and monitoring the behavior of RATs as they run on infected devices. To identify and detect RATs, they use various metrics based on execution data from the operating system, such as CPU and memory utilization [8], processor execution path [9], process ID, and API requests [10]. According to Moon et al. [11], the system behavior of a typical software may be used to identify malware from an Advanced Persistent Threat (APT) attack. A technique for spotting APT malware that focuses on “changes” in host behavior was implemented by Chandran et al. [12]. It utilizes data such as CPU and memory utilization, system files, and open ports. Using the communication and processing data, Liang et al. [13] suggested a method for logging suspicious communications from Trojans. This host-based approach relies on studies of static analysis techniques based on syntactic signatures or semantic properties. However, identifying a Trojan using only the static analysis technique may not be effective [14]. Moreover, malware that employs code obfuscation methods is challenging to examine using this method [15]. A deep learning method for RAT detection was introduced by Floroiu et al. [16]. Their method utilized an image classification model and converted executable files into grayscale texture images. The deployment dataset was constructed using both benign samples and files infected by known RATs. At 83.26%, they had the highest accuracy score among multiple deep learning models. However, the model’s ability to identify RATs is constrained by its dependence on session-level variables and low accuracy. Their findings also indicate that addressing the unbalanced dataset is necessary to improve identification.

1.2.2. Network-Based Studies

RATs are often detected using network recognition methods by examining the network traffic’s payload, examining the behavior of their users, or examining their statistical properties. To characterize the activity of the RAT, Li et al. [17] utilized four IP-level characteristics: the quantity of incoming and outgoing packets, the length of the communication session, the number of transport layer connections, and a couple of flow-level characteristics: packet delivery time and packet time interval. Their model’s false positive rate (FPR) is less than 3.2%, while the accuracy is over 91%. However, this may still allow sensitive data to be exposed before the threat is detected, due to delayed identification. Liang et al. [13] described the behavior of Trojans using five typical characteristics, which are the active connection percentage, the number of unique IPs, the ratio of transmitted to received traffic size, the uploaded connection percentage, and the number of connections. The FPR was 2.94%, with an accuracy of 97.05%. It can be hard to identify Trojans as quickly as possible, and the precise quantity of benign and Trojan cases was not mentioned in the study. In the early stages of TCP communication, RAT traffic can be identified, as achieved by Jiang et al. [18]. This study used 175 sessions for detection purposes. Only 10 RAT sessions were used and 165 benign ones. Their Random Forest classifier accuracy was 96%, and the FPR was 10%. An unbalanced distribution of Trojan and benign sessions was also deployed in other studies [19,20]. This makes the deployed classifier unable to distinguish between Trojan and benign traffic correctly. Sebakara et al. [21] introduced a model for RAT identification in their study. The model is based on metadata and behavioral signs at the network level, including persistent unidirectional flows, irregularities in TLS handshakes, and anomalies in packet timing. Several classifiers were trained and assessed. The Random Forest performed the best, with a 72.11% accuracy. However, their methodology was less dependable when detecting RATs because it is solely static and vulnerable to obfuscation strategies, and it requires further improvements for better detection.

1.2.3. Hybrid-Based Studies

In this approach, the characteristics of both the host and network features are combined to describe the behavior of the RATs. For example, an approach to identifying RAT-Bots was introduced by Awad et al. [22]. They suggested a framework that involves two stages that work together to detect RAT bots early on. The first stage depends on the host-based behavior, which monitors the host’s system activity and raises an alert in the event of any irregularities. The second stage relies on network-based behavior, monitoring network traffic, and identifying any suspicious trends. This strategy significantly depends on the host agent since the network agent does not start operating until the host agent sends the alarm. Guo et al. [5] proposed a system that combines both host and network characteristics to detect RATs. To increase the True Positive Rate (TPR) in their model, they trained two distinct recognition models, each tailored to a particular RAT operational stage. However, their data were highly skewed (benign: 28,730 records; Trojan: 251 records), which would lead to unreliable model accuracy due to the bias towards the benign category [23,24]. Another strategy was put forth to identify RATs in an LAN environment by employing a range of static and dynamic analysis techniques [25]. Both host-level activity analysis (using Process Explorer and static inspection tools like VirusTotal) and network traffic analysis (using Wireshark) are integrated in this work. The detection framework monitors the compromised host and examines the traffic it produces, spanning both ends of the RAT communication channel. Their method lacks machine learning approaches and does not evaluate performance under network variability, latency, or detection accuracy.

Despite the significant advancements in host-based, network-based, and hybrid RAT detection methods, several limitations remain unaddressed in the current literature. Many host-based studies rely heavily on static analysis techniques, which are often ineffective against obfuscated or polymorphic malware and may lack the context of external communication patterns. Network-based approaches usually rely on limited session-level features and struggle with high false positive rates or delayed detection, particularly when encountering encrypted or stealthy communication. Moreover, several prior studies suffer from highly imbalanced datasets, which compromise classifier reliability by biasing detection toward benign instances. Hybrid approaches have shown promise, but existing models often lack rich behavioral feature sets or rely on sequential triggers between host and network agents, limiting their responsiveness and detection scope [5]. To overcome these issues, this study suggests a hybrid detection methodology that combines a balanced dataset and advanced feature engineering.

The main contributions of this study are as follows:

Hybrid detection approach: A novel framework is proposed that combines host-based and network-based features to enhance RAT detection.
Improved accuracy and lower false positives: By integrating various features, the approach enhances classification accuracy and reduces false alarms.
Broader threat coverage: The hybrid method captures both external network behaviors (e.g., command-and-control activity) and internal host indicators (e.g., unauthorized access attempts).
Highlighting the significance of feature engineering: The study introduces and evaluates 10 newly engineered behavioral features that significantly improve model performance.

2. Materials and Methods

2.1. Methodology

This study introduces a new machine learning method for detecting RATs by combining host and network features. Two models were created: Model A, which includes the original dataset features along with engineered behavioral features, and Model B, which uses only the original dataset features without the added engineered attributes. This setup enabled a systematic evaluation of how feature engineering impacts the performance of RAT detection. The methodology used in this research is illustrated in Figure 2.

As illustrated in Figure 2, the study approach contains six main stages, beginning with data acquisition and followed by feature engineering and data preprocessing. The final stages focus on enhancing and evaluating the performance using several metrics. To ensure robust performance, the study employed a diverse set of algorithms, including Logistic Regression (LR), Naive Bayes (NB), K-nearest neighbors (KNN), Random Forest (RF), Gradient Boosting (GB), AdaBoost, LightGBM, and Multilayer Perceptron (MLP).

Linear models such as Logistic Regression provide interpretability and computational efficiency, making them suitable for establishing baseline performance. Naive Bayes is effective in high-dimensional spaces under conditional independence assumptions. Instance-based methods like K-Nearest Neighbors leverage local similarity and lazy learning, enabling direct comparisons between flow instances. Ensemble tree-based methods further enhance predictive power: Random Forest captures complex feature interactions and reduces variance through bagging, Gradient Boosting sequentially corrects errors from prior trees to improve accuracy, AdaBoost iteratively reweights misclassified samples to focus on challenging instances, and LightGBM optimizes leaf-wise growth for the efficient handling of large, high-dimensional datasets. Finally, MLP, as a neural network, learns intricate nonlinear patterns and hierarchical representations in the data.

By combining these diverse classifiers, the study ensures that different aspects of RAT behavior are captured and that conclusions are not biased toward any single algorithmic assumption. This comprehensive evaluation enables the identification of the most effective models for RAT detection and provides robust insight into the benefits of hybrid feature engineering.

2.2. Data Acquisition

The dataset used in this study was generated by the “Canadian Institute for Cybersecurity (CIC)” and downloaded from the Kaggle website [26]. It includes 177,482 network traffic records, each representing a flow instance labeled as either benign or a RAT. The dataset distribution is shown in Figure 3.

As shown in Figure 3, the dataset is roughly balanced, with 86,799 benign instances and 90,683 RAT instances, offering a nearly even class distribution. This balance reduces the need for extensive rebalancing initially, leading to more reliable training and evaluation of classification models. Although this study used a balanced and publicly available Kaggle dataset for training and assessment, such a distribution might not accurately reflect real-world RAT traffic, where benign flows are much more common. While this balancing was necessary to reduce classifier bias, it also introduces a limitation in terms of realism.

2.3. Feature Engineering

Feature engineering is essential for improving performance. It entails generating, altering, or choosing critical characteristics in unprocessed data to enhance the machine learning model’s efficiency [27]. In security-related tasks such as malware and RAT detection, the raw features in the original dataset capture low-level traffic statistics. By deriving additional behavioral features, we can reveal deeper patterns of activity that better distinguish malicious from benign behavior. This study introduced 10 new features that capture time-based, interaction-based, and behavioral characteristics of network flows. These features help reveal insights such as flow intensity, temporal activity, frequency of communication, and the diversity of contacts between IP addresses. Table 1 below summarizes the engineered features, their source features, and their classification, and provides a brief description of each one.

As shown in Table 1, the newly introduced features include time-related attributes such as “Hour, DayOfWeek, IsWeekend, and SecondsSinceMidnight”, which capture temporal patterns that can reveal attack timing behavior. Behavioral network features, such as “TimeDiffFromLastFlow, SourceIP_FlowCount, UniqueDestinations, and UniqueSources”, offer critical insights into how a source or destination behaves across connections, helping to identify anomalies like port scanning, lateral movement, or unusual communication volume. The “AvgFlowDuration” captures how long a source typically communicates in a session, which may further distinguish malicious flows.

To measure their impact, ablation experiments were performed: model A (hybrid without engineered features) and model B (hybrid with engineered features).

2.4. Data Cleaning and Preprocessing

To ensure data readiness, a comprehensive data preprocessing procedure was implemented. Initially, the preprocessing pipeline for Model A included removing non-informative features such as the “FlowID, and Unnamed: 0” index column, which was a CSV-generated identifier without analytical value. The “Timestamp, and FlowDate” columns were excluded to reduce redundancy and potential information leakage, as their identifying information had already been encoded into more meaningful derived features during the feature engineering phase. Categorical fields such as “Source IP, Destination IP, and Protocol” were transformed into numerical representations; specifically, the IP addresses were hashed into integers to preserve anonymity and maintain compatibility with machine learning algorithms, while the protocol values were encoded using categorical codes. The target variable “Class”, labeled initially with the string values “Benign” and “Trojan”, was mapped to binary numeric values (0 and 1, respectively) for classifier compatibility. To ensure data quality, all columns were assessed for missing values; any rows containing invalid timestamps had already been removed during earlier parsing steps. A final verification step confirmed that no missing values remained. Furthermore, all remaining features were verified to be numeric, aligning the dataset with the input requirements of the selected machine learning frameworks.

Model B followed the same preprocessing pipeline as Model A. Non-informative columns such as “FlowID” and “Unnamed: 0” were removed, categorical fields were transformed into numerical representations, IP addresses were hashed into integers, protocol values were encoded, and the target variable was mapped to binary numeric values (0 and 1). The only difference is that no derived behavioral features were added, and no feature importance or selection was performed

These steps collectively ensured the dataset was clean, numerically consistent, and free from features that could hinder the learning process.

2.5. Feature Selection

To improve performance and reduce computational load, this study employed a comprehensive feature selection and importance analysis process. Not all features contributed equally to classifying benign and RAT traffic. To identify the most critical features, a Random Forest classifier was trained on the preprocessed data. This model was chosen because of its ability to rank features based on their effectiveness in reducing impurity in decision trees. The feature importance was then assessed and ranked. A threshold of 0.001 was applied to exclude features that had minimal impact on the model’s decisions. Refer to Table 2, which lists the features and their descriptions after the feature importance analysis.

As shown in Table 2, the analysis revealed that several engineered behavioral features, such as “DayOfWeek, SecondsSinceMidnight, and IsWeekend” were among the most influential, highlighting the significance of temporal patterns in RAT detection. Network-based metrics such as SourceIP_FlowCount, UniqueDestinations, Flow Duration, and packet statistics like “Flow Bytes/s, Fwd IAT Max, and Packet Length Variance” also showed strong importance scores. In contrast, features like “RST Flag Count, Fwd URG Flags, ECE Flag Count”, and several bulk transfer indicators had an importance of zero and were thus removed.

Out of the ten engineered features, nine were retained in the final set of seventy after applying feature importance analysis. Features such as “FlowDate” were excluded to avoid redundancy; however, they were primarily used as intermediate variables to derive more meaningful features like “DayOfWeek” and “IsWeekend”, which were retained. Similarly, the identifiers feature “FlowID” and the “Timestamp” feature were removed, while their temporal information was already captured in other derived features. Thus, the final feature set preserves the essential behavioral signals introduced by our feature engineering, while avoiding redundant or less informative variables.

To facilitate the reader’s understanding of the final number of features, Figure 4 shows the roadmap for the feature engineering process.

As shown in Figure 4, Model A starts with 86 features. Ten features were extracted through feature engineering, as illustrated previously in Table 1. Four features were dropped, as discussed in Section 2.4. Finally, 22 features were dropped according to importance analysis. As a result, the final number of features in Model A is 70 features, including the label (Trojan, benign). The retained set included a mix of original network features, derived behavioral features, and key host indicators that were empirically validated to contribute significantly to the classification task. This selective pruning of the feature space not only improved training efficiency but also helped mitigate overfitting, as the models were now trained on the most informative subset of data. Model B has 86 features too; 2 features were dropped as discussed previously in Section 2.4. As a result, the final number of features in Model A is 70 features, including the label (Trojan, benign).

2.6. Data Splitting and Model Preparation

To prepare the dataset for training and evaluation, stratified sampling was used to split it into 80% for training and 20% for testing. The distribution was nearly balanced, with Trojan samples forming a slight majority. The training set maintained a class balance of approximately 49% benign and 51% Trojan, while the testing set preserved the same ratio. To address the slight imbalance in the training data and enhance the distinguishability of both classes, the Synthetic Minority Over-sampling Technique (SMOTE) was applied. To prevent data leakage, the dataset was first split into training and testing sets. Then, SMOTE was applied only to the training set to balance the classes before model fitting. The same data splitting process was used for both models. SMOTE generates synthetic data for the minority class, resulting in a perfectly balanced training set with 50% for each class. Table 3 summarizes the distribution of class labels across training, testing, and SMOTE-balanced training sets.

As Table 3 shows, the whole dataset consisted of 177,482 rows, which were divided into 141,985 rows (80%) for training and 35,497 rows (20%) for testing. The training set included 69,439 benign and 72,546 Trojan records, while the testing set contained 17,360 benign and 18,137 Trojan records. After applying SMOTE, the balanced training set consisted of 72,546 samples for each class, totaling 145,092 rows.

Finally, with the dataset preprocessed, features selected, and class distribution balanced, both models were ready for testing the machine learning algorithms. The following section presents the experimental results.

3. Results

3.1. Experiments

3.1.1. Hardware and Software Configuration

This study’s experiments were performed utilizing the “Python programming language within the Anaconda Jupyter Notebook environment (version 3)”. All code was executed on a standard “personal computer equipped with an Intel® Core™ (Santa Clara, CA, USA) i5-10210U CPU @ 1.60GHz (2.11 GHz), 16 GB of RAM”, with Windows 11 Pro OS.

All experiments were implemented in Python (version 3.9.12). The primary libraries included scikit-learn (version 1.0.2), LightGBM (version 4.6.0), NumPy (version 1.21.5), Pandas (version 1.4.2), Matplotlib (version 3.5.1), and Seaborn (version 0.11.2). This setup provided sufficient computational resources for model training, hyperparameter tuning, and visualization. The primary objective was to create a trustworthy and interpretable framework capable of accurately detecting and classifying RAT traffic.

3.1.2. Classifier Design and Tuning

To improve detection performance, specific enhancements were applied to each classifier for both models. The Logistic Regression model was trained on standardized features with L2 regularization (C = 1.0) and a maximum of 3000 iterations to ensure convergence. For Naive Bayes, the 25 most informative features were selected and scaled before training, with the var_smoothing parameter tuned from 10⁻⁹ to 10⁻⁵ for stable probability estimation. The K-Nearest Neighbors classifier was optimized by testing k values of 3, 5, and 7 with uniform and distance-based weights, while scaling the input features to ensure meaningful distance comparisons.

The ensemble and neural models were also configured with tailored enhancements. The Random Forest model was trained with 200 estimators, meaning it constructed 200 independent decision trees and aggregated their predictions through majority voting, which improves stability and reduces variance. The trees were allowed to grow fully (no maximum depth) to capture complex patterns, while parallel execution (n_jobs = −1) optimized training efficiency. The Gradient Boosting classifier also employed 200 estimators, but in contrast to RF, these were built sequentially, where each tree corrected the mistakes of the previous one. The learning rate of 0.05 reduced the weight of each tree’s contribution for smoother convergence, while the maximum depth of 3 limited tree complexity to prevent overfitting. Stochasticity was introduced with a subsample ratio of 0.8, where each tree was trained on 80% of the samples, improving generalization. The AdaBoost model used 100 estimators, representing 100 weak learners (shallow trees), each reweighted to emphasize previously misclassified samples, allowing the ensemble to iteratively improve its focus on difficult-to-detect Trojans. Similarly, the LightGBM model was configured with 100 estimators and a learning rate of 0.05 but optimized for efficiency by using a maximum depth of 5 and 31 leaves per tree, which constrained memory usage while preserving predictive power. Finally, the MLP neural network was designed with two hidden layers, each containing 128 and 64 neurons, respectively, allowing it to learn hierarchical representations of traffic patterns. It was trained with a maximum of 500 iterations, with early stopping to prevent overfitting, and regularization was applied through an L2 penalty (alpha = 0.0005).

To ensure reproducibility, a fixed random seed (random_state = 42) was applied during train–test splitting, SMOTE balancing, and in classifiers that support seed initialization (Random Forest, Gradient Boosting, AdaBoost, Logistic Regression, and LightGBM). For deterministic classifiers such as k-Nearest Neighbors and Naïve Bayes, reproducibility is inherent. For the MLPClassifier, results may vary slightly due to weight initialization when no random seed is specified.

Together, these models represent a diverse mix of simple, ensemble learning, and deep neural approaches, ensuring both interpretability and adaptability in detecting RAT activity.

3.1.3. Ablation Experiments

To rigorously evaluate the impact of feature engineering, two models were developed. Model B was trained using only the original dataset features, whereas Model A extended this feature space with ten newly engineered behavioral attributes. Both models were trained and tested under identical conditions to ensure a fair and unbiased comparison.

A consistent preprocessing pipeline was applied to both models, including data cleaning, normalization, the handling of missing values, and the use of SMOTE to balance the training set. Each model was then evaluated with the same set of eight classifiers: three simple algorithms, four ensemble methods, and one neural network. For each classifier, the relevant pipeline components (as described in Section 2.4) were integrated to ensure consistency and prevent bias across the experiments.

3.2. Metrics and Results

Demonstrating accuracy through testing alone is not enough to prove the algorithm’s reliability. Therefore, the results were evaluated using various metrics, including the confusion matrix, classification report, ROC curves, precision–recall curves, learning curves, and model training time. Each metric will be explained and discussed in separate sections. For Model A, all these metrics will be shown to provide a thorough assessment of its performance. Conversely, Model B results will be presented using only the classification report, as this highlights the effect of feature engineering without needing a complete comparative analysis.

3.2.1. Confusion Matrix

The confusion matrix is a method for classification problems that shows how well a model performs by breaking down correct and incorrect predictions for each class. It divides predictions into four categories: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). TP indicates that your model correctly predicted a positive outcome. TN indicates a correct negative prediction. FP refers to a positive prediction that was incorrect. FN refers to a negative prediction that was wrong [28]. This metric enables a detailed analysis of how effectively each class is identified. The results of the confusion matrix are shown in Figure 5.

As shown in Figure 5, the Logistic Regression correctly predicted 86% of benign flows and 74% of Trojan flows. The Naïve Bayes classifier achieved 87.17% accuracy for benign traffic and 64.17% for Trojan traffic. In contrast, the Random Forest classifier correctly predicted 98% of benign samples and 99% of Trojan samples. The Gradient Boosting, AdaBoost, LightGBM, and MLP models were able to predict the Trojan traffic with accuracies of 99.62%, 97.12%, 99.58%, and 98.54%, respectively.

3.2.2. The Classification Report

Accuracy, precision, recall, and F1 score measures are the outcome of the classification report. The fraction of correctly anticipated results is known as the accuracy. By multiplying the overall quantity of predictions by the sum of TPs and TNs, the classifier’s overall accuracy can be determined. Although accuracy is frequently the best criterion to assess a classifier’s effectiveness. However, for a more thorough comprehension of the classifier’s functionality, various measures should be considered [29]. The quantity of properly recognized outputs is used to calculate precision. Recall is the proportion of TPs, while the F1 score is the average of precision and recall [30]. Table 4 displays the results of the classification report for Model A, and Table 5 shows the results for Model B.

Table 4 reports that Logistic Regression, Naïve Bayes, and k-Nearest Neighbors algorithms achieved accuracies of 80%, 75%, and 94%, respectively. Both Random Forest and Gradient Boosting algorithms obtained 98% accuracy, while AdaBoost, LightGBM, and MLP classifiers achieved an accuracy of 97%.

Table 5 shows that Logistic Regression, Naïve Bayes, and k-Nearest Neighbors achieved accuracies of 50%, 52%, and 54%, respectively. The highest results were for the Random Forest classifier, with an accuracy of 81% and precision and recall values of 85% and 77% for the Trojan class. The classifiers Gradient Boosting, AdaBoost, LightGBM, and MLP achieved accuracies of 74%, 73%, 76%, and 75%, respectively.

3.2.3. ROC Curve

The Receiver Operating Characteristic (ROC) curve is a visualization used to assess a binary classifier’s effectiveness. It demonstrates the connection between the TP rate and the FP rate over various categorization levels [31]. Figure 6 presents the ROC curve results.

As seen in Figure 6, the ensemble and MLP algorithms exhibit curves that closely approach the upper-left corner. In contrast, Logistic Regression and Naïve Bayes had ROC curves closer to the diagonal.

3.2.4. Precision–Recall Curves

The Precision–Recall (PR) curve is a diagnostic tool that focuses on the performance of a classifier concerning the positive class (Trojan traffic) [32]. The curve displays precision against recall at various threshold values. PR curves are exceptionally informative when dealing with imbalanced datasets, as they provide more meaningful insight than ROC curves when the positive class is rare. Figure 7 shows the PR curves for the same set of models evaluated in the ROC curve analysis.

According to Figure 7, the ensemble and MLP algorithms consistently achieved high precision and recall across thresholds, with higher average precision scores than Logistic Regression and Naïve Bayes.

3.2.5. Learning Curve and Training Time Results

A learning curve illustrates how a model’s performance evolves with increasing training set size [33]. It provides details on how effectively the model generalizes and whether it suffers from underfitting or overfitting. In contrast, the training time bar chart in the bottom-right corner quantifies the computational cost required to fit each model to the whole training dataset. Figure 8 presents a comprehensive analysis of model behavior using both learning curves and a training time comparison.

As shown in Figure 8, Random Forest and Gradient Boosting demonstrated steadily increasing validation accuracy as training data increased, closely following their training accuracy. AdaBoost and LightGBM also showed consistent performance, with validation curves staying near their training curves across all data sizes. The MLPClassifier consistently had a small gap between training and validation accuracy. Logistic Regression reached early convergence, with nearly identical training and validation curves, while Naïve Bayes displayed more variation in validation accuracy. K-Nearest Neighbors maintained high training accuracy, with validation improving gradually as the data size expanded. The training times in seconds were LightGBM (1.79 s), Logistic Regression (8.71 s), Naïve Bayes (0.10 s), k-Nearest Neighbors (157.23 s), Random Forest (34.44 s), Gradient Boosting (181.34 s), AdaBoost (47.42 s), and MLP (50.08 s).

3.2.6. Inference Performance

Inference performance measures how efficiently a trained model can make predictions using new data, considering factors such as latency, memory usage, and CPU utilization. Table 6 provides a comparison of Model A in terms of these metrics.

As shown in Table 6, Logistic Regression exhibited memory usage of ≈1616 MB and CPU utilization of 101.9%. K-Nearest Neighbors required higher memory (≈1695 MB) and CPU (157.1%), with an average latency per sample of 1.986 milliseconds (MS). Random Forest, Gradient Boosting, and AdaBoost consumed between 1690 and 1703 MB of memory, with CPU utilization ranging from 96.2% to 100.2%, and latency per sample from 0.0022 to 0.0154 MS. LightGBM used 1645.98 MB of memory with a CPU utilization of 180.3% and 0.0024 MS latency per sample. The MLP classifier showed the highest CPU usage at 380.6%, memory of 1703.42 MB, and latency of 0.0020 MS per sample.

3.2.7. Feature Contribution Analysis for Ensemble ML Methods

As discussed in Section 3.1.2, all classifiers were carefully tuned using optimal hyperparameters to achieve the best performance while minimizing the risk of overfitting. To gain deeper insights into the contribution of each feature in the predictive process, an importance analysis was conducted specifically for the ensemble algorithms, including Random Forest, Gradient Boosting, AdaBoost, and LightGBM. This analysis highlights which features most strongly influence the model’s decisions and provides an understanding of the relative weight of each attribute in detecting RAT behavior. Figure 9 shows the analysis results.

As shown in Figure 9, “DayOfWeek” and “SecondsSinceMidnight” were the most significant features across all classifiers. For Random Forest, the most important features were “IsWeekend, Hour, and Source IP”. For Gradient Boosting, the top features included “Source IP, SourceIP_FlowCount, and UniqueSources”. The AdaBoost model highlighted “Source IP, Destination IP, and SourceIP_FlowCount”, while LightGBM identified “Source IP, Destination IP, and Flow Packets” as the key contributors.

4. Discussion

The evaluation results show that ensemble and neural network methods are more effective for RAT detection than linear or probabilistic classifiers. For example, Random Forest and MLP have demonstrated the ability to achieve high detection rates for both benign and malicious categories. This advantage is due to their capacity to model complex, nonlinear relationships between features.

The performance gap between advanced and simpler models is also clear in the ROC and Precision–Recall analyses. Ensemble classifiers, especially Random Forest and Gradient Boosting, generated nearly perfect ROC curves while maintaining high precision even at higher recall levels. Likewise, MLP demonstrated a strong balance between precision and recall, highlighting its robustness against false positives. In contrast, Logistic Regression and Naïve Bayes performed poorly when recall increased, revealing their limitations in scenarios that require maximum detection coverage.

Learning curve patterns further validated these findings. Random Forest and Gradient Boosting showed consistent improvements with more training data, while MLP maintained solid generalization with only a small training–validation gap. AdaBoost and LightGBM also displayed strong consistency, thanks to their boosting frameworks. Conversely, K-Nearest Neighbors was sensitive to variance and improved more slowly, Logistic Regression plateaued early due to bias issues, and Naïve Bayes remained unstable because of its independence assumption.

Training time comparisons show a trade-off between predictive performance and computational efficiency. While Random Forest, Gradient Boosting, AdaBoost, and MLP take longer to train, their higher accuracy may justify the extra cost in high-security situations. On the other hand, Naïve Bayes and Logistic Regression run faster but have lower detection ability. The performance evaluation of the classifiers highlights notable differences in memory usage, CPU utilization, and inference latency. Most models, including Logistic Regression and Naive Bayes, have moderate memory footprints around 1.6–1.7 GB, while tree-based models such as Random Forest, Gradient Boosting, AdaBoost, and MLP slightly exceed 1.7 GB. K-Nearest Neighbors, LightGBM, and MLP exhibit higher memory demands, reflecting the need to store extensive data structures during inference. CPU utilization varies considerably: simpler models like Logistic Regression and Naive Bayes remain close to 100%, whereas K-Nearest Neighbors, LightGBM, and especially MLP exceed 150–380%, indicating parallel processing or multi-core usage during computation. Latency per sample is minimal for most models (<0.003 MS), except for K-Nearest Neighbors, which incurs a much higher cost (≈1.99 MS) due to its instance-based computation. Overall, these results suggest that while more complex models may provide higher predictive capacity, they also require significantly more computational resources, which should be considered when deploying them in resource-constrained environments.

Ablation experiments further underscored the importance of feature engineering. Using the new behavioral features, Model A consistently outperformed Model B across all classifiers. These features significantly boosted neural network and ensemble classifiers, showing that behavioral patterns can distinguish RAT traffic. As detailed in Section 3.2.6, most of the top-ranked features were engineered attributes, confirming that the newly created behavioral features significantly contributed to the improved detection performance. To compare the study results, Table 7 presents the results of some related studies that used the same ML algorithms to address the RAT issue.

As shown in Table 7, some earlier studies reported higher KNN accuracy than ours [18,37]; however, our approach achieved better results with ensemble methods such as Random Forest, Gradient Boosting, AdaBoost, and LightGBM. The very high results reported for Random Forest (up to 99.7% [36] and 100% [39]) may suggest overfitting, particularly since these studies lack robustness checks. Additionally, most related works only present accuracy, without comprehensive evaluation metrics, and seldom include class distribution details, making it hard to determine whether their models are biased toward the majority class. In contrast, our study provides a complete set of performance metrics, explicitly analyzes both classes, and validates results across multiple classifiers. Our results do not conflict with previous malware detection research [25,39,40,41,42,43,44]. This study confirms that combining hybrid feature design with well-tuned machine learning models can significantly improve RAT detection efficiency.

False positives remain one of the toughest challenges when deploying RAT detection systems in real-world settings. In high-risk industries like healthcare, finance, and critical infrastructure, even a small FP rate can have serious consequences. For instance, in hospitals, unnecessary alarms might divert analysts’ attention from actual threats to electronic medical records; in financial companies, too many false alerts can interfere with fraud detection and delay responses to incidents; and in industrial control systems, harmless flows that are mistakenly identified as malicious can lead to costly service disruptions or shutdowns. These examples demonstrate that high precision is not only desirable but also essential for practical deployment.

From an operational standpoint, the study demonstrates that hybrid feature integration provides a practical and effective basis for real-world intrusion detection systems. The engineered behavioral features, particularly those that capture temporal patterns and connection diversity, directly relate to common RAT tactics, such as persistence, scheduled execution, and lateral movement. By increasing sensitivity to these behaviors, the framework enhances resilience against stealthy techniques that often evade traditional signature-based defenses. Furthermore, the scalability of tree-based ensemble models and the flexibility of MLP make them strong candidates for deployment in enterprise and critical infrastructure monitoring systems. Although these models demand more computational resources during training, their proven ability to achieve high detection accuracy while minimizing false positives underscores their feasibility and reliability in mission-critical security applications.

Despite the promising results, this study has several limitations. The dataset used was publicly available and balanced, which, although suitable for controlled experiments, might not entirely reflect real-world traffic patterns where benign flows usually prevail. As a result, the high accuracies reported may be influenced more by the dataset’s characteristics than by actual detection capabilities. Additionally, external validation on real-world RAT traffic was not conducted due to the lack of available datasets. Stronger model explainability methods, such as SHAP or LIME, were also not applied because the dataset was huge, and initial attempts to compute feature attributions using SHAP did not complete within a reasonable time on the available computational resources. These analyses are planned for future work when additional datasets and higher computational capacity become available.

Our main contribution is the development of a hybrid detection framework that combines host-based, network-based, and newly designed behavioral features. This design, validated through ablation tests and extended ensemble evaluations, demonstrates that the engineered features are the primary factors driving performance improvements, distinguishing our work from previous RAT detection studies.

5. Conclusions

This study introduces a new machine learning method for detecting RATs by merging host and network features. Two models were created: Model A, which uses the original dataset features along with engineered behavioral features, and Model B, which relies solely on the original dataset features without the added engineered attributes.

The process started with acquiring a dataset. Initial efforts focused on cleaning the data by removing irrelevant identifiers and timestamps, and ensuring data consistency. The class distribution was nearly balanced, so SMOTE was used to achieve an exact 50:50 ratio between benign and Trojan samples during training. After data preparation, a thorough feature engineering process was conducted. Multiple time-based and behavioral features were derived from temporal and flow patterns, combining domain knowledge with statistical attributes to enhance data understanding. These new features aim to capture temporal and communication patterns between hosts, improving the detection of stealthy RAT behavior. In total, ten new features were created and added to the dataset to provide a more informative view of flow behavior. A feature importance analysis was performed to refine the input space and help the learning algorithms focus on the most relevant variables. From the original 86 features, a final set of 70 features was selected based on importance thresholds and data cleaning. This subset included essential indicators from all three feature domains, reducing noise, redundancy, and computational costs while maintaining predictive power.

Eight classifiers (Logistic Regression, Naïve Bayes, k-Nearest Neighbors, Random Forest, Gradient Boosting, AdaBoost, LightGBM, and MLP) were used in both models. The results showed that adding engineered features significantly boosted performance across all classifiers. Model A consistently outperformed Model B, confirming the usefulness of the new behavioral attributes. Both Random Forest and Gradient Boosting achieved 98% accuracy, while AdaBoost, LightGBM, and MLP also performed well with 97% accuracy and dependable detection. In contrast, simpler models like Logistic Regression and Naïve Bayes fell behind, mainly when evaluated with ROC and Precision–Recall curves, which revealed their challenges in balancing precision and recall at high detection levels.

Unlike many previous studies that only reported accuracy, our evaluation used a wide range of metrics, including classification reports, confusion matrices, ROC curves, Precision–Recall curves, learning curves, and training times. This provides a complete view of how the model performs and its robustness. This thorough assessment builds confidence in the strength of our framework and reduces the chance of biased results caused by unbalanced datasets or single-metric assessments.

In conclusion, this research demonstrates that combining feature engineering with a proper setup for ensemble and neural network classifiers yields a robust and dependable approach for RAT detection. The method not only improves detection accuracy but also offers a reproducible benchmark for future studies. Going forward, testing the framework on real-world RAT traffic and incorporating explainable AI for better interpretability will further enhance its usefulness in cybersecurity defense.

Author Contributions

A.M.A. designed the methodology, built the model, processed the data, interpreted results, and wrote the first draft of the manuscript; M.F.-V. helped in the interpretation of the results, helped in the data visualization, revised and edited the manuscript, and provided supervision. A.H. helped with the experimental work, helped improve the results, and revised and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

El-Metwaly, A.E.S.; Abdelfattah, M.A.; Maher, N.M.; Hamed, M.; Tayel, E.M.; Al-Rifai, M.A.; Takieldeen, A.E. Remote Access Trojan (RAT) Attack: A Stealthy Cyber Threat Posing Severe Security Risks. In Proceedings of the International Telecommunications Conference (ITC), Cairo, Egypt, 22–25 July 2024. [Google Scholar] [CrossRef]
Jiang, W.; Wu, X.; Cui, X.; Liu, C.A. Highly Efficient Remote Access Trojan Detection Method. Int. J. Digit. Crime Forensics 2019, 11, 1–13. [Google Scholar] [CrossRef]
Sai, F.; Wang, X.; Yu, X.; Yan, P.; Ma, W. Recognition and Detection Technology for Abnormal Flow of Rebound Type Remote Control Trojan in Power Monitoring System. In Proceedings of the IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 24–26 February 2023. [Google Scholar] [CrossRef]
Jiang, D.; Omote, K. An Approach to Detect Remote Access Trojan in the Early Stage of Communication. In Proceedings of the International Conference on Advanced Information Networking and Applications (AINA), Gwangju, Republic of Korea, 24–27 March 2015. [Google Scholar] [CrossRef]
Guo, C.; Song, Z.; Ping, Y.; Shen, G.; Cui, Y.; Jiang, C. PRATD: A Phased Remote Access Trojan Detection Method with Double-Sided Features. Electronics 2020, 9, 1894. [Google Scholar] [CrossRef]
Piet, J.; Anderson, B.; McGrew, D. An In-Depth Study of Open-Source Command and Control Frameworks. In Proceedings of the 13th International Conference on Malicious and Unwanted Software (MALWARE), Nantucket, MA, USA, 22–24 October 2018. [Google Scholar] [CrossRef]
Valeros, V.; Garcia, S. Growth and Commoditization of Remote Access Trojans. In Proceedings of the 5th IEEE European Symposium on Security and Privacy Workshops (Euro S&PW), Genoa, Italy, 7–11 September 2020. [Google Scholar] [CrossRef]
Bridges, R.; Hernandez Jimenez, J.; Nichols, J.; Goseva-Popstojanova, K.; Prowell, S. Towards Malware Detection via CPU Power Consumption: Data Collection Design and Analytics. In Proceedings of the 17th IEEE International Conference on Trust, Security And Privacy in Computing and Communications/12th IEEE International Conference On Big Data Science and Engineering (TrustCom/BigDataSE), New York, NY, USA, 1–3 August 2018. [Google Scholar] [CrossRef]
Adachi, D.; Omote, K. A Host-Based Detection Method of Remote Access Trojan in the Early Stage. In Proceedings of the 12th International Conference on Information Security Practice and Experience (ISPEC), Zhangjiajie, China, 16–18 November 2016. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, W.; Lv, Z.; Sangaiah, A.K.; Huang, T.; Chilamkurti, N. MALDC: A Depth Detection Method for Malware Based on Behavior Chains. World Wide Web 2020, 23, 991–1010. [Google Scholar] [CrossRef]
Moon, D.; Pan, S.B.; Kim, I. Host-Based Intrusion Detection System for Secure Human-Centric Computing. J. Supercomput. 2016, 72, 2520–2536. [Google Scholar] [CrossRef]
Chandran, S.; Hrudya, P.; Poornachandran, P. An Efficient Classification Model for Detecting Advanced Persistent Threat. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, India, 10–13 August 2015. [Google Scholar] [CrossRef]
Liang, Y.; Peng, G.; Zhang, H.; Wang, Y. An Unknown Trojan Detection Method Based on Software Network Behavior. Wuhan Univ. J. Nat. Sci. 2013, 18, 369–376. [Google Scholar] [CrossRef]
Moser, A.; Kruegel, C.; Kirda, E. Limits of Static Analysis for Malware Detection. In Proceedings of the Annual Computer Security Applications Conference, ACSAC, Miami Beach, FL, USA, 10–14 December 2007; pp. 421–430. [Google Scholar] [CrossRef]
Pendleton, M.; Garcia-Lebron, R.; Cho, J.H.; Xu, S. A Survey on Systems Security Metrics. ACM Comput. Surv. 2016, 49, 1–35. [Google Scholar] [CrossRef]
Floroiu, I.; Floroiu, M.; Niga, A. Remote Access Trojans Detection Using Convolutional and Transformer-Based Deep Learning Techniques. Rom. Cyber Secur. J. 2024, 6, 47–58. [Google Scholar] [CrossRef]
Li, S.; Yun, X.; Zhang, Y.; Xiao, J.; Wang, Y. A General Framework of Trojan Communication Detection Based on Network Traces. In Proceedings of the IEEE 7th International Conference on Networking, Architecture and Storage (NAS), Xiamen, China, 28–30 June 2012. [Google Scholar] [CrossRef]
Jiang, D.; Omote, K. A RAT Detection Method Based on Network Behavior of the Communication’s Early Stage. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2016, E99A, 145–153. [Google Scholar] [CrossRef]
Jinlong, W.; Haidong, G.; Yixin, X. Closed-Loop Feedback Trojan Detection Technique Based on Hierarchical Model. In Proceedings of the 2015 Joint International Mechanical, Electronic and Information Technology Conference, Chongqing, China, 18–20 December 2015; pp. 240–243. [Google Scholar] [CrossRef]
Yin, K.S.; Khine, M.A. Network Behavioral Features for Detecting Remote Access Trojans in the Early Stage. In Proceedings of the VI International Conference on Network, Communication and Computing (ICNCC), Kunming, China, 8–10 December 2017. [Google Scholar] [CrossRef]
Sebakara, E.; Jonathan, K.N. Encrypted Remote Access Trojan Detection: A Machine Learning Approach with Real-World and Open Datasets. J. Inf. Technol. 2025, 5, 30–42. [Google Scholar] [CrossRef]
Awad, A.A.; Sayed, S.G.; Salem, S.A. Collaborative Framework for Early Detection of RAT-Bots Attacks. IEEE Access 2019, 7, 71780–71790. [Google Scholar] [CrossRef]
Aburbeian, A.M.; Ashqar, H.I. Credit Card Fraud Detection Using Enhanced Random Forest Classifier for Imbalanced Data. In Proceedings of the 2023 International Conference on Advances in Computing Research (ACR’23), Orlando, FL, USA, 8–10 May 2023. [Google Scholar] [CrossRef]
Banerjee, R.; Bourla, G.; Chen, S.; Kashyap, M.; Purohit, S. Comparative Analysis of Machine Learning Algorithms through Credit Card Fraud Detection. In Proceedings of the IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, 5–7 October 2018. [Google Scholar] [CrossRef]
Rashid, S.J.; Baker, S.A.; Alsaif, O.I.; Ahmad, A.I. Detecting Remote Access Trojan (RAT) Attacks Based on Different LAN Analysis Methods. Eng. Technol. Appl. Sci. Res. 2024, 14, 17294–17301. [Google Scholar] [CrossRef]
Cop, C. Trojan Detection. Available online: https://www.kaggle.com/datasets/subhajournal/trojan-detection/data (accessed on 6 July 2023).
Verdonck, T.; Baesens, B.; Óskarsdóttir, M.; vanden Broucke, S. Special Issue on Feature Engineering Editorial. Mach. Learn. 2024, 113, 3917–3928. [Google Scholar] [CrossRef]
Caelen, O. A Bayesian Interpretation of the Confusion Matrix. Ann. Math. Artif. Intell. 2017, 81, 429–450. [Google Scholar] [CrossRef]
Goutte, C.; Gaussier, E. A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3408, pp. 345–359. [Google Scholar] [CrossRef]
Boyd, K.; Costa, V.S.; Davis, J.; Page, C.D. Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation. In Proceedings of the 2012 11th International Conference on Machine Learning and Applications, Washington, DC, USA, 12–15 December 2012; pp. 349–368. Available online: https://pmc.ncbi.nlm.nih.gov/articles/PMC3858955 (accessed on 5 June 2025).
Aburbeian, A.H.M.; Fernández-Veiga, M. Secure Internet Financial Transactions: A Framework Integrating Multi-Factor Authentication and Machine Learning. AI 2024, 5, 177–194. [Google Scholar] [CrossRef]
Davis, J.; Goadrich, M. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006. [Google Scholar] [CrossRef]
Mohr, F.; van Rijn, J.N. Learning Curves for Decision Making in Supervised Machine Learning: A survey. Mach. Learn. 2024, 113, 8371–8425. [Google Scholar] [CrossRef]
Awad, A.A.; Sayed, S.G.; Salem, S.A. A Host-Based Framework for RAT Bots Detection. In Proceedings of the International Conference on Computer and Applications (ICCA), Doha, Qatar, 6–7 September 2017. [Google Scholar] [CrossRef]
Awad, A.A.; Sayed, S.G.; Salem, S.A. A Network-Based Framework for RAT-Bots Detection. In Proceedings of the 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 3–5 October 2017. [Google Scholar] [CrossRef]
Dehkordy, D.T.; Rasoolzadegan, A. DroidTKM: Detection of Trojan Families Using the KNN Classifier Based on Manhattan Distance Metric. In Proceedings of the 10th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 29–30 October 2020. [Google Scholar] [CrossRef]
Kanaker, H.; Karim, A.; Awwad, S.A.B.; Ismail, N.H.A.; Zraqou, J.; Al Ali, A.M.F. Trojan Horse Infection Detection in Cloud Based Environment Using Machine Learning. Int. J. Interact. Mob. Technol. 2022, 16, 81–106. [Google Scholar] [CrossRef]
Razak, M.F.A.; Jaya, M.I.; Ismail, Z.; Firdaus, A. Trojan Detection System Using Machine Learning Approach. Indones. J. Inf. Syst. 2022, 5, 38–47. [Google Scholar] [CrossRef]
Pi, B.; Guo, C.; Cui, Y.; Shen, G.; Yang, J.; Ping, Y. Remote Access Trojan Traffic Early Detection Method Based on Markov Matrices and Deep Learning. Comput. Secur. 2024, 137, 103628. [Google Scholar] [CrossRef]
Tran, G.; Hoang, A.; Bui, T.; Tong, V.; Tran, D. A Deep Learning Approach to Early Identification of Remote Access Trojans. In Proceedings of the International Symposium on Information and Communication Technology (SOICT), Danang, Vietnam, 13–15 December 2024. [Google Scholar] [CrossRef]
Ritzkal; Hendrawan, A.H.; Kurniawan, R.; Aprian, A.J.; Primasari, D.; Subchan, M. Enhancing Cybersecurity Through Live Forensic Investigation of Remote Access Trojan Attacks Using FTK Imager Software. Int. J. Saf. Secur. Eng. 2024, 14, 217. [Google Scholar] [CrossRef]
Safdar, H.; Seher, I.; Elgamal, E.; Prasad, P.W.C. A Review of Machine Learning-Based Trojan Detection Techniques for Securing IoT Edge Devices. In Proceedings of the 3rd International Conference on Intelligent Education and Intelligent Research (IEIR), Macau, China, 6–8 November 2024. [Google Scholar] [CrossRef]
Khan, S.U.; Nabil, M.; Mahmoud, M.M.E.A.; AlSabaan, M.; Alshawi, T. Trojan Attack and Defense for Deep Learning Based Power Quality Disturbances Classification. IEEE Trans. Netw. Sci. Eng. 2025, 12, 3962–3974. [Google Scholar] [CrossRef]
Jin, L.; Wen, X.; Jiang, W.; Zhan, J.; Zhou, X. Trojan Attacks and Countermeasures on Deep Neural Networks from Life-Cycle Perspective: A Review. ACM Comput. Surv. 2025, 57, 1–37. [Google Scholar] [CrossRef]

Figure 2. Research methodology workflow for RAT detection.

Figure 3. Distribution of RAT and benign samples in the dataset.

Figure 4. Feature engineering and selection roadmap for Model A and Model B.

Figure 5. Confusion matrix of Model A for RAT and benign classification.

Figure 6. ROC curves of Model A classifiers for distinguishing RAT and benign samples.

Figure 7. Precision–Recall curves of Model A classifiers for RAT and benign detection.

Figure 8. Learning curves and training time comparison of Model A classifiers.

Figure 9. Feature contribution in ensemble classifiers for Model A.

Table 1. Feature engineering results.

Source Feature	Derived Feature	Classification	Description
Timestamp	FlowDate	Network-based	The calendar date of the flow. Used for grouping flows by day for aggregation.
	Hour	Network-based	The hour of the day when the flow occurred (0–23). Helps identify time-of-day attack patterns.
	DayOfWeek	Network-based	Day of the week (0 = Monday, 6 = Sunday). Used to detect weekday vs. weekend behavior variations.
	SecondsSinceMidnight	Network-based	Total seconds elapsed since midnight. Provides more precise temporal behavior within a day.
Source IP	SourceIP_FlowCount	Network-behavioral	Total number of flows initiated by the source IP. Helps distinguish between active/inactive devices.
Source IP + Timestamp	TimeDiffFromLastFlow	Network-behavioral	Time difference (in seconds) between the current flow and the previous one from the same source IP. Indicates communication frequency.
Source IP + Destination IP	UniqueDestinations	Network-behavioral	Number of unique destination IPs contacted by a source IP. A higher number may indicate scanning or RAT behavior.
Source IP + Destination IP	UniqueSources	Network-behavioral	Number of distinct source IPs contacting a destination. Can highlight unusual popularity.
Source IP + Flow Duration	AvgFlowDuration	Network-behavioral	Average flow duration per source IP. Helps characterize the typical length of communications from a source.
DayOfWeek	IsWeekend	Network-based	Binary value indicating whether the flow occurred on a weekend (1) or a weekday (0).

Table 2. Final model features description [27].

	Feature	Description	Importance Score
1	DayOfWeek	Day of the week the flow was captured (0 = Monday, 6 = Sunday)	0.247309
2	SecondsSinceMidnight	Time of the flow in seconds since midnight	0.116366
3	IsWeekend	Whether the flow occurred during the weekend (1) or not (0)	0.108793
4	Hour	Hour of the day when the flow occurred	0.070363
5	SourceIP_FlowCount	Number of flows originating from the same Source IP	0.052595
6	UniqueDestinations	Number of distinct Destination IPs contacted by a Source IP	0.047872
7	Source IP	IP address of the sender of the flow (hashed)	0.043232
8	AvgFlowDuration	Average duration of flows for a given Source IP	0.038080
9	UniqueSources	Number of unique Source IPs contacting a Destination IP	0.018321
10	Destination IP	IP address of the receiver of the flow (hashed)	0.013639
11	Source Port	Port number at the source device	0.011500
12	Flow Duration	Duration of the flow in microseconds	0.009645
13	Flow IAT Min	Minimum inter-arrival time between packets in the flow	0.009489
14	Flow IAT Max	Maximum inter-arrival time between packets in the flow	0.009452
15	Flow IAT Mean	Average inter-arrival time between packets in the flow	0.009449
16	Flow Packets/s	Rate of packets per second in the flow	0.009331
17	Fwd Packets/s	Rate of forward direction packets per second	0.009276
18	Init_Win_bytes_forward	Initial window size in bytes in the forward direction	0.008862
19	Destination Port	Port number at the destination device	0.007974
20	Bwd Packets/s	Rate of backward direction packets per second	0.006739
21	Fwd IAT Max	Maximum inter-arrival time in the forward direction	0.006585
22	Fwd IAT Total	Total inter-arrival time in the forward direction	0.006416
23	Init_Win_bytes_backward	Initial window size in bytes in the backward direction	0.006255
24	Fwd IAT Mean	Average inter-arrival time in the forward direction	0.006102
25	Fwd IAT Min	Minimum inter-arrival time in the forward direction	0.005927
26	TimeDiffFromLastFlow	Time since last flow from the same Source IP	0.005879
27	Flow Bytes/s	Rate of bytes per second in the flow	0.004903
28	Fwd IAT Std	Standard deviation of inter-arrival time in forward direction	0.004480
29	Flow IAT Std	Standard deviation of inter-arrival time in the flow	0.003865
30	Fwd Header Length	Header length of packets in the forward direction	0.003789
31	Packet Length Mean	Average length of packets in the flow	0.003656
32	Fwd Packet Length Mean	Average packet length in forward direction	0.003577
33	Packet Length Std	Standard deviation of packet lengths	0.003453
34	Fwd Header Length.1	Duplicate of forward header length	0.003444
35	Avg Fwd Segment Size	Average segment size in forward direction	0.003373
36	Average Packet Size	Average size of packets in the flow	0.003372
37	Fwd Packet Length Max	Maximum packet length in forward direction	0.003364
38	Subflow Fwd Bytes	Total bytes sent in subflow forward direction	0.003334
39	Total Length of Fwd Packets	Sum of lengths of forward packets	0.003305
40	Packet Length Variance	Variance of packet lengths	0.003305
41	Bwd Packet Length Mean	Average packet length in backward direction	0.003262
42	Subflow Bwd Bytes	Total bytes sent in subflow backward direction	0.002940
43	Bwd Header Length	Header length of packets in the backward direction	0.002908
44	Avg Bwd Segment Size	Average segment size in backward direction	0.002906
45	Total Length of Bwd Packets	Sum of lengths of backward packets	0.002833
46	Bwd Packet Length Max	Maximum packet length in backward direction	0.002690
47	Bwd IAT Min	Minimum inter-arrival time in backward direction	0.002667
48	Max Packet Length	Maximum packet length in the flow	0.002644
49	min_seg_size_forward	Minimum segment size in forward direction	0.002493
50	Bwd IAT Total	Total inter-arrival time in backward direction	0.002486
51	Bwd IAT Max	Maximum inter-arrival time in backward direction	0.002480
52	Fwd Packet Length Std	Standard deviation of packet lengths (forward)	0.002471
53	Bwd IAT Mean	Average inter-arrival time in backward direction	0.002301
54	Bwd IAT Std	Standard deviation of inter-arrival times (backward)	0.001968
55	Bwd Packet Length Std	Standard deviation of packet lengths (backward)	0.001885
56	Min Packet Length	Minimum packet length in the flow	0.001853
57	Bwd Packet Length Min	Minimum packet length in backward direction	0.001795
58	Subflow Fwd Packets	Number of packets sent in subflow forward	0.001663
59	Total Fwd Packets	Total number of forward packets	0.001584
60	Total Backward Packets	Total number of backward packets	0.001560
61	Fwd Packet Length Min	Minimum packet length in forward direction	0.001463
62	Subflow Bwd Packets	Number of packets sent in subflow backward	0.001425
63	Idle Mean	Average idle time between flows	0.001324
64	Idle Min	Minimum idle time between flows	0.001224
65	Idle Max	Maximum idle time between flows	0.001181
66	Active Min	Minimum active time between packets	0.001162
67	Active Mean	Average active time between packets	0.001141
68	Active Max	Maximum active time between packets	0.001110
69	URG Flag Count	Number of packets with the URG flag set	0.001096
70	Class	Target label (0 = Benign, 1 = Trojan)	1.000000

Table 3. Data splitting details for both models.

	Class 0 (Benign)	Class 1 (Trojan)	Total Samples	Class Balance
Training Set	69,439	72,546	141,985	~49/51
Testing Set	17,360	18,137	35,497	~49/51
Balanced Training Set	72,546	72,546	145,092	50/50

Table 4. Classification report results of Model A: benign (0); Trojan (1).

Algorithm		Precision	Recall	F1-Score
Logistic Regression (LR)	0	76%	86%	81%
	1	85%	74%	79%
	accuracy			80%
Naïve Bayes (NB)	0	70%	86%	77%
	1	83%	64%	72%
	accuracy			75%
k-nearest neighbors (KNN)	0	93%	94%	93%
	1	94%	93%	94%
	accuracy			94%
Random Forest (RF)	0	99%	98%	98%
	1	98%	99%	98%
	accuracy			98%
Gradient Boosting (GB)	0	100%	96%	98%
	1	96%	100%	98%
	accuracy			98%
AdaBoost	0	97%	96%	97%
	1	96%	98%	97%
	accuracy			97%
LightGBM	0	100%	94%	97%
	1	95%	100%	97%
	accuracy			97%
Multilayer Perceptron (MLP)	0	98%	96%	97%
	1	96%	98%	97%
	accuracy			97%

Table 5. Classification report results of Model B: benign (0); Trojan (1).

Algorithm	Class	Precision	Recall	F1-Score
Logistic Regression (LR)	0	49%	51%	50%
	1	51%	50%	51%
	accuracy			50%
Naïve Bayes (NB)	0	50%	87%	64%
	1	59%	17%	27%
	accuracy			52%
k-nearest neighbors (KNN)	0	53%	53%	53%
	1	55%	54%	54%
	accuracy			54%
Random Forest (RF)	0	78%	86%	82%
	1	85%	77%	81%
	accuracy			81%
Gradient Boosting (GB)	0	68%	89%	77%
	1	86%	60%	70%
	accuracy			74%
AdaBoost	0	68%	85%	76%
	1	81%	62%	70%
	accuracy			73%
LightGBM	0	70%	90%	79%
	1	87%	63%	73%
	accuracy			76%
Multilayer Perceptron (MLP)	0	70%	84%	77%
	1	81%	66%	73%
	accuracy			75%

Table 6. Inference performance comparison of Model A classifiers: LR: logistic regression; NB: naïve Bayes; KNN: k-nearest neighbors; RF: random forest; GB: gradient boost; MLP: multilayer perceptron.

Classifier	Avg Latency (MS/Sample)	Memory Usage (MB)	CPU Utilization (%)
LR	0.000625	1616.863281	201.6
NB	0.001455	1616.898438	101.9
KNN	1.986421	1695.187500	157.1
RF	0.015431	1690.187500	97.1
GB	0.002177	1703.367188	100.2
AdaBoost	0.011036	1703.367188	96.2
LightGBM	0.002385	1645.976562	180.3
MLP	0.002038	1703.417969	380.6

Table 7. Related studies’ results. RF: random forest; KNN: k-nearest neighbors; NB: naïve Bayes; MLP: multilayer perceptron.

Reference	Year	Classifier	Accuracy
[18]	2016	RF	97.1%
		KNN	91.9%
		NB	43%
[34]	2017	RF	95.2%
[35]	2017	RF	99.7%
[20]	2017	NB	96.5%
[22]	2019	RF	99.5%
[36]	2020	KNN	97.8%
[5]	2020	AdaBoost	92%
[37]	2022	MLP	95.8%
[37]	2022	RF	95.6%
[38]	2022	NB	88.2%
[38]	2022	RF	100%
[21]	2025	RF	74.8%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aburbeian, A.M.; Fernández-Veiga, M.; Hasasneh, A. Improving Remote Access Trojans Detection: A Comprehensive Approach Using Machine Learning and Hybrid Feature Engineering. AI 2025, 6, 237. https://doi.org/10.3390/ai6090237

AMA Style

Aburbeian AM, Fernández-Veiga M, Hasasneh A. Improving Remote Access Trojans Detection: A Comprehensive Approach Using Machine Learning and Hybrid Feature Engineering. AI. 2025; 6(9):237. https://doi.org/10.3390/ai6090237

Chicago/Turabian Style

Aburbeian, AlsharifHasan Mohamad, Manuel Fernández-Veiga, and Ahmad Hasasneh. 2025. "Improving Remote Access Trojans Detection: A Comprehensive Approach Using Machine Learning and Hybrid Feature Engineering" AI 6, no. 9: 237. https://doi.org/10.3390/ai6090237

APA Style

Aburbeian, A. M., Fernández-Veiga, M., & Hasasneh, A. (2025). Improving Remote Access Trojans Detection: A Comprehensive Approach Using Machine Learning and Hybrid Feature Engineering. AI, 6(9), 237. https://doi.org/10.3390/ai6090237

Article Menu

Improving Remote Access Trojans Detection: A Comprehensive Approach Using Machine Learning and Hybrid Feature Engineering

Abstract

1. Introduction

1.1. Background

1.2. Related Studies

1.2.1. Host-Based Studies

1.2.2. Network-Based Studies

1.2.3. Hybrid-Based Studies

2. Materials and Methods

2.1. Methodology

2.2. Data Acquisition

2.3. Feature Engineering

2.4. Data Cleaning and Preprocessing

2.5. Feature Selection

2.6. Data Splitting and Model Preparation

3. Results

3.1. Experiments

3.1.1. Hardware and Software Configuration

3.1.2. Classifier Design and Tuning

3.1.3. Ablation Experiments

3.2. Metrics and Results

3.2.1. Confusion Matrix

3.2.2. The Classification Report

3.2.3. ROC Curve

3.2.4. Precision–Recall Curves

3.2.5. Learning Curve and Training Time Results

3.2.6. Inference Performance

3.2.7. Feature Contribution Analysis for Ensemble ML Methods

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI