You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

1 January 2025

Obfuscated Malware Detection and Classification in Network Traffic Leveraging Hybrid Large Language Models and Synthetic Data

,
,
,
,
,
and
1
Computer and Software Engineering Department, College of Electrical and Mechanical Engineering, National University of Sciences and Technology (NUST), Islamabad 44080, Pakistan
2
Cybersecurity Center, Prince Mohammad Bin Fahd University, 617, Al Jawharah, Khobar, Dhahran 34754, Saudi Arabia
3
Computer Science Department, HITEC University, Taxila 47080, Pakistan
4
Department of Computer Science, College of Computer Sciences and Information Technology (CCSIT), King Faisal University, P.O. Box 400, Al-Ahsa 31982, Saudi Arabia
This article belongs to the Special Issue AI Technology for Cybersecurity and IoT Applications

Abstract

Android malware detection remains a critical issue for mobile security. Cybercriminals target Android since it is the most popular smartphone operating system (OS). Malware detection, analysis, and classification have become diverse research areas. This paper presents a smart sensing model based on large language models (LLMs) for developing and classifying network traffic-based Android malware. The network traffic that constantly connects Android apps may contain harmful components that may damage these apps. However, one of the main challenges in developing smart sensing systems for malware analysis is the scarcity of traffic data due to privacy concerns. To overcome this, a two-step smart sensing model Syn-detect is proposed. The first step involves generating synthetic TCP malware traffic data with malicious content using GPT-2. These data are then preprocessed and used in the second step, which focuses on malware classification. This phase leverages a fine-tuned LLM, Bidirectional Encoder Representations from Transformers (BERT), with classification layers. BERT is responsible for tokenization, generating word embeddings, and classifying malware. The Syn-detect model was tested on two Android malware datasets: CIC-AndMal2017 and CIC-AAGM2017. The model achieved an accuracy of 99.8% on CIC-AndMal2017 and 99.3% on CIC-AAGM2017. The Matthew’s Correlation Coefficient (MCC) values for the predictions were 99% for CIC-AndMal2017 and 98% for CIC-AAGM2017. These results demonstrate the strong performance of the Syn-detect smart sensing model. Compared to the latest research in Android malware classification, the model outperformed other approaches, delivering promising results.

1. Introduction

Technology is progressing and becoming an essential part of everyday life. Cybersecurity threats to personal digital spaces are becoming increasingly critical and concerning, as digital systems face new malware every day []. Traditional malware detection and classification systems are mostly signature-based and may not identify the daily evolving patterns. Therefore, machine learning and AI-based models appear as new means to address the issue [,]. Malware is classified into different groups such as adware, ransomware, trojans, and SMS malware. The classification of malware is crucial as it shows the kind of harm each malware can cause to the system and how malware should be responded to by the system depending on the type of malicious attack []. Malware attacks may lead to mild to catastrophic losses to the systems []. Among the hand-held gadgets, Android is one of the most used operating systems in mobile phones []. Android-based apps are installed on mobile phones for several purposes such as online payment, medical record-keeping, gaming, and educational purposes []. As Android apps are more commonly used, they are more prone to malicious attacks [].

1.1. AI in Malware Detection and Classification Paradigm

Due to the nascent patterns of malware, AI and machine learning proved themselves as successful methods for malware detection and classification. Malware classification is the subject of machine learning, and classifiers such as the Naïve Bayes Classifier, Logistic Regression, and K-Nearest Neighbors have been successfully deployed for the job []. Algorithms like Convolutional Neural Networks, Random Forests, Hidden Markov Models, and Support Vector Machines have been used in approaches to malware detection and classification []. However, as the models for malware detection are advancing, the attackers are becoming alert as well and machine learning algorithms are showing deteriorating performance. This is taking the domain of AI-based malware detection towards explainable AI and making it more challenging [,].
Similarly, natural language processing methods (NLP) are also taking their way into the threat detection paradigm []. NLP methods are being used for visualization of the internal structure of the malware pattern []. NLP extracts the keyword from the textual representation and grasps the semantics of the text. So, while finding the traces of malicious content in the network traffic, this approach is hope and interest of the researchers in the domain of malware detection []. NLP algorithms such as Bag of Words (BOW), Term Frequency-Inverse Document Frequency (TF-IDF) matrices, Word2vec, and FastText are promising in determining the patterns of malware data. Determining the sequential information is of prime concern in malware detection models; so, Word2vec and FastText are competitive approaches in the development of malware detection models [,]. NLP further expands the services with its larger deep learning models for textual data processing and classification tasks with transformer models such as BERT, RoBERTa [], and DeBERTa []. Transformer models possess the capability to process large and complex data. So, these models are becoming popular for the provision of complex solutions such as malware detection datasets [].

1.2. Challenges with Malware Detection and Classification

While collecting Android malware data, the scarcity of the different classes in the collected data is a challenge []. High-dimensional malware data and imbalanced data impact the performance of machine learning and deep learning models []. The dynamic and evolving nature of the malware is another key challenge for machine learning and deep learning models while being applied to malware detection systems []. The dataset plays a key role in the success of deep learning models. The availability of large datasets is prime to the success of NLP. In actual scenarios, imbalanced samples are faced by malware detection systems []. It is significant to understand the working of the models in such challenging scenarios []. One of the foremost pathways to harm the Android app is network connectivity. Since Android apps mostly work over the network, malicious content may travel to the apps through network traffic. So, malware detection and classification researchers are always concerned about looking up network traffic to realize the traces of malware. So, the approaches that use network traffic data for malware detection and classification discover the malware patterns for accurate classification practices []. A robust and intellectual malware detection system with high precision and recall for multiple malware types is a need of today’s digital society, where most daily life tasks are performed through screen touches on Android apps and malware patterns are evolving exponentially [].

1.3. Contributions of the Current Study

Considering the impact of TCP flow in the network traffic for mining the traces of malware, we are focused toward developing an enhanced smart sensing malware detection and classification model through transformer-based NLP. The main contributions of the paper are as follows:
  • A custom dataset is prepared by extracting the TCP data flow from network traffic and unpacking packet data from a benchmark dataset. The features are then analyzed and transformed into a labeled dataset for the transformer model.
  • TCP synthetic data are created by generating contextual word embeddings for sentence augmentation at level 2. A transformer-based GPT-2 model is developed to address the challenge of scarcity in malware data.
  • An adapted BERT model is developed by incorporating classification layers for malware detection and classification. The features are extracted as word embeddings using a BERT Tokenizer.
A comprehensive literature review is conducted to evaluate the uses of transformers and LLMs in malware detection and classification. The literature review explains how LLMs are used to detect malware. Further, the challenges connected with the models are considered while defining research gaps. The research experiments are carried out using the Google Collaboratory Python notebook. The structure of the paper is as follows: Section 2 presents the related work. Section 3 explains the Syn-detect models’ architecture for malware classification. Section 4 provides an insight into the experiment and results. Section 5 presents the comparison of the results of the model, and Section 6 describes the conclusions of the research and future work. Declarations, contributions, and references are mentioned after that.

3. Methodology

The overall framework of the Syn-detect model is represented in Figure 1. Figure 1 illustrates the process of malware classification through the LLM-based hybrid model involving the extraction of TCP protocol segments of network traffic data of Android Apps from PCAP files. Then, TCP data are fed into the synthetic data generator to generate the synthetic samples of the minority class and balance the data. These samples are augmented with the raw data to generate a new dataset, which is employed as the input to the tokenizer after preprocessing. Tokenizer transforms the textual data into tokens. This representation is an acceptable format for the transformer. The adapted BERT model is fine-tuned to classify the data into malware families. Evaluations of the results revealed how well the Syn-detect model works.
Figure 1. Syn-detect model of malware classification.

3.1. Extraction of TCP Traffic Data

Network traffic data are collected in the form of PCAP files. Network traffic is assembled in the form of 5-tuple data having a source IP, destination IP, source port, destination port, and protocol. Table 1 illustrates the significance of each attribute in the traffic data.
Table 1. Attributes of the network traffic.
We used Wireshark to read the PCAP data. The packet data are extracted through the export packet dissections operation in Wireshark []. The traffic flows from the sender to the receiver are determined by all of the attributes listed in Table 1. Among all these criteria, we choose the protocol to select the data. Communication protocols include UPD, TCP, and HTTP. Protocols define the mechanism of communication and data transfer between sender and receiver. TCP is a vital protocol for network services and applications []. TCP allows in-depth analysis of the packet sequences, which helps in examining the malicious content during communication. So, for the data collection, we extracted the TCP data packets from the network traffic. Algorithm 1 represents the mechanism for the extraction of TCP packet data from the network traffic.
Algorithm 1 Extraction of TCP Traffic Data
Input: Network Traffic Data Pc
Output: TCP Traffic Data TCP_Pc
Initialization:
1:    TCP_Pc = {}
2: For each packet Pc_i in Pc do:
3:    Protocol = ExtractProtocol(Pc_i)
4:    If Protocol == ‘TCP’ then:
5:         Add Pc_i to TCP_Pc
6: Return TCP_Pc

3.2. Synthetic Sample Generator and Data Augmentation

The retrieved TCP packet dataset is evaluated to identify a candidate class for synthetic data generation. The classifier misclassifies samples that are underrepresented in the dataset. Lack of learning examples from these classes and reduced precision and recall reduce classifier efficiency. Classes with fewer than the average number of samples are classified as minority and sentence augmentation is conducted on samples from that class. Sentence augmentation level 2 results in the generation of two similar sentences []. GPT-2 from NLPAug library (1.1.11) of hugging face is used to generate the synthetic samples keeping the level of generation to 2 []. Once the minority classes have been identified and new samples obtained, the new data are combined with existing information to produce the final dataset. Algorithm 2 depicts the mechanism of the synthetic sample generator and the augmentation process for generating the custom dataset.
Algorithm 2 Synthetic Sample Generator and Data Augmentation
Input: TCP Traffic Data TCP_Pc
Output: Augmented Traffic Data Aug_Pc
Initialization:
1: Aug_Pc = {}
2: For each Class C_Pc in TCP_Pc do:
3:    Packet_Pc = Count C_Pc
4:    If Packet_Pc < ‘Threshold’
5:    AugPacket_Pc = Generate2Samples for Packet_Pc then:
6:     Aug_Pc = AugPacket_Pc U TCP_Pc
7: Return Aug_Pc
The quality of the synthetic data generated by GPT-2 was thoroughly validated by calculating the Kullback–Leibler (KL) divergence between the synthetic data and actual traffic data []. KL divergence provides a quantitative measure of how the probability distribution of the synthetic traffic data diverges from that of the original traffic data, which helps us to evaluate the realism and diversity of the generated samples. This analysis helps to prove the effectiveness of the synthetic traffic data in approximating real-world traffic patterns. To compute the KL divergence of traffic data, we vectorized the traffic data using TF-IDF. Once the two vectors are obtained as original traffic data and synthetic traffic data, the KL divergence between the original text data distribution P and the synthetic text data distribution Q, based on their TF-IDF values, is given by Equation (1):
D K L ( P Q ) = i P ( t i ) log P ( t i ) Q ( t i )
where we have the following:
  • P ( t i ) is the TF-IDF value of term t i in the original text data;
  • Q ( t i ) is the TF-IDF value of term t i in the synthetic text data;
  • The sum is taken over all terms t i in the vocabulary.

3.3. Ethical Considerations with Synthetic Data

The use of synthetic data provides an opportunity to accelerate research and model training. However, the responsible use of synthetic data requires attention regarding expected risks and challenges. Since synthetic data mimic the original data, it is vital to keep ethical issues in consideration. A report by [] guides how practitioners and innovators can responsibly use synthetic data. The synthetic malware traffic samples developed in this study are aimed at improving the performance of models used in cybersecurity research. The synthetic malware dataset will not be publicly released. It will only be used internally for research purposes or shared with the trusted researchers who are involved in the project. We aim to prevent any potential harm that could arise from the creation of synthetic data.

3.4. Features Analysis

The augmented data generated by the previous step undergo preprocessing with the NLTK library to remove the spaces and NULL values in the dataset []. The preprocessed data are further fed into the Tokenizer. Tokenization converts the textual data into a format that is understandable by the LLM model. BERT Tokenizer from Transformers library (version 4.42.4) is used in this model to perform the tokenization []. Tokenizer converts the data into smaller tokens. BERT Tokenizer generates tokens based on the word piece policy. With this process, each sentence splits into a word. Special tokens [CLS], [SEP], and [Pad] are added with Tokens. These Tokens help in the efficient handling of the text by maintaining a consistent representation of the text for the model to perform the classification. Attention Masks are used by BERT tokenizer to mask the originally generated token from the padded one. Token IDs for each token are generated by the tokenizer. Ultimately, BERT Tokenizer generates the Attention Mask and Token IDs as outputs. This is the Input for the BERT model along with the encoded labels for the classification. Figure 2 illustrates the process of Tokenization by the BERT tokenizer used in the Syn-detect model for tokenization.
Figure 2. Features analysis process.

3.5. Classification and Evaluation

BERT is a pre-trained model for textual data bidirectional representations [,]. Additional layers can be added to BERT to fine-tune it to perform the specific task of NLP such as classification. BERT is a state-of-the-art model and proved its computational and empirical power in research on NLP. So, BERT is fine-tuned through hyperparameter adjustments and the inclusion of new layers to perform the malware classification in the Syn-detect model. The BERT unit of the model’s structure is explained below []:
  • The 1st layer of the model is the input layer. This layer receives the input in the form of Attention Mask and Input ID. The required input is generated by the Tokenizer from the textual data. Tensorflow is used for model representation in Python so that input in the form of tensors is accepted by the model []. The shape of the input tensor is precise to 256 tokens; this resulted in the padding of the shorter token and truncation of larger tokens to keep the consistency of input.
  • The input is fed to the pre-trained BERT model—a BERT-based uncased model from the Transformers library (4.42.4). Data are processed by BERT, resulting in two major outputs. Activation layer and Pooled Output layer. For the classification through BERT, a sequence to be used for classification by the Pooled Output Layer is further input to the new layers designed for classifications.
  • An intermediate Dense layer is included, which is a fully connected dense layer, using the Rectified Linear Unit (ReLU) as an activation function to enhance the learning capability of the model. This layer is designed for 512 units, so it helps in dimensionality reductions of the features generated by the BERT pooled output layer, keeping the imperative information required for classification integral.
  • Following that, another dense layer, the output layer, is added to perform the classification and generate the probability distribution of the classes. The SoftMax activation function is used to calculate the probabilities of the classes using classification.
The designed model utilizes the powerful embeddings generated by BERT and additional layers to perform the classification over multiclass data. The adapted BERT model for the proposed approach is shown in Figure 3. The model possesses 109,878,533 trainable parameters.
Figure 3. Adapted BERT model in Syn-detect.
The model is compiled and trained with the learning rate of 0.00001 with Adam optimizer and batch size 16 on 50 epochs. Categorical cross-entropy loss is determined during the training and validation process to evaluate the performance of the model. Other evaluation parameters used to evaluate the model’s performance are accuracy, precision, recall, and F1-score. Adam optimizes the learning of the model by adjusting the historical gradient information of the parameters []. This results in efficient adaption according to the specific data. Adam leads to fast convergence by keeping the gradients recorded and updating the bias value at each step. The gradient G t at each step t with the loss function Loss ( θ t ) is computed using Equation (2).
G t = ϑ Loss ( θ t )
The mean of the gradients MG t by Adam at each step is determined by Equation (3).
MG t = β 1 MG t 1 + ( 1 β 1 ) G t
  • MG t 1 represents the previous value of the mean of the gradient.
  • G t , as computed in Equation (2), is the gradient value at time step t.
  • β 1 is the decay rate during the computation of the mean of the gradients.
The uncentered variance VC t of the gradients is computed by Equation (4).
VC t = β 2 VC t 1 + ( 1 β 2 ) G t 2
  • VC t 1 is the previous value of the uncentered variance.
  • G t 2 represents the element-wise square value of the gradient at each step.
  • β 2 is the decay value during the computation for the uncentered variance.
Bias correction is performed during training to set the values of MG t and VC t as they are inclined towards 0 during the initialization of training. Equation (5) and Equation (6) represent the bias correction MG t and VC t for MG t and VC t , respectively.
MG t = MG t 1 β 1 t
VC t = VC t 1 β 2 t
when dealing with a multiclass classification problem, the loss is computed using categorical cross-entropy loss. This loss function computes the difference between the probability distribution of predicted labels (obtained using SoftMax) and true encoded labels. Categorical cross-entropy loss for the single instance is computed using Equation (7).
Loss ( y , y ) = n y n log ( y n )
The value of y n is 1 if the true class is n and zero otherwise, according to the encoded labels. y n is the predicted probability of the class by the model.
Evaluation parameters accuracy, recall, precision, and F1-score are computed to elaborate the performance of the model. Equations (8)–(11) illustrate the computation of these evaluation parameters.
Accuracy = TP + TN TP + TN + FP + FN × 100
Recall = FP FP + TN × 100
Precision = TP FP + TN × 100
F 1 - Score = 2 × Precision × Recall Precision + Recall × 100
TP, TN, FP, and FN represent the true positive, true negative, false positive, and false negative values, respectively.

4. Results and Discussion

The model is evaluated and assessed on the samples of two publicly available Android malware datasets.

4.1. Dataset Acquisition

The first dataset is CIC-AndMal2017 []. This dataset contains samples collected from Android apps with different kinds of malware. The data are classified into four malware classes and a collection of benign data as well. For each malware class, the samples include the malware of different families that come under the category. Table 2 describes the composition of malware classes in the dataset.
Table 2. Composition of Dataset-1.
A customized dataset with TCP protocol as per the problem statement is extracted from the network traffic dataset. This dataset is a comprehensive dataset of the network traffic capturing the state of the system at three stages: first, the dataset is captured instantly after installation of the malware; second, 15 min before restarting the Android device; and, lastly, 15 min after the device restarts. The data are compiled in the form of packet capture (PCAP) files. The customized dataset is a composite of samples from each class keeping a lesser number of samples from the SMS Malware class for generating the synthetic data and analysis of mutant malware by the model. The dataset is extracted for the TCP protocol. The distribution of samples of each class in the original data is represented in Figure 4 and those with synthetic samples are shown in Figure 5.
Figure 4. Distribution of original TCP traffic data samples of Dataset-1.
Figure 5. Distribution of synthetic TCP traffic data samples of Dataset-1.
The second dataset is CIC-AAGM2017 []. These data are similar to that of the first dataset and capture the network traffic data. This dataset is composed of three classes. Table 3 illustrates the composition of this dataset.
Table 3. Composition of Dataset-2.
This dataset is also a comprehensive dataset of PCAP (Packet Capture) files. The dataset is customized to be used for the Syn-detect model. TCP protocol packets are extracted and samples of the Adware class are less represented in the customized dataset, which makes it eligible for synthetic sample generation and augmentation. The distribution of samples of each class in the original data is represented in Figure 6 and those with synthetic samples are shown in Figure 7.
Figure 6. Distribution of original TCP traffic data samples of Dataset-2.
Figure 7. Distribution of synthetic TCP traffic data samples of Dataset-2.

4.2. Deployment and Results

The quality of the synthetic data generated by GPT-2 is validated by the KL divergence. The results of the KL divergence between the original and synthetic samples of Dataset-1 are shown in Figure 8. An average KL divergence value of 0.003 indicates that the synthetic data generated by GPT-2 are closely similar to the original data. Still, variations can be observed in the plot, which demonstrates the diversity in synthetic data. Overall, KL divergence reflects that the synthetic samples are likely realistic and diverse in comparison to the actual traffic data.
Figure 8. KL divergence between original and synthetic TCP traffic data samples.
KL divergence can help to identify potential biases in synthetic data generation by comparing the probability distributions of the original traffic data and the synthetic traffic data. A high KL divergence implies that the synthetic data do not closely reflect the characteristics of the original data, which could indicate that the synthetic data generation model is biased or does not capture important facets of the real data []. On the other hand, a very low KL divergence, as 0.003 here, suggests that the synthetic data are very similar to the original data, implying less bias in terms of distribution. The synthetic data creation option for traffic data is low since the samples do not reflect meaningful English language. So, GPT-2 in this regard is suitable for identifying and understanding the patterns and generating diversified synthetic data [].
The developed datasets are preprocessed and fed into the model for tokenization. The tokenizer does its job of generating the input IDs and attention mask to be fed into the adapted BERT model for classification. The distribution of data for training and testing is kept at 80/20. Of the data, 80% is used for training of model, and 20% is used for validation. An 80/20 split was chosen to ensure a sufficient amount of data for training while reserving enough validation data to evaluate the model’s performance. Synthetic data were included in both the training and test sets. This was to evaluate how well the model generalizes to both real and synthetically generated attack patterns. Including synthetic data in the test set allowed us to assess the model’s ability to generalize to new, synthetic attack patterns. The dataset is trained and validated over 50 epochs and keeping a batch size of 16. For each epoch, the precision, recall, and accuracy are calculated to see the trend of learning by the model. The results of accuracy and loss are represented in Figure 9 and Figure 10 for Dataset-1 (CICAndMal2017) and in Figure 11 and Figure 12 for Dataset-2 (CIC-AAGM2017), respectively. Accuracy and loss for both datasets follow the trend across 50 epochs representing the increase in accuracy and decline in loss, illustrating the good learning of the model for classifying the malware.
Figure 9. Trend of accuracy of the Syn-detect model for Dataset-1.
Figure 10. Trend of loss of the Syn-detect model for Dataset-1.
Figure 11. Trend of accuracy of the Syn-detect model for Dataset-2.
Figure 12. Trend of loss of the Syn-detect model for Dataset-2.
In Figure 9, the training accuracy of the model for the CICAndMal2017 dataset is represented as blue and validation accuracy is represented as red. The accuracy of models improves as the model training starts, from 83% to 99%; then, it becomes consistent around the 15th epoch. The testing accuracy follows the same trend of increase in accuracy being consistent with the training curve at the 15th epoch. Figure 10 represents the loss curve decline as the model trains and then validates and converges to a very low value of loss.
Table 4 demonstrates the progress of the model step by step across 50 epochs towards convergence for Dataset-1.
Table 4. Model progress on epoch steps for Dataset-1.
In Figure 11, the same pattern is observed for the achievement of accuracy during training and validation of the model with CICAndMal2017. An increase in accuracy is observed until the 33rd epoch; then, it becomes consistent, achieving 98% throughout the 50 epochs. The loss declines with training and validation until it is reduced to a minute value following the 50 epochs, as depicted in Figure 12.
Table 5 demonstrates the progress of the model step by step across 50 epochs towards convergence for Dataset-2.
Table 5. Model progress on epoch steps for Dataset-2.
To see the learning pattern throughout the model training and validation, the precision and recall of each class are monitored across 50 epochs as well. Figure 13 and Figure 14 illustrate the precision and recall trends for Dataset-1, and Figure 15 and Figure 16 illustrate the same for the second dataset. The precision values monitored during the training and validation of the model for Dataset-1 (CICAndMal2017) are represented in Figure 13. Each class is represented with its assigned Label ID 0-4. The precision value of each class during training and validation illustrates the inclination toward 1, illustrating the learning of the model as epochs progress toward 50. Figure 14 illustrates the recall values of each class as the model trains and validates. The pattern illustrates that the model can classify maximum classes of malware as it approaches the 50th epoch. Figure 15 represents the precision trend of the model with Dataset-2 (CIC-AAGM2017). The progress of the model represents that the model converges at the 30th epoch; then, a consistent curve illustrates the ability of the model to identify most of the malware classes until the 50th epoch.
Figure 13. Trend of achieved precision of the Syn-detect model for Dataset-1.
Figure 14. Trend of achieved recall of the Syn-detect model for Dataset-1.
Figure 15. Trend of achieved precision of the Syn-detect model for Dataset-2.
Figure 16. Trend of achieved recall of the Syn-detect model for Dataset-2.
The individual precision, recall, and F1-score of each class in Dataset-1 are represented in Table 6. These values represent the average values of precision, recall, and F1-score achieved across the 50 epochs. The weighted values of precision, recall, and F1-score achieved by Dataset-1 show the overall achieved values. The results are represented in Table 6.
Table 6. Evaluations of the Syn-detect model with Dataset-1.
The individual precision, recall, and F1-score of each class in Dataset-2 is represented in Table 7. These values represent the average values of precision, recall, and F1-score achieved across the 50 epochs. The weighted values of precision, recall, and F1-score achieved by Dataset-2 show the overall achieved values. The results achieved by Dataset-2 are represented in Table 7.
Table 7. Evaluations of the Syn-detect model with Dataset-2.
The model is evaluated on a test dataset for the predictions. The test data contain samples of original and synthetic TCP traffic data. Two sets of test data from Dataset-1 and Dataset-2 are created. Figure 17 illustrates the results of Dataset-1, and Figure 18 shows the results of Dataset-2 in the form of a confusion matrix.
Figure 17. Confusion matrix of predictions for Dataset-1.
Figure 18. Confusion matrix of predictions for Dataset-2.
On the prediction dataset, the model shows an accuracy of 99.8% for Dataset-1 and 99.3% accuracy for Dataset-2. The MCC values achieved over predictions are 0.99 for Dataset-1 and 0.98 for Dataset-2. The MCC value near 1 shows the good predictions made by the model. So, the Syn-detect model demonstrates substantial performance. In the context of malware detection and classification, false positives and false negatives can have significant impacts, even if the model achieves high accuracy. A false positive may result in unnecessary blocking that may lead to the user’s loss of trust in the system. A false negative may result in a substantial security breach. A relatively small number of false positives could cause noteworthy trouble for users in real-time systems. Similarly, false negatives would cause more malware to be misclassified as benign. This may result in more malware going undetected, putting systems at risk. Since the accuracies of 99.8% and 99.3% are significantly high in the Syn-detect model, the users are at lesser risk.

5. Comparison of Results and Discussion

To elaborate on the performance of the model, an ablation study was conducted by executing the model without synthetic traffic data. The model attained an accuracy of 98.2%. Results of the study are shown in Table 8 for Dataset-1 and Table 9 for Dataset-2. The findings support the effectiveness of the proposed model, as a decline in performance was observed with lower precision, recall, and F1-score values when synthetic data were not used.
Table 8. Evaluations of the model without synthetic samples in Dataset-1.
Table 9. Evaluations of the model without synthetic samples in Dataset-2.
With Dataset-2, the model attained an accuracy of 97.8%. The attained precision, recall, and F1-score also declined, as illustrated in Table 9.
The Syn-detect model is compared with the previous studies that deal with the imbalanced dataset for the classification of the malware. Different techniques have been used for data balancing or extraction of the nominal features so that the impact of imbalanced data on the performance of the model can be reduced. Table 10 provides a comparison of the Syn-detect model with the previous studies, and the results signify the contribution made by the model to improve the classification of the malware.
Table 10. Comparison with recent studies.
Table 10 presents a comprehensive comparison with recent studies, highlighting the implication of the Syn-detect model. Each of these studies employs diverse methods and datasets for Android malware classification. The results of each study are summarized in the table, which validates that Syn-detect outperforms all other methods. This accentuates the innovative contribution of the current study to the field of cybersecurity.

6. Conclusions and Future Work

Android, being the most popular operating system, is prone to cyberattacks. Detection and classification of malware has been the focus area of research throughout the process. The detection classification methods started with signature-based methods, progressed to machine learning-based methods, and finally to AI-based methods. However, this research has always been intuitive because of the dynamic nature of malware, which makes it difficult for the models to identify the mutant malware to a great extent. In this research, we targeted the network traffic data with a synthetic sample generation strategy so that the model can be trained on the synthetic samples as well. The Syn-detect model generates the synthetic samples with GPT-2 for TCP protocol-based samples and classifies the malware with an adapted BERT model. The model’s performance is assessed through extensive experimentation on two datasets. CIC-AndMAL2017 presented a weighted accuracy of 99.6% across 50 epochs and CIC-AAGM2017 showed a weighted accuracy of 95%. On the prediction dataset, the model showed an accuracy of 99.8% accuracy for CIC-AndMAL2017 and 99.3% accuracy for CIC-AAGM2017. The MCC values achieved over predictions are 0.99 for CIC-AndMAL2017 and 0.98 for CIC-AAGM2017. These impressive results illustrate that the Syn-detect model is a substantial addition to malware detection and classification research. In the future, we are aiming to develop more algorithms for malware detection and classification by unpacking the code structure and dynamic analysis. The generalizability of the model to other types of malware or network traffic can be explored in the future, which may help to evaluate the adaptability and flexibility of the model. Also, this research opens avenues for creating a secure environment for IoT-based systems through other means, including edge computing and federated learning.

Author Contributions

Conceptualization, M.N. and F.U.; methodology, M.N. and A.A. (Amjad Alsirhani); software, S.I.; validation, M.N., F.U. and A.A. (Amjad Alsirhani); formal analysis, H.N.; investigation, G.N.A.; resources, A.A. (Abdullah Alomari); data curation, F.U.; writing—original draft preparation, M.N.; visualization, A.A. (Abdullah Alomari). All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Deanship of Graduate Studies and Scientific Research at Jouf University under grant No. (DGSSR-2024-02-02101).

Data Availability Statement

Datasets used in this research are available publicly at https://www.unb.ca/cic/datasets/index.html (accessed on 18 April 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sahay, S.K.; Sharma, A.; Rathore, H. Evolution of malware and its detection techniques. In Information and Communication Technology for Sustainable Development: Proceedings of ICT4SD 2018; Springer: Singapore, 2020; pp. 139–150. [Google Scholar]
  2. Udayakumar, N.; Saglani, V.J.; Cupta, A.V.; Subbulakshmi, T. Malware classification using machine learning algorithms. In Proceedings of the 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 11–12 May 2018; p. 1. [Google Scholar]
  3. Raff, E.; Nicholas, C. A survey of machine learning methods and challenges for windows malware classification. arXiv 2020, arXiv:2006.09271. [Google Scholar]
  4. Qamar, A.; Karim, A.; Chang, V. Mobile malware attacks: Review, taxonomy & future directions. Future Gener. Comput. Syst. 2019, 97, 887–909. [Google Scholar]
  5. Malik, M.I.; Ibrahim, A.; Hannay, P.; Sikos, L.F. Developing resilient cyber-physical systems: A review of state-of-the-art malware detection approaches, gaps, and future directions. Computers 2023, 12, 79. [Google Scholar] [CrossRef]
  6. Haris, M.; Jadoon, B.; Yousaf, M.; Khan, F.H. Evolution of android operating system: A review. Asia Pac. J. Contemp. Educ. Commun. Technol. 2018, 4, 178–188. [Google Scholar]
  7. Dhiman, D.B. Effects of Online News Applications for Android: A Critical Analysis. 2022. Available online: https://ssrn.com/abstract=4222791 (accessed on 21 June 2024).
  8. Selvaganapathy, S.; Sadasivam, S.; Ravi, V. A review on android malware: Attacks, countermeasures and challenges ahead. J. Cyber Secur. Mobil. 2021, 10, 177–230. [Google Scholar] [CrossRef]
  9. Almaleh, A.; Almushabb, R.; Ogran, R. Malware API calls detection using hybrid logistic regression and RNN model. Appl. Sci. 2023, 13, 5439. [Google Scholar] [CrossRef]
  10. Mehta, R.; Jurečková, O.; Stamp, M. A natural language processing approach to Malware classification. J. Comput. Virol. Hacking Tech. 2024, 20, 173–184. [Google Scholar] [CrossRef]
  11. Ambekar, N.G.; Devi, N.N.; Thokchom, S.; Yogita. TabLSTMNet: Enhancing android malware classification through integrated attention and explainable AI. Microsyst. Technol. 2024, 1–19. [Google Scholar] [CrossRef]
  12. Ullah, F.; Turab, A.; Ullah, S.; Cacciagrano, D.; Zhao, Y. Enhanced network intrusion detection system for internet of things security using multimodal big data representation with transfer learning and game theory. Sensors 2024, 24, 4152. [Google Scholar] [CrossRef]
  13. Zhang, N.; Xue, J.; Ma, Y.; Zhang, R.; Liang, T.; Tan, Y. Hybrid sequence-based Android malware detection using natural language processing. Int. J. Intell. Syst. 2021, 36, 5770–5784. [Google Scholar] [CrossRef]
  14. Alam, S. Applying natural language processing for detecting malicious patterns in Android applications. Forensic Sci. Int. Digit. Investig. 2021, 39, 301270. [Google Scholar] [CrossRef]
  15. Demmese, F.A.; Neupane, A.; Khorsandroo, S.; Wang, M.; Roy, K.; Fu, Y. Machine learning based fileless malware traffic classification using image visualization. Cybersecurity 2023, 6, 32. [Google Scholar] [CrossRef]
  16. Zhang, Y.; Jin, R.; Zhou, Z.H. Understanding bag-of-words model: A statistical framework. Int. J. Mach. Learn. Cybern. 2010, 1, 43–52. [Google Scholar] [CrossRef]
  17. Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [Google Scholar] [CrossRef]
  18. Liu, L.Z.; Wang, Y.; Kasai, J.; Hajishirzi, H.; Smith, N.A. Probing across time: What does RoBERTa know and when? arXiv 2021, arXiv:2104.07885. [Google Scholar]
  19. He, P.; Liu, X.; Gao, J.; Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. arXiv 2020, arXiv:2006.03654. [Google Scholar]
  20. Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X.; et al. A comparative study on transformer vs rnn in speech applications. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 449–456. [Google Scholar]
  21. Botacin, M.; Ceschin, F.; Sun, R.; Oliveira, D.; Grégio, A. Challenges and pitfalls in malware research. Comput. Secur. 2021, 106, 102287. [Google Scholar] [CrossRef]
  22. Li, J.; He, J.; Li, W.; Fang, W.; Yang, G.; Li, T. SynDroid: An adaptive enhanced Android malware classification method based on CTGAN-SVM. Comput. Secur. 2024, 137, 103604. [Google Scholar] [CrossRef]
  23. Chen, J.; Zhao, Z.; Cai, S.; Chen, X.; Ahmad, B.; Song, L.; Wang, K. DCM-GIFT: An Android malware dynamic classification method based on gray-scale image and feature-selection tree. Inf. Softw. Technol. 2024, 176, 107560. [Google Scholar] [CrossRef]
  24. Aboaoja, F.A.; Zainal, A.; Ghaleb, F.A.; Al-Rimy, B.A.S.; Eisa, T.A.E.; Elnour, A.A.H. Malware detection issues, challenges, and future directions: A survey. Appl. Sci. 2022, 12, 8482. [Google Scholar] [CrossRef]
  25. Gao, C.; Huang, G.; Li, H.; Wu, B.; Wu, Y.; Yuan, W. A Comprehensive Study of Learning-based Android Malware Detectors under Challenging Environments. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
  26. Ullah, F.; Ullah, S.; Naeem, M.R.; Mostarda, L.; Rho, S.; Cheng, X. Cyber-threat detection system using a hybrid approach of transfer learning and multi-model image representation. Sensors 2022, 22, 5883. [Google Scholar] [CrossRef] [PubMed]
  27. Chen, Y.; Cui, M.; Wang, D.; Cao, Y.; Yang, P.; Jiang, B.; Lu, Z.; Liu, B. A survey of large language models for cyber threat detection. Comput. Secur. 2024, 145, 104016. [Google Scholar] [CrossRef]
  28. Koide, T.; Fukushi, N.; Nakano, H.; Chiba, D. Detecting phishing sites using chatgpt. arXiv 2023, arXiv:2306.05816. [Google Scholar]
  29. Xu, P. Android-coco: Android malware detection with graph neural network for byte-and native-code. arXiv 2021, arXiv:2112.10038. [Google Scholar]
  30. Rahali, A.; Akhloufi, M.A. Malbert: Malware detection using bidirectional encoder representations from transformers. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Melbourne, Australia, 17–20 October 2021; pp. 3226–3231. [Google Scholar]
  31. Rahali, A.; Akhloufi, M.A. Malbertv2: Code aware bert-based model for malware identification. Big Data Cogn. Comput. 2023, 7, 60. [Google Scholar] [CrossRef]
  32. Demırcı, D.; Acarturk, C. Static malware detection using stacked bilstm and gpt-2. IEEE Access 2022, 10, 58488–58502. [Google Scholar] [CrossRef]
  33. Saracino, A.; Simoni, M. Graph-based android malware detection and categorization through bert transformer. In Proceedings of the 18th International Conference on Availability, Reliability and Security, Benevento, Italy, 29 August–1 September 2023; pp. 1–7. [Google Scholar]
  34. Ban, Y.; Yi, J.H.; Cho, H. Augmenting Android Malware Using Conditional Variational Autoencoder for the Malware Family Classification. Comput. Syst. Sci. Eng. 2023, 46, 2215–2230. [Google Scholar] [CrossRef]
  35. Mahmoudi, L.; Salem, M. BalBERT: A New Approach to Improving Dataset Balancing for Text Classification. Rev. d’Intell. Artif. 2023, 37, 425–431. [Google Scholar] [CrossRef]
  36. Habbat, N.; Nouri, H.; Anoun, H.; Hassouni, L. Sentiment analysis of imbalanced datasets using BERT and ensemble stacking for deep learning. Eng. Appl. Artif. Intell. 2023, 126, 106999. [Google Scholar] [CrossRef]
  37. Cheng, A. PAC-GAN: Packet generation of network traffic using generative adversarial networks. In Proceedings of the 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 17–19 October 2019; pp. 0728–0734. [Google Scholar]
  38. Kholgh, D.K.; Kostakos, P. PAC-GPT: A novel approach to generating synthetic network traffic with GPT-3. IEEE Access 2023, 11, 114936–114951. [Google Scholar] [CrossRef]
  39. Lamping, U.; Warnicke, E. Wireshark user’s guide. Interface 2004, 4, 1. [Google Scholar]
  40. Kozierok, C.M. The TCP/IP Guide: A Comprehensive, Illustrated Internet Protocols Reference; No Starch Press: San Francisco, CA, USA, 2005. [Google Scholar]
  41. Feng, S.Y.; Gangal, V.; Wei, J.; Chandar, S.; Vosoughi, S.; Mitamura, T.; Hovy, E. A survey of data augmentation approaches for NLP. arXiv 2021, arXiv:2105.03075. [Google Scholar]
  42. Gujjar, J.P.; Kumar, H.P.; Prasad, M.G. Advanced NLP Framework for Text Processing. In Proceedings of the 2023 6th International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, 3–4 March 2023; pp. 1–3. [Google Scholar]
  43. Pérez-Cruz, F. Kullback-Leibler divergence estimation of continuous distributions. In Proceedings of the 2008 IEEE International Symposium on Information Theory, Toronto, ON, Canada, 6–11 July 2008; pp. 1666–1670. [Google Scholar]
  44. Johansson, P.; Bright, J.; Krishna, S.; Fischer, C.; Leslie, D. Exploring responsible applications of Synthetic Data to advance Online Safety Research and Development. arXiv 2024, arXiv:2402.04910. [Google Scholar] [CrossRef]
  45. Hardeniya, N.; Perkins, J.; Chopra, D.; Joshi, N.; Mathur, I. Natural Language Processing: Python and NLTK; Packt Publishing Ltd.: Birmingham, UK, 2016. [Google Scholar]
  46. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
  47. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  48. Nadi, F.; Naghavipour, H.; Mehmood, T.; Azman, A.B.; Nagantheran, J.A.; Ting, K.S.K.; Adnan, N.M.I.B.N.; Sivarajan, R.A.; Veerah, S.A.; Rahmat, R.F. Sentiment Analysis Using Large Language Models: A Case Study of GPT-3.5. In Proceedings of the International Conference on Data Science and Emerging Technologies, Virtual, 4–5 December 2023; Springer: Singapore, 2023; pp. 161–168. [Google Scholar]
  49. Nadeem, S.; Mehmood, T.; Yaqoob, M. A Generic Framework for Ransomware Prediction and Classification with Artificial Neural Networks. In Proceedings of the International Conference on Data Science and Emerging Technologies, Virtual, 4–5 December 2023; Springer: Singapore, 2023; pp. 137–148. [Google Scholar]
  50. Géron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2022. [Google Scholar]
  51. Jais, I.K.M.; Ismail, A.R.; Nisa, S.Q. Adam optimization algorithm for wide and deep neural network. Knowl. Eng. Data Sci. 2019, 2, 41–46. [Google Scholar] [CrossRef]
  52. Lashkari, A.H.; Kadir, A.F.A.; Taheri, L.; Ghorbani, A.A. Toward developing a systematic approach to generate benchmark android malware datasets and classification. In Proceedings of the 2018 International Carnahan Conference on Security Technology (ICCST), Montreal, QC, Canada, 22–25 October 2018; pp. 1–7. [Google Scholar]
  53. Lashkari, A.H.; Kadir, A.F.A.; Gonzalez, H.; Mbah, K.F.; Ghorbani, A.A. Towards a network-based framework for android malware detection and characterization. In Proceedings of the 2017 15th Annual Conference on Privacy, Security and Trust (PST), Calgary, AB, Canada, 28–30 August 2017; pp. 233–23309. [Google Scholar]
  54. Chen, J.; Tam, D.; Raffel, C.; Bansal, M.; Yang, D. An empirical survey of data augmentation for limited data learning in nlp. Trans. Assoc. Comput. Linguist. 2023, 11, 191–211. [Google Scholar] [CrossRef]
  55. Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; Wu, Z.; Zhao, L.; Xu, S.; Liu, W.; Liu, N.; et al. Auggpt: Leveraging chatgpt for text data augmentation. arXiv 2023, arXiv:2302.13007. [Google Scholar]
  56. Zhou, F.; Wang, D.; Xiong, Y.; Sun, K.; Wang, W. FAMCF: A few-shot Android malware family classification framework. Comput. Secur. 2024, 146, 104027. [Google Scholar] [CrossRef]
  57. Ghourabi, A. An Attention-Based Approach to Enhance the Detection and Classification of Android Malware. Comput. Mater. Contin. 2024, 80, 2743–2760. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.