Malware Detection and Classification System Based on CNN-BiLSTM

Kim, Haesoo; Kim, Mihui

doi:10.3390/electronics13132539

Open AccessArticle

Malware Detection and Classification System Based on CNN-BiLSTM

by

Haesoo Kim

and

Mihui Kim

^*

School of Computer Engineering & Applied Mathematics, Computer System Institute, Hankyong National University, Jungang-ro, Anseong-si 17579, Gyeonggi-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2539; https://doi.org/10.3390/electronics13132539

Submission received: 10 May 2024 / Revised: 12 June 2024 / Accepted: 26 June 2024 / Published: 28 June 2024

(This article belongs to the Special Issue Applications of Artificial Intelligence in Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

For malicious purposes, attackers hide malware in the software used by their victims. New malware is continuously being shared on the Internet, which differs both in terms of the type of malware and method of damage. When new malware is discovered, it is possible to check whether there has been similar malware in the past and to use the old malware to counteract the new malware; however, it is difficult to check the maliciousness and similarity of all software. Thus, deep learning technology can be used to efficiently detect and classify malware. This study improves this technology’s accuracy by converting static features, which are binary data, into images and by converting time-series data, such as API call sequences, which are dynamic data with different lengths for each datum, into data with fixed lengths. We propose a system that combines AI-based malware detection and classification systems trained on both static and dynamic features. The experimental results showed a detection accuracy of 99.34%, a classification accuracy of 95.1%, and a prediction speed of approximately 0.1 s.

Keywords:

malware detection and classification; convolutional neural network; bidirectional long short-term memory; API call sequence; image; preprocessing

1. Introduction

With the proliferation of Internet-connected devices, the possibility of infection by malicious codes has increased. If infected once, attackers employ persistence mechanisms to maintain compromised systems for extended periods. This makes it difficult for static analyses through straightforward code inspection to identify malicious codes with persistence [1]. Instead, dynamic analyses, code execution, and reports of system changes are required to determine the malicious behaviors enforced in a system. New malicious codes continue to be discovered, and the number of malicious codes is increasing rapidly. Thus, it is difficult to analyze and classify all existing malicious codes using debugging and signatures. Finding defensive techniques whenever a new malicious code is discovered is not a fast enough method to match the occurrence rate of malicious codes. To respond to a new malicious code, we must find ways to quickly analyze and classify it. Malicious codes of the same type use similar libraries and APIs; therefore, there are similarities in program behavior. Therefore, if we can detect new malicious codes and classify them into families of existing malicious codes, we can confirm the new malicious code type and provide appropriate defensive techniques.

Azeez et al. [2] proposed a dense convolutional neural network (CNN) [3] ensemble classification-based detection technique for malicious code detection and classification. Vasan et al. [4] proposed an ensemble model for transfer learning and image-based malicious code family classification, and Kumar et al. [5] proposed an image-based malicious code classification model using pre-trained models. Naeem et al. [6] proposed a detection and classification system using image-based stacking ensemble models, and Yadava et al. [7] proposed an Android malicious code detection and classification technique using CNN and machine learning algorithms. Gomez et al. [8] proposed a detection and classification system using machine learning algorithms on the static analysis data of APK files.

Other studies have mainly been conducted using image-based pre-trained models or based on static data extracted from data; however, pre-trained models have many parameters, which create a large modeling overhead. In addition, when only static data are used, the model may not accurately detect a malicious code when obfuscated. To address these limitations, our previous study [9] proposed a malware detection system that utilized static data, such as malicious code images, and dynamic data. Using dynamic data allows for the confirmation of the actual malicious behaviors of the program through APIs when the malicious software is running, which cannot be identified using static data alone. This is effective for general detection and family classification, and it will also be beneficial for detecting obfuscated malicious codes. However, when preprocessing the dynamic data in the systems proposed in our previous research, all of the information is lost by removing the APIs over a certain length from the API call sequence, which affects the system’s performance. To improve this, this study applies a preprocessing technique that fits the API call sequence to minimize information loss and improve accuracy by using CNN and long short-term memory (LSTM) [10] models instead of pre-trained models, reducing the overhead and allowing for a choice between malicious code detection and malicious code family classification. By enabling the detection and classification of malicious codes, the proposed system can be used for detection and classification according to the requirements of a given situation (for example, when speed or accuracy needs to be prioritized).

The remainder of this paper is organized as follows. Section 2 explains the related research on malicious code detection and classification using artificial intelligence. Section 3 proposes an image- and time-series data-based malicious code detection and classification system using artificial intelligence. Section 4 analyzes the experimental results, and Section 5 concludes the paper.

2. Related Works

Azeez et al. [2] proposed reducing dimensionality through a principal component analysis (PCA) for 77 features extracted from PE files and used an ensemble classifier that combined the predictions of one-layer multi-layer perceptron (MLP), two-layer MLP, and 1D-CNN models with a meta-learner of 15 different machine learning algorithms to output detection results.

Vasan et al. [4] proposed an image-based malware family classification ensemble model to classify malware families by fine-tuning the VGG16 model pre-trained on ImageNet data and the ResNet-50 model using transfer learning on malware images and combining these with SVM models trained by extracting key features through a PCA.

Kumar et al. [5] proposed an image-based malware classification model using the VGG16 model fine-tuned with image-based datasets and pre-trained the VGG19, ResNet-50, and InceptionV3 models to extract the features of malware images and classify malware families using six machine learning classifiers.

Naeem et al. [6] proposed an image-based stacking ensemble detection and classification system for malware images by extracting the main features of the image using local binary pattern (LBP) and Grey-level spatial dependency matrices (GLCM), mapping high-dimensional features to low-dimensional features using a CNN ensemble model, and detecting and classifying malware using six machine learning classifiers.

Yadava et al. [7] proposed a stacking ensemble-based CNN, a machine learning Android malware detection and classification technique that trains a pre-training model, the EfficientNetB0 model, by fine-tuning malware images and outputs the model through a logistic regression model with a support vector machine (SVM) and random forest classifiers.

Gomez et al. [8] proposed a system that extracts features from static analysis data such as API calls, permissions, and actions requested from other applications in APK files, and detects and classifies them using six classifiers.

Other studies have proposed systems that extract features from static data and malware images in CNN-based deep learning models or map high-dimensional features to low-dimensional data and output detection or classification results through machine learning classifiers. Vasan et al., Kumar et al., and Yadava et al. have proposed systems that extract features using pre-trained models and classify them using machine learning algorithms. The characteristics of pre-trained models [11] are that they can be resource-intensive in terms of hardware because of the large number of parameters, and using only static data can potentially fail to detect malware that is unknown to the system until it is executed.

In this paper, we propose a CNN-LSTM detection and classification system based on static and dynamic data.

3. Proposed System

In this section, the proposed system is described. Initially, static and dynamic data were extracted from the PE files [12] and preprocessed using a Data Preprocessing Module (DPM). Subsequently, the Malware Detection and Classification Module (MDCM) determines whether the malware is malicious. Figure 1 shows a schematic of the proposed system.

3.1. Data Preprocessing Module (DPM)

The Data Preprocessing Module (DPM) is a module that extracts static and dynamic data from the input PE file and preprocesses them to conform to the input requirements of the MDCM. In general, malicious software can be detected and classified based on static data, such as binary information, opcodes, and similar attributes. However, in cases in which the software is obfuscated, relying solely on static data for detection and classification becomes challenging because the functionality of the program cannot be discerned until it is executed. Furthermore, deep learning models train on information in the training data to classify each class, and models can be trained on a diverse range of information, such as binary information from static data and API call sequences from dynamic data, to enhance the performance of the model. In the DPM, static data are transformed into images by extracting the binary information comprising the file’s hexadecimal format in one-byte increments, which are then converted into decimal numbers. Subsequently, each byte is mapped to one pixel to facilitate its transformation into an image [13]. Once converted into images, the static data were resized to fixed dimensions to align with the input size requirements of the CNN model employed by the MDCM. Dynamic data were extracted from the API call sequences of the PE files using Cuckoo Sandbox [14]. The extracted data are difficult to use as training data for the LSTM model because of the variation in data size depending on the program. Therefore, it was necessary to convert the data to a fixed size. To achieve this, we preprocessed the data using the term frequency–inverse document frequency (TF-IDF), which translates API calls from natural language into meaningful numerical values. In addition, the sliding window technique was applied to convert the image to a fixed size [15]. Algorithm 1 represents the preprocessing process for dynamic data.

Algorithm 1: Proposed TF-IDF and sliding window calculation algorithm
	Input: API call sequences $A$ ; Categories $C$ ; Target length for transformation $N$
	Output: Preprocessed data $S$
	/Calculating TF-IDF/
1	$T_{A} \leftarrow$ Calculating $A$ as TFIDF
2	$T_{C} \leftarrow$ Calculating $C$ as TFIDF
3	$T_{A C} \leftarrow T_{A} * T_{C}$
	/Chunking Sequence/
4	$S_{n}$ ← Dividing $T_{A C}$ by $N$
5	Type Casting $S_{n}$ to integer
6	Initialize $i d x$
7	for $i$ ← 0 to length of $T_{A C}$ and increment steps are $S_{n}$ do
8	if $N - i d x$ = (length of $T_{A C} - S * N$ ) then
9	$S_{n}$ ← $S_{n} + 1$
10	break
11	end
12	Extract size of $S_{n}$ elements from $T_{A C}$ and add to $S_{A C}$
13	$i d x + +$
14	end
15	for $i$ ← ${(S}_{n} + 1) * i d x$ to length of $T_{A C}$ and increment steps are $S_{n}$ do
16	Extract size of $S_{n}$ elements from $T_{A C}$ and add to $S_{A C}$
17	end
	/Sliding Window/
18	$S_{a v g}$ ← Calculating the average of each window in $S_{A C}$
19	$S$ ← Calculate the final values in $S_{A C}$ using $S_{a v g}$
20	return $S$

Lines 1–3 involve calculating the API call sequence and category using the term frequency–inverse document frequency (TF-IDF). Lines 4–17 detail the process of partitioning the calculated sequence into fixed-size segments for the sliding window computation. In Lines 18–20, the final values for the conversion to a fixed length are determined. The initial step in converting dynamic data to a fixed length after the TF-IDF calculation involves dividing the data length by the target length and converting the result into an integer (Lines 4–5). For instance, if the data length is 14 and the target length is 6, we divide 14 by 6 and discard the decimal, which results in 2, which becomes the window size. Grouping the data into pairs of 2 yields a single dataset with seven sub-datasets, exceeding the target length. Distributing one sub-dataset from the end of the API call sequence evenly across the six sub-datasets minimizes the loss of API call sequence information during its conversion to a fixed length. By incrementing the window size by one, based on the difference between the product of the sub-dataset size and the target length and the original data size, several sub-datasets equal to the target length are generated (Lines 8–9). For example, grouping [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] with a window size of 2 results in [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14]]. To resolve this, subtracting the products of 2 and 6 from 14 yields 2, and generating sub-datasets with a window size of 3 starting from the 4th position results in [0th: [1, 2], 1st: [3, 4], 2nd: [5, 6], 3rd: [7, 8, 9], 4th: [10, 11], and 5th: [12, 13, 14]], creating a sub-dataset equal to the target length. Converting each sub-dataset to a single value by calculating a representative value allows for data conversion to the target length (Lines 18–19). The method for calculating the representative value is as follows. If there are more values smaller than the average within the sub-dataset, the smallest value becomes the representative value. If there are larger values, the largest value becomes the representative value. When the values are equal, the average becomes the representative value.

Figure 2 shows the structure of a PE file and an example. Figure 3 and Figure 4 show examples of the preprocessing of static and dynamic data.

3.2. Malware Detection and Classification Module (MDCM)

In MDCM, static and dynamic data preprocessed by the DPM are input to the CNN and LSTM models, and each model combines the features extracted from the program’s static and dynamic data through a concatenate layer. The combined features are trained using a deep neural network (DNN) model to produce a single output. Table 1, Table 2 and Table 3 show the structure and parameters of the detection models used in the MDCM. Table 1 presents the structure and parameters of the CNN model used to train the static data. The fundamental architecture of a CNN consists of an input layer, convolutional layers, pooling layers, and a flattened layer, and multiple convolutional and pooling layers are stacked to train the image features and patterns. The rationale behind stacking five layers is that, during the process of increasing the number of filters, which represent the feature maps of images, the parameters at the output layer become half the size of the input image after the fifth pooling layer, and stacking any further would unnecessarily increase the overall model parameter size. Additionally, the hyperparameters for each layer were adjusted based on the metrics (loss and accuracy) of the validation data to prevent overfitting. Table 2 outlines the structure and parameters of the LSTM model used to train the dynamic data. The architecture comprises an input layer and bidirectional long short-term memory (BiLSTM) layers. Stacking multiple BiLSTM layers can potentially enhance the system’s accuracy. However, excessive stacking may lead to overfitting. Thus, the number of layers and hyperparameters were adjusted based on metrics from the validation data to avoid overfitting. Table 3 lists the structures and parameters of the layers that integrate the CNN and LSTM models to produce the output. The DNN fundamentally consists of an input layer, dense layers, and an output layer, with a concatenated layer and four dense layers serving as the input and output layers, respectively, in the proposed system. To address the issue of changing the data distribution during the training process on a batch-by-batch basis, batch normalization was added to the proposed model. Furthermore, to reduce the dimensionality of the data features gradually, the number of units was halved in each passing layer. The number of units of Dense Layer 1 and the total number of dense layers were adjusted based on the metrics from the validation data to prevent overfitting. The Dense Layer 4 parameters in Table 3, class_num and activation, can be used to select the detection and classification.

3.2.1. Convolutional Neural Network (CNN)

A CNN is a deep learning model used to train images. Static data preprocessed into an image form allow for the extraction and training of unique patterns or features inherent to malware. Therefore, a CNN was selected for this purpose. The convolutional layer processes the input data by multiplying them with the corresponding values in the filter as the filter slides over the input at regular intervals (stride). It then sums up these products and forwards the aggregated results to the subsequent layer. The value of each filter is the weight and represents the feature extracted from the input image. Figure 5 illustrates the computational process in the convolutional layer. The pooling layer reduces the dimensionality of the output of each layer while preserving its characteristics, thereby mitigating the risk of overfitting. To input values into the ensemble and dense layer, a flatten layer is employed to transform the output from three dimensions to one dimension. This allows for training on malware image features. The purpose of the ensemble is to enhance detection and classification performance by integrating the weights of the models trained on both static and dynamic data.

3.2.2. Long Short-Term Memory (LSTM)

LSTM, a subtype of the RNN, is a deep learning model used to train data containing time-series information. Dynamic data consist of API call sequences, making the order of these API calls significant, which leads to the selection of the RNN. Furthermore, given that the length of the API call sequences in malware varies and that there is a potential for gradient vanishing even when sequences are shortened, LSTM networks were chosen. LSTMs are improved RNN models designed to address these specific issues. The model utilized in the proposed system is a bidirectional LSTM. A conventional LSTM is trained using a memory cell containing forget, input, and output gates. The forget gate determines the percentage of information discarded as unimportant using a sigmoid function, and the input gate adds new information to the data discarded through the forget gate in a point-wise operation. Finally, the output gate determines the values to be passed to the subsequent layer. Figure 6 illustrates the structure of the LSTM layer. The classical LSTM model extracts information in the forward direction and trains the information of previous API calls relative to the current API call. However, bidirectional LSTM [16] trains on both the next and previous API calls relative to the current call, captures the information within the malware’s API call sequence better, and potentially outperforms forward-only LSTM. Figure 7 shows the structure of the bidirectional LSTM.

3.2.3. Deep Neural Network (DNN)

A DNN is a neural network consisting of input, output, and single or multiple hidden layers. The hidden layer, also known as a dense layer or multi-layer perceptron, is trained to classify data as either malicious or benign. Each layer in the network receives the sum of the outputs of the previous layers, each multiplied by its respective weight, as the input. The features extracted through the CNN and LSTM models are trained using a dense layer, while the final layer outputs the prediction results. the batch normalization layer normalizes the weights and input values in each layer to improve the internal covariant shift, which has a different data distribution per batch. This normalization reduces the scale of the operation results, accelerates the training speed, and moderates the weight adjustments, thereby aiding in the generalization of the model. Figure 8 shows an example of the concatenate layer, in which the three-dimensional vectors from the CNN block are transformed to one-dimensional vectors by the flatten layer and linearly combined with the results from the LSTM block. Figure 9 shows an example of a prediction result that depends on the parameters of the output layer of the DNN block. Malware can be detected when class_num is 1 and activation is sigmoid; it can be classified as a malware family when class_num is the number of malware families and activation is softmax.

4. Performance Evaluation

4.1. Experimental Environments

The following environment is used for the experiments.

Table 4 lists the system specifications used in the experiments. Table 5 lists the main libraries and versions used in the experiments.

4.2. Experimental Dataset

Raw PE files were provided by Practical Security Analytics for security and AI research [17]. In the Practical Security Analytics dataset metadata, the ratio of antivirus programs that identified a file as malicious to the total number of antivirus programs that scanned the program was calculated. Files with a ratio of 0.9 or higher were classified by type using VirusTotal [18]. The dataset comprises 9756 benign and 13,796 malicious entries, totaling 23,552 data points. Dynamic data from both benign and malicious files were extracted using Cuckoo Sandbox. Table 6 presents the distribution of the malware types utilized in the experiments. The distribution of the training, validation, and testing data has a ratio of 6:2:2. The hash values of the software used to extract the data can be found in the GitHub repository [19].

4.3. Experimental Methods

The experiments involved malware detection and malware family classification using the proposed model. Based on the model’s prediction, we measured the true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) and calculated the recall, precision, accuracy, and F1 score. The accuracy (TP + TN/TP + FP + TN + FN) is the percentage of correctly predicted samples from the total prediction results. The precision (TP/TP + FP) is the percentage of malicious samples out of the total number of predictions. The recall (TP/TP + FN) indicates the percentage of malicious samples predicted to be malicious. The F1 score (2 × precision × recall/precision + recall) is the harmonized average of precision and recall, which is a trade-off and addresses the shortcomings in terms of accuracy caused by unbalanced class data [18].

4.4. Experimental Results

Table 7 lists the performance of the proposed models on dynamic data of a fixed length. ‘Detect–Classify’ in the ‘Methods’ column is the result of detection and classification using the proposed model, whereas ‘Classify’ is the result of classification only. ‘Acc’ is the detection accuracy for benign malware, and ‘Classify-Acc’ is the accuracy of the malware family classification. The highest accuracy of the Detect–Classify model was 99.44% when the length of the dynamic data was 1200, the classification accuracy was 95.4% when the length was 2000, and the F1 score was 0.9953 when the length was 1200. The false-positive rate is also important in the case of malware because detecting maliciousness as normal can harm the system. The false-positive rate was equal to one minus the recall. The lowest false-positive rate was 0.23% in 2000. The highest detection accuracy for Classify was 99.39% for all three lengths except 2000, the highest classification accuracy was 95.28% for 2000, and the highest F1 score was 0.9967 for all three lengths except 2000. The average precision for each of the four lengths was approximately 99.31% for Detect–Classify and 99.34% for Classify, with average classification accuracies of approximately 95.1% and about 94.97%, respectively. The average F1 score for Detect–Classify was approximately 0.9941, and that for Classify was approximately 0.9964. The average false-positive rate was approximately 0.49% for Detect–Classify and approximately 0.34% for Classify. The length of the dynamic data used in the experiments varied between programs, resulting in different amounts of information being lost when applying the preprocessing techniques. However, our experiments show that the results can be generalized sufficiently, even with different degrees of information loss, indicating that the proposed system can be effectively applied to both detection and classification.

Table 8 presents the detection duration using the proposed model. It lists the longest (‘Max Total Time’), the fastest (‘Min Total Time’), and the average detection speed (‘Avg Total Time’) of the 25 samples for detection and classification among the 25 randomly sampled data. The fastest average speed for Detect–Classify is 400, which is 0.1311 s. For Classify, it is 1200, which is 0.0663 s. Detect–Classify has about a twofold difference in detection speed because it performs detection and classification. This shows that the proposed system can make predictions in about 0.1 s with accuracies of 99.34% and 95.1%.

Table 9 presents a performance comparison with other studies. This comparison was made using the highest performance figures from previous studies and the proposed paper. We found that the proposed model has a faster detection speed than the pre-trained CNN and LSTM models and has a significant level of accuracy and F1 score.

5. Conclusions

In this study, a malicious code detection and classification system using dynamic and static data is proposed. The dynamic data were preprocessed using the TF-IDF and sliding window method to fit the input of the deep learning model, and the static data were preprocessed by converting binary data into images. The preprocessed data were detected and classified using the CNN-LSTM model and showed a high average accuracy and F1 score, as well as a low average false-positive rate. We also measured the prediction time of the model and showed that it can determine whether a program is malicious with fewer parameters and a faster prediction speed than a pre-trained model. This allows us to detect and classify malicious programs before they affect a system and respond quickly and accurately. However, we found that the classification accuracy of the proposed model was low compared to that obtained in previous studies. In future research, we would like to study how to improve the system’s classification accuracy.

Author Contributions

H.K. and M.K. performed the experiments. H.K. developed and evaluated the proposed system. M.K. supervised the design and development of the technique proposed in this work and guided this work as a corresponding author. All authors have read and agreed to the published version of the manuscript.

Funding

This study received no external funding.

Data Availability Statement

The PE Malware Machine Learning dataset is available at https://practicalsecurityanalytics.com/pe-malware-machine-learning-dataset/ (accessed on 30 April 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

O’Kane, P.; Sezer, S.; McLaughlin, K. Obfuscation: The Hidden Malware. IEEE Secur. Priv. 2011, 9, 41–47. [Google Scholar] [CrossRef]
Azeez, N.A.; Odufuwa, O.E.; Misra, S.; Oluranti, J.; Damaševičius, R. Windows PE Malware Detection Using Ensemble Learning. Informatics 2021, 8, 10. [Google Scholar] [CrossRef]
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
Vasan, D.; Alazab, M.; Wassan, S.; Safaei, B.; Zheng, Q. Image-Based malware classification using ensemble of CNN architectures (IMCEC). Comput. Secur. 2020, 92, 101748. [Google Scholar] [CrossRef]
Kumar, S.; Panda, K. SDIF-CNN: Stacking deep image features using fine-tuned convolution neural network models for real-world malware detection and classification. Appl. Soft Comput. 2023, 146, 110676. [Google Scholar] [CrossRef]
Naeem, H.; Dong, S.; Falana, O.J.; Ullah, F. Development of a deep stacked ensemble with process based volatile memory forensics for platform independent malware detection and classification. Expert Syst. Appl. 2023, 223, 119952. [Google Scholar] [CrossRef]
Yadava, P.; Menonb, N.; Ravic, V.; Vishvanathand, S.; Phame, D.T. A two-stage deep learning framework for image-based android malware detection and variant classification. Comput. Intell. 2022, 38, 1748–1771. [Google Scholar] [CrossRef]
Gómez, A.; Muñoz, A. Deep Learning-Based Attack Detection and Classification in Android Devices. Electronics 2023, 12, 3253. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Kim, H.; Kim, M. Malware Detection System Based on Static-Dynamic preprocessing Techniques Combined in an Ensemble Model. In Proceedings of the 15th International Conference on Computer Science and Its Applications, Nha Trang, Vietnam, 18–20 December 2023. not published yet. [Google Scholar]
Han, X.; Zhang, Z.; Ding, N.; Gu, Y.; Liu, X.; Huo, Y.; Qiu, J.; Yao, Y.; Zhang, A.; Zhang, L.; et al. Pre-trained models: Past, present and future. AI Open 2021, 2, 225–250. [Google Scholar] [CrossRef]
PE Format. Available online: https://learn.microsoft.com/en-us/windows/win32/debug/pe-format (accessed on 10 April 2024).
Nataraj, L.; Karthikeyan, S.; Jacob, G.; Manjunath, B.S. Malware images: Visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security, Pittsburgh, PA, USA, 20 June 2011; pp. 1–7. [Google Scholar]
Cuckoo Sandbox—Automated Malware Analysis. Available online: https://cuckoo.readthedocs.io/en/latest/ (accessed on 10 April 2024).
Kim, M.; Kim, H. A Dynamic Analysis Data Preprocessing Technique for Malicious Code Detection with TF-IDF and Sliding Windows. Electronics 2024, 13, 963. [Google Scholar] [CrossRef]
Graves, A.; Schmidhuber, J. Framewise Phoneme Classification with Bidirectional LSTM Networks. In Proceedings of the International Joint Conference on Neural Networks, Montreal, Canada, 31 July–4 August 2005; pp. 2047–2052. [Google Scholar]
PE Malware Machine Learning Dataset. Available online: https://practicalsecurityanalytics.com/pe-malware-machine-learning-dataset/ (accessed on 10 April 2024).
VirusTotal. Available online: https://www.virustotal.com/gui/home/upload (accessed on 7 June 2024).
GitHub Repository. Available online: https://github.com/haesookimDev/MalDetectIntegrantedSystem/tree/main/Data (accessed on 10 April 2024).

Figure 1. Structure of the proposed system.

Figure 2. Structure of a PE file.

Figure 3. Static data preprocessing example.

Figure 4. Dynamic data preprocessing example.

Figure 5. Example of calculation in the convolutional layer.

Figure 6. Structure of the LSTM layer.

Figure 7. Structure of the bidirectional LSTM layer.

Figure 8. Example of a concatenate layer.

Figure 9. Example of DNN output.

Table 1. CNN block of the detection model.

Layer	Parameters	Values	Output
Input Layer		256, 256, 1	256, 256, 1
Convolutional Layer_1	filter	32	256, 256, 32
	kernel_size	3, 3
	strides	1
	activation	Rectified Linear Unit (ReLU)
Max Pooling Layer_1	pool_size	2, 2	128, 128, 32
Convolutional Layer_2	filter	64	128, 128, 64
	kernel_size	3, 3
	strides	1
	activation	ReLU
Max Pooling Layer_2	pool_size	2, 2	64, 64, 64
Convolutional Layer_3	filter	128	64, 64, 128
	kernel_size	3, 3
	strides	1
	activation	ReLU
Max Pooling Layer_3	pool_size	2, 2	32, 32, 128
Convolutional Layer_4	filter	256	32, 32, 256
	kernel_size	3, 3
	strides	1
	activation	ReLU
Max Pooling Layer_4	pool_size	2, 2	16, 16, 256
Convolutional Layer_3	filter	512	16, 16, 512
	kernel_size	3, 3
	strides	1
	activation	ReLU
Max Pooling Layer_3	pool_size	2, 2	8, 8, 512
Dropout Layer	rate	0.2	8, 8, 512
Flatten Layer			32,768

Table 2. LSTM block of the detection model.

Layer	Parameters		Values	Output
Input Layer			1, Target length	1, Target length
Bidirectional Layer 1	LSTM	units	120	1, 240
Bidirectional Layer 1	LSTM	return sequences	True	1, 240
Bidirectional Layer 2	LSTM	units	120	240

Table 3. Ensemble block and output of the detection model.

Layer	Parameters	Values	Output
Concatenate Layer		32,768	33,008
Concatenate Layer		240	33,008
Batch Normalization Layer 1			33,008
Dense Layer 1	units	512	512
Dense Layer 1	activation	ReLU	512
Batch Normalization Layer 2			512
Dense Layer 2	units	256	256
Dense Layer 2	activation	ReLU	256
Batch Normalization Layer 3			256
Dense Layer 3	units	128	128
Dense Layer 3	activation	ReLU	128
Dense Layer 4	units	class_num	class_num
Dense Layer 4	activation	sigmoid or softmax	class_num

Table 4. System specifications for experimentation.

Hardware	Specification
CPU	Intel Xeon(R) Silver 4215R 3.20 GHz
RAM	256 GB DDR4
GPU	RTX QUADRO A6000
VRAM	48 GB GDDR6

Table 5. A version of the library used in the experiment.

Library	Version
Python	3.7.13
TensorFlow	2.7.0
Scikit-learn	1.0.2
NumPy	1.21.6
Pandas	1.3.5

Table 6. Distribution of data by malware type.

Type	Count
Trojan.fareit	3095
Trojan.vilsel	2308
Virus.ramnit	2113
Worm.allaple	1524
Virus.virut	1478
Trojan.crcf	1248
Virus.sality	1047
Virus.parite	983

Table 7. Performance metrics for fixed lengths of dynamic data in proposed models.

Methods	Lengths	Acc (%)	Classify-Acc (%)	Recall (%)	Precision (%)	F1 score
Detect–Classify	400	99.27	94.93	99.34	99.42	0.9938
Classify	400	99.39	95.21	99.6	99.74	0.9967
Detect–Classify	800	99.17	95.08	99.45	99.13	0.9929
Classify	800	99.39	94.87	99.74	99.6	0.9967
Detect–Classify	1200	99.44	94.97	99.6	99.45	0.9953
Classify	1200	99.39	94.49	99.46	99.89	0.9967
Detect–Classify	2000	99.34	95.4	99.67	99.2	0.9944
Classify	2000	99.18	95.28	99.78	99.34	0.9956

Table 8. Detection rate for fixed lengths of dynamic data in proposed models.

Methods	Lengths	Max Total Time (s)	Min Total Time (s)	Avg Total Time (s)
Detect–Classify	400	0.1515	0.1121	0.1311
Classify	400	0.083	0.058	0.0697
Detect–Classify	800	0.1548	0.1154	0.1351
Classify	800	0.0824	0.0597	0.071
Detect–Classify	1200	0.1611	0.1157	0.1336
Classify	1200	0.0803	0.0585	0.0663
Detect–Classify	2000	0.1581	0.1152	0.1399
Classify	2000	0.0839	0.0577	0.0668

Table 9. Comparison with other literature studies.

Reference	Methods	Dataset	Detect Acc (%)	Classification Acc (%)	F1 score	Time (s)
Azeez et al. [2]	PCA + MLP, CNN	Benign: 5012 Malware: 14,599	1.0	N/A	1.0	N/A
Vasan et al. [4]	Image + VGG16, ResNet-50 PCA + SVM	Malware: 9339	N/A	99.5	N/A	1.18
Kumar et al. [5]	Image + VGG16 VGG19, ResNet-50, InceptionV3	Malware: 9339	N/A	98.55	0.99	0.471
Naeem et al. [6]	Image, LBP, GLCM + CNN	Benign: 608 Malware: 3686	99.1	N/A	0.99	1.82
Yadava et al. [7]	Image + EfficientNetB0, SVM, Random Forest	Benign: 4826 Malware: 2486	100	89	1.0	N/A
Gomez et al. [8]	Static Feature + Machine Learning	Malware: 16,890	99.01	N/A	0.986	N/A
This Paper	Image + CNN API call sequence + LSTM	Benign: 9756 Malware: 13,796	99.44	95.4	0.9967	0.0663

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, H.; Kim, M. Malware Detection and Classification System Based on CNN-BiLSTM. Electronics 2024, 13, 2539. https://doi.org/10.3390/electronics13132539

AMA Style

Kim H, Kim M. Malware Detection and Classification System Based on CNN-BiLSTM. Electronics. 2024; 13(13):2539. https://doi.org/10.3390/electronics13132539

Chicago/Turabian Style

Kim, Haesoo, and Mihui Kim. 2024. "Malware Detection and Classification System Based on CNN-BiLSTM" Electronics 13, no. 13: 2539. https://doi.org/10.3390/electronics13132539

APA Style

Kim, H., & Kim, M. (2024). Malware Detection and Classification System Based on CNN-BiLSTM. Electronics, 13(13), 2539. https://doi.org/10.3390/electronics13132539

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Malware Detection and Classification System Based on CNN-BiLSTM

Abstract

1. Introduction

2. Related Works

3. Proposed System

3.1. Data Preprocessing Module (DPM)

3.2. Malware Detection and Classification Module (MDCM)

3.2.1. Convolutional Neural Network (CNN)

3.2.2. Long Short-Term Memory (LSTM)

3.2.3. Deep Neural Network (DNN)

4. Performance Evaluation

4.1. Experimental Environments

4.2. Experimental Dataset

4.3. Experimental Methods

4.4. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI