Malware Detection Using Deep Learning and Correlation-Based Feature Selection

: Malware is one of the most frequent cyberattacks, with its prevalence growing daily across the network. Malware trafﬁc is always asymmetrical compared to benign trafﬁc, which is always symmetrical. Fortunately, there are many artiﬁcial intelligence techniques that can be used to detect malware and distinguish it from normal activities. However, the problem of dealing with large and high-dimensional data has not been addressed enough. In this paper, a high-performance malware detection system using deep learning and feature selection methodologies is introduced. Two different malware datasets are used to detect malware and differentiate it from benign activities. The datasets are preprocessed, and then correlation-based feature selection is applied to produce different feature-selected datasets. The dense and LSTM-based deep learning models are then trained using these different versions of feature-selected datasets. The trained models are then evaluated using many performance metrics (accuracy, precision, recall, and F1-score). The results indicate that some feature-selected scenarios preserve almost the same original dataset performance. The different nature of the used datasets shows different levels of performance changes. For the ﬁrst dataset, the feature reduction ratios range from 18.18% to 42.42%, with performance degradation of 0.07% to 5.84%, respectively. The second dataset reduction rate is between 81.77% and 93.5%, with performance degradation of 3.79% and 9.44%, respectively.


Introduction
Malware has affected a lot of computing gadgets in the digital age.Malevolent software, or malware, is created with the intention of achieving the negative goals of a malicious attacker.Malware can attack networks, damage vital infrastructure, compromise computers and smart devices, and steal sensitive data [1].
The modern idea of an information society has evolved thanks to the Internet of Things (IoT) and its applications.However, security issues provide a significant barrier to achieving the advantages of this industrial development as cybercriminals target specific PCs and networks in order to steal private information for financial gain and disrupt systems [2].Such attackers utilize malicious software, or "malware," to expose system vulnerabilities and pose substantial hazards.Computer software designed to harm the operating system is known as malware (OS) [3].These malware attacks have increased significantly since our daily interactions have undergone a significant transformation as a result of the development of mobile technology.Online learning, social networking, online banking, online shopping, and web browsing are a few examples of services offered by mobile devices while connected to the Internet.Mobile gadgets have therefore played a key role and have evolved into a necessary aspect of daily life [4].In total, 4.78 billion people worldwide are using mobile devices as of 2020 [5].These mobile devices do make life more convenient for consumers, but they are also vulnerable to virus invasion and attacks because of online social networks and services.Mobile malware is capable of disguising itself as ordinary code and then altering any intended program to corrupt and obstruct the operation of the system [5][6][7].
A permission-based approach has been offered by Google Play as a security measure to prevent the application from obtaining private data.By taking into account the assets of the application that have been accessed, this permission prompts users prior to installation.Before moving forward with the installation, the users must expressly accept the agreement.Unfortunately, the Google Play method cannot fully safeguard the user because they have a tendency to accept the agreement without carefully reading the authorization [5,8].Another threat possibility can come from profiting off successful Android apps, as seen by the over 10-fold increase in Android malware detections between 2012 and 2018 [9].Furthermore, every day in 2018, there were over 12 K brand-new Android malware samples found.The recently revealed Android malware samples are more advanced than the ones that first surfaced a few years ago in terms of escaping anti-virus monitoring through coding and encryption, in addition to the rapid proliferation of malware [10,11].
Malware detection studies utilizing machine learning are growing in popularity because they are a successful strategy that can produce a high level of detection accuracy [12].Some previous studies utilized machine learning (ML) algorithms, which can make decisions after learning from the data templates.Machine learning is the concept of minimizing human intervention in computing systems [13].Through the use of computer learning methodologies and experience or previous data, machine learning predicts decisions.To analyze the features and track the model, there are supervised and unsupervised learning methods [14,15].In both cases, the machine learns to distinguish between malicious and benign activities.In supervised learning, the ML model is given the input and targets together and learns to always match the actual malware patterns with their corresponding "malware" classes and match the normal activities with the "normal" classes.The training process is repeated until the model learns to correctly predict all samples [5].Many ML algorithms have been used, like support vector machines (SVM) [16][17][18], K-nearest neighbor (KNN) [19,20], Bayesian estimation [21,22], genetic algorithms [23], etc., in order to build malware detection systems.Unsupervised learning methods provide the inputs without any targets, and the ML algorithm is learned to distinguish between malware and benign samples.However, some studies fused the supervised and unsupervised learning methodologies together [24].
Malware detection is an important security topic with strong associations with firms' legal, reputational, and economic concerns.Deep learning as a method for making and fixing detection mechanisms is a good way to solve many problems with how to detect malware.But when it comes to deep learning, there are many difficult things that need to be considered when thinking about detection mechanisms.Correlation-based feature selection, the dense layer model, and the LSTM model are presented as three challenging and symmetric ways to affect performance.
In the current research, two different datasets will be used.One of them contains a large number of records, while the other one consists of a large number of predictors (attributes).The feature selection of the best attributes will be used in many scenarios in order to define the best combination of attributes.The correlation with the target attribute "classification" will be used as the feature selection methodology.In the training step, the Dense and LSTM models will be used and compared, so that many training scenarios will be configured depending on different feature selection criteria, different splitting criteria, Symmetry 2023, 15, 123 3 of 21 and different dataset architectures.Our main contribution is using the efficiency of deep learning and feature selection methodologies in the malware detection field in order to build a robust, powerful, low-computational malware detection system.

Related Work
Some of the previous studies used machine learning (ML) approaches, while others applied deep learning (DL) techniques, including convolutional networks (CNN), recurrent neural networks (RNN), and long-short-term memory networks (LSTM) [25][26][27].Some of them used desktop-related malware datasets, but most took care of the mobile-related malware datasets.
Many machine learning and deep learning models were used for malware detection, according to Vinayakumar et al. [3].They used the Ember malware dataset, consisting of 70,140 benign and 69,869 malware recodes.Several ML and DL models were applied (KNN, SVM, Random Forests (RF), AdaBoost, Logistic Regression (LR), Naïve Bayes (NB), and Deep Neural Network (DNN)).They used the Adam optimization algorithm, and the models were trained for 200 epochs.The best result was obtained by the LSTM model with 98.9% accuracy.
A malware detection system based on DL was introduced by Jeon and Moon in 2020 [25].They used the convolutional encoder to translate the opcode sequences extracted from Windows executable files.The recurrent neural networks (RNNs) were then used for the malware detection process.Their approach achieved 96% detection accuracy and a 95% true positive rate.
In another study by Yazdinejad et al. [26], opcodes for malware and benign activities were extracted from a dataset of 200 benign and 500 malware records.They applied the LSTM model to build a malware detection system using 10-fold cross-validation on the acquired dataset.Their study achieved a detection accuracy of 98%.
Opcodes and system calls were used in a study by Darabian et al. [27].The total collected dataset contains 1500 executable samples, and the CNN-LSTM model was trained using this dataset.The opcodes-based collected records achieved a detection accuracy of 99%, while the system calls achieved only a 95% detection rate.
Hwang et al. [28] proposed a malware detection system based on a malware dataset consisting of 10,000 malware records and 10,000 benign files.The DNN is trained using this dataset (80% for training and 20% for testing).Their proposed system achieved 94% accuracy.
Ban et al. [29] used convolutional neural networks (CNN) for the Android malware detection process.They used a malware dataset consisting of 28,179 records of the most malware activities that appeared from 2018 to 2020.The experiments showed that their approach achieved 98% accuracy and a 0.82 F1-score.
The "wrapping feature selection" (WFS) method was proposed in a study by Smmarwar et al. [30].They used random forests, decision trees, and SVM classifiers.Those classifiers are trained using the optimal number of selected features from the CIC-InvesAndMal2019 malware dataset.The experiments showed that the SVM, RF, and DT models achieved 82.33%, 91.32%, and 91.8% accuracy, respectively.
Toan et al. [31] used the static opcode features of the MIPS ELF malware dataset, consisting of 4511 malware and 4393 benign activities.They used the machine learning models on the Internet of Things (IoT) platforms for malware detection.Their models achieved an accuracy of 99.8% using only 20 opcodes.
Our study will use the feature selection approach to minimize the number of features (columns), reducing the next computational time.Besides, the proposed correlation-based approach helps selecting the best features, improving performance.The combination of deep learning, high performance, and feature selection will result in a robust, low computational, and powerful malware detection model that was not introduced by any of the previous studies.

Datasets
Two different datasets are used in the current study.In the first dataset, the main feature is the large number of records, while the main feature of the second dataset is the high dimensionality (large number of attributes).The network traffic of a virtual machine on a Unix/Linux-based platform was used to build this dataset.The dataset includes the harmless actions of malware software for Android devices.It consists of 35 attributes (features) and 100,000 records (50,000 malware records and 50,000 benign ones).The dataset was created for classification and malware detection purposes.Table 1 includes detailed information about the attributes of this dataset.The dataset is available on the Kaggle site [32].This dataset is available on the Kaggle site [33].It is made up of 215 distinct attributes gathered from over 15,000 Android applications (9476 benign and 5560 malicious) [34].Figure 1 shows the distribution of the benign and malware classes through this dataset.This dataset is available on the Kaggle site [33].It is made up of 215 distinct attributes gathered from over 15,000 Android applications (9476 benign and 5560 malicious) [34].Figure 1 shows the distribution of the benign and malware classes through this dataset.

Proposed Methodology
In this study, many DL methods are proposed and used.In order to train the DL models using the two selected datasets, these datasets need a preprocessing step in which the classification (target) columns are encoded (numbered) and the special characters or missed values are processed.Since the two datasets differ in their nature, the preprocessing steps will be somehow different.
After preprocessing the datasets, they are split into training and test sets.In some training scenarios, the feature selection process is performed before the training process in order to minimize the data dimensionality (computational time).
After that, the DL models will be built and trained based on many training scenarios, including different splitting criteria, different DL architectures, and with or without feature selection.Figure 2 illustrates the proposed methodology for both datasets.

Proposed Methodology
In this study, many DL methods are proposed and used.In order to train the DL models using the two selected datasets, these datasets need a preprocessing step in which the classification (target) columns are encoded (numbered) and the special characters or missed values are processed.Since the two datasets differ in their nature, the preprocessing steps will be somehow different.
After preprocessing the datasets, they are split into training and test sets.In some training scenarios, the feature selection process is performed before the training process in order to minimize the data dimensionality (computational time).
After that, the DL models will be built and trained based on many training scenarios, including different splitting criteria, different DL architectures, and with or without feature selection.Figure 2 illustrates the proposed methodology for both datasets.

Correlation-Based Feature Selection
The goal of feature selection is to select the best features of the studied problem in order to reduce computational time.However, in our study, a correlation-based approach is proposed in order to minimize the high dimensionality, reduce the computational time, and select the best combinations of features so that the performance of the training and

Correlation-Based Feature Selection
The goal of feature selection is to select the best features of the studied problem in order to reduce computational time.However, in our study, a correlation-based approach is proposed in order to minimize the high dimensionality, reduce the computational time, and select the best combinations of features so that the performance of the training and evaluation process will be increased.Figure 3 illustrates the correlation-based feature selection approach.
For the first dataset, the correlations between all columns and the target column are computed using Equation (1).
where, Corrxy is the correlation between feature xi and target feature y. ̅ and  ̅ are the mean value of x and y, respectively.Then a list of potential dropped columns is prepared.Different selection scenarios can be made since the correlation ranges between 0 and 1.The selection step is based on the number of desired columns, so we will obtain the K number of required features and drop the rest.For the second dataset, the same approach will be applied except for the selection step.Specific correlation thresholds (T) will be used to eliminate columns since the number of columns in the second dataset is 214.So, in the second dataset, the number of selected features depends on the chosen threshold, which is not defined as in the first dataset.

Dense Layers Model
In the current study, we suggest using the dense-based architecture with hidden layers of 50 neurons for the first dataset scenarios and 100 neurons for the second dataset scenarios (since the second dataset has 214 attributes while the first one has only 33).The first dense layer is the input layer, with an input size equal to the number of selected For the first dataset, the correlations between all columns and the target column are computed using Equation (1).
where, Corr xy is the correlation between feature x i and target feature y. x and y are the mean value of x and y, respectively.Then a list of potential dropped columns is prepared.Different selection scenarios can be made since the correlation ranges between 0 and 1.The selection step is based on the number of desired columns, so we will obtain the K number of required features and drop the rest.For the second dataset, the same approach will be applied except for the selection step.Specific correlation thresholds (T) will be used to eliminate columns since the number of columns in the second dataset is 214.So, in the second dataset, the number of selected features depends on the chosen threshold, which is not defined as in the first dataset.

Dense Layers Model
In the current study, we suggest using the dense-based architecture with hidden layers of 50 neurons for the first dataset scenarios and 100 neurons for the second dataset scenarios (since the second dataset has 214 attributes while the first one has only 33).The first dense layer is the input layer, with an input size equal to the number of selected features (this number varies depending on each scenario).The activation function of the first layer is the "relu" non-linear function.The next five layers are the hidden dense layers, with 50 cells each and a "relu" activation function.The last dense layer is the output layer, with two outputs and a "softmax" activation function.The "softmax" function is necessary for the last layer since we need an activation function that produces probabilities for all possible outputs (malware and benign), and then the class with the highest probability is chosen as the final prediction.
We chose five hidden layers after applying many experiments to find the best number of hidden layers.After five hidden layers, the performance stops getting better, so we choose "five" as the right number of hidden layers.
Our proposed model is very simple in order to minimize the number of learnable parameters.Unlike in previous studies, experiments are used to figure out the number of neurons and the number of hidden layers in order to find the best combination for the problem being studied.

LSTM Model
For the LSTM proposed model, the first dense layer is replaced by an LSTM layer "relu" activation function.The rest of the dense and output layers are left the same as the previous dense model.By replacing the dense layer with the LSTM layer, the number of parameters that can be learned will increase by a lot.As a result, the training time will increase.

Evaluation Criteria
The evaluation step is the last part in which the performance evaluation is done using many metrics.In this study, the validation accuracy, test accuracy, training time, precision, recall, and F1-score are used for the performance evaluation step.
The validation accuracy is computed through the training process by testing the trained model using the validation set.On the other hand, the test accuracy is calculated after the training, and it is used to test the trained model's ability to tell the difference between new malware and harmless samples.
Four different calculations are used to figure out the precision, recall, and F1-score.(TP stands for true positives, TN for true negatives, FP for false positives, and FN for false negatives.) These four statistics are computed as follows: TP is the number of correctly classified malware samples among all malware samples.FN, on the other hand, is the opposite concept of TP and represents the number of incorrectly rejected malware samples that are predicted as benign samples.The number of correctly rejected benign samples (that are actually benign samples and correctly predicted as benign) is denoted by TN.On the other hand, FP is the opposite concept of TN and is calculated as the number of incorrectly accepted benign samples that must be rejected and considered benign samples (benign samples incorrectly classified as malware).The best performance is registered when the TP and TN have the highest values or when the FP and FN have the lowest values.Precision, recall, and F1-score are calculated as Equations ( 2)-( 4) show [35].
Precision = TP/(TP + FP) Recall = TP/(TP + FN) The precision concept represents the positive predictive value of the trained model, while the recall expresses its sensitivity.A high precision value means that the trained model can predict the positive class samples well (the malware samples are predicted very well, and the incorrectly accepted benign samples are low).The high recall value means that the sensitivity of the trained model to correctly reject the benign samples is very high.The high rates of precision and recall result in a high F1-score value, which expresses a mixed concept of precision and recall.To judge a trained model and see how well it works, the above statistics need to be calculated.

Dataset Preprocessing
The preprocessing step includes the following tasks:

•
Handle the special characters by replacing them with "NaN" values.

•
Check for the missing values and "NaN" values and replace them.

•
Label the target class (classification column) using 0 for benign and 1 for malware.

•
Drop column "hash" in the malware dataset.

Dataset Split
Split the dataset into training and testing using three splitting scenarios: 80% for training and 20% for test sets.75% for training and 25% for test sets.70% for training and 30% for test sets.

Feature Selection
For the first dataset, there are 35 attributes (including the target).The correlation between the target attribute "classification" and other attributes is computed in order to define the degree of importance of these attributes in the final prediction.Table 2 includes the correlation results of the first malware dataset (in decreasing order).
Table 2 shows that there are many columns that can be excluded since their correlations with the target column (the prediction column) are very weak.
In our study, for the first malware dataset, we will apply experiments based on four different scenarios.All these proposed scenarios are derived from the dropping of some columns (features).The dropping approach is based on the correlation between each of the columns (features) and the target column (class).Those correlations are placed in Table 2 for dataset 1 and Table 3     While for the second "Android malware dataset," the correlation between the target column and the other attributes of the dataset was also computed.Because of the large number of attributes in the second dataset, we followed a feature selection technique based on various correlation thresholds [36].
Using a correlation threshold of 0.1, the selected attributes are only 27 columns out of 214 (after removing the classification column).Table 3 shows the 27 selected attributes (columns) with their corresponding correlations.While using a correlation threshold of 0.2, the number of selected columns will be 14.As shown in Table 3, almost half of the columns are dropped at a threshold of 0.2.Using a threshold of 0.5 will result in only 39 features with a selection rate of 18.22%.

Training Scenarios
In the training step, many training scenarios are suggested based on many concepts (with or without feature selection, with or without an LSTM layer, using different feature selection thresholds, using different splitting criteria, etc.).A total of 12 different training scenarios are done in order to identify the effects of using different datasets, different feature selection options, different splitting criteria, and different DL architectures.

Experimental Results
In this section, all training scenarios will be evaluated, and the results will be introduced and discussed.

Results of the First Six Training Scenario of the First Malware Dataset
In these six scenarios, the training will be performed using 20 epochs, with a patch size of 100 and using the "Adam" optimization algorithm.The spare categorical cross-entropy loss function will be used, and the validation set will be selected from the training set (20% of the training set will be chosen as a validation set).This validation set will be used throughout the training process to validate the trained model and ensure that the training process is going the right way.Table 4 includes the detailed results of the first six scenarios of the first malware dataset.Table 4 illustrates the fact that removing some features will not affect the performance of the dataset.Reducing the features from 33 to 27 (a reducing rate of 18.18%) decreases the validation accuracy by 0.07% and the test accuracy by 0.45%, while the precision, recall, and F1-score remain the same.The training time is minimized by 0.1 s/ep.By using a reduction rate of 27.27%, the validation and training accuracy are minimized by only 0.04% and 0.09%, respectively.Going on and reducing the features into 21 features (36.63% reducing rate), the validation accuracy is reduced by 0.24%, while the test accuracy is reduced by 0.21%.The highest reduction rate is 42.42% (by removing 19 features), which reduced the validation accuracy and the test accuracy by 5.84% and 6.21%, respectively.The computational training time is also reduced by 0.1 S/Ep.Using LSTM as the first layer of the dense-based DL model enhanced the performance by 0.05% and 0.44% for validation and test accuracies, respectively.However, the computational time is also increased by 0.95 s per epoch.The training and validation accuracy and loss curves of the first six scenarios are shown in Figure 4. (36.63% reducing rate), the validation accuracy is reduced by 0.24%, while the test accuracy is reduced by 0.21%.The highest reduction rate is 42.42% (by removing 19 features), which reduced the validation accuracy and the test accuracy by 5.84% and 6.21%, respectively.The computational training time is also reduced by 0.1 S/Ep.Using LSTM as the first layer of the dense-based DL model enhanced the performance by 0.05% and 0.44% for validation and test accuracies, respectively.However, the computational time is also increased by 0.   In every scenario except the last one (14 out of 33 features were chosen), the accuracy curves reach 95% of their final values after the second epoch.

Results of the First Five Training Scenario of the First Malware Dataset
For the second dataset, the feature selection scenarios are also performed with the same DL model and the same training parameters as for the first malware dataset.The results are illustrated in Table 5.In every scenario except the last one (14 out of 33 features were chosen), the accuracy curves reach 95% of their final values after the second epoch.

Results of the First Five Training Scenario of the First Malware Dataset
For the second dataset, the feature selection scenarios are also performed with the same DL model and the same training parameters as for the first malware dataset.The results are illustrated in Table 5.Table 5 shows that adding an extra LSTM layer won't change the performance, but it will take more time to train the computer.
Reducing the features of the second malware dataset into only 39 features (using only 18.22% of the entire dataset with an 81.77% reduction rate) will only minimize the validation accuracy by 3.79% and the test accuracy by 3.59%.Other metrics like precision, recall, and F1-score will be minimized by 3%, 4.7%, and 4%, respectively.All metrics demonstrate that large feature space minimization (dimension reduction) has no effect on performance at the same minimization rate.This means that some features are not actually essential and can be dropped.
By reducing the features expensively (using only 6.54% of the second dataset's features), the validation and test accuracies are minimized by 9.44% and 9.98%, respectively (i.e., main features are dropped).Figure 5 shows the accuracy and loss curves of the training and validation sets for the five scenarios of the second dataset.
The first two curves (Figure 5A,B) show the best performance, which is related to the original dataset features and LSTM scenarios.The third curve is related to the 39-feature-selected scenario, which shows some degradation compared to the previous two curves.Figure 5C,D include further degradation in performance (these curves correspond with the final two scenarios, which correspond to the 27 and 14 selected out of 214 features, respectively).

Results of Using Many Split Criteria for Both Malware Datasets
In these scenarios, the split rate of the malware dataset into training and testing will be evaluated.For the first malware dataset, we will use the "12 features-selected" scenario as a basis for three experiments in which the splitting is changed from 20% to 25% and then 30% of the test set, and the results are shown in Table 6.Table 6 shows that increasing the test set percentage will decrease the performance.The main problem of using more test samples appears with the recall of "0-class" samples (the benign samples) and the precision of the "1-class" samples for 25% splitting scenarios, as shown in Figure 6a.The same results are concluded for the 30% split scenario (as shown in Figure 6b).To conclude, the best splitting scenario is using 20% of the test set, and any further increase in test samples will affect either the acceptance rate or the rejection rate (recall and precision).The first two curves (Figure 5A,B) show the best performance, which is related to the original dataset features and LSTM scenarios.The third curve is related to the 39-featureselected scenario, which shows some degradation compared to the previous two curves.Figure 5C,D include further degradation in performance (these curves correspond with the final two scenarios, which correspond to the 27 and 14 selected out of 214 features, respectively).For the second dataset, the same splitting scenarios will be used based on the "39 features selected" scenario.Table 7 illustrates the results of these splitting scenarios.For the second dataset, the same splitting scenarios will be used based on the "39 features selected" scenario.Table 7 illustrates the results of these splitting scenarios.Changing the splitting criteria of the second dataset shows less performance instability than in the first dataset splitting scenarios.This is due to the different nature of both datasets.The first dataset has a large number of samples with a small number of features, while the second dataset has a small number of samples with a large number of features.The second reason is that the feature selection techniques have different effects on both datasets.
Several feature selection approaches were used in previous studies.Gumaa [37], for example, divided their dataset into three categories based on graph-based feature selection, getting 110 features (51.4%) for permission-only types, 73 features (33.95%) for API calls, and 182 features (85%) for permissions and API calls.Their approach achieved recall values of 95%, 96%, and 97.3%, respectively.In our study, using the same dataset, we applied three different scenarios, getting into 39, 27, and 14 features, respectively (18.22%, 12.61%, and 6.54% of the entire dataset).Our proposed methodology is more effective in selecting the appropriate features of the dataset.Moreover, the recall value of our models using the proposed feature selection approach on the second dataset was 94.1% (very close to the [37] study recall of 95%, although their approach selected more features than our algorithm did).
In the study of Smmarwar et al. [30], the wrapping feature selection (WFS) method was proposed.The study applied the proposed approach to the CIC-InvesAndMal2019 malware dataset.The SVM, RF, and DT models were trained using the selected features and achieved 82.33%, 91.32%, and 91.8% accuracy, respectively.
In a recent publication, Smmarwar et al. [38] used the Binary Grey Wolf Optimization (BGWO)-based meta-heuristic feature selection algorithm to select the best combination of features in a malware dataset.However, the heuristic algorithm takes too much computational time.Our methodology is very easy and takes less than a second to compute the correlations.The study [38] approach is powerful but time-consuming.They achieved accuracies of 70.64%, 65.44%, 59.93%, and 83.49% on the features-selected version of the malware dataset.
A detailed comparison between the current research and previous ones is listed in Table 8, and another detailed comparison between our methodology and previous ML ones that worked on the same dataset is listed in Table 9.Table 8 shows that the current study's performance exceeds most other related works' results.The used dataset size is also larger than most other studies' datasets.The variety of using two datasets with different specifications and under different feature selection scenarios is also introduced in our study.

Figure 1 .
Figure 1.The distribution of Benign (B) and Malware (S) records of the malware dataset.

Figure 1 .
Figure 1.The distribution of Benign (B) and Malware (S) records of the malware dataset.

Gtime − 1
The training scenarios of the first malware dataset are:-Train a dense layer-based DL model using the original dataset and different selected groups of dataset features (there are four groups and the original dataset, which means five different scenarios).-Train the DL model that has been modified (an LSTM layer has been added) using the first set of features that were chosen.-Train the DL model using three different splitting criteria (two new scenarios).The training scenarios of the second malware dataset are:-Train a dense layer-based DL model with the original dataset and the three groups of selected features (four different scenarios).-Train the modified DL model (added LSTM layer) using the main dataset.-Train the DL model using different spitting criteria (two new scenarios).
95 s per epoch.The training and validation accuracy and loss curves of the first six scenarios are shown in Figure 4.

Figure 4 .
Figure 4. Training/validation accuracy and loss of the six scenarios of the first dataset: The six scenarios are numbered from (A-F).

Figure 4 .
Figure 4. Training/validation accuracy and loss of the six scenarios of the first dataset: The six scenarios are numbered from (A-F).

Figure 5 .
Figure 5. Training/validation accuracy and loss of the six scenarios of the second dataset: the five scenarios are numbered from (A-E).

Figure 5 .
Figure 5. Training/validation accuracy and loss of the six scenarios of the second dataset: the five scenarios are numbered from (A-E).

Table 1 .
First malware dataset description.
2.1.2.Second Dataset (Android Malware Dataset for Machine Learning) Hash' columns are already removed.The dropped columns are chosen from the columns whose correlation with the target column is low.

Table 2 .
Correlation between target column (classification) and the closest 20 columns of the malware dataset.

Table 3 .
Correlation between second malware dataset columns and the target column using a correlation-threshold of 0.1.

Table 4 .
The evaluation results of the first six scenarios of the first malware dataset.

Table 5 .
The evaluation results of the first six scenarios of the second malware dataset.

Table 6 .
The split evaluation results of the first six scenarios of the first malware test dataset.

Table 7 .
The evaluation results of the first six scenarios of the second malware test dataset.

Table 7 .
The evaluation results of the first six scenarios of the second malware test dataset.

Table 8 .
Comparison between the current research and related work.