Ensemble ‐ Based Classification Using Neural Networks and Machine Learning Models for Windows PE Malware Detection

: The security of information is among the greatest challenges facing organizations and in ‐ stitutions. Cybercrime has risen in frequency and magnitude in recent years, with new ways to steal, change and destroy information or disable information systems appearing every day. Among the types of penetration into the information systems where confidential information is processed is malware. An attacker injects malware into a computer system, after which he has full or partial access to critical information in the information system. This paper proposes an ensemble classifi ‐ cation ‐ based methodology for malware detection. The first ‐ stage classification is performed by a stacked ensemble of dense (fully connected) and convolutional neural networks (CNN), while the final stage classification is performed by a meta ‐ learner. For a meta ‐ learner, we explore and compare 14 classifiers. For a baseline comparison, 13 machine learning methods are used: K ‐ Nearest Neigh ‐ bors, Linear Support Vector Machine (SVM), Radial basis function (RBF) SVM, Random Forest, Ada ‐ Boost, Decision Tree, ExtraTrees, Linear Discriminant Analysis, Logistic, Neural Net, Passive Clas ‐ sifier, Ridge Classifier and Stochastic Gradient Descent classifier. We present the results of experi ‐ ments performed on the Classification of Malware with PE headers (ClaMP) dataset. The best per ‐ formance is achieved by an ensemble of five dense and CNN neural networks, and the ExtraTrees classifier as a meta ‐ learner.


Introduction
Many aspects of society have shifted online with the broad adoption of digital technology, from entertainment and social interactions to business, entertainment, industry and, unfortunately, crime as well.Cybercrime is rising in frequency and magnitude in recent years, with a projection of reaching USD 6 trillion by 2021 (up from USD 3 trillion in 2015) [1] and also taking on conventional crime both in number and revenues [2].Additionally, these new cyber-attacks have become more complex [3], generating elaborate multi-stage attacks.By the end of 2018, about 9599 malicious packages appeared per day [4].Such attacks also resulted in significant damage and major financial losses.Up to USD 1 billion was stolen from financial institutions around the world in two years due to malware [5].In addition, Kingsoft estimated that between 2 and 5 million computers were attacked each day [6].With cybercrime revenues reaching USD 1.5 trillion in 2018 [7] and cybercrime's global cost predicted to reach USD 6 trillion by 2021 [8], addressing cyber threats has become an urgent issue.
Moreover, the COVID-19 pandemic has delivered an extraordinary array of cybersecurity challenges, as most services have moved to online and remote mode, raising the danger of cyberattacks and malware [9,10].Especially, in the healthcare sector, cyber-attacks can lead to compromised sensitive personal patient data, while data tampering can lead to incorrect treatment, with irreparable damage to patients [11].
Today, computer programs and applications are developed at high speed.Malicious software (malware) has appeared and is growing in many formats and is becoming increasingly sophisticated.Computer criminals use them as a tool to infiltrate, steal or falsify information, causing huge damage to individuals, businesses and even threatening national security.A generic term generally used to describe all various types of unauthorized software programs is malware (malicious software), which includes viruses, worms, Trojans, spyware [12], Android malicious apps [13], bots, rootkits [14] and ransomware [15].In achieving its objectives, malware has been used by cybercriminals as weapons.Malware has been used to conduct a wide variety of security threats, such as stealing confidential data, stealing cryptocurrency, sending spam, crippling servers, penetrating networks and overloading critical infrastructures.While large numbers of malware samples have been identified and blocked by cybersecurity service providers and antivirus software manufacturers, a significant number of malware samples have been created or mutated (e.g., "zero-day" malware [16]) and appear to evade conventional anti-virus scanning tools based on signatures.As these techniques are primarily based on modifications of signature-based models, this has caused the information security industry to reconsider their malware recognition techniques.
Malware detection methods can be classified into methods based on signatures and behavior.Currently, signature-based malware detectors can work effectively with previously known malware that has already been detected by some anti-malware vendors.However, it cannot detect polymorphic malware that can change its signatures, as well as new malware whose signatures have not yet been created.One solution to this problem is to use heuristic analysis in combination with machine learning techniques that provide higher detection efficiency.As practice has shown, the traditional approach to the field of malware detection, which is based on signature analysis [17], is not acceptable for detecting unknown computer viruses.To maintain the proper level of protection, users are forced to constantly and timely update anti-virus databases.However, the delay in the response from the anti-virus companies for the emergence of new malware (its detection and signature creation) can vary from several hours to several days.During this time, malicious new programs can cause irreparable damage.
To address this problem, in addition to the signature approach, heuristic analysis is used.At the same time, the file can be considered "potentially dangerous" with some probability based on its behavior (dynamic approach) or the analysis of its structure (static approach).Static analysis generally consists of two main stages: the training stage and the stage of using the results (detection of virus programs).At the training stage, a sample of infected (virus) and "clean" (legitimate) files is formed.In the structure of the files, some signs characterize each of them as viral or legitimate.As a result, a list of feature characteristics is compiled for each file.Next, the most significant (informative) features are selected, and redundant and irrelevant features are discarded.At the detection stage, feature characteristics are extracted from the scanned file.Heuristic algorithms developed specifically to detect unknown malware are characterized by a high error rate.Heuristic-based detection uses rules formulated by experts to distinguish between malicious and benign files.Additionally, behavior-based, model checking-based and cloud-based methods have performed effectively in malware detection [18].
Modern research in the area of information security aimed at creating such protection methods and algorithms that would be able to detect and neutralize unknown malware, and thus not only increase the computer security but also save the user from constant updates of antivirus software.The size of gray lists is constantly growing with the advancement of malware writing and production techniques.Intelligent methods for automatically detecting malware are, therefore, urgently required.As a result, several studies have been published on the development of smart malware recognition systems using artificial intelligence methods [19][20][21][22].
A prerequisite for creating effective anti-virus systems is the development of artificial neural network (ANN)-based technologies.The ability of such systems to learn and generalize results makes it possible to create smart information security systems.Artificial intelligence (AI) has several advantages when it comes to cybersecurity: AI can discover new previously unknown attacks; AI can handle a high volume of data; AI-based cybersecurity systems can learn over time to respond better to threats [23].
This study aims to implement an ensemble of neural networks for the detection of malware.The novel contributions of this paper are the following: 1) The detailed experimental analysis and verification of machine learning and deep learning methods for malware recognition performed on the Classification of Malware with PE headers (ClaMP) dataset; 2) A novel ensemble learning-based hybrid classification framework for malware detection with a heterogeneous batch of convolutional neural networks (CNNs) as base classifiers and a machine learning algorithm as a final-stage classifier, which allows us to achieve the improvement of malware detection accuracy; 3) An extensive ablation study to select CNN model architectures and a machine learning algorithm for the best overall malware detection performance.
The other parts of this study are structured as follows.In Section 2, related works are discussed including the presentation of adequate criticism of existing methods and approaches.Section 3 describes the methodology used in this paper.Section 4 discusses the implementation and results obtained.Section 5 presents the conclusion of the study.

Related Works
Malware search algorithms are divided into two classes based on the method of collecting information-dynamic and static.In static analysis, suspicious objects are considered without starting them, based on the assembly code and attributes of executable files [24].Dynamic analysis algorithms work either with already running programs or run them themselves in an isolated environment, exposing the information that has arisen in the course of work: they analyze the behavior of the program, sections of code and data and monitor resource consumption [25].According to the type of objects detected, malware search algorithms are divided into signature and anomalous ones.Signature programs tend to highlight the signatures of malware.Anomaly detection algorithms seek to describe legitimate programs and learn to look for deviations from the norm.
At the same time, machine learning is also widely used as a powerful tool for security experts to identify malicious programs with high accuracy, when the number of malicious programs is high enough, and their options have become diverse.Among the main methods is the Windows Portable Executable 32-bit (PE32) file header analysis [26].For example, Nisa et al. [27] transformed malware code into images and applied segmentationbased fractal texture analysis for feature extraction.Deep neural networks (AlexNet and Inception-v3) were used for classification.Previously, the use of ensemble methods, such as random forest and extremely randomized trees, allowed the improvement of the performances of machine learning models in detecting malware in internet of things (IoT) environments [28] and Wireless Sensor Networks (WSN) [29].
To solve the above-mentioned problems, more general and robust methods are, therefore, required.Researchers are creating numerous ensemble classifiers [38][39][40][41][42] that are less susceptible to malware feature collection.Ensemble methods [43] are a class of techniques that incorporate several learning algorithms to enhance the precision of overall prediction.To minimize the risk of overfitting in the training results, these ensemble classifiers integrate several classification models.In this way, the training dataset can be more effectively used, and generalization efficiency can be increased as a result.While several models of ensemble classification are already developed, there is still space for researchers to improve the accuracy of sample classification, which would be useful for improving malware detection.
Therefore, this paper proposes an ensemble earning-based approach for using fully connected and convolution neural networks as base learners for malware detection.

Materials and Methods
Malware developers are primarily focused on targeting computer networks and infrastructure to steal information, make financial demands or prove their potential.The standard approaches for detecting malware were effective in detecting known malware.Via these approaches, however, new malware can never be blocked.The latest machine learning platform [44] has significantly enhanced the identification capability of models used for malware detection.It is possible to detect malware using machine learning methods in two steps, namely, extracting features from input data and choosing important ones that best represent the data, and classifying/clustering.The technology proposed is focused on machine learning that can learn and discern malicious and benign files, as well as make reliable forecasts of new files that have not been seen before.
The phases involved in achieving the final solution are (1) data processing and feature selection and (2) model engineering, which includes the following steps: data selection and scaling, reduction in dimensionality, ANN model exploration and meta-learner classifier selection, ensemble model development, model testing and performance evaluation.Figure 1 indicates the flow to the model evaluation stage of the stages involved in the system methodology, beginning with data selection, which is described in more depth in the following subsections.

Data Collection and Processing
For machine learning to be a success, the selection of a representative dataset is necessary.This is because it is important to train a machine learning algorithm on a dataset that correctly represents the conditions for the modelʹs real-world applications.
For this model, the dataset gathered contains malicious and benign data from the Classification of Malware with PE headers (ClaMP) dataset, obtained from GitHub.We used the ClaMP_Integrated dataset, which has 2722 malware and 2488 Benign instances.However, we used only 68 features (all numerical), because one feature "packer_type" is a string, which was not used.The numerical features were scaled using the standard scaling method.These features, along with the class label (0 for benign and 1 for malicious), were used to build the ensemble classification model.

Dimensionality Reduction
To fix a variety of estimation and classification questions, machine learning methods are commonly used.Bad machine learning output can be triggered by overfitting or underfitting the results.Removing the unimportant characteristics guarantees the algo-rithmsʹ optimal efficiency and improves pace.Principal Component Analysis (PCA) was introduced to perform attribute dimensionality reduction.Based on previous studies, 40 features were chosen to be passed into the machine learning model (representing 95% of the total variability in the dataset), because these features are critical in neural network learning, whether a file is malicious or benign.

Deep Learning Models
As deep learning models, we considered fully connected (FC) multilayer perceptron (MLP) and one-dimensional convolutional neural networks (1D-CNN), which are discussed in detail below.

Multilayer Perceptron
As a baseline approach, we adopted a simple multilayer perceptron (MLP).Let the output of the MLP be known   at the input   , where   is a vector with components  ,  , … ,  , t is the number of the sequence value and  1,  (T is predetermined).
To find model parameters   ,  , … ,  and   ,  , … ,  , ℎ ,  1,  such that the model output  , ,  and the real output of the MLP   would be as close as possible.The relationship between the input and output of a two-layer perceptron is established by the following relationships: The following expression describes a perceptron with one hidden layer, which is able to approximate any continuous function defined on a bounded set.
Training of MLP occurs by applying a gradient descent algorithm (such as error backpropagation) similar to a single-layer perceptron.

One-Dimensional Convolutional Neural Network (1D-CNN)
While CNN models have been developed for image processing, where an internal representation of a two-dimensional input (2D) is learned by the model, the same mechanism can be used in a process known as feature learning on one-dimensional (1D) data sequences, such as in the case of malware detection.The model understands how to extract features from observational sequences and how to map hidden layers to different types of software (malware or benign).
where :  , … ,  indicates the input of the network, and : y is the output.Therefore, the network learns a mapping from the input space  to the output space .
The key block of the convolutional network is the convolutional layer.A group of trainable filters are the parameters of this layer (scan windows).Each filter operates in size through a tiny window.The scanning window sequentially traverses the whole picture during the forward propagation of the signal (from the first layer to the last layer) according to the tiling principle and measures the dot products of two vectors: the filter values and the outputs of the chosen neurons.Thus, a two-dimensional activation map is generated after passing all the shifts in the width and height of the input field, which gives the effect of applying a particular filter in each spatial area.The network uses filters that are enabled when there is an input signal of some kind.A series of filters are used for each convolutional layer, and each generates a different activation map.
where  is the convolution kernel,  is the size of kernels,  is the number of inputs  , , is kernel bias,  is the neuron activation function and represents the convolution operator.
The sub-sampling layer is another feature of a convolutional neural network.It is usually positioned between successive convolution layers, so it may occur periodically.Its purpose is to reduce the spatial size of the vector gradually to reduce the number of network parameters and calculations, as well as to balance overfitting.The convolution layer resizes the feature map, using the max operation most frequently.If the output from the previous layer is to be fed to the fully connected layer, the flattening layer is used, and then it needs to be flattened.The layer of the Parametric Rectified Linear Unit (PReLU) is an activation function that complements the rectified unit with a negative value slope.
The dropout layer is used to regularize the network.It also makes it possible to be thinner for the network size.The neurons that are less likely to raise the weight of learning are randomly removed.The practical importance of dropout unit is to prevent overfitting [45].This dropout layer, as we have two classes, is succeeded by a fully linked (dense) layer that reduces the final output vector to two classes, and we expect the programʹs behavior to be either malicious or benevolent.The final activation function is SoftMax, which shrinks the two outputs to one.
The output of each convolutional layer in 1D-CNN is also the input of the subsequent layer.It also represents the weights learned by the convolution kernel from the training samples.
A unique and essential part of CNNs is the fully connected (FC) layer, which outputs a final output.The output of the networkʹs previous layers is reshaped into a single vector (flattened).Any of them reflects the probability that a class label is a special function.The final probabilities for each label are supplied by the output of the FC layer.

Network Model Optimization
Optimization of neural network hyper-parameters, which rule how the network operates and governs its accuracy and validity, is still an unsolved problem.Optimizers adjust the parameters of neural networks, such as weight and learning rate, to minimize loss.Known examples of neural network optimization algorithms are Stochastic Gradient Descent (SGD) [46], AdaGrad [47], RMSProp [48] and Adam [49], which usually show a tradeoff of optimization vs. generalization.This means that higher training speed and higher accuracy in the training may result in poorer accuracy on the testing dataset.Here, we adopted the Exponential Adaptive Gradients (EAG) optimization [50], which combines Adam and AdaBound [51].During training, it exponentially sums the gradient in the past and adaptively adjusts the learning rate to address poor generalization of the Adam optimizer.

Ensemble Classification
The basic principle of ensemble methods is that training datasets are rearranged in several ways (either by resampling or reweighting) and by adding a base classifier to each rearranged training dataset, an ensemble of base classifiers is built.After that, a new ensemble classifier is developed using the stacked ensemble method by combining the prediction effects of all those base classifiers, where a new model learns how to better integrate predictions from multiple base models.We used the two-stage stacking technique [52].First, several models are trained based on a dataset.Then, the output of each of the models is processed to create a new dataset.The actual value it is supposed to approximate is related to each instance in the current dataset.Second, the dataset with the metalearning algorithm is used to provide the final output.
In the design of a stacking model (Figure 2), base models are often referred to as level-0 models, and a meta-learner (or generalizer) that integrates base model projections, referred to as a level-1 model, is involved.Models that fit into the training data and are compiled with forecasts are the base models.The meta-learner (level-1 model) is a classification model trained to combine the predictions of the base model.The meta-learner is informed by simple models on the choices made.To train the base models, a new batch of previously unused data is used and predictions are made, and the input and output value pairs of the training dataset are used to fit the meta-learner, along with projected outputs given by these predictions.
The ensemble learner algorithm consists of three stages: 1.
Set up the ensemble: a) Select  base learners; b) Select a meta-learning algorithm.On the training dataset, stacking capitalizes over every single best learner.Usually, the greatest gains are made when base classifiers used for stacking have high variability and uncorrelated outputs predicted values.As base models, we used the following neural networks: fully connected MLP with one hidden layer (Dense-1), fully connected MLP with two hidden layers (Dense-2) and one-dimensional CNN (1D-CNN).The configurations of neural networks are summarized in Table 1.
The examples of neural network architectures are presented in Figure 3.The role of the meta-learner is to find how best to aggregate the decisions of the base classifiers.As meta-learners, we explored K-Nearest Neighbors (KNN), Support Vector Machine (SVM) with linear kernel, SVM with radial basis function (RBF) kernel, Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP), AdaBoost Classifier, Ex-traTrees (ET) classifier, Isolation Forest, Gaussian Naïve Bayes (GNB), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Logistic Regression (LR), Ridge Classifier (RC) and stochastic gradient descent classifier (SGDC).Here, KNN is a model that classifies unknown input data based on having the most similarity (least distance) to known input data.SVM is a supervised learning method that constructs a higher dimensional hyperplane to separate input data belonging to various classes while maximizing the data input distance to the hyperplane.The DT classifier creates a decision tree by splitting according to the feature which has the highest information gain.RF fits many DT classifiers on different sub-samples of the dataset and uses averaging to improve the prediction accuracy.AdaBoost fits a classifier on the dataset and performs weighting of incorrectly classified instances to improve accuracy.Isolation Forest performs classification based on identified anomalies in data.GNB performs classification based on the probability distributions of features and classes.The ET Classifier [53] creates a meta-estimator that fits multiple decision trees on the training dataset sub-samples and uses averaging to improve precision and over-fitting management.The goal of LDA is to find a linear combination of input characteristics that distinguishes two or more input data groups.A quadratic decision surface is used by QDA to distinguish two or more groups of input data.LR is a linear regression-like statistical approach that predicts a result for a binary output variable from an input variable.RC converts the label data to (-1,1) and fixes the regression method problem.As a target class, the greatest value of prediction is admitted.SGDC is a learning algorithm for stochastic gradient descent that finds the decision boundary with a linear hinge loss.

Evaluation of Malware Detection Results
To measure the classification potential of the proposed ensemble learning model, the performance of the proposed model was evaluated using the Leave-One-Out Cross-Validation (LOOCV) with a 10-fold cross-validation method.
The true labels were compared against the predicted labels and the true positive (TP), true negative (TN), false positive (FP) and false-negative (FN) values were calculated.The recall, precision, accuracy, error rate and F-score values were calculated (we assumed the binary classification problem, where a positive class is labeled by + 1 and a negative class is labeled by -1): False positive rate (FPR) (also specificity): here • is the Iverson bracket operator.
True positive rate (TPR) (also sensitivity and recall): False negative rate (FNR): Here,   is the classifier with inputs   , …,  , and  , …,  are outputs.Precision is calculated as: To compute F-score, the following equation is used: The Matthews Correlation Coefficient (MCC) is calculated as: The Cohen's Kappa statistic (shortly, kappa) is where  represents the ratio of correct agreement in the test dataset, and  is the ratio of agreement that is expected by random selection.In this study, performance was calculated using 10-fold cross-validation.According to F1-score, instead of checking the performance of the model with accuracy alone, we selected the best model.The accuracy can be a confusing metric in datasets where a major class imbalance occurs.For a highly imbalanced sample, a model would correctly guess the value of the majority class for all predicted outcomes, and achieve a high performance in classification but making erroneous predictions in the minority and main classes.The F1-score discourages this type of action by computing the metrics for each mark and finding its unweighted average.We also consider area under curve (AUC) as a measure of binary classification consistency, which is known as a balanced metric that can be used even though there are classes of very different sizes in the dataset.Furthermore, the performance of the proposed model on a binary dataset is represented using the confusion matrix.
We used the performance outcomes achieved from the results from each fold of the 10-fold cross-validation for statistical analysis.We adopted the non-parametric Friedman test followed by the post-hoc Nemenyi test to compare the findings and measure their statistical value.Second, both strategies were ranked based on the selected performance measures (we used accuracy, AUC and F1-score).Then, each methodʹs mean ranks were determined.If the difference between the mean ranks of the methods was less than the critical difference obtained from the Nemenyi test, the difference between method outputs was assumed not to be significant.

Experimental Settings
The machine learning models were trained on the features acquired from the dataset using Python's Scikit-learn libraries.All experiments were performed on a laptop computer with 64-bit Windows 10 OS with Intel ® Core ™ i5-8265U CPU @ 1.60 GHz 1.80 GHz with 8GB RAM (Intel, Santa Clara, CA, USA).

Results of Machine Learning Methods
The results from using classical machine learning models are summarized in Table 2, while their confusion matrices are summarized in Figure 4.The best results were obtained by the ExtraTrees (ET) model, achieving an accuracy of 98.8%.As can be seen from Table 2 and Figure 3, the ET model generated very good results for the precision, recall, F1 and accuracy of the two classes.This agrees with the low FPR and FNR of 0.8% and 1.4% obtained by the ET model.

Results of Neural Network Classifiers
To select the base classifiers, first, we performed an ablation study to find the best representatives of Dense-1, Dense-2 and 1D-CNN models in terms of their performance with respect with different values of hyperparameters.The results are presented in Tables 3-5.Note that in all cases, we used sparse categorical cross-entropy loss function and an Adam optimizer.For the training of Dense-1 and Dense-2 models, we used 100 epochs, while for the training of 1D-CNN models, we used 20 epochs.In all cases, 80% of data were used for training and 20% for testing.

Results of Ensemble Learning
Based on the ablation study, we selected one Dense-1 (with 35 neurons) model, two Dense-2 (with (40,40) and (40,50) neurons) models and two 1D-CNN (with (25,25) and (30,35) neurons) models as base learners based on their kappa and F1-score performance.We performed classification with several different meta-learner classification algorithms.For KNN, the number of nearest neighbors was set to 3. For linear SVM, C was set to 0.025.For RBF SVM, the C parameter (which performs regularization by applying a penalty to reduce overfitting) was set to 1, and gamma was set to 2. For DT and RF, the max depth was set to 5. In all cases, 10-fold cross-validation was used, where each cross-validation fold was made by randomly selecting 80% of samples, and the remaining 20% were used for testing.The results are presented in Table 6.The average performance results are visualized in Figures 5-7, whereas the results from the 10-fold cross-validation are shown as boxplots in Figures 8-10.The results demonstrate that the ExtraTrees meta-learner achieved the highest performance in terms of accuracy, AUC and F1-score measures.Finally, we present the confusion matrix of the best ensemble model (with the ET classifier as the meta-learner) in Figure 11.

Statistical Analysis
To perform the statistical analysis of the experimental results, we adopted the Friedman test and the Nemenyi test.The results are presented as critical difference (CD) diagrams in Figures 12-14.If the difference between the mean ranks of the meta-learners is smaller than the CD, then it is not statistically significant.The results of the Nemenyi test again show that the ExtraTrees meta-learner allows us to achieve the best performance; however, the performance of AdaBoost and Decision Tree meta-learners is not significantly different.

Ablation Study of the Ensemble
We also conducted the ablation study to evaluate the contribution of the individual parts in the proposed ensemble classification base framework for malware recognition.We compared and analyzed the impact of the ensemble size of the classification results.We analyzed the following ensembles, consisting of a smaller number (4) of neural networks models: The results are summarized and compared in Table 7.In all cases, the ExtraTrees Classifier was used as a meta-learner.The Full Model here corresponds to the five-model ensemble with PCA scaling of data.The results show that the best performance was achieved by the full five-model ensemble with data scaling using PCA and ExtraTrees as the meta-learner.Finally, we compare our results with some of the related work on classifying benign and malware files in Table 8 and explain in more detail below.Note that the methods working on different malware datasets were compared.Alzaylaee et al. [54] explored 2-, 3-and 4-layer fully connected neural networks on a dataset of 31,125 Android apps, with 420 static and dynamic features, while comparing the results to machine learning classifiers.The best results were achieved with a three-layer network with 200 neurons in each layer.Bakour and Ünver [55] suggested a visualization-based approach that converted software characteristics into grayscale images and then applied local and global image features as voters in an ensemble voting classifier.Cai et al. [56] used information gain for feature selection and weight mapping functions derived by machine learning methods, which were optimized by the differential evolution algorithm.Chen et al. [57] used an attention network architecture based on CNN to classify apps based on their Application Programming Interface (API) call sequences.Fang et al. [58] used a DeepDetectNet deep learning model for static PE malware detection model, and an adversarial generation network RLAttackNet based on reinforcement learning, which was trained to bypass DeepDetectNet.The generated adversarial samples were used to retrain DeepDetectNet, which allowed the improvement of malware recognition accuracy.
Imtiaz et al. [59] proposed a deep multi-layer fully connected Artificial Neural Network (ANN) that has an input layer, few hidden layers and an output layer.The approach has been validated with the CICInvesAndMal2019 dataset of Android malware.Jeon and Moon [60] proposed a convolutional recurrent neural network (CRNN), which uses the opcode sequences of software as input.The front-end CNN performs opcode compression, and the back end dynamic recurrent neural network (DRNN) detects malware from the compressed sequence.
Jha et al. [61] proposed using RNN with feature vectors obtained by skip-grams of the Word2Vec embedding model for malware recognition.Namavar Jahromi et al. [62] proposed a modified Two-hidden-layered Extreme Learning Machine (TELM), which was tested on Ransomware, Windows, Internet of Things (IoT) and other malware datasets.
Narayanan and Davuluru [63] suggested using CNNs and Long Short-Term Memory (LSTM) networks for feature extraction and SVM or LR for the classification of malware based on their machine language opcodes.The approach was validated on Microsoft's Malware Classification Challenge (BIG 2015) dataset with nine malware classes.Song et al. [64] proposed a JavaScript malware detection based on the Bidirectional LSTM neural network.Wang et al. [65] suggested CrowdNet, a radial basis function network, as a malware predictor.Yen and Sun [66] extracted instruction code and applied hashing to extract features.Then, the features were transformed into images and used to train a CNN.

Conclusions
There is an increase in demand for smart methods that detect new malware variants, because the existing methods are time-consuming and vulnerable to many errors.This paper analyzed various machine learning algorithms and models of neural networks, which are smart approaches that can be used for malware detection.With neural networks used as base learners, we proposed an ensemble learning-based architecture and explored 14 machine learning algorithms as meta-learners.As baseline models, we used machine learning algorithms for comparison.We conducted our experiments on a dataset that included malware and benign files from Windows Portable Executables (PE).
In this paper, we analyzed and experimentally validated the use of ensemble learning to combine the malware prediction results given by different machine learning and deep learning models.The aim of this practice is to improve the recognition of Windows PE malware.With ensemble methods, it is not required to select any specific machine learning model.Instead, the prediction capability of each combination of the machine learning models is aggregated to create a learning procedure that achieves the best malware detection performance.We explored our proposed ensemble classification framework with lightweight fully connected and convolutional neural network architectures, and combined deep learning and machine learning techniques to learn effective and efficient malware detection models.We conducted extensive experiments on various lightweight deep learning architectures and machine learning models within the framework of ensemble learning under the same conditions for a fair comparison.
The results achieved show that the malware detection ability of ensemble stacking exceeds the ability of other machine-learning methods, including neural networks.We showed that the ensemble learning framework based on lightweight deep models could successfully tackle the problem of malware detection.The results obtained indicate that ensemble learning methods can be implemented and used as intelligent techniques for the identification of malware.The classification system with the Extra Trees algorithm as a meta-learner and an ensemble of dense ANN and 1-D CNN models obtained the best accuracy value for the classification procedure, outperforming other machine learning classification methods.Our proposed framework can lead to highly accurate malware detection models that are adapted for real-world Windows PE malware.
The application of explanatory artificial intelligence (XAI) [67] strategies to interpret the outcomes of deep learning models for malware detection will be carried out in the future to provide useful information for malware analysis researchers.We also intend to explore ensemble learning architectures and run further tests with larger databases of malware.We strive to improve the classification ability and accuracy of the ensemble learning model by refining the model architecture and validating it for multiple malware datasets in future work.
each of the  base learners on the training dataset  ,  , . . .,  , where  is the number of samples; b) Perform the k-fold cross-validation on each of the base learners and record the cross-validated predictions  ,  , . . .,  ; c) Combine cross-validated predictions from base learners to form a new feature matrix as follows.Train the meta-learner on the new data (features x predictions from baselevel classifiers)  ,  , . . .,  ,  ,  ,  , . . .,  ,  , . . .,  ,  , . . .,  ,  .Combine base learning models and the meta-learner to generate more accurate predictions on unknown data.3. Test on new data: a) Record output decisions from the base learners; b) Send base-level decisions to the meta-learner to make ensemble decision.

Figure 4 .
Figure 4. Confusion matrices of machine learning models.

Figure 5 .
Figure 5. Malware detection performance of deep learning ensemble model by final stage metalearner classifier: accuracy.

Figure 6 .
Figure 6.Malware detection performance of deep learning ensemble model by final stage metalearner classifier: F1-score.

Figure 7 .
Figure 7. Malware detection performance of deep learning ensemble model by final stage metalearner classifier: AUC.

Figure 8 .
Figure 8. Malware detection performance of deep learning ensemble model by final stage metalearner classifier: accuracy.

Figure 9 .
Figure 9. Malware detection performance of deep learning ensemble model by final stage metalearner classifier: area under curve.

Figure 10 .
Figure 10.Malware detection performance of deep learning ensemble model by final stage metalearner classifier: F1-score.

Figure 11 .
Figure 11.Confusion matrix of the best ensemble model (with the ET classifier as meta-learner).

Figure 12 .
Figure 12.Comparison of mean ranks of meta-learners based on their accuracy performance: results of Nemenyi test.

Figure 13 .
Figure 13.Comparison of mean ranks of meta-learners based on their AUC performance: results of Nemenyi test.

Figure 14 .
Figure 14.Comparison of mean ranks of meta-learners based on their F1-score performance: results of Nemenyi test.

Table 1 .
Model configuration of neural networks with their parameters.FC-fully connected.Conv1D-one-dimensional convolution.PReLU-Parametric Rectified Linear Unit.

Table 3 .
Malware detection performance with different number of neurons in hidden layer of Dense-1 model.Best models are shown in bold.

Table 4 .
Malware detection performance with different number of neurons in hidden layers of Dense-2 model.Best models are shown in bold.

Table 5 .
Malware detection performance with different number of filters in convolutional layers and neurons in the final fully connected layer of 1D-CNN model.Best models are shown in bold.

Table 6 .
Ensemble learning results with different meta-learners: mean values from 10-fold cross-validation.Best values are shown in bold.

Table 7 .
Comparison of ensemble models.Best values are shown in bold.

Table 8 .
Comparison with other known deep learning approaches for malware recognition.n/adata were not provided.