Windows PE Malware Detection Using Ensemble Learning

: In this Internet age, there are increasingly many threats to the security and safety of users daily. One of such threats is malicious software otherwise known as malware (ransomware, Trojans, viruses, etc.). The effect of this threat can lead to loss or malicious replacement of important infor ‐ mation (such as bank account details, etc.). Malware creators have been able to bypass traditional methods of malware detection, which can be time ‐ consuming and unreliable for unknown malware. This motivates the need for intelligent ways to detect malware, especially new malware which have not been evaluated or studied before. Machine learning provides an intelligent way to detect mal ‐ ware and comprises two stages: feature extraction and classification. This study suggests an ensem ‐ ble learning ‐ based method for malware detection. The base stage classification is done by a stacked ensemble of fully ‐ connected and one ‐ dimensional convolutional neural networks (CNNs), whereas the end ‐ stage classification is done by a machine learning algorithm. For a meta ‐ learner, we ana ‐ lyzed and compared 15 machine learning classifiers. For comparison, five machine learning algo ‐ rithms were used: naïve Bayes, decision tree, random forest, gradient boosting, and AdaBoosting. The results of experiments made on the Windows Portable Executable (PE) malware dataset are presented. The best results were obtained by an ensemble of seven neural networks and the Ex ‐ traTrees classifier as a final ‐ stage classifier.


Introduction
The popularity of the Internet has been skyrocketing since its invention. A report from the International Telecommunication Union (ITU) indicates that 51.2% of the population of the world, or 3.9 billion people, were using the Internet as of the close of 2018 [1]. The more popular the Internet becomes, the more vulnerable Internet users are because of cybercriminals who employ various methods to attack or damage computers, servers, clients, or computer networks for their financial or political benefit. One of the methods employed by cybercriminals is the use of malicious software, otherwise known as malware, to exploit a system's vulnerabilities and to affect the user or device. Malware is any software purposely designed to inflict harm to a computer system-a server, a client, or a computer network-for personal benefits. Malware can be classified based on how it multiplies or its particular action. The various types of malware include viruses, worms, Trojan horses, adware, spyware, rootkits, bots, ransomware, etc. [2][3][4].
Malevolent cyberattacks may cause considerable losses, whether they arise from a single person, an organization, or a hostile state. As colluded attacks on computer networks become more common, the development of systems that can recognize illegal attempts to infiltrate a secure network, i.e., intrusion detection systems (IDS), has become more prominent. IDS commonly have a limitation expressed as an ineptitude to identify a cyberattack that is hidden in a sequence of legal network connections. Cybercrime and related illegal activities over the Internet have become a criminal business model. Threats such as phishing [5], spyware and malware, trojans, worms, and even intrusions need sophisticated instrumentation of a multitude of hacked machines, also known as botnets. The landscape of internet crimes has become automated, which has made the use of nonhuman agents such as botnets, hijacked Internet of Things (IoT) devices [6], and compromised wireless sensor networks (WSNs) [7] more common.
Malware fall in the category of the major threats facing today's Internet users (business owners, corporate organizations, hospitals, etc.), and as the deployment of new malware increases, so does the advancement of anti-malware techniques. When malware finds its way into a computer, it transforms into executable code, scripts, and active content. Malware can cause various forms of damage to a computer. It can cause a computer to slow down or even crash. One can also notice a decrease in disk space, increased Internet activity, unexpected annoying pop-up ads, unwanted browser extensions, toolbars, etc. Malware can be used to gain access to personal information from people (Internet users), such as bank information, passwords, etc. Ransomware is a type of malware that can cause users to be locked out of their systems and then force them to pay ransoms (commonly in the form of an untraceable cryptocurrency) before they can get them back. Systems, which aim at fighting malware, are continually being built to ensure that cyberspace is safer from malware attacks [8,9]. Malware detection is the process of ascertaining the presence of malware on a system or determining whether a program is malicious or harmless so that the system can be protected or recovered from any effects caused by the malicious code [10]. Various malware detection methods exist today. The signature-based detection method is typically adopted in many anti-malware tools to detect threats, but malware developers continuously create new methods to evade detection.
As the number of legitimate users of the Internet increases, so do the opportunities for cybercriminals to gain from manufacturing malware. Anti-malware software is used for the protection of Internet users from malware attacks, and they adopt the signaturebased method to detect known threats. This method analyses strings from binary code. It is generally referred to as time-consuming, and it also does not respond to new malware threats because malware writers can evade this method. The heuristic-based method was adopted and became a very important method of malware detection. However, it is also time-consuming and error-prone. Malware creators have devised malicious codes that can bypass these traditional methods of malware analysis detection, giving rise to the need for intelligent methods of malware analysis and automatic detection of malware [11][12][13][14].
In practice, for the detection of unknown computer viruses, the traditional approach to malware detection based on signature analysis [15] is not acceptable. Users are forced to update antivirus databases constantly and promptly to maintain the correct level of protection. Nevertheless, the delay in the response of antivirus companies to the advent of new malware can last several days. New malicious programs can produce irreparable damage during this time. Heuristic algorithms specifically developed to detect unknown malware are characterized by high first-and second-type error rates. Modern information technology research aims to develop certain methods and algorithms of defense that would be able to detect and neutralize unknown malware. This not only improves security but also keeps the user from frequent antivirus software upgrades. The creation of artificial neural network (ANN)-related technology [16,17] and hybrid methods [18][19][20][21] is a prerequisite for developing successful antivirus systems. The ability of such systems to learn and generalize allows intelligent information protection systems to be developed.
Malware recognition is an active area of research with many open research problems. To face these problems, it is important to propose novel deep-learning frameworks and validate them on new malware datasets. The use of ensemble methods such as random forests has previously facilitated the output of machine learning models to enhance malware detection in Internet of Things (IoT) environments [6]. The goal of this research was to deploy an ensemble of deep neural networks for malware detection and classification.
The main contribution of this article is as follows: (1). A hybrid ensemble learning framework consisting of fully connected and convolutional neural networks (CNNs) with the ExtraTrees classifier as a meta-learner for malware recognition. (2). A comprehensive study of the performance of classifiers for selecting the components of the framework.
This paper is organized as follows. Previous works, including an adequate criticism of existing methods and approaches, are discussed in Section 2. The methodology employed in this article is described in Section 3. Implementation of the methodology and the results achieved are discussed in Section 4. The conclusions of this article are given in Section 5.

Related Works
This section contains a review of articles, journals, and conference proceedings on different approaches, insights, and techniques used for detecting and classifying malware. The merits and demerits of these approaches are also mentioned during the discussion. In the work of Bazrafshan et al. [22], three methods for detecting malware were considerednamely, signature-based, behavior-based, and heuristic-based. The study did not consider machine learning methods or dynamic and hybrid methods for the detection of malware. Souri et al. [13] did a similar study in which two of the three methods employed in [22] were considered. The study also did not consider data mining or machine learning approaches used for detecting and classifying malware. Rathore et al. [23] worked on detecting malware using different machine learning algorithms and deep learning models. They also involved certain practices in building these models, such as solving the class imbalance issue, cross-validation, etc. They applied supervised learning and unsupervised learning for malware classification using machine instruction (opcode) frequency as a feature vector. Feature reduction methods such as a single layer auto-encoder and a 3-layer stacked auto-encoder were used for dimensionality reduction. Then, the recognition was performed using deep neural network (DNN) and random forest (RF). Based on their results, the RF algorithm did a lot better than the DNN models. These results indicated that deep learning may not perform well for malware detection [24].
Ye et al. [12] provided an overview of data mining techniques for detecting malware. They considered the process of detecting malware using intelligent methods from two perspectives: extraction of features and clustering or classification. The study concluded that data mining-based malware detection frameworks may be employed to achieve high accuracy in malware detection with a low number of false positives. Based on their findings, the performance of the malware detection approaches did not only depend on the classification algorithms used but also on the features extracted. They also suggested that a set of classifiers could enhance the accuracy of detection, as opposed to individual classifiers, and a balanced distribution of harmful and benign files for training is required. These data mining techniques have proven to be successful in the anti-malware industry, but they are not void of challenges. Some of the challenges are: the manual inspection of files that could be malicious can take a lot of time; because malware samples are created each day, new malware samples should be used in training sets to ensure that the classifier remains efficient; with this approach, malware attackers can implement ways to wrongly train the classifiers; and there is also not enough research on predicting malware prevalence [9].
Another type of malware detection is behavioral malware detection, which detects the way the malware behaves and can also bypass obfuscation techniques. This involves confusing the malware analyst by encrypting and decrypting the malicious code. Observation of the behavior of the malicious file requires an emulated environment to be set up.
Setting up an emulated environment can be time-consuming, and although it is safe, the malicious file might only be triggered by processes or events. Pluskal [25] used a dataset that contained behavioral features provided by AVG Company. According to the author, improvement of the binary classifier may be achieved by an efficient feature representation using support vector machines (SVM) for training the classifier on large-scale datasets. The author also successfully created a linear classifier that, on every operating point, had a better true positive rate than the current AVG linear classifier at the time. Cakir and Dogdu [26] used a feature extraction method (Word2Vec)-which is deep learning-based-to represent malware depending on its opcodes and a gradient boosting machine classifier and achieved 96% accuracy with limited sample data.
Deep learning is a kind of machine learning. Deep learning is generally time-consuming, but it has proven to be more efficient in malware detection. Known malware analysis methods based on deep learning include CNN [27], deep belief network (DBN) [28], graph convolutional network (GCN) [29], long short-term memory (LSTM), gated recurrent unit (GRU) [30], and VGG16 [31]. For example, Lee et al. [24] discussed how to use deep learning to analyze malware. For this process, data must be extracted, developed, and network models trained. Additionally, compared to existing classification and analysis, difficult and complex features can be automatically extracted from simple malware data characteristics. Deep learning models that can classify and detect harmful codes more accurately and efficiently are required [28]. Besides, depending on the dataset, the accuracy of the studies may depend on the amount of data [12]. Ren et al. [27] proposed two deep learning models, DexCNN and DexCRNN, to recognize benign and malicious Android application packages (APKs). The experiments showed that DexCNN and DexCRNN achieved 93.4% and 95.8% detection accuracy, respectively. Yuxin and Siyi [28] used DBNs as an autoencoder to extract features from executables. They compared the performance of DBNs with baseline malware detection models (SVM, decision trees, and the k-nearest neighbors algorithm) as classifiers, demonstrating that the DBN model achieved better performance in malware recognition. Pei et al. [29] proposed a deep learning framework to learn embedding representations for Android malware detection, which included graph convolutional networks (GCNs) to learn semantic and sequential patterns, and an independently recurrent neural network (In-dRNN) to learn deep semantic information and extract context-dependent features for malware recognition. Čeponis and Goranin [30] suggested using dual-flow deep learning methods-such as a long short-term memory fully convolutional network (LSTM-FCN) and a gated recurrent unit (GRU)-FCN for malware recognition-and performed experiments on the Windows OS calls traces dataset (AWSCTD) but achieved best results with conventional one-dimension single flow CNN.
However, the generalization capabilities of ANN-based models [32] cannot be assured. More generic and stable approaches are therefore required to solve these problems. Researchers are developing ensemble classifiers [33][34][35][36][37] that are less vulnerable to the limitations of malware datasets. Ensemble methods [38,39] combine multiple machine learning algorithms to improve final prediction accuracy while minimizing the risk of overfitting in the training outcomes so that the training dataset can be used more efficiently and, as a consequence, higher generalization can be attained. There is still room for researchers to improve the accuracy of classification, although several models of ensemble classification are already developed that would be useful for enhancing malware recognition.
Thus, this article suggests an ensemble learning-based framework that uses fully connected ANNs and CNNs as first-stage learners, combined with a final-stage machine learning method for malware recognition.

Materials and Methods
The focus of malware developers is to attack computer networks and systems to loot data, make financial demands, or just to prove their skill. The traditional methods for malware detection have been succeeding at detecting known malware. However, new malware cannot be deterred by these methods. The detection capabilities of models used for malware detection have been greatly improved by the current machine learning technology [40]. The detection of malware using machine learning methods can be achieved in two stages-namely, the extraction of features from the input data and selecting the important ones (which represent the data better) and the classification. The proposed system is based on machine learning and deep learning methods, which can learn and differentiate malicious and benign files and also provide accurate predictions of new malware.
The stages involved in arriving at the final solution comprise of the following: data collection, dimensionality reduction, model building, model testing, and model evaluation. Figure 1 represents the flow of the stages involved in the system methodology, starting with data collection to the model evaluation stage, which is explained in more detail in the following subsections.

Dataset
The collection of a representative dataset is very important for machine learning to achieve success. This is because a machine learning model has to be trained on a dataset that accurately depicts the conditions for real-world applications of the model. For this model, we used a dataset that contained malicious and benign program data from Windows Portable Executable (PE) files, obtained from Kaggle. The dataset had 19,611 malicious samples obtained from various malware repositories including VirusShare, and benign samples. The dataset originally had 77 features, which included the following: • NumberOfSections: this refers to the size of the section table, which directly succeeds the headers. This feature is different in both malware and non-malware files. • MajorLinkerVersion: this is a field in the optional header, and it is the linker major version number. • AddressOfEntryPoint: this is also a field in the optional header. It is the entry point address. This address is related to the image base obtained as the Portable Executable (PE) file is loaded into memory. It is the starting address for program images, and it is the initialization function address for device drivers. For dynamic-link library (DLL), an entry point is not required. The field is null when there is no entry point.
• ImageBase: this represents the address of the first byte of the image when it loaded into memory. This is usually a multiple of 64K. • MajorOperatingSystemVersion: a number used to identify the version of the operating system. • MajorImageVersion: a number used to identify the version of the image. Many benign files have more versions and most malicious files have this feature with a value of zero. • CheckSum: 90% of the time, when the CheckSum, MajorImageVersion, and DLL-Characteristics of a file are equal to zero, the file is found to be malicious. • SizeOfImage: this refers to the image size as it is loaded in memory.
The discussed features, along with the class label (0 for benign and 1 for malicious), were used to create the classification model.

Dimensionality Reduction
Machine learning techniques are applied widely to address a range of prediction and classification problems. Poor performance in machine learning can be caused by overfitting or underfitting the data. Removing the unimportant features ensures the optimum performance of the algorithms and increases the speed. To perform feature dimensionality reduction, principal component analysis (PCA) was applied. Based on previous research, 55 features (representing 95% of variability) were selected to be passed into the machine learning model because the features were proven to be relevant in learning whether a file was malicious or benign.

Baseline Machine Learning Models
The study employed five machine learning algorithms-namely: random forest, naïve Bayes, AdaBoost, decision tree, and gradient boosting. A brief description of the algorithms is presented below.
A Gaussian naïve Bayes (NB) model [41] is premised on probability and likelihood. The algorithm is stable, fast, and simple. NB is built based on Bayes' theorem, which is premised on the strong assumption of conditional independence. The assumption is that every feature in a particular class is independent of all other features in that same class. The model is useful when working with very large datasets, and it is easy to build. It can also perform better than other classification algorithms.
The NB algorithm performs well with categorical input variables but performs less well with numerical values and in multi-class classification /prediction. Additionally, the assumption of independence feature upon which the algorithm is based may not always be true.
The decision tree (DT) algorithm [42] performs well for continuous as well as categorical variables. DT classifiers learn to make predictions on the test data by following a tree-like model (created using the training dataset) that resembles a flow chart, based on the features passed into it. Each of the tree's internal nodes correlates with an attribute, and every leaf node correlates with a class label. In a DT, the best feature of the dataset is positioned at the root of the tree, while the training dataset is divided into subsets. These two steps are then repeated on every subset until there are no further divisions possible. It is a simple algorithm that can work well with large datasets.
Random forest (RF) [43] is a classification algorithm that consists of multiple decision trees that make predictions based on the mean probabilistic prediction of each tree. It is similar to decision trees and it reduces the problem of overfitting, which is a problem associated with the DT algorithm. But it is not easy to interpret, unlike a decision tree. It uses randomness when constructing each DT to create a forest of different trees.
Boosting is a method of making strong learners out of weak learners by combining weak classifiers into one strong classifier. AdaBoost or adaptive boosting [44] is a machine learning classification algorithm that is based on the idea of iteratively making weak learners learn a bigger part of the examples in the training data that are difficult to classify by giving more weight (paying more attention) to examples that are often misclassified. The weak learners consist of DTs with one split, which are called decision stumps.
The gradient boosting (GB) algorithm [45] creates a model as a result of the combination of weaker models. The idea behind gradient boosting is to repeatedly minimize the loss function until the minimum test loss is reached. The steps involved in GB include the following: i.
Model the data with simple models and examine the data for errors. ii.
The errors connote data points that are not easy to fit by a simple model. iii.
For subsequent models, the focus is placed on improving the accuracy of classification on data that are hard to fit. iv.
Finally, all the predictors are combined by giving each predictor some weights.

Multilayer Perceptron
Let the output of a simple multilayer perceptron (MLP) be known as at the input , , … , To find model parameters , , … and , , … , ℎ , 1, such that the model output , , would match closely the real value of . The relationship between the input and output of an MLP is established by: A perceptron with one hidden layer can approximate any continuous function defined on a bounded set as follows: The training of MLP is performed by applying a gradient descent algorithm (such as error backpropagation).

One-Dimensional (1D) CNN Model
Although the CNN models were primarily designed for image processing, where the model learns an internal representation of a two-dimensional input (2D), the same mechanism can be used for feature learning on 1D data series, such as in the case of malware recognition. The model learns how to derive features from observation sequences and how to map hidden layers to various software types (malicious or benign).
The convolutional layer is the main block of the CNN. The parameters of this layer are a set of trainable filters (scan windows). Each filter works over a small window in size. During forward propagation (from the first layer to the last), the scanning window sequentially traverses the entire image following the tiling principle and calculates dot products of two vectors: the filter values and the outputs of the selected neurons. After passing all the shifts in the width and height of the input volume, a 2D activation map is created, which applies a specific filter in each spatial region. The network uses filters that are activated when on some type of input signal. Each convolutional layer uses a set of filters, and each creates a separate activation map.
Another element of a CNN is the down-sampling or subsampling layer. Usually, it is placed between successive layers of convolution so it can occur periodically. Its function is to gradually decrease the spatial size of the vector in order to decrease the number of computations in the network, as well as to balance overfitting. The convolution layer resizes the feature map, most often using the max pooling operation. The flattening layer is used if the output from the previous layer is to be transmitted to the fully-connected (FC) layer, then it needs to be flattened. The parametric rectified linear unit (PReLU) layer is a neuron activation function that supplements the rectified unit with a slope for negative values.
To regularize the network, the dropout layer is used. It also allows for the network size to be thinner. The neurons that are less likely to boost learning and classification weights are randomly dropped. As there are two classes, this dropout layer is followed by a completely connected (dense) layer that will reduce the output to two classes, and we expect to forecast the actions of the program as either malicious or benevolent. Softmax, which reduces the two outputs to one is the final activation function.

Ensemble Learning
The theory of ensemble methods is that training data are analyzed in multiple ways, and an ensemble of first-stage classifiers is constructed. After that, by integrating the decisions of all those first-stage classifiers, a new ensemble classifier is created using the stacked ensemble approach where a final-stage model learns how to best combine the predictions from multiple first-stage models. We use a stacking approach [46] that has two stages (see Figure 2). First, first-staged on a dataset, multiple models are trained. To create a new dataset, the prediction of each of the first-stage models is then stored. Each instance in the current dataset is connected to the actual value it is expected to estimate. Second, to derive the final prediction, the dataset is used with the meta-learning algorithm. Base models (also referred to as first-stage models) and a meta-learner (or, final-stage classifier) that incorporates base model predictions make up a stacking model. Different first-stage models are trained on the training data. Next, the final-stage model is trained on the training dataset and the outputs of the first-stage models to combine the base model predictions using previously unused data.
The ensemble learner algorithm consists of three phases: • Combine the decisions from the first-stage classifiers to form a feature matrix: • Train the final-stage classifier on the new (features x predictions) data. Then the ensemble model combines the first-stage learning models and the final-stage model, to get more accurate predictions on unknown data.
3. Test on new data.
• Store output predictions from the first-stage classifiers. • Input first-stage classifier decisions into a final-stage classifier to make a final ensemble prediction.
The algorithm of ensemble learning is summarized as an algorithm in Figure 3. Stacking improves over any single best learner on the training dataset. When first-stage classifiers used for stacking have variable and uncorrelated outputs, the largest gains in performance are typically made. As first-stage classifiers, we used fully connected one hidden layer MLPs (Dense-1), fully connected two hidden layer MLPs (Dense-2), and one-dimensional CNNs (1D-CNN). The configurations of neural networks are given in Table 1  K-nearest neighbors (KNN) classifies unseen input data based on the known input data that are most similar (close) to it. Support vector machine (SVM) is a supervised learning technique that creates a hyperplane in a higher dimension to separate input data belonging to different classes while maximizing the distance of input data to the hyperplane. The ExtraTrees classifier (ET) [47] constructs a meta estimator that fits several decision trees on sub-samples of the training dataset and employs averaging to increase accuracy and manage over-fitting. Linear discriminant analysis (LDA) aims to find a linear combination of input features that separates two or more classes of input data. Quadratic discriminant analysis (QDA) uses a quadratic decision surface to separate two or more classes of input data. Logistic regression (LR) is a statistical method similar to linear regression that predicts an outcome for a binary output variable from input variables. Passive-aggressive classifier (PAC) [48] is one of the incremental learning algorithms that adjusts its weight vector for each misclassified training sample it receives, trying to get it correct. Ridge classifier (RC) converts the label data into [−1, 1] and solves the problem with the regression method. The highest value in prediction is accepted as a target class. Stochastic gradient descent (SGD) classifier is a SGD learning algorithm that finds the decision boundary with hinge loss similar to a linear SVM.

Evaluation
The performance of the proposed model was evaluated using leave-one-out crossvalidation (LOOCV) with 10-fold cross-validation. The true labels were matched against the predicted labels and recall, precision, accuracy, error rate, F-score, and Matthews correlation coefficient (MCC) values were calculated as given in Table 2 (we assumed a binary classification problem): According to the F1-score, we chose the best model instead of testing the model with accuracy alone. This was since, in datasets where a significant class imbalance occurs, accuracy can be a misleading metric. For example, for all predictions, a model will correctly predict the value of the majority class and achieve a high classification performance while making errors in the minority and main classes. This form of conduct is penalized by the F1-score by measuring the metrics for each label and finding its unweighted average.
We also considered Area Under Curve (AUC) as a measure of the quality of binary classification that is considered as a balanced metric that can be used for highly imbalanced datasets.
The Cohen's kappa is calculated by: where is the ratio of correct agreement, and is the ratio of agreement that is predicted by random choice.
Apart from this, the performance of the proposed model on a binary dataset is represented using the confusion matrix as follows: Here, represents the number of elements belonging to the i-th class ( ) but that are classified as members of the j-th class ( ).
The random guess classifier and the zero rule classifier were adopted as baseline classifiers. In the dataset, the zero rule classifier returned the majority class only. The accuracy of a random guess classifier is calculated as follows: where is the probability of the i-th class, and is the number of samples of class .  Performance Measure Calculation False Positive Rate (FPR) (also specificity) True Positive Rate (TPR) (also sensitivity and recall) Accuracy ∑

Matthews Correlation Coefficient (MCC)
Here, is the sum of correctly classified data samples, is the total number of data samples, is classifier with inputs , …, , and , …, are the outputs.
For statistical analysis, we used the performance results obtained from each fold of the 10-fold cross-validation. To compare the results and evaluate their statistical significance, we used the Friedman test and post hoc Nemenyi test. First, all methods were ranked based on some selected performance measures (we used accuracy, AUC, and F1-score). Then, the mean ranks of each method were calculated. The difference between method performance was considered as not significant if the difference between mean ranks of the methods was smaller than the critical difference derived from the Nemenyi test.

Ssettings of Experiments
The classifiers were trained using the features extracted from the dataset using Python's Scikit-learn. All experiments were executed on a laptop computer with 64-bit Windows 10 OS with Intel Core i5-8265U CPU 1.80 GHz with 8GB RAM.

Results of Machine Learning Methods
The results achieved from using baseline machine learning methods are presented in Table 3, while their confusion matrices are given in Figure 5. The best results were the false positive and false negative rates of 2.13% and 0.31%, respectively, obtained by the RF model. The accuracy of 99.24% and F1 score of 0.98 indicate that RF classified instances of each of the two classes quite well. For comparison, the accuracy of the random guess classifier on this dataset was 61.9%, whereas the accuracy of the zero rule classifier was 74.4%.

Results of Neural Network Models
To select the first-stage classifiers, first, we performed an ablation study to find the best Dense-1, Dense-2, and 1D-CNN models considering their performance for different settings of their hyperparameters. Note that in all experiments we used sparse categorical cross-entropy loss and Adam optimizer. Eighty percent of the data was used for training and 20% for testing. The results are shown in Tables 4-6. We trained Dense-1 and Dense-2 models for 200 epochs, while the 1D-CNN models were trained for 50 epochs.

Results of Ensemble Classification
Based on the presented ablation study, we selected two Dense-1 (with 35 and 50 neurons) models, two Dense-2 (with (40;40) and (40;50) neurons) models, and 3 1D-CNN ((20;60), (60;40), (60;60)) models as first-stage classifiers based on their higher performance results. We performed classification with several last-stage classifiers. For KNN, the number of nearest neighbors was set to 3. For linear SVM, C was set to 0.025. For RBF SVM, the C parameter was set to 1, and gamma was set to 2. For DT and RF, the max depth was set to 5. The results are given in Table 7. In all experiments, 10-fold cross-validation was used, where the training fold was constructed by selecting 80% of samples, while 20% were used for the testing fold. The results are also illustrated in Figures 6-8. Note that the ExtraTrees as a final-stage classifier allowed achieving the best performance in terms of accuracy, F1-score, and AUC metrics.

Statistical Analysis
To analyze the results statistically, we adopted the nonparametric Friedman test and post hoc Nemenyi test. The testing results are the critical difference (CD) diagrams presented in Figures 9-11. If the difference between the mean ranks of the final-stage classifiers was smaller than the CD, then it was not statistically significant. The results of the Nemenyi test again show that the ExtraTrees final-stage classifier allowed achieving the best performance; however, DT and KNN classifiers presented statistically similar performance.

Comparison with Previous Work
We compare the results of our experiments with some of the previous work on classifying benign and malware files in Table 8. Note that the methods applied on different malware datasets are compared. In the same dataset, the previous best result was achieved in [49] using XGBoost with an accuracy of 98.62%.

Conclusions
There is a rise in demand for intelligent methods that recognize new malware cases because the current methods are tedious and error-prone. This study explored various machine learning classifiers and neural network models, which are artificial intelligence methods that can be used for detecting malware. We proposed an ensemble learningbased framework with neural networks used as first-stage classifiers and explored 15 machine learning models as final-stage classifiers. Five different machine learning algorithms were used for comparison as baseline models. We performed our experiments on a dataset containing Windows Portable Executable (PE) malware and benign files. The results obtained indicate that the ensemble of fully connected dense ANN and 1-D CNN models with ExtraTrees as a final-stage classifier achieved the best accuracy value for the classification process, outperforming other methods.
Most of the known malware recognition methods concentrate on featuring engineering techniques to improve detection accuracy; the advantage of our deep learning-based approach is the end-to-end learning process without the need for manual feature engineering to achieve high malware recognition performance. Thus, ensemble learning techniques can be adopted as intelligent techniques for malware detection and classification. However, the proposed framework is limited to supervised learning, which required both benign and malicious malware to be identified and labeled by experts. In the real-world setting, some malicious code may not be identified and thus the neural network cannot be trained on recognizing it. This raises the need for developing unsupervised ensemble learning frameworks for malware recognition.
Future work will perform the study of explainable artificial intelligence (XAI) techniques to interpret the results of deep learning models for malware recognition to provide valuable insights for researchers in malware analysis. We also plan to conduct additional experiments with larger datasets to validate the proposed framework.