1. Introduction
In the age of interconnected computer networks, malware poses a significant threat to computer security [
1,
2,
3]. To mitigate the risks posed by cyberattacks, the establishment of Computer Emergency Response Teams (CERTs) has been crucial. These CERTs play a vital role in safeguarding IT infrastructures by actively monitoring and analysing cyber-security events, as well as raising awareness about cyber security among various stakeholders [
2].
As reported by Cybercrime Magazine [
4], the global damage caused by ransomware alone, which is a type of malware, amounted to USD 20 billion in 2021 and is estimated to increase to USD 265 billion in 2031. This is a result of not only the attackers’ innovative methods but also the rapid proliferation of malware, with an expected production of over 91 million new pieces in 2022 [
5]. Due to its user-friendly interface and lightweight use, the Windows operating system has experienced rapid growth in demand. However, the popularity of the Windows operating system also makes it a target for security threats [
6]. Users who lack advanced computer skills may not be aware of these threats, which can take the form of illegal exploitation of the system. Malware is one of the most common methods used to carry out such attacks. For example, the Windows Portable Executable (PE) is a file format used by Windows operating systems to store executable code, data, and resources. Malware authors can create malicious programs in the form of Windows PE files, which can then be distributed and executed on Windows systems. These types of malware are referred to as Windows PE malware or simply PE malware [
7]. Over the years, malware has undergone significant structural changes; however, there are still identifiable features that could help analysts to detect it. When malware runs on a Windows operating system, it typically utilises some of the operating system’s services. It accesses Windows DLLs to invoke various API functions, and the sets of DLLs accessed or the API functions called produce the malicious behaviour. Additionally, analysing the information in the PE header and its various sections [
8] can aid in detecting malware. Ultimately, if the malicious behaviour is thoroughly analysed, it is possible to detect malware.
Classification of malicious software is an ongoing challenge due to constant development and improvement by malicious actors. Various techniques, such as signature-based, behaviour-based, and specification-based methods, are commonly employed for malware detection [
3]. Signature-based techniques offer fast detection and low computational requirements but struggle with new or unknown malware. Behaviour-based methods can detect both known and unknown malware but require high computational resources [
9]. To address these limitations, researchers have developed specification-based techniques, which leverage behaviour-based approaches. Data mining and ML techniques have been successfully applied, achieving high accuracy in malware detection and classification [
10,
11,
12]. These methods, particularly effective for metamorphic malware, focus on behavioural features rather than structural ones [
13,
14]. API call sequences, offering valuable insights into malware behaviour, have been widely utilised by researchers [
15,
16,
17,
18]. The ability to classify malware based on API call behaviour enables faster attribution and a better understanding of the malware’s impact. This knowledge empowers administrators to devise appropriate mitigation strategies. Despite the existing approaches that utilize Windows PE executable files for feature extraction and employ ML algorithms for malware detection, there is a lack of comprehensive research investigating the correlation between feature extraction and scoring techniques for API calls and their impact on the accuracy and efficiency of various ML algorithms in malware detection.
Hence, this paper aims to answer the following three research questions:
RQ1: How does the extracting the frequency of API calls from Windows PE files contribute to the detection and classification of malware (
Section 4.3)?
RQ2: What are the most influential features identified by the feature extraction and scoring methods in distinguishing between malware and goodware (
Section 4.4)?
RQ3: How does the performance of state-of-the-art (SOTA) ML algorithms vary in terms of accuracy and efficiency for malware detection when trained on the extracted features from Windows PE files (
Section 4.5)?
By considering these research questions, the research contributions of this paper are as follows:
Malware analysis: Dynamic analysis of malware samples within a controlled Cuckoo Sandbox environment, which yielded critical insights into malware behaviour (
Section 3.1.2).
Dataset generation: Curation of a new dataset consisting of API calls of both malware and goodware derived from Windows PE files (
Section 3.1.3).
Malware classification: Classification of malware families by leveraging the Virustotal API for accurate labelling (
Section 4.2).
API categorisation: Detailed categorisation of APIs used within each PE file (
Section 4.3).
Feature scoring: Identification of APIs that contribute to malicious activities utilising feature scoring methods (
Section 4.4).
Performance evaluation: A comprehensive performance analysis of ML models based on performance metrics (
Section 4.5).
Comparative analysis: A comparative analysis highlighting the distinguishing features of our approach compared to existing research in the field (
Section 5).
The remainder of the paper is organised in the following sections:
Section 2 provides an overview of the related work in the field.
Section 3 presents the methodology employed in our research.
Section 4, presents and discusses the results obtained from the classification algorithms used.
Section 5 presents a comprehensive comparative analysis, highlighting the unique aspects of our work compared to existing studies.
Section 6, focuses on the limitations of the research, as well as outlining future research directions, particularly focusing on addressing the highlighted shortcomings. The paper concludes with
Section 7, where a summary of the key findings and contributions of the research is presented.
2. Related Work
This section discusses the existing research that has used ML for malware analysis, with a focus on API calls and feature extraction. Dynamic feature-based malware detection techniques are widely used to detect and classify unseen malware. These techniques typically inspect activities such as resource consumption, triggered system calls [
19], invoked API calls [
20], function-based features [
21], and others to differentiate between malware and benign software. Usually, behaviour patterns are collected while the input file is executed in an isolated environment known as a sandbox. Numerous dynamic feature-based malware detection approaches that employ sandbox technology have been proposed over the years [
22,
23]. Yazi et al. [
13] used Cuckoo Sandbox to extract several API calls from several malware types. Qiao et al. [
24] showed that, in the context of malware analysis, a sequence of API calls can be viewed as a set of transaction data, where each API call and its corresponding arguments form a collection of items that can potentially reveal patterns of behaviour that are indicative of malicious activity. Shudong Li et al. [
25] utilised category weights for each malware family before selecting features and then performed feature selection based on the weights to ensure the accuracy of the classification task. Hansen et al. [
26] employed API call sequences and frequency to identify and classify malware by utilising the Random Forest classifier. Daeef et al. [
27] proposed a method to uncover the underlying patterns of malicious behaviour among different malware families by utilising the Jaccard index and visualisation techniques. J. Singh et al. [
28] and Albishry et al. [
29] explained how ML techniques have been widely utilised in the field of malware detection. These techniques involve training classification algorithms using features extracted from malware samples. Vadrevu et al. [
30], Mills et al. [
31], Uppal et al. [
32], and Kwon et al. [
33] employed Random Forest classifiers to detect malware based on PE file characteristics and network behaviours. Similarly, Mao et al. [
34], Wüchner et al. [
35], and Ahmadi et al. [
36] developed Random Forest classifiers to detect malware using features such as system calls, file system activities, and Windows registry information. Amer and Zelinka [
37] proposed an approach focused on extracting the most significant features from malware PE files to reduce data dimensionality. In addition, Dener et al. [
38] and Azmee et al. [
39] compared the performance of various ML algorithms for PE malware detection. They found that XGBoost and logistic regression exhibited the best performance among the tested methods.
Due to the varying lengths of API call sequences, it can be challenging to identify robust features for malware detection. To address this issue, researchers have proposed using deep learning models based on API request sequences. Recurrent neural networks (RNNs) are particularly effective in handling time-series sequences, especially in the field of natural language processing. Li et al. [
40] utilised an RNN model to classify malware families, with Long API call sequences used as the basis for categorising different types of malware. Eskandari et al. [
41] utilised RNNs and features extracted from API requests to differentiate between malware and benign files. Oliveira et al. [
42] proposed a method of converting API calls into a graph structure by using the depth graph convolution neural network (DGCNN) to distinguish between malware and legitimate samples. Tang et al. [
43] proposed a method to represent malware behaviour by converting API calls to images based on a colour-mapping criterion by using a convolutional neural network (CNN). The work by Fujino et al. [
44] introduced the concept of API call topics as a way to identify similar malware samples based on their API call behaviour. David and Netanyahu [
45] proposed an approach called DeepSign, which utilises deep learning to automatically generate malware signatures. Salehi et al. [
46] proposed a dynamic malware detection approach named MAAR that analyses the behaviour of malware in the system during runtime. The method generates features based on API calls and their corresponding arguments to identify malicious activities.
Existing approaches often use Windows APIs for feature extraction and employ machine learning algorithms for malware detection. However, there is a research gap concerning the significance of API-call feature extraction, the role of scoring techniques, and their combined impact on various ML algorithms for efficient malware detection. This paper adopts established methodologies in malware detection, particularly concentrating on malware family classification and API categorisation, aligning with the existing studies. However, what distinctly differentiates this research is the strategic use of feature selection and scoring models, particularly Chi2 and Gini importance, to reduce the feature space and assess individual feature significance in identifying malicious activities. This dual-layered approach improves precision and effectiveness in malware detection by pinpointing key API calls indicative of malicious behaviour. The evaluation using ML models on a new dataset enhances our understanding of specific API calls’ contributions to identifying malicious activity, thereby offering a significant contribution to the ongoing efforts in malware detection and analysis.
3. Materials and Methods
This section, as illustrated in
Figure 1, outlines in detail the five stages of the proposed methodology. Initially, a diverse set of PE files were gathered, comprising both malware and goodware samples. These files were subjected to dynamic analysis within the Cuckoo Sandbox environment, yielding comprehensive JSON reports. The next phase involved the extraction of API calls and their counts from these reports, leading to the generation of a new dataset. A subsequent step involved malware family classification and API categorisation according to their role in malicious activities. Afterwards, this dataset underwent feature extraction and scoring using Chi2 and Gini to assess the relevance and impact of different APIs. The final stage included an empirical evaluation using six state-of-the-art ML models.
3.1. Data Collection, Dynamic Analysis, and Dataset Generation
3.1.1. Data Collection
The first stage of the adopted methodology involved meticulous selection of PE files, drawing from two distinct sources to ensure a balanced dataset for dynamic malware analysis. The 1500 Windows PE malware files were collected from the MalwareBazaar repository [
47] specifically focusing on samples from the year 2023. To complement this, 1000 goodware files from the GitHub repository [
48] were gathered, aiming to create a balanced dataset that equally represented both malware and goodware. It is important to highlight that, to the best of our knowledge, the specific malware samples included in this study have not been dynamically analysed in previous research. This unique aspect of this research not only contributes to its novelty but also ensures that these findings and insights are based on fresh and unexplored data.
3.1.2. Cuckoo Sandbox Setup
The dynamic analysis component of this research is a critical aspect of the methodology, leveraging the capabilities of Cuckoo Sandbox [
49] a renowned open-source tool for analysing malicious code. It provides an isolated environment using virtual machines to ensure that the malware does not affect the host system or network. Cuckoo Sandbox employs API hooks, which are instrumental in capturing the malware’s behaviour, providing insights into how it interacts with the system. Multiple virtual machines, with the usage of docker containers, were employed for this purpose. Each virtual machine ran a 64-bit Windows 10 system. A virtual machine snapshot was used to maintain the integrity of the analysis environment. After the execution of each PE file, the virtual machine was reverted to a clean state using these snapshots to ensure an uncontaminated environment. The execution of each Windows PE file resulted in the generation of a JSON file. These JSON files were comprehensive, detailing the behaviour of the malware during its execution. The study amassed a significant collection of these files, with 583 malware and 438 goodware JSON files. The selection criteria were based on the presence of imported functions, API calls, or both within these files. This process led to a curated dataset comprising 1021 JSON files to facilitate a comprehensive analysis.
3.1.3. Dataset Generation
This phase involved filtering out irrelevant data from the Cuckoo JSON reports for dataset generation. To facilitate this, a Python json and os libraries (
Table A1) were employed to sift through the JSON data by extracting API calls and their respective counts. API calls are crucial indicators of how malware interacts with a system, and their frequency can provide insights into the nature and intent of the malware. The next step was to transform these curated data into the Comma Separated Values (CSV) format, which is more accessible and amenable to analysis. For this purpose, the Python pandas library as referred in
Table A1 was employed to convert the JSON data into a CSV format. The resulting dataset, as shown in
Table 1, was structured with rows representing individual malware samples and columns representing different API calls. Each cell within this table provides the frequency of triggers for the corresponding API calls in a specific malware sample. This structure allows for a clear and comprehensive representation of the data, where each row offers a complete profile of a malware sample in terms of its API calls. Step:1 in Algorithm 1 describes the function (
ExtractAPIsFromJSON (
)), which processes a collection of Cuckoo JSON reports (
), extracts API call counts from each report, assigns 1 for malware and 0 for goodware, and stores this information in a CSV file.
3.2. Malware Family Classification
In the second phase, we leveraged the capabilities of VirusTotal, a widely recognised online service that provides malware analysis through an extensive network of over 70 antivirus engines [
50]. The VirusTotal API [
51] was utilised using Python request library referred in
Table A1, which facilitates the submission of SHA256 hashes and responds with a JSON object containing the analysis results from various antivirus application engines. This response is pivotal in determining the family names of each malware sample. The voting mechanism was employed to ascertain the most commonly identified family name for each sample. This mechanism involves aggregating the family names provided by the majority of the antivirus engines and assigning the most frequently occurring name to each malware sample. The use of this voting mechanism ensures more accurate and consensus-based malware family classification, thereby enhancing the reliability of our dataset for subsequent research. The results of this voting mechanism were compiled and stored in a CSV file. Step:2 in Algorithm 1 describes the function (
LabelMalware (
,
)), which undertakes malware family classification on a set of APIs by querying the VirusTotal API (
). For each API, the responses are collected using the function
QueryVirusTotal (
,
) and the classification outcome determined through the
VotingMechanism function. It ultimately returns a dictionary of labels (
) representing the malware families.
3.3. API Categorisation
At this stage, the focus shifted towards categorising APIs based on the specific families they belonged to. This categorisation is essential for gaining granular insights into the roles these APIs play in malicious code execution. To facilitate this categorisation, a dedicated Python module was developed to rigorously analyse the APIs and assign them to their respective categories. By understanding which categories of APIs are most commonly used in malicious code, researchers and analysts can gain a deeper understanding of the techniques employed by malicious actors. Furthermore, it makes it possible to tailor security measures that are more responsive to these specific types of API calls. Step:3 in Algorithm 1 provides an overview of the function (
CategoriseAPIs (
,
)), which categorises API call data stored in a CSV file (api_counts.csv) based on a provided mapping (
) of APIs to categories. It reads the CSV data, iterates through each row, identifies relevant APIs present in the mapping, and creates a categorised dataset. The resulting dataset is written to a new CSV file (categorised_apis.csv), and the function returns the path to this categorized dataset file.
Algorithm 1: Algorithm for generating labelled and categorised malware dataset |
![Applsci 14 01015 i001]() |
3.4. Feature Selection and Scoring
In this section, the specifics of both the Chi-Square and Gini Index methodologies are discussed. The use of both Chi2 and the Gini Index serves a dual function of feature selection and scoring. This approach is crucial in pinpointing the key features that play a significant role in the detection of malicious activities.
In the initial stage, the Chi2 algorithm was applied to evaluate the independence between each feature and the target variable. This further narrowed down the feature selection process, resulting in a reduced set of 105 features from the initial 266. Furthermore, feature scoring was utilised, focusing on pinpointing the most critical features for detecting malicious activity. The function (
ChiSquareFeatureReduction (
X,
Y)) in Step:4 of Algorithm 2 provides an overview of the feature set (
X) reduction, retaining the most informative features (ReducedFeatures), and computes
Chi2Scores for the top 50 features, returning both for further analysis.
Equation (
1) illustrates how the Chi2 statistic (
) measures the discrepancy between observed and expected frequencies in a feature set. It calculates the squared differences between observed (
) and expected (
) frequencies for each feature. The expected frequencies are derived assuming independence between variables. The sum (∑) indicates that the calculations are performed for each feature, from
to
, of these squared differences
divided by the expected frequencies (
), providing a measure of overall discrepancy from the expected distribution.
In the second stage, the Gini Importance was applied to assess feature importance. Features were ranked according to their Gini Importance values, enabling the identification of a subset of the most influential features. Consequently, the feature count decreased from the original 266 to 50, streamlining the analysis for greater focus and efficiency. The function GiniImportance (
GiniImportance (
,
Y)) explains the further reduction in the feature set (
) from 150 to the top 50 features using the Gini importance criterion. It also calculates GiniScores for these features and returns them along with the TopFeatures set.
Equation (
2) shows the utilisation of the Gini Importance (GI) metric for a feature
“f”. The GI is obtained by summing the Gini Gain (GG) across all trees in the ensemble, measuring the improvement in Gini impurity achieved by splitting on
“f” in a particular tree. The dataset, which included extracted features, underwent division into training and testing sets, with 80% of the data allocated for training the models and the remaining 20% reserved for evaluating their performance by using Python’s Scikit-Learn library as illustrated in
Table A1. This division is illustrated by the function SplitData (
SplitData (
,
)), which splits the input dataset (
) into training and testing sets with the specified ratio (
). It returns the training and testing sets as
and
, respectively, for use in machine learning model training and evaluation.
3.5. Implemented SOTA Baseline ML Models
In the evaluation of the newly curated dataset, six baseline ML models were utilised to assess their correctness. These models were Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB), XGBoost (XGB), and K-Nearest Neighbour (KNN). Each of these models offers a unique approach to classification and prediction, making them suitable for a comprehensive analysis of the dataset’s accuracy and reliability. The diversity of these models ensured a thorough evaluation, covering various aspects of ML from simple linear approaches to more complex ensemble-based methods.
Logistic Regression was used to create a decision boundary to separate malware and goodware based on input features. Equation (
3) illustrates the formula, where
are the model coefficients, and
e is the base of the natural logarithm.
Random Forest was used to build multiple decision trees and combine their predictions to classify input (Equation (
4)), where
N is the number of trees, and
is the prediction of the
i-th decision tree.
Support Vector Machine was used to create a hyperplane to classify a given program as malicious or benign (Equation (
5)), where
is the bias and
is the weight vector.
Naive Bayes was used to calculate the probabilities for each feature belonging to malware or goodware to predict a given input. Equation (
6) calculates probabilities using Bayes’ theorem, where
is the posterior probability,
is the likelihood,
is the class prior probability, and
is the predictor prior probability.
XGBoost is a gradient-boosting algorithm used for classification and regression tasks, known for its high performance and scalability. In Equation (
7),
L is the loss function,
y is the true label, and
is the predicted label.
K-Nearest Neighbour is a lazy learning algorithm used for classification and regression, where predictions are based on the majority vote of the nearest neighbours in the training data. Equation (
8) represents the formula, where
k is the number of nearest neighbours and
are the labels of the
k nearest neighbours.
3.6. Performance Evaluation
Accuracy (A) is a common evaluation metric used to measure the performance of a classification model. It represents the percentage of correctly classified instances out of the total number of instances in the dataset.
Precision (P) is used to measure how often the model is correct when it predicts the positive classes.
Recall (R) is used to measure how well the model can detect positive samples and is useful for evaluating models that need to avoid false negatives.
F1-score (F1) is used to balance precision and recall, indicating the model’s overall performance. A high F1-score indicates that the model is good at avoiding both false positives and false negatives.
Area under the curve (AUC) is used to measure the ability of a binary classification model to distinguish between positive and negative samples.
Specificity (S) is an evaluation metric used to measure the ability of a classification model to correctly identify negative instances or the true-negative rate.
Step:5 in Algorithm 2 explains the function
EvaluateModels (
,
,
,
), which conducts an empirical evaluation of machine learning algorithms on training and testing datasets. It trains and evaluates six different models ((
), (
), (
), (
), (
), and (
)) using the provided training (
,
) and testing (
,
) data.
Algorithm 2: Feature selection, scoring, and empirical evaluation of ML |
![Applsci 14 01015 i002]() |
4. Results and Discussion
This section systematically presents the findings of the conducted research, highlighting key results in malware classification, API categorisation, feature scoring, and the evaluation of the utilised ML models. The research outcomes are discussed in detail, providing a comprehensive overview of the significant discoveries.
4.1. Experimental Setup
Table 2 presents the system and software specifications (experimental setup) used for evaluation. The research environment also included the usage of VirtualBox 7.0 for virtualization, Docker version 23.0.3 for containerization, Cuckoo Sandbox version 2.0.7 for dynamic analysis, and the Virustotal API (v3) for malware family classification.
4.2. Malware Family Classification
Malware classification encompassed 11 distinct families, each assigned labels and corresponding distribution percentages. This is crucial for understanding the diverse nature of malware and its prevalence. Each of these categories is represented in
Table 3, providing an overview of the malware families as observed in this dataset.
The analysis of malware families provides crucial insights for tailoring cyber-security measures to combat specific threats effectively. The high prevalence of Trojans (27%) suggests a need for enhanced vigilance in software verification processes, as Trojans often masquerade as legitimate applications. The prevalence of ransomware (22%) emphasises the importance of robust backup systems and advanced encryption detection tools. Regular backups can mitigate the impact of data encryption by ransomware, while specialised algorithms can help in the early detection and prevention of ransomware attacks. Downloaders and droppers, each at 11%, highlight the necessity for advanced network monitoring and endpoint security solutions. These types of malware often initiate the downloading and installation of additional malicious software, making it crucial to detect and block them at the earliest stages. The presence of generic malware, also at 11%, highlights the need for versatile and adaptive malware detection systems capable of identifying and responding to a wide range of malicious behaviours that may not fit into specific categories. Backdoors and stealers, each constituting 5%, call for enhanced network security and data protection measures. Backdoors require robust system integrity checks to prevent unauthorised access, while stealers necessitate strong encryption and privacy safeguards to protect sensitive information. Spyware, at 4%, and adware, at 3%, highlight the importance of privacy protection and user consent mechanisms. Effective adware-blocker algorithms can help mitigate the risks associated with these types of malware. Rootkits and worms, each making up 1% of the dataset, demand specialised detection algorithms that use deep system scans to uncover hidden malware, and Worms necessitate network security measures to prevent their spread across systems and networks. These findings emphasise the need for a multi-faceted and proactive approach to cyber security, with tailored strategies to address the unique challenges posed by each malware family.
4.3. API Categorisation
This section discusses the results of the overall categorisation of 266 APIs and gives a detailed overview of the top 50 APIs deemed significant by the feature selection models, Chi2 and Gini. The analysis offers a comprehensive overview of these crucial APIs, determining their roles and importance in the context of malware behaviours and their operational patterns.
4.3.1. Overall API Categorisation
Table 4 provides details on the breakdown of 266 API calls in the context of malware analysis, categorising them into 23 distinct categories. This detailed categorising offers valuable insights into how various API categories are leveraged by malware. It not only allows the derivation of significant conclusions but also enhances the understanding of malware behaviour.
The significance of Windows APIs in detecting malicious activities is evident from their extensive use by malware, as highlighted in our findings. Approximately 22% of the analysed malware samples interact with Windows operating systems through these APIs, often for system manipulation and exploiting vulnerabilities. This high usage indicates that monitoring Windows API calls can be a critical strategy in identifying and mitigating malware threats. The prevalence of file system (12%) and registry manipulation (8.2%) APIs highlights the malware’s tendency to modify files and system registries. This behaviour is typically aimed at establishing persistence and concealing the malware’s presence, making these APIs crucial targets for monitoring. The security and identity category (10%) reveals that a significant portion of malware focuses on bypassing security measures. This finding suggests the need for enhanced security protocols and vigilant monitoring of these APIs to prevent unauthorised access and identity theft. The socket category (8%) highlights the malware’s focus on establishing network connections for communication or data transfer. It shows the importance of monitoring network activities to detect and thwart malware that relies on network sockets for malicious operations. The Internet category (8%) reflects the malware’s engagement in online activities, possibly for data exfiltration, command and control communication, or downloading additional payloads. Vigilant monitoring of these APIs can be instrumental in identifying and mitigating these threats. The kernel category (5%) is indicative of the malware’s attempts to access and manipulate core system functions. This level of interaction suggests a sophisticated approach to control system operations at a fundamental level, making these APIs critical points for monitoring to prevent deep system intrusions. The threading category (5%) indicates that malware often employs multithreading techniques to evade detection and enhance efficiency. This insight can guide the development of more sophisticated detection mechanisms that can identify and counteract such evasive tactics. The error handling category (3%) demonstrates the malware’s capability to manage errors effectively, indicating a level of sophistication in maintaining operational stability while avoiding detection. This aspect highlights the need for advanced analytical tools capable of discerning such subtle operational patterns. The data access and storage category (3%) reflects the malware’s operations related to data manipulation and storage, often for harmful purposes like data exfiltration or encryption. Monitoring these APIs can help in the early detection of data breaches and unauthorised encryption activities. The other categories, like menu and resources (3.3%), developer (2.2%), COM/OLE/DDE (2%), networking (1%), shell (1%), process (1%), input/output (1%), remote procedure call (0.4%), and DNS (0.4%), each contribute to the malware’s diverse functionalities. These include propagation, evasion, orchestrating distributed attacks, and maintaining persistence. This detailed analysis of Windows APIs and their categorisation offers valuable insights into malware behaviour. This knowledge can improve cyber-security measures by helping to develop more effective detection and mitigation strategies against evolving threats employed by malware.
4.3.2. Top 50 API Categories According to Chi2 and Gini
The analysis of Windows API preferences and prioritisation using the Chi2 and Gini methods, as depicted in
Figure 2, offers significant insights into the detection of malicious activities and contributes to the understanding of cyber threats. The Chi2 method focuses on internal system aspects, particularly security and file system interactions, with no consideration for thread management, networking, and component object model APIs, indicating their perceived importance in malware detection. This method’s emphasis on these specific areas suggests a strategic concentration on internal system vulnerabilities and manipulations commonly exploited by malware. On the other hand, the Gini method presents a more balanced and comprehensive perspective, encompassing a broader range of API categories. This method’s allocation across system management, registry operations, kernel management, system information, thread management, networking, process management, and error handling indicates its recognition of the multifaceted nature of malware. By considering both internal system processes and external communications, the Gini method acknowledges the diverse tactics employed by malware, from system infiltration to network-based activities. The distinct approaches of these two methods in prioritising Windows APIs underscore the complexity of malware detection. The Chi2 method’s focused approach is beneficial for honing in on specific system vulnerabilities, while the Gini method’s comprehensive coverage aids in recognising a wider array of malicious behaviours. Together, these results contribute significantly to cyber-security efforts by providing a nuanced understanding of how different aspects of Windows APIs are exploited by malware, thereby informing more targeted and effective defensive strategies against evolving cyber threats.
4.4. Feature Scoring
Table 5 provides an insightful comparison of the top 10 features obtained by the Chi2 and Gini selection models and their relevance in malware detection. Each API, based on its scoring and categorisation, plays a unique role in identifying and classifying malware behaviour. In the comparison, NtAllocateVirtualMemory and NtProtectVirtualMemory emerge as key APIs, with Chi2 assigning them scores of 10.18% and 9.70% and Gini assigning 6.81% and 7.86%, respectively, indicating their significant role in kernel management. Involvement in memory allocation and protection is a common characteristic of advanced malware, which often manipulates memory to execute malicious code or evade detection. Their high relevance in both models shows their critical role in identifying malware that interacts with system memory. LdrGetProcedureAddress and NtFreeVirtualMemory, categorised under system and kernel management, respectively, show notable differences in their Chi2 and Gini scores, reflecting their varied impact in these specific categories. This disparity indicates their varying impact on malware operations, such as loading procedures and freeing memory resources, which are essential for understanding malware’s interaction with a system’s core processes. Other APIs, such as FindResourceExA, NtClose, and GetSystemTimeAsFileTime, despite having lower Gini scores, are still significant in their respective categories of resource, kernel, and system information management. They are involved in resource location, system resource management, and time-based operations, respectively. Their inclusion, even with lower scores, highlights their role in specific malware activities, such as resource exploitation and time manipulation. The lowest-scoring features, NtDelayExecution and NtDeviceIoControlFile, categorised under system and file system management, were still recognised within the Chi2’s and Gini’s top 50 rankings, highlighting their relevance in these specific API categories. NtDelayExecution is involved in delaying execution, a tactic used by malware to avoid detection during analysis, while NtDeviceIoControlFile deals with device I/O operations, which can be critical in understanding malware’s interaction with hardware components. The strategic selection and scoring of these APIs contribute significantly to the effectiveness of malware classification systems. By understanding the specific roles and impacts of these APIs, malware detection systems can be tailored to recognise and respond to a wide range of malicious activities.
4.5. Evaluation of Machine Learning Models
This section discusses the evaluation of six ML models employed in this research, focusing on performance metrics, as illustrated in
Table 6.
4.5.1. Performance Evaluation
Based on the evaluation presented in
Table 6, the performance of different ML models was evaluated. The results of each implemented model are discussed as follows: The LR model exhibited a strong overall performance, with a relatively low accuracy of 0.88 and F1-score of 0.92 as compared to RF and XGB. It maintained an excellent balance between precision (0.92) and recall (0.92), with a respectable AUC of 0.93 and good specificity at 0.80. SVM showed decent accuracy at 0.72 but fell short in terms of its F1-score of 0.67, indicating potential challenges in balancing precision (0.98) and recall (0.51). The AUC was reasonable at 0.88, while specificity was notably high at 0.98. NB performed reasonably well, with an accuracy of 0.84 and an F1-score of 0.83. It exhibited good precision of 0.96 but a slightly lower recall of 0.73, with the AUC standing at 0.84; specificity was solid at 0.96. RF outperformed the other implemented models with the highest accuracy of 0.96 and demonstrated impressive results across other metrics, achieving a high F1-score of 0.96, indicating a well-balanced trade-off between precision (0.99) and recall (0.93). Additionally, it boasted a substantial AUC (0.98) and maintained a noteworthy specificity of 0.96. XGB also demonstrated excellent performance in terms of classification accuracy (0.93), with an F1-score of 0.96. This indicated its ability to achieve a balance between precision and recall of 0.96. Additionally, XGB showcased strong discriminatory power, as evidenced by its AUC of 0.97 and specificity of 0.95. KNN exhibited a decent accuracy of 0.92 and an F1-score of 0.91. It provided a reasonable balance between a precision of 0.97 and a recall of 0.87. The AUC was 0.93, and the specificity was 0.89.
4.5.2. Receiver Operating Characteristic (ROC)
The ROC curves (
Figure 3) depict the performance of various ML models in distinguishing between malware and goodware samples generated by Python matplotlib
Table A1 libraries. It is important to note that the NB follows a slightly different approach compared to the other models, as it does not have a predict_proba method available in the Python Scikit-Learn library as illustrated in
Table A1. To adapt to this distinction, the prediction method for the Naive Bayes classifier was employed. This resulted in two ROC curves, as shown in
Figure 3, one dedicated to the Naive Bayes classifier and the other representing the remaining ML models. Examining the ROC curves reveals that the RF and XGB models performed exceptionally well. This indicates their effectiveness in achieving high true-positive rates while maintaining low false-positive rates. The KNN model also demonstrated a favourable ROC curve, indicating strong discriminatory power. On the other hand, the LR, SVM, and NB models showed relatively lower ROC curves compared to RF, XGB, and KNN. This implies a relatively lower ability to accurately classify instances from both classes.
4.5.3. Confusion Matrix and Error Analysis
Table 7 elaborates on the performance of the ML models used in this study. These confusion matrices provide a comprehensive breakdown of how each utilised model performed in classifying instances correctly and incorrectly.
The error analysis focused on analysing misclassifications and identifying common patterns in malware classification. LR and SVM showed 3 false positives and 13 false negatives. They appeared to struggle more with the negative class, resulting in lower precision for the positive class. Nevertheless, both maintained a relatively high recall for the positive class, indicating their ability to capture true malware instances. NB showed four false positives and eight false negatives, indicating higher precision but lower recall for the positive class, as it tended to classify fewer instances as positive when they were actually positive. However, RF and XGB demonstrated strong performance. RF had only one false positive and seven false negatives, whereas XGB, with only two false positives and seven false negatives, displayed good precision and recall for both classes, indicating a balanced performance. Both models excelled in minimising false positives while effectively capturing true positives. KNN, on the other hand, had 4 false positives and 23 false negatives, which demonstrates reasonable precision and lower recall for both classes. KNN had a slightly higher false-negative rate, implying that it sometimes misses true malware instances. Among these models, XGB and RF demonstrated the highest overall performance. They achieved remarkable accuracy, precision, and recall for both malware and goodware samples, outperforming other models.
5. Comparative Analysis with Existing Approaches
This section provides a comparative analysis using related existing studies to highlight the distinguishing contribution of this research. The comparison is based on the utilisation of API calls, feature selections, scoring methods, and implemented ML techniques. An overview of these studies follows.
Feature Selection and Scoring:Table 8 evidences that the studies by Yazi et al. [
14], Pektacs et al. [
20], Eskandari et al. [
41], Catak et al. [
52], and Daeef et al. [
27] primarily focused on feature selection methods, such as feature vectors and N-grams, and they did not emphasise feature scoring. Most of these studies also incorporated malware labelling, while Pektacs et al. [
20] additionally integrated API categorisation. However, our research distinctly incorporated both Chi2 and Gini Importance for effective reduction in feature space and stands out for its inclusion of feature scoring. This approach, combined with the integration of API categorisation and malware labelling, marks it as more comprehensive in its methodology compared to the others in malware detection and classification.
Enhanced F1-Score: As illustrated in
Table 9, although Pektacs et al. [
20] and Eskandari et al. [
41] achieved impressive accuracy scores of 98%, our research went a step further by improving the F1-score by 2%. This improvement is significant as it indicates a better balance between recall and precision in the implemented models. The enhanced F1-score demonstrates our model’s ability to accurately classify malware without compromising either the recall or precision, which is crucial for the reliability of malware detection systems.
Comprehensive Model Evaluation: Table 9 highlights the fact that our study also considered both the area under the curve (AUC) and specificity metrics for each ML model used. These metrics are vital for a thorough evaluation of a model’s performance, providing a more rounded assessment of its effectiveness in malware classification. The inclusion of AUC and specificity offers a deeper understanding of the model’s true-positive versus false-positive rates, which is an aspect not covered in the other compared studies. The combination of detailed feature scoring, improved F1-scores, and the inclusion of comprehensive evaluation metrics like AUC and specificity positions our research as a more balanced and holistic approach to malware classification. By addressing the limitations found in previous studies and introducing these comprehensive methods, this paper enhances the reliability and depth of malware analysis.
7. Conclusions
In today’s digital age, the threat of malware is ever-present and constantly on the rise. It can wreak havoc on computer systems, compromising personal and sensitive information. Therefore, it is crucial to take necessary precautions to protect assets against such threats. This paper tackled the significant challenge of malware threats through the generation of a novel dataset from the MalwareBazaar repository. The curated dataset included 582 malware and 438 goodware samples from Windows PE files, which served as a foundation for our dynamic analysis and classification framework. For evaluation, a comprehensive five-stage approach was adopted: generating a tailored dataset, labelling malware samples, categorising APIs, extracting and scoring pivotal features using Chi2 and Gini Importance, and applying six cutting-edge ML models. The results highlighted that the RF model demonstrated superior performance, with an impressive precision rate of 99% and an accuracy of 96%. The RF model’s AUC stood at 98%, and it achieved an F1-score of 96%, indicating a highly effective balance between precision and recall. These results were further authenticated by a TPR of 0.93 and an exceptionally low FPR of 0.0098, marking it as the most reliable among the evaluated models. The study revealed that Trojans and ransomware, constituting 27% and 22%, respectively, were predominant among the 11 analysed malware families. This indicates that we need to adopt more sophisticated strategies for these malware. In API categorisation, Windows APIs (22%) and APIs related to the file system (12%) and registry manipulation (8.2%) showcased their importance in detecting malicious activity. The high API usage indicates the need for monitoring to develop targeted security solutions by tracking unauthorized or suspicious changes that can provide early warning signs of malware infection. Our research distinguished itself by adopting a dual approach for feature reduction and scoring, as well as by achieving an improved F1-score (2%) and including AUC and specificity metrics, aspects not comprehensively addressed in previous studies. The curated dataset is now publicly available for further research and lays a solid groundwork for future studies on malware evolution prediction.