A Neural Network Approach to a Grayscale Image-Based Multi-File Type Malware Detection System

Abigail Copiaco; Leena El Neel; Tasnim Nazzal; Husameldin Mukhtar; Walid Obaid

doi:10.3390/app132312888

,

and

College of Engineering and Information Technology, University of Dubai, Dubai 14143, United Arab Emirates

^*

Author to whom correspondence should be addressed.

Appl. Sci.2023, 13(23), 12888;https://doi.org/10.3390/app132312888

This article belongs to the Section Computing and Artificial Intelligence

Version Notes

Order Reprints

Abstract

This study introduces an innovative all-in-one malware identification model that significantly enhances convenience and resource efficiency in classifying malware across diverse file types. Traditional malware identification methods involve the extraction of static and dynamic features, followed by comparisons with signature-based databases or machine learning-based classifiers. However, many malware detection applications that rely on transfer learning and image transformation suffer from excessive resource consumption. In recent years, transfer learning has emerged as a powerful tool for developing effective classifiers, leveraging pre-trained neural network models. In this research, we comprehensively explore various pre-trained network architectures, including compact and conventional networks, as well as series and directed acyclic graph configurations for malware classification. Our approach utilizes grayscale transform-based features as a standardized set of characteristics, streamlining malware classification across various file types. To ensure the robustness and generalization of our classification models, we integrate multiple datasets into the training process. Remarkably, we achieve an optimal model with 96% accuracy, while maintaining a modest 5 MB size using the SqueezeNet classifier. Overall, our model efficiently classifies malware across file types, reducing the computational load, which can be useful for cybersecurity professionals and organizations.

Keywords:

neural network; transfer learning; malware detection; grayscale; portable executable; PDF; MS Word; artificial intelligence; deep learning

1. Introduction

The 21st century has witnessed an increasing number of malware incidents reported, causing massive information and financial loss at both personal and corporate scales [1]. The last few decades did not only witness an increase in malware prevalence but also in their functionalities and impact; malware types have evolved from self-replicating harmless software called worms in the 1980’s to ransomware, which is capable of disabling device operations, in the recent decade. Malware developers are evolving their techniques for creating complex malware to target new technologies and avoiding detection by using obfuscation and encryption techniques [2].

Traditionally, malware analysis usually follows either static or dynamic approaches, depending on the requirements and goals of the analysis. In static analysis, features are extracted without executing the binary code using tools or open-source libraries. It examines malware behavior by using the original structure data of executable files, such as the byte sequence, which avoids the requirement of running the executable file. In turn, this also avoids potentially harming the operating system and exposing user data. Nevertheless, it remains challenging to detect unknown malware because the typical static technique frequently relies on a sizable malware database. Additionally, it is reliant on intricate analyses of software code and skilled extraction techniques.

Data mining for malware detection was initially introduced by Schultz et al. [3]. For malware identification, they made use of static properties like strings, byte sequences, and portable executable (PE) files. Following that, Krgel et al. introduced a unique fingerprinting method for identifying polymorphic worms based on structural data from a binary code’s control flow graph (CFG) [4]. Malware detection methods were also examined and compared [5]. Finally, an improved version of CFG-based malware detection was proposed by Nguyen et al. [6]. The authors attempted to bridge the gap between formal approaches that characterize malware activities and deep learning technology.

Static analysis techniques have become increasingly popular in recent years for locating malware. In order to leverage machine learning techniques to identify malicious applications, a combination of permission requests with application programming interface (API) requests was recommended [7]. This approach can be used to assess and categorize unidentified application packages. The use of visual analysis approaches has just been put forth, which is a significant advancement in the detection of malware. It has greater detection accuracy and fewer characteristics than the conventional static analysis method [8]. The technique utilizing visualization technology can extract high-dimensional inherent information from data samples, in addition to inheriting the benefits of conventional malware detection technologies. The first to propose the idea of turning malware into grayscale graphics was [9]. A given malware file transforms into integers that can be looked at as an image.

Conducting a more thorough exploration of the subject, we explore state-of-the-art malware classification techniques, such as that proposed by Ni et al. [10]. Their approach targeted malware families obtained from images of malware codes with high accuracy. Another classification mechanism was developed by Qiao et al. [8], which combines malware records with byte information. Further, a design was created in [11]. The design used deep convolutional neural networks and achieved 98.47% precision. A system was proposed in [12] that combines a recurrent neural network (RNN) and a CNN. It achieved good results on a single dataset.

Analysis tools have also gained popularity over the past years. A malware analysis tool was introduced by [13]. The tool utilizes an environment to check malware samples and their corresponding system calls. A strategy investigated by [14] targeted malware behavior by focusing on system information exchange. Additionally, architecture proposed in [15] emphasized categorization and utilized normal and harmful software analysis. Additional malware recognition approaches such as [16] were investigated, which concentrated on the behavioral traits of malware. Similarly, a technique proposed by [17] utilized invariant modeling to obtain a representation of the graph.

In recent years, the incorporation of artificial intelligence and machine learning algorithms has gained popularity for malware detection applications. Smmarwar et al. have explored the efficacy of a hybrid approach utilizing convolutional neural networks (CNN) and long short-term memory (LSTM) for malware classification, demonstrating promising results [18]. Concurrently, neural network-based methodologies have gained prominence in the context of Android malware detection, as exemplified in the studies by Ullah et al. [19] and Mahindru and Sangal [20].

Given the aforementioned literature analysis, it is clear that using artificial intelligence and machine learning to detect malware has grown in popularity and demonstrated benefits. Positive effects have also been seen through visual depiction. Nonetheless, it is crucial to remember that various file formats still call for various properties in order to identify malware. Given this, the following is a summary of the work’s goals and objectives:

To identify a universal image transformation method that works across multiple file formats, facilitating the development of a unified model with simplified resource requirements for detecting malware across multiple formats.
To develop a feature extraction approach that captures vital underlying data information from files.
To conduct a comprehensive assessment of various neural network models, including conventional and compact networks, for effective transfer learning in the specified application.
To establish a foundation for future research expansion to encompass additional file formats, such as audio (MP3) and video, broadening the scope of the study.

Hence, it can be inferred that the proposed approach differs from existing methods by adopting a unified model update and retraining process, avoiding the need to retrain each file type separately. This approach streamlines the workflow, reduces redundancy, and conserves resources, resulting in more convincing results. Its suitability for malicious file identification stems from its ability to efficiently address diverse requirements, offer scalability, and provide cost-effective solutions.

Our study examines different pre-trained compact and regular networks, in addition to series and directed acyclic graph architectures, in malware classification. Our approach utilizes grayscale transform-based features as a standardized feature set to classify malware across PDFs, Microsoft Office Documents, and Windows Portable Executable files. We incorporate multiple datasets in our study, such as Zenodo, jonaslejon, Clean DOC files, Clean XLS files, Clean PDF files, Dike, and Malimg, in order to evaluate their accuracy.

2. Microservices and Grayscale-Based Image Transform Features

2.1. File Type Microservices

2.1.1. Portable Executable (PE) Files

The majority of Windows OS malware are presented in the Portable Executable (PE) format, which is .exe [21]. Figure 1 illustrates the main components of the PE file format.

Figure 1. PE file format [2].

Extracting static features from PE files involves understanding the file structure. These features can include things like opcodes (instructions), n-grams (sequences of instructions), hashes, and information about the PE file itself. Malware often hides within Windows software, so analyzing PE files is crucial to cybersecurity. Static analysis involves disassembling the software, checking the PE header and sections, and extracting information like text strings and dynamic linked libraries (DLLs) [2,21]. Dynamic analysis, on the other hand, requires running the software in a controlled environment to observe its behavior. Researchers often use a combination of these approaches based on their needs.

Static analysis mainly focuses on examining the assembly code of PE files, but it can be challenging to extract meaningful information, especially from obfuscated malware that intentionally tries to hide its true purpose [22]. To tackle these challenges, recent research has explored analyzing the binary representation of the code using techniques like image processing [9,23,24,25] and signal processing [26]. These methods convert raw executable binaries into grayscale images with different visual patterns and textures. Neural networks are then used to learn features from these images and classify them to detect malware and identify their specific families [9,23,24,25]. This approach enhances the malware detection process by making it more effective and resilient against obfuscation techniques.

2.1.2. Microsoft Office (MS) Documents

According to a study conducted by Cisco from January to September 2017, Microsoft Office documents accounted for 38% of the globally identified malicious file types [27]. These documents are commonly used by end users in both corporate and personal contexts, making them susceptible to malicious activities.

There are three distinct pathways by which malware can infect the victim’s system through Microsoft Office documents. First, the victim can be infected through the exploitation of features. When new features are introduced to users as the software becomes outdated, it can inadvertently create vulnerabilities in the software, which can provide opportunities for attackers to gain unauthorized access to the systems. Secondly, malware can infiltrate systems through macros. Macros are defined by a set of rules that describe how the input instructions are executed in order to perform a particular task. Although they have advantages with regards to automation, malicious code can be embedded within them, which allows malware to infect the system. Finally, attackers also target vulnerabilities arising from the absence of software updates. As software evolves, developers often release updates for security bugs. However, not all users promptly update their software, leaving their systems susceptible to known security flaws. Attackers take advantage of this by targeting systems that have not been updated, using known vulnerabilities to gain access and introduce malware [28].

2.1.3. Portable Document Format (PDF) Files

Portable Document Format (PDF) files serve as another common platform for vulnerabilities caused by malware. As reported by Cisco, PDF files [27] were the target of 14% of malware attacks ranging from January to September 2017, which was considered the third highest after the Microsoft Office documents and archive files. Additionally, VirusTotal [29] documented receiving 380,453 submissions of PDF files between late August and early September 2018. This volume was ranked as one of the highest in the file type category [28].

Understanding the structure of PDF files is key to grasping their potential vulnerabilities. The foundation starts with the header, which indicates the PDF version being used. The body of the PDF encompasses a variety of elements including images, security settings, animations, text content, and fonts, which collectively shape the document’s content. An essential component is the cross-reference table, which allows for updates to the document, enhancing manageability. The trailer then serves as a guide for document rendering in viewers, marked by tags [28].

PDF files accommodate a range of content forms, such as booleans, strings, arrays, dictionaries, and many more. However, this complexity comes with risks. PDF documents can be exploited by injecting harmful scripts, often concealed as obfuscated strings, or dictionary items. This approach aims to deceive unsuspecting readers, enabling attackers to execute their malicious intent undetected [28]. Overall, the prevalence, structure intricacies, and vulnerabilities of PDF files highlight the need for vigilance in digital security in this category.

2.2. Grayscale-Based Transform Features

Recent researchers have made use of visual representations of malicious samples in building malware classifiers based on extracting image-based features from colored or grayscale images. The science of texture was used to determine malware families in which it was noticed that samples that show blocks of similar texture or patterns can belong to the same malware family [9]. Additionally, grayscale images can be fed directly to neural networks for feature learning and classification rather than going through feature engineering processes that are difficult and time-consuming [2]. Additionally, static analysis based on grayscale images has the advantage of analyzing the samples without the need to unpack them [24]. This means that the static image-based approach might give us an advantage in detecting obfuscated malware over other static analysis techniques. The difference in texture that is visualized in the grayscale image represents the different structure of the file’s header and contains valuable information about its contents. Researchers made use of the file’s binary by converting it into other formats to make use of other processing techniques in classifying the file. Figure 2 shows the representation of a PE header section in a grayscale image representation of malicious samples.

Figure 2. Representation of PE file sections in grayscale image of malicious samples [2].

The author of [9] converted the executable binary into eight-bit vectors and then converted them into a 2D array to be visualized as a grayscale image with a pixel value range of [1–255]. The “Gist” image feature, which relies on the texture of the image, was applied to the resulted grayscale image to extract the features’ vector and use it in classifying malicious samples into their respective malware families based on their similarities. Using “gist” features with a KNN classifier gave an accuracy of 99.2%. Bensaoud et al. [23] also used grayscale image representation of malware images generated from eight-bit vectors to detect twenty-five different malware kinds using six different deep learning models. Their experiment showed that the Inception-V3 model gives an overall higher accuracy in correctly classifying samples according to their type. The authors of [25] used a binary representation of a PE file at byte level and argued that this 2D representation of sequential code, although it showed a high accuracy of 94.5% using FractalNet DLNN, is subject to a fixed width, which impacts the structure of the machine code by breaking it into different rows; hence, the width should be selected in a way that ensures that the sequence of the code is preserved during the conversion process. Noever and Nouver [30] followed a different approach to generating grayscale images by using an intermediate NetPBM text format and PGM to convert a raw PE section header saved in CSV format into raw ASCII images and using the ImageMagick tool to convert it into a viewable image in JPG format. A CNN classifier showed an overall accuracy of 80% in a multi-level classification of samples into seven different classes (one benign and six different malware types) with a train–test ratio of 85:15 [30]. Raw binary bits of executable files were grouped together to represent a corresponding color from a grayscale spectrum and were used with a CNN classifier, showing an accuracy of 96% in binary malware classification and 92.3% in multi-level malware family classification [24]. Almomani [31] developed a vision-based framework where samples were converted into grayscale images and used the Adaboost algorithm to achieve 79% accuracy. Ghanei et al. [32] made use of the fact that grayscale images tend to produce highly accurate classifiers when used with CNN to build a novel approach that relies on dynamic features. Malware samples were executed in a dynamic environment, and behavior was captured in different time frames ranging from 5 to 30 seconds, such that hardware-based features were collected at the time of their occurrence via a VTune tool. Those features were normalized and inputted into a 1D vector, padded with zeros, and converted into a 2D vector, represented by a grayscale image of size 16 × 16, which was fed into a NN. A voting network system on CNN and LSTM gave an accuracy of 91.63%. The majority of malware detection solutions are based on static analysis and dynamic analysis, and these methods exhibit an optimal rate of detecting known malware. However, these methods have limitations in detecting advanced malware, such as encrypted, obscured, or zero-day malware [33]. In [34], the author proposed a novel approach based on structural feature extraction methodology (SFEM). It is based on machine learning for XML-based document malware detection. The analysis was conducted for 830 malicious files and 16180 benign files. The results showed that the SFEM can achieve a TPR of 97%. In [33], the author proposed a new approach, including data extraction and image conversion, to identify MS-DOC malicious files and benign files using three CNN models. The experimental results showed a detection rate of 94.09% on the test samples. In [35], the author proposed a technique for PDF malware detection using image processing. The PDF files were converted to grayscale images through image visualization methods. Then features were extracted for malware and benign PDFs. In the end, learning algorithms were applied to classify the files as benign or malicious. Optimum results were achieved using a byte plot with a Gabor filter and RF, which resulted in a 99.48% F1 score. In [36], the author used a PDF malware detection technique based on malware visualization and image classification. VGG19 with CNN architecture was used to train the model, which resulted in an accuracy of 97.3% and an F1-score of 97.5%. Based on the gaps identified in this review, the proposed work is based on a single CNN model, which will be trained to detect three file types: MS, PE, and PDF. This presents an advantage over existing works in terms of complexity and different file type detection. Aside from this, our detailed experimentation aims to improve the overall accuracy as well. In addition to that, we aim to detect the encrypted malware in MS, which is considered a limitation in machine learning methods.

2.3. Pre-Trained Neural Network Models

The convolutional neural network (CNN) has been a popular neural network type for pre-trained networks in recent years, provided that it utilizes multiple convolution stages for classification [37]. Hence, instead of connecting all previous neurons to the subsequent ones, as per a traditional model, a portion of the neurons in the preceding layer are linked to the following layer, leading to advancements in both computational and memory demands. At the moment, there are several models available for classification tasks utilizing transfer learning, which involves repurposing the weights and structure of a previously trained model for a related but distinct task [38]. The utilization of transfer learning results in reduced time and resources required for building and training the model [39]. Pre-trained CNN models can vary in terms of architecture, which could either be directed acyclic graphs (DAGs) or series. DAG networks are characterized by their complex architecture, which enables the utilization of the output from one layer as input for multiple other layers. Although this type of architecture is more complex and limits the overall flexibility of the neural network for future modifications, it offers advantages regarding downsizing the overall model size upon export [40]. Another type of neural network architecture is the series architecture, which involves a one-input-one-output mechanism. Due to the simpler connectivity, this type of architecture often results in a larger model. Nonetheless, it provides advantages in terms of more flexibility, particularly for hyper-parameter variations and model fine-tuning applications [40].

Given the pros and cons of these network architectures, the sizes of pre-trained networks upon export could also be categorized as regular (mostly between 100 and 600 MB) or compact (between 1 and 99 MB). Regardless, these networks are all trained on vast databases, and their weights are saved for subsequent transfer learning purposes. A summary of the fundamental features of various pre-trained networks is presented in Table 1.

Table 1. Summary of the comparison between pre-trained CNN models.

3. Methodology

After a comprehensive evaluation of various feature extraction and neural network techniques, a unified methodology was formulated to classify malware of different file types. This section provides a detailed description of the datasets used, feature engineering, and training process. It is worth noting that this work is a continuation of our prior research, including [53], which focuses on the categorization of malware within PE files, and [54], which concentrates on the distinction of malicious images.

3.1. Datasets

Several datasets were utilized in this work to ensure that the developed model possesses good generalization properties. Consequently, we combined multiple datasets comprising malicious and benign files in MS Word documents, PDFs, and Windows PE files. The train-test split ratio that was adopted is 70 to 30. The used dataset in this work is summarized in Table 2. VirusTotal [29] was used to validate and add confidence levels to the obtained results.

Table 2. Summary of Datasets.

3.2. Feature Extraction Using Grayscale-Based Image Transforms

Expanding upon our prior research efforts documented in [53,54], it is evident that the utilization of image-based features bestows advantages in the detection and categorization of malware within image and PE file formats. Hence, this study extends the previous approach by incorporating image transforms to develop a comprehensive classifier that is capable of detecting malicious files across different file types, including PDFs, Microsoft Office documents, and PE files.

Feature extraction was conducted using a common approach found in the literature that converts the PE file’s features into a grayscale image. This involved converting raw binary codes into eight-bit unassigned vectors and reshaping them into a 2D array. The resulting grayscale image was then reshaped and resized using bicubic interpolation to match the input size requirements of the pre-trained models. Additionally, we observed that the bit conversion approach is compatible with various file types, allowing the creation of a uniform processing pipeline for different formats. Figure 3 illustrates the adopted approach for converting files into grayscale images. Similarly, the mathematical representation of the steps is also outlined below.

Figure 3. Feature extraction process of grayscale image for all file types.

Step 1: Binary Data Representation:

The binary data from the first file are represented as a 2D binary array, denoted as B, with dimensions

M \times N

, where

B (i, j)

corresponds to the binary value at row i and column j.

Step 2: Conversion to Numerical Array:

A mapping function

f (B (i, j))

is applied to convert the binary array B into a numerical array A with the same dimensions (

M \times N

). This mapping may include translating 0 to 255, following the grayscale encoding scheme. The process can be mathematically expressed as:

A (i, j) = f (B (i, j)), for i \in [1, M] and j \in [1, N] .

Step 3: Grayscale Image Generation:

The numerical array A is used to generate grayscale images, denoted as G, with identical dimensions (

M \times N

). Each numerical value in A is directly translated into a pixel intensity value for the corresponding location in G. The mapping is typically such that pixel intensities range from 0 (representing black) to 255 (representing white) and can be expressed as

G (i, j) = A (i, j), for i \in [1, M] and j \in [1, N] .

We used an open-source Python code for the grayscale image generation process [59]. Grayscale image samples are shown in Figure 4 and Figure 5, displaying benign and malicious samples of Microsoft Word and PDF files, respectively.

Figure 4. Representation of benign and malicious MS Office files in grayscale image.

Figure 5. Representation of benign and malicious PDF files in grayscale image.

3.3. Model Training

Neural networks are recognized for their advantages in pattern recognition and data classification. Thus, our experiments are focused on comparing different neural network architectures using transfer learning for the specified application.

The utilization of transfer learning can be justified based on its ability to leverage pre-trained models. This approach allows the model to inherit valuable knowledge and features acquired from large and diverse datasets, which can significantly expedite the training process and enhance performance, especially when the dataset is limited. Further, both compact and regular networks are used and compared because of the benefits of these diverse architectures. Compact networks are particularly advantageous in resource-constrained environments, such as mobile devices and edge computing. They strike a balance between model size and performance, making them efficient choices for specific applications. Additionally, compact networks are often less prone to overfitting, which is a crucial advantage when working with smaller datasets.

Both series and directed acyclic graph (DAG) architectures are also considered and compared due to the flexibility that these structures offer. Series architectures are typically used when data flow sequentially, allowing for straightforward modeling of linear dependencies. In contrast, DAG architectures enable more complex and interconnected relationships between layers, which can capture intricate dependencies and patterns in the data. By considering both options, we are able to compare which architecture works best for the specific task’s requirements, ensuring that the model can effectively capture the dependencies inherent in the data.

Given the diversity of the pre-trained model architectures compared, we use accuracy as the basis for determining the top methodology. The input to the models are the extracted grayscale images, which are pre-processed before being fitted in order to align with the input requirements of the model. For example, pre-trained models have specific criteria related to image color scales and dimensions. Consequently, all feature images generated needed to be standardized by concatenating them into a three-dimensional format representing the RGB color scale. Further, these images have to be resized as per the model’s requirements, as previously specified in Table 1.

3.4. Performance Evaluation Metrics

Throughout the comparisons, the system’s responses are then observed in the following ways:

Effects on the overall accuracy of regular versus compact neural networks.
Effects on performance when utilizing DAG or series architectures.
Effects on system accuracy when training models for a specific file type and general models for multiple file formats.

To assess the effectiveness of the suggested systems, the following aspects are analyzed:

Comparing the features extracted from single and multiple file types using different models at both the individual and overall level.
Examining the impact of enlarging and merging datasets, as well as how the system responds to imbalanced data.

The training for these image transforms was performed using Matlab. The models are then exported in an Open Neural Network Exchange (ONNX) format to achieve compatibility with Python for deployment. It should be noted that, when testing the deployed model in Python, the library OpenCV reads the images in BGR format instead of the usual RGB. Since the model is trained on RGB images, the axes have to be established and interchanged to fit the requirements accordingly.

Various measurement criteria were also used to further examine the performance. The evaluation criteria included recall, specificity, precision, Dice score coefficient, overlap between manual and automatic segmentation, accuracy, and the F1 score. It is important to note that a three-fold cross-validation was used consistently throughout the experiments, and the reported results contain the average of these folds.

4. Results

4.1. Comparison of Single-File Type Models

As previously mentioned, this work focuses on the detection of malicious files in three main formats: PDFs, MS Documents, and PE files. Thus, in order to prepare a solid ground for comparison, the pre-trained model AlexNet was trained to detect benign and malicious data for individual file types using the grayscale-based image transform, as per Section 3.2. Accordingly, to represent compact neural network models, GoogleNet was also trained on these data. The findings for these are seen in Table 3.

Table 3. Results summary for individual file type malware detection model.

As presented, slightly higher accuracies are reported for the AlexNet model as opposed to GoogleNet. Nonetheless, both models return adequate results considering that the files contain imbalanced data. Figure 6 displays the relevant confusion matrices for the AlexNet model trainings, which justifies the model’s performance against imbalanced data.

Figure 6. Confusion matrices for AlexNet model trainings using individual file types (L to R): Office, PDF, PE.

It is also important to highlight that, since the datasets for each file format were generated from different sources, the amount of underlying data is different for each. Hence, the accuracies vary across different file types. Notably, the highest accuracy was achieved with PDF files, which also had the largest pool of training and testing files. The accuracy rankings for PE and MS documents also remained consistent with the amount of data available. This suggests a direct correlation between the number of files and enhanced model performance. However, due to the limited availability of data in other file type categories, efforts to equalize the amount of data across the file types were constrained.

4.2. Comparison of Multi-File Type Models

As per the results gathered from the previous experiments, both compact and regular neural network models are advisable for the selected application. Furthermore, it has also been found that the current dataset imbalance does not affect the model’s performance. In this work, we investigate the potential benefits of merging datasets to enable the model to accurately identify malicious content in files regardless of the format, resulting in enhanced resilience and generalization capabilities of the combined model. The outcomes of these experiments are reported in Table 4, and the corresponding confusion matrices are presented in Figure 7. A total of 80% of the combined data is used for training, while 20% is used for testing purposes.

Table 4. Result summary for multiple file type malware detection model.

Figure 7. Confusion matrices for the model trainings using multiple file types (L to R, T to B): AlexNet, GoogleNet, SqueezeNet, MobileNet-v2, VGG-16.

As presented in Table 4 and Figure 7, the pre-trained models provided adequate and consistent results throughout the experiments. It should also be noted that a three-fold cross-validation technique was conducted in order to verify the robustness of these results. From these findings, it can be deduced that a compact neural network model can be utilized as a preferred solution. This offers high accuracy, which is comparable to what regular-sized models can provide, while also providing advantages in terms of resource requirements. For deployment purposes, having a smaller-sized model allows for a more compact overall size of the application.

It is worth mentioning that the multi-file type malware detection model has demonstrated an enhancement in overall accuracy. This improvement can be attributed to two key factors. First, the combination of all the data ensures a nearly equal distribution across the categories, as is evident in Table 4. This mitigates any bias toward larger datasets. Additionally, the preeminence of PDF files, which constitute the largest share of samples in the combined model, may contribute to the observed increase in overall accuracy.

Since the main aim of these experiments is to deploy the model in a usable application, we further examined the reliability and performance of the trained model; aside from the testing sets, a detailed testing methodology was also applied. This tests the trained model against data that do not belong in the same category as the train or test sets, such as other related databases. This is detailed in the next sub-section.

4.3. Additional Validation Results

As discussed in Section 3.1, further testing has been applied to all models in order to detect any overfitting and to ensure a good generalization. A comparison has been achieved between classic static analysis and the image-based approach proposed in this work. The testing datasets used include unseen samples from the Contagio and Dike datasets. Table 5 illustrates the validation results.

Table 5. Comparison of static analysis model and image-based model for classifying MS files correctly.

Lower levels of accuracy were noted, especially when dealing with PDF files. This may be attributed to the limited number of validation samples available. Additionally, it is worth highlighting that the PDF files used for validation exhibited significant variations in terms of length, number of pages, and content type when compared to the files used for training. Thus, enhancing the model’s robustness to accommodate such variations is a potential area for future research.

Nonetheless, it is important to note that, for the field of MS documents, the traditional static analysis technique failed to detect any of the malicious samples accurately. In contrast, the proposed image-based combined method achieved 88.46% accuracy, demonstrating its capability to extract significant underlying data from the samples, which can be valuable for model training.

4.4. Multi-Level Classification and Future Directions

Given the positive results achieved in the preceding sub-sections, expansion to multi-level malware classification is one of the future directions that can be explored in this work. Furthermore, we could also assess the inclusion of other file formats, such as images, audio, and video. In this section, we present the preliminary research work conducted on the multi-level classification of Portable Executable (PE) files, which is achieved using the Malimg dataset [9]. The Malimg dataset consists of PE files organized into 25 different classes, which are named according to the detailed malware family name. However, due to the limited number of files per category, we conducted this experiment by combining the malware types based on their main malware category, as per Table 6. This approach organized the files into seven main categories: Backdoor, Dialer, PWS, Rouge, Trojan, Trojan Downloader, and Worm.

Table 6. Malimg dataset: main malware family categories [9].

The training resulted in an accuracy of approximately 98% for both AlexNet and SqueezeNet pre-trained models, as seen in the confusion matrices presented in Figure 8. Specifically, this resulted in an accuracy of 98.55% for AlexNet and 98.18% for SqueezeNet. Hence, although we are currently limited by the availability of pre-categorized data for PDF and MS document files, these preliminary results show that the work can possibly be extended to these file types as well.

Figure 8. Confusion matrices for Malimg multi-class dataset (top: SqueezeNet, bottom: AlexNet).

5. Discussion

The results presented demonstrate that all pre-trained models consistently produced satisfactory outcomes across all experiments. It is worth mentioning that a three-fold cross-validation technique was also conducted in order to ensure the reliability of these findings.

Based on the results found, it can be inferred that employing a compact neural network model can be an advantageous approach. Such models offer a high level of accuracy that is comparable to regular-sized models while also providing benefits from lower resource requirements. The compact size of the produced model enables a more efficient deployment within the intended application. This factor holds particular significance, provided that the primary objective of these experiments is to implement the model in a practical and usable application.

To assess the reliability and accuracy of the trained model further, a detailed testing methodology was performed. In addition to the standard train and test sets, this methodology included a thorough examination of the model’s behavior when exposed to data that fell outside the categories covered by the training and test sets. This involved testing the model against other related databases, thereby simulating real-world scenarios where the model encounters unfamiliar data.

Through this comprehensive testing methodology, we are able to gain insights into the model’s adaptability and generalization capabilities. Evaluating its performance on data from different categories provides a measure of its reliability and potential for real-world deployment. This assessment allowed us to ascertain the effectiveness of the proposed methodology beyond the specific contexts used during training and testing, ensuring its suitability for a broader range of applications.

6. Conclusions

In conclusion, our proposed model presents a versatile and compact solution for malware detection and classification, offering compatibility with a range of file types, including PDFs, Windows’ Portable Executable (PE) files, and Microsoft Office documents. Its efficiency in reducing the computational requirements while producing high accuracy with a compact size makes it an appealing choice for those seeking both convenience and resource optimization. Further, the innovative use of image transforms to harness the underlying data from diverse file types constitutes a notable contribution.

There are several promising directions for future development. We can explore the incorporation of additional file formats, extending the model’s versatility. Moreover, ongoing research into novel neural network architectures and transfer learning techniques could enhance the model’s performance and adaptability. Continuous efforts to expand the model’s dataset to include evolving malware threats will also contribute to its robustness. Overall, our proposed compact model paves the way for a more efficient and adaptable approach to malware detection, with ample opportunities for future advancements and innovations.

Author Contributions

Conceptualization, A.C., L.E.N. and H.M.; Methodology, A.C. and L.E.N.; Software, A.C., L.E.N. and T.N.; Validation, L.E.N., T.N. and W.O.; Formal analysis, A.C., L.E.N., T.N., H.M. and W.O.; Investigation, A.C., L.E.N., T.N. and W.O.; Resources, H.M.; Data curation, T.N. and W.O.; Writing—original draft, A.C., L.E.N., T.N. and W.O.; Writing—review & editing, A.C., L.E.N., T.N., H.M. and W.O.; Visualization, A.C., L.E.N. and T.N.; Supervision, H.M.; Project administration, H.M.; Funding acquisition, H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://doi.org/10.5281/zenodo.4559436 (accessed on 16 November 2023), https://github.com/kartik2309/Malicious_pdf_detection.git (accessed on 16 November 2023) and https://contagiodump.blogspot.com/2013/03/16800-clean-and-11960-malicious-files.html (accessed on 16 November 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Aslam, W.; Fraz, M.; Rizvi, S.; Saleem, S. Cross-validation of machine learning algorithms for malware detection using static features of Windows portable executables: A Comparative Study. In Proceedings of the IEEE 17th International Conference on Smart Communities: Improving Quality of Life Using ICT, IoT and AI (HONET), IEEE, Charlotte, NC, USA, 14–16 December 2020; pp. 73–76. [Google Scholar]
Gibert, D.; Mateu, C.; Planes, J. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges. J. Netw. Comput. Appl. 2020, 153, 102526. [Google Scholar] [CrossRef]
Schultz, M.G.; Eskin, E.; Zadok, E.; Stolfo, S. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy, S and P 2001, Oakland, CA, USA, 14–16 May 2000; pp. 38–49. [Google Scholar]
Kruegel, C.; Kirda, E.; Mutz, D.; Robertson, W.; Vigna, G. Polymorphic worm detection using structural information of executables. In Proceedings of the Recent Advances in Intrusion Detection: 8th International Symposium, RAID 2005, Seattle, WA, USA, 7–9 September 2005; Revised Papers 8. Springer: Berlin/Heidelberg, Germany, 2006; pp. 207–226. [Google Scholar]
Roundy, K.A.; Miller, B.P. Hybrid analysis and control of malware. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection, Ontario, OT, Canada, 15–17 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 317–338. [Google Scholar]
Nguyen, K.D.T.; Tuan, T.M.; Le, S.H.; Viet, A.P.; Ogawa, M.; Le Minh, N. Comparison of three deep learning-based approaches for IoT malware detection. In Proceedings of the 10th International Conference on Knowledge and Systems Engineering (KSE), IEEE, Ho Chi Minh City, Vietnam, 1–3 November 2018; pp. 382–388. [Google Scholar]
Peiravian, N.; Zhu, X. Machine learning for android malware detection using permission and API calls. In Proceedings of the IEEE 25th International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 4–6 November 2013; pp. 300–305. [Google Scholar]
Qiao, Y.; Jiang, Q.; Jiang, Z.; Gu, L. A multi-channel visualization method for malware classification based on deep learning. In Proceedings of the 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), Rotorua, New Zealand, 5–8 August 2019; pp. 757–762. [Google Scholar]
Nataraj, L.; Karthikeyan, S.; Jacob, G.; Manjunath, B.S. Malware images: Visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security, Pittsburgh, PA, USA, 20 July 2011; pp. 1–7. [Google Scholar]
Ni, S.; Qian, Q.; Zhang, R. Malware identification using visualization images and deep learning. Comput. Secur. 2018, 77, 871–885. [Google Scholar] [CrossRef]
Naeem, H.; Ullah, F.; Naeem, M.R.; Khalid, S.; Vasan, D.; Jabbar, S.; Saeed, S. Malware detection in industrial internet of things based on hybrid image visualization and deep learning model. Ad Hoc Netw. 2020, 105, 102154. [Google Scholar] [CrossRef]
Venkatraman, S.; Alazab, M.; Vinayakumar, R. A hybrid deep learning image-based analysis for effective malware detection. J. Inf. Secur. Appl. 2019, 47, 377–389. [Google Scholar] [CrossRef]
Willems, C.; Holz, T.; Freiling, F. Toward automated dynamic malware analysis using cwsandbox. IEEE Secur. Priv. 2007, 5, 32–39. [Google Scholar] [CrossRef]
Kolbitsch, C.; Comparetti, P.M.; Kruegel, C.; Kirda, E.; Zhou, X.Y.; Wang, X. Effective and efficient malware detection at the end host. In Proceedings of the USENIX Security Symposium, Montreal, QC, Canada, 10–14 August 2009; Volume 4, pp. 351–366. [Google Scholar]
Huang, W.; Stokes, J.W. MtNet: A multi-task neural network for dynamic malware classification. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, San Sebastián, Spain, 7–8 July 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 399–418. [Google Scholar]
Ding, Y.; Xia, X.; Chen, S.; Li, Y. A malware detection method based on family behavior graph. Comput. Secur. 2018, 73, 73–86. [Google Scholar] [CrossRef]
Wang, S.; Chen, Z.; Yu, X.; Li, D.; Ni, J.; Tang, L.A.; Gui, J.; Li, Z.; Chen, H.; Yu, P.S. Heterogeneous graph matching networks. arXiv 2019, arXiv:1910.08074. [Google Scholar]
Smmarwar, S.K.; Gupta, G.P.; Kumar, S. AI-empowered malware detection system for industrial internet of things. Comput. Electr. Eng. 2023, 108, 108731. [Google Scholar] [CrossRef]
Ullah, F.; Ullah, S.; Srivastava, G.; Lin, J.C.W.; Zhao, Y. NMal-Droid: Network-based android malware detection system using transfer learning and CNN-BiGRU ensemble. Wirel. Netw. 2023, 1–22. [Google Scholar] [CrossRef]
Mahindru, A.; Sangal, A.L. MLDroid—Framework for Android malware detection using machine learning techniques. Neural Comput. Appl. 2021, 33, 5183–5240. [Google Scholar] [CrossRef]
Belaoued, M.; Mazouzi, S. A chi-square-based decision for real-time malware detection using PE-file features. J. Inf. Process. Syst. 2016, 12, 644–660. [Google Scholar]
Singh, J.; Singh, J. A survey on machine learning-based malware detection in executable files. J. Syst. Archit. 2021, 112, 101861. [Google Scholar] [CrossRef]
Bensaoud, A.; Abudawaood, N.; Kalita, J. Classifying malware images with convolutional neural network models. Int. J. Netw. Secur. 2020, 22, 1022–1031. [Google Scholar]
Azab, A.; Khasawneh, M. Msic: Malware spectrogram image classification. IEEE Access 2020, 8, 102007–102021. [Google Scholar] [CrossRef]
Lin, W.C.; Yeh, Y.R. Efficient Malware Classification by Binary Sequences with One-Dimensional Convolutional Neural Networks. Mathematics 2022, 10, 608. [Google Scholar] [CrossRef]
Farrokhmanesh, M.; Hamzeh, A. Music classification as a new approach for malware detection. J. Comput. Virol. Hacking Tech. 2019, 15, 77–96. [Google Scholar] [CrossRef]
Cisco. Annual Cybersecurity Report. 2018. Available online: https://www.cisco.com/c/dam/m/hu_hu/campaigns/security-hub/pdf/acr-2018.pdf (accessed on 16 November 2023).
Singh, P.; Tapaswi, S.; Gupta, S. Malware detection in pdf and office documents: A survey. Inf. Secur. J. Glob. Perspect. 2020, 29, 134–153. [Google Scholar] [CrossRef]
VirusTotal. A Free Service That Analyzes Files and URLs for Viruses, Worms, Trojans and Other Kinds of Malicious Content. 2004. Available online: https://support.virustotal.com (accessed on 16 November 2023).
Noever, D.; Noever, S.E.M. Virus-MNIST: A benchmark malware dataset. arXiv 2021, arXiv:2103.00602. [Google Scholar]
Almomani, I.; Alkhayer, A.; El-Shafai, W. E2E-RDS: Efficient End to End Ransomware Detection System Based on Static-Based ML and Vision-Based DL Approaches. Sensors 2023, 23, 4467. [Google Scholar] [CrossRef]
Ghanei, H.; Manavi, F.; Hamzeh, A. A novel method for malware detection based on hardware events using deep neural networks. J. Comput. Virol. Hacking Tech. 2021, 17, 319–331. [Google Scholar] [CrossRef]
Yang, S.; Chen, W.; Li, S.; Xu, Q. Approach using transforming structural data into image for detection of malicious MS-DOC files based on deep learning models. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, Lanzhou, China, 18–21 November 2019; pp. 28–32. [Google Scholar]
Cohen, A.; Nissim, N.; Rokach, L.; Elovici, Y. SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods. Expert Syst. Appl. 2016, 63, 324–343. [Google Scholar] [CrossRef]
Corum, A.; Jenkins, D.; Zheng, J. Robust PDF malware detection with image visualization and processing techniques. In Proceedings of the 2019 2nd International Conference on Data Intelligence and Security (ICDIS), IEEE, South Padre Island, TX, USA, 28–30 June 2019; pp. 108–114. [Google Scholar]
Liu, C.Y.; Chiu, M.Y.; Huang, Q.X.; Sun, H.M. PDF Malware Detection Using Visualization and Machine Learning. In Proceedings of the IFIP Annual Conference on Data and Applications Security and Privacy, Calgary, AB, Canada, 19–20 July 2021; pp. 209–220. [Google Scholar]
Phan, H.; Hertel, L.; Maass, M.; Koch, P.; Mazur, R.; Mertins, A. Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1278–1290. [Google Scholar] [CrossRef]
Krishna, S.T.; Kalluri, H.K. Deep learning and transfer learning approaches for image classification. Int. J. Recent Technol. Eng. 2019, 7, 427–432. [Google Scholar]
Curry, B. An Introduction to Transfer Learning in Machine Learning; Medium: San Francisco, CA, USA, 2018. [Google Scholar]
Copiaco, A.; Ritz, C.; Abdulaziz, N.; Fasciani, S. A Study of Features and Deep Neural Network Architectures and Hyper-Parameters for Domestic Audio Classification. Appl. Sci. 2021, 11, 4880. [Google Scholar] [CrossRef]
Wang, S.H.; Zhang, Y. DenseNet-201-Based Deep Neural Network with Composite Learning Factor and Precomputation for Multiple Sclerosis Classification. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–19. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2017; pp. 8697–8710. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar] [CrossRef]
Iandola, F.; Moskewicz, M.; Ashraf, K.; Han, S.; Dally, W.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25 (NIPS 2012), Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, UK, USA, 2012. [Google Scholar]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
El Neel, L.; Copiaco, A.; Obaid, W.; Mukhtar, H. Comparison of Feature Extraction and Classification Techniques of PE Malware. In Proceedings of the 2022 5th International Conference on Signal Processing and Information Security (ICSPIS), Dubai, United Arab Emirates, 7–8 December 2022; pp. 26–31. [Google Scholar] [CrossRef]
Copiaco, A.; Mukhtar, H.; Neel, L.E.; Nazzal, T. Identification of Robust Features for Classifying Spam and Ham Images using Transfer Learning. In Proceedings of the 2022 5th International Conference on Signal Processing and Information Security (ICSPIS), Dubai, United Arab Emirates, 7–8 December 2022; pp. 1–4. [Google Scholar] [CrossRef]
Koutsokostas, V.; Lykousas, N.; Orazi, G.; Apostolopoulos, T.; Ghosal, A.; Casino, F.; Conti, M.; Patsakis, C. Malicious MS Office Documents Dataset. Zenodo 2021. [Google Scholar] [CrossRef]
Rajeshwaran, K. Malicious PDF Detection. 2022. Available online: https://github.com/kartik2309/Malicious_pdf_detection.git (accessed on 16 November 2023).
Contagio. Contagio Malware Dump. 2013. Available online: https://contagiodump.blogspot.com/2013/03/16800-clean-and-11960-malicious-files.html (accessed on 16 November 2023).
Wei, C.; Li, Q.; Guo, D.; Meng, X. Toward identifying APT malware through API system calls. Secur. Commun. Netw. 2021, 2021, 8077220. [Google Scholar] [CrossRef]
Chebbi, C. Mastering Machine Learning for Penetration Testing: Develop an Extensive Skill Set to Break Self-Learning Systems Using Python; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]

Figure 1. PE file format [2].

Figure 2. Representation of PE file sections in grayscale image of malicious samples [2].

Figure 3. Feature extraction process of grayscale image for all file types.

Figure 4. Representation of benign and malicious MS Office files in grayscale image.

Figure 5. Representation of benign and malicious PDF files in grayscale image.

Figure 6. Confusion matrices for AlexNet model trainings using individual file types (L to R): Office, PDF, PE.

Figure 7. Confusion matrices for the model trainings using multiple file types (L to R, T to B): AlexNet, GoogleNet, SqueezeNet, MobileNet-v2, VGG-16.

Figure 8. Confusion matrices for Malimg multi-class dataset (top: SqueezeNet, bottom: AlexNet).

Table 1. Summary of the comparison between pre-trained CNN models.

Model	Type	Year	Size (MB)	Input Size	Depth	Parameters
DenseNet [41]	C	2020	44	224 × 224	201	20 million
EfficientNet [41]	C	2020	20	224 × 224	82	5.3 million
MobileNet-v2 [42]	C	2019	13	224 × 224	53	3.5 million
NasNet [43]	R	2017	332	331 × 331	-	88.9 million
ShuffleNet [44]	C	2017	5.4	224 × 224	50	1.4 million
Inception-ResNet [45]	R	2017	209	299 × 299	164 *	55.9 million
Xception [46]	R	2016	85	299 × 299	71	22.9 million
SqueezeNet [47]	C	2016	5.2	227 × 227	18	1.25 million
ResNet [48]	R	2015	167	224 × 224	101 *	25 million
GoogleNet [49]	C	2014	27	224 × 224	22	4 million
VGGNet [50]	R	2014	515	224 × 224	41 *	138 million
AlexNet [51]	R	2012	227	227 × 227	8	62.3 million
LeNet [52]	R	1998	-	32 × 32	7	60,000

* Depth may vary as per the version utilized. Type: R—regular, C—compact.

Table 2. Summary of Datasets.

Citation	Dataset	File Types	No. of Benign Samples	No. of Malicious Samples
[55]	Zenodo	Microsoft Office documents of different formats	2735	15,105
[56]	jonaslejon	PDF	9006	10,980
[57]	Clean DOC files	DOC	100	0
[57]	Clean DOC files	DOC	1300	0
[57]	Clean XLS files	XLS	300	0
[57]	Clean XLS files	XLS	100	0
[57]	Clean PDF & XLS files	PDF	500	0
[58]	Dike dataset	doc, docx, docm, xls, xlsx, xlsm, ppt, pptx, and pptm	100	1871
[58]	Dike dataset	exe	982	8970
[9]	Malimg dataset	Grayscale image representation of malicious exe files	0	12,109

Table 3. Results summary for individual file type malware detection model.

File Format	No. of Files	Train	Test	Model	Accuracy	Size
PDF	19,889	15,912	3977	AlexNet	96.73%	$\tilde{2}$ 27 MB
MS Documents	5770	4617	1153	AlexNet	87.44%	$\tilde{2}$ 27 MB
PE	9952	7962	1990	AlexNet	92.60%	$\tilde{2}$ 27 MB
PDF	19,889	15,912	3977	GoogleNet	96.12%	$\tilde{2}$ 4 MB
MS Documents	5770	4617	1153	GoogleNet	86.35%	$\tilde{2}$ 4 MB
PE	9952	7962	1990	GoogleNet	90.78%	$\tilde{2}$ 4 MB

Table 4. Result summary for multiple file type malware detection model.

File Formats	No. of Files	Train	Test	Model	Accuracy	Size
PDF and MS	25,659	20,529	5130	AlexNet	93.76%	$\tilde{2}$ 27 MB
PDF, MS, PE	35,611	28,491	7120	AlexNet	96.88%	$\tilde{2}$ 27 MB
PDF, MS, PE	35,611	28,491	7120	GoogleNet	96.39%	$\tilde{2}$ 4 MB
PDF, MS, PE	35,611	28,491	7120	SqueezeNet	96.69%	$\tilde{5}$ MB
PDF, MS, PE	35,611	28,491	7120	MobileNet-v2	96.74%	$\tilde{1}$ 3 MB
PDF, MS, PE	35,611	28,491	7120	VGG-16	97.56%	$\tilde{5}$ 15 MB

Table 5. Comparison of static analysis model and image-based model for classifying MS files correctly.

File Type	Dataset	Number of Files	Static Analysis Accuracy	Image-Based Analysis Accuracy
	Contagio (MS)	2210	95.79%	62.8%
Benign	Dike (MS)	100	99%	84%
	Contagio (PDF)	87	54%	35.6%
	Contagio (MS)	26	0	88.46%
Malware	Dike (MS)	1871	91.2%	92.5%
	Contagio (PDF)	124	95.16%	59.67%

Table 6. Malimg dataset: main malware family categories [9].

Class	Main Family	Number of Files
1	Worm	5854
2	PWS	679
3	Trojan	760
4	Dialer	733
5	Trojan Downloader	661
6	Rogue	381
7	Backdoor	274

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Neural Network Approach to a Grayscale Image-Based Multi-File Type Malware Detection System

Abstract

1. Introduction

2. Microservices and Grayscale-Based Image Transform Features

2.1. File Type Microservices

2.1.1. Portable Executable (PE) Files

2.1.2. Microsoft Office (MS) Documents

2.1.3. Portable Document Format (PDF) Files

2.2. Grayscale-Based Transform Features

2.3. Pre-Trained Neural Network Models

3. Methodology

3.1. Datasets

3.2. Feature Extraction Using Grayscale-Based Image Transforms

3.3. Model Training

3.4. Performance Evaluation Metrics

4. Results

4.1. Comparison of Single-File Type Models

4.2. Comparison of Multi-File Type Models

4.3. Additional Validation Results

4.4. Multi-Level Classification and Future Directions

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics