Most research emphasizes the importance of behavior analysis techniques compared to traditional signature-based methods. This study divided the litearture review on deep learning models used for malware classification into three sections, namely, CNN-based models, Transformer-based models, and CNN+Transformer-based hybrid models.
2.2.1. CNN-Based Models
The concept of visualizing malware as images was presented by Nataraj et al. [
10], who introduced the Malimg dataset by converting malware binaries into grayscale images. Their work demonstrated that image-based malware classification could achieve high accuracy using simple texture analysis techniques. Subsequently, numerous studies have explored this approach, with researchers developing various visualization techniques for malware binaries. Awan et al. [
11] proposed a deep learning framework, SACNN, which combines spatial attention with the CNN for image-based classification using the Malimg dataset. The model achieved high performance across multiple metrics, with the precision, recall, and F1-scores all exceeding 97%, both with and without class balancing. While the model is relatively simple compared to more complex architectures, its strong accuracy demonstrates that lightweight designs can still be effective for malware detection, though broader generalization beyond Malimg remains to be explored. Singh et al. [
12] developed a hybrid malware classification model that combines Gated Recurrent Units (GRUs) for sequential feature extraction with a CNN-based feature refiner, followed by classification using a Cost-sensitive Bootstrapped Weighted Random Forest (CSBW-RF). The approach achieved 99.58% accuracy on the Malimg dataset and demonstrated strong generalizability on the Microsoft Big 2015 dataset, outperforming several existing models. While the method shows high robustness and adaptability across datasets, its complexity and reliance on multiple processing stages may pose challenges for real-time deployment in constrained environments. Kumar et al. [
13] introduced transfer learning and ensemble learning for malware classification, achieving an accuracy of 99.36% on the Malimg test dataset and 92.11% on a real-world malware dataset. Kalash et al. [
14] proposed a CNN-based malware classification model by converting binary images to grayscale. They evaluated the proposed model on the the Malimg and Microsoft malware datasets and achieved 98.52% and 99.97% accuracy results, respectively. Ravi and Alazab [
15] proposed an attention-based CNN method and achieved a 99% accuracy result on the MalImg dataset. Panda et al. [
16] introduced a stacked ensemble model (SE-AGM), combining an autoencoder, GRU, and MLP, trained on 25 extracted features from the MalImg dataset to classify malware families efficiently. Leveraging CNN-based transfer learning and data augmentation, their model achieved a high accuracy of 99.43%, outperforming several baseline approaches. Despite its strong performance, the model’s reliance on a limited feature set and evaluation on a single dataset may restrict its adaptability to broader malware environments. Alam et al. [
17] proposed an efficient layered feature extractor combined with spatial-CNN, which attained 99.87% on Malimg, 99.81% on BIG2015, and 99.22% on MaleVis, to design a streamlined architecture for malware classification. Guan et al. [
18] presented a hybrid of ResNet50 and VGG16 using knowledge distillation, achieving a 99.50% accuracy result on Malimg and 97.52% accuracy result on BIG2015, to compress deep models while retaining high accuracy through fused layer optimization. Abdulazeez et al. [
19] benchmarked DenseNet201 combined with KNN on malware data, obtaining a 96% accuracy result on Malimg, to evaluate the suitability of pretrained models for malware detection tasks. Alnajim et al. [
20] explored CNN and DNN models, reporting a 98.14% accuracy result on Malimg and 98.95% accuracy result on BIG2015, to broaden deep learning-based approaches for malware analysis.
The MaleVis dataset [
21] contains malware samples generated using a diverse approach that emphasizes the structural and behavioral characteristics of malware samples. Al-Khater and Al-Madeed [
22] addressed the challenge of detecting new malware by applying the fast and adaptive bidirectional empirical mode decomposition technique to improve the quality of the dataset and overcome class imbalance. Their study evaluated two 3D deep learning architectures, VGG-16 and ResNet-18, and their performance outcomes on the Malimg and MaleVis datasets, achieving up to 99.64% precision with ResNet-18. While the results are promising, the approach relies on computationally intensive 3D models and preprocessing steps, which may affect their scalability and real-time applicability in practical cybersecurity settings. Atitallah et al. [
23] introduced a vision-based IoT malware detection framework using deep transfer learning and ensemble methods to improve classification performance. Their approach combines ResNet18, MobileNetV2, and DenseNet161 through a random forest voting strategy, and it was evaluated on the MaleVis dataset. The model achieved a 98.68% accuracy result. However, the method relies on RGB image transformation and ensemble complexity, which may limit its real-time deployment in resource-constrained IoT environments.
Noever and Miller introduced the VirusMNIST dataset [
24], which includes more than 50,000 virus examples from nine malware families and benign files. They implemented a MobileNetV2 CNN model, achieving an accuracy of 80% in classifying malware samples. Their study highlighted the challenges of distinguishing between malware families with similar visual patterns. Habibi et al. [
25] addressed the limitations of conventional malware detection by employing CNN and transfer learning models—MobileNetV2 and ResNet50—for robust classification of malware, including obfuscated variants. The proposed models trained on VirusMNIST and achieved 99% accuracy results in general classification and 100% accuracy results in obfuscated malware detection on the Malimg dataset. While the results demonstrate high effectiveness, the study primarily evaluated static visual patterns and may benefit from additional analysis on dynamic behavior or unseen zero-day threats. Zou et al. [
26] introduced FACILE, a capsule network optimized for malware classification that integrates dynamic convolution and balanced routing to reduce training complexity and improve feature representation. Tested on the VIRUS-MNIST, MalImg, and BIG2015 datasets, FACILE reduced error rates to 8.087%, 1.149%, and 2.797%, respectively. Dutta et al. [
27] proposed KOL-4-GEN, a suite of four deep learning models based on Kolmogorov–Arnold Networks with trainable activation functions, integrated with a GAN to mitigate data imbalance during malware image classification. Evaluated on the Malimg, Malevis, and Virus-MNIST datasets, the models achieved validation accuracies of approximately 99.36%, 95.44%, and 92.12%, respectively.
Other research works have also been proposed for malware classification using a deep learning model. For example, Yang et al. [
28] proposed a malware detection framework that combines binary and opcode features using a stacked convolutional network and a triangular attention mechanism. Their model employs cross-attention to align and fuse feature representations, achieving 99.54% accuracy on the Kaggle Malware Classification dataset and 95.44% on a real-world dataset. A visualized attention module further enhances interpretability by highlighting relevant opcode patterns. However, the approach is limited to sequential features and does not explore visual or Transformer-based architectures to capture spatial or global dependencies. Cui et al. [
29] proposed a grayscale-based visualization method that converted malicious code. They implemented the CNN to extract visual features and addressed the imbalanced dataset using the BAT algorithm. Abdullah et al. [
30] proposed a hybrid static classifier that combines CNN and BILSTM to detect malware in IoT environments using 1D image representations of Byte and Assembly files. The model achieved average accuracies of 99. 91% and 99. 83% in the Microsoft Malware Classification and IoT Malware datasets, respectively. Although the method benefits from automatic feature extraction and strong accuracy, its reliance on 1D image conversion and dual-stage architecture may limit its adaptability to other data formats. Brosolo et al. [
31] evaluated visual malware analysis techniques and discussed key challenges such as the lack of interpretability of deep learning models. Karat et al. [
32] proposed a CNN-LSTM algorithm for zero-day malware detection. They used two API call sequence datasets to validate the proposed methods and achieved a 96% validation accuracy. There are also several research works that have explored alternative datasets beyond the commonly used benchmarks. For example, Chaganti et al. [
33] applied CNNs combined with feature fusion techniques, achieving a 97% accuracy result on a malware dataset, with the goal of enhancing classification performance through multi-feature integration. Ahmed et al. [
34] utilized InceptionV3 to achieve a 98.76% accuracy result on BIG15, aiming to compare machine learning and transfer learning approaches for malware detection. Ismail et al. [
35] employed contrastive learning and data augmentation, with pretraining performed on the unlabeled ImageNet dataset, followed by fine-tuning on unlabeled malware samples. The system was evaluated through two downstream tasks: malware family classification and malware-versus-benign detection. Their experimental results reported a 98.4% accuracy result on the Malimg dataset and 96.2% on the Maldeb dataset, surpassing the performance of existing self-supervised approaches. Puneeth et al. [
36] employed a CNN-based approach for malware classification and evaluated it on the Binary, Malimg, and Dumpware-10 datasets, achieving accuracies of 99.15%, 99.26%, and 98.19%, respectively.
2.2.2. Transformer-Based Models
Initially developed for natural language processing, Transformers have recently been repurposed for applications in computer vision. Dosovitskiy et al. [
37] introduced the Vision Transformer (ViT), demonstrating its competitive performance in image classification by conceptualizing images as sequences of patches. Following this work, numerous variants have been proposed to overcome the limitations inherent in the original ViT architecture [
38]. Liu et al. [
39] introduced the Swin Transformer, which implements shifted windows to improve computational efficiency and effectively capture local dependencies. This architecture has demonstrated strong performance across various computer vision tasks, including image classification, object detection, and semantic segmentation. The transfer learning-based Butterfly Vision Transformer (B-ViT) is a widely used method for malware classification. For example, Belal et al. [
40] implemented four types of B-ViTs, including B-VIT/B16, B-VIT/B32, B-VIT/L16, and B-VIT/L32. They evaluated the proposed methods using the Malimg, Microsoft BIG, and PE imports datasets and achieved accuracy results of 99.32%, 99.49%, and 99.99%, respectively. Ashawa et al. [
41] implemented ResNet-152 and the Vision Transformer architecture and achieved a 99.62% accuracy result using 10-fold cross-validation for malware classification.
Wang et al. [
42] employed a self-supervised Swin Transformer that reached 97.85% on BIG2015 and 98.28% on Malimg in terms of accuracy, focusing on developing a lightweight yet effective framework for malware image classification. Zhao et al. [
43] integrated a Swin Transformer with deformable attention, achieving a 99.35% accuracy result on Malimg, to enhance classification performance through improved attention mechanisms. Ahmed et al. [
44] proposed Gaussian Discriminant Analysis (GDA) and Segmentation-based Fractal Texture Analysis (SFTA) for feature extraction and Naive Bayes (NB) for classification, achieving a 98% accuracy result on MaleVis, aiming to assess the effectiveness of Vision Transformer models for binary and multi-class malware classification tasks. Ashwini et al. [
45] introduced a dual Vision Transformer with split attention, which attained a 99.99% accuracy result on a ransomware dataset, focusing on improving ransomware detection using attention-enhanced Transformer models.