A Neural Network Approach to a Grayscale Image-Based Multi-File Type Malware Detection System
Abstract
:1. Introduction
- To identify a universal image transformation method that works across multiple file formats, facilitating the development of a unified model with simplified resource requirements for detecting malware across multiple formats.
- To develop a feature extraction approach that captures vital underlying data information from files.
- To conduct a comprehensive assessment of various neural network models, including conventional and compact networks, for effective transfer learning in the specified application.
- To establish a foundation for future research expansion to encompass additional file formats, such as audio (MP3) and video, broadening the scope of the study.
2. Microservices and Grayscale-Based Image Transform Features
2.1. File Type Microservices
2.1.1. Portable Executable (PE) Files
2.1.2. Microsoft Office (MS) Documents
2.1.3. Portable Document Format (PDF) Files
2.2. Grayscale-Based Transform Features
2.3. Pre-Trained Neural Network Models
3. Methodology
3.1. Datasets
3.2. Feature Extraction Using Grayscale-Based Image Transforms
3.3. Model Training
3.4. Performance Evaluation Metrics
- Effects on the overall accuracy of regular versus compact neural networks.
- Effects on performance when utilizing DAG or series architectures.
- Effects on system accuracy when training models for a specific file type and general models for multiple file formats.
- Comparing the features extracted from single and multiple file types using different models at both the individual and overall level.
- Examining the impact of enlarging and merging datasets, as well as how the system responds to imbalanced data.
4. Results
4.1. Comparison of Single-File Type Models
4.2. Comparison of Multi-File Type Models
4.3. Additional Validation Results
4.4. Multi-Level Classification and Future Directions
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Aslam, W.; Fraz, M.; Rizvi, S.; Saleem, S. Cross-validation of machine learning algorithms for malware detection using static features of Windows portable executables: A Comparative Study. In Proceedings of the IEEE 17th International Conference on Smart Communities: Improving Quality of Life Using ICT, IoT and AI (HONET), IEEE, Charlotte, NC, USA, 14–16 December 2020; pp. 73–76. [Google Scholar]
- Gibert, D.; Mateu, C.; Planes, J. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges. J. Netw. Comput. Appl. 2020, 153, 102526. [Google Scholar] [CrossRef]
- Schultz, M.G.; Eskin, E.; Zadok, E.; Stolfo, S. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy, S and P 2001, Oakland, CA, USA, 14–16 May 2000; pp. 38–49. [Google Scholar]
- Kruegel, C.; Kirda, E.; Mutz, D.; Robertson, W.; Vigna, G. Polymorphic worm detection using structural information of executables. In Proceedings of the Recent Advances in Intrusion Detection: 8th International Symposium, RAID 2005, Seattle, WA, USA, 7–9 September 2005; Revised Papers 8. Springer: Berlin/Heidelberg, Germany, 2006; pp. 207–226. [Google Scholar]
- Roundy, K.A.; Miller, B.P. Hybrid analysis and control of malware. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection, Ontario, OT, Canada, 15–17 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 317–338. [Google Scholar]
- Nguyen, K.D.T.; Tuan, T.M.; Le, S.H.; Viet, A.P.; Ogawa, M.; Le Minh, N. Comparison of three deep learning-based approaches for IoT malware detection. In Proceedings of the 10th International Conference on Knowledge and Systems Engineering (KSE), IEEE, Ho Chi Minh City, Vietnam, 1–3 November 2018; pp. 382–388. [Google Scholar]
- Peiravian, N.; Zhu, X. Machine learning for android malware detection using permission and API calls. In Proceedings of the IEEE 25th International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 4–6 November 2013; pp. 300–305. [Google Scholar]
- Qiao, Y.; Jiang, Q.; Jiang, Z.; Gu, L. A multi-channel visualization method for malware classification based on deep learning. In Proceedings of the 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), Rotorua, New Zealand, 5–8 August 2019; pp. 757–762. [Google Scholar]
- Nataraj, L.; Karthikeyan, S.; Jacob, G.; Manjunath, B.S. Malware images: Visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security, Pittsburgh, PA, USA, 20 July 2011; pp. 1–7. [Google Scholar]
- Ni, S.; Qian, Q.; Zhang, R. Malware identification using visualization images and deep learning. Comput. Secur. 2018, 77, 871–885. [Google Scholar] [CrossRef]
- Naeem, H.; Ullah, F.; Naeem, M.R.; Khalid, S.; Vasan, D.; Jabbar, S.; Saeed, S. Malware detection in industrial internet of things based on hybrid image visualization and deep learning model. Ad Hoc Netw. 2020, 105, 102154. [Google Scholar] [CrossRef]
- Venkatraman, S.; Alazab, M.; Vinayakumar, R. A hybrid deep learning image-based analysis for effective malware detection. J. Inf. Secur. Appl. 2019, 47, 377–389. [Google Scholar] [CrossRef]
- Willems, C.; Holz, T.; Freiling, F. Toward automated dynamic malware analysis using cwsandbox. IEEE Secur. Priv. 2007, 5, 32–39. [Google Scholar] [CrossRef]
- Kolbitsch, C.; Comparetti, P.M.; Kruegel, C.; Kirda, E.; Zhou, X.Y.; Wang, X. Effective and efficient malware detection at the end host. In Proceedings of the USENIX Security Symposium, Montreal, QC, Canada, 10–14 August 2009; Volume 4, pp. 351–366. [Google Scholar]
- Huang, W.; Stokes, J.W. MtNet: A multi-task neural network for dynamic malware classification. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, San Sebastián, Spain, 7–8 July 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 399–418. [Google Scholar]
- Ding, Y.; Xia, X.; Chen, S.; Li, Y. A malware detection method based on family behavior graph. Comput. Secur. 2018, 73, 73–86. [Google Scholar] [CrossRef]
- Wang, S.; Chen, Z.; Yu, X.; Li, D.; Ni, J.; Tang, L.A.; Gui, J.; Li, Z.; Chen, H.; Yu, P.S. Heterogeneous graph matching networks. arXiv 2019, arXiv:1910.08074. [Google Scholar]
- Smmarwar, S.K.; Gupta, G.P.; Kumar, S. AI-empowered malware detection system for industrial internet of things. Comput. Electr. Eng. 2023, 108, 108731. [Google Scholar] [CrossRef]
- Ullah, F.; Ullah, S.; Srivastava, G.; Lin, J.C.W.; Zhao, Y. NMal-Droid: Network-based android malware detection system using transfer learning and CNN-BiGRU ensemble. Wirel. Netw. 2023, 1–22. [Google Scholar] [CrossRef]
- Mahindru, A.; Sangal, A.L. MLDroid—Framework for Android malware detection using machine learning techniques. Neural Comput. Appl. 2021, 33, 5183–5240. [Google Scholar] [CrossRef]
- Belaoued, M.; Mazouzi, S. A chi-square-based decision for real-time malware detection using PE-file features. J. Inf. Process. Syst. 2016, 12, 644–660. [Google Scholar]
- Singh, J.; Singh, J. A survey on machine learning-based malware detection in executable files. J. Syst. Archit. 2021, 112, 101861. [Google Scholar] [CrossRef]
- Bensaoud, A.; Abudawaood, N.; Kalita, J. Classifying malware images with convolutional neural network models. Int. J. Netw. Secur. 2020, 22, 1022–1031. [Google Scholar]
- Azab, A.; Khasawneh, M. Msic: Malware spectrogram image classification. IEEE Access 2020, 8, 102007–102021. [Google Scholar] [CrossRef]
- Lin, W.C.; Yeh, Y.R. Efficient Malware Classification by Binary Sequences with One-Dimensional Convolutional Neural Networks. Mathematics 2022, 10, 608. [Google Scholar] [CrossRef]
- Farrokhmanesh, M.; Hamzeh, A. Music classification as a new approach for malware detection. J. Comput. Virol. Hacking Tech. 2019, 15, 77–96. [Google Scholar] [CrossRef]
- Cisco. Annual Cybersecurity Report. 2018. Available online: https://www.cisco.com/c/dam/m/hu_hu/campaigns/security-hub/pdf/acr-2018.pdf (accessed on 16 November 2023).
- Singh, P.; Tapaswi, S.; Gupta, S. Malware detection in pdf and office documents: A survey. Inf. Secur. J. Glob. Perspect. 2020, 29, 134–153. [Google Scholar] [CrossRef]
- VirusTotal. A Free Service That Analyzes Files and URLs for Viruses, Worms, Trojans and Other Kinds of Malicious Content. 2004. Available online: https://support.virustotal.com (accessed on 16 November 2023).
- Noever, D.; Noever, S.E.M. Virus-MNIST: A benchmark malware dataset. arXiv 2021, arXiv:2103.00602. [Google Scholar]
- Almomani, I.; Alkhayer, A.; El-Shafai, W. E2E-RDS: Efficient End to End Ransomware Detection System Based on Static-Based ML and Vision-Based DL Approaches. Sensors 2023, 23, 4467. [Google Scholar] [CrossRef]
- Ghanei, H.; Manavi, F.; Hamzeh, A. A novel method for malware detection based on hardware events using deep neural networks. J. Comput. Virol. Hacking Tech. 2021, 17, 319–331. [Google Scholar] [CrossRef]
- Yang, S.; Chen, W.; Li, S.; Xu, Q. Approach using transforming structural data into image for detection of malicious MS-DOC files based on deep learning models. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, Lanzhou, China, 18–21 November 2019; pp. 28–32. [Google Scholar]
- Cohen, A.; Nissim, N.; Rokach, L.; Elovici, Y. SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods. Expert Syst. Appl. 2016, 63, 324–343. [Google Scholar] [CrossRef]
- Corum, A.; Jenkins, D.; Zheng, J. Robust PDF malware detection with image visualization and processing techniques. In Proceedings of the 2019 2nd International Conference on Data Intelligence and Security (ICDIS), IEEE, South Padre Island, TX, USA, 28–30 June 2019; pp. 108–114. [Google Scholar]
- Liu, C.Y.; Chiu, M.Y.; Huang, Q.X.; Sun, H.M. PDF Malware Detection Using Visualization and Machine Learning. In Proceedings of the IFIP Annual Conference on Data and Applications Security and Privacy, Calgary, AB, Canada, 19–20 July 2021; pp. 209–220. [Google Scholar]
- Phan, H.; Hertel, L.; Maass, M.; Koch, P.; Mazur, R.; Mertins, A. Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1278–1290. [Google Scholar] [CrossRef]
- Krishna, S.T.; Kalluri, H.K. Deep learning and transfer learning approaches for image classification. Int. J. Recent Technol. Eng. 2019, 7, 427–432. [Google Scholar]
- Curry, B. An Introduction to Transfer Learning in Machine Learning; Medium: San Francisco, CA, USA, 2018. [Google Scholar]
- Copiaco, A.; Ritz, C.; Abdulaziz, N.; Fasciani, S. A Study of Features and Deep Neural Network Architectures and Hyper-Parameters for Domestic Audio Classification. Appl. Sci. 2021, 11, 4880. [Google Scholar] [CrossRef]
- Wang, S.H.; Zhang, Y. DenseNet-201-Based Deep Neural Network with Composite Learning Factor and Precomputation for Multiple Sclerosis Classification. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–19. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
- Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2017; pp. 8697–8710. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar] [CrossRef]
- Iandola, F.; Moskewicz, M.; Ashraf, K.; Han, S.; Dally, W.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25 (NIPS 2012), Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, UK, USA, 2012. [Google Scholar]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- El Neel, L.; Copiaco, A.; Obaid, W.; Mukhtar, H. Comparison of Feature Extraction and Classification Techniques of PE Malware. In Proceedings of the 2022 5th International Conference on Signal Processing and Information Security (ICSPIS), Dubai, United Arab Emirates, 7–8 December 2022; pp. 26–31. [Google Scholar] [CrossRef]
- Copiaco, A.; Mukhtar, H.; Neel, L.E.; Nazzal, T. Identification of Robust Features for Classifying Spam and Ham Images using Transfer Learning. In Proceedings of the 2022 5th International Conference on Signal Processing and Information Security (ICSPIS), Dubai, United Arab Emirates, 7–8 December 2022; pp. 1–4. [Google Scholar] [CrossRef]
- Koutsokostas, V.; Lykousas, N.; Orazi, G.; Apostolopoulos, T.; Ghosal, A.; Casino, F.; Conti, M.; Patsakis, C. Malicious MS Office Documents Dataset. Zenodo 2021. [Google Scholar] [CrossRef]
- Rajeshwaran, K. Malicious PDF Detection. 2022. Available online: https://github.com/kartik2309/Malicious_pdf_detection.git (accessed on 16 November 2023).
- Contagio. Contagio Malware Dump. 2013. Available online: https://contagiodump.blogspot.com/2013/03/16800-clean-and-11960-malicious-files.html (accessed on 16 November 2023).
- Wei, C.; Li, Q.; Guo, D.; Meng, X. Toward identifying APT malware through API system calls. Secur. Commun. Netw. 2021, 2021, 8077220. [Google Scholar] [CrossRef]
- Chebbi, C. Mastering Machine Learning for Penetration Testing: Develop an Extensive Skill Set to Break Self-Learning Systems Using Python; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
Model | Type | Year | Size (MB) | Input Size | Depth | Parameters |
---|---|---|---|---|---|---|
DenseNet [41] | C | 2020 | 44 | 224 × 224 | 201 | 20 million |
EfficientNet [41] | C | 2020 | 20 | 224 × 224 | 82 | 5.3 million |
MobileNet-v2 [42] | C | 2019 | 13 | 224 × 224 | 53 | 3.5 million |
NasNet [43] | R | 2017 | 332 | 331 × 331 | - | 88.9 million |
ShuffleNet [44] | C | 2017 | 5.4 | 224 × 224 | 50 | 1.4 million |
Inception-ResNet [45] | R | 2017 | 209 | 299 × 299 | 164 * | 55.9 million |
Xception [46] | R | 2016 | 85 | 299 × 299 | 71 | 22.9 million |
SqueezeNet [47] | C | 2016 | 5.2 | 227 × 227 | 18 | 1.25 million |
ResNet [48] | R | 2015 | 167 | 224 × 224 | 101 * | 25 million |
GoogleNet [49] | C | 2014 | 27 | 224 × 224 | 22 | 4 million |
VGGNet [50] | R | 2014 | 515 | 224 × 224 | 41 * | 138 million |
AlexNet [51] | R | 2012 | 227 | 227 × 227 | 8 | 62.3 million |
LeNet [52] | R | 1998 | - | 32 × 32 | 7 | 60,000 |
Citation | Dataset | File Types | No. of Benign Samples | No. of Malicious Samples |
---|---|---|---|---|
[55] | Zenodo | Microsoft Office documents of different formats | 2735 | 15,105 |
[56] | jonaslejon | 9006 | 10,980 | |
[57] | Clean DOC files | DOC | 100 | 0 |
[57] | Clean DOC files | DOC | 1300 | 0 |
[57] | Clean XLS files | XLS | 300 | 0 |
[57] | Clean XLS files | XLS | 100 | 0 |
[57] | Clean PDF & XLS files | 500 | 0 | |
[58] | Dike dataset | doc, docx, docm, xls, xlsx, xlsm, ppt, pptx, and pptm | 100 | 1871 |
[58] | Dike dataset | exe | 982 | 8970 |
[9] | Malimg dataset | Grayscale image representation of malicious exe files | 0 | 12,109 |
File Format | No. of Files | Train | Test | Model | Accuracy | Size |
---|---|---|---|---|---|---|
19,889 | 15,912 | 3977 | AlexNet | 96.73% | 27 MB | |
MS Documents | 5770 | 4617 | 1153 | AlexNet | 87.44% | 27 MB |
PE | 9952 | 7962 | 1990 | AlexNet | 92.60% | 27 MB |
19,889 | 15,912 | 3977 | GoogleNet | 96.12% | 4 MB | |
MS Documents | 5770 | 4617 | 1153 | GoogleNet | 86.35% | 4 MB |
PE | 9952 | 7962 | 1990 | GoogleNet | 90.78% | 4 MB |
File Formats | No. of Files | Train | Test | Model | Accuracy | Size |
---|---|---|---|---|---|---|
PDF and MS | 25,659 | 20,529 | 5130 | AlexNet | 93.76% | 27 MB |
PDF, MS, PE | 35,611 | 28,491 | 7120 | AlexNet | 96.88% | 27 MB |
PDF, MS, PE | 35,611 | 28,491 | 7120 | GoogleNet | 96.39% | 4 MB |
PDF, MS, PE | 35,611 | 28,491 | 7120 | SqueezeNet | 96.69% | MB |
PDF, MS, PE | 35,611 | 28,491 | 7120 | MobileNet-v2 | 96.74% | 3 MB |
PDF, MS, PE | 35,611 | 28,491 | 7120 | VGG-16 | 97.56% | 15 MB |
File Type | Dataset | Number of Files | Static Analysis Accuracy | Image-Based Analysis Accuracy |
---|---|---|---|---|
Contagio (MS) | 2210 | 95.79% | 62.8% | |
Benign | Dike (MS) | 100 | 99% | 84% |
Contagio (PDF) | 87 | 54% | 35.6% | |
Contagio (MS) | 26 | 0 | 88.46% | |
Malware | Dike (MS) | 1871 | 91.2% | 92.5% |
Contagio (PDF) | 124 | 95.16% | 59.67% |
Class | Main Family | Number of Files |
---|---|---|
1 | Worm | 5854 |
2 | PWS | 679 |
3 | Trojan | 760 |
4 | Dialer | 733 |
5 | Trojan Downloader | 661 |
6 | Rogue | 381 |
7 | Backdoor | 274 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Copiaco, A.; El Neel, L.; Nazzal, T.; Mukhtar, H.; Obaid, W. A Neural Network Approach to a Grayscale Image-Based Multi-File Type Malware Detection System. Appl. Sci. 2023, 13, 12888. https://doi.org/10.3390/app132312888
Copiaco A, El Neel L, Nazzal T, Mukhtar H, Obaid W. A Neural Network Approach to a Grayscale Image-Based Multi-File Type Malware Detection System. Applied Sciences. 2023; 13(23):12888. https://doi.org/10.3390/app132312888
Chicago/Turabian StyleCopiaco, Abigail, Leena El Neel, Tasnim Nazzal, Husameldin Mukhtar, and Walid Obaid. 2023. "A Neural Network Approach to a Grayscale Image-Based Multi-File Type Malware Detection System" Applied Sciences 13, no. 23: 12888. https://doi.org/10.3390/app132312888
APA StyleCopiaco, A., El Neel, L., Nazzal, T., Mukhtar, H., & Obaid, W. (2023). A Neural Network Approach to a Grayscale Image-Based Multi-File Type Malware Detection System. Applied Sciences, 13(23), 12888. https://doi.org/10.3390/app132312888