Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model
Abstract
1. Introduction
2. Materials and Methods
2.1. Materials
2.2. Method
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Piersigilli, P.; Citroni, R.; Mangini, F.; Frezza, F. Electromagnetic Techniques Applied to Cultural Heritage Diagnosis: State of the Art and Future Prospective: A Comprehensive Review. Appl. Sci. 2025, 15, 6402. [Google Scholar] [CrossRef]
- Burges, C.J.C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
- Hollmann, N.; Müller, S.; Purucker, L.; Krishnakumar, A.; Körfer, M.; Hoo, S.B.; Schirrmeister, R.T.; Hutter, F. Accurate predictions on small data with a tabular foundation model. Nature 2025, 637, 319–326. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Wang, Y.; Wu, H.; Nettleton, D. Stability of Random Forests and Coverage of Random-Forest Prediction Intervals. In Proceedings of the 37th Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Ferry, J.; Fukasawa, R.; Pascal, T.; Vidal, T. Trained random forests completely reveal your dataset. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Bohacek, M.; Bravansky, M. When XGBoost Outperforms GPT-4 on Text Classification: A Case Study. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing, Mexico City, Mexico, 21 June 2024; pp. 51–60. [Google Scholar]
- Medsker, L.; Jain, L.C. Recurrent Neural Networks: Design and Applications; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
- Sunagar, P.; Sowmya, B.J.; Pruthviraja, D.; Supreeth, S.; Mathew, J.; Rohith, S.; Shruthi, G. Hybrid RNN Based Text Classification Model for Unstructured Data. SN Comput. Sci. 2024, 5, 726. [Google Scholar] [CrossRef]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Taherkhani, A.; Cosma, G.; McGinnity, T.M. A Deep Convolutional Neural Network for Time Series Classification with Intermediate Targets. SN Comput. Sci. 2023, 4, 832. [Google Scholar] [CrossRef]
- Jeong, Y.S.; Woo, J.; Kang, A.R. Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks. Secur. Commun. Netw. 2019, 2019, 8485365. [Google Scholar] [CrossRef]
- Raff, E.; Barker, J.; Sylvester, J.; Brandon, R.; Catanzaro, B.; Nicholas, C. Malware detection by eating a whole EXE. In Proceedings of the Workshops of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 268–276. [Google Scholar]
- Jeong, Y.S.; Woo, J.; Lee, S.; Kang, A.R. Malware Detection of Hangul Word Processor Files Using Spatial Pyramid Average Pooling. Sensors 2020, 20, 5265. [Google Scholar] [CrossRef] [PubMed]
- Jeong, Y.S.; Mswahili, M.E.; Kang, A.R. File-level malware detection using byte streams. Sci. Rep. 2023, 13, 8925. [Google Scholar] [CrossRef] [PubMed]
- Luo, X.; Fan, H.; Yin, L.; Jia, S.; Zhao, K.; Yang, H. CAG-Malconv: A Byte-Level Malware Detection Method With CBAM and Attention-GRU. IEEE Trans. Netw. Serv. Manag. 2024, 21, 5859–5872. [Google Scholar] [CrossRef]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In Proceedings of the NIPS 2014 Deep Learning and Representation Learning Workshop, Montreal, QC, Canada, 12–13 December 2014; pp. 1–9. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Rahali, A.; Akhloufi, M.A. MalBERT: Using Transformers for Cybersecurity and Malicious Software Detection. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Melbourne, Australia, 17–20 October 2021; pp. 3226–3231. [Google Scholar]
- Nichols, T.; Zemlanicky, J.; Luo, Z.; Li, Q.; Zheng, J. Image-based PDF Malware Detection Using Pre-trained Deep Neural Networks. In Proceedings of the 12th International Symposium on Digital Forensics and Security, San Antonio, TX, USA, 29–30 April 2024; pp. 1–5. [Google Scholar]
- Zhong, W.; Zhang, X. Multi-Level Generative Pretrained Transformer for Improving Malware Detection Performance. In Proceedings of the 7th International Conference on Artificial Intelligence and Big Data, Chengdu, China, 24–27 May 2024; pp. 99–104. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 1 October 2025).
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. 2019. Available online: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 1 October 2025).
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. 2020. Available online: https://arxiv.org/abs/2005.14165 (accessed on 1 October 2025).
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
- Jeong, Y.S.; Woo, J.; Kang, A.R. Malware Detection on Byte Streams of Hangul Word Processor Files. Appl. Sci. 2019, 9, 5178. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]



| Benign (Normal) | Malicious | Total | |
|---|---|---|---|
| Train | 57,515 | 28,857 | 86,372 |
| Test | 5429 | 1773 | 7202 |
| Malware | Normal | Macro F1 (%) | |||||
|---|---|---|---|---|---|---|---|
| F1 (%) | Precision (%) | Recall (%) | F1 (%) | Precision (%) | Recall (%) | ||
| G = 25 | 54.45 | 40.04 | 85.05 | 71.54 | 92.29 | 58.40 | 62.99 |
| G = 50 | 55.40 | 40.55 | 87.42 | 71.67 | 93.40 | 58.14 | 63.54 |
| G = 100 | 55.18 | 40.46 | 86.75 | 71.71 | 93.09 | 58.32 | 63.45 |
| G = 150 | 54.20 | 39.75 | 85.15 | 71.12 | 92.27 | 57.86 | 62.66 |
| Malware | Normal | Macro F1 (%) | |||||
|---|---|---|---|---|---|---|---|
| F1 (%) | Precision (%) | Recall (%) | F1 (%) | Precision (%) | Recall (%) | ||
| MalConv | 52.61 | 38.95 | 81.05 | 71.05 | 90.44 | 58.51 | 61.83 |
| SPAPConv (G = 50) | 54.87 | 39.49 | 89.90 | 69.46 | 94.36 | 54.98 | 62.16 |
| PoolModel (G = 50) | 55.40 | 40.55 | 87.42 | 71.67 | 93.40 | 58.14 | 63.54 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, E.-J.; Jeong, Y.-S. Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model. Appl. Sci. 2025, 15, 11525. https://doi.org/10.3390/app152111525
Kim E-J, Jeong Y-S. Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model. Applied Sciences. 2025; 15(21):11525. https://doi.org/10.3390/app152111525
Chicago/Turabian StyleKim, Eun-Jin, and Young-Seob Jeong. 2025. "Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model" Applied Sciences 15, no. 21: 11525. https://doi.org/10.3390/app152111525
APA StyleKim, E.-J., & Jeong, Y.-S. (2025). Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model. Applied Sciences, 15(21), 11525. https://doi.org/10.3390/app152111525

