Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model

Kim, Eun-Jin; Jeong, Young-Seob

doi:10.3390/app152111525

Open AccessCommunication

Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model

by

Eun-Jin Kim

and

Young-Seob Jeong

^*

Department of Computer Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11525; https://doi.org/10.3390/app152111525

Submission received: 6 October 2025 / Revised: 27 October 2025 / Accepted: 27 October 2025 / Published: 28 October 2025

(This article belongs to the Special Issue Application of Deep Learning for Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

As data exchange over wired and wireless networks continues to increase, the damage caused by malicious activities hidden in the data is also rising. In particular, malicious actions embedded in document files (e.g., PDF, HWP) are not only difficult to detect, but users are often careless when opening such files, making them highly vulnerable to malicious actions in documents. This study proposes a novel deep learning model that directly analyzes byte sequences to detect malicious actions embedded in HWP documents. Most previously proposed detection models have relied on convolutional neural networks, whereas our model uses no convolutional layers and employs two pooling layers instead. For the experiments, we constructed a new dataset by sampling byte sequences from HWP files, and our model achieved a 63.54% macro F1 score that is better than other existing models. This result demonstrates that our model is not only efficient but also achieves higher malware detection performance, implying that our model is more practical for real-world malware detection services, as we encounter numerous document files in everyday use.

Keywords:

malware detection; byte sequence; pooling layer; HWP dataset

1. Introduction

As vast amounts of data are exchanged over the Internet, the importance of security over wired and wireless networks is growing every day. In particular, most Web users are exposed to the threat of malware embedded in document files that can be downloaded from various Web forums or blogs, as well as in documents attached to emails. Because users often open document files such as PDF, MS Office, and Hangul Word Processor (HWP) without much suspicion, the potential risk posed by these files can be significant.

Methods for detecting malicious actions embedded in document files can be divided into static and dynamic approaches. The dynamic approach, which involves running a file and observing its behavior, requires a separate sandbox environment. As a result, it suffers from relatively low reproducibility and high costs when applied to real-world services. For this reason, recent studies have proposed models that perform malware detection using a static approach. The static approach directly analyzes files without executing programs or opening non-executable files. In particular, deep learning models have been proposed to directly analyze the byte sequences contained in document files in order to detect the presence of malicious actions.

In this study, we propose a novel model that performs malware detection by directly analyzing the byte sequences contained in document files, with a particular focus on Hangul Word Processor (HWP) files. HWP is a document editing tool widely used by government agencies and public institutions in Korea, and it has been continuously targeted by various cyberattacks from nations and organizations such as North Korea. For this reason, ongoing research on detecting malware embedded in HWP documents is essential to safeguard national institutions and public assets in South Korea.

The malware detection task using byte sequences can be formulated as a binary classification problem. As artificial intelligence models have been adopted in many areas [1], a variety of machine learning (ML) models have been applied to classification tasks, including support vector machines (SVMs) [2,3], logistic regression (LR), decision trees (DTs), random forests (RFs) [4,5,6], and XGBoost [7,8]. Although these ML models have achieved strong classification performance in various domains, they share a common limitation: they require considerable effort to define features for optimal performance. Deep learning models address this limitation by automatically extracting features from data. For instance, recurrent neural networks (RNNs) [9,10] are designed to effectively capture sequential features in datasets, while convolutional neural networks (CNNs) [11,12] excel at analyzing local features.

Previous studies on malware detection using byte sequences of non-executable files (e.g., HWP) have mainly adopted convolutional neural network (CNN) architectures. CNNs require relatively few computational resources and are effective at capturing local patterns, making them well suited for detecting malicious actions. Jeong et al. [13] designed a convolutional neural network (CNN) to predict potential malicious actions within byte sequences of PDF files and achieved F1 scores of 95∼97%. Raff et al. [14] introduced MalConv, which is a shallow and wide CNN architecture for efficient training and inference. MalConv has shown that a shallow and wide convolutional layer effectively captures local patterns from a given long sequence. Jeong et al. [15] proposed SPAPConv, a CNN-based architecture with spatial pyramid pooling that processes byte sequences of arbitrary length and produces embedded vectors of a predefined dimension, achieving F1 scores of 92∼95% for malware detection on HWP byte sequences. The spatial pooling layer is a key component of SPAPConv, as it enables the model to efficiently generate a fixed-dimensional representation from input sequences of arbitrary length. Jeong et al. [16] suggested an approach that aggregates multiple sequence-level prediction models, yielding improved file-level detection performance. They also demonstrated that a prediction model can perform well on byte sequences from different file formats. Luo et al. [17] proposed a combination of a GRU [18] and convolutional block attention module (CBAM) for malware detection using byte sequences of executables (e.g., .exe, .dll). Although these previous studies using convolutional layers have achieved strong detection performance, they become impractical or inefficient for long byte sequences because the model size increases with the input length. Only limited efforts have been made to address this efficiency issue.

Recently, a few studies have attempted to build pre-trained models or employ the Transformer architecture [19] to improve performance on malware detection tasks. Pre-trained models learn general knowledge or patterns from large collections of samples, and this knowledge can be transferred to downstream tasks such as detecting malicious network attacks or identifying malware in files. For example, motivated from Bidirectional Encoder Representations from Transformers (BERT) [20], Rahali and Akhloufi [21] proposed a new Transformer encoder-based pre-trained model, MalBert, for malware detection. Nichols et al. [22] leveraged pre-trained convolutional models for image-based malware detection, while Zhong and Zhang [23] applied pre-trained Generative Pre-trained Transformer (GPT) models [24,25,26] to malware detection in executables. Although these studies have demonstrated the potential of the Transformer architecture, they share a common limitation: high computational costs make them impractical for efficient malware detection. Moreover, because byte sequences can be much longer than other types of data, their high computational cost is an even greater concern. As a result, even though some recent studies have proposed alternative efficient ways (e.g., linear Transformer [27], state-space models [28]) for efficient construction of pre-trained models, no studies have yet constructed a pre-trained model for malware detection using byte sequences.

Figure 1 illustrates the malware detection task based on byte sequences. Byte streams containing one or more malicious actions are regarded as malicious, while those without any malicious actions are considered benign (normal). Detection models are trained on byte sequences sampled from these streams. Since a single non-executable file can contain as few as dozens and as many as thousands of byte sequences, it is essential to design an efficient detection model. There are several challenges to achieving efficient detection using byte sequences. First, byte sequences vary in length, making it difficult to design efficient detection models. A common approach to address this issue is to force the model to take input sequences of a fixed length, but this is neither efficient nor practical for real-world applications. Second, if a model employs a shallow architecture with a large number of parameters (as in the model proposed by [14]) to handle the first issue, it may still be impractical due to its inefficiency. In other words, large models can require significantly more time and memory to analyze byte sequences, whereas ordinary Web users are unlikely to tolerate high computational costs or long delays (e.g., several minutes) each time they open attached non-executable files.

The objective of this paper is to efficiently detect malicious actions in HWP byte sequences. This paper makes two main contributions. First, we propose a new efficient neural network architecture for malware detection in HWP byte sequences. Unlike previous studies that incorporate embedding, convolutional, pooling, and fully connected (FC) layers to enhance detection performance, our model consists of an embedding layer, two pooling layers, and a fully connected layer, with no convolutional layers; that is, our model does not have any convolutional layers and employs only pooling layers. This design enables our model to use far fewer parameters and achieve higher efficiency in both training and inference. Second, we construct and release a new dataset for malware detection using HWP byte sequences. Experimental results on this dataset demonstrate that our model is not only efficient but also achieves superior detection performance. An efficient malware detection model is advantageous for model improvement and deployment and will ultimately have a significant impact on the safety of the international cybersecurity industry and public administration sectors.

The remaining sections are organized as follows. The Materials and Methods Section introduces the data and materials used in this study and provides detailed descriptions of the proposed method. The Results Section presents the experimental setup and findings, along with a brief analysis of the results. The Discussion Section addresses theoretical and practical implications and explains why the proposed method achieved the best performance. Finally, the Conclusions Section summarizes this work and discusses its limitations.

2. Materials and Methods

2.1. Materials

There have been only a few publicly available resources for malware detection using byte sequences. For example, Jeong et al. [13] released a dataset of PDF byte sequences, https://sites.google.com/view/datasets-for-public/ (accessed on 1 October 2025), and Jeong et al. [16] provided byte sequences of MS Office documents (e.g., MS Word). The only dataset for malware detection on HWP byte sequences was introduced by Jeong et al. [29], and it is available upon request. We did not use this HWP dataset because previous models had already achieved sufficient performance on it (e.g., an F1 score of 93%), which we believe was largely due to the relatively simple distribution of the dataset. For our experiments, we obtained the original HWP files from Jeong et al. [29] and created a new larger set of byte sequences by following the sampling strategy suggested by Jeong et al. [29]. Compared with the previous dataset (13 K samples), our constructed dataset contains approximately 93 K samples, and each sample is a sequence of 1000 bytes. We split the dataset into training and test sets, and its statistics are summarized in Table 1. The machine used for our experiments has two NVIDIA GeForce RTX 4090, 500 GB RAM, and AMD Ryzen Threadripper PRO 5955WX 16-Cores.

2.2. Method

Figure 2 depicts the architecture of our proposed model. Our model consists of four layers: an embedding layer

L_{e m}

, a spatial pooling layer

L_{s p}

, a global pooling layer

L_{g p}

, and a fully connected (FC) layer

L_{f c}

. For a given input byte sequence of length S (see the bottom of Figure 2), the embedding layer converts the byte sequence into representations in the E-dimensional embedding space, which enables us to convey semantic information for each byte. Every example (i.e., a byte sequence) of the dataset used in this paper has the same length, 1000, so we can put the examples into the model without any pre-processing. Equation (1) describes this embedding step, where x is the input sequence of length S, and

H_{1} \in R^{E \times S}

is the resulting matrix of the embedding layer. Each column vector of

H_{1}

is an E-dimensional embedding representation of the corresponding byte in the input sequence x. Unlike the categorical byte values of input sequence, the embedding representation allows the model to comprehend semantic patterns of byte sequences.

H_{1} = L_{e m} (x)

(1)

The matrix

H_{1}

is passed to the spatial pooling layer

L_{s p}

that yields a resulting matrix

H_{2} \in R^{E \times G}

where G is the number of segments, as shown in Equation (2).

L_{s p}

divides the sequence of embedding vectors into segments, where the number of segments G is a hyper-parameter. For each segment,

L_{s p}

creates an E-dimensional vector by a pooling mechanism. Such spatial pooling [15] allows it to always create a resulting matrix

E \times G

regardless of the sequence length S. Theoretically, this allows the model to take input sequences of arbitrary length. The E-dimensional embedding vector of each segment carries a summary of all embedding vectors within the segment. The spatial pooling layer dramatically reduces the model size (i.e., the number of parameters), but it may lose important clues among the embedding matrix if we take a wrong pooling function. Based on the findings of the previous study [15], we chose an average function for the spatial pooling, which computes a mean vector for each segment.

H_{2} = L_{s p} (H_{1})

(2)

The matrix

H_{2}

is passed to another pooling layer,

L_{g p}

, which extracts globally pooled representative values. The global pooling layer produces a matrix

H_{3} \in R^{E \times 1}

, as shown in Equation (3). Note that

H_{3}

is just a single E-dimensional embedding vector regardless of the number of segments G, and this allows it to be connected with the fully connected layer. We use a max function in the global pooling layer, meaning that, for each embedding dimension, the maximum value across the G segments is selected. While the average function in the previous layer,

L_{s p}

, summarizes patterns within each segment, the max function in

L_{g p}

captures prominent patterns that may indicate malicious actions. Therefore, we expect that the

H_{3}

conveys a summary of all prominent patterns of the given input sequence in the embedding space.

H_{3} = L_{g p} (H_{2})

(3)

The output matrix

H_{3}

is finally delivered to the fully connected (FC) layer

L_{f c} \in R^{2 \times E}

, generating a 2-dimensional vector, as shown in Equation (4). The FC layer

L_{f c}

is the

E \times 2

matrix that converts the E-dimensional summary vector

H_{3}

into 2-dimensional logits. By this layer, the model classifies the input sequence x to malware or normal (benign). Note that the model does not have any convolutional layers, but only has the embedding layer, two pooling layers, and an FC layer. Such simple architecture allows it to have much less parameters than other previous dominant convolutional neural networks such as MalConv [14] and SPAPConv [15].

\hat{y} = L_{f c} (H_{3})

(4)

The model architecture is based on our assumption that using two pooling layers is sufficient to capture malicious actions. That is, malicious actions can occur anywhere and may be spread across byte streams; therefore, using consecutive pooling layers alone without convolutional layers might be better to capture them. Specifically, the spatial pooling layer summarizes potential clues of malicious activity by averaging the embedding vectors, while the global pooling layer identifies the most significant patterns by selecting the maximum values. Furthermore, the consecutive pooling layers greatly reduce the embedding matrix, making the model more efficient than CNNs. It is worth noting that the main objective of our model is to achieve better efficiency, but we will demonstrate that our model is not only efficient, but also effective in malware detection by empirical results.

3. Results

We conducted experiments using our constructed dataset, with the training and test splits shown in Table 1. Since the input is a byte sequence, the vocabulary consists of 256 characters. The rival models are MalConv and SPAPConv. MalConv employs a wide and shallow CNN architecture, while SPAPConv uses spatial average pooling to reduce model size and take input sequence of arbitrary length. All models are optimized using the Adam optimizer [30] with a cross-entropy loss function, and a class weight strategy is applied to address class imbalance in the dataset. We used 10% of the training samples as a separate validation set, and the optimal number of epochs for MalConv and SPAPConv was approximately 42∼44, determined based on validation loss. For all other hyperparameters, we followed the settings suggested in the respective papers. For our model, the initial learning rate was set to 0.001, the number of epochs was around 45, the embedding dimension was 512, and the mini-batch size is 512. The trainable parameters were initialized using the He initialization algorithm [31]. Figure 3 depicts the learning curve of our model. All experimental results are averages over three independent runs.

Table 2 shows the averaged per-class performance of our model. As our model has the spatial average pooling layer, its performance may rely on the number of segments G. If G is too large, the layer may fail to deliver informative patterns because it sees only a tiny fraction of sequence. For example, when

G = S

, this layer just passes the input directly to the output without considering other adjacent bytes. On the other hand, when G is too small, the layer may fail to summarize local patterns in the corresponding segment due to a bottleneck problem; that is, the E-dimensional vector may not be enough to represent complicated patterns observable within the segment. For example, if

G = 1

, then this layer is identical to a global pooling layer, and will fail to convey all local patterns in the long sequence. To examine how much the number of segments G impacts the performance, we varied G from 25 to 150 and found that the model gives the best performance when G = 50, meaning that the spatial average pooling layer worked well with each segment of 200 bytes. The value of G should be determined considering the input length and available computational resources, to balance between the efficiency and effectiveness.

We also compared our model with other previous CNN models such as MalConv and SPAPConv, and Table 3 summarizes the results. Generally, all models give poor precision on ‘malware’ class, and poor recall on ‘normal’ class. This indicates that the models tend to predict ‘malware’ more frequently overall, which can be attributed to the imbalance of the dataset. Note that both SPAPConv and our model utilize spatial average pooling; for a fair comparison, we set the number of segments

G = 50

for both. SPAPConv exhibits better performance than MalConv, which is consistent with the previous study [15]. PoolModel achieved the best F1 score on the normal class and Macro F1 score, even though its size (i.e., the number of parameters) is the smallest. When G = 100, the number of trainable parameters of MalConv, SPAPConv, and our model is approximately 298 K, 278 K, and 134 K, respectively, indicating that our model requires fewer computational resources and less training time. This demonstrates that PoolModel is not only efficient, but also effective in malware detection task. The p-values of the macro F1 scores from the paired t-test are 0.0053 and 0.1751 for MalConv and SPAPConv, respectively.

4. Discussion

Even though the experimental results show that our model is more efficient and effective than previous models (e.g., MalConv and SPAPConv), several issues remain to be discussed. First, in real-world services, detecting as many malicious actions as possible may be more important than achieving high overall accuracy. Therefore, as shown in Table 3, SPAPConv may be the best choice for deployment because it achieves the highest recall for the malware class, while PoolModel performs slightly worse. The superior recall for the malware class achieved by SPAPConv and PoolModel may stem from the spatial average pooling mechanism. Since PoolModel is much more efficient than SPAPConv and its macro F1 score is the best, the choice between them can be made based on available computational resources. Second, as described in the previous section, the number of trainable parameters in PoolModel is far smaller than the other models, but it remains unclear how much faster it is in practice. To investigate this, we measured the running time on the test set and observed that MalConv, SPAPConv, and PoolModel took around 3.5, 7.4, and 1.6 s. This shows that PoolModel works much faster than the other CNNs as it does not involve any convolution operations. We believe that PoolModel will be easily deployed in a real-world malware detection service, because it has less parameters (i.e., smaller model) so that it can be trained faster and be deployed faster. As the model is trained with byte sequences and is not designed for generation, it has no ethical issues for the deployment. Third, the best macro F1 score achieved was only 63.54%, likely due to the highly complex patterns present in our dataset. By making our dataset publicly available, we believe it will provide many subsequent studies with an opportunity to advance malware detection using HWP byte sequences.

There still remains a question of how our model could achieve better F1 scores than existing CNNs, even though it does not include any convolutional operations. Previous studies have utilized CNNs based on the hypothesis that malicious actions can be captured as local patterns. In contrast, malicious actions are distributed throughout the byte streams, and thus we assumed that only a small number of local patterns can be effectively captured by convolutional operations. The convolutional operations mainly focus on local rather than global or dispersed patterns, so CNNs often become dependent on the input length; in other words, most CNNs are designed to process input sequences of a fixed length. Our model addresses these limitations by replacing convolutional operations with consecutive pooling layers. The first pooling layer summarizes long-range patterns, while the second pooling layer captures global suspicious patterns. This architecture not only enables effective detection of malicious actions but also allows the model to handle input sequences of arbitrary length. We plan to extend our model to operate at the file level following the approach of Jeong et al. [16].

There are few limitations of this paper. First, as shown in the results of Table 1, the performance of the proposed model is affected by the hyperparameter G. Therefore, depending on the experimental settings or datasets, it is necessary to find the optimal value of G to achieve good performance. Second, it is necessary to measure how the performance of the proposed model varies with model size. Due to the complexity of the dataset introduced in this study, the overall performance of all models is relatively low; however, it is expected that increasing the model size would lead to improved performance. Third, this paper does not involve any adversarial scenario. It will be better to investigate or check if the model is robust to slight modifications to byte sequences designed to evade detection.

5. Conclusions

In this paper, we propose a new efficient neural network architecture that consists of an embedding layer, two pooling layers, and a fully connected layer. This design differs substantially from many previous studies (e.g., MalConv and SPAPConv), which have mostly employed convolutional neural networks. The convolutional layers have been adopted because of their superior ability to capture local patterns in the input sequence. However, malicious actions can occur anywhere and may be spread across byte streams, so we assumed that pooling layers alone might be better to capture suspicious activities. Theoretically, the pooling layer without convolutional layer may lose important local patterns, but we addressed this issue by taking two different pooling layers. The first pooling layer is a spatial pooling layer that extracts potential ‘local’ clues of malicious activity, while the second pooling layer is a global pooling layer that captures the most suspicious patterns. Such architecture of two consecutive pooling layers achieved the best F1 scores. Furthermore, practically, the consecutive pooling layers of our model allowed it to have much fewer parameters than the other models. We constructed a new dataset by sampling byte sequences from HWP files, and empirical results using this dataset demonstrated that our model is not only efficient but also effective. We believe the main limitation of this paper is the small size of model. Since there is still room for performance improvement, we plan to construct bigger models of the consecutive pooling layers, and explore additional strategies, including the development of large pre-trained models based on byte sequences. We also plan to extend this work to other different file formats (e.g., PDF), so that our work may help global users that use different applications.

Author Contributions

Conceptualization, E.-J.K. and Y.-S.J.; methodology, Y.-S.J.; validation, E.-J.K. and Y.-S.J.; resources, Y.-S.J.; data curation, Y.-S.J.; writing—original draft preparation, E.-J.K. and Y.-S.J.; supervision, Y.-S.J.; funding acquisition, Y.-S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(RS-2020-NR049604).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset in this paper is available at https://naver.me/5asK0pSb, accessed on 1 October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Piersigilli, P.; Citroni, R.; Mangini, F.; Frezza, F. Electromagnetic Techniques Applied to Cultural Heritage Diagnosis: State of the Art and Future Prospective: A Comprehensive Review. Appl. Sci. 2025, 15, 6402. [Google Scholar] [CrossRef]
Burges, C.J.C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Hollmann, N.; Müller, S.; Purucker, L.; Krishnakumar, A.; Körfer, M.; Hoo, S.B.; Schirrmeister, R.T.; Hutter, F. Accurate predictions on small data with a tabular foundation model. Nature 2025, 637, 319–326. [Google Scholar] [CrossRef] [PubMed]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Wang, Y.; Wu, H.; Nettleton, D. Stability of Random Forests and Coverage of Random-Forest Prediction Intervals. In Proceedings of the 37th Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Ferry, J.; Fukasawa, R.; Pascal, T.; Vidal, T. Trained random forests completely reveal your dataset. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Bohacek, M.; Bravansky, M. When XGBoost Outperforms GPT-4 on Text Classification: A Case Study. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing, Mexico City, Mexico, 21 June 2024; pp. 51–60. [Google Scholar]
Medsker, L.; Jain, L.C. Recurrent Neural Networks: Design and Applications; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
Sunagar, P.; Sowmya, B.J.; Pruthviraja, D.; Supreeth, S.; Mathew, J.; Rohith, S.; Shruthi, G. Hybrid RNN Based Text Classification Model for Unstructured Data. SN Comput. Sci. 2024, 5, 726. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Taherkhani, A.; Cosma, G.; McGinnity, T.M. A Deep Convolutional Neural Network for Time Series Classification with Intermediate Targets. SN Comput. Sci. 2023, 4, 832. [Google Scholar] [CrossRef]
Jeong, Y.S.; Woo, J.; Kang, A.R. Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks. Secur. Commun. Netw. 2019, 2019, 8485365. [Google Scholar] [CrossRef]
Raff, E.; Barker, J.; Sylvester, J.; Brandon, R.; Catanzaro, B.; Nicholas, C. Malware detection by eating a whole EXE. In Proceedings of the Workshops of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 268–276. [Google Scholar]
Jeong, Y.S.; Woo, J.; Lee, S.; Kang, A.R. Malware Detection of Hangul Word Processor Files Using Spatial Pyramid Average Pooling. Sensors 2020, 20, 5265. [Google Scholar] [CrossRef] [PubMed]
Jeong, Y.S.; Mswahili, M.E.; Kang, A.R. File-level malware detection using byte streams. Sci. Rep. 2023, 13, 8925. [Google Scholar] [CrossRef] [PubMed]
Luo, X.; Fan, H.; Yin, L.; Jia, S.; Zhao, K.; Yang, H. CAG-Malconv: A Byte-Level Malware Detection Method With CBAM and Attention-GRU. IEEE Trans. Netw. Serv. Manag. 2024, 21, 5859–5872. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In Proceedings of the NIPS 2014 Deep Learning and Representation Learning Workshop, Montreal, QC, Canada, 12–13 December 2014; pp. 1–9. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Rahali, A.; Akhloufi, M.A. MalBERT: Using Transformers for Cybersecurity and Malicious Software Detection. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Melbourne, Australia, 17–20 October 2021; pp. 3226–3231. [Google Scholar]
Nichols, T.; Zemlanicky, J.; Luo, Z.; Li, Q.; Zheng, J. Image-based PDF Malware Detection Using Pre-trained Deep Neural Networks. In Proceedings of the 12th International Symposium on Digital Forensics and Security, San Antonio, TX, USA, 29–30 April 2024; pp. 1–5. [Google Scholar]
Zhong, W.; Zhang, X. Multi-Level Generative Pretrained Transformer for Improving Malware Detection Performance. In Proceedings of the 7th International Conference on Artificial Intelligence and Big Data, Chengdu, China, 24–27 May 2024; pp. 99–104. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 1 October 2025).
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. 2019. Available online: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 1 October 2025).
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. 2020. Available online: https://arxiv.org/abs/2005.14165 (accessed on 1 October 2025).
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Jeong, Y.S.; Woo, J.; Kang, A.R. Malware Detection on Byte Streams of Hangul Word Processor Files. Appl. Sci. 2019, 9, 5178. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]

Figure 1. Malware detection for non-executables using byte sequences. The detection model takes a byte sequence as an input and gives a prediction result (malware or normal).

Figure 2. Proposed architecture, where E is the embedding dimension, S is the input sequence length, and G is the number of segments for spatial pooling. Note that this model does not have any convolutional layers.

Figure 3. Learning curve of our model when G = 50, where the x-axis represents the number of steps, and blue and red lines are training loss and smoothed validation loss, respectively. The optimal number of steps is between 6000∼7000.

Table 1. The number of byte sequences for training and test splits.

	Benign (Normal)	Malicious	Total
Train	57,515	28,857	86,372
Test	5429	1773	7202

Table 2. Averaged per-class precision, recall, and F1 score of PoolModel (ours) with varying the number of segments G.

	Malware			Normal			Macro F1 (%)
	F1 (%)	Precision (%)	Recall (%)	F1 (%)	Precision (%)	Recall (%)	Macro F1 (%)
G = 25	54.45	40.04	85.05	71.54	92.29	58.40	62.99
G = 50	55.40	40.55	87.42	71.67	93.40	58.14	63.54
G = 100	55.18	40.46	86.75	71.71	93.09	58.32	63.45
G = 150	54.20	39.75	85.15	71.12	92.27	57.86	62.66

Table 3. Averaged per-class precision, recall, and F1 score of MalConv, SPAPConv, and PoolModel (ours).

	Malware			Normal			Macro F1 (%)
	F1 (%)	Precision (%)	Recall (%)	F1 (%)	Precision (%)	Recall (%)	Macro F1 (%)
MalConv	52.61	38.95	81.05	71.05	90.44	58.51	61.83
SPAPConv (G = 50)	54.87	39.49	89.90	69.46	94.36	54.98	62.16
PoolModel (G = 50)	55.40	40.55	87.42	71.67	93.40	58.14	63.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, E.-J.; Jeong, Y.-S. Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model. Appl. Sci. 2025, 15, 11525. https://doi.org/10.3390/app152111525

AMA Style

Kim E-J, Jeong Y-S. Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model. Applied Sciences. 2025; 15(21):11525. https://doi.org/10.3390/app152111525

Chicago/Turabian Style

Kim, Eun-Jin, and Young-Seob Jeong. 2025. "Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model" Applied Sciences 15, no. 21: 11525. https://doi.org/10.3390/app152111525

APA Style

Kim, E.-J., & Jeong, Y.-S. (2025). Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model. Applied Sciences, 15(21), 11525. https://doi.org/10.3390/app152111525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Malware Detection in HWP Byte Sequences Using Pooling-Based Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Method

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI