1. Introduction
Research based malware [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10] plays an important role in network attacks and has become one of the main threats facing network security today. With the popularization of the Internet and the rapid development of information technology, the ways of network attacks are increasingly diversified and more hidden [
11,
12,
13,
14]. The increasingly developing malware has strong concealment and diversity [
15,
16,
17,
18]. In recent years, malware developers have adopted an increasing number of adversarial techniques to resist conventional analysis methods, such as code obfuscation and process hiding. Among them, packing technology has been widely used as a protective measure [
19,
20,
21,
22,
23,
24,
25]. Packing technology increases the difficulty of malware analysis by encapsulating malicious code in encrypted or compressed data. The software after packing addition is called packed software, and it is crucial to effectively identify and unpack. Therefore, the protection technology of packed software has become a major challenge for network security analysts, and it is particularly important to develop effective methods for identifying packed software.
Although existing packed software recognition methods have achieved certain results in some cases, most of these methods rely on existing training data, such as machine learning methods such as support vector machine (SVM) [
26], random forest (RF) [
27], as well as deep network methods such as convolutional neural network (CNN) [
28], long short-term memory network (LSTM), or VGG16 [
29], to construct effective recognition models, and focus on identifying known/seen types of malicious software. Such strategies often overlook unknown/unseen types of malware that may be encountered in practical applications, and therefore cannot address opened-set issues. The opened-set problem refers to the inability of a system to effectively identify unknown malware outside the training set, resulting in the creation of security vulnerabilities. Therefore, researching how to effectively identify packed software in an opened-set scenario has become a key issue in improving the efficiency and accuracy of malware analysis.
To address this issue, this article clarifies the closed-set and opened-set recognition problems of packed software and designs a new evaluation metric, known/unknown (seen/unseen) recall rate, to measure the recognition ability of the two types. This metric not only considers the efficient identification of seen malware, but also enhances the ability to identify unknown types of malware, overcoming the shortcomings of traditional methods in opened-set problems. This article also proposes a new multi-model recognition prevention method that can simultaneously meet the recognition requirements of closed-set and opened-set packed software. Through experimental verification of the packed software dataset, the results show that the proposed prevention has significant advantages in improving recognition accuracy and adaptability.
The contributions of this article are as follows.
- 1.
Through the sorting and analysis of previous methods for identifying packed software, combined with the existing environmental background of malware detection, an overlooked problem in the field of packed software identification, namely opened-set identification, was discovered. The closed-set and opened-set identification problems of packed software were compared and analyzed, and these two problems were clearly defined.
- 2.
Through detailed discussions on the opened-set problem, three solutions were identified for solving the problem of identifying opened-set in packed software, and a comprehensive analysis showed that the multi-model recognition scheme is suitable for meeting the recognition requirements of closed-set and opened-set packed software. A method for identifying unseen types of packed software based on multi-class model schemes is proposed.
- 3.
In order to effectively measure the ability of packed software recognition methods in closed-set recognition problems and opened-set recognition problems, an evaluation metric called seen/unseen recall rate was designed to measure the closed-set and opened-set recognition methods of packed software. And further integrate more macroscopic method metrics based on the above two metrics.
- 4.
Through algorithm comparison experiments, parameter adjustment experiments, and scheme comparison experiments, the effectiveness of the proposed method was verified from three aspects.
Section structure:
Section 2 will review relevant research work.
Section 3 proposes to explore the motivation and methodology architecture of method design.
Section 4 will present the experimental results and analyze them.
Section 5 summarizes the work and looks forward to future research directions.
Table 1 provides a list of abbreviations used throughout this paper.
Table 2.
Summary of related work.
Table 2.
Summary of related work.
| Method | Raw Data | Feature | Algorithm | Problem |
|---|
| Kim et al. [19] | Byte sequences | The first 15 bytes | SVM | PRC |
| Jung et al. [21] | Byte sequences | Byte entropy | GBoost | PRC |
| Li et al. [20] | CEGs | GNN | DNN | PRC |
| 2-SPIFF [22] | File attribute+FCGs | File+FCG | SVM+KNN | PRC+PRO |
| Mondon et al. [24] | Assembly codes+CFGs | Code+CFG | SMOTE SVM | PRC |
| PackHero [25] | CGs | Node+Signatures | GMN+Cluster | PRC |
2. Related Work
The packing technology of malware is widely used in anti analysis, which seriously affects traditional methods of malware detection. Therefore, researchers have proposed various technical means to improve the recognition efficiency and accuracy of packed software. The existing related work can be mainly divided into two categories: one is recognition methods based on feature analysis, and the other is recognition methods based on graph analysis. The following provides a review of these two types of methods and makes corresponding summaries and analyses.
2.1. Recognition Method Based on Feature Analysis
The recognition method based on feature analysis mainly extracts the byte sequence or other file features of malware and uses classification algorithms for recognition.
In 2019, Kim et al. [
19] proposed a byte recognition method based on packed software, using the first 15 bytes of packed software as training data and implementing an effective packed software recognition model using SVM.
In 2020, Jung et al. [
21] proposed a packed software recognition method based on byte sequences, which analyzes the byte sequences of malicious software, extracts its features, and classifies them. Unlike traditional signature detection methods, this method can recognize different packing tools, and experimental results show an accuracy of 91.6%. This method is particularly suitable for analyzing malware containing different packed tools.
2.2. Recognition Method Based on Graph Analysis
The identification method based on graph analysis mainly analyzes and identifies malware by constructing graph structures such as control flow graphs (CFGs) or function call graphs (FCGs).
In 2019, Li et al. [
20] proposed a method based on consistent executing graphs (CEGs) to identify the packing technique used by packed malware. This method maximizes semantic preservation and uses graph matching algorithms and graph kernel techniques for efficient recognition. The experimental results show that this method has good performance for packed software with complex graph structures, but it requires a significant amount of computational resources.
In 2021, Liu et al. [
22] proposed a two-stage packed software recognition method, 2-SPIFF, which combines FCGs and file attribute features to identify packed and non-packed files through a two-stage detection strategy, and further distinguish packed tools or software. This method achieved a detection accuracy of 99.80% for packing technology and a recognition accuracy of 98.49% for packing technology in the experiment.
In 2024, Mondon and Lemos [
24] proposed a string obfuscation detection method based on CFGs and string encryption analysis. They designed an efficient detector that can accurately identify string obfuscation in malware by combining assembly code features, control flow graphs, and directed graphs. The experimental results show that the method achieved an accuracy of over 90% in all evaluation metrics, making it particularly suitable for analyzing malware with string obfuscation.
In 2025, Di Gennaro et al. [
25] proposed PackHero, a graph based scalable packed software recognition method. PackHero uses graph matching networks (GMNs) and clustering algorithms to identify programs protected by different packing tools’ call graphs (CGs). The experimental results show that PackHero can achieve a macro average F1 score of 93.7% with only 10 samples, and improve to 98.3% with 100 samples. PackHero is particularly adept at recognizing virtualization packing technology, outperforming existing signature detection tools.
2.3. Summary of Related Work
Through the analysis of the above research, it can be found that there are two main trends in the existing packed technology identification methods: one is to analyze through byte sequences or file features, and the other relies on building the graph structure of malware for identification. The former can efficiently and accurately identify packed technology types, but has poor ability to cope with complex packing techniques. The latter, by modeling the complex behavior of malicious software, can identify more types of packing techniques, especially when facing new packed methods, demonstrating stronger adaptability.
However, most existing research has focused on the recognition problem of packed technology under closed-set conditions, and the recognition ability for opened-set scenarios has not been fully explored. Only the 2-SPIFF method proposed by Liu et al. [
22] has explored unseen packed software to some extent, but it has not yet deeply solved the opened-set problem. Therefore, future research needs to pay more attention to how to improve the adaptability of models to unseen packed techniques, especially in opened-set environments. How to effectively identify newly emerging packing techniques remains an urgent problem to be solved.
3. Motivation and Method
3.1. Problem Definition
In previous packed software detection problems, packed software recognition was defined as a supervised learning problem. This article explores the scalability based on this definition.
Packing technology, as a software protection technique, is not a static technology, but a technology that is constantly updated and changed as needed. The above content also applies to malware, therefore, the models constructed by previous packed software detection methods that only target labels in the training dataset are not suitable for real detection environments. Therefore, Definitions 1 and 2 are made to take into account the real detection situation in the method architecture.
Definition 1. Packed software recognition on closed-set (PRC) problem. The set of labels being tested is a packed software recognition problem that is a subset of the labels in training, which belongs to supervised learning problems. The labels in the training set are called seen labels.
Definition 2. Packed software recognition on opened-set (PRO) problem. The set of labels being tested is not a subset of the labels in the training packed software recognition problem, which belongs to unsupervised learning problem. The labels in the non training set are unseen labels.ls in the non training set are referred to as unknown labels.
3.2. Solution Analysis
The previous packed software detection problems all belong to Definition 1
PRC problem. The premise assumption of
PRC problem is that there are partitions in the feature vector space according to certain rules, and the packed software in the same partition area belongs to a specified label. As shown in
Figure 1, a schematic diagram of a region partition is shown, where the serial number represents the class number, and each region shares a boundary with its neighborhood without gaps. This is because in general supervised learning problems, the label with the highest confidence will be used as the predicted label, which will result in each sample in the feature vector space partitioned by the model uniquely existing in one region or on the boundary of multiple regions (such as two labels having equal confidence). Below are three solutions for solving the problem of opened-set recognition in packed software. Note that
Figure 2,
Figure 3,
Figure 4 are transformations of the three schemes in the vector space shown in
Figure 1, without any direct causal relationship among them. Class 6 is not a special category but a relatively dense cluster in the high-dimensional space learned by the model, making it more prone to being enclosed by other classes.
3.2.1. Threshold Setting Solution
In order to detect unseen packed software, based on
Figure 1, 2-SPIFF [
22] first proposed a threshold based unseen label recognition scheme, which adds category gaps in
Figure 1. When a sample is predicted by the model with the highest confidence level below the threshold, it falls into the classification gap and is predicted as an unseen sample. Otherwise, it will be judged as an unseen packed software. As shown in
Figure 2, the light colored areas represent unseen packed software, while the dark colored areas correspond to multiple class packed software. The 2-SPIFF solution has opened up ideas for solving the problem of opened-set recognition in packed software. The following will introduce two other solutions.
3.2.2. Single Model Solution
The advantage of the threshold setting solution is that it can maintain high recognition accuracy for seen samples while providing recognition capability for unseen packed software. It is not an independent scheme designed for packed software of unseen classes, but an extension of the closed-set recognition scheme for packed software. The single model scheme considers all the packed software in the training set as a class of packed software, and uses the single class recognition model to realize the recognition of unseen packed software. As shown in
Figure 3, the single model scheme does not rely on the assumption of multiple classification spaces, but only classifies the feature space into seen class space and unseen class space based on seen packed software.
3.2.3. Multi-Class Model Solution
Due to the difference in distribution and density of multiple classes, the single model scheme will judge some edge samples of seen classes as unseen classes. The multi class model scheme extends the single model scheme. For each independent classification, the single model scheme is used, so that on the one hand, more seen classes of paceked software can be retained in the area covered by the model, and on the other hand, compared with the single model scheme, the unseen classes of packed software between classes can be eliminated. As shown in
Figure 4, the multi class model scheme adopts the method of increasing the number of models to more accurately divide seen classes and unseen classes. Different colors represent different training data sources for the model. Therefore, the subsequent method framework adopts the multi-class model solution.
3.3. Proposed Method
In order to effectively solve the problem of open-set recognition in packed software, a feature and multi-class model solution with good classification performance was adopted, as shown in
Figure 5. The proposed packed software recognition method in opened-set scenario is called
POS. The POS framework is divided into four main steps:
- 1.
File preprocessing. In the initial stage of software testing, the software to be analyzed is first disassembled. By using powerful disassembly analysis tools such as IDA Pro, one can delve into the internal structure of the software and help extract key disassembly code. Specifically, by analyzing the function call graph structure in these codes, a directed graph can be constructed to describe the call relationships between software functions. In this diagram, each function name serves as a vertex, and the calling relationship between functions is represented by directed edges, with the starting point of the edge being the caller function and the ending point being the called function. This diagram can not only help identify the basic architecture of the software, but also reveal the complex interaction relationships between different functions during program execution. These pieces of information are crucial for subsequent feature extraction and pattern recognition work, providing valuable clues for analyzing software behavior and potential malicious code.
- 2.
Feature extraction. In the feature extraction stage, the goal is to extract vectors from disassembly code that can effectively represent software characteristics. Although this method framework itself does not impose strict restrictions on feature classes, in order to ensure consistency and comparability between different detection methods, the same feature extraction method is chosen. The extracted features mainly include two classes: segment class features and function call graph class features. The characteristics of segment classes involve the distribution and structure of different memory segments (such as code areas, data areas, etc.) in software, and the information of these segments can effectively reflect the overall organization and execution process of the program. The feature of function call graph class focuses on the characteristics of the relationships between functions in the function call graph, which can reveal the behavior pattern of the program from the frequency, hierarchical structure, and call path of function calls. These two classes of features are processed numerically and used as inputs for machine learning models, providing support for subsequent classification and recognition.
- 3.
Multi-model training. For packed software from different classes, a multi class model training solution is adopted, where each class of packed software is modeled through an independent single class recognition model. These single class models are trained separately for each seen packed software, ensuring that each model can achieve maximum performance on a specific software class.
Table 3 shows all the extracted features.To train these models, one class support vector machine (One Class SVM) algorithm was used, which can specifically recognize a certain class of feature and has strong discriminative ability for abnormal data (i.e., unseen samples). During the training process, by inputting a large number of labeled seen samples, the model is able to learn typical features of different software classes in the feature space. After training, each model can accurately identify the corresponding class of packed software, providing a foundation for the final recognition.
- 4.
Packed software recognition. In the actual recognition process, a multi-class model trained in the early stage will classify and judge the software to be detected. Specifically, when the packed software features to be detected are input into these models, the models will classify them based on the extracted features. If at least one model determines that the software to be tested belongs to a seen packed software, then the software will be classified as that class. If all models fail to recognize the seen class, the software is judged as an unseen sample. In order to ensure the accuracy of recognition, the whole recognition process adopts the integrated learning method, that is, the judgment results of multiple single models will be comprehensively considered, so as to improve the robustness and accuracy of the final classification results. Through this method, accurate judgments can be quickly made in different classes of packed software, and seen and unseen classes can be effectively distinguished.
4. Experiments and Discussions
This section will provide a detailed experimental description and discussion of POS, and overall introduce it from three aspects. In
Section 4.1, introduce the hardware environment, experimental dataset, and experimental indicators used for the validation experiment. In the following subsections, experiments and analysis were conducted on the overall method, algorithms used, and adjusted parameters. Finally, a comparison was made with the baseline method, and the effectiveness of the proposed method was verified in multiple aspects, improving the ability to detect packed software under opened-set conditions.
4.1. Experimental Configuration
4.1.1. Experimental Equipment
The experiments were conducted on a laptop running Windows 11 Home, Version 25H2, equipped with an Intel Core i7-10710U CPU (6 cores, 12 threads, 1.10 GHz base, 4.70 GHz turbo, 384 KB L1 cache, 1.5 MB L2 cache, 12 MB L3 cache) and 16 GB RAM.
4.1.2. Dataset
In order to ensure the accuracy of experimental data annotation, all data were constructed using packing technology tools, with software in a personal computer as input, and the packed software were output. This experiment uses 10 classes of packed software, with their numbers and quantities shown in
Table 4.
In order to achieve recognition of unseen classes, in the experimental setup, 5 classes were selected as seen classes each time, and the other 5 classes were selected as unseen classes. Therefore, a total of 21 groups were divided, and the information for each group is shown in
Table 5.
4.1.3. Experimental Metrics
In order to effectively measure the location class of packed software, a recall based metric is proposed. As described in
Section 2, in the opened-set problem of packed software, there are seen and unseen samples. For seen samples, multiple indicators are used to measure the ability of the method. Using seen recall
(Equation (
1)) to measure the recognition ability of the method on seen samples:
Equation (
1) represents the number of seen samples that were correctly predicted and the number of seen samples that were incorrectly predicted. It should be noted that it does not include samples of misclassified classes that were predicted. For example, if a class 1 sample is predicted as a class 2 sample, it is still included.
Use unseen recall
(Equation (
2)) to measure the method’s ability to recognize unseen samples:
Equation (
2) represents the number of correctly predicted unseen samples and the number of incorrectly predicted unseen samples.
Using the average
, called
(Equation (
3)) measurement method to evaluate the average performance of seen classes:
Using the average
, called
(Equation (
4)) measurement method to evaluate the average performance of unseen classes:
Using the average metric
(Equation (
5)) to measure the average performance:
4.2. Overall Analysis
After the discussion in
Section 3, based on the multi class model scheme in
Section 3.2, isolation forest (iForest) [
30,
31,
32] algorithms was used to construct independent recognition models for multiple classes of packed software. The proposed method was finally implemented and measured using the metrics in
Section 4.1.3.
The results of the experiment are shown in
Table 6, as described in the dataset partitioning method in
Section 4.1.2. It can be seen that in all 21 experimental groups,
of POS can be maintained above 89%. In terms of
, POS can also maintain a score of over 87% in each group of experiments. In the results of macroscopic observations, POS also maintained a high level of
and
,
reaching 87.11%. The above results demonstrate that POS performs well in both of PRC and PRO problems.
4.3. Algorithm Analysis
In order to comprehensively evaluate the effectiveness of the scheme, comparative experiments were conducted on the algorithms. The role of local outlier factor (LOF) detection method and one-class-SVM [
33] are similar to that of isolation forest, therefore, they are used as the algorithms for comparison.
Table 7 shows the experimental results. LOF’s ability to recognize unseen classes is extremely poor. It focuses more on detecting samples in this class, which leads to overfitting. The metrics of One-class-SVM are better than LOF’s, but lower than POS’s.
4.4. Parameter Analysis
In a single class recognition model, there is an adjustable training parameter called training error (TE). TE is used to set the proportion of samples in the current training sample that will be discarded, because if all points are considered as being able to be classified into the current category, it will cause overfitting during training. Therefore, the TE value is always a decimal in the 0–1 open range.
In order to enhance the recognition ability of the proposed method for unseen classes, an analysis was conducted on the TE of each class. Specifically, a TE value range from 0 to 0.3 was used, and the best TE was gradually tested with a step size of 0.01.
Figure 6 the relationship between AR and TE in 10 classes, where the coordinates of the maximum AR value in each class have been marked with circles and data labels.
Table 8 shows the all values and corresponding TE for each classes, and the maximum TE for each classes is labeled in bold and underline.
Table 9 shows the TEs list used in POS. Based on
Figure 6 and
Table 8, it can be observed that most classes achieve their peak AR within a particular TE range. In the case of Class 6 (Themida), several identical maximum values occur between 0.14 ≤ TE ≤ 0.18; consequently, the median value of this interval is adopted as the TE value for POS.
4.5. Comparative Analysis
In order to more intuitively demonstrate the effectiveness of POS, a comparison was made with other works. Since previous methods have assumed the closed-set problem of packed software, the multi-class model solution (Kim et al. [
19] and Juang et al. [
21]) in
Section 3.2 can be used to endow it with the ability to recognize opened-set samples. As 2-SPIFF [
22] uses a threshold setting solution, it is no longer adjusted.
Table 10,
Table 11, and
Table 12 respectively present the multifaceted differences between the compared method and POS.
In
Table 10, it can be clearly observed that the method proposed by Juang et al. [
21] has the weakest ability to recognize unseen classes. This is because its feature extraction discrimination is too weak, and using entropy features will result in a large number of different byte orders being ignored, leading to the recognition model will judge most of the samples as packed software of unseen classes. The other three methods perform better because their features have good discriminability, and the algorithms and strategies used are in line with the requirements of simultaneously detecting seen and unseen classes. Among them, the method proposed by Kim et al. [
19] uses the first byte method to have good recognition ability for version invariant packed software, which is actually a disguised label matching scheme, resulting in more completely correct seen classes recognition effects.
In
Table 11, the recognition ability of each method for unseen classes is shown. The method proposed by Kim et al. [
19], as it has always been similar to label matching, exhibits overfitting in its ability to recognize unseen classes, with only two values of 0 and 100 present in 21 experiments. Although the method proposed by Juang et al. [
21] has good recognition ability for unseen classes, it completely sacrifices the recognition ability for seen classes. POS is relatively balanced, without overfitting or sacrificing any recognition ability.
Table 12 shows the macroscopic results of four methods, and it can be seen that POS achieved the highest average recall rate, demonstrating the effectiveness of the method in compatibility with seen and unseen classes.
To investigate the poor performance of the POS method in certain groups, dimensionality-reduced data were analyzed. Principal Component Analysis (PCA) is employed to compress high-dimensional data and visualize its spatial distribution. As shown in
Figure 7, most classes exhibit clear separability; however, some classes contain scattered points that overlap with others, which compromises the effectiveness of specific groupings.
4.6. Real Malware Analysis
In order to measure the ability of POS in real packed malware, malware was collected on MALWAREBazaar (
https://bazaar.abuse.ch/browse/) 500 samples. And using the method described in
Section 4.1.2, two different packers Enigma (Malware 1) and eXPressor (Malware 2) were used for packing. POS identified two groups of real packed malware with clear labels.
Table 13 shows the capabilities on malware data. Obviously, POS still has a certain ability to identify real packed malware.
4.7. Efficiency Analysis
To evaluate the efficiency of the POS methods in practical applications, the average time cost was measured for each stage: feature extraction (4.161249 s/sample), model training (0.24 s/model), and detection (0.0004 s/sample). Consequently, in real-world environments, the theoretical time required for a sample to be successfully detected is approximately 4 s.
5. Limitations
This section injects several common binary obfuscation attacks into real-world data, following the descriptions of Lucas et al. [
34,
35]. Note that more effective attacks typically require access to the malware source code (or equivalent capabilities).
Table 14 lists the attacks and their descriptions.
Table 15 reports the effects of the
changestr attack on three groups: the first group (unattacked baseline), the second group (attacked, with the attacked classes unseen during training), and the third group (attacked, with the attacked classes seen during training). The results show that POS is affected by the changestr attack, and applying the more sophisticated attacks from
Table 14 would further degrade POS performance. Enhancing the robustness of POS against such obfuscation techniques is left for future work.
6. Conclusions
Through an in-depth analysis of previous packed software recognition techniques, two key challenges in practical applications were identified: the Closed-set Packed Software Recognition (PRC) problem and the Open-set Packed Software Recognition (PRO) problem. These challenges not only affect recognition accuracy but also impose higher demands on the generalization capability of recognition systems. Specifically, the PRC problem deals with accurate classification within a known sample set, while the PRO problem addresses the handling of unseen class samples, particularly in the context of complex and diverse packed software. Three approaches to the PRO problem were analyzed and compared. After evaluating the strengths and weaknesses of each, a multi-class model recognition solution was chosen, as it effectively distinguishes between packed software classes and addresses the challenges posed by the PRO problem.
In order to further evaluate and measure the ability of this method in handling PRO problem, a new performance metric called “unseen recall rate” () was proposed. This metric is mainly used to measure the recognition ability of the model when facing unseen packed software, especially in the accuracy of identifying unseen classes. Through verification and testing in multiple experimental scenarios, the experimental results show that the proposed method, called POS, exhibits superior performance in opened-set recognition tasks, effectively identifying unseen samples and maintaining a high recall rate. Therefore, based on the experimental results, the effectiveness and feasibility of this method in practical applications have been demonstrated.
In future work, existing methods will be further optimized, focusing on single-class recognition algorithms to improve . By refining model structure, algorithm strategies, and data preprocessing techniques, the goal is to enhance both robustness and accuracy, especially in handling unseen samples, thereby advancing the development of packed software recognition technology.
Author Contributions
Conceptualization, Z.Q.; Methodology, F.L.; Software, M.H. and H.L.; Validation, F.L.; Formal analysis, F.L.; Investigation, B.L. and H.L.; Resources, X.L. and H.L.; Data curation, C.Z. and H.L.; Writing—original draft, G.F. and H.L.; Writing—review & editing, G.F. and H.L.; Visualization, Y.H. and H.L.; Supervision, Z.Q. and F.L.; Project administration, Z.Q. and F.L.; Funding acquisition, Z.Q. and F.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by China Southern Power Grid’s major network-level scientific and technological project “Research and Application of Multi-dimensional Active Defense Technology for Digital Grid”, project number 037800KC24040002 (GDKJXM20240428).
Data Availability Statement
The personal software data used in this study cannot be made publicly available due to privacy concerns. However, the malware dataset can be accessed from the website [
https://bazaar.abuse.ch/browse/].
Conflicts of Interest
Author Z.Q., F.L., M.H., B.L., X.L., C.Z., G.F. and Y.H. were employed by the company Guangdong Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
References
- Alshoulie, M.; Mehmood, A. Deep Learning Approaches for Malware Detection: A Comprehensive Review of Techniques, Challenges, and Future Directions. IEEE Access 2025, 13, 118652–118677. [Google Scholar] [CrossRef]
- Or-Meir, O.; Nissim, N.; Elovici, Y.; Rokach, L. Dynamic malware analysis in the modern era—A state of the art survey. ACM Comput. Surv. (CSUR) 2019, 52, 88. [Google Scholar] [CrossRef]
- Tahir, R. A study on malware and malware detection techniques. Int. J. Educ. Manag. Eng. 2018, 8, 20. [Google Scholar] [CrossRef]
- Liu, H.; Tian, Z.; Qiu, J.; Liu, Y.; Fang, B. Survey on Few-shot for Malware Detection. J. Softw. 2024, 35, 3785–3808. [Google Scholar] [CrossRef]
- Hafeth, A.A.; Abdullahi, A.I. An Efficient Malware Detection Method Using a Hybrid ResNet-Transformer Network and IGOA-Based Wrapper Feature Selection. Electronics 2025, 14, 2741. [Google Scholar] [CrossRef]
- Sherazi, S.N.A.; Qureshi, A. Hybrid Analysis Model for Detecting Fileless Malware. Electronics 2025, 14, 3134. [Google Scholar] [CrossRef]
- Kulkarni, S.S.; Di Troia, F. Robust Hashing for Improved CNN Performance in Image-Based Malware Detection. Electronics 2025, 14, 3915. [Google Scholar] [CrossRef]
- Tong, Y.; Liang, H.; Ma, H.; Zhang, S.; Yang, X. A Survey on Reinforcement Learning-Driven Adversarial Sample Generation for PE Malware. Electronics 2025, 14, 2422. [Google Scholar] [CrossRef]
- Miura, H.; Kimura, T.; Hirata, K. Modeling of Malware Propagation in Wireless Mobile Networks with Hotspots Considering the Movement of Mobile Clients Based on Cosine Similarity. Electronics 2025, 14, 3528. [Google Scholar] [CrossRef]
- Roy, A.; Di Troia, F. Discriminative Regions and Adversarial Sensitivity in CNN-Based Malware Image Classification. Electronics 2025, 14, 3937. [Google Scholar] [CrossRef]
- Liu, H.; Zhou, Y.; Fang, B.; Sun, Y.; Hu, N.; Tian, Z. PHCG: PLC Honeypoint Communication Generator for Industrial IoT. IEEE Trans. Mob. Comput. 2025, 24, 198–209. [Google Scholar] [CrossRef]
- Chen, K.; Lu, H.; Yao, Y.; Fang, B.; Liu, Y.; Tian, Z. Enhancing Container Security through Phase-Based System Call Filtering. IEEE Trans. Cloud Comput. 2025, 13, 983–994. [Google Scholar] [CrossRef]
- Ren, Y.; Xiao, Y.; Zhou, Y.; Zhang, Z.; Tian, Z. CSKG4APT: A Cybersecurity Knowledge Graph for Advanced Persistent Threat Organization Attribution. IEEE Trans. Knowl. Data Eng. 2023, 35, 5695–5709. [Google Scholar] [CrossRef]
- Wang, Z.; Zhou, Y.; Liu, H.; Qiu, J.; Fang, B.; Tian, Z. ThreatInsight: Innovating Early Threat Detection Through Threat-Intelligence-Driven Analysis and Attribution. IEEE Trans. Knowl. Data Eng. 2023, 36, 9388–9402. [Google Scholar] [CrossRef]
- Gubbi, K.I.; Saber Latibari, B.; Srikanth, A.; Sheaves, T.; Beheshti-Shirazi, S.A.; PD, S.M.; Rafatirad, S.; Sasan, A.; Homayoun, H.; Salehi, S. Hardware trojan detection using machine learning: A tutorial. ACM Trans. Embed. Comput. Syst. 2023, 22, 46. [Google Scholar] [CrossRef]
- Xie, B.; Liu, M. Dynamics stability and optimal control of virus propagation based on the e-mail network. IEEE Access 2021, 9, 32449–32456. [Google Scholar] [CrossRef]
- Bala, B.; Behal, S. AI techniques for IoT-based DDoS attack detection: Taxonomies, comprehensive review and research challenges. Comput. Sci. Rev. 2024, 52, 100631. [Google Scholar] [CrossRef]
- Zheng, R.; Wang, Q.; Lin, Z.; Jiang, Z.; Fu, J.; Peng, G. Cryptocurrency malware detection in real-world environment: Based on multi-results stacking learning. Appl. Soft Comput. 2022, 124, 109044. [Google Scholar] [CrossRef]
- Kim, Y.; Paik, J.Y.; Choi, S.; Cho, E.S. Efficient svm based packer identification with binary diffing measures. In Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA, 15–19 July 2019; Volume 1, pp. 795–800. [Google Scholar]
- Li, X.; Shan, Z.; Liu, F.; Chen, Y.; Hou, Y. A consistently-executing graph-based approach for malware packer identification. IEEE Access 2019, 7, 51620–51629. [Google Scholar] [CrossRef]
- Jung, B.; Bae, S.I.; Choi, C.; Im, E.G. Packer identification method based on byte sequences. Concurr. Comput. Pract. Exp. 2020, 32, e5082. [Google Scholar] [CrossRef]
- Liu, H.; Guo, C.; Cui, Y.; Shen, G.; Ping, Y. 2-SPIFF: A 2-Stage Packer Identification Method Based on Function Call Graph and File Attributes. Appl. Intell. 2021, 51, 9038–9053. [Google Scholar] [CrossRef]
- Alkhateeb, E.; Ghorbani, A.; Habibi Lashkari, A. Identifying malware packers through multilayer feature engineering in static analysis. Information 2024, 15, 102. [Google Scholar] [CrossRef]
- Mondon, P.; de Lemos, R. Detecting Cryptographic Functions for String Obfuscation. In Proceedings of the 2024 IEEE International Conference on Cyber Security and Resilience (CSR), London, UK, 2–4 September 2024; pp. 315–320. [Google Scholar]
- Di Gennaro, M.; D’Onghia, M.; Polino, M.; Zanero, S.; Carminati, M. PackHero: A Scalable Graph-based Approach for Efficient Packer Identification. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Lausanne, Switzerland, 17–19 July 2024; Springer: Berlin/Heidelberg, Germany, 2025; pp. 253–274. [Google Scholar]
- Li, J.; He, J.; Li, W.; Fang, W.; Yang, G.; Li, T. SynDroid: An adaptive enhanced Android malware classification method based on CTGAN-SVM. Comput. Secur. 2024, 137, 103604. [Google Scholar] [CrossRef]
- Lichy, A.; Bader, O.; Dubin, R.; Dvir, A.; Hajaj, C. When a RF beats a CNN and GRU, together—A comparison of deep learning and classical machine learning approaches for encrypted malware traffic classification. Comput. Secur. 2023, 124, 103000. [Google Scholar] [CrossRef]
- Akhtar, M.S.; Feng, T. Detection of malware by deep learning as CNN-LSTM machine learning techniques in real time. Symmetry 2022, 14, 2308. [Google Scholar] [CrossRef]
- Alzahrani, A.I.; Ayadi, M.; Asiri, M.M.; Al-Rasheed, A.; Ksibi, A. Detecting the presence of malware and identifying the type of cyber attack using deep learning and VGG-16 techniques. Electronics 2022, 11, 3665. [Google Scholar] [CrossRef]
- Zhai, Y.; Liu, D.; Cheng, Z.; Fang, S. A Novel Prognostic Model of the Degradation Malfunction Combining a Dynamic Updated-ARIMA and Multivariate Isolation Forest: Application to Radar Transmitter. Electronics 2022, 11, 1921. [Google Scholar] [CrossRef]
- Heigl, M.; Anand, K.A.; Urmann, A.; Fiala, D.; Schramm, M.; Hable, R. On the Improvement of the Isolation Forest Algorithm for Outlier Detection with Streaming Data. Electronics 2021, 10, 1534. [Google Scholar] [CrossRef]
- Fang, N.; Fang, X.; Lu, K. Anomalous Behavior Detection Based on the Isolation Forest Model with Multiple Perspective Business Processes. Electronics 2022, 11, 3640. [Google Scholar] [CrossRef]
- Zhao, Y.; Zhou, X.; Chen, L.; Mao, Y.; Yan, M. Research on Abnormal Radio Detection Method Combining Local Outlier Factor and One-Class Support Vector Machine. Electronics 2025, 14, 4055. [Google Scholar] [CrossRef]
- Lucas, K.; Pai, S.; Lin, W.; Bauer, L.; Reiter, M.K.; Sharif, M. Adversarial Training for Raw-Binary Malware Classifiers. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 1163–1180. [Google Scholar]
- Lucas, K.; Lin, W.; Bauer, L.; Reiter, M.K.; Sharif, M. Training Robust ML-based Raw-Binary Malware Detectors in Hours, not Months. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 14–18 October 2024; CCS ’24. pp. 124–138. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).