A Survey on Reinforcement Learning-Driven Adversarial Sample Generation for PE Malware
Abstract
1. Introduction
- We summarize the advantages of RL in the context of adversarial malware generation and propose a foundational framework and an evaluation system for RL-driven malware generation techniques.
- Furthermore, we provide a comprehensive comparative analysis of existing methodologies across multiple critical dimensions, including the design of action spaces, representation of state spaces, construction of reward functions, and selection of RL architectures.
- Finally, drawing upon recent advancements in the field, we reflect on and discuss the major challenges and future research directions for PE malware generation technologies.
2. Methodology
2.1. Research Question
2.2. Search Strategy
3. Preliminaries
3.1. Portable Executable Format
3.2. Reinforcement Learning
- S denotes the set of all possible states;
- A represents the set of available actions, with the agent selecting an action based on its policy in state ;
- R is the reward function, which defines the immediate reward received after transitioning from state to state as a result of taking action ;
- P is the state transition probability function, representing the likelihood of transitioning to state given the current state and action . This transition depends only on the current state and action and not on any previous states or actions, thereby satisfying the Markov property.
3.3. Advantages of RL in Adversarial Malware Generation
- RL enables agents to learn optimal evasion strategies through limited interaction with real-world anti-malware engines, without requiring explicit knowledge of the underlying detection model architecture;
- RL enables the end-to-end modification of malware in the problem space, directly generating adversarial malware samples rather than manipulating features in the feature space;
- RL enables agents to perform a series of discrete operations on malware, modifying it without undermining its original functionality.
4. Framework for RL-Driven Adversarial Malware Generation
- The set of permissible operations that the agent can perform constitutes its action space;
- The environmental states perceived by the agent during interaction constitute the state space representation;
- The reward signal derived from the detector’s feedback guides the agent’s policy updates and forms the basis of the reward function.
5. Methods of RL-Driven Adversarial Malware Generation
5.1. Action Space Design
5.2. State Space Representation
5.3. Reward Function Construction
5.4. Selection of RL Architectures
6. Evaluation System for RL-Driven Adversarial Malware Generation
- Static Verification: The control flow graphs (CFGs) of the original and modified binaries are compared using tools such as IDA Pro [79], aiming to detect whether critical logic paths and function boundaries remain intact.
- Dynamic Verification: Samples are executed in controlled environments such as Cuckoo Sandbox [80] to monitor the activation of malicious behaviors.
7. Discussion
- Simplicity: Under the premise of maintaining operational simplicity, the action space should systematically explore how variations in the modification magnitude and deterministic behavior of a single operation affect the feature decision boundary.
- Validity: All operations must strictly follow the PE format specification while ensuring that its malicious functions are not damaged to avoid sample failure due to operations. This has two implications: first, samples that do not adhere to the PE file format specifications may be unable to run properly on Windows systems; second, even if the samples are able to execute successfully, their original malicious functionality may be compromised or eliminated during the modification process.
- Comprehensiveness: Avoid limiting modifications to a single type of transformation. Instead, the action space should incorporate diverse strategies across multiple dimensions, including code obfuscation techniques. This facilitates the comprehensive perturbation of the model’s decision boundary and enhances the robustness of adversarial samples against detection mechanisms.
- Sensitivity to Agent Actions: The observed state features should be responsive to the agent’s modifications, ensuring that changes in the adversarial sample are meaningfully reflected in the state transition dynamics.
- Comprehensiveness of State Features: The feature set must provide broad coverage of the malware sample’s structural and behavioral properties, ensuring that no critical information is lost during the adversarial perturbation process.
8. Open Issues and Future Directions
8.1. Emerging Trends in Adversarial Malware Generation Techniques
8.1.1. Toward Practical Evaluation of RL in Adversarial Malware Generation
8.1.2. Addressing the Gap Between Static and Dynamic Detection in Adversarial Malware Generation
8.2. Reflections on the Evolution of Malware Detection Techniques
9. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Cletus, A.; Opoku, A.A.; Weyori, B.A. An evaluation of current malware trends and defense techniques: A scoping review with empirical case studies. J. Adv. Inf. Technol. 2024, 15, 649–671. [Google Scholar] [CrossRef]
- Hamza, A.A.; Abdel-Halim, I.T.; Sobh, M.A.; Bahaa-Eldin, A.M. A survey and taxonomy of program analysis for IoT platforms. Ain Shams Eng. J. 2021, 12, 3725–3736. [Google Scholar] [CrossRef]
- Monnappa, K.A. Learning Malware Analysis: Explore the Concepts, Tools and the Techniques; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
- CHECK POINT. Available online: https://itwire.com/images/articles/andrew-matler/2024_Security_Report.pdf (accessed on 29 April 2025).
- Sophos. Available online: https://news.sophos.com/enus/2024/03/12/2024-sophos-threat-report/ (accessed on 29 March 2025).
- AV-Test. Available online: https://portal.av-atlas.org/malware/statistics (accessed on 15 March 2025).
- Sathyanarayan, V.S.; Kohli, P.; Bruhadeshwar, B. Signature generation and detection of malware families. In Proceedings of the 13th Australasian Conference, Information Security and Privacy, Wollongong, Australia, 7–9 July 2008. [Google Scholar]
- Ucci, D.; Aniello, L.; Baldoni, R. Survey of machine learning techniques for malware analysis. Comput. Secur. 2019, 81, 123–147. [Google Scholar] [CrossRef]
- Anderson, B.; McGrew, D. Machine learning for encrypted malware traffic classification: Accounting for Noisy Labels and Non-Stationarity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017. [Google Scholar]
- Cui, Z.; Xue, F.; Cai, X.; Cao, Y.; Wang, G.; Chen, J. Detection of Malicious Code Variants Based on Deep Learning. IEEE Trans. Ind. Inform. 2018, 14, 3187–3196. [Google Scholar] [CrossRef]
- Botacin, M.; Domingues, F.D.; Ceschin, F.; Machnicki, R.; Alves, Z.A.M.; Geus, P.; Gregio, A. Antiviruses under the microscope: A hands-on perspective. Comput. Secur. 2022, 112, 102500. [Google Scholar] [CrossRef]
- Qiang, W.; Yang, L.; Jin, H. Efficient and robust malware detection based on control flow traces using deep neural networks. Comput. Secur. 2022, 122, 102871. [Google Scholar] [CrossRef]
- Jha, S.; Prashar, D.; Long, H.V.; Taniar, D. Recurrent neural network for detecting malware. Comput. Secur. 2020, 99, 102037. [Google Scholar] [CrossRef]
- SL, S.D.; Jaidhar, C.D. Windows malware detector using convolutional neural network based on visualization images. IEEE Trans. Emerg. Top. Comput. 2019, 9, 1057–1069. [Google Scholar]
- Raff, E.; Barker, J.; Sylvester, J.; Brandon, R.; Catanzaro, B.; Nicholas, C.K. Malware Detection by Eating a Whole EXE. In Proceedings of the 32nd AAAI National Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Raff, E.; Fleshman, W.; Zak, R.; Anderson, H.S.; Filar, B.; McLean, M. Classifying sequences of extreme length with constant memory applied to malware detection. In Proceedings of the 35nd AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021. [Google Scholar]
- Krcál, M.; Švec, O.; Bálek, M.; Jašek, O. Deep convolutional malware classifiers can learn from raw executables and labels only. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Kalash, M.; Rochan, M.; Mohammed, N.; Bruce, N.; Wang, Y.; Iqbel, F. Malware classification with deep convolutional neural networks. In Proceedings of the 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), Paris, France, 26–28 February 2018. [Google Scholar]
- Vu, D.L.; Nguyen, T.K.; Nguyen, T.V.; Nguyen, T.N.; Massacci, F.; Phung, P.H. A convolutional transformation network for malware classification. In Proceedings of the 2019 6th NAFOSTED Conference on Information and Computer Science (NICS), Hanoi, Vietnam, 12–13 December 2019. [Google Scholar]
- Vasan, D.; Alazab, M.; Wassan, S.; Naeem, H.; Safaei, B.; Zheng, Q. IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture. Comput. Netw. 2020, 171, 107138. [Google Scholar] [CrossRef]
- Saxe, J.; Berlin, K. Deep neural network based malware detection using two dimensional binary program features. In Proceedings of the 2015 10th international conference on Malicious and Unwanted Software (MALWARE), Fajardo, PR, USA, 20–22 October 2015. [Google Scholar]
- Chen, X.; Hao, Z.; Li, L.; Cui, L.; Zhu, Y.; Ding, Z. Cruparamer: Learning on parameter-augmented api sequences for malware detection. IEEE Trans. Inf. Forensics Secur. 2022, 17, 788–803. [Google Scholar] [CrossRef]
- Maniriho, P.; Mahmood, A.N.; Chowdhury, M.J.M. API-MalDetect: Automated malware detection framework for windows based on API calls and deep learning techniques. J. Netw. Comput. Appl. 2023, 218, 103704. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
- Demetrio, L.; Biggio, B.; Lagorio, G.; Roli, F.; Armando, A. Functionality-preserving black-box optimization of adversarial windows malware. IEEE Trans. Inf. Forensics Secur. 2021, 16, 469–3478. [Google Scholar] [CrossRef]
- Kong, Z.; Xue, J.; Wang, Y.; Huang, L.; Niu, Z.; Li, F. A survey on adversarial attack in the age of artificial intelligence. Wirel. Commun. Mob. Comput. 2021, 2021, 4907754. [Google Scholar] [CrossRef]
- Park, D.; Yener, B. A survey on practical adversarial examples for malware classifiers. In Proceedings of the 4th Reversing and Offensive-Oriented Trends Symposium, Vienna, Austria, 19 November 2020. [Google Scholar]
- Pierazzi, F.; Pendlebury, F.; Cortellazzi, J.; Cavallaro, L. Intriguing properties of adversarial ml attacks in the problem space. In Proceedings of the 41st IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 18–20 May 2020. [Google Scholar]
- Zhan, D.; Duan, Y.; Hu, Y.; Li, W.; Guo, S.; Pan, Z. MalPatch: Evading DNN-based malware detection with adversarial patches. IEEE Trans. Inf. Forensics Secur. 2023, 19, 1183–1198. [Google Scholar] [CrossRef]
- Wang, S.; Fang, Y.; Xu, Y.; Wang, Y. Black-box adversarial windows malware generation via united puppet-based dropper and genetic algorithm. In Proceedings of the 24th IEEE International Conference on High Performance Computing & Communications, Hainan, China, 18–20 December 2022. [Google Scholar]
- Gibert, D.; Planes, J.; Le, Q.; Zizzo, G. A wolf in sheep’s clothing: Query-free evasion attacks against machine learning-based malware detectors with generative adversarial networks. In Proceedings of the 2023 IEEE European Symposium on Security and Privacy Workshops, Delft, The Netherlands, 3–7 July 2023. [Google Scholar]
- Zhong, F.; Cheng, X.; Yu, D.; Gong, B.; Song, S.; Yu, J. MalFox: Camouflaged adversarial malware example generation based on conv-GANs against black-box detectors. IEEE Trans. Comput. 2023, 73, 980–993. [Google Scholar] [CrossRef]
- Hu, W.; Tan, Y. Generating adversarial malware examples for black-box attacks based on GAN. In Proceedings of the 7th International Conference on Data Mining and Big Data, Beijing, China, 11 November 2022. [Google Scholar]
- Anderson, H.S.; Kharkar, A.; Filar, B.; Evans, D.; Roth, P. Learning to evade static PE machine learning malware models via reinforcement learning. arXiv 2018, arXiv:1801.08917. [Google Scholar]
- Fang, Z.; Wang, J.; Li, B.; Wu, S.; Zhou, Y.; Huang, H. Evading anti-malware engines with deep reinforcement learning. IEEE Access 2019, 7, 48867–48879. [Google Scholar] [CrossRef]
- Fang, Y.; Zeng, Y.; Li, B.; Liu, L.; Zhang, L. DeepDetectNet vs. RLAttackNet: An adversarial method to improve deep learning-based static malware detection model. PLoS ONE 2020, 15, e0231626. [Google Scholar] [CrossRef]
- Geng, J.; Wang, J.; Fang, Z.; Zhou, Y.; Wu, D.; Ge, W. A survey of strategy-driven evasion methods for PE malware: Transformation, concealment, and attack. Comput. Secur. 2024, 137, 103595. [Google Scholar] [CrossRef]
- Ling, X.; Wu, L.; Zhang, J.; Qu, Z.; Deng, W.; Chen, X.; Qiao, Y.; Wu, C.; Ji, S.; Luo, T. Adversarial attacks against Windows PE malware detection: A survey of the state-of-the-art. Comput. Secur. 2023, 128, 103134. [Google Scholar] [CrossRef]
- Afianian, A.; Niksefat, S.; Sadeghiyan, B.; Bapitiste, D. Malware dynamic analysis evasion techniques: A survey. ACM Comput. Surv. (CSUR) 2019, 52, 1–28. [Google Scholar] [CrossRef]
- Bulazel, A.; Yener, B. A survey on automated dynamic malware analysis evasion and counter-evasion: Pc, mobile, and web. In Proceedings of the 1st Reversing and Offensive-Oriented Trends Symposium, Vienna, Austria, 16 November 2017. [Google Scholar]
- Jadhav, A.; Vidyarthi, D. Evolution of evasive malwares: A survey. In Proceedings of the 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT), New Delhi, India, 11–13 March 2016. [Google Scholar]
- Mahmood, A.N.; Chowdhury, M.J.M.; Maniriho, P. A survey of recent advances in deep learning models for detecting malware in desktop and mobile platforms. ACM Comput. Surv. 2024, 56, 145. [Google Scholar]
- Wu, C.; Shi, J.; Yang, Y.; Li, W. Enhancing machine learning based malware detection model by reinforcement learning. In Proceedings of the 8th International Conference on Communication and Network Security, Qingdao, China, 2–4 November 2018. [Google Scholar]
- Anderson, H.S.; Roth, P. Ember: An open dataset for training static pe malware machine learning models. arXiv 2018, arXiv:1804.04637. [Google Scholar]
- Chen, J.; Jiang, J.; Li, R.; Dou, Y. Generating adversarial examples for static PE malware detector based on deep reinforcement learning. J. Phys. Conf. Ser. 2020, 1575, 012011. [Google Scholar] [CrossRef]
- Ebrahimi, M.; Pacheco, J.; Li, W.; Hu, J.L.; Chen, H. Binary black-box attacks against static malware detectors with reinforcement learning in discrete action spaces. In Proceedings of the 2021 IEEE security and privacy workshops (SPW), San Francisco, CA, USA, 27 May 2021. [Google Scholar]
- Labaca-Castro, R.; Franz, S.; Rodosek, G.D. AIMED-RL: Exploring adversarial malware examples with reinforcement learning. In Proceedings of the Machine Learning and Knowledge Discovery in Databases, Applied Data Science Track: European Conference, ECML PKDD 2021, Bilbao, Spain, 13–17 September 2021. [Google Scholar]
- Li, X.; Li, Q. An IRL-based malware adversarial generation method to evade anti-malware engines. Comput. Secur. 2021, 104, 102118. [Google Scholar] [CrossRef]
- Gibert, D.; Fredrikson, M.; Mateu, C.; Plances, J.; Le, Q. Enhancing the insertion of NOP instructions to obfuscate malware via deep reinforcement learning. Comput. Secur. 2022, 113, 102543. [Google Scholar] [CrossRef]
- Song, W.; Li, X.; Afroz, S.; Grag, D.; Kuznetsov, D.; Yin, H. MAB-Malware: A reinforcement learning framework for blackbox generation of adversarial malware. In Proceedings of the 2022 ACM on Asia conference on computer and communications security, Nagasaki, Japan, 30 May 2022. [Google Scholar]
- Quertier, T.; Marais, B.; Morucci, S.; Fournel, B. MERLIN-Malware Evasion with Reinforcement LearnINg. arXiv 2022, arXiv:2203.12980. [Google Scholar]
- Zhong, F.; Hu, P.; Zhang, G.; Li, H.; Cheng, X. Reinforcement learning based adversarial malware example generation against black-box detectors. Comput. Secur. 2022, 121, 102869. [Google Scholar] [CrossRef]
- Virus Total. Available online: https://virustotal.com/ (accessed on 29 May 2025).
- Zhang, L.; Liu, P.; Choi, Y.H.; Chen, P. Semantics-preserving reinforcement learning attack against graph neural networks for malware detection. IEEE Trans. Dependable Secur. Comput. 2022, 20, 1390–1402. [Google Scholar] [CrossRef]
- Zhan, D.; Bai, W.; Liu, X.; Hu, Y.; Zhang, L.; Guo, S.; Pan, Z. PSP-Mal: Evading malware detection via prioritized experience-based reinforcement learning with Shapley prior. In Proceedings of the 39th Annual Computer Security Applications Conference, Austin, TX, USA, 4–8 December 2023. [Google Scholar]
- Etter, B.; Hu, J.L.; Ebrahimi, M.; Li, W.; Li, X.; Chen, H. Evading Deep Learning-Based Malware Detectors via Obfuscation: A Deep Reinforcement Learning Approach. In Proceedings of the 2023 IEEE International Conference on Data Mining (ICDM), Shanghai, China, 1–4 December 2023. [Google Scholar]
- Zhan, D.; Zhang, Y.; Zhu, L.; Chen, J.; Xia, S.; Guo, S.; Pan, Z. Enhancing reinforcement learning based adversarial malware generation to evade static detection. Alex. Eng. J. 2024, 98, 32–43. [Google Scholar] [CrossRef]
- Coull, S.E.; Gardner, C. Activation analysis of a byte-based deep neural network for malware classification. In Proceedings of the 2019 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 23 May 2019. [Google Scholar]
- Tian, B.; Jiang, J.; He, Z.; Yuan, X.; Dong, L.; Sun, C. Functionality-Verification Attack Framework Based on Reinforcement Learning Against Static Malware Detectors. IEEE Trans. Inf. Forensics Secur. 2024, 19, 8500–8514. [Google Scholar] [CrossRef]
- Zhan, D.; Liu, X.; Bai, W.; Li, W.; Guo, S.; Pan, Z. GAME-RL: Generating Adversarial Malware Examples against AP| Call Based Detection via Reinforcement Learning. IEEE Trans. Dependable Secur. Comput. (Early Access) 2025, 1–17. [Google Scholar] [CrossRef]
- Microsoft, Inc. Available online: https://docs.microsoft.com/en-us/windows/win32/debug/pe-format (accessed on 9 June 2025).
- Paterson, T. An inside look at MS-DOS. Byte 1983, 8, 230. [Google Scholar]
- Szegedy, C.; Sutskever, I.; Goodfellow, I.; Zaremba, W.; Fergus, R.; Erhan, D. Intriguing properties of neural networks. In Proceedings of the 2nd International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Aryal, K.; Gupta, M.; Abdelsalam, M.; Saleh, M. Intra-section code cave injection for adversarial evasion attacks on windows pe malware file. arXiv 2024, arXiv:2403.06428. [Google Scholar]
- Demetrio, L.; Biggio, B.; Lagorio, G.; Roli, F.; Armando, A. Explaining vulnerabilities of deep learning to adversarial malware binaries. arXiv 2019, arXiv:1901.03583. [Google Scholar]
- Demetrio, L.; Coull, S.E.; Biggio, B.; Lagorio, G.; Armando, A.; Roli, F. Adversarial exemples: A survey and experimental evaluation of practical attacks on machine learning for windows malware detection. ACM Trans. Priv. Secur. (TOPS) 2021, 24, 27. [Google Scholar] [CrossRef]
- Badhwar, R. Polymorphic and metamorphic malware. In The CISO’s Next Frontier: AI, Post-Quantum Cryptography and Advanced Security Paradigms; Springer International Publishing: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
- Simple Malware Obfuscation Techniques. Available online: https://resources.infosecinstitute.com/topic/simple-malware-obfuscation-techniques/ (accessed on 29 March 2025).
- Upx. Available online: https://github.com/upx/upx (accessed on 29 May 2025).
- Darkarmour Computer Software. Available online: https://github.com/bats3c/darkarmour (accessed on 29 March 2025).
- Lief. Available online: https://github.com/lief-project/LIEF (accessed on 29 May 2025).
- pefile. Available online: https://github.com/erocarrera/pefile (accessed on 29 March 2025).
- Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 11 August 2017. [Google Scholar]
- Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.V.; Lanctot, M.; Freitas, N.D. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
- Bellemare, M.G.; Dabney, W.; Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Fellows, M.; Mahajan, A.; Rudner, T.G.J.; Whiteson, S. Virel: A variational inference framework for reinforcement learning. Adv. Neural Inf. Process. Syst. 2019, 32, 1–15. [Google Scholar]
- Christodoulou, P. Soft actor-critic for discrete action settings. arXiv 2019, arXiv:1910.07207. [Google Scholar]
- IDA Pro. Available online: https://www.hex-rays.com/products/ida/ (accessed on 29 March 2025).
- Guarnieri, C. cuckoosandbox. Available online: https://cuckoosandbox.org/ (accessed on 29 March 2025).
- Angr. Available online: https://angr.io/ (accessed on 29 March 2025).
- Ding, Y.; Shao, M.; Nie, C.; Fu, K. An efficient method for generating adversarial malware samples. Electronics 2022, 11, 154. [Google Scholar] [CrossRef]
- Chen, S.; Carlini, N.; Wagner, D. Stateful detection of black-box adversarial attacks. In Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence, Taiwan, China, 6 October 2020. [Google Scholar]
- Albulayhi, K.; Sheldon, F.T. An adaptive deep-ensemble anomaly-based intrusion detection system for the internet of things. In Proceedings of the 2021 IEEE world AI IoT congress (AIIoT), Seattle, WA, USA, 13 May 2021. [Google Scholar]
- Chen, B.; Ren, Z.; Yu, C.; Hussain, I.; Liu, J. Adversarial examples for cnn-based malware detectors. IEEE Access 2019, 7, 54360–54371. [Google Scholar] [CrossRef]
- Shaukat, K.; Luo, S.; Varadharajan, V. A novel method for improving the robustness of deep learning-based malware detectors against adversarial attacks. Eng. Appl. Artif. Intell. 2022, 116, 105461. [Google Scholar] [CrossRef]
- Zhang, Y.; Li, H.; Zheng, Y.; Yao, S.; Jiang, J. Enhanced DNNs for malware classification with GAN-based adversarial training. J. Comput. Virol. Hacking Tech. 2021, 17, 153–163. [Google Scholar] [CrossRef]
- Wang, X.; Miikkulainen, R. MDEA: Malware detection with evolutionary adversarial learning. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation (CEC), Glasgow, UK, 19 July 2020. [Google Scholar]
- Albulayhi, K.; Abu Al-Haija, Q.; Alsuhibany, S.A.; Jillepalli, A.A.; Ashrafuzzaman, M.; Sheldon, F.T. IoT intrusion detection using machine learning with a novel high performing feature selection method. Appl. Sci. 2022, 12, 5015. [Google Scholar] [CrossRef]
- Mishra, A.; Gupta, B.B.; Perakovic, D.; Yamaguchi, S.; Hsu, C.H. Entropy based defensive Mechanism against DDoS attack in SDN-Cloud enabled online social networks. In Proceedings of the 2021 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 10 January 2021. [Google Scholar]
Method | Year | Target Model | RL Method | State Space | Action Space | Reward | Validation | Content |
---|---|---|---|---|---|---|---|---|
gym-malware [34] | 2017 | GDBT | DQN | 2350d | 10 | evasion | - | random |
gym-plus [43] | 2018 | LightGBM [44] | Double DQN, Sarsa | 2350d | 16 | - | - | random |
DQEAF [35] | 2019 | GBDT | DDQN | 513d | 4 | evasion, query number | ✓ | random |
gym-malware-mini [45] | 2020 | GDBT | DQN,A2C | 2350d | 10 | evasion | - | deterministic |
RLAttackNet [36] | 2020 | DeepDetectNet | Double DQN | 2478d | 218 | evasion, query number | ✓ | deterministic |
AMG-VAC [46] | 2021 | LightGBM, MalConv [15] | VAC | 2350d, raw bytes | 10 | - | ✓ | random |
AIMED-RL [47] | 2021 | LightGBM | DiDDQN | 2351d | 10 | evasion, query number, similarity | ✓ | random |
AMG-IRL [48] | 2021 | 360engine | IRL | - | 4 | IRL | ✓ | random |
Gibert [49] | 2022 | CNN classifiers | Double DQN | - | 1 | evasion | - | deterministic |
mab-malware [50] | 2022 | LightGBM, MalConv, 3 AVS | multi-armed-bandit | 2350d, raw bytes | 8 | evasion | ✓ | random, benign |
MERLIN [51] | 2022 | LightGBM/ MalConv/AV | DQN policy Gradient | - | 15 | evasion | - | random |
Malinfo [52] | 2022 | VirusTotal [53] | DP, TD | - | Obfusmal, Stealmal, Hollowmal | evasion | - | deterministic |
SRL [54] | 2022 | GNN detector | CFG | - | injecting semantic NOPs into the CFG | evasion | - | deterministic |
PSPmal [55] | 2023 | LightGBM | PER/TS | 2381d | 10 | evasion | ✓ | deterministic |
OBFU-mal [56] | 2023 | LightGBM, MalConv, AV | DQN | 2350d, raw bytes | 12 | evasion intrinsic reward | - | - |
Enhancing RLAMG [57] | 2024 | LightGBM, MalConv, FireEyeNet [58] | PPO | 2350d, raw bytes | 5 | evasion, intrinsic reward | ✓ | deterministic |
RLAEG [59] | 2024 | LightGBM, MalConv, MalConv-GCT [16] | SAC | 636d | 12 | evasion | ✓ | - |
GAME-RL [60] | 2025 | API Call Sequence-Based Black-Box Model | - | - | insert API calls | evasion, query number | ✓ | - |
Number | Name | Type | Description | Methods | Adoption Rate |
---|---|---|---|---|---|
1 | Overlay Append | Additive | Append bytes to the end of the file | [34,35,36,43,45,46,47,48,50,55,56,57,59] | 86.7% |
2 | Section Add | Additive | Add a new section | [34,43,45,47,50,55,56,57,59] | 60.0% |
3 | Section Slack | Additive | Append bytes to extra spaces at the ends of sections | [34,35,36,43,45,46,47,48,50,55,57,59] | 80.0% |
4 | Import Append | Additive | Add a function to the import address table | [34,35,36,43,45,46,47,48,55,56,57] | 73.3% |
5 | Remove Debug | Additive | Unlink the debug section form the header | [34,36,43,45,46,47,50,55,56] | 60.0% |
6 | Section Rename | Editing | Manipulate existing section names | [34,36,43,45,46,47,50,55,56,59] | 66.7% |
7 | Remove Certificate | Editing | Remove the signed certificate | [34,35,36,43,45,46,47,48,50,56] | 66.7% |
8 | Break Checksum | Editing | Modify the header checksum | [34,43,45,46,47,50,55,56] | 53.3% |
9 | Overlay Replace | Editing | Replace bytes at the end of the file | [43,59] | 13.3% |
10 | Dos Change | Editing | Reset the DOS header except the magic number and PE header offset | [43,57,59] | 20% |
11 | Header Disrupt | Editing | Reset portions of the PE header | [43,55,56,59] | 26.7% |
12 | Change Timestamp | Editing | Reset the timestamp | [43,46,55,56,59] | 33.3% |
13 | Extend Header | Extension | Fill in the strings by increasing the PE header offset to empty the new space | [43,59] | 13.3% |
14 | Extend Section | Extension | Fill in the bytes by increasing the offset of each section to empty the new space | [43,59] | 13.3% |
15 | Upx Pack | Obfuscation | Pack with UPX tools | [34,43,45,46,47,56,59] | 46.7% |
16 | Upx Unpack | Obfuscation | Unpack with UPX tools | [34,43,45,46,47,59] | 40.0% |
17 | Insert NOP | Obfuscation | Insert a NOP instruction | [49,54] | 13.3% |
18 | Add XOR | Obfuscation | Appy an XOR encryption loop | [56] | 6.70% |
Target Model | LightGBM | MalConv | AV |
---|---|---|---|
Knowledge of model | Yes | Yes | No |
Input data | Features extracted from PE files | Raw bytes extracted from PE files | PE files |
Input preprocessing | 2351d vectors | Fixed size (1M) | Unknown |
Feature | Position | Information | Dim |
---|---|---|---|
General file information (Parsed feature) | PE header | file_size | 10 |
vsize | |||
has_debug | |||
exports | |||
imports | |||
has_relocations | |||
has_resources | |||
has_signature | |||
has_tls | |||
symbols | |||
Header information (Parsed feature) | PE header | timestamp | 62 |
machine | |||
characteristics | |||
subsystem | |||
dll_characteristics | |||
magic | |||
major_image_version | |||
minor_image_version | |||
major_linker_version | |||
minor_linker_version | |||
major_operating_system_version | |||
minor_operating_system_version | |||
major_subsystem_version | |||
minor_subsystem_version | |||
sizeof_code | |||
sizeof_headers | |||
sizeof_heap_commit | |||
Imported functions (Parsed feature) | Import address table | - | 1280 |
Exported functions (Parsed feature) | Export table | - | 128 |
Section information (Parsed feature) | Each section | entry | 255 |
name | |||
size | |||
entropy | |||
vsize | |||
props | |||
Byte histogram (format-agnostic feature) | Counts of each bytes | - | 256 |
Byte entropy histogram (format-agnostic feature) | Byte entropy | - | 256 |
String information (format-agnostic feature) | Statistics about printable strings | numstrings | 104 |
avlength | |||
printabledist | |||
printables | |||
entropy | |||
paths | |||
urls | |||
registry | |||
MZ |
Category | Method | State Space | Action Space | Convergence Speed |
---|---|---|---|---|
Value-based | DQN | Continuous | Discrete | Low sample efficiency, requires a large number of environment interactions |
Double DQN | Continuous | Discrete | ||
Dueling DQN | Continuous | Discrete | ||
DiDDQN | Continuous | Discrete | ||
Actor–Critic | A3C | Continuous | Discrete | Policy gradient direct optimization objective function, converges faster |
SAC | Continuous | Continuous | ||
PPO | Continuous or Discrete | Continuous |
Metric | Description | |
---|---|---|
Primary evaluation metric | Evasion rate | The ratio of the number of adversarial samples that maintain their malicious functionality and successfully evade detection to the total number of adversarial samples |
Auxiliary evaluation metrics | Functionality preservation | Adversarial samples can evade detection while fully retaining the original malicious functions |
Transferability | Whether the samples generated by one model can escape from other models | |
Injection rate | The ratio of adversarial perturbation to the original sample size | |
Interaction rounds | The number of rounds of confrontation between the agent and the detection model | |
Operation count | The cumulative modification steps of adversarial samples |
Application Stage | Description | Methods |
---|---|---|
Data preprocessing | Eliminate possible adversarial samples | [83,84,85] |
Model training | Add a certain number of adversarial samples to retrain the model | [86,87,88] |
Model detection | Directly extract the features of the adversarial samples | [89,90] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tong, Y.; Liang, H.; Ma, H.; Zhang, S.; Yang, X. A Survey on Reinforcement Learning-Driven Adversarial Sample Generation for PE Malware. Electronics 2025, 14, 2422. https://doi.org/10.3390/electronics14122422
Tong Y, Liang H, Ma H, Zhang S, Yang X. A Survey on Reinforcement Learning-Driven Adversarial Sample Generation for PE Malware. Electronics. 2025; 14(12):2422. https://doi.org/10.3390/electronics14122422
Chicago/Turabian StyleTong, Yu, Hao Liang, Hailong Ma, Shuai Zhang, and Xiaohan Yang. 2025. "A Survey on Reinforcement Learning-Driven Adversarial Sample Generation for PE Malware" Electronics 14, no. 12: 2422. https://doi.org/10.3390/electronics14122422
APA StyleTong, Y., Liang, H., Ma, H., Zhang, S., & Yang, X. (2025). A Survey on Reinforcement Learning-Driven Adversarial Sample Generation for PE Malware. Electronics, 14(12), 2422. https://doi.org/10.3390/electronics14122422