MalRefiner: Recovering Malware Semantics via Reinforcement Learning-Based Semantic NOP Removal

Sun, Jiankun; Shi, Fan; Zhang, Min; Hu, Miao; Xue, Pengfei; Huang, Cheng; Xu, Chengxi

doi:10.3390/app152212015

Open AccessArticle

MalRefiner: Recovering Malware Semantics via Reinforcement Learning-Based Semantic NOP Removal

by

Jiankun Sun

¹,

Fan Shi

¹,

Min Zhang

¹,

Miao Hu

^1,*

,

Pengfei Xue

¹

,

Cheng Huang

²

and

Chengxi Xu

¹

College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China

²

School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12015; https://doi.org/10.3390/app152212015

Submission received: 22 October 2025 / Revised: 9 November 2025 / Accepted: 10 November 2025 / Published: 12 November 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Adversarial evasion against learning-based malware detectors has shifted from feature-space perturbations to semantic-preserving, problem-space manipulations. In this paradigm, attackers inject semantic NOPs—functionally NOP instructions that shift the static feature distribution—into assembly code to suppress detection confidence. Existing defenses primarily recalibrate classifier decision boundaries, leaving the adversarially modified malware intact and thereby hindering downstream tasks including but not limited to malicious API localization and capability attribution. We introduce MalRefiner, a reinforcement-learning agent that automatically identifies and removes adversarially inserted semantic NOPs to restore the original malicious representation. The recovery process is formulated as a Markov Decision Process, where a policy network sequentially decides whether to retain or remove each opcode. The agent is trained with a composite reward function that balances detection confidence recovery with semantic preservation, guided by a lightweight 1D causal convolutional environment providing compact state representations and delayed rewards. Extensive evaluation on the PEMML and RawMal-TF datasets against four state-of-the-art detectors (1D CNN, MalConv, TCN, and MALIGN) demonstrates that MalRefiner restores F1 to within 3.18 ± 0.94% of the clean baseline and achieves a recovery rate exceeding 90% across all models and datasets, without requiring retraining or architectural modification of the target classifier.

Keywords:

adversarial defense; semantic NOP insertion; policy gradient; reinforcement learning; malware analysis

1. Introduction

A primary challenge in static malware analysis is the localization of malicious application programming interface (API) functionality in assembly instruction sequences [1]. Recent work derives API-level control flow graphs and delimits regions of interest (ROIs) by inspecting the surrounding opcodes to classify the API as benign or malicious [2]. This pipeline is demonstrably fragile under semantic-NOP insertion attacks, where an adversary inserts one or more semantically redundant NOP instruction (semantic NOPs) into ROIs, perturbing the static analysis while preserving runtime semantics. In such situations, the opcode distribution inside the ROI is shifted, which in turn displaces the feature vector used by both signature [3] and learning-based detectors [1], causing a measurable drop in maliciousness score. More critically, the operability and stealth of automated semantic-NOP toolkits [4,5,6] make them promising to be embedded into Malware-as-a-Service ecosystems [7], expanding the attack surface from isolated campaigns to industrial scale. Therefore, semantic-NOP removal plays an important role in malware analysis with accuracy and completeness.

Semantic NOPs are individual instructions or instruction sequences that preserve the original program semantics while altering the static representation of a function. Table 1 lists several commonly used semantic NOPs identified in prior research. However, manual identification of semantic NOPs is impractical due to the vast volume of malware samples. Although rule-based approaches offer an automated alternative for detecting semantic NOPs, they struggle to adapt to the continuous evolution and variation of such instructions. Consequently, effective semantic NOP discovery and removal demands an automated approach capable of identifying semantic-NOP patterns, even for previously unseen instruction sequences that have never appeared in any training corpus. This requirement naturally aligns with a sequential decision problem under uncertainty, a setting in which reinforcement learning (RL) has demonstrated state-of-the-art performance. In this scenario, an RL agent treats the choice of a semantic-NOP as an action and the evasion success rate against a downstream detector as the reward. Consequently, it can continually expand its repertoire beyond known patterns and learn to remove inserted semantic NOPs.

In this work, we present MalRefiner, a fully-working reinforcement-learning system that recovers the original opcode sequence of an adversarially injected semantic-NOP malware by jointly optimizing three interacting modules: (i) a representation network, (ii) a classification network, and (iii) a policy network. First, each malware is statically disassembled into an ordered opcode stream, integer-embedded, and fed to a stacked 1D causal convolutional representation network to yield a compact, context-aware state vector. This vector simultaneously supplies the Markov state for the policy network and the input to the classification network, which returns the posterior malware likelihood used as the reward. Within this environment consisting of a representation network and a classification network, the policy-gradient refiner performs per-opcode retain/remove actions, maximizing cumulative reward so that the final, minimal opcode sequence (a) preserves original runtime semantics and (b) maximizes the confidence of any downstream static detector—without requiring retraining or architectural modification.

Experimental evaluation is conducted on two benchmark datasets, i.e., PE Malware Machine Learning (PEMML) and RawMal-TF, for semantic-NOP removal. PEMML contains 25,426 Windows PE files (2017–2018) obtained by family-stratified sampling from 201 k executables. RawMal-TF comprises 6000 samples (2023–2025) with an equal proportion of malware and benignware. We target four mainstream detectors—1D CNN, MalConv, TCN, and MALIGN—without retraining. Adversarial perturbation is generated with the Semantics-preserving Reinforcement Learning (SRL) attack, inserting semantic NOPs into at most 5% of basic blocks, yielding paired clean/adversarial samples. MalRefiner is trained on train-split and evaluated on held-out test set, attaining a recovery rate exceeding 90% on both datasets.

In summary, the contributions of this paper are as follows:

(1): Semantic-NOP recovery pipeline: We present MalRefiner, an adversarial malware defense system that jointly learns to locate and remove adversarially inserted semantic NOPs.
(2): Causal-conv MDP backbone: We cast the task as a Markov Decision Process (MDP) and design a lightweight 1D causal convolutional environment that yields compact, context-aware states and delayed detection-likelihood rewards, enabling end-to-end training with REINFORCE.
(3): Extensive experimental evaluation: We perform extensive experiments to evaluate the effectiveness of MalRefiner on 31k executables. Experimental results demonstrate that MalRefiner achieves a 90% recovery rate with no modification of the downstream classifier.

The remainder of this article is organized as follows. Section 2 reviews the related adversarial-malware literature. Section 3 formalizes the threat model and details the problem formulation, network architecture, and reward design. Section 4 describes datasets, baselines, attack protocols, and metrics. Section 5 presents quantitative and qualitative results and case analyses. Section 6 concludes the paper and outlines future research directions.

2. Related Work

In this section, we first trace the evolution of semantic-preserving evasions, from PE byte injection to RL-driven NOP insertion, to illustrate the advancements of current advanced methods. Following that, we demonstrate that existing defenses primarily prioritize detection performance, leaving the malicious semantics unrestored, thereby motivating our shift toward learning to remove rather than merely resist.

2.1. Adversarial Malware Evasion Against ML Detectors

Since the migration from signature-based [8] to learning-based detectors, adversarial research has predominantly focused on feature-space manipulation of static and dynamic models. Early work appended bytes to PE headers [9] or injected benign API calls [10] to shift the decision boundary of data-driven classifiers. Subsequent studies regard the problem as a constrained optimization problem: maximize misclassification probability while preserving file syntax and execution semantics [11]. With the advent of graph-level detectors, attackers moved from raw bytes to structural features, perturbing control-flow or call graphs via node injection. Among structure-aware attacks, semantics-preserving perturbations have become the evasion method because they alter only non-functional instructions and reach a good evasion rate. Within this category, inserting semantic NOP instructions that do not alter program behavior, such as NOP, register-to-register swaps, or additions of zero, achieves good stealth.

Gibert et al. [4] leverage a Double-DQN agent to sequentially place NOPs into the assembly code to reduce the cross-entropy loss of a data-driven classifier. Their method achieves 100% evasion on three malware families while increasing the basic-block count by merely 5%. Zhang et al. [5] extended the idea to graph neural networks. They proposed that an RL policy jointly selects a basic block and a semantic-NOP, enabling better evading performance. Their SRL attack attains 100% evasion against both basic GCN and DGCNN detectors with less than

0.23 %

additional instructions, and retains 85% success even after adversarial retraining. Ling et al. [6] present MalGuise, a black-box attack that refines semantics-preserving CFG manipulation through call-based redividing. Each call instruction splits a basic block into three sequential blocks, while semantic-NOPs are injected into the mid blocks. Their evaluation on 210 k wild samples reveals that MalGuise exceeds 95% attack-success rate while preserving original API-level semantics in over 91% of cases.

2.2. Adversarial Defenses on Evading Malware

With the rapid evolution of adversarial attacks against malware samples, defending machine learning detectors has recently become a priority. Adversarial training is now a standard robustness strategy: Lucas et al. [12] perform large-scale adversarial training on raw-byte models and demonstrate that lightweight, gradient-guided augmentations transfer robustness to stronger malicious variants while leaving the malicious byte sequence intact.

In recent years, researchers have proposed robust architectures and dedicated threat indicators. Li et al. [13] present RAMDA, a robust Android malware detection framework that couples a feature-disentangled VAE (FD-VAE) with an MLP classifier. By learning a compact latent space that cleanly separates benign, malicious, and adversarial samples, RAMDA attains a good defense success rate of over 90% against seven diverse adversarial attacks without requiring adversarial examples at training time, thereby generalizing well to unseen attacks. Rashid et al. [14] introduce MalProtect, a stateful defense to detect adversarial query attacks against ML-based malware detectors. By analyzing query sequences with multiple anomaly-based indicators, including query similarity, feature overlap, and auto-encoder reconstruction loss, MalProtect reduces evasion rates by

80 %

on both Android and Windows datasets, outperforming prior stateful defenses. Although these approaches preserve detection accuracy, they do not attempt to recover the original malware semantics, which is an essential prerequisite for downstream malware analysis.

2.3. Reinforcement Learning for Malware Evasion and Defense

With the rapid evolution of adversarial attacks targeting machine learning-based malware detectors, reinforcement learning has emerged as a powerful framework for modeling sequential decision-making in both malware evasion and adversarial defense. Existing RL-based approaches predominantly treat the malware sample as an indivisible unit, abstracting the adversarial generation or defense process as a high-level policy optimization problem. Zhong et al. [15] exemplify this paradigm by treating the entire modified executable as the RL state. It employs dynamic programming and temporal difference learning to explore sequences of perturbations and uses VirusTotal as a black-box detector to provide reward signals. Similarly, Ravi et al. [16] adopt a holistic view by embedding the binary as an RGB image and using the full image as the RL state. A ResNet-18 surrogate model provides reward feedback, guiding the agent to insert semantic NOPs into the executable region of binary files. On the defensive side, Liu et al. [17] leverage inverse reinforcement learning to model malware behavior as a sequence of system-level interactions. It constructs a dynamic heterogeneous graph to represent malware–system relationships and infers attack intent from observed behavior trajectories.

These methods concentrate on the whole malware sample and disregard the internal opcode sequence, which encodes the core logic and functionality of the program. This coarse-grained abstraction hinders their ability to perform fine-grained defenses, particularly when locating subtle instruction-level changes that can yield significant evasion gains.

To the best of our knowledge, no prior work has concentrated on the automated selective removal of adversarial semantic NOPs. We close this gap with a policy-gradient agent that learns to remove semantic NOPs, directly maximizing the confidence of the downstream ML detector.

3. Methodology

In this section, we first give the threat model and formulate the problem. Then, following the mathematical symbols in problem formulation, the environment and agent are introduced. Finally, rewards and the update strategy of the model parameters are present.

3.1. Defense Model

Following previous works [18,19], the defense model in this paper focuses on three key components: defender goals, knowledge, and capabilities.

Defender goals: The defender aims to remove semantic NOPs from adversarially altered malware samples. These instructions do not alter program behavior but can cause ML-based detectors to misclassify malicious binaries. This work focuses on removing such NOPs to restore the malware to its original, pre-obfuscation state.

Defender knowledge: We assume the defender operates the target ML-based malware detector and receives a batch of suspicious obfuscated samples. The defender observes the obfuscated samples and its true labels are malware, but has no knowledge of how or where semantic NOPs were inserted. Adversaries release the crafted sample but keep the original sample and their generation pipeline confidential.

Defender capabilities: The defender is assumed to possess two core capabilities: (1) query access to the target ML-based malware detector, enabling the acquisition of a predicted label for any submitted sample; (2) the ability to reverse-engineer a binary executable and extract its ordered opcode sequence, which constitutes the input to our method. Since the removal of adversarially injected semantic NOPs is a non-trivial task, these prerequisites mirror the skill set routinely expected of operational malware analysts.

3.2. Problem Formulation

Let

Z

be the set of binary executables and

X \subseteq R^{L}

the L-dimensional feature space produced by static or dynamic analysis. Each feature vector

x^{\circ} = (x_{1}^{\circ}, \dots, x_{L}^{\circ}) \in X

is extracted from a original malware sample

z^{\circ} \in Z

through a deterministic feature map

ϕ (\cdot)

, i.e.,

x^{\circ} = ϕ (z^{\circ})

. Labels are drawn from

Y = {0, 1}

, where 1 indicates malware and 0 indicates benignware. Then, a data-driven malware classifier is

F : X \to Y, x^{\circ} \mapsto y,

(1)

aiming to correctly classify all samples.

An adversarial obfuscation attacker aims to produce an obfuscated malware sample

z = A (z^{\circ})

that evades the classifier F, i.e.,

F (ϕ (z)) = 0

. Conversely, the semantic-NOP removal task in this paper seeks a reverse operation R that recovers the original malware representation:

z^{\circ} = R (z)

. The overall attack-and-defense workflow is presented in Figure 1. Thus, the core challenge lies in learning a removal function R that approximates the inverse of A without access to

z^{\circ}

.

Since individual semantic NOPs may exhibit contextual dependencies, their NOP-like behavior emerges only when they form an inter-instruction sequence. Therefore, we cast the semantic-NOP removal task as a Markov Decision Process:

M = (S, A, T, R)

, where

S

denotes the state space,

A

is the action space,

T

denotes the transition kernel parameterized by a policy network, and

R

denotes the rewards of

S \times A \times S

. As illustrated in Figure 2, MalRefiner couples a convolutional environment with a policy-gradient agent. The environment embeds the opcode sequence in a causal 1D latent space via a representation network, and returns a delayed malware-likelihood reward via a classification network. The agent receives the hidden vector from the representation network and produces a retain/remove decision for each opcode to maximize detector confidence while preserving runtime semantics. Both modules are co-trained with REINFORCE to yield the recovered opcode sequence.

3.3. Environment

The environment comprises a representation network and a classification network. Since their parameters are updated jointly, the two networks form a single entity, termed the environment network. The representation network produces the hidden vector that serves as the state, and the classification network outputs the reward.

3.3.1. Representation Network

To balance representational capacity with training efficiency, we stack two lightweight 1D causal convolutional blocks instead of recurrent or self-attention layers. Causal convolution ensures that the state at position t is computed solely from opcodes

1, \dots, t

, satisfying the sequential decision constraint of the MDP while retaining parallelizability. Dilated kernels capture local instruction patterns without the quadratic complexity of self-attention or the sequential bottleneck of recurrent networks.

In the first convolutional layer, the kernel size is set to 1. This layer serves two purposes: (1) reducing the dimensionality of the input vectors for efficient representation; (2) establishing alignment between the input layer and the subsequent convolutional layer, which facilitates vector concatenation in the state representation. In the second convolutional layer, the kernel size is denoted as k, which is set to 3 in this paper. This choice is motivated by the observation that a semantic NOP typically consists of at most 3 NOP instructions. Therefore, neurons in this layer are computed based on 3 adjacent neurons from the previous layer, enabling the model to capture local instruction-level patterns.

In sequence modelling, the fundamental task is to predict an output sequence

y = (y_{1}, \dots, y_{t}, \dots, y_{L})

under an input sequence

x

that is ordered chronologically. In such tasks, the goal is to predict the output

y_{t}

at time step t using only the preceding input subsequence

(x_{1}, x_{2}, \dots, x_{t})

. This constraint is known as the causal constraint, which ensures that no future information is used in the prediction. In the context of malicious sequence analysis, adhering to this constraint is essential to maintain the integrity of the temporal modeling process.

3.3.2. State

To elucidate the state representation, Figure 3 presents an instance of the calculation of state under several conditions with different retain/remove mask.

Situation 1: If

a_{t}

and the preceding

k - 1

actions are all 1, the convolutional kernel slides across the hidden representation from the lower convolution layer, and the hidden representation at upper layer

h_{t}^{u}

can be computed

h_{t}^{u} = (h^{l} * g) (t) = \sum_{j = 1}^{k} h_{t - j + 1}^{l} \cdot g_{j}^{⊤},

(2)

where

h^{l}

denotes the representation from the first convolutional layer,

g

denotes the convolution kernel, ∗ denotes the convolutional operator, and k denotes the kernel size in the upper convolutional layer, which is 3 in this setting. For instance,

h_{6}^{u}

in Figure 3 is

h_{6}^{u} = h_{6}^{l} \cdot g_{1}^{⊤} + h_{5}^{l} \cdot g_{2}^{⊤} + h_{4}^{l} \cdot g_{3}^{⊤} .

(3)

Situation 2: If

a_{t} = 0

, the opcode is removed, and the corresponding representation

h_{t}^{l}

and

h_{t}^{u}

are both undefined. Following [20], we copy the previous hidden vector. As in Figure 3,

h_{2}^{u}

is

h_{2}^{u} = h_{1}^{u} .

(4)

Situation 3: If

a_{t} = 1

, but some earlier actions were 0, the receptive field is shifted to the nearest valid positions. As in Figure 3,

h_{5}^{u}

is

h_{5}^{u} = h_{5}^{l} \cdot g_{1}^{⊤} + h_{4}^{l} \cdot g_{2}^{⊤} + h_{1}^{l} \cdot g_{3}^{⊤} .

(5)

Collectively, the update of

h^{u}

is is summarised by Equation (6).

h_{t}^{u} = \{\begin{matrix} \sum_{j = 1}^{k} h_{t - j + 1}^{l} \cdot g_{j}^{⊤}, & a_{t} = 1; \\ h_{t - 1}^{u}, & a_{t} = 0 . \end{matrix}

(6)

In this work, considering semantic NOPs span at most three consecutive opcodes, two consecutive hidden representations suffice to capture the semantic NOPs. Therefore, the RL state is designed as

s_{t} = \{\begin{matrix} h_{t}^{l} \oplus h_{t - 1}^{u}, & t > 1; \\ h_{1}^{l}, & t = 1, \end{matrix}

(7)

where ⊕ denotes concate operator and

s_{1}

resorts to

h_{1}^{l}

because no preceding context exists.

3.3.3. Classification Network

The classification network is a lightweight multi-layer perceptron that maps the compact, causal representation produced by the representation network to a malware posterior. It consists of two fully connected layers, each followed by a ReLU non-linearity and a dropout layer. The first FC layer projects the 128-dimensional hidden vector to 64 units, and the second reduces the dimension to 2, yielding the logits for benign and malware. Finally, a softmax operator converts these logits into a probabilistic output.

3.4. Agent

From above, the representation and classification networks constitute the environment. The agent interacts with this environment by drawing actions from a policy network that decides, at each position t, whether to retain or remove the current opcode.

Action Space and Policy Network

The action set is binary:

a \in \{0, 1\}

, with 0 meaning the remove operation and 1 meaning the retain operation. At time step t, the agent feeds the state vector

s_{t}

to a two-layer policy network:

\begin{matrix} π (s_{t}, a_{t}) & = G (a_{t} | s_{t}, Θ) \\ = σ (W_{2} ReLU (W_{1} s_{t} + b_{1}) + b_{2}), \end{matrix}

(8)

where

Θ = \{W_{1}, W_{2}, b_{1}, b_{2}\}

denotes the trainable parameters,

σ (\cdot)

is the sigmoid function, and

ReLU

denotes the ReLU function.

3.5. Reward

The reward is composed of two complementary terms: (i) a log-likelihood score that quantifies the increase in detector confidence after refinement and (ii) a bonus that encourages the removal of redundant opcodes. The first term ensures that the refined sequence recovers the malicious signal recognized by the downstream classifier, while the second term discourages the agent from retaining instructions whose only purpose is to increase the binary size without altering program behavior, thereby guiding the policy toward a concise yet semantically equivalent opcode sequence.

The first reward component r′, derived from the classification network, is defined as the log-likelihood gain achieved after refinement:

r^{'} = log P (y_{g t} | x^{'}),

(9)

where

P (\cdot)

represents the predicted probability from the environment network,

y_{g t}

is the ground-truth label of the original input sequence

x

, and x′ is the refined sequence. This reward r′ encourages the refinement process to adjust the input, reclassifying samples originally regarded as malware.

The second reward r′′, to encourage the removal of semantic NOPs, is designed as

r^{″} = 2^{L^{'} / L} - 1,

(10)

where

L^{'}

is the number of removed opcodes, and L denotes the original sequence length. The reward is constrained to

[0, 1)

to match the scale of the log-likelihood reward.

Since every action contributes to the final outcome, a single delayed reward is broadcast along the trajectory

r_{1} = \dots = r_{L}

with

r (s_{t}, a_{t}) = \{\begin{matrix} (1 - β) r^{'} + β r^{″}, & t = L, \\ r (s_{L}, a_{L}), & t < L, \end{matrix}

(11)

where

β

is a weight factor to balance the two reward terms.

Finally, the expected return over a sampled trajectory

τ

is then

reward = \sum_{(s_{t}, a_{t}) \in τ} G (a_{t} | s_{t}, Θ) \cdot r (s_{t}, a_{t}),

(12)

where

G (a_{t} | s_{t}, Θ)

is the policy probability of choosing action

a_{t}

in state

s_{t}

.

3.6. Model Training

To optimize the parameters in

Θ

, we employ the REINFORCE algorithm [21] in Equation (12). Assuming N trajectories are sampled via the Monte Carlo method, the gradient of the expected reward with respect to

Θ

is given by

\nabla_{Θ} reward = \frac{1}{N} \sum_{n = 1}^{N} \sum_{t = 1}^{L} r (s_{t}, a_{t}) \nabla_{Θ} log G (a_{t} | s_{t}, Θ) .

(13)

The training involves a coupled update scheme between the RL refiner and the environment network: the refiner’s updates depend on the environment network. In contrast, the environment network is trained using sequences revised by the refiner. To facilitate effective cooperation between the two components, we adopt a pre-training strategy. First, the environment network is trained independently, without involving the refiner. Then, with the environment network fixed, the RL refiner is trained to refine input sequences. Finally, both components are jointly fine-tuned. During joint training, the parameters of both the environment network and the policy network are updated using a stable strategy following [22], which combines old and new parameter values as follows:

Φ = γ Φ_{n} + (1 - γ) Φ_{o}, γ ≪ 1,

(14)

where

Φ_{n}

denotes the newly computed parameters in the current iteration, and

Φ_{o}

represents the old parameters.

In the RL training process, delayed rewards guide the direction and magnitude of parameter updates. These rewards are only available after the entire sequence has been sampled by the policy network. Once obtained, the rewards are used to update the policy network parameters. Since the same policy network—with identical structure and parameters—generates actions for all inputs in a sequence, the action selection process benefits from the parallelism of the convolutional architecture, thereby accelerating training. The overall training procedure is outlined in Algorithm 1.

Algorithm 1 Training process

1:: Require:
2:: Training data $x = (x_{1}, \dots, x_{t}, \dots, x_{L})$ ;
3:: Representation network and classification network with parameters $Ω$ ;
4:: The Policy Network with parameters $Θ$ ;
5:: Number of epochs e, batches per epoch b, number of input sequences per batch d.
6:: Ensure:
7:: Optimized parameters $Ω$ and $Θ$ .
8:: Initialize the network parameters $Ω$ and $Θ$ ;
9:: Pre-train the representation network and classification network, assuming all the inputs are reserved;
10:: Pre-train the RL structure by Equation (13);
11:: Train these two structures jointly:
12:: for each epoch e in $1, \dots, e$ do
13:: for each batch b in $1, \dots, b$ do
14:: for each sequence d in $1, \dots, d$ do
15:: Sample a trajectory for the sequence with $Θ$ ;
16:: Compute the log-likelihood with $Ω$ ;
17:: Compute the reward of the trajectory using Equation (12);
18:: for $(s_{t}, a_{t})$ in the trajectory do
19:: Spread the reward to all the pairs of state and the corresponding action and construct $(s_{t}, a_{t}, r_{t})$ ;
20:: end for
21:: Compute the gradients according to Equation (13);
22:: end for
23:: Update parameters $Θ$ and $Ω$ by Equation (14).
24:: end for
25:: end for

4. Evaluation Settings

In this section, we detail the datasets, experimental environment, parameter settings, evaluation metrics, target models, and attack method used throughout the experiments.

4.1. Datasets and Experimental Setup

All experiments were conducted in TensorFlow 2.8.0 on a Windows Server with dual Intel^® Xeon^® E5-2620 v4 CPUs (16 physical cores, 32 threads), 128 GB DDR4 memory, and an NVIDIA GeForce RTX 4090D GPU with 24 GB memory.

4.1.1. Datasets

Contemporary data-driven malware research faces two principal constraints. First, raw executables are rarely distributed because of legal and ethical concerns, and hence extracted feature vectors are more common, which impedes studies that require direct assembly manipulation and semantic-NOP insertion. Second, most reference collections, including the widely used Microsoft BIG 2015 dataset [23], are more than a decade old. Finally, we chose two open-access datasets that (i) supply the actual binaries and (ii) span distinct time windows, ensuring the evaluation covers both legacy and modern malware.

PEMML (https://practicalsecurityanalytics.com/pe-malware-machine-learning-dataset/, accessed on 28 May 2025): The PEMML (PE Malware Machine Learning) dataset is a publicly accessible collection of Windows Portable Executable (PE) files, originally compiled to support machine learning-based malware detection research. It comprises 201,549 executables (114,737 malicious, 86,812 benign) with raw binaries supplied for both classes. In this work, we downsample the dataset by selecting samples from the top-10 most representative malware families and an equal number of benign samples to ensure a balanced and tractable evaluation setup.
RawMal-TF [24]: The RawMal-TF dataset provides more than 250,000 malware samples from 17 families. Since it contains only malicious executables, we supplement it with 3000 benign PE files from PortableApps (https://portableapps.com/, accessed on 28 May 2025). We then randomly draw 3000 malware instances from the original pool, yielding a balanced dataset for experimentation.

Since few public collections offer both benign and malicious binaries, we limit our evaluation to the two datasets described above. Each dataset is split into training, validation, and test sets with a ratio of 0.64:0.16:0.20, respectively. In the following experimental evaluation, this splitting procedure is repeated five times with independent random splits to ensure the reliability and robustness of the experimental results, mitigating the influence of random variation, while preserving the overall class distribution in each split. Table 2 summarizes the key characteristics of the resulting experimental datasets, including sample counts, family distributions, and temporal coverage.

4.1.2. Implementation

The maximal length of the input sequences is set to 3000. For the environment network, the learning rate is set to

0.001

, the embedding dimension is 300, and the number of filters and neurons in convolutional layers and fully connected layers are both 128. For the RL refiner, the learning rate is

0.0001

, the dimension of the hidden layer is 20, and the weight factor

β

is set to

0.7

for the PEMML and

0.4

for the RawMal-TF dataset. In addition,

γ

is set to

0.1

to control the update speed of the network parameters. In addition, some tricks are utilized in this work. Learning rate schedule is an extra strategy in the training phase. The learning rates reduce to

95 %

of their original values every 1000 steps on both datasets.

4.1.3. Evaluation Metrics

To comprehensively assess the model’s performance, we adopt a suite of standard classification metrics, i.e., Precision, Recall, F1-score (F1), Accuracy (ACC), False-Positive Rate (FPR), and False-Negative Rate (FNR). Furthermore, we gauge adversarial effectiveness and defensive robustness by reporting the Attack Success Rate (ASR) and the Recovered Rate (RR), respectively. Definitions are summarized in Table 3.

4.2. Target Models

Recently, a variety of data-driven malware classification methods have been proposed and evaluated on publicly available datasets. To benchmark the performance of our proposed approach, we compare it with several representative methods, including the following:

1D CNN: A lightweight convolutional baseline model that replaces recurrent layers with one-dimensional convolutions to reduce memory usage and training time. The architecture follows the design of the environment network described in Section 3, comprising 128 filters with kernel size 3, ReLU activation, global max-pooling, and a fully connected layer with 128 neurons.
MalConv [25]: An end-to-end byte-level convolutional neural network that processes the entire PE file (up to 2 MB) via an embedding layer, followed by a gated convolutional block and a single dense output layer. We adopt the original architectural settings, including an embedding dimension of 8, a stride of 500, and a gating width of 128.
TCN [26]: A temporal convolutional network that utilizes stacks of dilated causal convolutions with residual connections to capture long-range sequential dependencies without suffering from gradient degradation. In our implementation, we retain the original channel width of 128 and follow the dilation schedule of [1, 2, 4, 8, 16], as suggested in the original work.
MALIGN [27]: A biologically-inspired sequence alignment-based method for malware family classification. MAlign converts raw byte sequences into nucleotide representations and applies multiple sequence alignment to identify conserved code regions across malware families. These conserved regions are then used to compute alignment scores, which serve as features for a logistic regression classifier. Unlike traditional deep learning models, MAlign is designed to be interpretable and robust against adversarial perturbations.

For all baseline models, we apply consistent preprocessing steps, including data normalization and sequence truncation or padding. Additionally, we use the same train/validation/test split and early-stopping criteria to ensure a fair comparison.

4.3. Attack Methods

Adversarial semantic-NOP insertion has attracted considerable research interest in recent years. Representative approaches include SRL [5], ADVeRL-ELF [16], and the method proposed by Gibert et al. [4]. Gibert et al. restrict themselves to inserting literal NOP instructions, which can be trivially detected by scanning for standard NOP opcodes. ADVeRL-ELF, although more sophisticated, is tightly coupled to the ELF format and targets exclusively Linux environments. Consequently, we select SRL as the attack method for this study because it is format-agnostic, preserves functional semantics, and resists static signature detection.

Table 4 summarizes the performance of these malware detection models on the clean test sets drawn from PEMML and RawMal-TF datasets. From the detection performance on these two datasets, RawMal-TF is clearly the easier dataset because of the better generalization ability of these machine learning models on it. From the perspective of the machine learning models, MALIGN consistently delivers the highest accuracy and the lowest false-positive rate. On PEMML, it surpasses the widely used MalConv baseline by 6.35% in accuracy and 4.41% in FPR. On RawMal-TF, all four models exceed 98% accuracy.

Towards the target detection system, we further measure the robustness to the SRL attack on the test samples. Following [5], we allow the adversary to insert semantic-NOP instructions into at most 5% of the basic blocks. Table 5 reports the post-attack performance of the four models. On the PEMML set, the detection accuracy of MalConv drops from 80.47% to 46.16%, while MALIGN falls from 91.45% to 59.19%. On the RawMal-TF dataset, the attack attains a 100% evasion rate as reported in [5], saying every correctly classified malware sample can be perturbed into a form that all four detectors misclassify.

5. Results and Discussion

This section presents the recovered performance of the proposed MalRefiner and conducts a detailed analysis of the recovered input.

5.1. Performance Evaluation of the MalRefiner on Four Target Models

Table 6 and Table 7 report the quantitative recovery performance of MalRefiner on the PEMML and RawMal-TF datasets, respectively. All metric values represent the averages over five independent experimental runs. Figure 4 and Figure 5 present the corresponding boxplots of these five runs, highlighting the dispersion and stability of performance metrics across different trials. In contrast, Figure 6 and Figure 7 compare the recovered performance with the clean-set baseline, illustrating the extent to which MalRefiner can restore the detection capability.

The performance of the MalRefiner is evaluated on four target models because its performance is indirectly measured through the recovered detection results, rather than by direct task-specific indicators. From Table 6, refined inputs reach an accuracy of 77.25% on MalConv and 88.17% on MALIGN, corresponding to drops of 3.22% and 3.28% relative to clean-set performance, with FPR increases of 0.81% and 0.23%, respectively. TCN delivers the highest RR with 96.63%, and achieves a relatively balanced Precision of 86.36%, Recall of 84.50%, and F1 of 85.41%. Figure 4 shows consistently narrow interquartile ranges and minimal outliers, suggesting that recovery is stable across different random splits. The clean-vs.-recovered comparison in Figure 6 reveals that Precision drops less sharply than Recall, implying that most unrecovered detection errors occur in originally well-classified malicious samples. Overall, all recovery performances, mainly estimated by RR, reach above 90%.

Table 7 shows MalConv restored to 94.14% accuracy and MALIGN to 97.73%, with small FPR increases of 0.04% and 0.17%, separately. In addition, all four models exceed 92% RR, and TCN achieves 96.27% accuracy with an F1 of 96.20%. Figure 5 exhibits extremely tight metric distributions and very few outliers, indicating recovery consistency even for different data splits. Overall, RawMal-TF yields higher absolute accuracy and lower variance compared to PEMML. In Figure 7, all metrics show mild declines relative to clean-set baselines, and RR ranges from 92% to 96%, revealing strong restoration capability for this dataset.

The comparative analysis highlights that while PEMML poses greater challenges for byte-based models such as MalConv, due to longer and more diverse code sequences, the recovery process remains stable with RR above 91% in all cases. RawMal-TF appears to show slightly better reproducibility and absolute performance, which may be attributed to dataset structure simplicity. The boxplot results across both datasets show low dispersion, suggesting that MalRefiner’s recovery effectiveness is not sensitive to training–validation splits and can generalize across malware datasets.

5.2. Quantitative and Qualitative Analysis of Removed Semantic Nops

To better understand how MalRefiner restores model confidence, we first quantify which semantic NOPs are actually eliminated, and then inspect concrete opcode sequences to verify the semantic-preserving nature of the removals.

Quantitative statistics: Table 8 summarizes the five most frequently removed patterns on the PEMML test set. On average, SRL inserts 64 self-exchanging xchg instructions per sample, and MalRefiner removes 53 of them, yielding an 85.94% RP. The push edi; pop edi pair is deleted in about 66.66% of its occurrences, whereas nop opcodes are removed completely with an RP of 100%. Double bswap and canceled add-sub pairs exhibit RPs of 75% and 62.5%, respectively. These results indicate that the policy-gradient agent consistently targets instructions whose sole effect is to alter the static byte sequence while leaving runtime semantics intact.

Qualitative inspection: Table 9 presents concrete opcode slices extracted from the same executable before and after refinement. In the first fragment, the adversary adds an extra push-pop pair and a redundant xchg ebx, ebx between consecutive call instructions. After refinement, only one push-pop token remains, and the xchg is entirely eliminated, shortening the sub-sequence from 13 to 11 opcodes without disturbing the call-site logic. The second slice contains a duplicated xor and an inserted bswap eax; bswap eax inside an arithmetic block. MalRefiner merges the duplicate xor and strips both bswap instructions, restoring the original four-instruction pattern.

From the above two experiments, the statistics and the case studies demonstrate that MalRefiner successfully learns to remove the same categories of semantic NOPs that adversaries inject, thereby recovering a representation that is both more concise and more amenable to correct classification and analysis.

From Table 9, we present concrete proof-of-concept cases demonstrating the recovery of input sequences after semantic-NOP removal. To further clarify the performance recovery of the detectors shown in Table 6 and Table 7, we provide an explanation from the perspective of posterior scores, highlighting how MalRefiner mitigates elevated false negative rates (FNRs). Specifically, the two malicious samples reported in Table 9 were subjected to an SRL attack and misclassified as benign by the baseline detector due to heavy insertion of semantic NOPs. As indicated in Table 10, the posterior scores for the malicious label dropped from 0.87 and 0.94 to 0.36 and 0.28, respectively, both falling below the detection threshold. After refinement with MalRefiner, removal of the inserted NOPs restored the posterior scores to 0.78 and 0.80, respectively, thereby enabling correct classification. It is worth noting that the semantic-NOP attack is specifically designed to conceal malicious content, and benign software is rarely subjected to such an obfuscation technique. Consequently, MalRefiner has minimal impact on benign samples in this setting, and both the false positive rate (FPR) under SRL attack and the FPR after refinement remain essentially unchanged.

5.3. Computation Time and Stability Evaluation of MalRefiner Training

To assess the practical viability of MalRefiner in real-world scenarios, with particular emphasis on computational efficiency and training stability, we provide detailed analyses of the average rewards and computation time of MalRefiner in this part.

As illustrated in Figure 8, the average reward curves of MalRefiner on both datasets exhibit a clear convergence trend, though the rate of convergence and the degree of fluctuation differ noticeably. For the PEMML dataset in Figure 8a, the average reward enters a stable phase after approximately 2500 training steps, with a relatively narrow local variance band, indicating that the policy updates have become stable in the later training stages. Nevertheless, the early-stage rewards display certain negative fluctuations, which can be attributed to the exploratory nature of the policy network before it establishes a consistent decision-making pattern. In contrast, on the RawMal-TF dataset in Figure 8b, the reward curve shows a marked improvement after approximately 600 steps and reaches stability after around 1000 steps. Moreover, the overall curve suggests faster convergence and reduced volatility. This observation aligns with the relatively simple distribution of RawMal-TF samples, which facilitates stronger generalization capability for the underlying representation and policy networks.

Table 11 reports the total computation time of MalRefiner and four downstream classifiers across both datasets. The results demonstrate that the training time of MalRefiner takes the most time of the whole diagram. On the PEMML dataset, training MalRefiner requires approximately 235,296 s, which is nearly two orders of magnitude greater than the slowest baseline MALIGN. For the RawMal-TF dataset, although the overall training time is reduced to 105,043 s, it remains significantly higher than that of lightweight classifiers such as TCN. This demonstrates the substantial computational overhead inherent to the reinforcement learning framework. Encouragingly, the training process needs to be performed only once, after which the model can be deployed and continuously utilized over an extended period.

5.4. Threats to Validity and Limitations

In this work, we focus on the removal of inserted semantic NOPs, while the executability of the resulting samples is not validated. This issue lies beyond the primary scope of the current study. However, the inverse feature mapping operations can be implemented using the same techniques employed in existing semantic NOP insertion methods.

Moreover, the generalization capability of the proposed method across various semantic NOP insertion attacks has not been thoroughly evaluated. In addition to automatic semantic NOP insertion techniques, experiments involving manual obfuscation methods, which typically offer higher stealthiness, could better demonstrate the effectiveness of our approach. Nevertheless, constructing a reliable manually obfuscated dataset is time-consuming and is left as future work.

From the perspective of the model itself, the limitations of the proposed method are summarized as follows:

High Training Overhead: One of the most significant drawbacks is the extensive training time required. The reward evaluation of the reinforcement learning (RL) model is computationally intensive. Furthermore, the framework involves two networks—namely, the environment network and the policy network—that need to be jointly optimized. Although RL offers a promising paradigm for sequential decision-making and is more resource-efficient than manual analysis, it still incurs substantial time costs during training and parameter tuning.
Discrete Semantic NOPs Not Addressed: The proposed method does not address the case of discrete semantic NOPs. As discussed in this paper, current semantic NOP insertion techniques tend to place NOPs in consecutive positions to avoid execution errors. More sophisticated and stealthy strategies may emerge in the future, such as distributing semantic NOP operations across non-adjacent instructions.

6. Conclusions

In this paper, we introduce MalRefiner, a reinforcement-learning system that automatically recovers adversarially injected semantic-NOP sequences. Framed as a Markov decision process, the agent performs a sequence of retain/remove actions on individual opcodes and receives a reward from the log-likelihood of the downstream detector. Without retraining or architectural modification of four state-of-the-art malware detectors, MalRefiner elevates detection accuracy from 52.4 ± 2.8% under the SRL attack to 89.7 ± 1.6% across two large-scale malware datasets. In addition, the qualitative inspection demonstrates MalRefiner’s ability to recover input sequences. Considering the discrete semantic-NOP technique is a potentially stealthier malware evasion strategy, we intend to extend MalRefiner to more general settings in future work.

Author Contributions

Conceptualization, M.Z. and F.S.; methodology, J.S. and C.X.; validation, P.X.; writing—original draft preparation, J.S.; writing—review and editing, C.H. and M.H.; supervision, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by Postdoctoral Fellowship Program of CPSF under Grant Number CZC20233544, and in part by National Natural Science Foundation of China under Grant Number 62466042.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets used in this study are publicly available and can be accessed via the URLs provided in the paper.

Acknowledgments

The authors are grateful to the providers of the open datasets used in this study. In particular, we thank PE Malware Machine Learning dataset maintained by Practical Security Analytics LLC (available at https://practicalsecurityanalytics.com/pe-malware-machine-learning-dataset/, accessed on 9 November 2025) and RawMal-TF released by álik, David et al. (https://github.com/CS-and-AI/RawMal-TF, accessed on 9 November 2025). The open sharing of these resources greatly facilitated our research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Downing, E.; Mirsky, Y.; Park, K.; Lee, W. DeepReflect: Discovering Malicious Functionality through Binary Reconstruction. In Proceedings of the 30th USENIX Conference on Security Symposium, Vancouver, BC, Canada, 11–13 August 2021; pp. 3469–3486. [Google Scholar]
Wong, G.W.; Huang, Y.T.; Guo, Y.R.; Sun, Y.; Chen, M.C. Attention-Based API Locating for Malware Techniques. IEEE Trans. Inf. Forensics Secur. 2024, 19, 1199–1212. [Google Scholar] [CrossRef]
The FLARE Team. CAPA: Automatically Identify Malware Capabilities. 2020. Available online: https://github.com/mandiant/capa (accessed on 28 April 2025).
Gibert, D.; Fredrikson, M.; Mateu, C.; Planes, J.; Le, Q. Enhancing the Insertion of NOP Instructions to Obfuscate Malware via Deep Reinforcement Learning. Comput. Secur. 2022, 113, 102543. [Google Scholar] [CrossRef]
Zhang, L.; Liu, P.; Choi, Y.; Chen, P. Semantics-Preserving Reinforcement Learning Attack against Graph Neural Networks for Malware Detection. IEEE Trans. Dependable Secur. Comput. 2023, 20, 1390–1402. [Google Scholar] [CrossRef]
Ling, X.; Wu, Z.; Wang, B.; Deng, W.; Wu, J.; Ji, S.; Luo, T.; Wu, Y. A Wolf in Sheep’s Clothing: Practical Black-Box Adversarial Attacks for Evading Learning-Based Windows Malware Detection in the Wild. In Proceedings of the 33rd USENIX Conference on Security Symposium, Philadelphia, PA, USA, 14–16 August 2024; pp. 7393–7410. [Google Scholar]
Zhan, D.; Duan, Y.; Hu, Y.; Li, W.; Guo, S.; Pan, Z. MalPatch: Evading DNN-based Malware Detection with Adversarial Patches. IEEE Trans. Inf. Forensics Secur. 2024, 19, 1183–1198. [Google Scholar] [CrossRef]
Botacin, M.; Alves, M.Z.; Oliveira, D.; Grégio, A. HEAVEN: A Hardware-Enhanced AntiVirus ENgine to Accelerate Real-Time, Signature-Based Malware Detection. Expert Syst. Appl. 2022, 201, 117083. [Google Scholar] [CrossRef]
Anderson, H.S.; Kharkar, A.; Filar, B.; Evans, D.; Roth, P. Learning to Evade Static PE Machine Learning Malware Models via Reinforcement Learning. arXiv 2018, arXiv:1801.08917. [Google Scholar] [CrossRef]
Kreuk, F.; Barak, A.; Aviv-Reuven, S.; Baruch, M.; Pinkas, B.; Keshet, J. Deceiving End-to-End Deep Learning Malware Detectors Using Adversarial Examples. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 2–8 December 2018. [Google Scholar]
Park, D.; Khan, H.; Yener, B. Generation & Evaluation of Adversarial Examples for Malware Obfuscation. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications, Boca Raton, FL, USA, 16–19 December 2019; pp. 1283–1290. [Google Scholar]
Lucas, K.; Pai, S.; Lin, W.; Bauer, L.; Reiter, M.K.; Sharif, M. Adversarial Training for Raw-Binary Malware Classifiers. In Proceedings of the 32nd USENIX Conference on Security Symposium, Anaheim, CA, USA, 9–11 August 2023; pp. 1163–1180. [Google Scholar]
Li, H.; Zhou, S.; Yuan, W.; Luo, X.; Gao, C.; Chen, S. Robust Android Malware Detection against Adversarial Example Attacks. In Proceedings of the International World Web Conference 2021, New York, NY, USA, 19–23 April 2021; pp. 3603–3612. [Google Scholar]
Rashid, A.; Such, J. MalProtect: Stateful Defense against Adversarial Query Attacks in ML-based Malware Detection. IEEE Trans. Inf. Forensics Secur. 2023, 18, 4361–4376. [Google Scholar] [CrossRef]
Zhong, F.; Hu, P.; Zhang, G.; Li, H.; Cheng, X. Reinforcement Learning Based Adversarial Malware Example Generation against Black-Box Detectors. Comput. Secur. 2022, 121, 102869. [Google Scholar] [CrossRef]
Ravi, A.; Chaturvedi, V.; Shafique, M. ADVeRL-ELF: ADVersarial ELF Malware Generation Using Reinforcement Learning. In Proceedings of the 2025 62nd ACM/IEEE Design Automation Conference, San Francisco, CA, USA, 22–25 June 2025. [Google Scholar]
Liu, C.; Li, B.; Liu, X.; Li, C.; Bao, J. Evolving Malware Detection through Instant Dynamic Graph Inverse Reinforcement Learning. Knowl.-Based Syst. 2024, 299, 111991. [Google Scholar] [CrossRef]
He, P.; Cavallaro, L.; Ji, S. Defending against Adversarial Malware Attacks on ML-based Android Malware Detection Systems. arXiv 2025, arXiv:2501.13782. [Google Scholar] [CrossRef]
Carlini, N.; Athalye, A.; Papernot, N.; Brendel, W.; Rauber, J.; Tsipras, D.; Goodfellow, I.; Madry, A.; Kurakin, A. On Evaluating Adversarial Robustness. arXiv 2019, arXiv:cs.LG/1902.06705. [Google Scholar] [PubMed]
Zhang, T.; Huang, M.; Zhao, L. Learning Structured Representation for Text Classification via Reinforcement Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 5993–6000. [Google Scholar]
Williams, R.J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Zhang, J.; Hao, B.; Chen, B.; Li, C.; Chen, H.; Sun, J. Hierarchical Reinforcement Learning for Course Recommendation in MOOCs. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 435–442. [Google Scholar]
Microsoft. Microsoft Malware Classification Challenge (BIG 2015). 2015. Available online: https://www.kaggle.com/c/malware-classification (accessed on 7 December 2023).
Bálik, D.; Jureček, M.; Stamp, M. RawMal-TF: Raw Malware Dataset Labeled by Type and Family. arXiv 2025, arXiv:2506.23909. [Google Scholar]
Raff, E.; Barker, J.; Sylvester, J.; Brandon, R.; Catanzaro, B.; Nicholas, C. Malware Detection by Eating a Whole EXE. In Proceedings of the 2018 AAAI Workshop on AI for Cyber Security, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Sun, J.; Luo, X.; Gao, H.; Wang, W.; Gao, Y.; Yang, X. Categorizing Malware via a Word2Vec-based Temporal Convolutional Network Scheme. J. Cloud Comput. 2020, 9, 53. [Google Scholar] [CrossRef]
Saha, S.; Afroz, S.; Rahman, A.H. MAlign: Explainable Static Raw-Byte Based Malware Family Classification Using Sequence Alignment. Comput. Secur. 2024, 139, 103714. [Google Scholar] [CrossRef]

Figure 1. Overall attack–defense workflow.

Figure 2. MalRefiner system overview. The representation and classification networks in the environment module provide latent states and delayed rewards, and the policy network in the agent module emits retain/remove actions.

Figure 3. Computational procedure of the 1D convolution.

Figure 4. Boxplots of performance metrics on the PEMML dataset.

Figure 5. Boxplots of performance metrics on the RawMal-TF dataset.

Figure 6. Comparison of original (clean performance) and MalRefiner-recovered (recovered performance) performance on the PEMML dataset.

Figure 7. Comparison of original (clean performance) and MalRefiner-recovered (recovered performance) performance on the RawMal-TF dataset.

Figure 8. Average rewards versus training steps of MalRefiner on two datasets.

Table 1. Frequently employed semantic NOPs in malware evasion.

Semantic NOP	Description	Semantic NOP	Description
`nop`	Explicit no-operation instruction	`xchg ebx, ebx`	Register self-exchange
`push edi; pop edi`	Stack-based register save/restore	`inc edi; dec edi`	Cancelled increment/decrement pair
`bswap eax; bswap eax`	Double byte-swap	`not eax; not eax`	Double bitwise-NOT operation
`add eax, 1; sub eax, 1`	Cancelled arithmetic modification	`add eax, 0`	Addition of zero

Table 2. Summary of dataset characteristics and sample distribution.

Dataset	Dataset Information					Sample Distribution
Dataset	Total Samples	Malware	Goodware	Family Count	Collection Period	Training Set	Validation Set	Test Set
PEMML	25,426	12,574	12,852	10	2017–2018	16,272	4069	5085
RawMal-TF	6000	3000	3000	17	2023–2025	3840	960	1200

Table 3. Evaluation metrics.

Metric	Description
TP	The number of malware that are correctly identified as malware
TN	The number of goodware that are correctly identified as goodware
FN	The number of malware that are misclassified as goodware
FP	The number of goodware that are misclassified as malware
Precision	Proportion of true malware among all detected positives: $TP / (TP + FP) \times 100 %$
Recall	Detection rate for malicious samples: $TP / (TP + FN) \times 100 %$
F1	Harmonic mean balancing precision and recall: $2 \times TP / 2 \times (TP + FP + FN) \times 100 %$
ACC	Overall classification correctness: $(TP + TN) / (TP + TN + FN + FP) \times 100 %$
FPR	Probability of benign file misclassification: $FP / (FP + TN) \times 100 %$
FNR	Malware evasion rate: $FN / (TP + FN) \times 100 %$
ASR	Proportion of successfully adversarial samples among all correctly classified malware.
RR	Proportion of the recovered samples among successfully adversarial samples according to the classification results.

Table 4. Comparisons of classification results on the two clean testing sets.

Testing Set	Models	Precision	Recall	F1	ACC	FPR	FNR
PEMML	1D CNN	81.18%	77.63%	79.36%	79.80%	18.02%	22.37%
	MalConv	84.82%	74.25%	79.17%	80.47%	13.30%	25.75%
	TCN	87.25%	87.70%	87.47%	87.44%	12.83%	12.30%
	MALIGN	91.17%	91.78%	91.47%	91.45%	8.89%	8.22%
RawMal-TF	1D CNN	98.99%	97.77%	98.38%	98.38%	1.00%	2.23%
	MalConv	99.36%	98.50%	98.93%	98.93%	0.63%	1.50%
	TCN	99.30%	99.20%	99.25%	99.25%	0.70%	0.80%
	MALIGN	99.43%	99.30%	99.37%	99.37%	0.57%	0.70%

Table 5. Comparisons of classification results on the two confused testing sets.

Testing Set	Models	Precision	Recall	F1	ACC	FPR	FNR	ASR
PEMML	1D CNN	33.72%	9.16%	14.37%	45.55%	18.02%	90.84%	88.20%
	MalConv	29.88%	5.66%	9.52%	46.16%	13.30%	94.34%	92.38%
	TCN	57.33%	17.22%	26.47%	52.17%	12.83%	82.78%	80.37%
	MALIGN	75.46%	27.32%	40.05%	59.19%	8.89%	72.68%	70.24%
RawMal-TF	1D CNN	0.00%	0.00%	0.00%	49.50%	1.00%	100.00%	100%
	MalConv	0.00%	0.00%	0.00%	49.68%	0.63%	100.00%	100%
	TCN	0.00%	0.00%	0.00%	49.65%	0.70%	100.00%	100%
	MALIGN	0.00%	0.00%	0.00%	49.72%	0.57%	100.00%	100%

Table 6. Comparisons of classification results on the PEMML dataset.

Models	Precision	Recall	F1	ACC	FPR	FNR	RR
1D CNN	79.30%	71.33%	75.25%	76.78%	18.11%	28.67%	91.38%
MalConv	83.20%	68.25%	74.97%	77.25%	14.11%	31.75%	91.68%
TCN	86.36%	84.50%	85.41%	85.10%	12.89%	15.50%	96.63%
MALIGN	89.90%	88.44%	89.16%	88.17%	9.12%	11.56%	94.89%

Table 7. Comparisons of classification results on the RawMal-TF dataset.

Models	Precision	Recall	F1	ACC	FPR	FNR	RR
1D CNN	98.70%	89.82%	93.94%	94.32%	1.17%	10.18%	92.13%
MalConv	99.19%	92.06%	95.48%	94.14%	0.74%	7.94%	93.50%
TCN	99.09%	93.46%	96.20%	96.27%	0.87%	6.54%	94.06%
MALIGN	99.25%	95.85%	97.52%	97.73%	0.74%	4.13%	96.56%

Table 8. Most frequently removed semantic nops on PEMML testing set.

Opcode	ISN ¹	RSN ²	RP ³	Instruction	Function
xchg	64	53	85.94%	`xchg ebx, ebx`	Register self-swap (identity)
push-pop	33	25	66.66%	`push edi; pop edi`	Stack-based save/restore of EDI
nop	14	14	100%	`nop`	Explicit no-operation
bswap-bswap	8	6	75.00%	`bswap eax; bswap eax`	Double byte-swap (identity)
add-sub	8	5	62.50%	`add eax, 1; sub eax, 1`	Cancelled increment/decrement

¹ ISN: average number of Inserted Semantic Nop instructions per sequence by SRL. ² RSN: average number of Removed Semantic Nop instructions per sequence by MalRefiner. ³ RP: Removal Proportion achieved by MalRefiner (%).

Table 9. Case studies of the original and refined sequences in PEMML dataset.

Sample 1	Original	pop call call db pop call call db pop call call db mov push mov …
	SRL	pop push pop call call db pop call call xchg db pop call call db mov mov push mov …
	Refined	pop ~~push~~ ~~pop~~ call call xchg db pop call call db pop call call db mov ~~mov~~ push mov …
Sample 2	Original	align pop retn align push xor push mov push xor sub push mov add cmp push mov add …
	SRL	align pop retn align push xor xor push mov push xor sub bswap push mov add add sub cmp push mov add …
	Refined	align pop retn align push xor xor push mov push xor sub ~~bswap~~ push mov add ~~add~~ ~~sub~~ cmp push mov add …

Here, opcode shows the inserted opcodes after the SRL attack while the ~~opcode~~ demonstrates the removed opcodes via our method.

Table 10. Example cases illustrating FN handling by MalRefiner (Detector: MalConv).

Case No.	Before Posterior	Before Label	After Posterior	After Label	RP
Sample 1	0.87	Benign	0.36	Malware	87.32%
Sample 2	0.94	Benign	0.28	Malware	92.14%

Table 11. Computation time of MalRefiner and downstream classifiers (seconds).

Dataset	MalRefiner	1D CNN	MalConv	TCN	MALIGN
PEMML	235,296.43	1856.21	2743.43	3892.45	2567.89
RawMal-TF	105,043.04	743.58	1115.21	1557.32	1027.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, J.; Shi, F.; Zhang, M.; Hu, M.; Xue, P.; Huang, C.; Xu, C. MalRefiner: Recovering Malware Semantics via Reinforcement Learning-Based Semantic NOP Removal. Appl. Sci. 2025, 15, 12015. https://doi.org/10.3390/app152212015

AMA Style

Sun J, Shi F, Zhang M, Hu M, Xue P, Huang C, Xu C. MalRefiner: Recovering Malware Semantics via Reinforcement Learning-Based Semantic NOP Removal. Applied Sciences. 2025; 15(22):12015. https://doi.org/10.3390/app152212015

Chicago/Turabian Style

Sun, Jiankun, Fan Shi, Min Zhang, Miao Hu, Pengfei Xue, Cheng Huang, and Chengxi Xu. 2025. "MalRefiner: Recovering Malware Semantics via Reinforcement Learning-Based Semantic NOP Removal" Applied Sciences 15, no. 22: 12015. https://doi.org/10.3390/app152212015

APA Style

Sun, J., Shi, F., Zhang, M., Hu, M., Xue, P., Huang, C., & Xu, C. (2025). MalRefiner: Recovering Malware Semantics via Reinforcement Learning-Based Semantic NOP Removal. Applied Sciences, 15(22), 12015. https://doi.org/10.3390/app152212015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MalRefiner: Recovering Malware Semantics via Reinforcement Learning-Based Semantic NOP Removal

Abstract

1. Introduction

2. Related Work

2.1. Adversarial Malware Evasion Against ML Detectors

2.2. Adversarial Defenses on Evading Malware

2.3. Reinforcement Learning for Malware Evasion and Defense

3. Methodology

3.1. Defense Model

3.2. Problem Formulation

3.3. Environment

3.3.1. Representation Network

3.3.2. State

3.3.3. Classification Network

3.4. Agent

Action Space and Policy Network

3.5. Reward

3.6. Model Training

4. Evaluation Settings

4.1. Datasets and Experimental Setup

4.1.1. Datasets

4.1.2. Implementation

4.1.3. Evaluation Metrics

4.2. Target Models

4.3. Attack Methods

5. Results and Discussion

5.1. Performance Evaluation of the MalRefiner on Four Target Models

5.2. Quantitative and Qualitative Analysis of Removed Semantic Nops

5.3. Computation Time and Stability Evaluation of MalRefiner Training

5.4. Threats to Validity and Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI