Next Article in Journal
Robust Sliding Mode Motion Control for an Integrated Hydromechatronic Actuator
Previous Article in Journal
Augmented Recursive Sliding Mode Observer Based Adaptive Terminal Sliding Mode Controller for PMSM Drives
Previous Article in Special Issue
A Novel Kinematic Calibration Method for Industrial Robots Based on the Improved Grey Wolf Optimization Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Model-Based Reinforcement Learning for Containing Malware Propagation in Wireless Radar Sensor Networks

1
School of Mechanical and Electrical Engineering, Guangzhou University, Guangzhou 510006, China
2
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China
3
Big Data Institute, Central South University, Changsha 410083, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Actuators 2025, 14(9), 434; https://doi.org/10.3390/act14090434
Submission received: 18 July 2025 / Revised: 21 August 2025 / Accepted: 29 August 2025 / Published: 2 September 2025
(This article belongs to the Special Issue Intelligent Sensing, Control and Actuation in Networked Systems)

Abstract

To address malware containment challenges in WRSNs—where traditional integer-order models neglect propagation memory effects and standard reinforcement learning (RL) suffers from slow trial-and-error limitations—we propose the following: (1) a fractional-order VCISQ epidemic model capturing temporal dependencies for higher accuracy, and (2) a model-based Soft Actor–Critic (MBSAC) method, which integrates a learned transition model into an actor–critic architecture to predict future states from limited data, accelerating learning. Experiments confirm MBSAC outperforms RL baselines by reducing control overhead, hastening convergence, and enhancing robustness. It alleviates the rigidity of the traditional method and establishes a reward-driven safeguard for WRSNs.

1. Introduction

Wireless radar sensor networks (WRSNs) have powerful detection, monitoring, and information dissemination capabilities [1]. They play an essential role in intelligent transportation systems, target identification and tracking, and environmental monitoring [2,3]. Ensuring the long-term stability of sensor networks hinges on the security of the network environment [4]. However, as shown in Figure 1, the intrusion and spread of malware in WRSNs are significant and pressing issues [5]. This directly endangers the normal communication functions of internal network devices. To effectively address this threat, many studies have concluded that proactively deploying control mechanisms is the most promising way to optimize and curb the spread of malware [6,7]. However, it is important to note that excessive control measures may be ineffective and costly. Therefore, the key challenge lies in determining and applying the optimal level of control to mitigate the spread of malware in WRSNs.
In order to generate a control object, it is essential to establish a valuable mathematical model of the spread of malware. Currently, some scholars are using integer-valued differential equations to model the spread of malware in a complex manner [5,8,9,10,11]. However, the role of integer-order models is inherently limited. Their core mechanism is to treat the system evolution as a process in which the system instantaneously transitions from one state to the next at each time step [12]. Notably, the actual propagation of malware is driven by the cumulative effects of a prolonged period of time. This includes, but is not limited to, infection scale and propagation rate, which means that integer-order systems cannot accurately model the spread of malware. In contrast, fractional-order differential equations provide a superior framework. Their core mechanism lies in the fact that they incorporate the cumulative effect of the entire evolutionary history of the system through weighted integration [13]. This non-local dependence on historical trajectories makes the model more effective in fitting the long-term dynamics of malware spread.
In-depth research by many scholars has led to the use of many excellent traditional control methods to overcome the spread of malware [14,15,16]. These methods include the use of game theory [14], fuzzy control [15], and optimal control [16] to implement intermittent control of malware. The method that is most widely recognized by the research community is the optimal control method [4,10,16,17]. This is because optimal control provides a rigorous mathematical framework through variational methods or dynamic programming [4,17], enabling it to optimize the system’s performance indicators in the time domain [10] while satisfying complex dynamic constraints. While these traditional methods have been proven effective in controlling the spread of malware, they are highly dependent on precise mathematical models for complex deductive reasoning. However, due to the uncertainty of sensor network communication channels and the dynamic changes in network topology, the spread of malware is usually not represented by a precise mathematical model [18,19]. As malware self-updates and evolves [20], these traditional methods will gradually become less effective in controlling its spread. Therefore, it is critical to find a control method that relies solely on environmental observation, control input, and control effect evaluation.
Fortunately, a recent breakthrough in this area has been achieved through an artificial intelligence method that relies solely on rewards. Many researchers have used reinforcement learning (RL) methods to solve a variety of control problems [21,22,23,24]. Several studies have also used RL methods to address the issue of malware propagation [4,25,26]. Previous RL algorithms used to control the spread of malware include the Soft Actor–Critic (SAC) [4], Proximal Policy Optimization (PPO) [5], and Deep Q-Network (DQN) [27]. These MFRL methods do not require precise mathematical models to output effective control and successfully overcome the limitations of traditional methods. However, these MFRL algorithms all aim to build an effective intelligent agent network using a large amount of interaction experience. This means that these RL algorithms have poor sample efficiency. Unfortunately, the high-speed diffusion properties of malware in WRSNs are fundamentally incompatible with RL’s long-term training [28]. This results in a delay in strategy updates, which can prevent the system from accurately controlling the spread of malware. Therefore, improving RL’s learning efficiency is a key area for further research.
Recently, researchers have used model predictive control (MPC) to improve the timeliness of control [29,30,31]. MPC can simulate the control results before making control decisions, allowing for the selection of a more optimal control sequence, which can effectively reduce unnecessary sample trials in the actual operation [32]. At the same time, it optimizes each time according to the latest system state, making full use of the sample information to accelerate the process of exploring strategies. However, these existing MPC methods depend on precise mathematical models and do not have the advantage of RL, which relies solely on self-interaction. Therefore, using MPC to generate an MFRL-based neural network predictive model will effectively improve RL’s sample efficiency. To the best of our knowledge, this is the first study to use a fractional-order propagation model to control the spread of malware in WRSNs. The proposed model-based RL (MBRL) method demonstrates superior learning efficiency in controlling the spread of malware.
Specifically, the main contributions of this paper are as follows:
(1)
Based on previous research [5], this paper constructs a fractional-order VCISQ propagation model to more accurately describe the spread of malware in WRSNs. At the same time, this paper derives a fractional-order optimal control method for the fractional-order VCISQ propagation model. Through fractional-order optimal control, the theoretical optimal control effect under treatment and quarantine measures is obtained. At the same time, the final control cost generated becomes the standard by which the control effect of RL methods is judged.
(2)
This study proposes a method called the Model-Based Soft Actor–Critic (MBSAC) MBRL algorithm, which is based on the general RL method SAC and has an additional prediction network that can use existing interaction data to predict future interaction environments. MBSAC can also use interaction and predictive data to optimize the agent network simultaneously, reducing the number of interactions required for RL methods and improving their learning efficiency. Compared to the current SOTA general RL methods, MBSAC has lower control costs, faster convergence, and more stable control processes.
The other sections of the text are organized as follows: Section 2 introduces related work; Section 3 builds the fractional-order VCISQ model and derives the fractional-order optimal control method; Section 4 proposes the MBRL algorithm MBSAC; Section 5 carries out multi-dimensional verification of the model and method; and Section 6 summarizes the contributions of this paper and suggests future improvements. Figure 2 outlines each step of the research process.

2. Related Work

In many existing studies, scholars have used integer-order models to model the spread of malware [5,9,10,11]. In [5], the authors proposed an integer-order VCISQ epidemic model to describe malware propagation in WRSNs and combined it with control methods to effectively suppress malware propagation. In [9], an integer-order hybrid epidemic propagation model was proposed to study the dynamic behavior of malware spreading via external storage devices. In [10], an integer-order epidemic-based malware propagation model was established to precisely describe the propagation patterns and random interference characteristics of malware in unmanned aerial vehicle–wireless sensor network (UAV-WSN) systems. In [11], an integer-order epidemic model was constructed by scholars to analyze the propagation characteristics of malware in multi-layer networks, and a hybrid maintenance strategy was introduced to suppress malware propagation. These propagation models effectively simulate malware propagation to a certain extent. However, the non-memory characteristics of integer-order systems do not align with the actual propagation effects of malware [12,33]. Therefore, they cannot accurately fit the propagation of malware. It is worth noting that the global memory property of fractional-order differential equations partially overcomes this limitation, making the propagation model more aligned with reality [13].
Figure 2. Research Flowchart.
Figure 2. Research Flowchart.
Actuators 14 00434 g002
Some excellent traditional control methods have been used by researchers to control malware [5,14,15,16]. In [5], researchers built a hybrid control strategy and proposed an integer-order optimal control method to suppress the spread of malware in WRSNs. In [14], scholars have proposed a patching strategy based on the differential game theory method to effectively suppress the spread of malware in the Industrial Internet of Things (IIoT). In [15], a self-adaptive integer-order SIR model is proposed, combining fuzzy control and gradient descent optimization to control the spread of malware in IIoT networks. In [16], researchers effectively reduce the spread of malware in IoT networks and enhance system security using an integer-order optimal control strategy. These traditional control methods have proven effective in controlling the spread of malware. However, they require a precise mathematical model of the malware propagation process, which is difficult to obtain due to the high degree of uncertainty in sensor network communication channels and network topology [18,19], thus reducing the adaptability of traditional control algorithms.
To overcome the limitations of traditional control methods in terms of adaptability, different deep learning (DL) methods have been developed to control the spread of malware [34,35,36,37]. In [34], a DL-based fine-grained hierarchical learning method was proposed, which effectively controlled the spread of malware in IoT by extracting its behavioral patterns. In [35], a novel network forensics framework based on multi-layer perceptrons was proposed to enhance the network’s ability to suppress the spread of malware. In [36], a DL method based on multi-layer convolutional neural networks was proposed to effectively prevent the spread of malware in IIoT. In [37], a self-supervised learning method based on a masked autoencoder framework was proposed, which identified malware network traffic to achieve effective control of malware. Due to the fact that DL methods only depend on the network model obtained during training, they are more adaptable than traditional methods. However, DL methods often require a large amount of high-quality manually labeled data for network training. In practice, obtaining large amounts of data for WRSNs is costly [5,10], which means that using DL methods alone to suppress the spread in WRSNs is not realistic.
In environments with a strong demand for adaptability, the study of controlling malware using MFRL methods has emerged [4,25,26,38].
In [4], the Multi-Agent SAC (MASAC) method is used to effectively control the spread of malware in the IoV. In [25], a framework called GAME-RL is proposed to generate adversarial malware samples to effectively prevent the spread of malware. In [26], a method combining federal learning and the Q-RL framework was proposed to effectively control the spread of malware on IoT devices. In [38], researchers proposed an RL-based functional verification attack framework that effectively increased the efficiency of generating adversarial examples and controlled the spread of malware. These MFRL methods largely overcome the need for precise models and high-quality data. However, an excellent policy network often requires a long training period to be established [13]. Nevertheless, the long training period of RL is in fundamental conflict with the fast spread of malware in WRSNs [28].
Fortunately, new work on MPC is constantly emerging [30,39,40]. In [30], an MPC method based on a feedback algorithm was proposed to solve the optimization problem of the cost function and constraints in MPC. In [39], scholars proposed a framework based on DL and MPC methods to control and mitigate the security threats posed by tampering attacks on the power grid. In [40], a self-adaptive control framework combining an unknown input observer and MPC methods was designed to effectively control the safety and stability of the power system. These studies all use the predictive optimization idea of MPC to determine the optimal control sequence in advance. However, MPC methods alone cannot do without precise mathematical models [41], and lack the self-optimizing ability of RL. Conversely, if MPC’s predictive control sequence can be used to assist RL in exploration, it will accelerate RL training. With this approach, MFRL will become MBRL and achieve higher sample efficiency [42,43,44]. Table 1 presents a comparison of the relevant works.

3. Fractional-Order Model and Optimal Control Method

3.1. Fractional-Order Malware Propagation Model

In previous work [5], a model for the propagation of the VCISQ malware was proposed. As shown in Figure 3, this model contains five state nodes, as follows: susceptible node V ( t ) , carrier node C ( t ) , infected node I ( t ) , secured node S ( t ) , and quarantined node Q ( t ) .
(1)
Susceptible node V ( t ) : V ( t ) is a sensor node with vulnerabilities but not yet infected with the malware. It may carry the malware and transform into the carrier node C ( t ) by interacting with the infected node I ( t ) . Alternatively, it may transform into the quarantined node Q ( t ) due to quarantine measures.
(2)
Carrier node C ( t ) : After τ i time, the malware is activated and the carrier node C ( t ) is transformed into the infected node I ( t ) .
(3)
Infected node I ( t ) : I ( t ) is a node that has activated and is spreading the malware. It will change to the secured node S ( t ) after τ p time, due to treatment measures such as installing a patch.
(4)
Secured node S ( t ) : S ( t ) is a node that has been patched and restored, and is temporarily immune to malware. However, the patch effect may weaken after H time, and it may revert to a susceptible node V ( t ) .
(5)
Quarantined node Q ( t ) : Q ( t ) is a node that has been quarantined and is no longer participating in network interactions. It may revert to susceptible node V ( t ) after h time if the quarantine fails. Alternatively, it may be transformed into the secured node S ( t ) if treatment measures such as installing a patch are implemented.
In the model, the WRSN system considers the delay factors in the real system. These include the delay τ i in installing malware, which describes the time it takes for the malware to be carried and activated. The delay τ p in installing patches describes the time it takes for the patch to be deployed and take effect. The delay τ s in patch expiration describes the time it takes for the patch to gradually expire on the secured node S ( t ) . The delay h in quarantine expiration describes the time it takes for the quarantine measures to gradually expire on the quarantined node Q ( t ) .
The VCISQ model can partially simulate the spread of malware in WRSNs by integrating multiple factors. However, integer-order models have inherent limitations. Their current state is only determined by the previous moment. This is unable to match the continuous influence of past states on current states in the actual spread of malware [33]. To effectively describe such phenomena with genetic characteristics and cumulative effects, Caputo fractional-order models have proven to be highly applicable [13]. Considering the need for a more accurate description of the effects of malware spread, this study proposes the VCISQ malware Caputo fractional-order spread model:
t 0 C D t d q V ( t ) = 2 ϕ ( t ) I ( t ) μ ( t ) + Λ S ( t τ s ) α 2 ( t ) V ( t ) + ( 1 k ) α 2 ( t h ) V ( t h ) e h ξ , t 0 C D t d q C ( t ) = 2 ϕ ( t ) I ( t ) μ ( t ) 2 Ω ϕ ( t τ i ) I ( t τ i ) μ ( t τ i ) , t 0 C D t d q I ( t ) = 2 Ω ϕ ( t τ i ) I ( t τ i ) μ ( t τ i ) α 1 ( t τ p ) I ( t τ p ) , t 0 C D t d q S ( t ) = α 1 ( t τ p ) I ( t τ p ) Λ S ( t τ s ) + k α 2 ( t h ) V ( t h ) e h ξ , t 0 C D t d q Q ( t ) = α 2 ( t ) V ( t ) k α 2 ( t h ) V ( t h ) e h ξ ( 1 k ) α 2 ( t h ) V ( t h ) e h ξ .
Among them, t 0 C D t d q is called the q-order Caputo fractional derivative, where t 0 is the initial time and t d is the target time. Unlike integer-order differential equations, if  Y ( t ) requires a Caputo fractional derivative expansion, it will follow the following expression [4]. It should be noted that the Gamma function is represented by Γ ( z ) , whose expansion is Γ ( z ) = 0 e s s z 1 d s
t 0 C D t d q Y ( t ) = 1 Γ ( 1 q ) t 0 t Y ( s ) ( t s ) q d s , 0 < q 1 .
The initial value G l [ 0 , ) , l = 1 , 2 , , 5 of each state equation is non-negative and invariant. According to the above settings, the initial conditions of the VCISQ model (1) follow the following expressions:
( V ( t ) , C ( t ) , I ( t ) , S ( t ) , Q ( t ) ) T = ( G 1 , G 2 , G 3 , G 4 , G 5 ) T
In (1), ϕ ( t ) represents the number of neighboring nodes of susceptible node V ( t ) in the VCISQ model. μ ( t ) represents the contact rate between neighboring nodes in the VCISQ model. In [5], ϕ ( t ) is established based on the interaction range of radar nodes, Rayleigh fading, path loss, SNR threshold, node spatial distribution, and the degree of difference between ideal and actual communication. In [5], μ ( t ) is established based on the time-varying effects of node distribution, contact opportunities, and device transmission power.
Specifically, ϕ ( t ) is given by the following expression:
ϕ ( t ) = 2 q c π χ 2 a α σ n 4 a γ t h 2 a a 2 Γ 2 a V ( t )
Additionally, μ ( t ) follows the following expression:
μ ( t ) = 1 e 1 2 η t , η = 2 ρ π χ 2 a α σ n 4 a γ t h 2 a Γ 2 a .
To optimize the VCISQ model (1) for the suppression of malware propagation, it is necessary to introduce treatment and quarantine control quantities ( α 1 ( t ) , α 2 ( t ) ) to implement intermittent control with varying intensity. However, the inherent resource limitations of WRSNs restrict the control intensity ( α 1 , α 2 ) and the WRSN’s inherent resource constraints. To balance control effectiveness and resource consumption within these constraints, we must perform a real-time evaluation of the system state at each time point t and decide on the optimal control action ( α 1 ( t ) , α 2 ( t ) ). We consider the allowable control function space k on the time interval T, which is defined as the set of all Lebesgue measurable admissible control functions:
k = { ( α 1 ( · ) , α 2 ( · ) ) L ( 0 , t d ) : 0 α 1 ( · ) , α 2 ( · ) 1 } .
Among them, t d represents the termination control moment.
The core objective of implementing control for the VCISQ model (1) is to minimize control costs while suppressing the spread of malware. Susceptible nodes V ( t ) are at risk of being attacked by malware and gradually becoming unable to work. The number of infected nodes I ( t ) directly affects the security of the WRSN system. Excessive quarantined nodes Q ( t ) also prevent WRSN systems from operating efficiently. Therefore, reducing these nodes is an effective way to mitigate the spread of malware. Control measures α 1 ( t ) and α 2 ( t ) are introduced to quickly suppress the spread of malware, but their implementation consumes energy and bandwidth resources. Weighting coefficients B 1 , B 2 , B 3 , B 4 , and  B 5 are used to quantify the impact of different parts in the cost function. Specifically, they include the damage caused by different numbers of sensor nodes and the resource consumption of different control measures. The cost function is defined as follows:
J ( V ( t ) , I ( t ) , Q ( t ) , α 1 ( t ) , α 2 ( t ) ) = 0 T B 1 V ( t ) + ( B 2 I ( t ) ) β + B 3 Q ( t ) + B 4 α 1 2 ( t ) + B 5 α 2 2 ( t ) d t .
The parameters α 1 ( t ) and α 2 ( t ) in the equation represent control measures to suppress the spread of malware, while the weighting coefficients B 1 , B 2 , B 3 , B 4 , and  B 5 represent the damage caused by different numbers of sensor nodes and the resource consumption of different control measures, respectively.
All parameters involved in the VCISQ model (1) are defined in Table 2.

3.2. Fractional Optimal Control Method

In this section, we will derive the fractional-order optimal control method for the fractional-order model in Section 3.1. For a known control model, the state vector Z ( t ) , the delay vector D ( t ) , and the control vector L ( t ) are defined as follows:
Z ( t ) = ( V ( t ) , C ( t ) , I ( t ) , S ( t ) , Q ( t ) ) , D ( t ) = ( Z ( t τ s ) , Z ( t τ i ) , Z ( t τ p ) ) , L ( t ) = ( α 1 ( t ) , α 2 ( t ) ) .
According to the formula, the fractional-order VCISQ model (1) can be converted into the following equation: t 0 C D t q Z ( t ) = f ( Z ( t ) , D ( t ) , L ( t ) ) . The corresponding Hamilton function is then as follows:
H ( Z ( t ) , D ( t ) , L ( t ) , λ ( t ) ) = B 1 V ( t ) + ( B 2 I ( t ) ) β + B 3 Q ( t ) + B 4 α 1 2 ( t ) + B 5 α 2 2 ( t ) + λ ( t ) t 0 C D t d q Z ( t ) = B 1 V ( t ) + ( B 2 I ( t ) ) β + B 3 Q ( t ) + B 4 α 1 2 ( t ) + B 5 α 2 2 ( t ) + λ ( t ) f ( Z ( t ) , D ( t ) , L ( t ) ) = B 1 V ( t ) + ( B 2 I ( t ) ) β + B 3 Q ( t ) + B 4 α 1 2 ( t ) + B 5 α 2 2 ( t ) + λ 1 ( t ) 2 ϕ ( t ) I ( t ) μ ( t ) + Λ S ( t τ s ) α 2 ( t ) V ( t ) + ( 1 k ) α 2 ( t h ) V ( t h ) e h ξ + λ 2 ( t ) 2 ϕ ( t ) I ( t ) μ ( t ) 2 Ω ϕ ( t τ i ) I ( t τ i ) μ ( t τ i ) + λ 3 ( t ) 2 Ω ϕ ( t τ i ) I ( t τ i ) μ ( t τ i ) α 1 ( t τ p ) I ( t τ p ) + λ 4 ( t ) ( α 1 ( t τ p ) I ( t τ p ) Λ S ( t τ s ) + k α 2 ( t h ) V ( t h ) e h ξ ) + λ 5 ( t ) ( α 2 ( t ) V ( t ) k α 2 ( t h ) V ( t h ) e h ξ ( 1 k ) α 2 ( t h ) V ( t h ) e h ξ )
Among them, λ ( t ) = λ i ( t ) represents the state equation, i = 1 , 2 , , 5 . They are obtained by taking the partial derivative of the Hamilton function for each state equation.
Due to the fact that α 1 ( t ) and α 2 ( t ) are the control variables corresponding to their respective state variables, if x is defined, then each existing state equation λ ( t ) should satisfy the following condition:
t 0 C D t d q λ ( t ) = H Z ( t ) + χ [ 0 , t d τ s ] H D ( t + τ s ) + χ [ 0 , t d τ i ] H D ( t + τ i ) + χ [ 0 , t d τ p ] H D ( t + τ p ) + χ [ 0 , t d h ] H L ( t + h ) , λ i ( t d ) = 0 , i = 1 , 2 , , 5 , H ( Z ( t ) , D ( t ) , L ( t ) , λ ( t ) ) = min 0 L ( t ) 1 H Z * ( t ) , D * ( t ) , L * ( t ) , λ * ( t ) .
Among them, χ [ 0 , t d τ s ] ( t ) = 1 , t [ 0 , t d τ s ] 0 , otherwise , χ [ 0 , t d τ i ] ( t ) = 1 , t [ 0 , t d τ i ] 0 , otherwise , χ [ 0 , t d τ p ] ( t ) = 1 , t [ 0 , t d τ p ] 0 , otherwise , and χ [ 0 , t d h ] ( t ) = 1 , t [ 0 , t d h ] 0 , otherwise .
Theorem 1.
For the VCISQ model (1), the state variable can be expressed as follows:
t 0 C D t d q λ 1 ( t ) = ( B 1 + ( λ 2 ( t ) λ 1 ( t ) ) ϕ ( t ) V ( t ) I ( t ) μ ( t ) + ( λ 5 ( t ) λ 1 ( t ) ) α 2 ( t ) + ( λ 3 ( t + τ i ) λ 2 ( t + τ i ) ) Ω ϕ ( t ) V ( t ) I ( t ) μ ( t ) + λ 1 ( t + h ) ( 1 k ) α 2 ( t ) e h ξ + λ 4 ( t + h ) k α 2 ( t ) e h ξ λ 5 ( t + h ) α 2 ( t ) e h ξ ) , t 0 C D t d q λ 2 ( t ) = 0 , t 0 C D t d q λ 3 ( t ) = ( β B 2 ( B 2 I ( t ) ) β 1 + ( λ 2 ( t ) λ 1 ( t ) ) ϕ ( t ) I ( t ) μ ( t ) + λ 4 ( t + τ p ) α 1 ( t ) λ 3 ( t + τ p ) u 1 ( t ) + ( λ 3 ( t + τ i ) λ 2 ( t + τ i ) ) Ω ϕ ( t ) I ( t ) μ ( t ) ) , t 0 C D t d q λ 4 ( t ) = ( λ 1 ( t + τ s ) λ 4 ( t + τ s ) ) Λ , t 0 C D t d q λ 5 ( t ) = B 3 .
The optimal control variables ( α 1 * ( t ) , α 2 * ( t ) ) can then be obtained:
α 1 * ( t ) = max min 1 , λ 3 ( t + τ p ) λ 4 ( t + τ p ) I ( t ) 2 B 4 , 0 α 2 * ( t ) = max ( min 0 , λ 1 ( t ) λ 5 ( t ) V ( t ) 2 B 5 λ 1 ( t + h ) λ 5 ( t + h ) V ( t ) e h ξ 2 B 5 λ 4 ( t + h ) λ 1 ( t + h ) k V ( t ) e h ξ 2 B 5 , 0 )
The optimal control model of VCISQ is obtained by substituting the optimal control variables ( α 1 * ( t ) , α 2 * ( t ) ) into the VCISQ model (1):
t 0 C D t d q V * ( t ) = 2 ϕ ( t ) I * ( t ) μ ( t ) + Λ S * ( t τ s ) α 2 * ( t ) V * ( t ) + ( 1 k ) α 2 * ( t h ) V * ( t h ) e h ξ , t 0 C D t d q C * ( t ) = 2 ϕ ( t ) I * ( t ) μ ( t ) 2 Ω ϕ ( t τ i ) I * ( t τ i ) μ ( t τ i ) , t 0 C D t d q I * ( t ) = 2 Ω ϕ ( t τ i ) I * ( t τ i ) μ ( t τ i ) α 1 * ( t τ p ) I * ( t τ p ) , t 0 C D t d q S * ( t ) = α 1 * ( t τ p ) I * ( t τ p ) S * ( t τ s ) + k α 2 * ( t h ) V * ( t h ) e h ξ , t 0 C D t d q Q * ( t ) = α 2 * ( t ) V * ( t ) k α 2 * ( t h ) V * ( t h ) e h ξ ( 1 k ) α 2 * ( t h ) V * ( t h ) e h ξ .
Proof. 
The state variable λ ( t ) is determined by Pontryagin’s maximum principle and (10). Specifically, it is obtained by solving the partial derivatives of the Hamilton function H (9) of the VCISQ system for each state variable:
t 0 C D t d q λ 1 ( t ) = 𝜕 H 𝜕 V ( t ) = ( B 1 + ( λ 2 ( t ) λ 1 ( t ) ) ϕ ( t ) V ( t ) I ( t ) μ ( t ) + ( λ 5 ( t ) λ 1 ( t ) ) α 2 ( t ) + ( λ 3 ( t + τ i ) λ 2 ( t + τ i ) ) Ω ϕ ( t ) V ( t ) I ( t ) μ ( t ) + λ 1 ( t + h ) ( 1 k ) α 2 ( t ) e h ξ + λ 4 ( t + h ) k α 2 ( t ) e h ξ λ 5 ( t + h ) α 2 ( t ) e h ξ ) , t 0 C D t d q λ 2 ( t ) = 𝜕 H 𝜕 C ( t ) = 0 , t 0 C D t d q λ 3 ( t ) = 𝜕 H 𝜕 I ( t ) = ( β B 2 ( B 2 I ( t ) ) β 1 + ( λ 2 ( t ) λ 1 ( t ) ) ϕ ( t ) I ( t ) μ ( t ) + ( λ 4 ( t + τ p ) λ 3 ( t + τ p ) ) α 1 ( t ) + ( λ 3 ( t + τ i ) λ 2 ( t + τ i ) ) Ω ϕ ( t ) I ( t ) μ ( t ) ) , t 0 C D t d q λ 4 ( t ) = 𝜕 H 𝜕 S ( t ) = ( λ 1 ( t + τ s ) λ 4 ( t + τ s ) ) Λ , t 0 C D t d q λ 5 ( t ) = 𝜕 H 𝜕 Q ( t ) = B 3 .
Then, use the Hamilton function H to calculate the partial derivatives of the control variables α 1 ( t ) and α 2 ( t ) . The optimal conditions can be obtained from this equation: 𝜕 H 𝜕 α k ( t ) , k = 1 , 2 . Combining this with the optimal cross-section condition (6) gives the following:
𝜕 H 𝜕 α 1 ( t ) α 1 ( t ) = α 1 * ( t ) = 2 B 4 α 1 ( t ) + λ 4 ( t + τ p ) I ( t ) λ 3 ( t + τ p ) I ( t ) = 0 , 𝜕 H 𝜕 α 2 ( t ) α 2 ( t ) = α 2 * ( t ) = 2 B 5 α 2 ( t ) ( λ 1 ( t ) λ 5 ( t ) ) V ( t ) + ( λ 1 ( t + h ) λ 5 ( t + h ) ) V ( t ) e h ξ + ( λ 4 ( t + h ) λ 1 ( t + h ) ) k V ( t ) e h ξ = 0 .
Applying a transformation to (12) yields the desired optimal control pair ( α 1 ( t ) , α 2 ( t ) ):
α 1 * ( t ) = max min 1 , λ 3 ( t + τ p ) λ 4 ( t + τ p ) I ( t ) 2 B 4 , 0 α 2 * ( t ) = max min 1 , ( λ 1 ( t ) λ 5 ( t ) ) V ( t ) 2 B 5 ( λ 1 ( t + h ) λ 5 ( t + h ) ) V ( t ) e h ξ 2 B 5 ( λ 4 ( t + h ) λ 1 ( t + h ) ) k V ( t ) e h ξ 2 B 5 , 0
Thus, Theorem 1 has been proven.    □
Theorem 2.
The positivity and boundedness of the solution are prerequisites for the existence of an optimal control pair for the fractional-order dynamics model. For the VCISQ model in (1), the solution remains non-negative and bounded at all times t under the initial condition (5).
Proof. 
Based on the fact that R 5 + = { ϕ R 5 : ϕ 0 } , Theorem 3 in [17] shows that the VCISQ model (1) has a unique solution k in ( 0 , ) . Based on the above conditions, the key to proving that the positive domain R 5 + is a VCISQ model is to show that it is positively invariant.
From (1) and (3), we can directly obtain V ( t h ) R + , I ( t τ i ) R + , μ ( t τ i ) R + , ϕ ( t τ i ) R + , S ( t τ s ) R + , α 1 ( t τ p ) R + , α 2 ( t h ) R + . Let ϕ = ( V ( t ) , C ( t ) , I ( t ) , S ( t ) , Q ( t ) ) T = 0 . The state values of the VCISQ model (1) at this time are as follows:
t 0 C D t d q V ( t ) V ( t ) = 0 = 2 ϕ ( t ) I ( t ) μ ( t ) + Λ S ( t τ s ) + ( 1 k ) α 2 ( t h ) V ( t h ) e h ξ 0 , t 0 C D t d q C ( t ) C ( t ) = 0 = 2 ϕ ( t ) I ( t ) μ ( t ) 2 Ω ϕ ( t τ i ) I ( t τ i ) μ ( t τ i ) 0 , t 0 C D t d q I ( t ) I ( t ) = 0 = 2 Ω ϕ ( t τ i ) I ( t τ i ) μ ( t τ i ) α 1 ( t τ p ) I ( t τ p ) 0 , t 0 C D t d q S ( t ) S ( t ) = 0 = α 1 ( t τ p ) I ( t τ p ) Λ S ( t τ s ) + k α 2 ( t h ) V ( t h ) e h ξ 0 , t 0 C D t d q Q ( t ) Q ( t ) = 0 = α 2 ( t ) V ( t ) k α 2 ( t h ) V ( t h ) e h ξ ( 1 k ) α 2 ( t h ) V ( t h ) e h ξ 0 .
Let t 0 be the initial moment and define the vector ϕ = ( V ( t 0 ) , C ( t 0 ) , I ( t 0 ) , S ( t 0 ) , Q ( t 0 ) ) T R 10 + . Then, based on (4),  it follows that the R 10 + of the control system (1) constitutes a positively invariant set.
Define Z ( t ) = ( V ( t ) , C ( t ) , I ( t ) , S ( t ) , Q ( t ) ) , and according to the VCISQ model (1) and the initial conditions (5), it can be shown that:
t 0 C D t q Z ( t ) = G 1 ( 1 k ) Z G 1 ( 1 k ) Z ( t ) 1
The following inequality can be derived from this inequality:
lim t sup Z ( t ) G 1 1 k
Thus, Theorem 2 has been proven. Therefore, the VCISQ model (1) proposed in this paper is non-negative and bounded, so it has an optimal control pair in the fractional-order optimal control method.    □

3.3. Algorithm Complexity Analysis

Time Complexity
The overall time complexity of MBSAC is dominated by three core operations per training episode E:
  • Environment Interaction ( O 2 ): Requires O ( N × T × t e ) per episode, where N = 1 (single agent), T is episode length, and  t e is per-step interaction time.
  • Predictive Model Training ( O m o d e l ): Involves O ( E t × K 1 ) computations per episode for E t training iterations over replay buffer K 1 .
  • Network Updates ( O 3 ): Demands O ( t g · T × U p ) per episode, where t g is the gradient step interval and U p denotes policy/critic update time.
The aggregate time complexity across E episodes is as follows:
O t i m e = O ( E × [ T · t e + E t · K 1 + t g · T · U p ] )

4. Model-Based Soft Actor–Critic

4.1. Soft Actor–Critic

The implementation of MBSAC is based on the SAC algorithm, which is a general reinforcement learning algorithm that usually performs well for continuous state–action space Markov decision processes (MDPs) [4]. It optimizes its strategy to maximize expected returns. As shown in Figure 4, this optimization process involves iterative interaction with the fractional-order VCISQ model (1). The algorithm’s implementation architecture comprises five multilayer perceptron (MLP) networks, each of which has an input layer, hidden layer, and output layer. The policy network P φ , defined by parameters, φ , is responsible for action sampling; the value function, V φ , is computed by two critic networks, C θ 1 and C θ 2 , defined by parameters θ 1 and θ 2 , respectively; and two target critics, C θ 1 ¯ and C θ 2 ¯ , defined by parameters θ 1 ¯ and θ 2 ¯ , are designed to generate reliable value-target estimates. The core purpose is to overcome the instability caused by frequent updates to the network. Before interacting with VCISQ, the SAC intelligent agent will observe its initial state s 0 . For the SAC intelligent agent, it is necessary to observe the state of VCISQ s t (1) at each moment. After the state sequence s t is obtained, it is input into the policy network to obtain the control action a t . The policy network constitutes the behavior strategy π φ ( a t | s t ) , which determines the generation of a t . Specifically, π φ ( a t | s t ) represents the probability that the agent will output action a t in state s t . As shown in the equation, a t is quantized by SAC as the control strength of the VCISQ malware:
s t = { V ( t ) , C ( t ) , I ( t ) , S ( t ) , Q ( t ) } , a t = { α 1 ( t ) , α 2 ( t ) }
The mean μ t and standard deviation σ t are implicitly output as a probability distribution by the MLP policy network after processing the state s t . These two values together determine the action a t . In the VCISQ continuous action space, this sampling mechanism will cause loss of differentiability. However, subsequent gradient updates depend on differentiability. To overcome this defect, reparameterization techniques are necessary [4]:
a t = δ t * e log σ t + μ t .
Among these, the external random source δ t remains at a constant value at any time t.
As a strategy optimization navigation benchmark, the reward signal r t is generated synchronously in each interaction. When the agent completes the interaction with the environment, the VCISQ model (1) executes the state transition to s t + 1 . At the same time, the VCISQ model (1) releases the reward signal r t . The agent computes the reward value according to the (7) paradigm at each moment t:
r t = B 1 V ( t ) + ( B 2 I ( t ) ) β + B 3 Q ( t ) + B 4 α 1 2 ( t ) + B 5 α 2 2 ( t )
In the interactive framework with total time T, the SAC optimization objective is to maximize the discounted cumulative reward E π φ ( t T γ r t ) . The discount factor γ quantifies the agent’s trade-off between immediate rewards and future benefits. The entropy term H ( π φ ( · | s t ) ) is defined as a measure of the strategy’s randomness. It aims to proactively avoid local optimum traps by dynamically expanding the strategy search space to guide the solution towards the global optimum. Specifically, the entropy expression is as follows:
H ( π φ ( · | s t ) ) = a t π φ ( a t | s t ) log π φ ( a t | s t ) d a t = E a t π φ ( · | s t ) log π φ i ( a t | s t ) .
Let l be the entropy regularization coefficient, whose online calibration ensures continuous exploration capability. In each interaction, the intelligent agent receives a real trajectory, which is defined as follows: τ 1 = ( o i t , a i t , o i t + 1 , r t ) that have the core features of the VCISQ model (1). All of these true trajectories are stored in the true replay buffer R 1 to be used for network updates. During the update stage, the true data provides the most important guidance. Throughout the entire interaction stage, the SAC algorithm learns the optimal policy π φ * via a trial-and-error reinforcement learning paradigm:
π φ * = arg max π φ J ( π φ ) = arg max π φ t T E π φ [ γ r t + l H ( π φ ( · | s t ) ) ]
Equation explanation:
(1) γ r t : Discounted immediate reward at time t, where γ 0 , 1 is the discount factor. This prioritizes short-term rewards due to malware’s instantaneous propagation in WRSNs (Section 1).
(2) H π φ · s t : Policy entropy defined in (23), measuring action distribution randomness.
(3) l: Adaptively tuned entropy coefficient (Algorithm 1, Line 19; updated via (34)) to balance exploration (high entropy) and exploitation (low entropy).
Algorithm 1 Model-Based Soft Actor–Critic
1:Initialize the parameters of the policy network φ , the predictive model p, the critic networks θ 1 and θ 2 ,
the target critic networks θ 1 ¯ and θ 2 ¯ , the entropy regularization coefficient l
the real replay buffer K 1 and the predictive replay buffer K 2
2:for 1 to E do
 Return the initial state s 0
3:  for  t = 1 toTdo
4:  Select the action a t π φ ( · | s t ) with agent’s policy network
5:  Put a i t into the VCISQ model (1), and return the next state s t + 1 and the reward r t
6:  Store real trajectory τ 1 = s t , a t , s t + 1 , r t into K 1
7:   s t + 1 = s t
8:if  t = t r u n c a t i o n
9: Return all trajectory data of K 1 , and output the vector x 1 and y 2 with the predictive model (28)
10: Train the predictive model parameter of p with x 1 and y t r (29)
11: for 1 to  P k  do
12:   Randomly sampling state from K 1 to become the model state o t m
13:   Output the model action a t m π φ ( · | s t m ) with agent’s policy network
14:   Output the predictive vector x 2 with s t m and a t m and
   calculate the next model state s t + 1 m and the model reward r t m (31)
15:   Store Predictive trajectory τ 2 = s t m , a t m , s t + 1 m , r t m into K 2
16:    o t m = o t + 1 m
17:   end for
18:  end for
19: for each gradient step do
20:  Randomly sample τ 1 and τ 2 from K 1 and K 2 respectively
21:  Combine the two datasets in the specified s l ratio to facilitate update.
22:  Update all network parameters
   φ φ a φ φ L ( φ ) (32)
   θ j θ j a c θ j L critic for  j { 1 , 2 } (33)
   θ ¯ j τ θ j + θ ¯ j τ θ ¯ j for  j { 1 , 2 } (34)
  Update the entropy regularization coefficient parameter L ( l ) (35)
23:  end for
24:end for
Optimization Goal:
Maximizes the trade-off between cumulative rewards and policy stochasticity:
(1) The γ r t term ensures effective malware containment (Section 5.4 validates reduced control cost).
(2) The l H · term prevents premature convergence (e.g., avoiding overly deterministic quarantine policies), crucial for dynamic WRSNs.

4.2. Predictive Model

As shown in Figure 4, MBSAC quickly locates the optimal strategy for the VCISQ model (1), which depends on the explicit MLP predictive model G. This predictive model G uses the idea of the MPC method. First, it initializes an adjustable network parameter p and then performs trajectory prediction to support the optimization process of the policy network. MBSAC prioritizes the VCISQ model (1), and then interacts with it for a certain amount of time to obtain a certain amount of real data. MBSAC then truncates the real trajectory to complete the predictive model’s trajectory prediction. Specifically, the real data contains the model’s characteristics. Predicting based on real data can make the model’s parameters more accurately reflect the real situation, thereby improving the accuracy of the predictive model. The core task of the predictive model is to accurately describe the dynamic evolution of the VCISQ model (1) under state–action pairs, and simultaneously, it should also fit the environment’s instantaneous reward generation logic. When the pre-set truncation moment arrives, the model will read all the training data accumulated in the real replay buffer R 1 as a learning basis. At each moment t, o t and a t are concatenated to form the state–action vector x 1 :
x 1 = ( s t , a t )
Additionally, the reward r t and the state difference s t + 1 s t are concatenated to form the reward-state difference vector y 1 :
y 1 = ( r t , s t + 1 s t )
The training data for the predictive model consists of the input λ x 1 and the corresponding target output y 1 . To prevent order bias and improve distribution uniformity, the training data is shuffled randomly at the start of the model training process. The task of model G is to predict the state at the next moment and the instant reward obtained at that moment. To achieve this prediction, model G uses a Gaussian distribution in the output layer of the MLP to represent the uncertainty of the prediction. Specifically, the mean of the predicted state, μ x 1 , and the log of the variance, log ( σ x 1 2 ) , will be generated. The mean represents the core expectation of the predicted value, and the log variance describes the dispersion of the prediction:
G = N ( μ x 1 , log ( σ x 1 2 ) )
Among them, the input dimension of the model is the same as the state–action vector x 1 . The output dimension is 2 * ( s t , a t ) .
The state–action vector x 1 (26) is divided into a training set x t 1 (with a proportion of ( 1 K R ) and the validation set x v 1 (with a proportion of K R ). The training set is used to perform E t model iterations. In each training round, MBSAC randomly samples a training data subset of size B c . This batch of data is then input into the predictive model. (28), which outputs its predicted training mean μ t x 1 and training log-likelihood log ( σ t x 1 2 ) . The log-likelihood log ( σ t x 1 2 ) is then truncated between the pre-set minimum V m i n and maximum V m a x . In addition, the validation set is used to evaluate the performance of the predictive model G in each training iteration, with the aim of deciding whether to terminate the training early.
The test set input vector x v 1 is input into the predictive model (28) to obtain μ v x 1 and log ( σ v x 1 2 ) . This is then combined with y 1 to calculate the loss value.
Based on the above, the training set loss function L t ( p ) and the test set loss function L v ( p ) are constructed to optimize the MLP parameters of the predictive model:
L t ( p ) = h m V m a x h m V m i n + ( μ t x 1 y 1 ) 2 + log ( σ t x 1 2 ) * ( μ t x 1 y 1 ) 2 σ t x 1 2 , L v ( p ) = ( μ v x 1 y 1 ) 2
Among these, h m represents the model loss weight. The calculation result of the loss function L v ( p ) is only used to monitor the increase in model performance P, and its gradient does not participate in the process of updating the parameters. If the performance improvement ratio P is less than the pre-set target value P t a r g e t , the training process is terminated early to avoid overfitting of the model. L v ( p , k v ) is defined as the loss value calculated on the validation set at the end of the k v training iteration using Formula (29). The following expression is used:
d r = 1 L v ( p , k v ) L v ( p , k v + 1 )
Once the predictive model has completed its training, MBSAC will use the predictive model based on the MPC idea to predict the trajectory and generate the prediction data. As shown in Figure 4, from the truncation moment onward, the intelligent agent samples B d batches of states from the real experience cache R 1 to become the model state s t m . The predictive model then uses the behavior strategy π φ ( a t m | s t m ) to obtain the model action a t m . Next, the prediction input vector x 2 = ( s t m , a t m ) is constructed. x 2 is then input into the predictive model (28) to obtain the predicted mean μ x 2 2 and the predicted log-likelihood log ( σ x 2 2 ) . The random sample is defined as N ( 0 , 1 ) representing the standard normal distribution, and c represents the state space dimension of the VCISQ model (1). Specifically, the predictive model uses the following formula to obtain the next moment model state s t + 1 m and the model reward r t m :
s t + 1 m = μ x 2 2 + N ( 0 , 1 ) · σ x 2 2 1 c 1 , r t m = μ x 2 2 + N ( 0 , 1 ) · σ x 2 2 c .
Among them, the subscripts 1 to c represent the sequence μ x 2 2 + N ( 0 , 1 ) · σ x 2 2 from the first to the c-1th element, and the subscript c represents the cth element.
It should be noted that the prediction process is repeatable. MBSAC requires P k predictions to be completed after each truncation. Once a prediction is complete, s t + 1 m is used to generate the prediction input vector x 2 as the current state for the next prediction. Once the prediction process is complete, P k prediction trajectories are produced: τ 2 = ( s t m , a t m , s t + 1 m , r t m ) . These serve as virtual data and are saved in the virtual replay cache R 2 .

4.3. Network Optimization

MBSAC improves learning efficiency and reduces dependence on large amounts of real-world interaction data. Its core relies on the synchronous updating of real and predictive data. In fact, the predictive mechanism of the predictive model effectively reduces the number of times RL algorithms interact directly with the real environment. The core value of real data lies in its precise depiction of environmental patterns, providing a benchmark for the learning direction of an intelligent agent. At the same time, real data provides a safe space for trial and error for the model-generated virtual data. However, overreliance on real data can hinder the intelligent agent’s ability to fully exploit the predictive potential of the model. At the same time, excessive reliance on virtual data can be misleading.
It is necessary to set an appropriate ratio to balance the use of the two types of data. This is key to the intelligent agent achieving the optimal strategy in a complex environment. If the utilization rate of the real data is s l , then the utilization rate of the virtual data is ( 1 s l ) . For the MBSAC intelligent agent, the policy network is updated using the following loss function L ( φ ) :
L ( φ ) = s l * E ( a t , s t ) R 1 D KL π φ ( a t | s t ) exp C θ j ( s t , a t ) / l Z ( s t ) + ( 1 s l ) * E ( a t m , s t m ) R 2 D KL π φ ( a t m | s t m ) exp C θ j ( s t m , a t m ) / α Z ( s t m ) = s l * E ( a t , s t ) R 1 α H ( π φ ( a t | s t ) ) min j = 1 , 2 C θ j ( s t , a t ) + ( 1 s l ) * E ( a t m , s t m ) D 2 α H ( π φ ( a t m | s t m ) ) min j = 1 , 2 C θ j ( s t m , a t m )
This equation updates the policy network π φ by minimizing the Kullback–Leibler (KL) divergence between the current policy and a target distribution derived from the critic’s evaluation. Critically, it computes this loss using a weighted sum of transitions sampled from both the real replay buffer ( R 1 ) and the predictive replay buffer ( R 2 ). The hyperparameter s l balances this data usage, enabling the policy to learn from accurate real interactions while greatly improving sample efficiency with model-generated data.
For any critic network C θ m , where m = 1 , 2 , the loss function L critic is given by the following:
L critic = s l * E τ 1 1 2 C θ m ( s t , a t ) r t + γ min j = 1 , 2 C θ j ¯ ( s t + 1 , a t + 1 ) α log π φ ( a t + 1 | s t + 1 ) 2 + ( 1 s l ) * E τ 2 1 2 C θ m ( s t m , a t m ) r t m + γ min j = 1 , 2 C θ j ¯ ( o t + 1 m , a t + 1 m ) α log π φ ( a t + 1 m | o t + 1 m ) 2 .
This equation updates the critic networks C θ by minimizing the mean squared error (MSE) of the temporal-difference (TD) error. It calculates the TD target using a soft value estimate from the target networks that includes an entropy bonus. Similar to Equation (32), the final loss is a weighted average of the TD error computed on batches from both the real ( R 1 ) and predictive R 2 buffers. This allows the critic to learn a more accurate and stable value function from a larger and more diverse set of experiences.
The convergence stability of the loss function (33) depends on the soft update mechanism of the target critic network. Specifically, MBSAC introduces parameters θ 1 ¯ and θ 2 ¯ that are updated using a sliding average based on the scaling factor τ :
θ 1 ¯ = τ θ 1 + θ 1 ¯ τ θ 1 ¯ , θ 2 ¯ = τ θ 2 + θ 2 ¯ τ θ 2 ¯ .
This is not a loss function but a stability mechanism. It slowly synchronizes the parameters of the target critic networks with the parameters of the online critics. This soft update, as opposed to a periodic hard copy, ensures the target values change gradually, which is crucial for preventing divergence and stabilizing the training of the critics in Equation (33).
To maintain the exploratory nature of MBSAC, the entropy value network has the following loss function for real-time adjustment of the regularization coefficient:
L ( l ) = s l * E ( a t , s t ) R 1 l · log ( π φ ( a t | s t ) ) l · l + ( 1 s l ) * E ( a t m , s t m ) R 2 l · log ( π φ ( a t m | s t m ) ) l · l .
Entropy coefficient loss function: This equation automates the adjustment of the entropy regularization coefficient. It performs a gradient descent step to minimize the difference between the policy’s current average entropy and a target entropy value. This dynamic adjustment is essential for maintaining a sustainable exploration-exploitation balance throughout the training process, allowing the agent to explore effectively without degrading final performance.
The Adam optimizer and the ReLU activation function are applied uniformly to the design of all RL algorithms’ neural networks. The details of the execution framework are fully presented in Algorithm 1. Additionally, the parameters and values of the MBSAC algorithm are provided in Table 3.

5. Experiment Validation

5.1. Experimental Setup

In order to validate the fractional order optimal control with the proposed algorithm MBSAC, experiments are designed in this paper in several dimensions. All the experiments in this paper were conducted under Python 2025.1 and Python 3.13.5. It should be noted that the fractional-order optimal control method provides the theoretically optimal solution for model control. The closer the final control cost of RL is to the optimal control cost produced by the fractional-order optimal control method, the better the RL performance. The RL methods for each condition were averaged over 10 runs with different random seeds. The experimental parameters are configured according to Table 2 and Table 3 if not specified.
Specifically, this paper has the following comparison baseline:
(1)
DQN [27]: A generalized reinforcement learning algorithm based on value functions.
(2)
SAC [4]: A generalized maximum entropy-based reinforcement learning algorithm that performs well on model species in continuous space.
(3)
PPO [5]: A generalized reinforcement learning algorithm based on importance sampling and trust domain.
(4)
MAPPO [5]: A multi-agent reinforcement learning algorithm based on PPO, in which agents can collaborate to optimize.
(5)
Fractional-order optimal control method: The control method proposed in this paper for the VCISQ model, which aims to output an optimal result.
(6)
MBSAC: A high sample efficiency MBRL algorithm proposed in this paper with higher convergence efficiency and learning performance.
To ensure the reproducibility and transparency of our experimental results, this subsection provides a detailed explanation of the configuration parameters, which are categorized into two groups, as follows: (1) the environment and model parameters that define the WRSNs and malware propagation dynamics, and (2) the algorithm hyperparameters that control the learning process of the reinforcement learning agents.

5.1.1. Environment and Model Parameters (Table 2)

The parameters for the fractional-order VCISQ model were carefully selected to simulate a realistic and challenging WRSNs scenario while ensuring the mathematical stability of the model.
Initial states G 1 G 5 : The initial population was set to 300 nodes in total. We initialized with 30 infected nodes G 3 = 30 and 270 susceptible nodes G 1 = 270 to simulate an ongoing malware outbreak, a critical scenario for testing the containment (policy). Other states were set to zero.
Cost Weights B 1 B 5 : The weights in the cost function (7) were set to prioritize the reduction of infected nodes B 2 = 900) as they pose the most immediate threat. The weights for susceptible and quarantined nodes B 1 , B 3 = 300 were set lower but significant to discourage excessive quarantine that cripples network functionality. The control weights B 4 , B 5 = 12 were tuned to reflect the non-negligible resource cost of applying security measures, preventing the algorithm from finding a trivial solution that applies maximum control at all times.
Delay parameters τ i , τ p , τ s , h : These temporal parameters are crucial for realism. The malware installation delay τ i = 1 and patch delay τ p = 0.5 are shorter than the patch failure delay τ s = 2 and quarantine failure delay h = 1 , reflecting the fact that remediation efforts often take longer to complete than the time malware needs to activate.
Communication parameters q c , χ , α , σ n 2 , γ t h , ρ : These values were adopted from our previous work [5] and standard models for wireless sensor networks to simulate a realistic communication channel based on Rayleigh fading, path loss, and signal-to-noise ratio (SNR) constraints. The node density ρ = 0.6 models a moderately dense deployment.
Fractional order q = 0.9 : This value was chosen as the primary order after a sensitivity analysis (Section 5.3). It provides a strong memory effect while keeping the model computationally tractable. Values of q = 0.8 and q = 0.7 were also tested to study the impact of memory length.

5.1.2. Algorithm Hyperparameters (Table 3)

The hyperparameters for the MBSAC and baseline algorithms were tuned through an extensive grid search to ensure optimal and fair performance. The Adam optimizer was used for all neural network updates.
Learning rates a φ , a θ , a p , a l : The policy network learning rate a φ = 2 × 10 3 was set higher than the critic learning rate a θ = 2 × 10 4 to ensure stable policy updates relative to the value estimates, a common practice in actor–critic methods. The predictive model learning rate ( a p = 1 × 10 3 ) was chosen for rapid fitting to the recently collected data.
Discount factor and soft update γ = 0.85 , τ = 4 × 10 3 : A discount factor of γ = 0.85 was chosen to balance the importance of immediate rewards against future outcomes, which is suitable for a finite-horizon control problem. The target network update factor τ is very small to ensure slow and stable tracking of the critic networks, which is critical for convergence.
Buffer sizes and batch sampling K 1 = 8000 , K 2 = 2000 , Batch size = 128 ): The real replay buffer K 1 is larger to maintain the rich history of interactions. The predictive buffer K 2 is smaller as it contains fresher, model-generated data. A batch size of 128 provides a good compromise between stable gradient estimates and computational efficiency.
Predictive model parameters P k = 5 , s l = 0.9 : The agent performs P k = 5 prediction steps after each truncation to generate sufficient synthetic data without compounding excessive model error. The real data utilization ratio ε l = 0.9 indicates that the update is primarily guided by real data (90%), with 10% coming from the predictive model, ensuring learning stability while boosting sample efficiency.

5.2. Validation of Control Measure Validity

In order to prove that the treatment and quarantine measures exert the necessary effects in the VCISQ model, this experiment was conducted by implementing an ablation experiment on the control volume. In Figure 5a–e, three different states of VCISQ models with only treatment measures α 1 , only quarantine measures α 2 , and mixed controls are analyzed. In the VCISQ model, the control objective is to minimize the amount of malware. This means that we expect to have a higher number of V ( t ) and S ( t ) and a lower number of I ( t ) and C ( t ) .
In Figure 5a–e, it can be seen that when only the treatment measure α 1 is available, the infected node I ( t ) is not quarantined in time, so the decline rate of V ( t ) is not suppressed. Meanwhile, a large number of WRSNs carrier nodes C ( t ) still exist at the final moment of the system. When only the quarantine measure α 2 is available, the infected node I ( t ) is not treated in time, so the rising rate of S ( t ) is not enhanced. At the same time, the secured node S ( t ) is transformed into a carrier node C ( t ) by the high concentration of infected nodes I ( t ) . It is noteworthy that V ( t ) and S ( t ) are able to retain more WRSN nodes under the hybrid control. This is due to the fact that the two control strategies complement each other under the hybrid control measure. Timely quarantine of the infected node I ( t ) , along with timely treatment, effectively reduces the number of WRSNs carrying the nodes C ( t ) and I ( t ) . Table 4 shows that the final cost of the fractional-order optimal control method is 8634.91 under the treatment-only measure α 1 . The final cost of the fractional-order optimal control method is 10,347.14 under the control-only measure α 2 . Under the hybrid control measure, the final cost of the fractional-order optimal control method is 7143.22. Therefore, the cost of the treatment-only measure is 17.28 % higher than that of the hybrid control. The hybrid control is 17.28 % higher. The control cost under quarantine only is 44.85 % higher than the hybrid control.
From Figure 5f,g, it can be seen that when only the treatment measure α 1 is available, the treatment measure outputs a higher treatment intensity compared to the mixed control. This is because the VCISQ model loses the assistance of the quarantine measure. The VCISQ model can only rely more on the treatment measure to stop the generation of a large number of infected nodes I ( t ) . When there are only quarantine measures, the quarantine measures output a higher quarantine effort compared to the quarantine measures under hybrid control. At this point, the VCISQ model does not have access to the treatment to convert S ( t ) to I ( t ) . Therefore, the VCISQ model can only output a large number of quarantine measures to transform I ( t ) into the quarantine node Q ( t ) to prevent further malware propagation.

5.3. Impact on VCISQ Model with Different Fractional Orders q

The purpose of this experiment is to investigate the effect of different fractional orders of q on the malware propagation dynamics in the VCISQ model. Fractional order q is chosen as the research variable because in real WRSNs, malware propagation is characterized by complexity, time-varying, and memory. And the fractional order differential equation can more accurately portray this propagation phenomenon with genetic characteristics and cumulative effects.
The curves in Figure 6a–e show the state transitions of the VCISQ model under different q values. For the susceptible node V t curve, it can be seen that as the q value decreases from 0.9 to 0.7, the decline of the curve gradually decreases. This indicates that at lower q values, susceptible nodes convert to other states more slowly, and the propagation rate of malware decreases. At the same time, the rise and peak of the infected node I ( t ) curve also decrease as the q value decreases. This shows that the infection intensity of the malware is effectively suppressed. In contrast, the carrier node C ( t ) curve shows a different trend, with its rise and stable value increasing as the q value decreases. This means that at lower q values, more nodes can return to a safe state. In addition, the peak of the quarantine node Q ( t ) curve decreases when the q value is small, reflecting a change in the frequency or intensity of quarantine measures. This may be due to the slowdown in the spread rate, which reduces the need for quarantine. When q = 0.9 , the C ( t ) and S t curves fluctuate significantly. Among them, the peak of C t is high, and the decline process is relatively slow. The rise rate of S t is relatively fast. However, as q decreases to 0.8 and 0.7, the amplitude of the curve fluctuations decreases significantly. At this point, the peak of C t decreases and the rate of decline accelerates, while the rate of increase of S t slows down, and the system reaches a stable state more quickly. This indicates that a lower fractional order q value enables the system to exhibit a more stable dynamic response when dealing with malware propagation. This reduces the instability caused by severe fluctuations, thereby reaching a relatively balanced state more quickly. This improvement in stability is likely related to the fractional-order model’s consideration of the cumulative effects of historical states. The fractional-order model enables the system to more effectively regulate the transitions between state nodes during propagation. Under such conditions, excessive fluctuations and imbalances are effectively avoided.
As shown in Table 4, when q is 0.9, 0.8, and 0.7, the fractional-order optimal control costs are 7143.22, 7265.18, and 7415.57, respectively. It is evident that as the fractional order q decreases, the control cost shows a gradually increasing trend. This phenomenon is closely related to the curves in Figure 6f,g. A deeper analysis reveals that there are significant differences in the output intensity of control measures under different q values. At lower q values, in order to effectively maintain system stability and control effectiveness, the VCISQ model has to output stronger control intensity, and the system needs to pay a higher cost to deal with the spread of malware in order to achieve better control effectiveness. In contrast, at higher q values, although the control intensity input is reduced, this also leads to insufficient resistance to a large number of infected nodes I t . Nevertheless, relatively low control intensity also means reduced resource consumption. From a practical point of view, when q = 0.7 , the VCISQ model shows a more excellent state. At this time, the model not only has fewer infected nodes I ( t ) in WRSNs, but can also balance the control effect and resource consumption to a certain extent. This balance makes the model more advantageous in practical applications. Because it can effectively suppress the spread of malware under limited resource conditions. Thus, it improves the security and stability of the entire WRSN system.

5.4. Performance of Reinforcement Learning Under Different Hyperparameters

Hyperparameters are directly related to the performance and application effect of RL algorithms. They directly affect the learning efficiency, convergence speed, and final control performance of the algorithm. In order to better understand and optimize the RL algorithm in this paper, we conducted a detailed analysis of the RL operation results under different hyperparameters.
Figure 7a–l show the variation of average rewards of different reinforcement learning algorithms under various hyperparameters. The curves in Figure 7a–l show that the MBSAC algorithm significantly outperforms the DQN, SAC, PPO, and MAPPO algorithms in terms of learning efficiency and convergence speed. For the policy network learning rate a φ , the average reward value of MBSAC rises rapidly and stabilizes when a φ = 1 × 10 3 , while the reward values of DQN, SAC, PPO, and MAPPO rise slowly and fluctuate greatly. Even at a φ = 2 × 10 3 or a φ = 3 × 10 3 , MBSAC still maintains a stable upward trend. The fluctuation of the baseline algorithm is still obvious and cannot converge effectively. Obviously, MBSAC at a φ = 2 × 10 3 has the highest reward. For the criticism network and the target criticism network learning rate a θ , MBSAC exhibits fast convergence and high average rewards at different settings of 1 × 10 4 , 2 × 10 4 , and 3 × 10 4 . In contrast, the reward values of DQN, SAC, PPO, and MAPPO rise slower and fluctuate more, making it difficult to reach the stable state of MBSAC. It is not difficult to find that MBSAC has a superior performance at a θ = 2 × 10 4 . In Batch size, MBSAC also performs well. When Batch size increases from 64 to 128 and 256, the average reward of MBSAC fluctuates slightly. However, it is still much higher than the other algorithms overall and stabilizes faster at batch sizes of 128 and 256. In particular, the control cost of MBSAC at Batch size = 128 is lower than the other two Batch size cases. Whereas DQN, SAC, PPO, and MAPPO are less effective in learning under various Batch size, and the reward value fluctuates more. The change in the discount factor γ also has an impact on the algorithm’s performance. As γ decreases from 0.95 to 0.85 and 0.75, the control cost of MBSAC gradually increases. Nevertheless, the decrease is much smaller than that of the baseline algorithm, and it is able to converge stably. However, DQN, SAC, PPO, and MAPPO fluctuate more and are difficult to converge effectively. It is worth noting that the final control cost of MBSAC is lower at γ = 0.85 than at γ = 0.95 and γ = 0.75 . The related experimental results are shown in Table 4.
At a φ = 1 × 10 3 , the control cost of MBSAC is 7352.41, and the error from the optimal control is 2.92 % ; at a φ = 2 × 10 3 , the control cost of MBSAC is 7266.29, and the error is reduced to 1.72 % ; at a φ = 3 × 10 3 , the control cost of MBSAC is 7448.52, and the error is 4.27 % .
When a θ = 1 × 10 4 , the control cost of MBSAC is 7511.36, with an error of 5.15 % ; when a θ = 2 × 10 4 , the control cost of MBSAC is 7266.29, with an error of 1.72 % ; at a θ = 3 × 10 4 , the control cost of MBSAC is 7373.21, with an error of 3.22 % .
When Batch size = 64, the control cost of MBSAC is 7428.87, with an error of 3.99 % ; when Batch size = 128, the control cost of MBSAC is 7266.29, with an error of 1.72 % ; when Batch size =256, the control cost of MBSAC is 7298.74 with an error of 2.17 % .
In the case of γ = 0.95 , the control cost of MBSAC is 7374.52, with an error of 3.24 % ; in the case of γ = 0.85 , the control cost of MBSAC is 7266.29, with an error of 1.72 % ; and in the case of γ = 0.75 , the control cost of MBSAC is 7435.96, with an error of 4.09 % .
In this experiment, the hyperparameter design of the algorithms was not involved; instead, we followed similar experimental ideas for testing and optimization. By fine-tuning the hyperparameters and performing performance evaluations, a set of superior hyperparameter combinations was determined in Table 3. MBSAC utilizes this set of hyperparameter combinations to effectively achieve accurate suppression of malware propagation when executing the VCISQ model problem. It provides an efficient and reliable solution for malware prevention and control in WRSNs.

5.5. Relative Optimization at Different Fractional Orders q s l Investigations

The parameter s l determines how much MBSAC relies on real and predicted data in the updating process. This experiment aims to deeply investigate the control effect of the MBSAC algorithm proposed in this paper under different fractional orders q and different s l . The performance of the MBSAC algorithm under different conditions is evaluated by comparing it with the optimal control method. The observation of the experimental results is of great significance to improve the control effect of MBSAC.
Figure 8a,d,g show the average reward curves for various s l cases at different score orders. With the iteration of training episodes, the reward value of the MBSAC algorithm gradually increases and stabilizes. When the score order q is set to 0.9, the convergence of the MBSAC algorithm gradually slows down as the s l value increases from 0.7 to 0.9. This is reflected in the fact that the slope of the rise in reward values flattens out as the s l increases, implying that the algorithm requires a greater number of training rounds to reach a steady state. When s l = 0.7 , the reward value rises rapidly and stabilizes in an earlier number of rounds. Whereas when s l = 0.9 , the reward values rise more slowly but eventually reach the highest stabilized average reward values. This indicates that the algorithm is able to obtain a better control strategy based on fully utilizing the real data. This may be due to the fact that higher s l values make the algorithm more dependent on real data, and the intelligences are more cautious in the exploration process. Although the learning process is prolonged, the final performance is better. For the case of q = 0.8 , the experimental results show the most moderate convergence rate at s l = 0.8 . It is able to achieve a better performance balance. At this point, the algorithm achieves a better balance between utilizing real data and predicting data. It ensures a faster convergence rate without significantly reducing the final reward value. This indicates that at this score order, s l = 0.8 is a more ideal choice to effectively utilize both real and predicted data. Ultimately, it improves the control effect while ensuring the learning efficiency. For the case of q = 0.7 , the experimental results show a similar trend. At s l = 0.8 , the algorithm not only exhibits a faster convergence speed, but also reaches a relatively high average reward value in the end. In contrast, the higher s l = 0.9 value leads to a significantly slower convergence rate. The lower s l = 0.7 value, while converging faster, ends up with a lower average reward value. At the same time, the volatility of the curve increases slightly. It is worth noting that—at this point— s l = 0.8 provides a better balance between performance and convergence speed. It avoids the problem of performance degradation or slow convergence caused by over-reliance on real or predicted data.
The running results of the optimal control and MBSAC under different conditions in this experiment are given in Table 4. Specifically, when the fractional order q = 0.9 :
(1)
s l = 0.9 : The control cost of MBSAC is 7266.29, compared to 7143.22 for the optimal control, and has an error of 1.72 % .
(2)
s l = 0.8 : The cost of control for MBSAC is 7285.64, compared to 7143.22 for the optimal control, which has an error of 2.00 % .
(3)
s l = 0.7 : The control cost of MBSAC is 7377.86, which is an error of 3.28 % compared to the optimal control of 7143.22.
When fractional order q = 0.8 :
(1)
s l = 0.9 : The control cost of MBSAC is 7431.72, which is an error of about 2.30 % compared to the optimal control of 7265.18.
(2)
s l = 0.8 : The control cost of 7333.14 for MBSAC is about 1.01 % error compared to the optimal control 7265.18.
(3)
s l = 0.7 : The control cost of MBSAC is 7536.87, which is about 3.74 % error compared to the optimal control 7265.18.
When fractional order q = 0.7 :
(1)
s l = 0.9 : The control cost of MBSAC is 7623.66, which is about 2.81 % error compared to the optimal control 7415.57.
(2)
s l = 0.8 : The control cost of MBSAC is 7682.57, which is about 3.85 % error compared to the optimal control 7415.57.
(3)
s l = 0.7 : The control cost of MBSAC is 7538.14, which is about 1.46 % error compared to the optimal control 7415.57.
Combining Figure 8b,c, Figure 8e,f, and Figure 8h,i control curves, the control curves of the algorithm are closer to the optimal control curves when the values of s l are adapted with the values of q, e.g., q = 0.9 and s l = 0.9 , q = 0.8 and s l = 0.8 . This indicates that the control strategy at this point is more optimal and the error is smaller. For example, in the case of q = 0.9 and s l = 0.9 , the control curve of MBSAC is highly consistent with the fluctuation trend of the optimal control curve. This indicates that the algorithm is able to adjust the control strength in time according to the changes of the system state, thus effectively suppressing the propagation of malware. Similarly, when q = 0.8 and s l = 0.8 , the control curve is in high agreement with the optimal control curve. The algorithm can better balance the control effect and resource consumption under this parameter combination. On the contrary, when s l does not match with q, such as q = 0.7 and s l = 0.8 , the control curve deviates more from the optimal control curve. This is manifested in the form of large fluctuations in the control effort and insufficient or excessive control effort in certain time periods. This will lead to the spread of malware that is not effectively controlled, so that the control cost increases. Unreasonable parameter combinations will lead to a mismatch between the control strategy of the algorithm and the actual needs of the system, which in turn affects the control effect.
Further analysis reveals that as the value of q increases, the system’s dependence on the historical state is gradually weakened, and at this time, appropriately increasing the value of s l can enhance the algorithm’s efficiency in utilizing the real data, thus improving the control accuracy. For example, when q increases from 0.7 to 0.9, the value of s l should be increased accordingly in order to maintain a better control effect. Specifically, when q = 0.7 , the error of s l = 0.7 is minimized; while when q increases to 0.8 and 0.9, the error of s l reaches the minimum when the value is increased to 0.8 and 0.9, respectively. This shows that when the value of q is high, increasing the value of s l can better utilize the performance of the algorithm and make it closer to the optimal control effect.

6. Conclusions

We propose a fractional-order VCISQ model for accurate malware dynamics in WRSNs and derive its theoretical optimal control. To address RL inefficiency, we develop MBSAC, a model-based reinforcement learning algorithm. MBSAC leverages a predictive network to forecast environmental changes from existing data and co-optimizes the agent using both real and predicted data, significantly improving learning efficiency, reducing control costs, and accelerating convergence.
MBSAC’s effectiveness may decline in highly dynamic environments due to its reliance on stable environmental characteristics and prior knowledge. Future research will explore Meta-Reinforcement Learning (Meta-RL) to enable rapid adaptation to new dynamics using minimal data via meta-learning, enhancing the generalization capability and robustness of malware control strategies in WRSNs.

Author Contributions

Conceptualization, supervision, methodology, and writing—review and editing: H.L. and C.T.; conceptualization, supervision, and methodology: L.C.; conceptualization, supervision, and software: D.L.; formal analysis, software, and writing—review and editing: Y.W. and Y.H.; formal analysis and methodology: C.T. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Nature Science Foundation of China (No. 62503124, No. 62402532), in part by the Hunan Provincial Natural Science Foundation of China (No. 2024JJ6526), and in part by Tertiary Education Scientific research project of Guangzhou Municipal Education Bureau (No. 2024312258), this work is also supported by the "College Student Innovation Training Program of Guangzhou University" [grant numbers 202411078028].

Data Availability Statement

The data, material, and code are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, S.; Gong, Y.; Li, X.; Li, Q. Integrated Sensing, Communication, and Computation Over the Air: Beampattern Design for Wireless Sensor Networks. IEEE Internet Things J. 2024, 11, 9681–9692. [Google Scholar] [CrossRef]
  2. Zhang, G.; Yi, W.; Matthaiou, M.; Varshney, P.K. Direct Target Localization With Low-Bit Quantization in Wireless Sensor Networks. IEEE Trans. Signal Process. 2024, 72, 3059–3075. [Google Scholar] [CrossRef]
  3. Dou, Z.; Yao, Z.; Zhang, Z.; Lu, M. A Lidar-Assisted Self-Localization Technology for Indoor Wireless Sensor Networks. IEEE Internet Things J. 2023, 10, 17515–17529. [Google Scholar] [CrossRef]
  4. Liu, G.; Li, H.; Xiong, L.; Tan, Z.; Liang, Z.; Zhong, X. Fractional-Order Optimal Control and FIOV-MASAC Reinforcement Learning for Combating Malware Spread in Internet of Vehicles. IEEE Trans. Autom. Sci. Eng. 2025, 22, 10313–10332. [Google Scholar] [CrossRef]
  5. Liu, G.; Li, H.; Xiong, L.; Chen, Y.; Wang, A.; Shen, D. Reinforcement Learning for Mitigating Malware Propagation in Wireless Radar Sensor Networks with Channel Modeling. Mathematics 2025, 13, 1397. [Google Scholar] [CrossRef]
  6. Shen, Y.; Shepherd, C.; Ahmed, C.M.; Shen, S.; Yu, S. Integrating Deep Spiking Q-network into Hypergame-theoretic Deceptive Defense for Mitigating Malware Propagation in Edge Intelligence-enabled IoT Systems. IEEE Trans. Serv. Comput. 2025, 18, 1487–1499. [Google Scholar] [CrossRef]
  7. Shen, S.; Cai, C.; Shen, Y.; Wu, X.; Ke, W.; Yu, S. Joint Mean-Field Game and Multiagent Asynchronous Advantage Actor-Critic for Edge Intelligence-Based IoT Malware Propagation Defense. IEEE Trans. Dependable Secur. Comput. 2025, 22, 3824–3838. [Google Scholar] [CrossRef]
  8. Ahn, H.; Choi, J.; Kim, Y.H. A Mathematical Modeling of Stuxnet-Style Autonomous Vehicle Malware. IEEE Trans. Intell. Transp. Syst. 2023, 24, 673–683. [Google Scholar] [CrossRef]
  9. Essouifi, M.; Lachgar, A.; Vasudevan, M.; B’ayir, C.; Achahbar, A.; Elkhamkhami, J. Automated Hubs-Patching: Protection Against Malware Spread Through Reduced Scale-Free Networks and External Storage Devices. IEEE Trans. Netw. Sci. Eng. 2024, 11, 4758–4773. [Google Scholar] [CrossRef]
  10. Liu, G.; Zhang, J.; Zhong, X.; Hu, X.; Liang, Z. Hybrid Optimal Control for Malware Propagation in UAV-WSN System: A Stacking Ensemble Learning Control Algorithm. IEEE Internet Things J. 2024, 11, 36549–36568. [Google Scholar] [CrossRef]
  11. Peng, B.; Liu, J.; Zeng, J. Dynamic Analysis of Multiplex Networks With Hybrid Maintenance Strategies. IEEE Trans. Inf. Forensics Secur. 2024, 19, 555–570. [Google Scholar] [CrossRef]
  12. Chen, J.; Sun, S.; Xia, C.; Shi, D.; Chen, G. Modeling and Analyzing Malware Propagation Over Wireless Networks Based on Hypergraphs. IEEE Trans. Netw. Sci. Eng. 2023, 10, 3767–3778. [Google Scholar] [CrossRef]
  13. Li, H.; Liu, G.; Xiong, L.; Liang, Z.; Zhong, X. Meta-Reinforcement Learning for Controlling Malware Propagation in Internet of Underwater Things. IEEE Trans. Netw. Sci. Eng. 2025; early access. [Google Scholar] [CrossRef]
  14. Shen, S.; Xie, L.; Zhang, Y.; Wu, G.; Zhang, H.; Yu, S. Joint Differential Game and Double Deep Q-Networks for Suppressing Malware Spread in Industrial Internet of Things. IEEE Trans. Inf. Forensics Secur. 2023, 18, 5302–5315. [Google Scholar] [CrossRef]
  15. Zheng, Y.; Na, Z.; Ji, W.; Lu, Y. An Adaptive Fuzzy SIR Model for Real-Time Malware Spread Prediction in Industrial Internet of Things Networks. IEEE Internet Things J. 2025, 12, 22875–22888. [Google Scholar] [CrossRef]
  16. Jafar, M.T.; Yang, L.-X.; Li, G.; Zhu, Q.; Gan, C. Minimizing Malware Propagation in Internet of Things Networks: An Optimal Control Using Feedback Loop Approach. IEEE Trans. Inf. Forensics Secur. 2024, 19, 9682–9697. [Google Scholar] [CrossRef]
  17. Liu, G.; Tan, Z.; Liang, Z.; Chen, H.; Zhong, X. Fractional Optimal Control for Malware Propagation in Internet of Underwater Things. IEEE Internet Things J. 2024, 11, 11632–11651. [Google Scholar] [CrossRef]
  18. Heidari, A.; Jabraeil Jamali, M.A. Internet of Things intrusion detection systems: A comprehensive review and future directions. Clust. Comput. 2023, 26, 3753–3780. [Google Scholar] [CrossRef]
  19. Asadi, M.; Jabraeil Jamali, M.A.; Heidari, A.; Navimipour, N.J. Botnets Unveiled: A Comprehensive Survey on Evolving Threats and Defense Strategies. Trans. Emerg. Telecommun. Technol. 2024, 35, 1–39. [Google Scholar] [CrossRef]
  20. Ghimire, B.; Rawat, D.B. Recent Advances on Federated Learning for Cybersecurity and Cybersecurity for Federated Learning for Internet of Things. IEEE Internet Things J. 2022, 9, 8229–8249. [Google Scholar] [CrossRef]
  21. Hua, H.; Wang, Y.; Zhong, H.; Zhang, H.; Fang, Y. A Novel Guided Deep Reinforcement Learning Tracking Control Strategy for Multirotors. IEEE Trans. Autom. Sci. Eng. 2025, 22, 2062–2074. [Google Scholar] [CrossRef]
  22. Muduli, R.; Jena, D.; Moger, T. Application of Reinforcement Learning-Based Adaptive PID Controller for Automatic Generation Control of Multi-Area Power System. IEEE Trans. Autom. Sci. Eng. 2025, 22, 1057–1068. [Google Scholar] [CrossRef]
  23. Xu, T.; Pang, Y.; Zhu, Y.; Ji, W.; Jiang, R. Real-Time Driving Style Integration in Deep Reinforcement Learning for Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2025, 26, 11879–11892. [Google Scholar] [CrossRef]
  24. Han, Z.; Chen, P.; Zhou, B.; Yu, G. Hybrid Path Tracking Control for Autonomous Trucks: Integrating Pure Pursuit and Deep Reinforcement Learning With Adaptive Look-Ahead Mechanism. IEEE Trans. Intell. Transp. Syst. 2025, 26, 7098–7112. [Google Scholar] [CrossRef]
  25. Zhan, D.; Liu, X.; Bai, W.; Li, W.; Guo, S.; Pan, Z. GAME-RL: Generating Adversarial Malware Examples against API Call Based Detection via Reinforcement Learning. IEEE Trans. Dependable Secur. Comput. 2025; early access. [Google Scholar] [CrossRef]
  26. Feng, C.; Celdrán, A.; Sánchez, P.; Kreischer, J.; Assen, J.; Bovet, G. CyberForce: A Federated Reinforcement Learning Framework for Malware Mitigation. IEEE TRansactions Dependable Secur. Comput. 2025, 22, 4398–4411. [Google Scholar] [CrossRef]
  27. Shen, Y.; Shepherd, C.; Ahmed, C.M.; Yu, S.; Li, T. Comparative DQN-Improved Algorithms for Stochastic Games-Based Automated Edge Intelligence-Enabled IoT Malware Spread-Suppression Strategies. IEEE Internet Things J. 2024, 11, 22550–22561. [Google Scholar] [CrossRef]
  28. Tannirkulam Chandrasekaran, S.; Kuruvila, A.P.; Basu, K.; Sanyal, A. Real-Time Hardware-Based Malware and Micro-Architectural Attack Detection Utilizing CMOS Reservoir Computing. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 349–353. [Google Scholar] [CrossRef]
  29. Saeednia, N.; Khayatian, A. Reset MPC-Based Control for Consensus of Multiagent Systems. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 1611–1619. [Google Scholar] [CrossRef]
  30. Zuliani, R.; Balta, E.C.; Lygeros, J. BP-MPC: Optimizing the Closed-Loop Performance of MPC using BackPropagation. IEEE Trans. Autom. Control 2025, 70, 5690–5704. [Google Scholar] [CrossRef]
  31. Tang, H.; Chen, Y. Composite Observer based Resilient MPC for Heterogeneous UAV-UGV Systems Under Hybrid Cyber-Attacks. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 8277–8290. [Google Scholar] [CrossRef]
  32. Xu, J.-Z.; Liu, Z.-W.; Ge, M.-F.; Wang, Y.-W.; He, D.-X. Self-Triggered MPC for Teleoperation of Networked Mobile Robotic System via High-Order Estimation. IEEE Trans. Autom. Sci. Eng. 2025, 22, 6037–6049. [Google Scholar] [CrossRef]
  33. Wang, T.; Li, H.; Xia, C.; Zhang, H.; Zhang, P. From the Dialectical Perspective: Modeling and Exploiting of Hybrid Worm Propagation. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1610–1624. [Google Scholar] [CrossRef]
  34. Abusnaina, A.; Abuhamad, M.; Alasmary, H.; Anwar, A.; Jang, R.; Salem, S. DL-FHMC: Deep Learning-Based Fine-Grained Hierarchical Learning Approach for Robust Malware Classification. IEEE Trans. Dependable Secur. Comput. 2022, 19, 3432–3447. [Google Scholar] [CrossRef]
  35. Mei, Y.; Han, W.; Li, S.; Lin, K.; Tian, Z.; Li, S. A Novel Network Forensic Framework for Advanced Persistent Threat Attack Attribution Through Deep Learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 12131–12140. [Google Scholar] [CrossRef]
  36. Ahmed, I.; Anisetti, M.; Ahmad, A.; Jeon, G. A Multilayer Deep Learning Approach for Malware Classification in 5G-Enabled IIoT. IEEE Trans. Ind. Inform. 2023, 19, 1495–1503. [Google Scholar] [CrossRef]
  37. Safari, A.; Hassanzadeh Yaghini, H.; Kharrati, H.; Rahimi, A.; Oshnoei, A. Voltage Controller Design for Offshore Wind Turbines: A Machine Learning-Based Fractional-Order Model Predictive Method. Fractal Fract 2024, 8, 463. [Google Scholar] [CrossRef]
  38. Tian, B.; Jiang, J.; He, Z.; Yuan, X.; Dong, L.; Sun, C. Functionality-Verification Attack Framework Based on Reinforcement Learning Against Static Malware Detectors. IEEE Trans. Inf. Forensics Secur. 2024, 19, 8500–8514. [Google Scholar] [CrossRef]
  39. Abazari, A.; Soleymani, M.M.; Ghafouri, M.; Jafarigiv, D.; Atallah, R.; Assi, C. Deep Learning Detection and Robust MPC Mitigation for EV-Based Load-Altering Attacks on Wind-Integrated Power Grids. IEEE Trans. Ind. Cyber-Phys. Syst. 2024, 2, 244–263. [Google Scholar] [CrossRef]
  40. Amare, N.D.; Yang, S.J.; Son, Y.I. An Optimized Position Control via Reinforcement-Learning-Based Hybrid Structure Strategy. Actuators 2025, 14, 199. [Google Scholar] [CrossRef]
  41. Wang, Y.; Wei, M.; Dai, F.; Zou, D.; Lu, C.; Han, X.; Chen, Y.; Ji, C. Physics-Informed Fractional-Order Recurrent Neural Network for Fast Battery Degradation with Vehicle Charging Snippets. Fractal Fract 2025, 9, 91. [Google Scholar] [CrossRef]
  42. Jafarr, M.; Yang, L.; Li, G. An innovative practical roadmap for optimal control strategies in malware propagation through the integration of RL with MPC. Comput. Secur. 2025, 148, 104186. [Google Scholar] [CrossRef]
  43. Wu, L.; Braatz, R.D. A Direct Optimization Algorithm for Input-Constrained MPC. IEEE Trans. Autom. Control 2025, 70, 1366–1373. [Google Scholar] [CrossRef]
  44. Li, D.; Li, Q.; Ye, Y.; Xu, S. A Framework for Enhancing Deep Neural Networks Against Adversarial Malware. IEEE Trans. Netw. Sci. Eng. 2021, 8, 736–750. [Google Scholar] [CrossRef]
Figure 1. WRSNs scenario diagram.
Figure 1. WRSNs scenario diagram.
Actuators 14 00434 g001
Figure 3. Fractional order model VCISQ state transition diagram.
Figure 3. Fractional order model VCISQ state transition diagram.
Actuators 14 00434 g003
Figure 4. Model-Based Soft Actor–Critic algorithm flowchart: Consists of four parts. The first part is the environment to be interacted with. The second part is a general-purpose underlying agent, in which the policy network outputs actions and the critic network evaluates actions. The third part, the predictive model, accelerates training. The fourth part stores interaction data.
Figure 4. Model-Based Soft Actor–Critic algorithm flowchart: Consists of four parts. The first part is the environment to be interacted with. The second part is a general-purpose underlying agent, in which the policy network outputs actions and the critic network evaluates actions. The third part, the predictive model, accelerates training. The fourth part stores interaction data.
Actuators 14 00434 g004
Figure 5. Validation of control effectiveness. (a) Susceptible nodes V ( t ) ; (b) Carrier node C ( t ) ; (c) Infected node I ( t ) ; (d) Secured node S ( t ) ; (e) Quarantined node Q ( t ) ; (f) Treatment control α 1 ( t ) ; (g) Quarantine control α 2 ( t ) .
Figure 5. Validation of control effectiveness. (a) Susceptible nodes V ( t ) ; (b) Carrier node C ( t ) ; (c) Infected node I ( t ) ; (d) Secured node S ( t ) ; (e) Quarantined node Q ( t ) ; (f) Treatment control α 1 ( t ) ; (g) Quarantine control α 2 ( t ) .
Actuators 14 00434 g005
Figure 6. The influence of different fractional orders q on the VCISQ model. (a) Susceptible nodes V ( t ) ; (b) Carrier node C ( t ) ; (c) Infected node I ( t ) ; (d) Secured node S ( t ) ; (e) Quarantined node Q ( t ) ; (f) Treatment control α 1 ( t ) ; (g) Quarantine control α 2 ( t ) .
Figure 6. The influence of different fractional orders q on the VCISQ model. (a) Susceptible nodes V ( t ) ; (b) Carrier node C ( t ) ; (c) Infected node I ( t ) ; (d) Secured node S ( t ) ; (e) Quarantined node Q ( t ) ; (f) Treatment control α 1 ( t ) ; (g) Quarantine control α 2 ( t ) .
Actuators 14 00434 g006
Figure 7. Performance of RL under different hyperparameters. (a) Policy network learning rate is 1 × 10 3 ; (b) Policy network learning rate is 2 × 10 3 ; (c) Policy network learning rate is 3 × 10 3 ; (d) Critic network learning rate is 1 × 10 3 ; (e) Critic network learning rate is 2 × 10 3 ; (f) Critic Network learning rate is 3 × 10 3 ; (g) Batch size is 64; (h) Batch size is 128; (i) Batch size is 256; (j) Discount factor is 0.95; (k) Discount factor is 0.85; (l) Discount factor is 0.75.
Figure 7. Performance of RL under different hyperparameters. (a) Policy network learning rate is 1 × 10 3 ; (b) Policy network learning rate is 2 × 10 3 ; (c) Policy network learning rate is 3 × 10 3 ; (d) Critic network learning rate is 1 × 10 3 ; (e) Critic network learning rate is 2 × 10 3 ; (f) Critic Network learning rate is 3 × 10 3 ; (g) Batch size is 64; (h) Batch size is 128; (i) Batch size is 256; (j) Discount factor is 0.95; (k) Discount factor is 0.85; (l) Discount factor is 0.75.
Actuators 14 00434 g007
Figure 8. Investigation of relative optimal s l under different fractional order q. (a) Average reward curves at q = 0.9 ; (b) Treatment control curves at q = 0.9 ; (c) Quarantine control curves at q = 0.9 ; (d) Average reward curves at q = 0.8 ; (e) Treatment control curves at q = 0.8 ; (f) Quarantine control curves at q = 0.8 ; (g) Average reward curves at q = 0.7 ; (h) Treatment control curves at q = 0.7 ; (i) Quarantine control curves at q = 0.7 .
Figure 8. Investigation of relative optimal s l under different fractional order q. (a) Average reward curves at q = 0.9 ; (b) Treatment control curves at q = 0.9 ; (c) Quarantine control curves at q = 0.9 ; (d) Average reward curves at q = 0.8 ; (e) Treatment control curves at q = 0.8 ; (f) Quarantine control curves at q = 0.8 ; (g) Average reward curves at q = 0.7 ; (h) Treatment control curves at q = 0.7 ; (i) Quarantine control curves at q = 0.7 .
Actuators 14 00434 g008
Table 1. Comparison of related works.
Table 1. Comparison of related works.
AdvantageFractional-Order ModelingTraditional Control Method or DL Control MethodRL Control MethodUtilizing MPC Thinking to Accelerate Policy Network Learning
Study
Essouifi et al. [9]×××
Peng et al. [11]×××
Zheng et al. [15]×××
Liu et al. [10]×××
Abusnaina et al. [34]×××
Mei et al. [35]×××
Ahmed et al. [36]×××
Tian et al. [38]×××
Liu et al. [4]××
Liu et al. [5]×××
Shen et al. [14]×××
Jafar et al. [16]×××
Zhan et al. [25]×××
Feng et al. [26]×××
Lin et al. (ours)×
Table 2. Model parameters.
Table 2. Model parameters.
ParametersValues
Model ParametersInitial number of susceptible nodes G 1 270
Initial number of carrier nodes G 2 0
Initial number of infected nodes G 3 30
Initial number of secured nodes G 4 0
Initial number of QUARANTINED nodes G 5 0
Cost weight of susceptible nodes B 1 300
Cost weight of infected nodes B 2 900
Cost weight of quarantined nodes B 3 300
Cost weight of treatment measure B 4 12
Cost weight of quarantine measure B 5 12
Attack severity of the malware β 0.85
Malware installation delay τ i 1
Patch installation delay τ p 0.5
Patch failure delay τ s 2
Quarantine failure delay h1
Failure rate of treatment measures Λ 0.05
The rate of quarantine k0.5
Successful installation rate of malware Ω 0.8
The communication difference coefficient between ideal and reality q c 2 ln 2 / π
Transmit power coefficient χ 5 × 10 10
Sliding length of the target square area a500 m
The exponent of path loss α 4
Intensity of noise σ n 2 −60 dBm
Signal-to-noise threshold γ t h 3 dB
The density of total nodes ρ 0.6
Fractional order q0.9
Table 3. Reinforcement learning algorithm parameters.
Table 3. Reinforcement learning algorithm parameters.
ParameterDescriptionValue
RL Algorithms l The target entropy 2
EThe total number of running rounds100
TThe total interaction time30
a φ The learning rate of the policy network 2 × 10 3
a θ The learning rate of the critic and the target critic network 2 × 10 4
a p The learning rate of the predictive model 1 × 10 3
a l The learning rate of entropy 3 × 10 3
γ The discount factor 0.85
τ The value of soft update 4 × 10 3
h m The loss weight of the predictive model 0.02
K R The proportion of the validation set for predictive model training 0.5
P k The number of predictive rounds after each truncation5
V max The maximum of logarithmic variance 0.5
V min The minimum of logarithmic variance 10
K 1 The size of the real replay buffer8000
K 2 The size of the predictive replay buffer2000
Batch size The number of samples taken by the agent during each training of the network128
E t The maximum number of iterations for training predictive model15
s l Real data utilization rate of network updates 0.9
Table 4. Performance comparison of control methods.
Table 4. Performance comparison of control methods.
ParameterControl
Optimal ControlDQNSACPPOMAPPOMBSAC
Validation of control effectivenessOnly treatment measure α 1 8634.91
Only quarantine measure α 2 10,347.14
Hybrid control measure7143.22
The influence of different
fractional orders q on
the VCISQ model
q = 0.9 7143.22
q = 0.8 7265.18
q = 0.7 7415.57
Performance of RL under
different hyperparameters
a φ = 1 × 10 3 7143.228845.168547.187965.447763.587352.41
a φ = 2 × 10 3 7143.228753.248125.717847.517632.247266.29
a φ = 3 × 10 3 7143.228949.338219.528088.937607.167448.52
a θ = 1 × 10 4 7143.228964.218487.318004.557721.637511.36
a θ = 2 × 10 4 7143.228753.248125.717847.517574.187266.29
a θ = 3 × 10 4 7143.228865.638192.647935.827513.557373.21
Batch size = 647143.228931.368419.697948.067832.297428.87
Batch size = 1287143.228753.248125.717847.517683.777266.29
Batch size = 2567143.228799.228349.908056.997686.737298.74
γ = 0.95 7143.228931.378257.258189.187568.247374.52
γ = 0.85 7143.228753.248125.717847.517469.567266.29
γ = 0.75 7143.229023.868564.188031.897711.237435.96
Investigation of relative optimal s l
under different fractional orders q
q = 0.9 , s l = 0.9 7143.227266.29
q = 0.9 , s l = 0.8 7143.227285.64
q = 0.9 , s l = 0.7 7143.227377.86
q = 0.8 , s l = 0.9 7265.187431.72
q = 0.8 , s l = 0.8 7265.187338.14
q = 0.8 , s l = 0.7 7265.187536.87
q = 0.7 , s l = 0.9 7415.577623.66
q = 0.7 , s l = 0.8 7415.577682.57
q = 0.7 , s l = 0.7 7415.577538.14
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, H.; Tian, C.; Chen, L.; Liao, D.; Wang, Y.; Hua, Y. Model-Based Reinforcement Learning for Containing Malware Propagation in Wireless Radar Sensor Networks. Actuators 2025, 14, 434. https://doi.org/10.3390/act14090434

AMA Style

Lin H, Tian C, Chen L, Liao D, Wang Y, Hua Y. Model-Based Reinforcement Learning for Containing Malware Propagation in Wireless Radar Sensor Networks. Actuators. 2025; 14(9):434. https://doi.org/10.3390/act14090434

Chicago/Turabian Style

Lin, Haitao, Can Tian, Linman Chen, Daizhi Liao, Yunbo Wang, and Yubo Hua. 2025. "Model-Based Reinforcement Learning for Containing Malware Propagation in Wireless Radar Sensor Networks" Actuators 14, no. 9: 434. https://doi.org/10.3390/act14090434

APA Style

Lin, H., Tian, C., Chen, L., Liao, D., Wang, Y., & Hua, Y. (2025). Model-Based Reinforcement Learning for Containing Malware Propagation in Wireless Radar Sensor Networks. Actuators, 14(9), 434. https://doi.org/10.3390/act14090434

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop