DRLMDS: A Deep Reinforcement Learning-Based Scheduling Algorithm for Mimic Defense Servers

Liao, Xiaoyun; Yang, Sen; Ouyang, Lijian; Wu, Rong; Huang, Xin; Yu, Shengjie; Mao, Jinzhou; Liu, Shangdong; Ji, Yimu

doi:10.3390/sym17111960

Open AccessArticle

DRLMDS: A Deep Reinforcement Learning-Based Scheduling Algorithm for Mimic Defense Servers

by

Xiaoyun Liao

¹

,

Sen Yang

¹

,

Lijian Ouyang

¹

,

Rong Wu

¹

,

Xin Huang

²

,

Shengjie Yu

²

,

Jinzhou Mao

²

,

Shangdong Liu

²

and

Yimu Ji

^2,*

¹

Taizhou Power Supply Branch, State Grid Jiangsu Electric Power Co., Ltd., Taizhou 225309, China

²

School of Computer Science, Nanjing University of Posts and Telecommunications, Xianlin Campus, Nanjing 210003, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(11), 1960; https://doi.org/10.3390/sym17111960

Submission received: 8 September 2025 / Revised: 16 October 2025 / Accepted: 31 October 2025 / Published: 14 November 2025

(This article belongs to the Special Issue Symmetry and Asymmetry in Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

Mimic defense, as an emerging active defense architecture, enhances the resilience of critical systems against unknown attacks through diversified redundant executors and dynamic switching mechanisms. However, the structural heterogeneity and dynamic behaviors of such systems pose great challenges for efficient and secure task scheduling, which traditional algorithms fail to address effectively. To overcome these limitations, this paper proposes a deep reinforcement learning-based scheduling algorithm for mimic defense servers, termed DRLMDS, which integrates an improved particle swarm optimization strategy to construct an environment-adaptive scheduling model capable of perceiving system state changes and optimizing task-resource allocation among heterogeneous executors. The algorithm is validated on mimic defense server datasets containing multiple heterogeneous nodes, where symmetric resource distribution and adjudication mechanisms are explicitly modeled to ensure balanced load distribution and robustness. Experimental results demonstrate that DRLMDS not only effectively defends against malicious attacks but also achieves approximately 30% reduction in task response time, 25% improvement in resource utilization, and nearly 40% enhancement in system stability compared with traditional swarm intelligence algorithms. These findings confirm the superior efficiency, robustness, and security advantages of the proposed approach in complex edge computing environments. This study provides a novel approach for intelligent and adaptive task scheduling in mimic defense architectures, offering theoretical support for active defense research and practical guidance for secure system deployment.

Keywords:

mimic defense; server scheduling; deep reinforcement learning; symmetry; particle swarm optimization

1. Introduction

1.1. Research Background

In the era of ubiquitous computing, edge computing environments are increasingly exposed to sophisticated and evolving cyber threats. Traditional static defense mechanisms often fail to counteract unknown or adaptive attacks, leading to system instability and data compromise. As a novel computing paradigm, the mimic defense server has shown great promise in enhancing system resilience by leveraging redundant executors, dynamic switching, and heterogeneous configurations [1]. These features enable flexible task scheduling and provide a structural foundation for resisting diverse network attacks. However, with the rapid evolution of attack strategies, even mimic defense systems face new security vulnerabilities during task scheduling [2].

1.2. Research Problem

During the scheduling process of mimic defense servers, as illustrated in Figure 1, attackers may tamper with scheduling information, infer resource allocation patterns, or exploit predictable task mappings. Such behaviors can result in service disruption, resource abuse, and data leakage. Therefore, how to enhance the security and unpredictability of mimic defense scheduling while maintaining efficiency has become a pressing research problem. Specifically, the challenge lies in designing a scheduling mechanism that dynamically adapts to environmental changes while preserving system diversity and randomness.

1.3. Research Status

Existing research has made progress in optimizing task scheduling and enhancing system security. Classical optimization approaches, including Particle Swarm Optimization (PSO), Genetic Algorithms (GA), and Ant Colony Optimization (ACO), have been applied to edge computing due to their global search capabilities [3]. Meanwhile, Deep Reinforcement Learning (DRL) has recently attracted attention for its ability to learn adaptive scheduling strategies through environment interaction [4]. However, most existing DRL-based or swarm-based scheduling methods are designed for general distributed systems, and few have explicitly integrated mimic defense principles—such as diversity, and behavioral randomness—into the scheduling process [5,6].

1.4. Problems and Gaps in Existing Research

Despite these advances, several limitations remain:

Most existing approaches rely on fixed scheduling objectives, lacking control over randomness and unpredictability, which makes them vulnerable to inference attacks.
Security enhancements such as encryption and authentication protect data but fail to introduce structural diversity at the scheduling layer.
Traditional scheduling algorithms often ignore heterogeneity and redundancy, resulting in weak adaptability to dynamic and adversarial environments [7,8].

As a result, mimic servers still suffer from structural vulnerabilities at the scheduling layer, which can be exploited by attackers to achieve persistent infiltration [9].

1.5. Research Objective and Scope

To overcome these limitations, this study proposes a Deep Reinforcement Learning-Based Mimic Defense Scheduling Algorithm (DRLMDS). The proposed algorithm integrates the adaptive learning capability of DRL with an improved PSO strategy to optimize resource allocation and scheduling under adversarial conditions. Specifically, DRLMDS introduces: a dynamic scheduling mechanism that perceives environment changes and adjusts resource binding in real time; policy perturbation and randomized control to generate diversified scheduling behaviors;heterogeneous resource scheduling that enhances system redundancy and tolerance to attacks [10,11]. The study focuses on mimic defense server environments in edge computing, aiming to enhance security, robustness, and efficiency simultaneously.

In summary, the main contributions of this paper are as follows:

DRL-Based Scheduling Framework for Mimic Systems:A dynamic scheduling objective model tailored to mimic defense servers is established. The proposed DRLMDS algorithm integrates DRL’s adaptive learning capacity with mimic diversity mechanisms to dynamically optimize task scheduling while maintaining system-level security and robustness.
Enhanced Redundancy and Heterogeneity via Improved PSO: A layered learning strategy and t-distribution-based perturbation are introduced into the PSO algorithm, strengthening the flexibility of multi-path scheduling and enhancing system redundancy against potential compromise of specific nodes.
Empirical Validation in Mimic Server Environments: Extensive experiments verify the algorithm’s superior performance under adversarial conditions, demonstrating substantial reductions in task latency and resource consumption, as well as notable improvements in the overall reliability and resilience of mimic defense systems.

Comprehensive experimental evaluations demonstrate that DRLMDS significantly improves scheduling performance and defensive capability under malicious attack conditions. Specifically, the algorithm effectively reduces task delay and resource waste while enhancing system robustness and efficiency in complex threat environments.

1.6. Paper Structure

The remainder of this paper is organized as follows. Section 2 reviews related work on mimic defense and intelligent scheduling. Section 3 presents the design of the proposed DRLMDS algorithm, including its DRL-based framework, improved PSO strategy, and mimic scheduling mechanism. Section 4 details the experimental setup and evaluation results, demonstrating the algorithm’s superiority in reducing task latency, improving resource utilization, and enhancing resilience against attacks. Section 5 concludes the study and outlines potential directions for future research.

2. Related Work

In recent years, with the rapid advancement of edge computing and mimic computing technologies, resource scheduling and system defense have become focal points of academic research. As a key approach to improving computational efficiency and system stability, resource scheduling has been extensively studied in various domains such as cloud computing and edge computing. However, mimic servers, as a novel computing architecture that integrates diversity, dynamism, and redundancy, present significantly different challenges in scheduling and defense compared to traditional systems.

For example, mimic systems emphasize the “equivalent polymorphism” of execution entities and their runtime dynamic switching. Consequently, scheduling strategies must not only ensure efficient resource utilization but also address system unpredictability, security, and dynamic defense capabilities—requirements that are not commonly encountered in cloud-edge environments. Thus, despite substantial progress in traditional resource scheduling, there remains a clear gap in optimizing scheduling for mimic servers.

The scheduling algorithm proposed in this paper is specifically designed around the characteristics of mimic architectures, aiming to balance performance optimization with attack resistance, thereby addressing a critical gap in existing research [12].

In the fields of cloud and edge computing, related studies have primarily focused on efficient resource allocation, task optimization, and reduction of latency and energy consumption [13]. For example, Masadeh et al. [14] proposed a cloud-computing task-scheduling approach based on the Sea Lion Optimization (SLnO) algorithm; by emulating the hunting behavior of sea lions and integrating a multi-objective optimization model, they reduced overall completion time, cost and energy consumption while improving resource utilization, thereby achieving superior scheduling performance. Wang et al. [12] proposed a distributed remote-calibration prototype system based on a cloud-edge-end architecture. By integrating a high-precision frequency-to-voltage conversion module, satellite timing signals, environmental monitoring, video surveillance, and OCR technology, they solved the problems of traceability transmission and on-site intelligent data recording/extraction in remote calibration, enabling high-precision and intelligent remote calibration of power equipment. Chen Y. et al. [15] proposed a flexible resource scheduling model combining Software-Defined Networking (SDN) with edge computing to address resource scheduling challenges in software-defined cloud manufacturing. Junchao Y. et al. [16] developed a parallel intelligent-driven resource scheduling scheme for Intelligent Vehicle Systems (IVS) to tackle task delays and network load imbalance caused by dual dependencies in time and data. However, these works did not adequately consider integration with mimic defense mechanisms.

Cleverson V.N. et al. [17] introduced an intent-aware reinforcement learning approach for radio resource scheduling in wireless access network slicing, aimed at fulfilling service quality intents of different slices, but the method lacks sufficient adaptability under attack scenarios. Dong Y. et al. [18] proposed a deep reinforcement learning-based dynamic resource scheduling algorithm to adapt to device heterogeneity and network dynamics, but their model failed to fully leverage system diversity to enhance robustness.

Most of these efforts overlook the unique scheduling demands of mimic systems, including runtime dynamic switching, execution path uncertainty, and heterogeneity-driven redundancy defense strategies—making their approaches difficult to directly apply to mimic server environments.

In the field of task scheduling algorithms, Particle Swarm Optimization (PSO) has been widely applied to resource scheduling problems. For example, one study proposed a global path planning method for Autonomous Underwater Vehicles (AUVs) based on an improved T-distribution Fireworks–PSO algorithm, in which a T-distribution perturbation mechanism was introduced to enhance the algorithm’s global search capability and its ability to escape local optima [19]. In addition, some works have combined PSO with reinforcement learning for UAV path planning—where reinforcement learning is used for real-time decision-making, and PSO is employed to optimize the search process of those decisions [20].

From a security perspective, mimic computing emphasizes enhancing system unpredictability through diversity, dynamism, and randomness [21], thus offering a theoretical foundation for defending against unknown attacks. Prior works such as Shao et al. [22] introduced a heterogeneous executor selection mechanism, and Wang et al. [23] designed a load balancing strategy, but most of these approaches failed to deeply explore how scheduling mechanisms themselves affect the effectiveness of mimic defense. In particular, they lack modeling of the dynamic and stochastic nature of scheduling behavior, making it difficult to maintain robustness under complex attack scenarios.

Currently, most mimic scheduling research still suffers from several notable limitations. First, existing methods often rely on fixed scheduling strategies or simplistic optimization goals, with limited adaptability to dynamic changes in system states [24]. This makes the system vulnerable to predictability and exploitation by attackers, leading to poor defensive performance [2]. Second, although some studies attempt to introduce randomness into task scheduling, they generally lack adaptive optimization strategies such as deep reinforcement learning, resulting in low scheduling flexibility in dynamic environments [25]. Moreover, current research does not sufficiently improve randomization and unpredictability in resource scheduling, limiting their effectiveness in resisting intelligent attacks [26].

To summarize, current mimic server scheduling approaches face the following architecture-specific challenges:

Weak dynamic defense capability: Most existing methods rely on static scheduling strategies and lack environment-aware adaptation mechanisms, making them ineffective in responding to changing attacker tactics in real-time [18,19].
Lack of unpredictability in scheduling outcomes: Task assignments often lack sufficient randomness and diversity, allowing attackers to infer patterns and launch targeted attacks [27,28].
Separation of defense mechanisms and scheduling logic: Many studies treat scheduling and defense as independent modules, rather than integrating them under the unified concept of “scheduling as defense” [29].

Although existing studies have made progress in optimizing resource allocation, most have failed to fully integrate mimic defense principles, resulting in insufficient robustness against malicious attacks. Traditional methods typically rely on fixed scheduling strategies or simplified optimization objectives, lacking the necessary adaptability and diversity, which makes them susceptible to being predicted and exploited by attackers. Moreover, current approaches still exhibit significant deficiencies in the randomization and unpredictability of resource scheduling, limiting their effectiveness in resisting sophisticated and adaptive attacks.

To address these limitations, this paper proposes an efficient and secure scheduling solution for mimic servers by integrating deep reinforcement learning (DRL) with the core philosophy of mimic defense. The proposed design centers around the three foundational characteristics of mimic computing—dynamism, diversity, and redundancy—to systematically enhance the scheduling mechanism. We introduce a Deep Reinforcement Learning-based Mimic Defense Server Scheduling algorithm (DRLMDS) that bridges the gap in existing research by improving resilience under complex attack scenarios. This method breaks through the limitations of traditional scheduling systems that struggle to balance performance and security, and it offers a feasible path for the design of scheduling mechanisms in future high-security computing architectures.

3. DRLMDS Design

The proposed DRLMDS aims to enhance the robustness and resource utilization efficiency of mimic defense servers when facing malicious attacks by integrating deep reinforcement learning (DRL) with mimic defense principles. The mimic defense server adopts a symmetric architecture, in which computing resources, tasks, and decision-making mechanisms are evenly distributed across multiple nodes. DRLMDS reduces bias and enhances generality by designing a scheduling strategy that treats similar tasks and nodes symmetrically. By incorporating a symmetric decision-making mechanism, it further improves the fairness and robustness of scheduling decisions. Compared with traditional methods, this symmetry-aware design contributes to more efficient resource utilization and better load balancing. The overall framework of the algorithm, as illustrated in Figure 2, consists of a deep reinforcement learning module, a mimic defense module, and a mechanism that integrates the two. By leveraging dynamic optimization and randomized strategies, DRLMDS significantly improves the system’s resistance to attacks and its efficiency in resource utilization.

3.1. Deep Reinforcement Learning Module

The deep reinforcement learning (DRL) module serves as the core component of the algorithm, responsible for dynamically adjusting resource allocation strategies based on the current system state [30]. This paper employs Deep Q-Network (DQN) as the implementation framework for DRL (as shown in Figure 3), where the agent interacts with the environment to learn the optimal scheduling policy [13]. The DRL module achieves dynamic behavior through the following mechanisms [5]:

Dynamic Resource Scheduling: The DRL module dynamically adjusts resource allocation strategies based on the current system state, such as task latency, server energy consumption, and task queue length [8]. Through continuous learning and optimization, the DRL module can respond to system changes in real-time to ensure optimal resource scheduling.
Dynamic Adaptation to Attack Scenarios: The DRL module adapts dynamically to various attack scenarios. By continuously adjusting the learning rate and exploration strategies, the system can respond to attack behaviors in real-time and modify resource scheduling policies to resist attacks.

The Deep Q-Network (DQN) algorithm serves as a foundational implementation framework for deep reinforcement learning (DRL), utilizing two neural networks: the evaluation network and the target network. The evaluation network, also called the action network, is the main network in the DQN algorithm. It is responsible for selecting actions by taking the current state as input and outputting the Q-value estimates for each action through forward propagation. The target network is an auxiliary network used to calculate the target Q-values. Its parameters remain unchanged for a fixed period (either a certain number of training steps or a time interval), as shown in Figure 4. This provides relatively stable target Q-values, reducing fluctuations during training and improving the stability and convergence of the algorithm.

The proposed DRLMDS leverages the structural symmetry of mimic defense servers to enhance scheduling fairness and load balancing. Specifically, symmetry is realized through:

Symmetric Resource Distribution: Computing resources, tasks, and arbitration mechanisms are evenly distributed across multiple heterogeneous nodes.
Symmetric Task-Node Mapping: Similar tasks are assigned to nodes with similar capabilities, ensuring that no single node is overloaded or underutilized.
Symmetric Decision-Making: The arbitration mechanism applies uniform criteria to evaluate scheduling strategies, avoiding bias toward any particular node or task type.

This symmetric design reduces scheduling bias, improves resource utilization, and enhances system robustness against targeted attacks.

Building on the DRL framework established by DQN, the Deep Deterministic Policy Gradient (DDPG) algorithm introduces an actor-critic architecture, as shown in Figure 5, to address limitations in handling continuous action spaces and enhance performance and efficiency. It combines the policy network (Actor) and the value function network (Critic): the Actor improves the policy by maximizing the Critic’s evaluation of the policy, while the Critic updates its parameters by comparing the actual rewards with the value function estimates. The Critic provides feedback on the policy to guide the Actor’s updates, and the Actor generates action strategies (especially in continuous spaces) to produce new experiences for the Critic to learn from.

In addition, DDPG retains key components from the DQN framework, including experience replay and target networks, as shown in the overall algorithm framework in Figure 6. Similar to DQN, DDPG’s target networks keep parameters fixed for a certain period, allowing the algorithm to focus more on long-term cumulative rewards, slowing down target value updates, and improving stability. Experience replay, another inherited component, breaks the correlation between consecutive samples, reducing the impact of correlated data on training. By randomly sampling data from the experience buffer, the algorithm better leverages data diversity and coverage, avoiding overfitting to specific patterns and further enhancing stability and convergence.

3.1.1. State Space

The state space describes the environment’s state. In this paper, task delay and system energy consumption are used as the state representation after the execution of an action. The state is defined as:

S = {T^{tran}, T^{comp}, E^{tran}, E^{j}}

(1)

where

T^{tran}

represents the total transmission delay of the current scheduling scheme,

T^{comp}

is the computation delay of processing tasks by edge servers,

E^{tran}

is the total transmission energy consumption of data transferred from edge devices to servers in the edge system, and

E^{j}

is the total computation energy consumption generated by all edge servers during task processing.

3.1.2. Action Space

Each action in the action space represents a specific resource allocation scheme. Selecting an action corresponds to selecting a server, to which the task is transmitted for processing on the chosen edge server. For example, as shown in Figure 7, given a task set

T = {t_{1}, t_{2}, t_{3}, t_{4}}

and a set of server resources

S = {S_{1}, S_{2}, S_{3}, S_{4}, S_{5}}

at time t, the resource allocation matrix is shown in Figure 7. An element in the matrix with value 1 indicates that the corresponding task is assigned to a server resource.

At time

t_{1}

, task

t_{1}

is allocated to server resource

S_{1}

, task

t_{2}

is allocated to

S_{2}

, and tasks

t_{3}

and

t_{4}

are assigned to server resources

S_{3}

and

S_{1}

, respectively. At this point, the system is in state

s_{t}

.

Based on the selected action

a_{t}

, which determines the allocation scheme of resources, the allocation for

t_{1}

and

t_{3}

is maintained, while

t_{2}

is reallocated to resource

S_{2}

and

t_{4}

is reallocated to

S_{5}

, leading to the next state

s_{t + 1}

.

In terms of action selection, the algorithm is based on the

ε

-greedy policy and the reward function. With probability

ε

, the action with the highest reward is selected to enable the algorithm to approach the optimal value. With probability

1 - ε

, a random action is chosen to expand the exploration of the state and action space, thereby avoiding the issue of falling into local optima. Meanwhile, the randomly selected state transition tuple

(s_{t}, a_{t}, r_{t}, s_{t + 1})

is stored in the experience replay buffer to train the DQN network, which helps reduce the correlation between samples.

3.1.3. Reward Function

The objective of this study is to minimize task delay and system energy consumption. Therefore, the design of the reward function mainly considers these two aspects. Since the objective is negatively correlated with the reward function, the reward function r is also calculated as shown in Equation (2), where

T^{total}

is the total delay of all tasks in the edge system, and

E^{total}

is the total energy consumption in the edge environment.

r = \frac{1}{T^{total} + E^{total}}

(2)

During the training of the DQN network, the loss function is used to optimize the network parameters by minimizing the difference between the target Q-network and the evaluation Q-network. The target Q-value is calculated as:

y_{t} = r + γ max_{a} Q (s_{t + 1}, a_{t + 1}; ω^{-})

(3)

where

Q (s_{t}, a_{t}; ω)

is the Q-value estimated by the evaluation network. Mean Squared Error (MSE) is adopted as the loss function for the DQN neural network, as shown in Equation (4).

L (ω) = E [{(y_{t} - Q (s_{t}, a_{t}; ω))}^{2}]

(4)

3.1.4. Network Architecture

The network component of the DQN algorithm consists of an evaluation network and a target network. Both networks share the same architecture and utilize a Deep Neural Network (DNN). The DNN structure used in this algorithm is shown in Figure 8. It contains five connected layers, including one input layer, three hidden layers, and one output layer. The input layer receives state feature information, each hidden layer contains 32 nodes to learn the features among data, and the output corresponds to the Q-values for different actions. The activation function used is the ReLU function. Figure 9 illustrates the architecture of the DQN module, where the agent continuously optimizes the scheduling policy through interactions with the environment.

3.2. Scheduling Optimization Based on Improved Particle Swarm Algorithm

Resource scheduling problems typically require consideration of multiple constraints and objective functions. The global search capability of the particle swarm algorithm allows it to find the optimal or near-optimal solutions across the entire solution space [11]. This section presents improvements to the particle swarm algorithm by introducing hierarchical learning [19] and the concept of t-distribution [20]. Particles are classified into layers according to their fitness values. During the particle update process, lower-level particles learn from higher-level particles, thereby enhancing the overall quality of the swarm. To ensure particle diversity, mutations based on the t-distribution are applied to particles to expand the search space. Specifically, the improved particle swarm algorithm achieves heterogeneity and redundancy through the following methods:

Heterogeneity: The initial particle swarm is constructed based on initial scheduling schemes generated by the DQN algorithm, with each particle representing a resource scheduling scheme. These particles embody different scheduling strategies, reflecting the system’s heterogeneity.
Redundancy: By generating multiple particles, the system contains several redundant scheduling schemes that can perform the same or similar functions. When one scheduling scheme is attacked or fails, other schemes can take over the tasks, ensuring system stability and continuity.

In this chapter, the DQN algorithm is used to generate the initial solutions for the particle swarm, improving the quality of the initial values. The top K scheduling schemes are selected from the experience replay pool of the DQN algorithm as particles in the particle swarm algorithm, denoted as

X = {x_{1}, x_{2}, \dots, x_{K}}

. Each particle

x_{i}

represents a resource scheduling scheme.

The dimension of each particle equals the number of tasks in the edge environment, where each dimension corresponds to a task. The value at each dimension indicates the server resource ID allocated to that task. As shown in Figure 10, taking this as an example, resource

S_{0}

is assigned to tasks

{t_{3}, t_{4}}

, resource

S_{1}

is assigned to task

{t_{1}}

, resource

S_{2}

is assigned to task

{t_{6}}

, and resource

S_{3}

is assigned to task

{t_{5}}

.

In the particle swarm optimization algorithm, the fitness value of each particle reflects its quality in the solution space. The calculation of the fitness value is usually designed based on the objective function. The objective function in this chapter is to minimize system energy consumption and task latency. Therefore, a smaller fitness value indicates that the particle is closer to the optimal solution. When calculating the fitness value, this work considers constraint terms. For scheduling schemes where the latency exceeds the maximum allowed latency or the energy consumption exceeds the preset maximum energy, penalty values are assigned. The fitness value consists of three parts: task latency, system energy consumption, and the penalty value, as shown in Equations (5)–(7).

F = \sum_{t_{k} \in T} \frac{E_{t_{k}}^{t o t a l}}{E^{max}} + ϕ_{1} + \sum_{t_{k} \in T} (\frac{T_{t_{k}}^{t o t a l}}{T^{max}} + ϕ_{2})

(5)

ϕ_{1} = \{\begin{matrix} \sum_{t_{k} \in T_{S_{j}}} (E_{t_{k}}^{t r a n} + E_{t_{k}}^{j}) - E^{max} & \sum_{t_{k} \in T_{S_{j}}} (E_{t_{k}}^{t r a n} + E_{t_{k}}^{j}) < E^{max} \\ 0 & Otherwise \end{matrix}

(6)

ϕ_{2} = \{\begin{matrix} T_{t_{k}}^{t o t a l} - T^{max} & T_{t_{k}}^{t o t a l} < T^{max} \\ 0 & Otherwise \end{matrix}

(7)

Each particle determines its personal best solution based on its own fitness value, which in turn guides the speed and direction of its position update. At the same time, particles with better fitness values attract other particles to move closer to them, thereby promoting the entire swarm to move toward the optimal solution through information exchange among particles. To facilitate particles moving toward the optimal solution, this section performs a sorting and layering operation on the particles according to their fitness values, as shown in Figure 11.

The particle set in the particle swarm is

X = {x_{1}, x_{2}, \dots, x_{N}}

. The particles are sorted in ascending order according to their fitness values

F = {f_{1}, f_{2}, \dots, f_{N}}

. After sorting, the particles are renumbered, and the swarm is divided into three layers

H = {H_{1}, H_{2}, H_{3}}

. The number of particles in the first and third layers corresponds to the first

\frac{1}{8}

and the last

\frac{1}{8}

of the sorted swarm, respectively.

After each iteration of the algorithm, particles are resorted and re-layered according to their fitness values, as shown in Equations (8) and (9):

H_{1}^{end} = \frac{1}{8} N + N \times {(0.5 - \frac{t}{T})}^{3}

(8)

H_{2}^{end} = \frac{7}{8} N + N \times {(0.5 - \frac{t}{T})}^{3}

(9)

where N is the total number of particles in the swarm, t is the current iteration number, and T is the maximum number of iterations set by the algorithm. The first layer particle set is

H_{1} = {x_{1}, x_{2}, x_{3}, \dots, x_{H_{1}^{end}}}

, the second layer particle set is

H_{2} = {x_{H_{1}^{end} + 1}, x_{H_{1}^{end} + 2}, x_{H_{1}^{end} + 3}, \dots, x_{H_{2}^{end}}}

, and the third layer particle set is

H_{3} = {x_{H_{2}^{end} + 1}, x_{H_{2}^{end} + 2}, x_{H_{2}^{end} + 3}, \dots, x_{N}}

.

The particles in the first layer are closer to the global optimal solution. For updating particles in this layer, the traditional method is applied, where the velocity and position are updated based on the particle’s personal best and the global best solutions, as shown in Equation (10):

\{\begin{matrix} v_{i}^{t + 1} = ω v_{i}^{t} + c_{1} γ_{1} (P_{i best}^{t} - x_{i}^{t}) + c_{2} γ_{2} (G_{best}^{t} - x_{i}^{t}) \\ x_{i}^{t + 1} = x_{i}^{t} + v_{i}^{t + 1} \end{matrix}

(10)

The particles in the second layer update their positions and velocities by learning from the particles in

H_{1}

. Two particles,

x_{k} \in H_{1}

and

x_{l} \in H_{1}

, are randomly selected from the first layer as references for the update. The update rules are shown in Equation (11):

\{\begin{matrix} v_{i}^{t + 1} = ω v_{i}^{t} + c_{1} γ_{1} (x_{k}^{t} - x_{i}^{t}) + c_{2} γ_{2} (x_{l}^{t} - x_{i}^{t}) \\ x_{i}^{t + 1} = x_{i}^{t} + v_{i}^{t + 1} \end{matrix}

(11)

For the particles in the third layer with relatively high fitness values, a t-distribution mutation is applied to increase the randomness and diversity of the search. The t-distribution, also known as Student’s t-distribution, denoted by

Z \sim t (n)

, has the probability density function given by Equation (12), where n is the degrees of freedom. When

n = 1

, the t-distribution can be regarded as a Cauchy distribution, and as

n \to \infty

, the t-distribution approaches the Gaussian distribution. Therefore, the t-distribution has characteristics of both the Cauchy and Gaussian distributions:

f_{Z} (x) = \frac{Γ (\frac{n + 1}{2})}{\sqrt{n π} Γ (\frac{n}{2})} {(1 + \frac{x^{2}}{n})}^{- \frac{n + 1}{2}}

(12)

x_{i}^{t + 1} = x_{i}^{t} + x_{i}^{t} \times t (n)

(13)

The overall algorithm flow is shown in Figure 12. The initial particle swarm is constructed based on the initial scheduling schemes generated by the DQN algorithm. Particles are sorted and layered according to their fitness values. For particles in the first layer, their velocity and position are updated based on their individual best solutions and the global best solution of the current layer’s swarm. For particles in the second layer, two particles are randomly selected from the first layer, and their velocity and position are updated according to specific formulas. For particles in the third layer, mutations are applied using the t-distribution to change their positions without altering their velocities. Before the iteration count reaches the maximum number set by the algorithm, after each iteration and update of the particle swarm, the fitness of each particle is recalculated, and particles are re-layered accordingly, while the global best solution is recorded.

3.3. Mimetic Defense Module

Specifically, DRLMDS achieves mimetic defense through the following four characteristics:

Dynamicity: The module implements a dynamic resource scheduling mechanism via the Deep Reinforcement Learning (DRL) module. This module adjusts resource allocation strategies in real-time based on the current system state, including task queue length, task latency, and server energy consumption. The DRL module employs adaptive learning rate adjustment and exploration strategies to cope with different types of attack scenarios. It continuously learns and updates strategies during runtime to dynamically resist external attacks. This dynamic adaptability ensures efficient system operation under complex and changing attack environments.
Heterogeneity: The system deploys heterogeneous servers running different OS versions or possessing diverse hardware architectures to build the mimetic environment. At the start of scheduling, the system generates initial scheduling strategies using a Deep Q-Network (DQN) and maps them to multiple particles in the particle swarm, with each particle representing a resource scheduling scheme. Differences in the structural design of these particles naturally reflect system heterogeneity. This heterogeneity not only improves resource utilization efficiency but also significantly increases the uncertainty of attack paths for adversaries, thereby enhancing system security.
Redundancy: The system ensures redundancy by deploying multiple servers with similar functions but different configurations. During scheduling, the system maintains multiple candidate scheduling strategies as backup paths. When a server is attacked or a task fails, the system can promptly switch to other redundant servers to execute the corresponding tasks. Through the pool of multiple scheduling schemes maintained by particle swarm optimization (PSO), the system can quickly call alternative solutions when the original plan fails, guaranteeing service continuity and fault tolerance.
Decision-making Process: The mimetic defense module introduces a decision-making mechanism to select the optimal strategy among multiple scheduling schemes. This decision process can evaluate candidate schedules based on predefined rules or trained machine learning models (e.g., neural network evaluators), considering factors such as task completion rate, security, energy consumption, and system load comprehensively. The decision-making mechanism ensures that the final selected scheduling strategy effectively enhances resource utilization while possessing strong anti-attack capabilities.

The mimetic defense module significantly enhances system unpredictability by introducing diversity and randomness into scheduling strategies. Specifically, on top of the actions output by the DRL module, random perturbations are added, causing variations in task allocation paths and schemes, which increase the difficulty for attackers to predict system behavior. For example, each task can generate multiple possible execution paths during allocation. By introducing perturbation functions or random selection strategies, the diversity of these paths is increased, forming diversified scheduling behaviors.

Moreover, the system architecture of the mimetic environment is illustrated in Figure 13. This architecture deeply integrates the deep reinforcement learning module with the mimetic scheduling module. During the training phase, it learns resource scheduling strategies through extensive interactive data. In the operational phase, it dynamically collects state data for real-time scheduling and mimetic decision-making, thereby achieving a collaborative optimization of system security and scheduling performance.

Unlike prior approaches that treat scheduling and security as separate concerns, our method provides a holistic integration of mimic defense principles—dynamicity, heterogeneity, and redundancy—directly into the scheduling logic. This resolves the fundamental limitation of existing algorithms, which lack the inherent unpredictability, dynamic adaptation, and structural diversity required to defend against intelligent attacks in edge computing environments, thereby enabling a simultaneous optimization of both performance and security.

The mimetic defense module significantly enhances the system’s resilience against attacks by employing randomization and diversification strategies, making it difficult for attackers to predict the system’s scheduling patterns.

3.4. Deep Reinforcement Learning-Driven Mimetic Defense Mechanism

The core innovation of this work lies in combining deep reinforcement learning (DRL) with mimetic defense to achieve efficient and secure resource scheduling. The DRL module generates initial scheduling strategies, which are further optimized by the mimetic defense module to enhance system robustness. The specific integration mechanism is as follows:

The scheduling strategies produced by the DRL module serve as inputs, which the mimetic defense module optimizes through randomization and diversification. This dynamic optimization mechanism enables the system to adapt in real time to different attack scenarios.
The mimetic defense module increases system unpredictability by introducing diversity and randomness. Servers running different versions of operating systems introduce heterogeneity, further improving system security.
Additionally, the mimetic defense module generates multiple redundant scheduling strategies, ensuring system continuity even if some strategies are attacked or fail.
A decision-making (arbitration) mechanism within the mimetic defense module selects the optimal scheduling strategy. This mechanism evaluates and filters different strategies based on predefined rules or machine learning models, ensuring that the chosen strategy effectively resists attacks while improving resource utilization.

By integrating deep reinforcement learning with mimetic defense, the system not only dynamically optimizes resource allocation but also effectively counters malicious attacks, significantly improving overall system performance and security.

4. Experimental Design and Analysis

To verify the effectiveness of DRLMDS, this paper designs a resource scheduling experiment incorporating a mimic defense mechanism. The mimic defense environment is designed with the core concept of “diversified runtime environments and dynamic scheduling strategies,” aiming to enhance the system’s robustness and resource adaptability when facing various attacks. The experiment design covers the following four aspects:

Dynamic Validation: By simulating typical attack scenarios (DoS attacks and resource inference attacks), the system’s responsiveness and performance in dynamically adjusting resource scheduling strategies are verified. The DoS attack simulation logic involves introducing a continuous high-frequency stream of pseudo-tasks at the logical level, simulating an attacker causing the server resources to be busy for a prolonged period through network requests. The resource inference attack simulation logic assumes that attackers infer backend scheduling rules by analyzing task execution delays and resource allocation trajectories, thereby influencing subsequent task scheduling strategies. Under these attack scenarios, the system can identify abnormal task behaviors in real-time and dynamically adjust scheduling strategies (e.g., switching execution paths), effectively reducing task latency and energy consumption fluctuations.

Heterogeneity Validation: The experimental environment simulates multiple servers running different types of operating systems (such as Windows Server, Ubuntu, etc.) and various hardware specifications to construct a heterogeneous edge environment. The system can flexibly schedule tasks based on different server performances and load conditions, improving resource utilization. Specifically, under the heterogeneous server deployment environment, the system’s resource utilization improved by approximately 25% compared to a homogeneous environment, demonstrating strong heterogeneity adaptability.

Redundancy Validation: By simulating scenarios where some servers become unavailable due to attacks or failures, the stability of the system’s redundancy scheduling strategy is evaluated. Experimental results show that when the primary scheduling scheme fails, backup strategies can quickly take over tasks to ensure uninterrupted system services, improving system stability by about 40%.

Decision Process Validation: By comparing indicators such as energy consumption and task response time across different scheduling schemes, the effectiveness of the system’s decision module in selecting the optimal strategy is verified. The decision mechanism integrates rule-based models and empirical learning outcomes to ultimately select the scheduling scheme with the highest task completion rate and lowest energy consumption. The selected strategy outperforms baseline algorithms across multiple metrics. In experiments, the decision mechanism evaluates and filters different scheduling strategies based on predefined rules or machine learning models, ensuring that the chosen strategy effectively resists attacks and improves resource utilization efficiency. Specifically, the scheduling strategies selected by the decision mechanism demonstrate excellent performance in task response time, resource utilization, and system robustness.

4.1. Experimental Environment

The experimental environment simulates a typical edge computing scenario, consisting of multiple distributed edge nodes and terminal devices. To verify the effectiveness and robustness of DRLMDS, two types of attack scenarios are designed: DoS attacks and resource inference attacks, aiming to evaluate the system’s performance under different attack types. The DoS attack simulates high-frequency request attacks on edge servers to cause resource occupation; the resource inference attack periodically samples task processing times, builds time-series prediction models, and attempts to identify scheduling behavior patterns. By introducing dynamic attribute changes of nodes and attack logic simulation mechanisms, a typical mimic defense environment is reconstructed at the simulation level, thus validating the scheduling performance of DRLMDS under the “mimic environment.”

4.1.1. Experimental Setup

The experiments were conducted on Windows 10. The software stack comprised Python 3.8 and TensorFlow 2.7, running on hardware equipped with 16 GB RAM and an AMD Ryzen 7 5800H CPU.

4.1.2. Parameter Settings

To validate the practical effectiveness of the resource scheduling algorithm proposed in this chapter in reducing system energy consumption and shortening latency, experiments are conducted as follows. The algorithm parameters are set as shown in Table 1. The number of edge servers is set to 20. For the number of tasks to be processed, this paper sets it to [100, 200, 300, 400, 500] to facilitate performance evaluation of DRLMDS under different workloads.

Currently, there is a lack of standardized training and testing datasets in the field of resource scheduling for edge computing. Therefore, the parameter values used in the experiments are based on related experimental settings from similar research on edge computing resource scheduling models [1]. The parameter settings for the proposed resource scheduling model in this paper are referenced from the literature, with detailed values shown in Table 2.

4.1.3. Experimental Metrics

To evaluate the algorithm’s performance, the experiment compares and assesses four aspects: system energy consumption, average task delay, task response time, and resource utilization. Additionally, the Inverted Generational Distance (IGD) is used to evaluate the performance of the algorithms on multi-objective problems, serving as an indicator of the quality of the solution set generated.

4.2. Experimental Results and Discussion

In this phase, experiments were carried out to assess the system performance under non-attack scenarios using conventional deep reinforcement learning (DRL)-based resource scheduling algorithms. Specifically, the experimental setup compared the proposed DRLMDS (Deep Reinforcement Learning-based Multi-Dimensional Scheduling) algorithm against two benchmark approaches: the Particle Swarm Optimization (PSO) algorithm and a hybrid method integrating Deep Q-Network (DQN) with Genetic Algorithm (hereafter denoted as DQN-GA).

4.2.1. Evaluation of Energy-Saving Performance Under Variable Task and Server Quantities

To verify the energy-saving performance of DRLMDS, this section compares the system energy consumption of DRLMDS, PSO, and DQN-GA under different combinations of task numbers and server counts. The corresponding results are presented in Figure 14, and key observations are summarized as follows:

Overall Trend: The system energy consumption of all three algorithms increases to a certain extent as the number of tasks rises. Among them, the PSO algorithm exhibits the highest energy consumption, with a near-linear growth trend as the task quantity increases.
Performance Gap Under Low Task Loads: When the number of tasks is small, the energy consumption difference between DRLMDS and DQN-GA is relatively minor. For example, with 10 servers and 100 tasks, DRLMDS reduces energy consumption by 6.69% compared to DQN-GA.
Performance Advantage Under High Task Loads: As the number of tasks reaches 400 or more, the energy consumption gap between DRLMDS and DQN-GA widens. When the task count reaches 500, DRLMDS achieves a 7.82% reduction in energy consumption relative to DQN-GA. This result indicates that DRLMDS demonstrates superior performance when handling large-scale task loads.

Additionally, the impact of server quantity on energy consumption varies with the task load:

When the number of tasks is ≤200, increasing the number of servers does not reduce the total energy consumption for any of the three algorithms. This phenomenon is attributed to the fact that surplus idle servers still consume energy, leading to unnecessary resource waste.
When the number of tasks reaches 300, the system energy consumption of all three algorithms decreases as the number of servers increases.

To further verify the energy optimization effectiveness of DRLMDS, this paper analyzes the energy consumption data of each algorithm when task number is fixed at 500, under three server configurations: 10, 20, and 30 servers. For each scenario, 10 independent runs are conducted per algorithm, and the Kruskal-Wallis H test (a non-parametric statistical test) is applied to ensure the robustness and reliability of the results.

4.2.2. Results of Statistical Analysis

The statistical results indicate that under all three server configurations, the differences in energy consumption among the three algorithms are statistically significant (all p-values

< 0.05

). A representative example (with 10 servers) is detailed below:

The average energy consumption values of DRLMDS, DQN-GA, and PSO are 680, 740, and 810 (units consistent with experimental metrics) respectively.
Their 95% confidence intervals are $[631.3, 728.7]$ , $[686.9, 793.1]$ , and $[751.9, 868.1]$ , respectively, with a significance p-value of approximately $0.028$ .

This result confirms that there is a significant difference in energy performance among the three algorithms, and DRLMDS outperforms the other two methods by a statistically significant margin. Similarly, under the server configurations of 20 and 30 servers, DRLMDS maintains its superior energy efficiency and retains a stable performance advantage.

Building on the preceding confidence interval and significance test results, a definitive conclusion can be drawn: DRLMDS consistently reduces overall energy consumption across different server configurations and outperforms comparative algorithms (i.e., DQN-GA and PSO) by a statistically significant margin.

4.2.3. Evaluation of Average Task Delay Performance

Following the energy consumption analysis, this section assesses the performance of the three algorithms in terms of average task delay—a critical metric for real-world resource scheduling systems. Notably, task sizes in practical scenarios are not fixed; to reflect this reality, the experiment compares the average task delay of the algorithms under three distinct task size ranges: [50–100 KB], [100 KB–1 MB], and [1–2 MB]. The corresponding results are presented in Figure 15.

Key Observations on Task Delay Trends

Across all task size categories, the average delay of all three algorithms increases as the number of tasks grows. However, two critical performance differences emerge:

Superiority over PSO: Both DRLMDS and DQN-GA achieve significantly lower average task delays compared to the PSO algorithm. This advantage stems from the DQN-based framework—algorithms rooted in DQN learn highly adaptive scheduling policies through continuous interaction with the system environment. Their neural network architectures excel at processing high-dimensional feature information in resource scheduling tasks, enabling real-time capture of key operational patterns. Consequently, hybrid approaches that combine swarm intelligence with deep reinforcement learning (e.g., DQN-GA and DRLMDS) outperform purely swarm intelligence-based methods (e.g., PSO) in solving resource allocation problems in edge computing environments.
Advantage of DRLMDS over DQN-GA: When the number of tasks reaches 500, DRLMDS further outperforms DQN-GA across all three task size ranges, achieving delay reductions of 10.08%, 11.11%, and 12.99% respectively. Moreover, as task size increases, DRLMDS maintains the lowest average task delay among the three algorithms and exhibits the slowest rate of delay growth with increasing task quantity.

Implications of Task Delay Results

These findings collectively indicate that DRLMDS not only delivers excellent energy efficiency, but also excels in handling large volumes of large-scale tasks—a critical requirement for real-world edge computing systems where both task size and quantity can fluctuate significantly.

4.2.4. Evaluation of Algorithm Solution Set Quality

In parallel with the aforementioned performance metrics, this section assesses the quality of solution sets generated by the algorithms—critical for validating their effectiveness in multi-objective optimization. The Inverted Generational Distance (IGD) metric is adopted for this purpose. IGD quantifies the average distance from each solution in the true Pareto front to its nearest counterpart in the algorithm-generated approximated front; a smaller IGD value indicates higher solution quality and superior optimization performance.

To compare solution quality, the proposed DRLMDS is evaluated against PSO and DQN-GA using IGD values. Each algorithm was executed in 10 independent runs, and the average results are reported (see Figure 16). Key observations include:

As the number of tasks increases, problem complexity rises, leading to a general decline in solution quality across all algorithms.
DRLMDS consistently generates higher-quality solution sets than PSO and DQN-GA. This advantage is attributed to two core design features of DRLMDS:DQN-based initialization of the particle swarm, which provides a more optimal starting point for search;A layered learning strategy for particle updates, which enhances the algorithm’s ability to explore and exploit the solution space.

To statistically validate the performance differences, a paired sample t-test was conducted on the IGD values from the 10 independent runs. The results confirm:

Compared to DQN-GA: DRLMDS achieves an average IGD improvement of 4.8, with a 95% confidence interval of $[1.59, 8.01]$ and a p-value of 0.0143 ( $p < 0.05$ ), indicating a statistically significant advantage.
Compared to PSO: DRLMDS delivers an average IGD improvement of 10.6, with a 95% confidence interval of $[5.82, 15.38]$ and a p-value of 0.0035 ( $p < 0.01$ ), demonstrating a highly significant performance edge.

These findings collectively confirm that DRLMDS offers superior solution quality and stability in resource scheduling optimization tasks.

4.2.5. System Performance Under Simulated Malicious Attack Scenarios

The second part of this section introduces a mimic defense mechanism to simulate malicious attack scenarios, evaluating three key system attributes: accuracy (in malicious behavior detection), latency, and malicious behavior suppression capability. Further, system performance is analyzed from three additional dimensions—latency, energy consumption, and cost—via simulation statistics.

Latency and Accuracy Under Attack Scenarios

Figure 17 presents the experimental results of system latency and detection accuracy under simulated attacks with the mimic defense mechanism enabled. Key insights are as follows:

Latency Fluctuations: System latency exhibits significant fluctuations during training, particularly in the middle phases (e.g., around iteration steps $0.4$ – $0.6$ ). This suggests potential performance bottlenecks or resource constraints at these stages, which may stem from factors such as increased data-processing overhead, sub-optimal computation-resource allocation, network delays, or inherent algorithmic limitations. For example, certain iterations may involve more complex decision-making (e.g., handling multiple concurrent attacks) or intensive computation (e.g., model-parameter updates), leading to prolonged processing times.
Stable Detection Accuracy: In contrast to latency, the system’s accuracy in detecting malicious behaviors remains relatively stable, ranging between $0.59$ and $0.64$ . As training progresses and more attack-related data are accumulated, accuracy tends to stabilize around $0.6$ .
Latency–Accuracy Relationship: While latency fluctuates within a relatively wide range (compared with accuracy variation), it does not converge toward a clear stable value. This instability may undermine system efficiency, especially in real-time application scenarios requiring consistent responsiveness. Notably, latency and accuracy appear to be influenced by distinct factors, with no direct correlation observed—latency does not determine accuracy, nor vice versa. Both metrics are independently affected by elements such as algorithmic optimizations, model-training complexity, or dataset characteristics.

To validate latency variations, the Kruskal–Wallis H test (non-parametric) was applied to latency data under different attack intensities. Results confirm significant differences in latency distribution across attack strengths (

p < 0.05

). For accuracy stability, the 95% confidence interval was calculated as

[0.59, 0.64]

, indicating high consistency within this range.

Trade-Offs Between Energy Consumption, Latency, and Cost

This section presents simulation results for system energy consumption, latency, and monetary cost; key relationships are visualised in Figure 18 (data points are highlighted by red circles to emphasise trade-offs).The latency fluctuations are caused by policy exploration, path switching, and model updates, which stabilize with training.

Power Consumption vs. Service Latency: Figure 18 reveals a negative correlation—as power consumption increases, service latency decreases. This indicates that the system can dynamically enhance performance (reduce latency) by increasing energy input, a critical capability for responding to dynamic crisis threats (e.g., sudden attack surges).
Power Consumption vs. Monetary Cost: A positive correlation is observed: higher power consumption leads to increased monetary cost. This is an expected outcome, as operational expenses (e.g., electricity bills) directly scale with energy usage in most systems.
Service Latency vs. Monetary Cost: A similar negative trend emerges—reduced service latency (i.e., improved performance) tends to correlate with lower monetary cost. This suggests that more efficient, low-latency systems may also be more cost-effective, potentially due to optimised resource utilisation (e.g., reduced idle time for servers).

The data confirms a tripartite trade-off among power consumption, service latency, and monetary cost: systems with higher power consumption typically offer lower latency but incur higher costs. From an optimization perspective, system design must balance these three parameters to meet target performance requirements while controlling operational expenses. To quantify these correlations, multiple regression analysis was performed on the collected data. The R² value indicates that power consumption and service latency together explain 78% of the variance in monetary cost—confirming a significant statistical relationship between these factors (p < 0.05).

4.2.6. DRLMDS Performance Advantages Under Malicious Attacks

Building on the preceding analysis of simulated-attack scenarios, experimental results confirm that DRLMDS exhibits significant performance improvements when confronting malicious attacks. Key advantages are demonstrated across three critical system metrics: task-response time, resource utilization, and system robustness.

Detailed Performance Improvements

Task-Response Time
DRLMDS effectively mitigates the delay induced by attacks, enabling the system to process and respond to task requests more rapidly. Experimental data show an approximate 30% reduction in task-response time. This improvement directly enhances real-time responsiveness—a critical requirement for time-sensitive edge-computing applications.
Resource Utilization
To optimize resource efficiency, DRLMDS adopts randomized and diversified scheduling strategies. These strategies minimize server idleness by dynamically matching resources to task demands, yielding an approximate 25% improvement in resource utilization. This outcome confirms DRLMDS’s ability to maximize the value of available computing resources even under attack-induced disruptions.
System Robustness
DRLMDS enhances system unpredictability through its adaptive scheduling mechanism, making it more difficult for attackers to identify vulnerabilities or disrupt operations. This design significantly boosts security and stability in edge-computing environments, with experimental results showing a 40% improvement in system stability under attack conditions.

Statistical Validation of Improvements

To rigorously verify the significance of the observed performance gains, a comparative experiment was conducted against a baseline system without the mimic-defense mechanism. A paired-sample t-test was applied to evaluate the statistical significance of improvements in task-response time, resource utilization, and system robustness. The results confirm:

Task-response time: Average reduction of 30%, with a p-value $< 0.01$ —indicating a highly statistically significant improvement.
Resource utilization: Average increase of 25%, with the t-test confirming high statistical significance ( $p < 0.01$ ).
System robustness: Approximately 40% improvement, also supported by a statistically significant result ( $p < 0.01$ ).

4.2.7. Concluding Remarks

Rationale for Adopting the DQN Framework

This study selected the Deep Q-Network (DQN) framework as the core of its algorithm design for three key reasons:

Its inherent capability to handle high-dimensional feature spaces, which is essential for capturing complex resource-scheduling dynamics in edge computing;
Its ability to learn adaptive scheduling policies through continuous interaction with the system environment, enabling real-time adjustment to changing task loads and attack scenarios;
Its strong compatibility with swarm-intelligence algorithms (e.g., GA, PSO), facilitating the development of hybrid approaches that combine the strengths of both paradigms.

Summary of Algorithmic Performance

Both experimental outcomes and statistical-significance analyses confirm that the DQN framework and its derivative algorithms (DQN-GA and DRLMDS) outperform traditional swarm-intelligence algorithms (e.g., PSO) in addressing edge-computing resource-scheduling challenges. Specifically, DQN-based methods demonstrate superior performance in:

Energy efficiency: reducing system energy consumption across varying server and task configurations;
Task-latency reduction: minimizing average task delay, particularly for large-scale and large-size tasks;
Solution quality: generating higher-quality Pareto-optimal solution sets (as validated by lower IGD values).

Significance of DRLMDS

The adoption of the DQN framework provides a solid technical foundation for the proposed DRLMDS algorithm. By integrating DQN-based initialization, layered learning for particle updates, and a mimic defense mechanism, DRLMDS achieves efficient and secure resource scheduling in complex edge computing environments—successfully balancing performance, resource efficiency, and resilience against malicious attacks. This makes DRLMDS a promising solution for real-world edge computing systems where dynamic task loads and security threats are prevalent.

5. Conclusions

To tackle the efficiency-security dilemma in edge computing resource scheduling, where traditional PSO algorithms fall short, this study pinpointed flaws in existing methods: poor adaptability, weak attack defense, and unbalanced objectives. We introduced the DRLMDS algorithm, combining Deep Reinforcement Learning (DRL) for adaptive optimization and mimetic defense for attack resistance. Experimental comparisons with PSO and DQN-GA evaluated energy consumption, latency, solution quality, and robustness, validated by statistical tests. Results show that integrating DRL and mimetic defense resolves the efficiency-security conflict: DRLMDS reduces task response time by 30%, boosts resource utilization by 25%, and enhances attack stability by 40%.

Additionally, DQN-based frameworks outperform swarm intelligence in complex edge scenarios. While offering theoretical contributions, replicable methodologies, and practical value for industrial/IoT edge systems, the study has limitations, including simulated homogeneous environments and unmeasured overheads. Future research will focus on developing a lightweight DRLMDS, testing in real-world APT scenarios, improving scalability and fault tolerance, and integrating with federated learning or digital twins.

Author Contributions

Conceptualization, methodology, validation, and writing, X.L., S.Y. (Sen Yang), R.W. and L.O.; Conceptualization, Data curation, Formal analysis, X.H., S.Y. (Shengjie Yu) and J.M.; Supervision, funding acquisition, and review, S.L. and Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a Science and Technology Project funded by State Grid Jiangsu Electric Power Company (Project No. J2024089).

Data Availability Statement

All datasets are available in the mentioned references.

Acknowledgments

Our thanks to the above projects for their support.

Conflicts of Interest

Authors Xiaoyun Liao, Sen Yang, Lijian Ouyang and Rong Wu were employed by the company “State Grid Jiangsu Electric Power Co., Ltd.”. The authors declare that this study received funding from the State Grid Jiangsu Electric Power Company Science and Technology Project (Project No. J2024089). The funder had the following involvement with the study: study design, the writing of this article.

References

Carvalho, G.; Cabral, B.; Pereira, V.; Bernardino, J. Edge computing: Current trends, research challenges and future directions. Computing 2021, 103, 993–1023. [Google Scholar] [CrossRef]
Costa, B.; Bachiega, J., Jr.; de Carvalho, L.R.; Araujo, A.P.F. Orchestration in fog computing: A comprehensive survey. ACM Comput. Surv. 2022, 55, 1–34. [Google Scholar] [CrossRef]
Luo, Q.; Hu, S.; Li, C.; Li, G.; Shi, W. Resource scheduling in edge computing: A survey. IEEE Commun. Surv. Tutor. 2021, 23, 2131–2165. [Google Scholar] [CrossRef]
Chowdhury, A.; Raut, S.A.; Narman, H.S. DA-DRLS: Drift adaptive deep reinforcement learning based scheduling for IoT resource management. J. Netw. Comput. Appl. 2019, 138, 51–65. [Google Scholar] [CrossRef]
Wang, J.; Zhao, L.; Liu, J.; Kato, N. Smart resource allocation for mobile edge computing: A deep reinforcement learning approach. IEEE Trans. Emerg. Top. Comput. 2021, 9, 1529–1541. [Google Scholar] [CrossRef]
Hou, H.; Jawaddi, S.N.A.; Ismail, A. Energy efficient task scheduling based on deep reinforcement learning in cloud environment: A specialized review. Future Gener. Comput. Syst. 2024, 151, 214–231. [Google Scholar] [CrossRef]
Ning, Z.; Dong, P.; Wang, X.; Rodrigues, J.J.P.C. Deep reinforcement learning for vehicular edge computing: An intelligent offloading system. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–24. [Google Scholar] [CrossRef]
Zhao, N.; Liang, Y.-C.; Niyato, D.; Pei, Y.; Wu, M.; Jiang, Y. Deep reinforcement learning for user association and resource allocation in heterogeneous cellular networks. IEEE Trans. Wirel. Commun. 2019, 18, 5141–5152. [Google Scholar] [CrossRef]
Zhu, J.; Ren, Z.; Zhang, G.; Zhang, Y. Research on electric power Internet of Things based on 5G edge computing. Electr. Technol. Econ. 2023, 32, 12–34. [Google Scholar]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Bala, I.; Chauhan, D.; Mitchell, L. Orthogonally Initiated Particle Swarm Optimization with Advanced Mutation for Real-Parameter Optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ’24 Companion), Melbourne, VIC, Australia, 14–18 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 735–738. [Google Scholar] [CrossRef]
Shao, S.; Liu, S.; Li, K.; Zhou, S. LBA-EC: A load balancing algorithm based on weighted bipartite graph for edge computing. Chin. J. Electron. 2022, 31, 1–12. [Google Scholar] [CrossRef]
Wu, Y.-C.; Dinh, T.Q.; Fu, Y.; Dutkiewicz, E. A hybrid DQN and optimization approach for strategy and resource allocation in MEC networks. IEEE Trans. Wirel. Commun. 2021, 20, 4282–4295. [Google Scholar] [CrossRef]
Waqar, N.; Hassan, S.A.; Mahmood, A.; Abbas, H.; Shafiq, M. Computation offloading and resource allocation in MEC-enabled integrated aerial-terrestrial vehicular networks: A reinforcement learning approach. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21478–21491. [Google Scholar] [CrossRef]
Yang, C.; Liao, F.; Lan, S.; Wang, L.; Shen, W.; Huang, G.Q. Flexible resource scheduling for software-defined cloud manufacturing with edge computing. Engineering 2023, 22, 60–70. [Google Scholar] [CrossRef]
Yang, J.; Lin, F.; Chakraborty, C.; Yu, K.; Guo, Z.; Nguyen, A.T.; Rodrigues, J.J.P.C. A parallel intelligence-driven resource scheduling scheme for digital twins-based intelligent vehicular systems. IEEE Trans. Intell. Veh. 2023, 8, 2770–2785. [Google Scholar] [CrossRef]
Nahum, C.V.; Lopes, V.H.L.; Dreifuerst, R.M.; Batista, P.; Correa, I.; Cardoso, K.V.; Klautau, A.; Heath, R.W. Intent-aware radio resource scheduling in a RAN slicing scenario using reinforcement learning. IEEE Trans. Wirel. Commun. 2024, 23, 2253–2267. [Google Scholar] [CrossRef]
Yang, D.; Zhang, W.; Ye, Q.; Zhang, C.; Zhang, N.; Huang, C.; Zhang, H.; Shen, X. DetFed: Dynamic resource scheduling for deterministic federated learning over time-sensitive networks. IEEE Trans. Mob. Comput. 2024, 23, 5162–5178. [Google Scholar] [CrossRef]
Guan, C.; Yuen, K.K.F.; Coenen, F. Particle swarm optimized density-based clustering and classification: Supervised and unsupervised learning approaches. Swarm Evol. Comput. 2019, 44, 876–896. [Google Scholar] [CrossRef]
Ding, P. On the conditional distribution of the multivariate t distribution. Am. Stat. 2016, 70, 293–295. [Google Scholar] [CrossRef]
Do-Duy, T.; Van Huynh, D.; Dobre, O.A.; Canberk, B.; Duong, T.Q. Digital twin-aided intelligent offloading with edge selection in mobile edge computing. IEEE Wirel. Commun. Lett. 2022, 11, 806–810. [Google Scholar] [CrossRef]
Shao, S.; Ji, Y.; Zhang, W.; Zhou, S. A DHR executor selection algorithm based on historical credibility and dissimilarity clustering. Sci. China Inf. Sci. 2023, 66, 212304. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Liang, X.; Wang, X.; Song, X. Deep reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5064–5078. [Google Scholar] [CrossRef]
Panagant, N.; Pholdee, N.; Bureerat, S.; Yildiz, A.R.; Mirjalili, S. A comparative study of recent multi-objective metaheuristics for solving constrained truss optimisation problems. Arch. Comput. Methods Eng. 2021, 28, 1–17. [Google Scholar] [CrossRef]
Thabit, F.; Can, O.; Alhomdy, S.; Al-Gaphari, G.H.; Jagtap, S. A novel effective lightweight homomorphic cryptographic algorithm for data security in cloud computing. Int. J. Intell. Netw. 2022, 3, 16–30. [Google Scholar] [CrossRef]
Pallathadka, H.; Sajja, G.S.; Phasinam, K.; Ritonga, M.; Naved, M.; Bansal, R.; Quiñonez-Choquecota, J. An investigation of various applications and related challenges in cloud computing. Mater. Today Proc. 2022, 51, 2245–2248. [Google Scholar] [CrossRef]
Parcu, P.L.; Pisarkiewicz, A.R.; Carrozza, C.; Innocenti, N. The future of 5G and beyond: Leadership, deployment and European policies. Telecommun. Policy 2023, 47, 102622. [Google Scholar] [CrossRef]
Zhang, S.; Cui, G.; Long, Y.; Wang, W. Joint computing and communication resource allocation for satellite communication networks with edge computing. China Commun. 2021, 18, 236–252. [Google Scholar] [CrossRef]
She, C.; Sun, C.; Gu, Z.; Li, Y.; Yang, C.; Poor, H.V.; Vucetic, B. A tutorial on ultrareliable and low-latency communications in 6G: Integrating domain knowledge into deep learning. Proc. IEEE 2021, 109, 204–246. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Limitations of Existing Methods in Attack Scenarios.

Figure 2. Overall Structure of DRLMDS.

Figure 3. DQN Algorithm Framework.

Figure 4. Network Parameter Update.

Figure 5. Actor-Critic Structure Diagram.

Figure 6. DDPG Algorithm Architecture.

Figure 7. State Transition.

Figure 8. Network Structure Diagram.

Figure 9. Algorithm Architecture Based on Deep Reinforcement Learning.

Figure 10. Particle Encoding.

Figure 11. Particle Sorting and Layering.

Figure 12. Scheduling Optimization Process Based on Improved Particle Swarm Algorithm.

Figure 13. Mimic Defense Module Architecture Diagram.

Figure 14. System Energy Consumption under Different Numbers of Servers for Various. (a) Energy consumption (10 servers). (b) Energy consumption (20 servers). (c) Energy consumption (30 servers).

Figure 15. Average Task Latency under Different Task Scales for Various Algorithms. (a) Average latency for task sizes in the range of (50–100 KB). (b) Average latency for task sizes in the range of (100 KB–1 MB). (c) Average latency for task sizes in the range of (1–2 MB).

Figure 16. Solution quality of various algorithms with respect to the number of tasks.

Figure 17. Experiment on Malicious Attack Scenarios under Mimic Defense Mechanism and Accuracy.

Figure 18. Relationship between System Energy Consumption, Delay, and Cost.

Table 1. Algorithm Parameter Settings.

Parameter Name	Symbol	Value
Number of Edge Servers	S	20
Number of Tasks	T	[100, 200, 300, 400, 500]
Epsilon-Greedy Parameter	$ε$	0.8
Discount Factor	$γ$	0.8
Learning Rate	$η$	0.0002
Number of Iterations	Iteration	[0, 300]
Learning Factor	$c_{1}$	2
Learning Factor	$c_{2}$	2
Inertia Weight	$ω$	0.6
Population Size	N	[100, 200, 300, 400, 500]

Table 2. Model Parameter Settings.

Parameter Name	Symbol	Value
Channel Bandwidth	B	10 MHz
Noise Power	$σ^{2}$	−174 dBm/Hz
Data Transmission Power	$P_{i, j}$	1 W
CPU Frequency	f	$10^{9}$ cycles/s
Idle Power of Edge Server	$P_{L}$	0.4 W
Maximum Allowable Delay	$T^{max}$	4
Maximum Energy Consumption per Server	$E^{max}$	20 J

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, X.; Yang, S.; Ouyang, L.; Wu, R.; Huang, X.; Yu, S.; Mao, J.; Liu, S.; Ji, Y. DRLMDS: A Deep Reinforcement Learning-Based Scheduling Algorithm for Mimic Defense Servers. Symmetry 2025, 17, 1960. https://doi.org/10.3390/sym17111960

AMA Style

Liao X, Yang S, Ouyang L, Wu R, Huang X, Yu S, Mao J, Liu S, Ji Y. DRLMDS: A Deep Reinforcement Learning-Based Scheduling Algorithm for Mimic Defense Servers. Symmetry. 2025; 17(11):1960. https://doi.org/10.3390/sym17111960

Chicago/Turabian Style

Liao, Xiaoyun, Sen Yang, Lijian Ouyang, Rong Wu, Xin Huang, Shengjie Yu, Jinzhou Mao, Shangdong Liu, and Yimu Ji. 2025. "DRLMDS: A Deep Reinforcement Learning-Based Scheduling Algorithm for Mimic Defense Servers" Symmetry 17, no. 11: 1960. https://doi.org/10.3390/sym17111960

APA Style

Liao, X., Yang, S., Ouyang, L., Wu, R., Huang, X., Yu, S., Mao, J., Liu, S., & Ji, Y. (2025). DRLMDS: A Deep Reinforcement Learning-Based Scheduling Algorithm for Mimic Defense Servers. Symmetry, 17(11), 1960. https://doi.org/10.3390/sym17111960

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DRLMDS: A Deep Reinforcement Learning-Based Scheduling Algorithm for Mimic Defense Servers

Abstract

1. Introduction

1.1. Research Background

1.2. Research Problem

1.3. Research Status

1.4. Problems and Gaps in Existing Research

1.5. Research Objective and Scope

1.6. Paper Structure

2. Related Work

3. DRLMDS Design

3.1. Deep Reinforcement Learning Module

3.1.1. State Space

3.1.2. Action Space

3.1.3. Reward Function

3.1.4. Network Architecture

3.2. Scheduling Optimization Based on Improved Particle Swarm Algorithm

3.3. Mimetic Defense Module

3.4. Deep Reinforcement Learning-Driven Mimetic Defense Mechanism

4. Experimental Design and Analysis

4.1. Experimental Environment

4.1.1. Experimental Setup

4.1.2. Parameter Settings

4.1.3. Experimental Metrics

4.2. Experimental Results and Discussion

4.2.1. Evaluation of Energy-Saving Performance Under Variable Task and Server Quantities

4.2.2. Results of Statistical Analysis

4.2.3. Evaluation of Average Task Delay Performance

Key Observations on Task Delay Trends

Implications of Task Delay Results

4.2.4. Evaluation of Algorithm Solution Set Quality

4.2.5. System Performance Under Simulated Malicious Attack Scenarios

Latency and Accuracy Under Attack Scenarios

Trade-Offs Between Energy Consumption, Latency, and Cost

4.2.6. DRLMDS Performance Advantages Under Malicious Attacks

Detailed Performance Improvements

Statistical Validation of Improvements

4.2.7. Concluding Remarks

Rationale for Adopting the DQN Framework

Summary of Algorithmic Performance

Significance of DRLMDS

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI