1. Introduction
Cloud-native environments and microservice architectures have transformed how modern services are built and operated [
1,
2]. By decomposing applications into independently deployable, loosely coupled services, organizations gain elasticity, rapid iteration, and fine-grained scaling [
3]. This modularity, often managed by declarative orchestration and service meshes, provides robust mechanisms for service discovery and communication, enabling on-demand resource allocation and resilient recovery at scale [
4,
5]. The result is a platform that can adapt capacity to demand, isolate failures, and evolve services independently, improving developer velocity, operational efficiency, and overall service reliability.
However, the same flexibility introduces structural complexity and new attack surfaces [
6]. Existing threats such as data breaches, insecure APIs, account hijacking, insider misuse, and Distributed Denial-of-Service (DDoS) attacks continue to endanger cloud-native environments, which are intensified by heterogeneous security configurations and architectural characteristics inherent to microservice-based systems [
7]. Since microservices are interdependent along service call chains, localized congestion at a shared service node can propagate upstream and downstream, creating cascading slowdowns and widespread performance degradation [
8,
9]. Nodes dedicated to single, resource-intensive functions are particularly vulnerable to saturation. Adversaries can exploit these properties with low-rate, high-consumption requests that exhaust compute, memory, or I/O at bottleneck nodes or paths (as shown in
Figure 1). This class of DDoS attacks, namely resource-exhaustion attacks [
10,
11], closely mimics legitimate traffic, is deeply coupled with microservice execution patterns, and exhibits strong stealthiness, detection difficulty, and adaptive strategies.
Traditional detection-based solutions, including signature-based [
12] and anomaly-based [
13] methods, attempt to distinguish legitimate users from attackers by learning behavioral patterns and traffic features. In resource-exhaustion attacks, however, malicious traffic closely mimics normal workloads, degrading detector generalization and causing both false positives and false negatives [
14,
15]. Moreover, conventional autoscaling is slow to react, may inflate costs, and provides insufficient coverage when attack traffic propagates along call chains: scaling a single hotspot leaves other services exposed and can simply shift bottlenecks to downstream or upstream nodes, leading to bottleneck migration rather than effective mitigation [
16].
Proactive methods such as Moving Target Defense (MTD) [
17,
18] offer a promising direction by continually changing the attack surface (e.g., reconfiguring service instances, shifting routes, rotating resources, or diversifying placements). By introducing temporal and structural uncertainty into the attacker’s decision process, MTD techniques shift the defender-attacker asymmetry and raise the cost and difficulty of planning, probing, and sustaining a successful compromise [
19]. However, existing MTD solutions in microservices typically lack principled, attack-aware policy guidance, incurring substantial overhead, resource waste, and degraded Quality of Service (QoS) [
20]. This gap is exacerbated by sophisticated, spatiotemporally adaptive adversaries who rotate targets across service nodes and call-chains, vary attack cadence and request links, and strategically manipulate system dynamics, to keep the system in a persistent oscillation between waste and vulnerability [
5,
11].
To address these challenges, this paper proposes ScaleShield, an adaptive MTD framework that learns cost-aware defense policies for microservices under resource-exhaustion DDoS, where Deep Reinforcement Learning (DRL) is used to guide MTD methods by learning defense policies from environment-driven interactions and facilitating adaptation to novel adversarial attacks. The defense is formulated as a sequential decision process over a multi-dimensional discrete action space that captures practical MTD levers (e.g., scaling up and scaling down). The Multi-Dimensional Double Deep Q-Network (MD3QN) algorithm is utilized to efficiently learn coordinated action combinations that maximize service availability while minimizing defense cost and unnecessary churn. By shaping rewards to reflect availability, Service Level Agreement (SLA) compliance, and the operational cost of defenses, the proposed approach learns when to intervene and which MTD actions to deploy under dynamic, attack-aware conditions.
The main contributions of this paper are summarized as follows:
ScaleShield is presented as an MTD-based method tailored to resource-exhaustion DDoS in cloud-native microservices. It continuously reconfigures the system to dissipate bottlenecks and suppress effects along call chains under bounded overhead.
A collaborative scaling mechanism is developed that recasts container scaling as MTD actions and constructs per-service resource observation vectors, enabling distributed mitigation that absorbs attack pressure and prevents bottleneck propagation.
The attack–defense interaction is formulated as a Markov Decision Process (MDP) with constraints on security status and resource utilization. An MD3QN-based policy is employed to yield high-quality strategies in a large combinatorial action space.
An end-to-end experimental environment is built with realistic topologies and constraints. Results show that ScaleShield achieves near 100% defense success and maintains over 94% resource efficiency, significantly outperforming State-of-the-Art (SOTA) baselines. The implementation is available as open source at [
21].
The remainder of this paper is organized as follows.
Section 2 reviews the background and related work on DDoS defense, Moving Target Defense, and DRL-based security mechanisms.
Section 3 presents the system and threat models and formulates the defense problem in cloud-native microservice environments.
Section 4 describes the architecture and key components of ScaleShield, including its MTD mechanisms and integration with microservice observability. It also details the DRL-based policy learning framework and the design of the MD3QN algorithm for coordinating multi-dimensional defense actions.
Section 5 reports on the experimental setup and evaluation results, comparing ScaleShield with state-of-the-art baselines. Finally,
Section 6 discusses the proposed method with its limitations, and
Section 7 concludes this paper.
2. Background and Related Works
This section first reviews DDoS detection and mitigation methods in cloud settings. Signature-based systems leverage protocol features and handcrafted fingerprints to flag known attacks [
12], while anomaly-based approaches employ statistical modeling and machine learning to surface deviations from historical baselines [
13,
22]. For example, Abiramasundari et al. [
23] developed a Principal Component Analysis (PCA)-enhanced framework that detects DDoS anomalies by modeling statistical deviations in network traffic, effectively improving classification performance across three benchmark datasets. However, resource-exhaustion attacks with low-rate requests closely mimic legitimate workloads, leading to poor generalization under workload shifts [
24]. Moreover, recent studies have revealed the critical vulnerability of machine learning-based detection methods to evasion attacks [
25], leading to substantial numbers of false negatives.
Mitigation techniques such as rate limiting [
26], filtering [
27], and traffic engineering [
28] are widely deployed in clouds and Content Delivery Networks (CDNs); however, they are prone to rerouting pressure that causes bottleneck migration along service call chains in cloud-native systems [
29]. Moreover, autoscaling [
30,
31,
32] is often used as a first-line reactive control, but it reacts slowly under stealthy attacks and provides limited coverage when attack pressure spreads across multiple services and internal links [
33,
34]. Scaling a single hotspot may simply redirect load through call chains to unprotected components, re-manifesting bottlenecks downstream. These limitations motivate proactive and attack-aware orchestration that goes beyond detect-then-react at individual nodes.
To handle security threats in the cloud-native networks, MTD seeks to disrupt attacker reconnaissance and exploitation by continuously reconfiguring the attack surface through IP/port rotation, routing diversification, configuration randomization, etc. [
35,
36]. In cloud-native systems, MTD can be also integrated with orchestrators and service meshes to adjust topology [
37] and traffic links [
38] at runtime. Empirical studies show that MTD can reduce dwell time and increase attacker uncertainty for volumetric or stationary threats [
19].
However, heuristic or schedule-based actions in existing MTD works are frequently mistimed or overly frequent, incurring operational churn, resource waste, and QoS degradation [
39]. Moreover, naive MTD can be outpaced or even exploited when defending against spatiotemporally adaptive adversaries that rotate targets across services and call-chain links (e.g., inducing sustained scale-out or re-attacking after scale-in). Existing approaches rarely optimize when and how to coordinate multiple defense levers under cost and stability constraints, leading to either insufficient protection or excessive overhead [
40]. This gap points to the need for attack-aware, cost-sensitive MTD policies that jointly reason about cross-service coordination, timing, and action composition.
DRL has been explored for cyber defense tasks including intrusion response [
41], traffic control [
42], resource allocation [
43], etc. Its sequential decision-making is well-suited to adversarial dynamics, enabling attack-aware policies that balance effectiveness and cost under partial observability and non-stationarity [
44]. For example, Tellache et al. [
45] proposed a multi-agent DRL-based Intrusion Detection System (IDS) that augments Deep Q-Network (DQN) with a weighted Mean Square Error (MSE) loss to cope with class imbalance and evolving attack patterns, demonstrating fine-grained detection with low false positives. Similarly, Shahab et al. [
46] formulated cloud manufacturing as a DRL-driven service composition problem in stochastic settings, demonstrating superior capabilities in enhancing network resilience and addressing disruption scenarios.
However, most efforts assume low-dimensional action spaces or single-lever controls, limiting applicability to microservices where defenses span multiple discrete levers that must be coordinated jointly. These gaps motivate an approach that leverages multi-dimensional DRL to coordinate multiple MTD levers and incorporates cloud-native observability to mitigate resource-exhaustion attacks with distributed and cross-service policies.
3. System Model and Problem Formulation
This section defines the system and threat models considered in this study. First, the cloud-native microservice architecture is formalized; then resource-exhaustion attacks with different strategies are characterized. Finally, the defense task is formulated as a multi-objective optimization problem, and its characteristics are analyzed.
3.1. System Model
As a modern software paradigm, microservice architectures exhibit substantial structural complexity. The system model works along two dimensions: (i) the service-invocation topology, capturing inter-service call relationships, and (ii) the resource deployment model, characterizing capacities and placements.
3.1.1. Service-Invocation Topology
A cloud-native microservice system consists of a set of functionally independent services. Let
denote the service set, where
N is the number of services. Over a given planning horizon, the composition of
is assumed fixed and changes only as business requirements evolve. Invocation dependencies among services are modeled as a directed edge set
as:
where each invocation edge is an ordered pair of services. For example, an edge
indicates that service
invokes service
.
Services that provide Application Programming Interfaces (APIs) and directly process external user requests are termed entry nodes. Then, a service path is any comprehensive invocation chain in the service-call graph from an entry node to a terminal node, implementing a distinct sub-function (e.g., an API call traversing an API service, business logic, and a cache or database). For notational convenience, this work labels each service path by its entry node, i.e., paths are identified by their corresponding entry services.
For an invocation edge , is referred to as upstream of and as downstream of . When request traffic enters the microservice cluster, it propagates along service-invocation links. For an upstream service node , the volumes of incoming and outgoing traffic need not coincide. Depending on functional logic, may issue zero, one, or multiple downstream invocations, and the payloads of those invocations may differ from the incoming request. At an aggregate level, the expected proportion of traffic forwarded from to a downstream node is represented by the transition probability .
Let
denote the traffic transition probability matrix aligned with the directed invocation graph, where
. Then the expected incoming traffic at a downstream node
, denoted
, is obtained by aggregating contributions from all upstream nodes:
where
is the set of upstream nodes of
and
is the traffic received at
.
The set of service nodes , the directed invocation edge set , and the transition probability matrix together define the topology of the microservice system. For clarity of exposition, this paper reasonably assumes that the microservice architecture follows the design principle of minimal functional decomposition, under which service-level dependencies do not form intrinsic cycles. Runtime behaviors such as retries and callbacks may induce cyclic patterns in the traffic flow, yet the functional dependencies between services remain inherently directional. Thus, the system is modeled as a Directed Acyclic Graph (DAG) such that .
3.1.2. Resource Deployment Model
In a cloud-native microservice system, the fundamental functional unit is a service and the basic allocatable unit is a replica. A service may have replicas and each replica is an instantiated container (or pod) scheduled onto some host and provisioned with a fixed resource request vector. This model considers five allocatable resource types, including CPU, memory, network bandwidth, disk, and I/O, collectively denoted by .
Replicas of distinct services may request different resource quantities, whereas replicas of the same service are assumed homogeneous. The per-replica resource request vector of service is denoted by . The cluster consists of a finite host set with K hosts. Similarly, each host exposes an available resource capacity vector . Since the attack-defense formulation focuses on aggregate resource availability at cluster scope, detailed replica placement policies are abstracted away.
During the actual operation of a microservice system, different service nodes consume resources with varying preferences due to their distinct functionalities. For instance, a computation-intensive service node consumes more CPU resources under similar traffic conditions compared to others, while a storage-intensive service consumes significant disk and I/O resources. Let
denote a vector of nonnegative consumption functions mapping traffic intensity to resource usage for service
:
where
is the expected incoming request rate (cf. Equation (
2)). Empirically, resource demand is positively correlated with
, capturing higher utilization under increased request volume. Therefore, the instantaneous resource occupancy of service
is denoted as
.
Finally, to monitor system health and detect emerging bottlenecks, the per-service observation metric set is defined as:
where
is the replica count,
denotes the service latency,
represents the transmission rate, and
is the request failure rate. These observables form the state features consumed by the defense policy to assess saturation and impending bottlenecks across services and call-chain segments.
3.2. Threat Model
3.2.1. Resource-Exhaustion DDoS Attacks
In cloud-native microservices, many invocation links converge on shared dependencies. As requests propagate and fan in, traffic from one or more entry services can accumulate at a shared downstream node and exceed the original entry rate. It can be referred as traffic amplification along the call graph.
Services are more likely to saturate when they are inherently resource-intensive, highly sensitive in specific resource dimensions, or positioned as shared dependencies that experience pronounced amplification. Such services constitute bottleneck nodes and can give rise to bottleneck links when congestion propagates along adjacent links (
Figure 1). For instance, a service
implementing large file uploads may exhibit memory and network consumption components within the system-level mapping
(cf. Equation (
3)), where the corresponding dimensions
and
grow rapidly with request size or concurrency, increasing the probability of exhausting these resources and propagating congestion along downstream segments.
Adversaries targeting cloud-native microservices typically employ stealthy strategies with low-rate traffic, aiming to maximize system-wide disruption while remaining indistinguishable from benign workloads. Their objective is to degrade availability by exhausting critical resources (e.g., CPU, memory, I/O, bandwidth) or by inducing sustained congestion across multiple critical service paths. To amplify downstream impact, adversaries first conduct lightweight reconnaissance to infer salient aspects of the call topology and to locate bottleneck-prone services and links. They then inject low-rate, syntactically legitimate application requests along these links to trigger high-cost operations, effecting a resource-exhaustion attack. Therefore, the path containing bottleneck nodes under attack can be called as a bottleneck path.
Attacks are modeled in discrete rounds of duration , with time index set . Each attacked path is represented by its entry service. Let denote the set of compromised nodes in bottleneck paths at time , and the attack traffic injected at entry service is , where .
For a candidate bottleneck service
, consider all paths from
to
, this set can be represented as
and let
be the propagation probability of path
induced by routing and fan-in along the call graph. It follows that:
where
is the cumulative attack traffic arriving at
,
is its resource usage, and
is the service-specific consumption mapping as defined in Equation (
3).
To model different attackers, this paper considers three kinds of attack strategies based on the variation in attack timing and target paths as follows:
Static Attack: The adversary selects a fixed bottleneck path and maintains a constant attack rate throughout the entire attack duration.
Periodic Attack: The adversary launches attacks at regular intervals while remaining inactive between attacks. This kind of attack is also known as yoyo attacks.
Random Attack: The adversary randomly modifies attack targets while continuously issuing attacks to maximize disruption.
3.2.2. Threat Evaluation Metrics
To quantitatively assess the impact of resource-exhaustion attacks and guide the defense policy, a set of metrics are established to Characterize the security posture and resource pressure of each service. These metrics allow the defender to gauge the severity of a threat and the remaining capacity of the system.
First, a resource threshold for any candidate bottleneck service
at time
t is defined as
. This threshold represents the maximum permissible resource consumption before the service is considered to be at risk. It is formally defined as:
where
is the current replica count of the service, which is a defender-controlled variable.
is the per-replica resource request vector, representing the provisioned capacity of a single instance. The parameter
is a discount factor.
Thus, it can be formally determined if a service is in a dangerous state using this threshold. A service
is considered to be under threat if its instantaneous resource consumption vector,
, exceeds the threshold in any dimension. Then, a binary indicator function
can be captured as:
where
is the indicator function, and the inequality is interpreted element-wise. A value of
signifies that at least one resource type has breached its limit, indicating an immediate threat.
While the binary danger state is useful for alerts, a more granular view of resource pressure is needed for fine-grained defensive actions. Therefore, for each resource type
, the idle rate is defined as:
Based on this, an approximate count of idle replicas of bottleneck service,
, can be derived. It estimates how many replicas are effectively unburdened and ready to absorb additional load:
where
is the current replica count of the service, and
denotes the floor function.
3.3. Problem Formulation
Based on the above system model, two optimization problems concerning collaborative scaling under resource-exhaustion DDoS attacks are considered: (i) at time t, given per-service observations and thresholds , the objective is to determine the optimal scaling action that minimizes the number of compromised nodes; (ii) over a horizon with adversarial load, the objective is to determine a scaling sequence that rapidly mitigates risk, preserves user experience, and minimizes resource waste.
Specifically, the defense planning is formulated as a multi-objective program as follows:
The constraints ensure that the scaling decisions are feasible and maintain system stability, with the following practical meanings: (i) Constraint C1 is the total resource capacity constraint, ensuring that the aggregated resource allocation for all services at any time t does not exceed the total available physical resources or a predefined operational limit . (ii) Constraint C2 represents a Quality of Service (QoS) constraint, stipulating that the end-to-end latency for any critical service path must remain below a predefined threshold to guarantee user experience. (iii) Constraint C3 is another constraint for maintaining service reliability, requiring that the request failure rate for any service must be kept below an acceptable upper bound . (iv) Finally, constraint C4 simply defines the finite time horizon for the optimization problem, ensuring that the defense planning occurs within a specific operational window.
4. Architectural Design of ScaleShield
4.1. Framework Overview
To effectively mitigate resource-exhaustion DDoS attacks, this paper proposes ScaleShield, an adaptive multi-resource collaborative scaling system based on MTD and MD3QN. As shown in
Figure 2, the system includes the following modules, where stacked blocks (such as multiple services, configurations, or replicas) visually indicate collections of homogeneous instances or records rather than overlapped or hidden information.
Resource Monitoring. This module is responsible for real-time state perception. It continuously collects fine-grained metrics from the underlying cloud-native environment using tools like Prometheus and Jaeger. The collected data includes node-level as well as link-level metrics. These raw observations form the basis for constructing the state representation for the DRL agent.
State Modeling. To provide the DRL agent with a comprehensive and structured view of the system, this module aggregates the per-service observations into a global state matrix, . This is computed as . It embeds both the current health status of individual services and their inter-dependencies, enabling the agent to reason about cascading effects and bottleneck propagation.
Adaptive Decision-Making. It models the sequential attack-defense interaction as an MDP, and employs an MD3QN algorithm to navigate the complex, multi-dimensional action space corresponding to coordinated scaling decisions across multiple services. The agent learns to select actions that minimize the long-term cost, effectively balancing the conflicting objectives of mitigating threats.
Scaling Configuration. This module translates the agent’s high-level strategic decision into a concrete MTD tactic. It interprets the multi-dimensional discrete action selected by the MD3QN agent and maps it to specific scaling parameters for each targeted service. This step ensures that the agent’s abstract intent to “change the attack surface” is transformed into an executable reconfiguration plan, forming the core of the adaptive defense maneuver.
Configuration Deployment. It takes the generated scaling configurations and enforces them through the cloud-native orchestration layer. This involves dynamically adjusting the number of replicas for specific services, thereby altering the attack surface and re-allocating resources to withstand the attack pressure.
4.2. Collaborative Multi-Resource Scaling Mechanism
In the context of microservice security, dynamic resource scaling can be conceptualized as an MTD mechanism. By adaptively adjusting the number of replicas for different services, it continuously reconfigures the system’s resource landscape, thereby altering the attack surface. This proactive reconfiguration introduces uncertainty for the adversary, disrupting their ability to sustain resource-exhaustion attacks on predictable bottlenecks.
However, traditional autoscalers (e.g., Kubernetes Horizontal Pod Autoscaler) are insufficient as an MTD strategy. They rely on static, per-service policies driven by fixed thresholds and are primarily tuned for benign demand surges. Under intelligent, low-rate DDoS adversaries, such reactive schemes are brittle and predictable. For example, sophisticated attackers can exploit periodic traffic peaks to induce the system to scale down before launching attacks, keeping it perpetually vulnerable. Furthermore, post-hoc scaling at an observed hotspot often induces bottleneck migration to upstream or downstream dependencies, merely shifting the problem rather than solving it and increasing mitigation latency.
To overcome these limitations, a collaborative and adaptive scaling mechanism is proposed to embody the principles of intelligent MTD. Instead of reacting to isolated triggers, the proposed mechanism leverages DRL to learn a proactive policy that jointly reasons over the call-graph structure, traffic propagation, and resource headroom. This allows it to preemptively reconfigure the attack surface by coordinating adjustments across multiple services, thereby dissipating emerging bottlenecks, suppressing cascade effects, and raising the cost for the attacker to maintain a successful compromise.
To capture the temporal evolution of service conditions, the observation metric set in Equation (
4) is extended with an explicit time index
t. For each service
at time
t, the observation vector is denoted by
and is given by:
where
is the replica count of service
at time
t,
is its resource utilization vector,
denotes the service latency,
represents the traffic rate, and
is the request failure rate.
For notational convenience, let
denote the
m-th scalar component of
:
where
m indexes the metric type (e.g., replica count, CPU utilization, latency). Each metric
is governed by an operational interval that defines its acceptable range, formally expressed as:
where
and
are the predefined lower and upper bounds for the
m-th metric type, respectively. The time-indexed observations
are aggregated into the global system state and are subsequently used to drive coordinated scaling decisions in the following modules.
After determining the scaling boundary, it is necessary to independently decide for each service node whether to scale and by what factor. For each service node , a scaling factor, , is computed. If any metric component exceeds its upper threshold, a scale-up action is triggered, and the specific scale-up factor is calculated as the maximum ratio of the observed metrics to their thresholds, , to accommodate the most constrained resource. Conversely, if all components fall below their lower thresholds, a scale-down action is taken, with the factor being the minimum ratio, , to ensure a conservative reduction. Otherwise, no action is taken, and the scaling factor is simply 1.
Therefore, the aforementioned rules on scaling actions can be formally expressed as:
Based on the specific scaling action with its scaling factor, the new replica count for the next time step
, can be calculated as follows:
where the ceiling function (
) is used for scaling up to ensure sufficient capacity is added, while the floor function (
) is used for scaling down to ensure at least one replica remains active. This process yields the precise scaling configuration for each service node, which is then passed to the deployment module.
4.3. DRL-Based Adaptive MTD Decision-Making
Note that the multi-objective optimization problem (P1) is computationally intractable to solve using traditional optimization methods. This is because obtaining accurate, real-time knowledge of the complex system dynamics and the adversary’s adaptive behavior is impractical in real-world scenarios. Therefore, this study explores a DRL approach, leveraging the powerful function approximation capabilities of neural networks to learn an effective defense policy directly from interaction with the environment.
To formalize this sequential decision-making problem, this paper adopts an MDP framework. An MDP is particularly suitable for the microservice defense scenario because: (i) the system’s evolution depends on the current state and the defense action taken; (ii) the defender must make a sequence of coordinated decisions over time to counter an ongoing attack; and (iii) the objectives involve optimizing long-term rewards that balance immediate security with cumulative operational costs. This formulation can be completely described through its state space, action space, transition probabilities, and reward function as follows.
4.3.1. State Space
The state
represents a snapshot of the entire system at time
t. It is constructed as a matrix
, where
N is the number of services and
M is the number of features per service, that is,
. Each row of the matrix corresponds to the observation vector of a single service, and the global state matrix
is then formed by stacking these individual observation vectors as:
where the matrix representation preserves the structural information of the system, allowing the DRL agent to process observations from all services simultaneously. Before being fed into the neural network, this matrix is typically flattened into a single vector.
4.3.2. Action Space
To enable coordinated defense across multiple services, the action space is designed to be multi-discrete, where each dimension corresponds to a specific service in the system. At each time step
t, the agent selects a joint action vector
. This vector is composed of
N individual actions, one for each of the
N services:
where the component
represents the scaling decision for a single service
, which has been defined in Equation (
15). For each service, the agent can choose one of three discrete actions: scale-up, scale-down, or no action. This design ensures that while the agent learns a single, unified policy, its decision is expressed as a set of concurrent actions across all services, thus resolving the challenge of coordinated defense.
4.3.3. Transition Probability
The state transition function
defines the probability of transitioning to state
given the current state
and the joint action
. In the simulated cloud-native environment, the system dynamics, including traffic propagation, resource contention, and the effects of scaling actions, are deterministic. This means that for any given state and action pair, the next state is uniquely determined by the environment’s rules. Formally, the transition probability can be expressed as:
For instance, if the defender takes no action (i.e., all elements in the joint action vector are zero, for all i) and the attacker continues with the same malicious requests, the system’s next state is a direct and predictable consequence of the ongoing resource consumption and traffic propagation from state . Conversely, if the defender initiates a scale-up action for a specific service (i.e., ), the replica count for that service will increase. This change, combined with the existing traffic, will deterministically alter the resource utilization and latency metrics in the next state, resulting in a predictable state transition where for a specific next state .
4.3.4. Reward Function
The reward function
, is designed to guide the DRL agent towards achieving the multi-objective goals defined in
P1. It quantifies the quality of an action
taken in state
by balancing attack mitigation effectiveness against operational costs. The total reward at each time step
t is a weighted sum of several components as follows:
where
are non-negative weight coefficients calibrated through experiments.
The components are designed to provide a comprehensive feedback signal. The primary positive component is the security reward
, which incentivizes maintaining system health and achieving the defense objective efficiently. It is composed of a per-step reward for keeping services safe and a terminal reward for successfully ending an episode, defined as:
Here,
is the binary indicator for service
being in a danger state, which has been defined in Equation (
8),
is an indicator for the terminal step,
is a bonus for successful defense, and
is a shaping reward that encourages faster mitigation. Conversely, several penalty terms are introduced to penalize undesirable outcomes. The danger penalty
, is the count of services currently in a danger state. The failure penalty
, reflects the impact on user experience by summing the failure rates at compromised services. To discourage resource waste, the idle resource penalty
, is calculated by summing the idle replica counts across all services. Finally, an action penalty
, where
is a binary indicator that equals 1 if the action
on service
is invalid, and 0 otherwise. This penalty encourages the agent to learn physically feasible and valid state-action mappings.
By combining these components, the reward function provides a comprehensive signal that incentivizes the agent to find a dynamic balance between robust security, efficient resource utilization, and a high-quality user experience.
4.3.5. MD3QN Algorithm for Policy Optimization
To solve the formulated MDP, this paper employs the MD3QN algorithm, which is specifically designed for environments with multi-dimensional discrete action spaces. MD3QN extends the principles of Double DQN by maintaining a separate Q-value estimate for each of the N action dimensions (i.e., for each service). This factorized approach allows the agent to learn a coordinated policy across all services simultaneously. The agent’s learning process is driven by several key mechanisms detailed below.
At its core, the MD3QN agent uses a set of N parallel Q-networks , known as the online networks, to approximate the action-value function for each service. To balance exploration of new actions with exploitation of known good actions, an -greedy strategy is used for action selection. To stabilize the learning process and break temporal correlations in the observed experiences, the agent utilizes an experience replay buffer , which stores past transitions .
During training, mini-batches of transitions are sampled from to update the network weights. To mitigate the overestimation bias inherent in standard Q-learning, MD3QN adopts the Double DQN update rule. This involves a second set of networks , called the target networks. The core idea is to decouple action selection from action evaluation. For each sample in a mini-batch, the process is as follows:
First, the optimal next action for each dimension
i is selected using the online network. This step identifies which action the current policy believes to be best for the next state
:
Second, the value of this chosen action
is evaluated using the more stable target network. This yields the estimated value of the next state, denoted as
:
Finally, this value is used to construct the complete Temporal-Difference (TD) target
, by combining the immediate reward
with the discounted future reward. The target is set to zero if the episode has terminated:
where
is the discount factor that balances immediate and future rewards.
The online networks are then updated by minimizing the mean squared error (MSE) between the predicted Q-values and the computed TD targets. The total loss function
is averaged across all
N dimensions and all
B samples in the mini-batch:
Therefore, the target network parameters are not trained directly via backpropagation but are updated slowly towards the online network parameters using a soft update rule , where is the update rate. This ensures that the TD targets remain stable over time, further improving learning stability.
The entire training process detailed in Algorithm 1, yields the optimized parameters
for the online Q-networks. These parameters implicitly define the learned defense policy
. After training, this optimal policy
is executed by acting greedily with respect to the learned Q-values. Specifically, for any given state
, the agent constructs the joint action
by independently selecting the action
for each service
that maximizes its corresponding Q-value
.
| Algorithm 1: MD3QN for Coordinated Scaling Policy Optimization |
![Futureinternet 17 00580 i001 Futureinternet 17 00580 i001]() |
5. Performance Evaluation
This section presents a series of experiments to evaluate the performance of the proposed framework, ScaleShield. The experimental setup is first detailed, including the simulation environment, baseline methods, and evaluation metrics. Subsequently, a comprehensive analysis of the results is provided, demonstrating ScaleShield’s effectiveness in mitigating various resource-exhaustion DDoS attacks and its superiority over state-of-the-art approaches.
5.1. Experimental Setup
5.1.1. Simulation Environment
To emulate a realistic microservice environment, this paper constructs a heterogeneous testbed consisting of a physical host and five virtual machines (VMs). The physical host is equipped with a 24 core CPU, 128 GB memory, 2 TB storage, and an NVIDIA RTX 3090Ti GPU. Five homogeneous VMs, each configured with a 4 core CPU, 8 GB memory, and 40 GB storage, are deployed on the host. The physical host executes the DRL-based decision module, while the VM cluster runs the microservice system. Both components communicate through a dedicated network.
The software environment is built on Docker containers managed by Kubernetes and integrates the
Bench open source platform [
47]. Prometheus and Jaeger are incorporated to support resource monitoring, service topology construction, and distributed tracing. The SocialNetwork workload from DeathStarBench [
48] is employed as the microservice topology, comprising one gateway node and 25 service nodes. Six nodes are randomly assigned custom resource pressure functions to serve as critical bottlenecks and simulated attack targets.
The experiment simulates three distinct attack strategies as defined in
Section 3.2.1 and the interactions operate in intervals of one minute. The attacker issues resource exhaustion HTTP requests to the gateway node, while the defense system adjusts scaling actions in real time based on monitoring indicators. Each episode contains at most 50 time steps and terminates early if the system maintains a secure state for 10 consecutive steps. The reward weight coefficients
,
,
,
, and
are determined by balancing security, efficiency, and energy-consumption objectives during experimentation, and after normalizing each reward component, they remain stable across different system topologies. Hyperparameter configurations are summarized in
Table 1.
5.1.2. Baseline Methods
To evaluate the performance of ScaleShield, this paper compares it against four representative baseline methods. These baselines were carefully selected to span the spectrum of existing autoscaling and defense strategies, ranging from the industry-standard reactive approach to state-of-the-art learning-based techniques.
HPA. This represents the standard Kubernetes Horizontal Pod Autoscaler (HPA), which reacts to CPU or memory utilization exceeding a predefined static threshold for individual services [
33].
Heuristic. A heuristic autoscaling strategy that combines a set of empirical thresholds with short-window load estimation to reduce false scaling decisions [
32].
HPA+. An enhanced version of HPA that incorporates LSTM-based workload prediction with traditional Kubernetes metrics to achieve more accurate autoscaling [
9].
DeepScaling. A state-of-the-art DRL-based autoscaling method that uses a single-agent DQN to estimate resource coupled with a DQN-driven scaling policy [
34].
5.1.3. Evaluation Metrics
This section evaluates the performance of all methods across five key dimensions to provide a holistic assessment of their effectiveness, efficiency, and cost.
First, it measures the
Number of Compromised Nodes at each time step
t, denoted as
. This metric reflects the immediate impact of an attack and is calculated by summing the danger state indicators (cf. Equation (
8)) across all services:
. A lower value indicates a more effective containment of the attack’s impact.
Second, it assesses the Defense Success Rate (SR), which quantifies the overall reliability of a defense strategy. It is defined as the percentage of episodes that successfully terminate by maintaining a secure state for consecutive steps. Over a total of evaluation episodes, it is calculated as: , where is the binary success indicator for episode k, consistent with the reward function definition.
Third, it evaluates the Convergence Steps (), measuring the efficiency and speed of the mitigation response. For each episode, it is defined as the number of time steps from the onset of the attack at until the system is fully stabilized: . This metric reports the average over all successful episodes.
Fourth, it analyzes the Service Latency (), which assesses the impact on user-perceived performance. It is measured as the average end-to-end response time over all requests processed during an episode. This metric reports both the mean and standard deviation of across all evaluation episodes to evaluate performance and its stability.
Finally, it quantifies the Resource Efficiency to assess how effectively a policy uses provisioned capacity. This metric defines a resource efficiency metric , calculated as . In this formula, the term inside the summation measures the instantaneous inefficiency at each time step t by penalizing the simultaneous existence of compromised nodes and idle replicas relative to the total allocated replicas . A higher value of closer to 1, indicates better resource utilization with lower risk.
5.2. Convergence and Stability of ScaleShield
This section first analyzes the training convergence and stability of the ScaleShield agent by examining its cumulative reward curve over 2000 training episodes, as shown in
Figure 3. The figure plots the average reward obtained per episode throughout the training process, offering insight into the agent’s learning efficiency and policy stability under different adversarial conditions.
A clear trend of convergence is observed across all three attack scenarios. The agent’s reward consistently increases from the initial exploration phase and gradually stabilizes at a high value after approximately 1500 training episodes, indicating that the MD3QN agent successfully learns an effective and stable policy. The specific dynamics vary by attack type. Under the static attack, the agent demonstrates rapid learning, with the reward curve smoothly ascending and converging to a high positive value of approximately 6.33. This shows that ScaleShield can easily master an optimal policy to maintain system health and maximize rewards against a predictable adversary. When facing the periodic attack, the agent learns to counter the intermittent threat, with the reward stabilizing around −2.21. Although the reward is negative due to the recurring nature of the attack, its stable convergence proves that the agent has learned an effective mitigation pattern to minimize penalties and quickly restore the system state.
Finally, under the random attack, which represents the most challenging scenario, the reward curve shows a steady but more gradual increase, eventually converging to a stable value of approximately −8.61. This result highlights the robustness of the proposed approach; even against a highly unpredictable and adaptive adversary, ScaleShield learns a consistent and resilient defense strategy that effectively bounds the system’s losses.
5.3. Comparative Analysis of Defense Effectiveness
Next, this section compares the defense effectiveness of ScaleShield against the four baseline methods by analyzing the number of compromised nodes over time, as depicted in
Figure 4. The results unequivocally demonstrate ScaleShield’s superior performance across all three attack scenarios.
In the static attack scenario (as seen in
Figure 4a), the static threshold-based methods, HPA+ and Heuristic, demonstrate their strength against a predictable threat, achieving complete mitigation in just one step. The proposed method, ScaleShield, exhibits a similarly swift and decisive response, reducing the number of compromised nodes from 5 to 0 within 5 steps. This demonstrates its ability to quickly master an optimal policy against a fixed adversary, performing nearly as well as methods specialized for static conditions. In comparison, the other DRL-based method, DeepScaling, also successfully neutralizes the attack but requires a longer convergence time of 17 steps. In contrast, the standard reactive method, HPA, struggles significantly, taking approximately 40 steps to achieve only a slow, partial recovery where the number of compromised nodes stabilizes around 1.
Under the periodic attack (as seen in
Figure 4b), the proactive nature of ScaleShield becomes even more apparent. It effectively anticipates the recurring attack waves, showing only negligible, instantly suppressed spikes. This is a significant advantage over HPA and the Heuristic method, which are reactive, and even HPA+ with prediction still exhibits large oscillations that mirror the attack’s cadence, with the number of compromised nodes repeatedly spiking to 5. DeepScaling shows a better response but still allows higher and wider peaks compared to ScaleShield.
Finally, in the most challenging random attack scenario (as seen in
Figure 4c), ScaleShield’s robustness is highlighted. It again achieves complete mitigation, bringing the number of compromised nodes to zero within 5 steps. DeepScaling manages to suppress the attack to a low level but fails to eliminate it entirely. Meanwhile, HPA, HPA+, and the Heuristic method are largely ineffective, with the number of compromised nodes remaining high throughout the episode. These results confirm that ScaleShield’s learned, coordinated policy is highly effective at suppressing cascading failures and provides a much faster and more complete defense than both traditional reactive methods and other state-of-the-art DRL-based approaches.
The superior performance is further quantified by the defense success rate and mitigation efficiency, shown in
Figure 5 and
Figure 6, respectively. As illustrated in
Figure 5, ScaleShield achieves the highest success rate across all scenarios, consistently maintaining a reliability of nearly 100%. Even under the most challenging random attack, its success rate remains above 95%, which is dramatically higher than all baselines. For instance, in the same random scenario, the success rate of HPA+ plummets to approximately 23%, while HPA and DeepScaling only manage around 48% and 62%, respectively. This highlights ScaleShield’s exceptional robustness against unpredictable threats.
In terms of efficiency (as seen in
Figure 6), ScaleShield requires the fewest convergence steps to neutralize threats in all cases. It consistently stabilizes the system in approximately 11–15 steps. In stark contrast, reactive methods take considerably longer. HPA requires around 45 steps in the static scenario and fails to converge within the 50-step episode limit under periodic and random attacks. Similarly, Heuristic and HPA+ also fail to converge under the periodic attack. This demonstrates that ScaleShield’s efficiency does not stem from simple reaction but from a learned policy that executes precise, multi-service scaling actions, directly targeting the attack’s propagation path.
5.4. Analysis of Performance and Cost-Effectiveness
Finally, this section evaluates the impact of each defense strategy on user-perceived service latency and operational resource efficiency, with detailed results summarized in
Table 2 and
Table 3.
In terms of service latency (as shown in
Table 2), ScaleShield demonstrates superior performance, particularly under dynamic and unpredictable attacks. In the random attack scenario, ScaleShield achieves the lowest average latency by a significant margin, recording only 388.84 ms, whereas reactive methods like HPA suffer from extreme delays of 1382.27 ms. This corresponds to a relative latency reduction of approximately 72% under dynamic attack conditions. Similarly, under the periodic attack, ScaleShield maintains the lowest average latency of 415.79 ms and the best stability with lowest standard deviation of 25.85. This indicates that ScaleShield’s proactive and coordinated scaling effectively prevents the formation of bottlenecks that degrade user experience. While HPA+ shows the best average latency in the static scenario, ScaleShield achieves the best stability with the lowest standard deviation, highlighting its ability to provide a consistent quality of service.
From a cost-effectiveness perspective (as shown in
Table 3), ScaleShield excels in maintaining high resource efficiency without compromising security. Under the dynamic periodic and random attacks, ScaleShield has more than
effective utilization. It achieves the highest efficiency scores of 0.9767 and 0.9490, respectively, with near-perfect stability of zero or near-zero standard deviation. This demonstrates its ability to precisely allocate resources to counter threats without significant over-provisioning. In contrast, methods like HPA+ show a severe drop in efficiency under dynamic attacks (e.g., 0.4605 for periodic and a mere 0.0173 for random), indicating severe resource mismanagement. While the Heuristic method performs well in the static case, its efficiency declines under more complex attacks. This confirms that the agent learns a policy that is not only secure and responsive but also economically efficient, striking an optimal balance between performance and cost.
6. Discussion
This section discusses how the proposed hybrid framework addresses the challenges of MTD in cloud-native microservices, the rationale for adopting DRL, its potential value in real DDoS incidents, and the main limitations of the current study.
Traditional MTD strategies in microservices often lack attack-aware control and can be ineffective against spatiotemporally adaptive adversaries that vary targets, timing, and traffic patterns. The proposed hybrid framework combines an MTD-based collaborative scaling mechanism with an MD3QN-based DRL policy, and its effectiveness against adaptive threats is reflected in several empirical findings. First, under the periodic and random attack scenarios, which explicitly model time-varying and target-shifting adversarial behavior, ScaleShield maintains a defense success rate close to 100% and quickly converges. Second, ScaleShield consistently keeps the number of compromised nodes near zero and suppresses cascade effects along service call chains, indicating that the policy can neutralize evolving attack patterns rather than merely reacting to fixed signatures. Third, the reward design jointly penalizes danger states, SLA violations, idle resources, and invalid actions, enabling the policy to learn when MTD actions are genuinely beneficial.
Furthermore, the choice of DRL is motivated by the intrinsic characteristics of cloud-native microservice defense. First, the defense problem is inherently sequential in a non-stationary environment, where the effect of each scaling action propagates over time, making delayed and coupled impacts difficult to capture with static rules. Second, the joint action space is high-dimensional and combinatorial. Multiple services must be scaled concurrently, and the number of action combinations grows rapidly with system size. DRL, and in particular MD3QN, can handle such multi-dimensional discrete action spaces and learn coordinated policies that are difficult to design manually. Third, cloud-native systems continuously generate rich telemetry, which can be exploited by DRL to adapt to unseen workloads and attack strategies. This makes DRL a suitable tool not only for reacting to known patterns but also for identifying and mitigating novel or previously unobserved threat behaviors.
Historically, several real DDoS incidents have highlighted the difficulty of defending complex services, ranging from the Mirai-based attack on Dyn in 2016 and the 2018 memcached attack on GitHub that peaked at 1.3 Tbps, to recent hyper-volumetric HTTP floods and HTTP/2 Rapid Reset attacks reaching hundreds of millions of requests per second [
49]. In such cases, volumetric scrubbing and rate limiting can preserve core connectivity, but residual attack traffic still tends to concentrate on specific microservices, creating internal bottlenecks and cascading slowdowns. A defense solution like ScaleShield can complement perimeter protection by monitoring per-service health and adaptively scaling vulnerable components to absorb attack pressure without indiscriminate overprovisioning. For example, in application-layer floods, ScaleShield can complement DDoS mitigation by prioritizing scaling nodes along critical service paths and avoiding indiscriminate scaling elsewhere, thereby preventing internal resource exhaustion, mitigating service collapse, and containing cost under constrained capacity.
Despite its promising performance, the proposed approach has several limitations. First, the proposed architecture introduces additional system complexity, as ScaleShield depends on fine-grained telemetry collection, global state aggregation, and DRL-driven decision-making on top of the existing system. This complexity may expand when solving security defense configurations, especially in large-scale topologies [
50]. Second, there is an inherent trade-off between security and QoS. Frequent or aggressive MTD actions can degrade latency and throughput, and the experimental results already show non-negligible impact on user-perceived response time under certain attack scenarios. Moreover, these effects may be amplified in real deployments with stricter requirements. Third, the defense space is currently restricted to horizontal scaling. Other practical MTD mechanisms such as traffic rerouting, instance migration, and configuration diversification are not yet integrated into the action space, leaving complementary defense levers underexploited. Future work will focus on addressing these limitations through improved system design, more MTD actions, and more comprehensive evaluations.
7. Conclusions
Cloud-native microservice architectures are highly vulnerable to sophisticated resource-exhaustion DDoS attacks that exploit inter-service dependencies and propagate along service call chains. Existing detection-based and reactive mitigation mechanisms, as well as heuristic MTD strategies, often provide insufficient protection against stealthy and adaptive adversaries and may incur substantial overhead due to uncoordinated or mistimed actions.
To overcome these limitations, this paper introduced ScaleShield, a novel adaptive MTD framework powered by DRL techniques. By formulating the defense problem as an MDP, ScaleShield learns to orchestrate proactive scaling actions across multiple services. The core contribution lies in the use of an MD3QN algorithm to navigate the high-dimensional discrete action space, enabling the system to learn a coordinated defense policy. This policy dynamically reconfigures the attack surface to dissipate emerging bottlenecks and suppress attack propagation, guided by a comprehensive reward function that balances security, SLA compliance, and operational cost.
Through extensive simulation, this paper validated the effectiveness of ScaleShield against various attack strategies. The experimental results demonstrate that the proposed approach provides superior resilience and efficiency. Specifically, ScaleShield maintained a defense success rate near 100% and neutralized threats in 11–15 steps on average, whereas several baselines failed to converge under dynamic attacks. This confirms that by learning an intelligent, attack-aware policy, ScaleShield provides a robust and resource-efficient solution for ensuring the resilience of cloud-native applications.
There are several promising directions for future work. First, although the current evaluation uses a representative topology, assessing ScaleShield on a broader range of topologies with varying sizes, depths, cyclic patterns, and workloads would further validate its generalization. Second, the policy could be enhanced by incorporating graph-based state representations that explicitly encode invocation structure and resource couplings, potentially improving robustness while simplifying decision logic. Third, the MTD action space may be extended to include more mechanisms, with their joint optimization studied in real-world clusters and long-running traces.
Author Contributions
Conceptualization, Y.Z. and K.D.; methodology, Y.Z.; software, K.D.; validation, K.D.; formal analysis, Y.Z.; investigation, G.C.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, K.D.; writing—review and editing, Y.Z.; visualization, Y.Z.; supervision, G.C.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded in part by the National Natural Science Foundation of China under Grant No. 62202097 and Grant No. 62072100, in part by the Frontier Technologies R & D Program of Jiangsu under Grant No. BF2025026, in part by China Postdoctoral Science Foundation under Grant No. 2024T170143 and Grant No. 2022M710677, and in part by Jiangsu Funding Program for Excellent Postdoctoral Talent under Grant No. 2022ZB137.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
|
DDoS | Distributed Denial-of-Service |
| QoS | Quality of Service |
| MTD | Moving Target Defense |
| DRL | Deep Reinforcement Learning |
| MDP | Markov Decision Process |
| MD3QN | Multi-Dimensional Double Deep Q-Network |
| SLA | Service Level Agreement |
| SOTA | State of the Art |
| APT | Advanced Persistent Threat |
| PCA | Principal Component Analysis |
| CDN | Content Delivery Network |
| IDS | Intrusion Detection System |
| DQN | Deep Q-Network |
| MSE | Mean Squared Error |
| API | Application Programming Interface |
| DAG | Directed Acyclic Graph |
| I/O | Input/Output |
| VM | Virtual Machine |
| HPA | Horizontal Pod Autoscaler |
| LSTM | Long Short-Term Memory |
References
- Deng, S.; Zhao, H.; Huang, B.; Zhang, C.; Chen, F.; Deng, Y.; Yin, J.; Dustdar, S.; Zomaya, A.Y. Cloud-native computing: A survey from the perspective of services. Proc. IEEE 2024, 112, 12–46. [Google Scholar] [CrossRef]
- He, X.; Xu, H.; Xu, X.; Chen, Y.; Wang, Z. An efficient algorithm for microservice placement in cloud-edge collaborative computing environment. IEEE Trans. Serv. Comput. 2024, 17, 1983–1997. [Google Scholar] [CrossRef]
- Loconte, D.; Ieva, S.; Gramegna, F.; Bilenchi, I.; Fasciano, C.; Pinto, A.; Loseto, G.; Scioscia, F.; Ruta, M.; Di Sciascio, E. Serverless microservice architecture for cloud-edge intelligence in sensor networks. IEEE Sens. J. 2024, 25, 7875–7885. [Google Scholar] [CrossRef]
- Hu, Y.; Hou, L.; Hu, J.; Ren, M.; Hu, M.; Cai, C.; Peng, K. Time-Varying Microservice Orchestration with Routing for Dynamic Call Graphs via Multi-Scale Deep Reinforcement Learning. IEEE Trans. Serv. Comput. 2025, 18, 3276–3291. [Google Scholar] [CrossRef]
- Mahmud, R.; Jin, J.; Kua, J.; Afrin, M.; Mistry, S.; Krishna, A. Trusted Microservice Orchestration for Secure Edge Computing in Industrial Cyber-Physical Systems. IEEE Netw. 2025, 39, 70–78. [Google Scholar] [CrossRef]
- Haindl, P.; Kochberger, P.; Sveggen, M. A systematic literature review of inter-service security threats and mitigation strategies in microservice architectures. IEEE Access 2024, 12, 90252–90286. [Google Scholar] [CrossRef]
- Almanasir, R.; Al-solomon, D.; Indrawes, S.; Almaiah, M.; Islam, U.; Alshar’e, M. Classification of threats and countermeasures of cloud computing. J. Cyber Secur. Risk Audit. 2025, 2025, 27–42. [Google Scholar] [CrossRef]
- Xie, S.; Wang, J.; Li, B.; Zhang, Z.; Li, D.; Hung, P.C. PBScaler: A bottleneck-aware autoscaling framework for microservice-based applications. IEEE Trans. Serv. Comput. 2024, 17, 604–616. [Google Scholar] [CrossRef]
- Toka, L.; Dobreff, G.; Fodor, B.; Sonkoly, B. Machine learning-based scaling management for kubernetes edge clusters. IEEE Trans. Netw. Serv. Manag. 2021, 18, 958–972. [Google Scholar] [CrossRef]
- Jayalath, R.K.; Ahmad, H.; Goel, D.; Syed, M.S.; Ullah, F. Microservice vulnerability analysis: A literature review with empirical insights. IEEE Access 2024, 12, 155168–155204. [Google Scholar] [CrossRef]
- Bremler-Barr, A.; Czeizler, M.; Levy, H.; Tavori, J. Exploiting miscoordination of microservices in tandem for effective DDoS attacks. In Proceedings of the IEEE INFOCOM 2024-IEEE Conference on Computer Communications, Vancouver, BC, Canada, 20–23 May 2024; pp. 231–240. [Google Scholar]
- Garba, U.H.; Toosi, A.N.; Pasha, M.F.; Khan, S. SDN-based detection and mitigation of DDoS attacks on smart homes. Comput. Commun. 2024, 221, 29–41. [Google Scholar]
- Alenezi, M.N. Significance of Machine Learning-Driven Algorithms for Effective Discrimination of DDoS Traffic Within IoT Systems. Future Internet 2025, 17, 266. [Google Scholar]
- Pietrantuono, R.; Ficco, M.; Palmieri, F. Testing the resilience of MEC-based IoT applications against resource exhaustion attacks. IEEE Trans. Dependable Secur. Comput. 2023, 21, 804–818. [Google Scholar]
- Shen, W.; Wu, Y.; Yang, Y.; Liu, Q.; Yang, N.; Li, J.; Lu, K.; Ma, J. Towards Understanding and Defeating Abstract Resource Attacks for Container Platforms. IEEE Trans. Dependable Secur. Comput. 2024, 22, 474–490. [Google Scholar] [CrossRef]
- Jeong, B.; Jeon, J.; Jeong, Y.S. Proactive resource autoscaling scheme based on SCINet for high-performance cloud computing. IEEE Trans. Cloud Comput. 2023, 11, 3497–3509. [Google Scholar] [CrossRef]
- Tan, J.; Jin, H.; Zhang, H.; Zhang, Y.; Chang, D.; Liu, X.; Zhang, H. A survey: When moving target defense meets game theory. Comput. Sci. Rev. 2023, 48, 100544. [Google Scholar] [CrossRef]
- Zhang, T.; Kong, F.; Deng, D.; Tang, X.; Wu, X.; Xu, C.; Zhu, L.; Liu, J.; Ai, B.; Han, Z.; et al. Moving target defense meets artificial intelligence-driven network: A comprehensive survey. IEEE Internet Things J. 2025, 12, 13384–13397. [Google Scholar]
- Pagnotta, G.; De Gaspari, F.; Hitaj, D.; Andreolini, M.; Colajanni, M.; Mancini, L.V. DOLOS: A novel architecture for moving target defense. IEEE Trans. Inf. Forensics Secur. 2023, 18, 5890–5905. [Google Scholar] [CrossRef]
- Zhang, T.; Xu, C.; Shen, J.; Kuang, X.; Grieco, L.A. How to disturb network reconnaissance: A moving target defense approach based on deep reinforcement learning. IEEE Trans. Inf. Forensics Secur. 2023, 18, 5735–5748. [Google Scholar] [CrossRef]
- Github—SEU-ProactiveSecurity-Group/ScaleShield. Available online: https://github.com/SEU-ProactiveSecurity-Group/ScaleShield (accessed on 10 December 2025).
- Owusu, E.; Rahouti, M.; Jagatheesaperumal, S.K.; Xiong, K.; Xin, Y.; Lu, L.; Hsu, D.F. Online network dos/ddos detection: Sampling, change point detection, and machine learning methods. IEEE Commun. Surv. Tutor. 2024, 27, 2543–2580. [Google Scholar] [CrossRef]
- Abiramasundari, S.; Ramaswamy, V. Distributed denial-of-service (DDOS) attack detection using supervised machine learning algorithms. Sci. Rep. 2025, 15, 13098. [Google Scholar] [CrossRef]
- Koksal, S.; Catak, F.O.; Dalveren, Y. Flexible and Lightweight Mitigation Framework for Distributed Denial-of-Service Attacks in Container-Based Edge Networks Using Kubernetes. IEEE Access 2024, 12, 172980–172991. [Google Scholar] [CrossRef]
- Laila, D.A. Responsive Machine Learning Framework and Lightweight Utensil of Prevention of Evasion Attacks in the IoT-Based IDS. STAP J. Secur. Risk Manag. 2025, 2025, 59–70. [Google Scholar] [CrossRef]
- Najar, A.A.; Naik, S.M. Cyber-secure SDN: A CNN-based approach for efficient detection and mitigation of DDoS attacks. Comput. Secur. 2024, 139, 103716. [Google Scholar] [CrossRef]
- Zhao, Z.; Liu, Z.; Chen, H.; Zhang, F.; Song, Z.; Li, Z. Effective DDoS mitigation via ML-driven in-network traffic shaping. IEEE Trans. Dependable Secur. Comput. 2024, 21, 4271–4289. [Google Scholar]
- Kaur, A.; Krishna, C.R.; Patil, N.V. A comprehensive review on Software-Defined Networking (SDN) and DDoS attacks: Ecosystem, taxonomy, traffic engineering, challenges and research directions. Comput. Sci. Rev. 2025, 55, 100692. [Google Scholar] [CrossRef]
- Kumar, A.; Agarwal, M. Reducing Internal Collateral Damage from Distributed Denial of Service (DDoS) Attacks through Micro-Service Cloud Architecture. IEEE Trans. Inf. Forensics Secur. 2024, 20, 1081–1091. [Google Scholar]
- Qiu, H.; Mao, W.; Wang, C.; Franke, H.; Youssef, A.; Kalbarczyk, Z.T.; Başar, T.; Iyer, R.K. {AWARE}: Automate workload autoscaling with reinforcement learning in production cloud systems. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23), Boston, MA, USA, 10–12 July 2023; pp. 387–402. [Google Scholar]
- Quattrocchi, G.; Incerto, E.; Pinciroli, R.; Trubiani, C.; Baresi, L. Autoscaling solutions for cloud applications under dynamic workloads. IEEE Trans. Serv. Comput. 2024, 17, 804–820. [Google Scholar] [CrossRef]
- Srirama, S.N.; Adhikari, M.; Paul, S. Application deployment using containers with auto-scaling for microservices in cloud environment. J. Netw. Comput. Appl. 2020, 160, 102629. [Google Scholar] [CrossRef]
- Nguyen, T.T.; Yeom, Y.J.; Kim, T.; Park, D.H.; Kim, S. Horizontal pod autoscaling in kubernetes for elastic container orchestration. Sensors 2020, 20, 4621. [Google Scholar] [CrossRef]
- Wang, Z.; Zhu, S.; Li, J.; Jiang, W.; Ramakrishnan, K.; Yan, M.; Zhang, X.; Liu, A.X. Deepscaling: Autoscaling microservices with stable cpu utilization for large scale production cloud systems. IEEE/ACM Trans. Netw. 2024, 32, 3961–3976. [Google Scholar] [CrossRef]
- Zhou, Y.; Cheng, G.; Yu, S. An SDN-enabled proactive defense framework for DDoS mitigation in IoT networks. IEEE Trans. Inf. Forensics Secur. 2021, 16, 5366–5380. [Google Scholar] [CrossRef]
- Zhou, Y.; Cheng, G.; Zhao, Y.; Chen, Z.; Jiang, S. Toward proactive and efficient DDoS mitigation in IIoT systems: A moving target defense approach. IEEE Trans. Ind. Inform. 2022, 18, 2734–2744. [Google Scholar] [CrossRef]
- Zhang, Q.; Cho, J.H.; Moore, T.J.; Kim, D.D.; Lim, H.; Nelson, F. Evade: Efficient moving target defense for autonomous network topology shuffling using deep reinforcement learning. In Proceedings of the International Conference on Applied Cryptography and Network Security, Kyoto, Japan, 19–22 June 2023; Springer: Cham, Switzerland, 2023; pp. 555–582. [Google Scholar]
- Zhou, Y.; Cheng, G.; Ouyang, Z.; Chen, Z. Resource-efficient low-rate DDoS mitigation with moving target defense in edge clouds. IEEE Trans. Netw. Serv. Manag. 2024, 22, 168–186. [Google Scholar] [CrossRef]
- Yan, L.; Zhou, Z.; Yang, S.; Xu, C. Device-Cloud Collaborative DDoS Resistance for QoS-Sensitive Mobile Applications: A Seamlessly Shuffle-based Moving Target Defense Approach. In Proceedings of the GLOBECOM 2024—2024 IEEE Global Communications Conference, Cape Town, South Africa, 8–12 December 2024; pp. 1191–1196. [Google Scholar]
- Rehman, Z.; Gondal, I.; Ge, M.; Dong, H.; Gregory, M.; Tari, Z. Proactive defense mechanism: Enhancing IoT security through diversity-based moving target defense and cyber deception. Comput. Secur. 2024, 139, 103685. [Google Scholar] [CrossRef]
- Yoon, H.J.; Soon, D.; Moore, T.J.; Yoon, S.; Lim, H.; Kim, D.; Nelson, F.F.; Cho, J.H. Intrusion Response System for In-Vehicle Networks: Uncertainty-Aware Deep Reinforcement Learning-based Approach. In Proceedings of the MILCOM 2024—2024 IEEE Military Communications Conference (MILCOM), Washington, DC, USA, 28 October–1 November 2024; pp. 827–832. [Google Scholar]
- Srinivasa Rao, P.S.V.; Bhadula, S.; R, B.; Suryawanshi, P.B.; Kumar Yadav, A.; Vekariya, D. Adaptive Traffic Management in Smart Cities using Deep Reinforcement Learning for IoT Mobile Ad Hoc Networks. In Proceedings of the 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), Coimbatore, India, 28–30 August 2024; pp. 362–367. [Google Scholar]
- Zhang, H.; Liang, H.; Hong, X.; Yao, Y.; Lin, B.; Zhao, D. DRL-Based resource allocation game with influence of review information for vehicular edge computing systems. IEEE Trans. Veh. Technol. 2024, 73, 9591–9603. [Google Scholar] [CrossRef]
- Núñez-Molina, C.; Mesejo, P.; Fernández-Olivares, J. A review of symbolic, subsymbolic and hybrid methods for sequential decision making. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
- Tellache, A.; Mokhtari, A.; Korba, A.A.; Ghamri-Doudane, Y. Multi-agent reinforcement learning-based network intrusion detection system. In Proceedings of the NOMS 2024—2024 IEEE Network Operations and Management Symposium, Seoul, Republic of Korea, 6–10 May 2024; pp. 1–9. [Google Scholar]
- Shahab, E.; Taleb, M.; Gholian-Jouybari, F.; Hajiaghaei-Keshteli, M. Designing a resilient cloud network fulfilled by reinforcement learning. Expert Syst. Appl. 2024, 255, 124606. [Google Scholar] [CrossRef]
- Detti, A.; Funari, L.; Petrucci, L. μBench: An open-source factory of benchmark microservice applications. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 968–980. [Google Scholar] [CrossRef]
- Gan, Y.; Zhang, Y.; Cheng, D.; Shetty, A.; Rathi, P.; Katarki, N.; Bruno, A.; Hu, J.; Ritchken, B.; Jackson, B.; et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Providence, RI, USA, 13–19 April 2019; pp. 3–18. [Google Scholar]
- Famous DDoS Attacks|Biggest DDoS Attacks|Cloudflare. Available online: https://www.cloudflare.com/learning/ddos/famous-ddos-attacks (accessed on 10 December 2025).
- Li, Y.; Li, Y.; Wang, G.; Hu, H. An Adaptive Dynamic Defense Strategy for Microservices Based on Deep Reinforcement Learning. Electronics 2025, 14, 4096. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).