ERA-MADDPG: An Elastic Routing Algorithm Based on Multi-Agent Deep Deterministic Policy Gradient in SDN

Huang, Wanwei; Liu, Hongchang; Li, Yingying; Ma, Linlin

doi:10.3390/fi17070291

Open AccessArticle

ERA-MADDPG: An Elastic Routing Algorithm Based on Multi-Agent Deep Deterministic Policy Gradient in SDN

¹

College of Software Engineering, Zhengzhou University of Light Industry, Zhengzhou 450007, China

²

College of Electronics & Communication Engineering, Shenzhen Polytechnic University, Shenzhen 518005, China

³

College of Information Technology, Zhengzhou Vocational College of Finance and Taxation, Zhengzhou 450048, China

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(7), 291; https://doi.org/10.3390/fi17070291

Submission received: 31 May 2025 / Revised: 27 June 2025 / Accepted: 27 June 2025 / Published: 29 June 2025

Download

Browse Figures

Versions Notes

Abstract

To address the fact that changes in network topology can have an impact on the performance of routing, this paper proposes an Elastic Routing Algorithm based on Multi-Agent Deep Deterministic Policy Gradient (ERA-MADDPG), which is implemented within the framework of Multi-Agent Deep Deterministic Policy Gradient (MADDPG) in deep reinforcement learning. The algorithm first builds a three-layer architecture based on Software-Defined Networking (SDN). The top-down layers are the multi-agent layer, the controller layer, and the data layer. The architecture’s processing flow, including real-time data layer information collection and dynamic policy generation, enables the ERA-MADDPG algorithm to exhibit strong elasticity by quickly adjusting routing decisions in response to topology changes. The actor-critic framework combined with Convolutional Neural Networks (CNN) to implement the ERA-MADDPG routing algorithm effectively improves training efficiency, enhances learning stability, facilitates collaboration, and improves algorithm generalization and applicability. Finally, simulation experiments demonstrate that the convergence speed of the ERA-MADDPG routing algorithm outperforms that of the Multi-Agent Deep Q-Network (MADQN) algorithm and the Smart Routing based on Deep Reinforcement Learning (SR-DRL) algorithm, and the training speed in the initial phase is improved by approximately 20.9% and 39.1% compared to the MADQN algorithm and SR-DRL algorithm, respectively. The elasticity performance of ERA-MADDPG is quantified by re-convergence speed: under 5–15% topology node/link changes, its re-convergence speed is over 25% faster than that of MADQN and SR-DRL, demonstrating superior capability to maintain routing efficiency in dynamic environments.

Keywords:

DDPG; multi-agent; network topology; routing algorithm; SDN; actor-critic; DRL

1. Introduction

In recent years, with the development and widespread application of machine learning technology, it has demonstrated powerful performance across various fields, bringing significant changes and opportunities to numerous industries. Deep Reinforcement Learning (DRL), as a subfield of machine learning, exhibits unique advantages in dynamic network environments. Unlike traditional rule-based routing or single-agent reinforcement learning, DRL leverages deep neural networks for high-dimensional state representation and employs multi-agent collaboration to achieve distributed decision-making. Moreover, it can adaptively optimize long-term cumulative rewards through end-to-end learning, thereby enabling more efficient routing selection and resource allocation in complex, time-varying network topologies. In the network environment, machine learning makes decisions based on real-time data, continuously learns, and adjusts models to adapt to changes in the network environment. This makes it widely used in the networking field to optimize network performance and improve network security [1]. Traditional machine learning methods often assume that the environment is static or determined by external factors. However, in practical network environments, system behavior is frequently influenced by multiple autonomous entities. Multi-Agent Systems (MAS) provide a powerful framework for addressing such challenges by enabling multiple agents to interact and collaborate in pursuit of shared or individual objectives. In the context of machine learning, the application of MAS allows algorithms to more effectively handle uncertainty and dynamic changes in distributed environments. Each agent can independently learn and adapt to the environment while simultaneously accounting for the behaviors and influences of other agents [2]. Reinforcement learning (RL) is a fundamental branch of machine learning that enables an agent to interact with its environment and learn an optimal policy by leveraging feedback in the form of reward signals. The agent interacts with the environment by taking a sequence of actions, upon which the environment provides corresponding rewards or penalties. The agent’s objective is to maximize the long-term cumulative reward. In network routing scenarios, the agent can be represented by a network node or a routing algorithm, while the environment encompasses network conditions such as the topology and link states. The reward function can be defined based on network performance metrics, such as throughput and latency. This reward-driven learning paradigm endows reinforcement learning with significant potential in network routing optimization, which enables it to adapt to dynamic network conditions and enhance overall network performance [3].

Machine learning brings revolutionary breakthroughs to network routing by enabling autonomous decision-making and dynamic adaptation. Unlike traditional rule-based algorithms, which rely on static protocols and struggle with real-time topology changes, machine learning algorithms—such as DRL—can continuously optimize routing policies through iterative interactions with network environments [4]. Intelligent routing employs machine learning algorithms to achieve load balancing and improve network performance and reliability. Machine learning demonstrates advantages such as intelligence, efficiency, and security in intelligent routing applications, thereby enhancing network performance and reducing network operation and maintenance costs. However, network topology often undergoes changes due to the physical and logical connection methods of each node to meet network requirements, enable fault recovery, optimize performance, and other factors. For example, equipment failure can cause changes in network paths [5]. After a network topology change, routing algorithms based on reinforcement learning may need to be retrained to avoid performance loss, as these algorithms typically make decisions based on the network status and topology. When the network topology changes, the original routing decisions may no longer be applicable to the new topology. For example, changes in link conditions, node failures, or the addition of new devices can cause the original routing decisions to degrade in performance or become inapplicable. At this point, the routing algorithm based on reinforcement learning needs to re-collect the current network status, retrain, and update routing decisions accordingly. This process allows the algorithm to adapt to the new network environment and maintain efficient routing decision-making capabilities, thereby avoiding performance loss [6].

Currently, researchers have conducted research on routing algorithms based on reinforcement learning and proposed a series of intelligent routing algorithm solutions. Zhao et al. [7] proposed a multi-agent reinforcement learning method for inter-domain routing, which improved the overall performance of the network. Liu et al. [8] used a comprehensive reward function, an efficient learning algorithm, and a novel deep neural network structure to learn appropriate routing strategies for different types of traffic demands. Bhavanasi et al. [6] constructed a routing reinforcement learning strategy that can operate in dynamic network conditions without retraining and compared this method with other routing protocols on various quality of service indicators. Kaviani et al. [9] proposed Deep CQ+ routing, which integrates the emerging multi-agent deep reinforcement learning technology into the existing Q-learning-based routing protocol and achieves consistently higher performance across a wide range of network topologies, demonstrating a high degree of robustness and scalability. You et al. [10] proposed a packet routing framework based on multi-agent deep reinforcement learning, which performs training and decision-making in a fully distributed environment, effectively approximates the value function of Q learning, and reduces the transmission time of packets. Ding et al. [11] used deep reinforcement learning to select routers in large-volume networks, reducing network congestion and data transmission paths, and used Markov decision processes to formulate routing problems. They designed two novel Deep Q Network (DQN)-based algorithms to reduce the probability of network congestion with shorter transmission paths. He et al. [12] proposed an effective solution to the routing optimization problem by integrating a Graph Neural Network (GNN) structure into DRL; the characteristics of GNN are used to interact with the network topology environment. Through the message-passing process of information exchange between links in the topology, available knowledge is extracted to achieve network traffic load balancing, and network performance is enhanced. Sun P et al. [13] proposed a deep reinforcement learning-based routing optimization model framework, SINET, which achieves intelligent optimization of network routing by controlling some key nodes. Chen et al. [14] proposed a scalable inter-domain multi-link routing optimization mechanism based on multi-agent reinforcement learning. The proposed mechanism dynamically adjusts inter-domain link routing by real-time sensing of traffic distribution, maximizing throughput across autonomous systems. At present, most existing elastic routing algorithms (e.g., single-agent DRL-based methods) can improve network performance under static topology, but they lack the ability to efficiently re-converge to optimal routing decisions when facing dynamic topology changes. In contrast, the ERA-MADDPG algorithm proposed in this paper introduces multi-agent collaborative learning to enhance the ‘elasticity’ of routing, defined as the capability to rapidly adapt to topology changes through real-time policy updates and inter-agent knowledge sharing.

Based on the advantages and current status of multi-agent intelligent routing, this paper integrates the advantages of establishing intelligent routing based on SDN [15] and combining deep learning with reinforcement learning. According to the Markov process, an Elastic Routing Algorithm based on Multi-Agent Deep Deterministic Policy Gradient (ERA-MADDPG) is established; when confronted with topology changes, network routing can adapt more flexibly and intelligently through collaborative learning among multi-agent systems. According to the state mapping, action mapping, and reward value mapping, the routing performance is judged to achieve flexible routing based on multi-agent. The main contributions of this paper are as follows:

Proposes a novel SDN-based three-layer architecture for resilient routing, comprising a multi-agent layer, a controller layer, and a data layer. The architecture enables dynamic collaboration among agents to process topology changes through a streamlined workflow (information collection → state input → policy generation → delivery), addressing the limitation of single-agent models in distributed network environments.
Develops the ERA-MADDPG algorithm by integrating the actor-critic framework with CNN networks. This innovation enhances training efficiency by 20.9–39.1% compared to traditional methods (MADQN, SR-DRL) through centralized training of multi-agent experiences and CNN-based high-dimensional feature extraction, while stabilizing learning through policy gradient optimization in continuous action spaces.
Demonstrates superior adaptability to topology dynamics: Through simulation experiments, the algorithm achieves re-convergence speeds over 25% faster than baselines under 5–15% node/link changes, validating its robustness in dynamic networks. The multi-agent collaborative mechanism reduces reliance on centralized retraining, providing a scalable solution for real-time routing optimization.

The remainder of this paper is organized as follows. Section 2 presents the system architecture of the proposed method, including the three-layer SDN-based architecture and the interaction mechanism between agents and the environment. Section 3 details the ERA-MADDPG intelligent routing solution, covering the algorithm and its interaction with the environment. Section 4 conducts experimental evaluations, comparing the performance of the ERA-MADDPG algorithm with other related algorithms in terms of convergence speed, adaptability to network changes, and performance metrics such as delay, throughput, and packet loss rate. Finally, Section 5 concludes the paper, summarizing the research achievements and highlighting the advantages of the proposed algorithm.

2. System Architecture

The algorithm is implemented based on the three-layer architecture of SDN. The three-layer architecture comprises the multi-agent layer, the controller layer, and the data layer from top to bottom. The multi-agent layer enables programmatic control of the network, including traffic monitoring, load balancing, and security policy implementation. Other functions are achieved through multi-agent control. Multi-agents accelerate learning by sharing knowledge and experience. The controller layer is the core of SDN architecture, which is responsible for centralized management of data layer network topology, interacts with the agent through the northbound interface to deliver the flow table, and interacts with the data layer through the southbound interface to collect data layer network topology information. The data layer includes network devices such as switches and routers, which are responsible for the actual data packet forwarding and processing. The processing flow of the architecture includes data layer information collection, status input, policy generation, policy delivery, and flow table delivery. The architecture is shown in Figure 1.

The interaction time interval between a single agent and the environment is optimized and iterated based on feedback from reward information. The reinforcement learning process is modeled as a Markov Decision Process (MDP) and follows the Markov decision process. When there are multiple agents in the architecture interacting with the environment, the entire system becomes a multi-agent system. Agent-environment interaction follows the MDP framework: at each time step

t

, the agent observes the state

s_{t}

, selects an action

a_{t}

via policy

π_{t}

, receives a reward

r_{t}

, and transitions to state

s_{t + 1}

according to

P (s_{t + 1} | s_{t}, a_{t})

. This iterative process aims to maximize the expected cumulative reward. Individual agents still follow the MDP process, while multi-agent systems treat other agents as influencing factors in the environment. In MDP, actions are selected based on known probabilities, and then the reward value is obtained in the next state. MDP includes the five elements of Equation (1).

〈S, A, P (s_{t + 1} |s_{t}, a_{t}), γ, R〉

(1)

where

S

is the state space (set of states), with

|S|

denoting its cardinality,

A

represents a set of actions,

t

denotes the time step index, and

P (s_{t + 1} | s_{t}, a_{t})

represents the transition probability at time

t

, mapping the state-action pair at time

t

to the state distribution at time

t + 1

. Immediate reward function

r (s_{t}, a_{t}, s_{t + 1})

measures the short-term feedback for taking action

a_{t}

in a state

s_{t}

to reach

s_{t + 1}

. The discount factor

γ \in [0, 1]

prioritizes immediate rewards over future rewards, balancing short-term and long-term objectives in the reinforcement learning process. The schematic diagram of the interaction between multiple agents and the environment is shown in Figure 2. Figure 2 illustrates the dynamic interaction between

n

agents and the network environment in a multi-agent MDP. Each agent

i

observes a local state

s_{i}^{t}

at time

t

, selects an action

a_{i}^{t}

based on its policy

π_{i}

, and collectively influences the environment’s transition to the next state

s_{t + 1}

. The shared reward

r_{t + 1}

reflects the global impact of joint actions (e.g., reduced end-to-end delay). This diagram highlights that agents treat each other’s actions as part of the environment’s dynamics, necessitating collaborative learning to maximize cumulative rewards.

In multi-agent reinforcement learning, each agent operates in a shared environment, where they simultaneously select actions

a_{i} \in A_{i}

based on the current global state

s

. The joint actions

a = (a_{1}, a_{2}, \dots, a_{N})

collectively drive environmental transitions, leading to new states

s'

and reward feedback

r = (r_{1}, r_{2}, \dots, r_{N})

. Using the MDP framework defined in (1), the cumulative discounted reward for agent

i

is modeled as Equation (2), which captures the expected long-term return starting from state

s

under policies

π^{i}

and

π^{- i}

.

π^{i}

denotes the strategy of intelligent agent

i

.

π^{- i}

denotes the joint strategy of intelligences other than intelligent agent

i

.

V_{π^{i}, π^{- i}}^{i} (s) = Ε [\sum_{t \geq 0} γ^{t} r^{i} (s_{t}, a_{t}, s_{t + 1}) |a_{t}^{i} \sim π^{i} (• | s_{t}), s_{0} = s]

(2)

For a multi-agent system, multiple agents will reach an equilibrium point. For any agent, no other strategy can be adopted to obtain a higher cumulative reward, as shown in (3). This equilibrium ensures that no single agent can unilaterally improve its cumulative reward by changing its strategy, assuming other agents’ strategies remain optimal.

V_{π^{i, *}, π^{- i, *}}^{i} \geq V_{π^{i}, π^{- i, *}}^{i}, \forall i \in Agent s

(3)

In a multi-agent environment, a single agent interacts with the environment. The purpose of the agent is to start from any initial state

s_{0}

, with the maximum expected return as shown in (4), where

E \{\}

denotes the mathematical expectation, calculating the expected cumulative reward under the policy

π

.

π_{t}^{*} = \arg \max_{π} E \{\sum_{m = 0}^{\infty} γ^{m} r_{t + m} | s_{t} = s\}

(4)

where

E_{π}

represents the expected value under the strategy,

m

represents the subsequent time period,

r_{t + m}

denotes the reward at time

t + m

, given the current state

s_{t} = s

. Where

π_{t}^{*}

denotes the optimal policy at time

t

,

V_{π}^{t} (s)

and

Q_{π}^{t} (s, a)

are the time-dependent value functions, explicitly linking to the state

s_{t}

and reward

r_{t + m}

at time

t

. Under strategy

π

, the maximum reward value of state

s

is represented as

V_{π}^{t} (s)

.

V_{π}^{t} (s)

denotes the value function of state

s

under the strategy

π

, which is the expected cumulative reward for following the strategy

π

from state

s

onwards.

V_{π}^{t} (s)

is defined as Equation (5).

V_{π}^{t} (s) = \max_{π} E {}_{π}{\{\sum_{m = 0}^{\infty} γ^{m} r_{t + m} | s_{t} = s\}}

(5)

Defined under strategy

π

, the value

Q_{π}^{t} (s, a)

of action

a

taken against state

s

represents the expected return of taking action

a

starting from state

s

and subsequently following strategy

π

, as formalized in Equation (6).

Q_{π}^{t} (s, a) = \max_{π} E_{π} \{\sum_{m = 0}^{\infty} γ^{m} r_{t + m} | s_{t} = s, a_{t} = a\}

(6)

In the single-agent case, the state includes all information about the environment, while a random process satisfying the Markov property can only use a portion of the information. Multi-agent reinforcement learning processes are more challenging when training an agent; other agents are also considered part of the environment, and the agent’s observation of the environment does not include action information. During the training process, the strategies of other agents often change, making the observation results more uncertain. A stochastic process that does not strictly satisfy the Markov property is defined as a Partially Observable Markov Decision Process (POMDP) [16]. POMDPs are crucial in multi-agent systems because each agent’s observations are limited to local information, and the actions of other agents introduce uncertainty into the environment, violating the full observability assumption of standard MDPs.

Information collection includes topology information, bandwidth, resource utilization, and latency. The definition of collected network information is as follows: the set of network nodes

n

is

N

, the set of links

l

is

L

, the capacity is

C

, and the network topology environment is

G = (N, L, C)

. The data flow is defined as shown in Equation (7).

f = (d_{i}, h_{i}, p_{i}, q_{i}, t d_{i})

(7)

where

d_{i}

represents the starting node of the data flow,

h_{i}

represents the destination node of the data flow,

p_{i}

represents the start time of data transmission,

q_{i}

represents the end time of the data flow, and

t d_{i}

represents the traffic demand.

L

is the set of links (

k \in L

),

P_{i j}

denotes the path from node

i

to

j

, and the average delay/throughput is calculated over all node pairs

(i, j)

. The calculation equations for average end-to-end time and average throughput are shown in Equations (8) and (9). Where

i

and

j

denote source and destination node indices (

i, j \in [1, |N|]

), and

k

represents the link index within path

P_{i j}

. The terms

d_{k}

and

τ_{k}

denote the delay and throughput of the link

k

, respectively.

A v e r a g e D e l a y = \frac{1}{{|N|}^{2}} \sum_{i, j = 1}^{N} \sum_{k \in P_{i j}} d_{k}

(8)

A v e r a g e T h r o u g h p u t = \frac{1}{{|N|}^{2}} \sum_{i, j = 1}^{N} \sum_{k \in P_{i j}} τ_{k}

(9)

3. ERA-MADDPG Intelligent Routing Solution

3.1. ERA-MADDPG Intelligent Routing Algorithm

This paper improves the multi-agent deep deterministic policy gradient algorithm (DDPG) [17,18] in DRL [19,20] and combines it with the actor-critic framework [21]. The actor network and critic network in the actor-critic framework adopted are the two networks of the intelligent agent in the MADDPG algorithm. The critic network evaluates the actions calculated by the actor network to improve the performance of the actor network. The actor network calculates the actions to be taken based on the obtained state. In order to reduce the correlation in the training data and increase the stability of the training, a certain number of training experiences with better reward values are stored in the experience replay memory and randomly read by the critic network during updating. Compared with the DDPG algorithm, the MADDPG algorithm adopts distributed execution and centralized training, which can effectively improve training efficiency, enhance learning stability, promote collaboration, and improve the versatility and applicability of the algorithm. The actor network simply collects its own observed data during the training phase, while the critic network collects information about the actions and observations of other agents. The ERA-MADDPG algorithm architecture diagram is shown in Figure 3.

MADDPG builds a centralized critic network for each agent, where the critic network builds a shared value function network that can obtain global information, including the global state and the actions of all agents. This enables each agent can consider the impact of other agents’ behaviors on the global environment, thereby evaluating the value of its own actions more accurately. The output of the centralized critic is the value function

Q_{i}^{π} (x, a_{1}, a_{2}, \dots, a_{N})

corresponding to each agent.

π_{i}

represents the random policy of the agent, mapping states to action probabilities or deterministic actions.

μ_{i}

is a deterministic policy that directly maps states to actions. For the intelligent routing task with

N

agents, the joint routing strategy is defined as shown in Equation (10).

π = \{π_{1}, \dots π_{i}, \dots π_{N}\}

(10)

where

π = \{π_{1}, π_{2}, \dots, π_{N}\}

,

π

represents the random strategy of agent

i

. The action taken by the

i - t h

agent under the observed parameter information is represented by the corresponding policy function

π_{i}

. The information of the entire network environment is represented by

x

, and the gradient of the stochastic

π_{i}

can be expressed by (11), where

\log

denotes the natural logarithm, used to compute the log-likelihood term of the policy gradient for stabilizing gradient updates during training.

J (θ_{i})

denotes the expected cumulative reward for agent

i

under policy

π_{i}

with parameters

θ_{i}

.

\nabla_{θ_{i}} J (θ_{i}) = Ε_{s \sim ρ^{π}, a_{i} \sim π_{i}} [\nabla_{θ_{i}} \log π_{i} (a_{i} | s_{i}) Q^{π} (x, a_{1}, \dots, a_{N})]

(11)

where

Q_{i}^{π} (x, a_{1}, a_{2}, \dots, a_{N})

is the evaluation function of the

i - t h

agent, the input is the obtained network information

x

and the action

a_{i}

taken by each agent, and the output is the

Q

value of the agent.

B

is the batch size (number of samples drawn from replay buffer,

B = 128

),

|D| = 5000

is the buffer capacity, and

j \in [1, B]

indexes the samples in the minibatch. The critic network is updated by minimizing the loss function in (12), which computes the Mean Squared Error (MSE) between its predicted

Q

-value and the target value

y^{j}

. This loss function ensures that the critic’s evaluation aligns with the agent’s long-term reward objectives.

L (θ_{t}) = \frac{1}{B} \sum_{j} [{(Q_{i}^{μ} (x^{j}, a_{1}^{j}, \dots, a_{N}^{j}) - y^{j})}^{2}]

(12)

The actor function is updated through gradient descent using (13). The gradient for the deterministic policy

μ_{i}

is defined in (13), which omits the log-likelihood term as

μ_{i}

directly outputs actions.

\nabla_{θ_{i}} J (μ) = \frac{1}{|S|} \sum j [\nabla_{θ_{i}} μ_{i} (o_{i}^{j}) Q_{i}^{μ} (x, a_{1}^{j}, a_{2}^{j}, \dots, a_{N}^{j}) |_{a_{i}^{j} = μ_{i} (a_{i}^{j})}]

(13)

where

y

is the objective function, which is calculated by the intelligent agent critic. The calculation process of

y

is shown in Equation (14).

y = r_{i} + γ Q'_{i} (x^{i}, a'_{1}, \dots, a'_{N}) |_{a'_{j} = μ'_{j} (x')}

(14)

In Equations (12)–(14),

μ'

is the action function of the target network, where

μ' = \{μ_{θ'}, \dots, μ_{θ'_{N}}\}

.

γ

is the discount factor, the algorithm uses a target network to speed up the learning process of the evaluation function and action function.

θ'_{i}

is the action function parameter corresponding to the

i - t h

target network, and

Q'_{i}

is the target network evaluation function of the

i - t h

agent.

Each element in the experience replays cache pool

D

consists of a four-tuple

(s_{t}, a, r, s_{t + 1})

, which is used to record multi-agent experience, including

a = \{a_{1}, a_{2}, \dots, a_{N}\}

,

r = \{r_{1}, r_{2}, \dots, r_{N}\}

. The action value function of each agent is updated through back propagation, and the updating process is shown in Equation (15).

L (θ_{i}) = E_{x, a, r, x'} [{(Q_{i}^{μ} (x, a_{1}, a_{2} \dots a_{N}) - y)}^{2}]

(15)

Since the strategy of each agent in the environment is iteratively updated, it is easy for the strategy of a single agent to overfit to the strategies of other agents. In order to alleviate the over-fitting problem, for a single agent

i

, the algorithm’s strategy

μ_{i}

is set of sub-strategies

μ_{i}^{k}

(

k

: sub-strategy index,

k \in [1, K]

). In each training cycle, a sub-policy

μ_{i}^{(k)}

is sampled, and its interactions with the environment generate transitions stored in

D_{i}^{(k)}

. This ensures each sub-strategy maintains a dedicated experience buffer for focused learning. During the learning process, the maximization goal is the expected return of all sub-strategies as shown in (16).

J_{e} (μ_{i}) = Ε_{k \sim u n i f (1, k), s \sim p^{μ}, a \sim μ_{i}^{(k)}} [r_{i} (s, a)]

(16)

To mitigate overfitting to static agent strategies, ERA-MADDPG decomposes each agent’s policy

μ_{i}

into a set of sub-policies

\{μ_{i}^{k}\}

. During training, a random sub-policy is selected for each episode, forcing the agent to learn robust strategies that generalize across diverse interaction scenarios. The update gradient of each sub-strategy

μ_{i}^{(k)}

is computed using its dedicated experience buffer

D_{i}^{(k)}

as shown in (17).

D_{i}^{(k)}

is the experience subset corresponding to the sub-strategy

μ_{i}^{(k)}

, containing transitions

(s_{t}, a, r, s_{t + 1})

generated by

μ_{i}^{(k)}

. The expectation

Ε_{x, a \sim D_{i}^{(k)}}

denotes averaging over samples drawn from

D_{i}^{(k)}

.

\nabla_{θ_{i}^{(k)}} J_{e} (μ_{i}) = \frac{1}{K} Ε_{x, a \sim D_{i}^{(k)}} [\nabla_{θ_{i}^{(k)}} μ_{i}^{(k)} (a_{i} |o_{i}) \nabla_{a_{i}} Q^{μ_{i}} (x, a_{1}, \dots, a_{N}) |_{a_{i} = μ_{i}^{(k)} (o_{i})}]

(17)

In Equations (16) and (17),

μ_{i}^{k}

is the set of sub-strategies,

Q

is the evaluation function of

μ_{i}

under sub-strategy

μ_{i}^{k}

, and

θ = \{θ_{1}, \dots, θ_{N}\}

represents the input parameters. The random process

N_{t}

is added to the deterministic policy output

μ_{θ} (o_{i})

for exploration, but the final link weights are generated by smoothing the noisy actions through the actor network, ensuring they reflect the underlying network state rather than randomness. The ERA-MADDPG algorithm process is shown in Algorithm 1.

Algorithm 1 ERA-MADDPG Algorithm Process

Input: Network status information bandwidth, latency, and topology information collection, agent actions.

Output: Link Weight for routing decisions.

(1) for episode = 1 to

M

do;
(2) Initialize a random process

N_{t}

for action exploration;
(3) Receive initial state

s_{0}

;
(4) for

t = 1

to max-episode-length do;
(5) for each agent

i \in {1, \dots, N}

do;
(6) Select action

a_{i} = μ_{θ_{i}} (o_{i}) + N_{t}

;
(7) end for
(8) Execute joint action

a = \{a_{1}, a_{2}, \dots, a_{N}\}

and observe the reward

r

new state

s_{t + 1}

;
(9) Store

(s_{t}, a, r, s_{t + 1})

in replay buffer

D

;
(10)

s_{t} \leftarrow s_{t + 1}

;
(11) Sample a random minibatch of

B

samples

(x^{j}, a^{j}, r^{j}, s'^{j})

from

D

;
(12) for each sample

j

in minibatch do;
(13) Randomly select sub-policy index

k

~ uniform

(1, K)

;
(14) Compute target value

y^{j} = r_{i} + γ Q_{i}^{μ'} (x'^{j}, a'_{i}, \dots, a'_{N}) | a'_{k} = μ'_{k} (o_{k}^{j})

;
(15) end for
(16) Update critic by minimizing the loss:

L (θ_{t}) = \frac{1}{B} \sum_{j} [{(Q_{i}^{μ} (x^{j}, a_{1}^{j}, \dots, a_{N}^{j}) - y^{j})}^{2}]

;
(17) Update actor using the sampled policy gradient:

\nabla_{θ_{i}} J (μ) = \frac{1}{|S|} \sum j [\nabla_{θ_{i}} μ_{i} (o_{i}^{j}) Q_{i}^{μ} (x, a_{1}^{j}, a_{2}^{j}, \dots, a_{N}^{j}) |_{a_{i}^{j} = μ_{i} (a_{i}^{j})}]

;
(18) end for
(19) Update target network parameters:

θ'_{i} \leftarrow τ θ_{i} + (1 - τ) θ'_{i}

;
(20) end for

3.2. ERA-MADDPG Intelligent Routing Interacts with the Environment

The ERA-MADDPG algorithm architecture employs a CNN for MADDPG network training. When using the MADDPG algorithm for training, it is usually necessary to input high-dimensional observation data. The observation data contains information about the data layer environment state. When processing high-dimensional data, the original MADDPG neural network often faces problems such as high computational complexity and low training efficiency. This paper adopts a CNN to leverage high-dimensional data processing, which can effectively accelerate the convergence efficiency of the algorithm. At the same time, the convolutional layers in a CNN can share weight parameters, and the neural network can reuse the same weights to process different areas of the input data, reducing the number of network parameters and model complexity. A CNN can better process high-dimensional observation data, extract effective features, and pass them as input to the MADDPG network for decision-making and value function estimation. This architecture enhances the representation ability of the network, enabling agents to better understand the state of the environment to learn appropriate strategies and value functions. The interaction between the ERA-MADDPG intelligent routing and the environment is illustrated in Figure 4.

The intelligent agent interacts with the data layer environment through the controller layer. The state is extracted by CNN and input into the decision network. Actions are jointly generated by multiple intelligent agents and executed through a stream table. Reward values are dynamically calculated based on real-time performance indicators. In the ERA-MADDPG intelligent routing algorithm, a CNN is employed to construct the decision-making model of the intelligent agent. This model enables the agent to autonomously carry out path planning and decision-making by learning the mapping relationship between states, actions, and reward values. At the same time, the mutual influence and collaboration among multiple agents enable collaborative learning and decision-making among agents. The ERA-MADDPG algorithm uses a reinforcement learning model that consists of an agent and the environment. The model collects the environmental state

S

, makes a decision

A

based on

S

, and obtains the reward value

R

according to the quality of the decision. The mapping process is as follows:

(1) State Mapping: The state mapping in multi-agent routing is a real-time mapping of the current network state, mapping the network topology structure, link bandwidth, throughput, switch resource utilization, link delay, packet loss rate, and other state information of the traffic into the neural network. Through state mapping, multi-agent routing can divert, optimize, and secure traffic according to different conditions. In addition to information input such as network link bandwidth, link throughput, link delay, packet loss rate, and traffic, the algorithm in this paper also allows multiple agents to interact with each other. Agents can perceive the status of other agents and collaborate in training, effectively improving the efficiency of algorithm training.

(2) Action Mapping: Action mapping is in different states; the agent selects actions and the values or probabilities associated with those actions. These actions are determined based on state information input and cumulative reward values. Under different network information or traffic requirements, each link presents different weights. In multi-agent routing, each agent needs to choose an optimal action based on the current state to achieve overall routing optimization. By learning the action mapping relationship, the agent can find the best action strategy that should be taken in a specific state to maximize its expected reward value. The link weight

W_{i}

is dynamically calculated based on multiple network features, including link delay

d_{i j}

, available bandwidth

b_{i j}

, and packet loss rate

l_{i j}

. Specifically,

W_{i}

is defined as a weighted sum of these features. This design ensures that the link weight reflects the comprehensive cost of data transmission, guiding the routing algorithm to select optimal paths. The link weight vector is defined as shown in Equation (18).

w_l i n k = 〈W_{1}, W_{2}, \dots W_{i}, \dots, W_{n}〉

(18)

where

W

represents the node-link weight, the weighted shortest path from the source node to the target node is calculated based on the path weight. By continuously updating the distance array and selecting the node with the shortest distance, the shortest path from the source node to other nodes can be efficiently calculated.

(3) Reward value mapping: The reward value is used to provide feedback on the quality of the current action and calculate the reward value based on network performance indicators. The reward value is set to

R = M

(throughput, delay, loss), and the performance indicators include parameters such as throughput, link delay, and packet loss rate. Adjust the corresponding reward value weight according to the actual traffic demand, and the reward calculation equation is defined as (19), where

R

is a dimensionless metric calculated by normalizing and weighting throughput, delay, and loss. The weights

n_{1}

,

n_{2}

, and

n_{3}

are adjusted according to traffic demand to balance the impact of performance indicators on multi-agent learning. In sensitive network environments where reliability is prioritized, increasing

n_{3}

can effectively reduce packet loss by penalizing it more heavily during training.

R = n_{1} • t h r o u g h p u t + n_{2} • \frac{1}{d e l a y} + n_{3} • \frac{1}{l o s s}, n_{1}, n_{2}, n_{3} \in [0, 1]

(19)

4. Experimental Evaluation

4.1. Experimental Environment and Parameter Configuration

To validate the ERA-MADDPG algorithm, this paper employs the Mininet network simulation platform [15] and Ryu SDN controller [22], which are widely recognized in the SDN research community. Mininet is an open-source tool for rapid prototyping of software-defined networks, enabling lightweight emulation of large-scale network topologies with real kernel forwarding and interactive Command-Line Interface (CLI) management. The Ryu controller, as a leading open-source SDN controller, provides a robust framework for implementing network control logic through its northbound Application Programming Interface (API) and supporting dynamic topology management. The data layer adopts the basic topology of the GEANT network [23], as shown in Figure 5, which contains 23 network nodes and 37 bidirectional links. The GEANT network topology is divided into multiple autonomous domains. Intra-domain links connect nodes within the same domain, while inter-domain links connect different domains. The bandwidth of the intra-domain and inter-domain links is set to 100 Mbps. The controller layer of the 5 domains is controlled by the Ryu controller simulation, and the multi-agent layer adopts the ERA-MADDPG algorithm.

The experimental environment software consists of Ubuntu 18.04.1, Python 3.6.5, Torch 1.4.0 for building neural network models and performing efficient numerical calculations, Gym 0.17.1 task environment, Numpy 1.18.2 numerical calculation dependency library, and Statsmodels 0.11.1 library. The experimental hardware environment consists of an I5-13600K-CPU, 32 GB DDR5 memory, and an RTX 4060 Ti 16 G. The ERA-MADDPG algorithm introduces randomness through two primary mechanisms: (1) Gaussian exploration noise

N_{t}

added to deterministic actions during training. (2) Randomly sampling of experience tuples from the replay buffer. To mitigate the impact of this randomness on results, all experiments were repeated 30 times with different initial random seeds. Set the ERA-MADDPG algorithm training steps, learning rate, discount factor, experience replay pool, and other parameter sizes as shown in Table 1.

Traffic load intensity (80%) represents the ratio of actual network traffic to the maximum capacity, simulating a high-load scenario to test the algorithm’s robustness under heavy traffic conditions. The experience replay pool is updated every 200 iterations to balance training stability and convergence speed, reducing the correlation between consecutive training samples.

4.2. Performance Evaluation

To verify the performance of ERA-MADDPG, this paper uses a single-agent intelligent routing strategy based on deep reinforcement learning (Smart routing based on deep reinforcement learning, SR-DRL) [24] as a comparison object and a multi-agent learning algorithm based on a deep Q network (Multi-Agent Deep Q-Network, MADQN) [6] for convergence and performance comparison. The experiment compares the topological node changes and link changes after the algorithm converges; among them, the topology node changes are set to a dynamic change of about 5% of network nodes, a dynamic change of network nodes of about 10%, and a dynamic change of network nodes of about 15%. Full connectivity between and within domains is guaranteed when the topology changes. The link dynamic change is set to about 5%, the link dynamic change is set to 10%, and the link dynamic change is set to about 15%. When the link changes, full connectivity within and between domains is guaranteed. For each experimental condition, results are reported as the mean ± 95% confidence interval (CI) calculated via Student’s t-test, based on 30 independent runs. Confidence intervals are derived from the sample mean and standard deviation (SD), ensuring statistical significance at the

α = 0.05

level.

Dynamic changes in network nodes simulate real-world scenarios such as device failures, new node deployments, or topology reconfiguration due to traffic demand shifts. These changes introduce uncertainty into the network environment, challenging routing algorithms to adapt quickly and maintain optimal performance. By evaluating the ERA-MADDPG algorithm under such dynamic conditions, we aim to quantify its resilience, defined as the ability to re-converge to near-optimal routing policies within minimal training cycles while minimizing performance degradation (e.g., delay, packet loss). This aligns with the core objective of resilient routing: ensuring continuous service quality despite unexpected topology disruptions.

(1): Convergence comparison of the ERA-MADDPG algorithm, MADQN algorithm, and SR-DRL algorithm

The training process of the ERA-MADDPG algorithm, MADQN algorithm, and SR-DRL algorithm is shown in Figure 6. The entire training process consists of 100 cycles, and each cycle is divided into 1000 steps. The training process is considered stabilized when the cumulative reward fluctuates within ±5% of the moving average over 10 consecutive cycles, and the standard deviation of the reward across episodes is less than 0.1% of the maximum reward value. The training speed is measured by the number of cycles to convergence, defined as the point where the cumulative reward remains within ±5% of the moving average over 10 consecutive cycles. The convergence speed is quantified by the number of cycles to reach a stable state, where cumulative reward fluctuates within ±5% of the moving average for 10 consecutive cycles. It can be seen from Figure 6 that the first training of the ERA-MADDPG algorithm stabilizes around the 53rd cycle. The MADQN algorithm first stabilized around the 67th training cycle. The SR-DRL algorithm first stabilized around the 87th training cycle. The convergence speed of the algorithms is compared as shown in Table 2. This translates to a 20.9% speed improvement over MADQN and a 39.1% improvement over SR-DRL in the initial training phase. The ERA-MADDPG algorithm outperforms MADQN and SR-DRL for two reasons. First, it builds on DDPG to optimize continuous action spaces via policy gradients, which is well-suited for network traffic scheduling. Second, agents store experiences in a shared replay pool, which accelerates collaborative learning. Multi-agent collaborative training makes convergence faster. The advantages of the MADDPG algorithm and the MADQN algorithm over the single-agent algorithm mainly lie in their ability to achieve collaborative learning, experience sharing, and coordinated cooperation among multiple agents, making them more suitable for network traffic scheduling scenarios.

(2): Comparison of recovery training with topology node changes of 5%, 10%, and 15%

In order to verify the ability of the ERA-MADDPG algorithm to adapt to the new network environment and maintain efficient routing decision-making in the face of network changes, in this paper, the data layer network topology nodes are dynamically changed by 5%, 10%, and 15%, respectively, and then the ERA-MADDPG algorithm, MADQN algorithm, and SR-DRL algorithm are trained again. When the topological nodes change dynamically by 5%, 10%, and 15%, the training processes of the ERA-MADDPG algorithm, MADQN algorithm, and SR-DRL algorithm tend to be stable again, as shown in Figure 7a–c. When topology changes occur (e.g., 5% node failures), the algorithm initializes the retraining process with the optimal policy from the previous topology, rather than a random policy. This warm-start mechanism reduces the search space for new optimal routes, as shown in Figure 7. The comparison of convergence speed after topological node changes is shown in Table 3.

As shown in Figure 7, reward values are dimensionless and normalized in Equation (19), which integrates throughput, delay, and packet loss metrics. Higher values indicate better routing performance. The ERA-MADDPG algorithm exhibits consistent convergence acceleration across varying node change magnitudes: when node changes increase from 5% to 15%, the re-convergence cycles rise from 10 to 22, a 120% increase. In contrast, the MADQN algorithm shows a more significant degradation (22 → 35 cycles, 59% increase), and SR-DRL experiences a 58% increase (38 → 60 cycles). This indicates that ERA-MADDPG maintains better scalability under severe topology perturbations.

The results in Figure 7a–c demonstrate that under dynamic topology changes of 5%, 10%, and 15% node variations, the ERA-MADDPG algorithm reconverges at approximately the 10th, 22nd, and 21st training cycles, respectively; the MADQN algorithm achieves reconvergence at around the 22nd, 33rd, and 35th cycles; the SR-DRL algorithm stabilizes at about the 38th, 59th, and 60th cycles for corresponding change levels. The ERA-MADDPG algorithm’s faster re-convergence under node changes (e.g., 10 cycles for 5% node loss) directly reflects its multi-agent collaborative mechanism. When nodes fail, agents dynamically reroute traffic by exchanging topology change information through the SDN controller layer. This distributed decision-making avoids centralized bottlenecks, unlike MADQN, which relies on a single critic network and exhibits delayed adaptation (22 cycles for 5% node changes).

The ERA-MADDPG algorithm has better convergence speed than the MADQN algorithm and the SR-DRL algorithm when facing dynamic changes of 5%, 10%, and 15% in network nodes. The main reason is that the MADDPG algorithm can better cope with the cooperation situation in the multi-agent system by collaboratively training multiple agents. Different agents collaborate to optimize the performance of the actor-critic network. The actor-critic structure is more stable during the training process, which helps to solve the instability problem in the multi-agent system. When faced with dynamic changes, the advantages of multi-agents are significantly greater than those of single-agents. This is mainly because multi-agent algorithms can achieve distributed decision-making, collaboration, and competition with a certain degree of fault tolerance when faced with dynamic changes in network nodes, thereby better adapting to changes in complex network environments.

(3): Comparison of recovery training for link changes of 5%, 10%, and 15%

In the dynamic changes of the network, the link changes may be caused by network topology adjustments, equipment failures, or excessive link loads. When faced with link changes, it was once again verified that the ERA-MADDPG algorithm can adapt to new network environments and maintain efficient routing decision-making capabilities in the face of network changes. This paper sets the dynamic changes of links between network topology nodes to 5%, 10%, and 15%, respectively, and retrains the ERA-MADDPG algorithm, MADQN algorithm, and SR-DRL algorithm when the links between network topology nodes are dynamically changing by 5%, 10%, and 15%. Then, the results show that all three algorithms—ERA-MADDPG, MADQN, and SR-DRL—tend to stabilize again after the link changes. The training process is shown in Figure 8a–c. The comparison of convergence speed after link changes is shown in Table 4.

Figure 8 demonstrates that the ERA-MADDPG algorithm adapts faster to link changes than node changes: re-convergence cycles increase from 12 (5% link changes) to 22 (15% link changes), an 83% increase, while MADQN shows a 113% increase (15 → 32 cycles). This difference may stem from the algorithm’s ability to leverage link-state adjacency information in the CNN-based state mapping, which is more directly applicable to link change scenarios. Figure 8 highlights the algorithm’s resilience to link dynamics, such as bandwidth fluctuations or partial outages. For 15% link changes, ERA-MADDPG adjusts link weights within 22 cycles by leveraging CNN-based state mapping to detect real-time link congestion. This contrasts with SR-DRL, a single-agent approach that struggles to balance global traffic distribution, resulting in longer re-convergence (54 cycles). The multi-agent framework’s ability to jointly optimize inter-domain and intra-domain routes is critical here.

The results in Figure 8a–c show that when the network links change dynamically by 5%, 10%, and 15%, the ERA-MADDPG algorithm reconverges around the 12th, 13th, and 22nd training cycles, respectively. The MADQN algorithm converges around the 15th, 23rd, and 32nd cycles under the same link changes, while the SR-DRL algorithm does so around the 35th, 42nd, and 54th cycles. The MADDPG algorithm collaboratively trains multiple agents to make it converge faster. At the same time, CNN takes advantage of high-dimensional data processing and can effectively accelerate the convergence efficiency of the algorithm. The convolutional layers in CNN can share weight parameters, and the neural network can reuse the same weights to process different areas of the input data, reducing the number of network parameters and model complexity. The faster re-convergence in dynamic topology scenarios (Figure 7 and Figure 8) stems from two key factors: (1) experience transfer—reusing valid routing decisions from the previous topology via the replay buffer. (2) multi-agent coordination—agents collaboratively update policies by exchanging topology change information, which reduces redundant exploration compared to the single-agent SR-DRL. Comparing Figure 7 and Figure 8, the ERA-MADDPG algorithm achieves faster re-convergence in link change scenarios (12–22 cycles) than in node change scenarios (10–22 cycles). This is likely due to the fact that link changes preserve node connectivity, allowing the algorithm to reuse more prior routing policies compared to node failures, which require more extensive path reconfiguration.

Across both Figure 7 and Figure 8, ERA-MADDPG demonstrates a 25–50% faster re-convergence rate compared to baselines as the topology change severity increases (5–15%). This scalability is rooted in its three-layer SDN architecture, which separates policy generation (multi-agent layer) from topology management (controller layer), enabling rapid adaptation without full retraining. In contrast, single-agent and centralized algorithms face exponential complexity growth with increasing changes, underscoring the necessity of multi-agent collaboration for resilient routing.

(4): Comparison of the performance of the ERA-MADDPG algorithm in terms of delay, throughput, and packet loss rate

After the convergence of the algorithm is verified, this paper tests the algorithm’s performance. Testing includes statistical analysis of latency, throughput, and packet loss rate after algorithm convergence following 5%, 10%, and 15% changes in topology nodes and links. The average values are then calculated. The performance comparison is shown in Table 5.

As shown in Table 5, the ERA-MADDPG algorithm achieves 76 Mbps throughput after retraining under node changes, which is comparable to its initial training performance (75 Mbps under link changes), demonstrating that experience transfer maintains high efficiency without full random initialization. Simulation results, averaged over 30 independent runs with 95% confidence intervals, demonstrate that the ERA-MADDPG algorithm consistently outperforms baseline methods in convergence speed and routing stability under dynamic topology changes. The ERA-MADDPG algorithm outperforms the SR-DRL algorithm and the MADQN algorithm in performance while ensuring a better adaptation to dynamic changes in network topology and accelerating the retraining rate. The main reason is that the ERA-MADDPG algorithm can use other agents’ observations and strategies to assist each other in training. Learning other agents’ strategy models enables the ERA-MADDPG algorithm to converge better than other multi-agent learning algorithms, with faster convergence speed and better training reward value.

4.3. Performance Analysis

In the performance evaluation, the ERA-MADDPG algorithm demonstrates competitive results in terms of delay, throughput, and packet loss rate compared with MADQN and SR-DRL. Specifically, under 5–15% topological node or link changes (representing extreme scenarios such as simultaneous multi-node failures or large-scale link reconfiguration), the average delay of ERA-MADDPG ranges from 136 ms to 139 ms, while the packet loss rate remains below 6%. These change amplitudes simulate critical network disruptions that require rapid routing adaptation. To assess whether these values are tolerable in practical applications, we compare them with existing SDN-based systems. For example, in the study of Intrusion Detection Systems (IDS) based on SDN, the authors reported that the Ryu controller introduces a processing delay of approximately 100–150 ms in traffic monitoring scenarios [25]. Our proposed algorithm’s delay falls within this range, indicating that the additional latency introduced by multi-agent collaboration and CNN-based processing is comparable to typical SDN controller overheads and is thus acceptable for most real-time networking applications, such as non-critical enterprise networks or delay-tolerant IoT systems.

Notably, the packet loss rate of ERA-MADDPG is significantly lower than that of the single-agent SR-DRL algorithm (7% under node changes), which benefits from multi-agent collaborative learning that optimizes global routing decisions. Although higher packet loss rates (6% under link changes) may occur in extreme scenarios, they can be mitigated by adjusting reward function weights (e.g., increasing the penalty for packet loss in Equation (19)) to prioritize reliability in sensitive networks. Furthermore, the trade-off between convergence speed and delay is worth emphasizing. To further explore the impact of packet loss penalty, we increased the reward weight

n_{3}

for packet loss in Equation (19) from its default value (0.3) to 0.6, while proportionally adjusting

n_{1}

and

n_{2}

to maintain the normalization. The result showed a notable reduction in average packet loss rate—from 4.0% to 2.3%—at the cost of a slight increase in average delay (from 136 ms to 140 ms). This validates that reward-weight tuning can serve as an effective mechanism to prioritize reliability in scenarios where low packet loss is critical. ERA-MADDPG achieves faster re-convergence (over 25% improvement) than comparative algorithms when facing topology changes, which is critical for maintaining network stability. In dynamic environments where rapid adaptation to failures or traffic shifts is prioritized, the algorithm’s delay and loss rates represent a reasonable compromise for enhanced resilience. In summary, the delay and packet loss introduced by ERA-MADDPG are within tolerable limits for SDN-based routing systems, especially considering its superior convergence efficiency and multi-agent collaboration advantages.

It is worth noting that a 15% topology change represents an extreme stress test scenario, equivalent to sudden failures of multiple core nodes or intentional network reconfiguration under high-traffic loads. While such scenarios are rare in routine operations, they are critical for verifying the algorithm’s resilience in emergency situations (e.g., natural disasters or cyber-attacks). For typical network dynamics (e.g., 1–5% node/link fluctuations), the algorithm demonstrates even better performance metrics, as validated in supplementary tests.

5. Conclusions

The ERA-MADDPG algorithm addresses the challenge of dynamic topology changes in SDN-based routing through a novel three-layer architecture and multi-agent deep reinforcement learning. Integrating the actor-critic framework combined with a CNN network effectively improves training efficiency, enhances learning stability, promotes collaboration, and improves the versatility and applicability of the algorithm. The simulation experiments show that the convergence of the ERA-MADDPG routing algorithm is better than that of the MADQN algorithm and the SR-DRL algorithm. Specifically, its initial training speed is 20.9% faster than that of MADQN and 39.1% faster than that of SR-DRL, while re-convergence speed under 5–15% topology changes is over 25% faster for both node and link variations. These metrics validate that the combination of the actor-critic framework and CNN not only enhances training efficiency but also strengthens the algorithm’s resilience to dynamic network environments.

In future work, we plan to explore the impact of whitening processing on the algorithm’s performance, especially in scenarios with drastic topology changes or multi-scale feature inputs, while addressing computational complexity and preserving feature semantics.

Author Contributions

Conceptualization by W.H., H.L. and Y.L.; methodology by W.H., H.L. and L.M.; software by H.L. and Y.L.; validation by W.H., H.L. and Y.L.; formal analysis by W.H. and H.L.; investigation by H.L. and Y.L.; data curation by W.H. and L.M.; writing—original draft preparation by W.H. and H.L.; writing—review and editing by W.H., H.L., Y.L. and L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Project of Science and Technology in Henan Province (No.252102211085, No.252102211105), The Key Field Special Project of Guangdong Province (No.2021ZDZX1098), The China University Research Innovation Fund (No.2021FNB3001, No.2022IT020), Shenzhen Science and Technology Innovation Commission Stable Support Plan (No.20231128083944001), and Henan Provincial Colleges and Universities Key Scientific Research Project Plan (No. 24A520042).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

Jain, J.K.; Waoo, A.A.; Chauhan, D. A literature review on machine learning for cyber security issues. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 2022, 8, 374–385. [Google Scholar] [CrossRef]
Fawaz, A.; Mougharbel, I.; Al-Haddad, K.; Kanaan, H.Y. Energy routing protocols for Energy Internet: A review on multi-agent systems, metaheuristics, and Artificial Intelligence approaches. IEEE Access 2025, 13, 41625–41643. [Google Scholar] [CrossRef]
Prabhu, D.; Alageswaran, R.; Miruna Joe Amali, S. Multiple agent based reinforcement learning for energy efficient routing in WSN. Wirel. Netw. 2023, 29, 1787–1797. [Google Scholar] [CrossRef]
Wang, Y.; Qiu, D.; Strbac, G. Multi-agent deep reinforcement learning for resilience-driven routing and scheduling of mobile energy storage systems. Appl. Energy 2022, 310, 118575. [Google Scholar] [CrossRef]
Perry, Y.; Frujeri, F.V.; Hoch, C.; Kandula, S.; Menache, I.; Schapira, M.; Tamar, A. A Deep Learning Perspective on Network Routing. arXiv 2023, arXiv:2303.00735. [Google Scholar]
Bhavanasi, S.S.; Pappone, L.; Esposito, F. Dealing with changes: Resilient routing via graph neural networks and multi-agent deep reinforcement learning. IEEE Trans. Netw. Serv. Manag. 2023, 20, 2283–2294. [Google Scholar] [CrossRef]
Zhao, X.; Wu, C.; Le, F. Improving inter-domain routing through multi-agent reinforcement learning. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada, 6–9 July 2020; pp. 1129–1134. [Google Scholar]
Liu, C.; Wu, P.; Xu, M.; Yang, Y.; Geng, N. Scalable deep reinforcement learning-based online routing for multi-type service requirements. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 2337–2351. [Google Scholar] [CrossRef]
Kaviani, S.; Ryu, B.; Ahmed, E.; Larson, K.; Le, A.; Yahja, A.; Kim, J.H. DeepCQ+: Robust and scalable routing with multi-agent deep reinforcement learning for highly dynamic networks. In Proceedings of the MILCOM 2021-2021 IEEE Military Communications Conference (MILCOM), San Diego, CA, USA, 29 November–2 December 2021; pp. 31–36. [Google Scholar]
You, X.; Li, X.; Xu, Y.; Feng, H.; Zhao, J.; Yan, H. Toward packet routing with fully distributed multiagent deep reinforcement learning. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 855–868. [Google Scholar] [CrossRef]
Ding, R.; Xu, Y.; Gao, F.; Shen, X.S.; Wu, W. Deep reinforcement learning for router selection in network with heavy traffic. IEEE Access 2019, 7, 37109–37120. [Google Scholar] [CrossRef]
He, Q.; Wang, Y.; Wang, X.; Xu, W.; Li, F.; Yang, K.; Ma, L. Routing optimization with deep reinforcement learning in knowledge defined networking. IEEE Trans. Mob. Comput. 2023, 23, 1444–1455. [Google Scholar] [CrossRef]
Sun, P.; Lan, J.; Guo, Z.; Xu, Y.; Hu, Y. Improving the scalability of deep reinforcement learning-based routing with control on partial nodes. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3557–3561. [Google Scholar]
Chen, B.; Sun, P.H.; Lan, J.L.; Wang, Y.W.; Cui, P.S.; Shen, J. Inter-Domain Multi-Link Routing Optimization Based on Multi-agent Reinforcement Learning. J. Inf. Eng. Univ. 2022, 23, 641–647. [Google Scholar]
Gupta, N.; Maashi, M.S.; Tanwar, S.; Badotra, S.; Aljebreen, M.; Bharany, S. A comparative study of software defined networking controllers using mininet. Electronics 2022, 11, 2715. [Google Scholar] [CrossRef]
Xiang, X.; Foo, S. Recent advances in deep reinforcement learning applications for solving partially observable markov decision processes (pomdp) problems: Part 1—Fundamentals and applications in games, robotics and natural language processing. Mach. Learn. Knowl. Extr. 2021, 3, 554–581. [Google Scholar] [CrossRef]
Yao, Z.; Wang, Y.; Meng, L.; Qiu, X.; Yu, P. DDPG-Based Energy-Efficient Flow Scheduling Algorithm in Software-Defined Data Centers. Wirel. Commun. Mob. Comput. 2021, 2021, 6629852. [Google Scholar] [CrossRef]
Li, L.; Li, Y.Z.; Zhang, Y.J.; Wei, W. Deep deterministic policy gradient algorithm based on mean of multiple estimators. J. Zhengzhou Univ. (Eng. Sci.) 2022, 43, 15–21. [Google Scholar]
Chen, B.; Sun, P.; Zhang, P.; Lan, J.; Bu, Y.; Shen, J. Traffic engineering based on deep reinforcement learning in hybrid IP/SR network. China Commun. 2021, 18, 204–213. [Google Scholar] [CrossRef]
Wang, B.C.; Si, H.W.; Tan, G.Z. Research on autopilot control algorithm based on deep reinforcement learning. J. Zhengzhou Univ. (Eng. Sci.) 2020, 41, 41–45,80. [Google Scholar]
Xi, L.; Wu, J.; Xu, Y.; Sun, H. Automatic generation control based on multiple neural networks with actor-critic strategy. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 2483–2493. [Google Scholar] [CrossRef] [PubMed]
Bhardwaj, S.; Panda, S.N. Performance evaluation using RYU SDN controller in software-defined networking environment. Wirel. Pers. Commun. 2022, 122, 701–723. [Google Scholar] [CrossRef]
Le, D.H.; Tran, H.A.; Souihi, S.; Mellouk, A. An AI-based traffic matrix prediction solution for software-defined network. In Proceedings of the ICC 2021-IEEE International Conference on Communications, Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar]
Chen, L.; Lingys, J.; Chen, K.; Liu, F. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, Budapest Hungary, 20–25 August 2018; pp. 191–205. [Google Scholar]
Fausto, A.; Gaggero, G.; Patrone, F.; Marchese, M. Reduction of the delays within an intrusion detection system (ids) based on software defined networking (sdn). IEEE Access 2022, 10, 109850–109862. [Google Scholar] [CrossRef]

Figure 1. Three-layer architecture of SDN.

Figure 2. Schematic diagram of multi-agent interaction with the environment.

Figure 3. ERA-MADDPG algorithm architecture.

Figure 4. Mapping process.

Figure 5. GEANT network topology.

Figure 6. Algorithm convergence comparison plot.

Figure 7. Training comparison after dynamic changes of topological nodes.

Figure 8. Comparison of training on dynamic changes of network links.

Table 1. Simulation experiment parameter configuration.

Experimental Parameters	Parameter Value
Training steps T of the algorithm	100,000
Actor-Critic learning rate	0.0001
Reward value discount factor	0.9
Experience replay pool size	5000
Traffic load intensity	80%
Reward value weighting parameters $n_{1}$ , $n_{2}$ , and $n_{3}$	0~1
Experience replay unit update iteration steps	200
Training batch size	128

Table 2. Convergence speed comparison of algorithms.

Algorithm	First Stabilization Cycle (95% CI)	Corresponding Training Steps (×10³, 95% CI)	Delay Cycles Compared to ERA-MADDPG
ERA-MADDPG	53 ± 2.1	53,000 ± 1800	-
MADQN	67 ± 3.5	67,000 ± 2300	+14
SR-DRL	87 ± 4.8	87,000 ± 3100	+34

Table 3. Re-convergence speed comparison after topological node changes.

Change Magnitude	Algorithm	Re-Convergence Cycle (Training Cycles)	Delay Cycles Compared to ERA-MADDPG
5% node change	ERA-MADDPG	10 ± 0.8	-
	MADQN	22 ± 1.5	+12
	SR-DRL	38 ± 2.3	+28
10% node change	ERA-MADDPG	22 ± 1.2	-
	MADQN	33 ± 2.1	+11
	SR-DRL	59 ± 3.7	+37
15% node change	ERA-MADDPG	21 ± 1.3	-
	MADQN	35 ± 2.4	+14
	SR-DRL	60 ± 4.2	+39

Table 4. Re-convergence speed comparison after link changes.

Change Magnitude	Algorithm	Re-Convergence Cycle (Training Cycles)	Delay Cycles Compared to ERA-MADDPG
5% link change	ERA-MADDPG	12 ± 0.9	-
	MADQN	15 ± 1.3	+3
	SR-DRL	35 ± 2.7	+23
10% link change	ERA-MADDPG	13 ± 1.1	-
	MADQN	23 ± 1.8	+10
	SR-DRL	42 ± 3.2	+29
15% link change	ERA-MADDPG	22 ± 1.6	-
	MADQN	32 ± 2.5	+10
	SR-DRL	54 ± 4.1	+32

Table 5. Performance comparison of the ERA-MADDPG algorithm, MADQN algorithm, and SR-DRL algorithm.

Algorithm	Average Throughput (Mbps) ± SD	Average Delay (ms) ± SD	Average Packet Loss Rate (%) ± SD
ERA-MADDPG	76 ± 2.1	136 ± 4.3	4.0 ± 0.8
MADQN	73 ± 3.5	141 ± 5.2	5.0 ± 1.1
SR-DRL	67 ± 4.8	163 ± 6.7	7.0 ± 1.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, W.; Liu, H.; Li, Y.; Ma, L. ERA-MADDPG: An Elastic Routing Algorithm Based on Multi-Agent Deep Deterministic Policy Gradient in SDN. Future Internet 2025, 17, 291. https://doi.org/10.3390/fi17070291

AMA Style

Huang W, Liu H, Li Y, Ma L. ERA-MADDPG: An Elastic Routing Algorithm Based on Multi-Agent Deep Deterministic Policy Gradient in SDN. Future Internet. 2025; 17(7):291. https://doi.org/10.3390/fi17070291

Chicago/Turabian Style

Huang, Wanwei, Hongchang Liu, Yingying Li, and Linlin Ma. 2025. "ERA-MADDPG: An Elastic Routing Algorithm Based on Multi-Agent Deep Deterministic Policy Gradient in SDN" Future Internet 17, no. 7: 291. https://doi.org/10.3390/fi17070291

APA Style

Huang, W., Liu, H., Li, Y., & Ma, L. (2025). ERA-MADDPG: An Elastic Routing Algorithm Based on Multi-Agent Deep Deterministic Policy Gradient in SDN. Future Internet, 17(7), 291. https://doi.org/10.3390/fi17070291

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ERA-MADDPG: An Elastic Routing Algorithm Based on Multi-Agent Deep Deterministic Policy Gradient in SDN

Abstract

1. Introduction

2. System Architecture

3. ERA-MADDPG Intelligent Routing Solution

3.1. ERA-MADDPG Intelligent Routing Algorithm

3.2. ERA-MADDPG Intelligent Routing Interacts with the Environment

4. Experimental Evaluation

4.1. Experimental Environment and Parameter Configuration

4.2. Performance Evaluation

4.3. Performance Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI