RS-MADDPG: Routing Strategy Based on Multi-Agent Deep Deterministic Policy Gradient for Differentiated QoS Services

Kuang, Shi; Zheng, Jinyu; Liang, Shilin; Li, Yingying; Liang, Siyuan; Huang, Wanwei

doi:10.3390/fi17090393

Open AccessArticle

RS-MADDPG: Routing Strategy Based on Multi-Agent Deep Deterministic Policy Gradient for Differentiated QoS Services

by

Shi Kuang

¹,

Jinyu Zheng

²,

Shilin Liang

³,

Yingying Li

^4,*,

Siyuan Liang

⁵ and

Wanwei Huang

⁵

¹

Transmission Operation and Inspection Center, State Grid Zhengzhou Electric Power Supply Company, Zhengzhou 450007, China

²

DC Branch, State Grid Henan Electric Power Company, Zhengzhou 450052, China

³

College of Big Data and Artificial Intelligence, Zhengzhou University of Economics and Business, Zhengzhou 450099, China

⁴

College of Electronics and Communication Engineering, Shenzhen Polytechnic University, Shenzhen 518005, China

⁵

College of Software Engineering, Zhengzhou University of Light Industry, Zhengzhou 450007, China

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(9), 393; https://doi.org/10.3390/fi17090393

Submission received: 31 July 2025 / Revised: 24 August 2025 / Accepted: 27 August 2025 / Published: 29 August 2025

Download

Browse Figures

Versions Notes

Abstract

As network environments become increasingly dynamic and users’ Quality of Service (QoS) demands grow more diverse, efficient and adaptive routing strategies are urgently needed. However, traditional routing strategies suffer from limitations such as poor adaptability to fluctuating traffic, lack of differentiated service handling, and slow convergence in complex network scenarios. To this end, we propose a routing strategy based on multi-agent deep deterministic policy gradient for differentiated QoS services (RS-MADDPG) in a software-defined networking (SDN) environment. First, network state information is collected in real time and transmitted to the control layer for processing. Then, the processed information is forwarded to the intelligent layer. In this layer, multiple agents cooperate during training to learn routing policies that adapt to dynamic network conditions. Finally, the learned policies enable agents to perform adaptive routing decisions that explicitly address differentiated QoS requirements by incorporating a custom reward structure that dynamically balances throughput, delay, and packet loss according to traffic type. Simulation results demonstrate that RS-MADDPG achieves convergence approximately 30 training cycles earlier than baseline methods, while improving average throughput by 3%, reducing latency by 7%, and lowering packet loss rate by 2%.

Keywords:

quality of service; routing strategy; multi-agent deep deterministic policy gradient; software-defined networking

1. Introduction

In recent years, the scale of electric power information networks has been expanding, and intelligent inspection devices have been widely adopted, resulting in the generation of massive inspection data and dynamic changes in traffic demands [1,2]. The expansion of network scale and the variability of traffic types impose higher requirements on network management and routing strategies. For instance, surveillance video or audio streams require minimal network latency to ensure smooth transmission and a good user experience, while frequent transmission of file data demands higher throughput and a lower packet loss rate to avoid interruptions or data loss [3]. Traditional routing algorithms struggle to meet the diverse needs of different types of traffic, such as high throughput, low latency, and low packet loss rates [4,5]. To further illustrate this limitation, take the widely used Open Shortest Path First (OSPF) algorithm as a typical example. As shown in Figure 1, assuming that the link weights in the network topology are all equal to 1, the algorithm selects routes solely based on static link weights and prioritizes the shortest path between source and destination nodes, completely disregarding the heterogeneous QoS requirements of different traffic types. When latency-sensitive traffic from

T_{1}

and packet loss-sensitive traffic from

T_{2}

simultaneously transmit to

T_{4}

, both are forced to use the shortest path

S_{1} - S_{3}

due to OSPF’s static decision-making. This leads to severe bandwidth competition on

S_{1} - S_{3}

, resulting in excessive latency (exceeding 200 ms in extreme cases) and packet loss—directly degrading the quality of delay-sensitive services (e.g., surveillance video) and packet loss-sensitive services (e.g., file transfers). Such defects of traditional routing algorithms highlight the urgency of developing adaptive routing strategies that can dynamically respond to differentiated QoS demands.

To enhance Quality of Service (QoS) [6], intelligent routing algorithms based on Software-Defined Networking (SDN) [7] have been studied and applied [8]. In conventional algorithms, each router possesses autonomous decision-making capabilities and lacks centralized control over the data plane, making it challenging to meet strict requirements for latency, packet loss rates, and throughput from users. SDN technology addresses this limitation by separating the control plane from the data plane, enabling more flexible centralized control and programming of the network. In the SDN architecture, the controller collects network device status information, traffic information, and other relevant data centrally. After processing, this information is transmitted to the intelligent layer. In the intelligent layer, the system centrally controls the traffic flow in the data plane based on network information, and dynamically selects routing paths to enhance service quality and user experience.

Currently, intelligent routing algorithms for SDN architecture can be classified into single-agent and multi-agent routing algorithms [9,10]. The single-agent routing algorithm based on SDN offers advantages such as centralized control and programmability. However, it still faces limitations in terms of differentiated service capabilities and risks associated with single point of failure. For instance, if there is a failure in the central intelligent agent or a loss of connection to the control layer, the entire network may become paralyzed. Additionally, the single-agent approach exhibits deficiencies in handling dynamic traffic [11]. To overcome the deficiencies of single-agent routing, such as poor differentiated service capabilities and risks associated with single point of failure, research into and application of intelligent routing algorithms based on multi-agent systems have been conducted. The multi-agent routing algorithm decentralizes the routing decision-making process to multiple agents. Each intelligent agent can independently make routing decisions based on its own location and management, effectively mitigating the risk of single-agent single point of failure. The multi-agent routing algorithm rapidly identifies user requirements and specifies traffic strategies based on different traffic demands [12]. For instance, in the case of sudden traffic surges or network topology changes, the exchange of status information, traffic demands, and routing strategies among the multiple agents enables them to achieve consensus through coordination. This provides a more efficient and flexible routing algorithm for network path selection [9]. However, existing multi-agent routing methods are typically optimized for a single type of network traffic and often overlook the heterogeneous QoS requirements associated with diverse traffic types [13]. In practical networking environments, services such as video streaming, voice communication, and file transfer often coexist. Specifically, video and voice communications are highly sensitive to latency and jitter, whereas file transfers require high bandwidth and minimal packet loss. Failure to accommodate such varied requirements can significantly degrade the quality of service for critical applications, thereby constraining the adaptability and robustness of routing strategies in complex network environments [14,15].

To address these issues, we propose a routing strategy based on multi-agent deep deterministic policy gradient (RS-MADDPG) under the SDN architecture. The RS-MADDPG system encompasses information collection, state input, multi-agent policy generation, multi-agent policy output, and flow table installation processes. It adopts a decentralized training method to effectively address differentiated service requirements for different traffic types. The main contributions of this paper are as follows:

We propose a routing strategy based on multi-agent deep deterministic policy gradient (RS-MADDPG), which optimizes the agent strategy by updating the actor network and the critic function network through multi-agent collaborative training, realizes network dynamic information collection and agent information interaction, and effectively finds the optimal solution to meet the dynamic needs of different sensitive flows.
We integrate multiple information such as topology and traffic to form a multi-dimensional state input to the neural network, and design action mapping based on dynamic link weights. Furthermore, we design a custom reward structure based on the characteristics of different QoS traffic flows (throughput-sensitive, latency-sensitive, and packet loss-sensitive) and the demands of complex network scenarios. By dynamically adjusting link weights and optimizing feedback, we enhance algorithm performance to meet the differentiated QoS requirements of various traffic types.
We design experimental comparisons under different service types of traffic. The results show that RS-MADDPG has better convergence and training reward values than the baseline method. In addition, RS-MADDPG improves the average throughput performance by about 3%, reduces the average delay by about 7%, and reduces the average packet loss rate by about 2%. Experiments on Fat-Tree, NSFNET topologies, and random traffic patterns demonstrate its strong generalization capability, with performance fluctuations within 5% across scenarios, validating its applicability to diverse network environments.

The rest of the paper is organized as follows. Section 2 studies the state-of-the-art of the agent routing algorithm, Section 3 implements the RS-MADDPG system architecture, Section 4 designs an intelligent routing algorithm under multiple controllers, Section 5 conducts simulation experiments, and Section 6 summarizes.

2. Related Work

Existing SDN intelligent routing research can be categorized into two types: single-agent routing and multi-agent routing. The single-agent routing approach relies on a central controller to make global routing decisions, emphasizing centralized optimization and learning capabilities. In contrast, the multi-agent routing approach achieves network state awareness and dynamic routing through distributed collaboration among multiple agents, providing enhanced adaptability and robustness. Both approaches offer distinct advantages and are widely applied in network environments with varying scales and complexities.

2.1. Single-Agent Routing Algorithm

Traditional routing algorithms are predominantly based on static rules or limited dynamic update mechanisms, making them unsuitable for complex and dynamic network environments [1]. With the emergence of SDN, researchers have introduced single-agent intelligent routing methods to facilitate dynamic optimization and fine-grained management of network traffic via autonomous perception and learning enabled by the controller [16].

Pei et al. [17] proposed a reconfigurable traffic engineering (EfficientTE) method, which integrates SDN with Deep Reinforcement Learning (DRL) [18,19] techniques. EfficientTE identifies key congested links, introduces a dynamic virtual capacity adjustment mechanism, and utilizes the weighted K-shortest path algorithm to selectively reroute critical flows. Experimental results demonstrate that its load balancing performance across four network topologies improves by 6.13% to 22.81% compared to traditional solutions. However, the optimization is primarily global-average and does not explicitly model per-flow differentiated QoS objectives, and its responsiveness under rapid traffic phase shifts is limited. He et al. [20] introduced a message-passing deep reinforcement learning (MPDRL) method that leverages Graph Neural Networks (GNNs) [21] to interact with network topology and achieve effective traffic load balancing. Experiments confirm that MPDRL outperforms baseline algorithms in ISP network topologies. However, it does not deliver targeted QoS guarantees for specific flows and lacks mechanisms to quickly adapt to heterogeneous service demands.

Song et al. [22] proposed a routing optimization architecture based on SDN and the Deep Deterministic Policy Gradient (DDPG) algorithm. This method separates the SDN control and forwarding functions and constructs reward functions based on multi-dimensional QoS indicators such as delay and bandwidth, guiding the algorithm to learn optimal routing strategies. Experimental evaluations demonstrate that this approach significantly reduces network delay and enhances transmission efficiency compared to traditional shortest-path algorithms. Nevertheless, the reward design does not separate heterogeneous flow priorities, and the centralized learning approach may limit scalability and robustness under controller failures. Beyond routing optimization, DDPG has also demonstrated effectiveness in distributed signal processing scenarios. Xu et al. [23] addressed the challenges faced by RIS-assisted cellular-free IoT networks, such as high hardware costs, significant backhaul overhead, and channel fading in high-frequency bands, by proposing a deep distributed alternating direction method of multipliers (D²-ADMM) based on algorithmic unfolding. This algorithm jointly optimizes base station precoding and RIS reflection coefficients to maximize the weighted sum rate. It constructs a learning framework by transforming the problem, utilizing distributed ADMM, and algorithmic unfolding, combining one-way information exchange and data-driven training, and integrating domain knowledge with end-to-end learning to enhance system performance.

Li et al. [24] presented a software-defined wireless networking (SDWN) method that employs a GCN-GRU framework for traffic prediction and constructs a data forwarding mechanism based on the Proximal Policy Optimization (PPO) algorithm. It uses traffic matrices and topology information as input to the DRL environment and constructs reward functions by integrating link status and penalty mechanisms. Results show that this algorithm surpasses traditional methods in terms of throughput, delay, and other performance metrics. Building upon this, Huang et al. [16] proposed an intelligent routing method that combines traffic prediction with the Dueling Deep Q-Network (DQN) [25] algorithm. By obtaining global link states through SDN and converting them into a traffic matrix, this method predicts future traffic trends and generates optimal routing decisions. Experimental results show that it improves throughput, reduces delay and packet loss rate, and achieves faster convergence compared to DDPG and PPO. However, these methods optimize average performance and lack explicit per-class QoS differentiation, making them vulnerable to prediction errors during sudden traffic surges.

İpek et al. [26] proposed AIRSDN, an AI-based routing framework in SDN designed for multimedia traffic transmission, which employs supervised learning (specifically Logistic Regression) to dynamically select optimal paths based on real-time traffic analysis, significantly reducing delay and packet loss while improving throughput for video streams. However, the method primarily optimizes overall QoS metrics (e.g., RTT, PSNR) without explicit mechanisms to prioritize distinct traffic classes (e.g., real-time vs. non-real-time), potentially limiting its adaptability to heterogeneous service requirements in mixed traffic scenarios. Sanchez et al. [27] developed MDQ, a QoS-Congestion aware deep reinforcement learning approach for multi-path routing in SDN. It incorporates a Live QoS Monitor, Dynamic Network Classifier, and congestion severity index to reduce delay and improve load balancing. Nevertheless, MDQ runs on a single-controller SDN (limiting large-network scalability) and does not explore energy efficiency or computational overhead.

Chen et al. [28] proposed a scalable deep reinforcement learning-based intelligent routing strategy (SR-DRL) for datacenter-scale traffic optimization. It adopts a two-level architecture: Peripheral Systems on end-hosts handle short flows with instant local decisions, while a Central System uses two DRL agents to optimize thresholds for short flows and adjust routing/priority for long flows, leveraging the long-tail distribution of datacenter traffic to achieve sub-10 ms response latency. However, SR-DRL focuses on general flow completion time minimization without targeted optimization for heterogeneous QoS demands, and its short-flow routing relies on ECMP without QoS-aware path selection, with long-flow policy updates depending on batch processing of finished flow data, leading to potential adaptation delays in dynamic QoS scenarios.

Compared with traditional routing algorithms, these single-agent approaches exhibit strong adaptability and learning capabilities. They can dynamically adapt routing strategies to respond to changes in network topology and traffic fluctuations, thereby enhancing network performance and resource utilization. However, in large-scale networks or complex dynamic environments, these methods often suffer from limited perception ranges and delayed decision-making, thereby hindering global optimization and efficient collaboration.

2.2. Multi-Agent Routing Algorithm

With the rapid advancement of machine learning, multi-agent approaches have been extensively applied in the development of intelligent routing algorithms. Each agent can independently or collaboratively sense network conditions, predict traffic dynamics, and adapt routing strategies accordingly, thereby facilitating distributed decision-making and achieving global optimization [29].

Okine et al. [30] proposed a distributed multi-agent deep reinforcement learning (MADRL) routing method designed for tactical mobile sensor networks, with multiple sink nodes to mitigate link-layer interference attacks. This method demonstrated superior performance compared to baseline algorithms in terms of packet delivery rate, end-to-end delay, and energy efficiency. However, its design is tailored to specific scenarios and lacks generalization to heterogeneous QoS in wired SDN environments. Building upon this work, Lyu et al. [31] introduced a constrained multi-agent dynamic routing (CMADR) algorithm, which formulates the routing task as a max–min optimization problem using the Lagrangian method. The algorithm efficiently balances objective improvement and constraint adherence through strategy updates and Lagrangian multiplier adjustments. Experimental results indicate that CMADR reduces delay by at least 21% and 15% compared to baseline methods. Nevertheless, the constraints are global rather than per-flow, and QoS differentiation is not explicitly addressed. Wang et al. [32] proposed a MADRL-based routing method for laser intersatellite links, modeling the routing optimization as a Markov decision process (MDP) [33]. By decomposing the state, action, and reward components, the MDP is reformulated within a MADRL framework. Experimental results show that this method reduces energy consumption by over 15% and decreases transmission delay by approximately two hops within a few iterations. However, this method targets specialized network structures and does not handle differentiated QoS scheduling for mixed services.

Guo et al. [11] developed a collaborative multi-agent reinforcement learning (CMRL) method, which decomposes large-scale routing problems into smaller, manageable subproblems. The method constructs a virtual training environment tailored to the limitations of legacy devices and SDN switches and introduces a differential reward mechanism to incentivize cooperative decision-making among agents. Real-world traffic simulations validate the performance advantages of CMRL in dynamic traffic engineering scenarios. Liu et al. [29] proposed a multi-agent reinforcement learning-based multi-path routing (RL-MR) algorithm. It generates routing decisions in a hop-by-hop manner, integrating delay, transmission rate, and packet loss metrics, and incorporates an auxiliary learning mechanism to enhance reliability and accelerate convergence. Experiments demonstrate that RL-MR surpasses baseline methods in throughput, bandwidth utilization, and delay performance. However, reward designs remain single-objective or weakly multi-objective, and class-aware QoS trade-offs are not enforced at decision time. Kołakowski et al. [34] proposed 3DQR, a hierarchical network engineering framework that combines multi-controller SDN with graph-based MADRL to optimize routing in three-dimensional ground-to-non-ground networks. It uses a dual verification mechanism for real-time QoS enforcement and integrates a QoS-aware reward function and GNNs for dynamic topology, achieving 13.5% lower traffic rejection rate and 50% better load distribution than traditional shortest path schemes. Though adaptive to mobile conditions via topology prediction, it faces challenges in resource fairness for fluctuating traffic and performance limits in sparse networks.

Huang et al. [35] proposed ERA-MADDPG, an Elastic Routing Algorithm based on Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [36] for SDN. ERA-MADDPG designs a three-layer SDN architecture and combines the Actor–Critic framework with CNN to enhance routing elasticity, enabling fast policy adjustments when network topology changes. It uses centralized multi-agent training and sub-policy decomposition to boost efficiency and avoid overfitting, but only focuses on general routing resilience under dynamic topology and lacks optimization for differentiated QoS of diverse traffic. In contrast, RS-MADDPG designs a decentralized training framework with a shared centralized critic network, and its key innovation lies in introducing differentiated Quality of Service (QoS) flows and a custom reward mechanism for differentiated QoS needs. It first classifies traffic into throughput-sensitive, delay-sensitive, and packet loss-sensitive flows, then customizes reward weights for each flow type. This framework dynamically balances throughput, delay, and packet loss rate, effectively addressing ERA-MADDPG’s limitation in handling heterogeneous traffic.

Yue et al. [37] proposed a contribution-based multi-agent routing algorithm (CBR), which introduces graph convolutional networks (GCN) to extract network topology features and models the entire topology through a point-edge swapping technique. This method addresses the limitation of shared global rewards in traditional multi-agent systems by designing a contribution-based advantage function: it calculates the individual contribution of each agent to the global reward by comparing the Q-value of the agent’s current action with the average Q-value of all possible actions under the current policy, avoiding incorrect optimization directions caused by undifferentiated reward allocation. However, CBR focuses on hop-by-hop routing decision-making and relies on pre-training of shortest path rules to accelerate convergence, which leads to two limitations: first, its hop-by-hop decision mechanism increases the cumulative delay of path generation in large-scale networks; second, the pre-training process is dependent on static shortest path rules, and its adaptability to sudden link failures or dynamic traffic surges needs to be further improved.

Compared to traditional centralized control methods, multi-agent routing approaches offer greater flexibility, scalability, and fault tolerance, thereby significantly improving network adaptability and QoS assurance. However, existing methods often focus on a single type of traffic service and fail to differentiate between the requirements of heterogeneous traffic types. Therefore, building on the strengths of multi-agent routing, this paper further considers the dynamic QoS demands of various flows that are sensitive to delay, throughput and packet loss rate.

3. RS-MADDPG System Architecture

The RS-MADDPG algorithm is designed based on the SDN architecture and leverages a multi-agent deep deterministic policy gradient approach to achieve intelligent routing. As shown in Figure 2, the system architecture is composed of three main components: the intelligent layer, the control layer, and the data layer. The training process of the RS-MADDPG algorithm involves several key stages, including network information collection, state representation, policy learning and generation, action execution, and flow table deployment to the data layer. The RS-MADDPG algorithm enables efficient collaboration among multiple agents through a lightweight communication mechanism: agents exchange only key local observations (e.g., link weights, traffic type labels) and policy gradients via the SDN controller’s northbound interface, avoiding transmission of redundant topology data. This design reduces communication overhead while accelerating convergence, effectively addressing challenges associated with information exchange in large-scale networks. The algorithm optimizes agent policies through the joint learning and updating of actor and critic networks across agents. By sharing a centralized critic network, RS-MADDPG facilitates global policy optimization, enhancing both fault tolerance and adaptability in dynamic network environments.

The processing flow of the RS-MADDPG system architecture consists of five main stages: information collection, state input, policy generation, policy output, and flow table deployment. Information collection provides the data foundation for learning and decision-making within the RS-MADDPG framework. Specifically, network status data is gathered at the data layer and transmitted to the controller via the southbound interface. Based on this information, multiple agents collaboratively perform learning and training to derive appropriate routing policies. The flow table deployment phase translates these decisions into concrete network operations. The overall processing flow of the RS-MADDPG algorithm is depicted in Figure 3.

(1) Information Collection: In the RS-MADDPG system, agents collect network information at the data layer, including network topology, link bandwidth, switch resource utilization, link delay, and packet loss rate.

(2) State Input: After the collected network information is processed and summarized, the agents extract and generate status information needed for the application, which is then input into the deep reinforcement learning neural network.

(3) Policy Generation: Multiple agents learn and optimize policies based on the network information. They collaborate with each other and design one or more routing policies, considering the starting node and state information, which are used for optimal path selection and traffic allocation.

(4) Policy Output: After the neural network training converges, intelligent routing policies are generated and deployed to the controller through the northbound interface. Based on the corresponding control commands, intelligent traffic control is implemented.

(5) Flow Table Deployment: The SDN controller disseminates the generated flow table entries to each switch/router through protocols. Upon receiving the flow table entries, the switches/routers perform matching operations and execute corresponding actions based on the matched flow table entries.

Information collection includes key network metrics, such as topology, bandwidth, resource utilization, and delay. The network is defined as a topology graph

G = (V, K, C)

, where

V

is the set of nodes,

K

is the set of links, and

C

denotes link capacities. A data flow is defined as

f = (d_{i}, h_{i}, p_{i}, q_{i}, r_{i})

, where

d_{i}

and

h_{i}

are the source and destination nodes,

d_{i j}

is the delay between nodes

i

and

j

,

p_{i}

and

q_{i}

are the start and end times, and

r_{i}

is the traffic demand. The average end-to-end delay and average throughput are calculated as shown in Equations (1) and (2).

D e l a y = \sum_{i, j = 1}^{V} \sum_{k \in K}^{K} d_{i j}

(1)

T h r o u g h p u t = \sum_{i, j = 1}^{V} \sum_{k \in K}^{K} t_{i j}

(2)

4. Intelligent Routing Under Multiple Controllers

4.1. Overview of RS-MADDPG Algorithm

RS-MADDPG employs the multi-agent deep deterministic policy gradient algorithm, which is an improved version of the MADDPG [36] method and combines it with the actor–critic framework. In the actor and critic networks, the actor network is responsible for mapping network information collected by the intelligent agents, such as link throughput and link bandwidth, to action selections based on traffic demands [38,39]. The critic function evaluates the decisions made by the agents based on the current policy and assigns corresponding reward values. In the RS-MADDPG algorithm, each intelligent agent is treated as an actor and, during the intelligent routing training phase, each agent can perceive the states of other agents while all agents share a single critic network. During the training process of the actor and critic networks, the algorithm interacts with the network environment to complete the training of the routing strategy. It maps the observed data from the multi-agent environment to the state selections of the agents, combines the state selections of other agents, and collects experience data through interactions with the environment and other agents. The neural networks are updated through experience replay buffer

D

, and the actor network and critic function network are optimized by learning and updating the intelligent agent’s policy. For example, when there is delay-sensitive traffic in the current network, the actor network, based on the collected network information

x = (s_{1}, s_{2} \dots s_{N})

, where

s

represents the collected network information, chooses a path with lower delay. Then, the critic function provides feedback on the current policy and evaluates its effectiveness. The framework of the RS-MADDPG algorithm is illustrated in Figure 4.

For the task of intelligent routing with

N

intelligent agents, the RS-MADDPG algorithm corresponds to

N

policy functions and

N

evaluation functions. The intelligent routing strategies adopted by the

N

agents are represented by

π = \{π_{1}, \dots π_{i}, \dots π_{N}\}

respectively. The corresponding parameters of the network topology, traffic bandwidth, delay, throughput, and traffic sensitivity information are denoted as

θ = \{θ_{1}, θ_{2}, \dots θ_{N}\}

. The action taken by the

i - t h

agent based on the observed parameter information is denoted as

π_{i} (a_{i} | s_{i})

. The complete network environment state is represented by

x

, and the policy gradient for the

i - t h

agent is expressed in Equation (3).

\nabla_{θ_{i}} J (θ_{i}) = Ε_{s \sim ρ^{μ}, a_{i} \sim π_{i}} [\nabla_{θ_{i}} \log π_{i} (a_{i} | s_{i}) Q^{π} (x, a_{1}, \dots a_{N})]

(3)

Here,

Q_{i}^{π} (x, a_{1}, a_{2} \dots a_{N})

represents the agent’s evaluation function, mapping the acquired network state

x

and the agent’s action

a_{i}

to a Q-value that quantifies the expected cumulative reward, and guides policy optimization during training. The network environment information

x = (s_{1}, s_{2} \dots s_{N})

can be the observation set of each agent, and other information can be input based on the current actual network environment or traffic requirements. The evaluation function

Q_{i}^{π}

of each intelligent agent in the network environment can be learned separately, and the reward value can also be designed separately.

Each element in the experience replay buffer

D

consists of a quadruple

(x, a, r, x')

, which records the experiences of multiple agents. Here,

a = \{a_{1}, a_{2} \dots a_{N}\}

and

r = \{r_{1}, r_{2} \dots r_{N}\}

represent the observed states and corresponding actions, respectively. The action-value function of each intelligent agent is updated via backpropagation, as described in Equation (4).

L (θ_{i}) = E_{x, a, r, x'} [{(Q_{i}^{μ} (x, a_{1}, a_{2} \dots a_{N}) - y)}^{2}]

(4)

Here,

y

denotes the target value computed by the agent’s critic network. The computation of

y

is detailed in Equation (5).

y = r_{i} + γ Q'_{i} (x^{i}, a'_{i}, \dots, a'_{N}) |_{a'_{j} = μ'_{j} (s_{j})}

(5)

Here,

μ' = \{μ_{θ'}, \dots, μ_{θ'_{N}}\}

represents the action function of the target network and

γ

is the discount factor. The RS-MADDPG algorithm employs target networks to accelerate the learning of both evaluation and action functions. Specifically, the parameter of the action function in the

i - t h

target network is represented by

{θ^{'}}_{i}

, while the evaluation function of the

i - t h

agent’s target network is denoted as

{Q^{'}}_{i}

.

The traffic scheduling algorithm designed for differentiated services, named RS-MADDPG, is presented in Algorithm 1:

Algorithm 1. RS-MADDPG traffic scheduling algorithm for differentiated services

(1) Input: network status information set

θ = \{θ_{1}, θ_{2}, \dots θ_{N}\}

, environment information set

x = (s_{1}, s_{2} \dots s_{N})

, agent action set

a = \{a_{1}, a_{2} \dots a_{N}\}

, step size

T = 1000

, discount factor

γ

(2) Output:

π

,

θ_{i}

,

Q_{i}

,

{Q^{'}}_{i}

(3) for episode = 1 to

M

do

(4) Initialize a random process for action exploration;

(5) Initialize network routing environment status x;

(6) for

t = 1

to 1000 do

(7) for each agent

i

, select action

a_{i} = μ_{θ_{i}} (o_{i}) + N_{t}

;

(8) Perform action

a = \{a_{1}, a_{2} \dots a_{N}\}

;

(9) Action

a

and reward value

r

, as well as the next state

s'

;

(10) Store the converted quadruple

(s, a, r, s^{'})

into the experience replay cache pool

D

;

(11) end for

(12) for agent

i = 1

to

N

do

(13) Take

S

samples

(s^{j}, a^{j}, r^{j}, s^{' j})

from

D

;

(14) Cumulative income:

(15)

y^{j} = r_{i}^{j} + γ {Q^{μ'}}_{i} (x^{' j}, {a^{'}}_{1}, \dots, {a^{'}}_{N}) |_{a'_{j} = μ'_{k} (s_{k}^{j})}

;

(16) Update the evaluation function based on minimizing the loss function:

(17)

L (θ_{i}) = \frac{1}{S} \sum_{j} [{(Q_{i}^{μ} (x^{j}, a_{1}^{j}, \dots a_{N}^{j}) - y^{j})}^{2}]

;

(18) Update the action function according to gradient descent:

(19)

\nabla_{θ_{i}} J (μ_{i}) = \frac{1}{S} \sum j [\nabla_{θ_{i}} μ_{i} (s_{i}^{j}) Q_{i}^{μ} (x, a_{1}^{j}, \dots, a_{N}^{j}) |_{a_{i}^{j} = μ_{i} (s_{i}^{j})}]

;

(20) end for

(21) Each agent

i

updates the target network parameters:

{θ^{'}}_{i} = τ θ_{i} + (1 - τ) {θ^{'}}_{i}

;

(22) end for

(23) end for

The flow of the RS-MADDPG algorithm for differentiated service traffic scheduling is as follows: First, in lines (1)–(2), network status, environment, agent actions, and other parameters are initialized; key hyperparameters such as the number of steps and the discount factor are set, and output parameters are defined. Then, in lines (3)–(5), the main training loop is initiated; the action exploration process and the network routing environment state are initialized. Subsequently, in lines (6)–(11), within each round, agents perform actions based on their policies and exploration noise, observe the resulting rewards and next states, and store the transition tuples (state, action, reward, next state) in the experience replay buffer. In lines (12)–(20), each agent samples from the replay buffer, computes the cumulative return and target Q-values, updates the Q-network based on the loss function, and refines the actor network through gradient descent. In line (21), the target network parameters for each agent are updated via a soft update mechanism. Finally, in lines (22)–(23), the agent and round loops are terminated. After multiple rounds of iterative optimization, the RS-MADDPG model converges to an effective scheduling policy tailored for differentiated service traffic.

4.2. Mapping of State, Action, and Reward

The RS-MADDPG algorithm is based on a deep reinforcement learning framework comprising intelligent agents interacting with the environment. The environment state

S

is observed, upon which agents make decisions

A

and receive rewards

R

reflecting the quality of their actions. In the RS-MADDPG algorithm, the policy function adopts a deep neural network (DNN), which can provide more efficient routing services by allowing intelligent routing based on real-time network information. Firstly, the network information is preprocessed and input into the neural network. The convolutional layer and pooling layer of the neural network are used to extract the features of the processed data. The extracted data is then trained through the neural network, and the network decision is made based on the network output, updating the routing forwarding table and making traffic forwarding decisions. Finally, the DNN network updates the policy based on the reward value and network status of this action to dynamically adjust the mapping process, as shown in Figure 5.

(1) State Mapping: The current network state is mapped in real time, including topology, link bandwidth, throughput, switch resource utilization, link delay, and packet loss rate. These metrics are selected to capture universal network dynamics (e.g., congestion trends, resource constraints) rather than topology-specific details, enabling the model to learn transferable strategies across different network structures. The data plane uploads net-work topology information to the SDN controller via the southbound interface, where it is aggregated into a normalized state representation (independent of node/link labels) and forwarded to the intelligent agents. These agents then utilize the state information to make routing decisions. In the RS-MADDPG algorithm, multiple metrics, such as node delay, packet loss rate, throughput, and link utilization, are incorporated as input states. The network node state at time t is represented by Equation (6):

S_{m \times n_{t}} = (\begin{array}{l} v_{11_{t}} \dots v_{1 n_{t}} \\ ⋮ ⋱ ⋮ \\ v_{m 1_{t}} \dots v_{m n_{t}} \end{array})

(6)

where

s = \{d_{t}, l_{t}, t_{t}, u_{t}, e_{t}\}

,

d_{t}

represents node delay,

l_{t}

represents packet loss rate,

t_{t}

represents throughput,

u_{t}

link utilization, and

e_{t}

represents other information. This study defines

r_{i}

as the traffic demand,

r_{i} = \{d_{i}, l_{i}, t_{i}\}

, with

d_{i}

representing the demand for delay,

l_{i}

representing the demand for packet loss rate, and

t_{i}

representing the demand for throughput. The state information is mapped into the neural network, and the neural network is used to extract its features, ultimately outputting routing strategies. In addition to utilizing network metrics such as link bandwidth, throughput, delay, packet loss rate, and traffic as input, the proposed algorithm enables interaction among multiple intelligent agents. These agents can perceive each other’s states and collaboratively train, thereby significantly improving the training efficiency.

(2) Action Mapping: Actions represent decisions derived from the input state information and accumulated reward values. Given varying network conditions or traffic demands, each link is assigned a corresponding weight. The set of link weights for multiple intelligent agents at time

t

is expressed in Equation (7).

a_{t} = 〈W_{1}, W_{2}, \dots W_{i}, \dots W_{n}〉

(7)

Here,

W

represents the link weights of nodes, which are used to calculate the paths. In the action mapping process of the RS-MADDPG algorithm, network information such as network state, link quality, and bandwidth utilization is first collected. Then, the router action space is defined, including selection of neighboring nodes, choosing paths, or adjusting traffic allocation. Finally, paths are determined based on the link weights. The RS-MADDPG algorithm selects different paths for traffic with different demands. For example, in this study, delay-sensitive flows are transmitted to their destination nodes, while packet loss-sensitive flows often exhibit different link weights.

(3) Reward Mapping: The reward value provides feedback on the quality of the agent’s action and is computed based on network performance metrics. Specifically, the reward R is defined as a function

M (t h r o u g h p u t, d e l a y, l o s s)

, where throughput, delay, and loss correspond to key indicators such as link throughput, transmission delay, and packet loss rate. In this study, the weights assigned to these metrics are adjusted dynamically according to actual traffic demands. The reward calculation is formalized in Equation (8):

R = n_{1} \times t h r o u g h p u t + n_{2} \times \frac{1}{d e l a y} + n_{3} \frac{1}{l o s s}, n_{1}, n_{2}, n_{3} \in [0, 1]

(8)

where

n_{1}

,

n_{2}

, and

n_{3}

are weight parameters that are adjusted based on traffic demands and actual situations to balance the impact of throughput, delay, packet loss, and other indicators on multi-agent learning.

5. Simulation Experiments

5.1. Experimental Environment Setup

The performance of RS-MADDPG is evaluated using the Mininet network simulation platform with the Ryu controller to construct the network environment. The experimental setup runs on an Ubuntu 18.04.1 system with Python 3.6.5. Deep learning models are implemented using PyTorch 1.4.0 for efficient numerical computation. The reinforcement learning environment is provided by Gym 0.17.1, while NumPy 1.18.2 supports numerical operations and Statsmodels 0.11.1 is employed for statistical analysis. The hardware configuration includes an Intel i9-13900K CPU, 32 GB DDR5 RAM (16 × 2 GB), and an RTX 4080 Super 16GB GPU.

The parameters and traffic parameter settings used in the RS-MADDPG algorithm during the experiment are shown in Table 1.

The GEANT basic network topology, comprising 23 nodes and 37 bidirectional links [35,40], is adopted in this study. Each link is configured with a bandwidth of 100 Mbps. The network topology is illustrated in Figure 6.

5.2. Comparing Performance and Methods

The traffic was categorized into three types: delay-sensitive, throughput-sensitive, and packet loss-sensitive, based on user requirements. The corresponding reward weight values were adjusted according to the actual traffic demands. RS-MADDPG was compared with traditional routing, single-agent intelligent routing, and multi-agent routing strategies based on three metrics. For traditional link-state routing, the Shortest Path First (SPF) strategy used in the OSPF was employed [41]. For single-agent routing, a smart routing strategy based on deep reinforcement learning (SR-DRL) was utilized [28]. Additionally, other multi-agent learning algorithms (MALA) were considered for comparison [37].

5.3. Performance Evaluation

(1) Convergence Comparison of RS-MADDPG Algorithm

The convergence of the RS-MADDPG and MALA algorithms is illustrated in Figure 7. The entire training process consists of 200 training epochs, with each epoch comprising a total of 1000 steps. From the graph, it can be observed that RS-MADDPG reaches stability at approximately 117 epochs, while MALA stabilizes around the 148th epoch. After about the 74th training epoch, the training rewards of RS-MADDPG become significantly higher than those of MALA. This can be attributed to the fact that the RS-MADDPG algorithm allows agents to observe and assist each other in training, learning from the policies of other intelligent agents. As a result, individual agents can effectively utilize the strategies of other models while learning their own models. This contributes to the superior convergence performance of the RS-MADDPG algorithm compared to other multi-agent learning algorithms, with a faster convergence rate and more optimal training rewards.

(2) Performance Evaluation for Different Sensitive Traffic Flows

To validate the differentiated service for different types of traffic, this study conducted tests using three different flow types based on user requirements. Type I represents throughput-sensitive traffic, Type II represents delay-sensitive traffic, and Type III represents packet loss-sensitive traffic. Each test involved only one type of traffic, and statistical analysis was performed on throughput, delay, and packet loss for all three types of traffic. During the testing process, the traffic load intensity was fixed at

H = 0.8

. A comparative analysis was conducted between RS-MADDPG, SPF, SR-DRL, and MALA under different types of traffic. The performance comparison for different types of traffic is illustrated in Figure 8.

The results in Figure 8 indicate that the RS-MADDPG algorithm outperforms SPF, SR-DRL, and MALA routing algorithms in terms of throughput, delay, and packet loss under Type I, Type II, and Type III traffic. Specifically, Figure 8a shows that the throughput of RS-MADDPG under Type I traffic is significantly higher than under Type II and III, demonstrating its ability to accurately adapt to throughput-sensitive flows. In contrast, the throughput of SPF, SR-DRL, and MALA remains largely unchanged across all traffic types. This discrepancy arises because the SPF algorithm selects the shortest path based solely on the source and destination nodes, ignoring the differentiated demands of traffic. As a static routing method, it lacks adaptability to dynamic traffic characteristics. Similarly, the SR-DRL and MALA algorithms lack robust on-demand service differentiation and struggle to dynamically adjust resource allocation strategies, leading to undifferentiated throughput performance across traffic types. The distinctions are further evident in the latency comparison in Figure 8b and the packet loss rate comparison in Figure 8c. RS-MADDPG achieves significantly lower end-to-end latency under Type II traffic compared to Type I and III, indicating effective adaptation to delay-sensitive applications. Under Type III traffic, the algorithm maintains a consistently low packet loss rate, demonstrating its capacity for targeted QoS optimization. In contrast, SPF, SR-DRL, and MALA show minimal variation in latency and packet loss rate across traffic types. Notably, the SPF algorithm exhibits a higher packet loss rate under Type III traffic, highlighting its inability to meet differentiated service requirements. These findings suggest that most traditional and intelligent routing algorithms cannot provide differentiated services on demand. They also fail to dynamically adjust routing strategies according to traffic needs. Although intelligent routing methods have made substantial progress in overall performance, their ability to deliver differentiated QoS remains limited. In comparison, RS-MADDPG—leveraging policy learning and real-time network updates—is capable of capturing subtle, real-time variations in traffic demand and dynamically allocating network resources to achieve precise scheduling for throughput, delay, and packet loss sensitive flows.

(3) Performance Analysis under Various Traffic Conditions

In real-world scenarios, multiple types of traffic often coexist, with different types appearing in a periodic and alternating manner due to user behavior patterns. The experimental tests were conducted with three types of traffic, i.e., Type I, Type II, and Type III, which occur periodically. The throughput, delay, and packet loss rate were measured for all types of traffic. During the tests, the traffic load intensity was fixed at

H = 0.8

. A comparative analysis was performed between RS-MADDPG and SPF, SR-DRL, and MALA algorithms. The performance comparison under different types of traffic is presented in Table 2.

As shown in Table 2, in a mixed multi-traffic scenario, the RS-MADDPG algorithm outperforms the SPF, SR-DRL, and MALA algorithms across key performance metrics, including average throughput, average delay, and average packet loss rate. The traditional SPF algorithm adopts a static shortest path strategy, relying solely on physical topological distance as the decision criterion. It lacks the capability to adapt to dynamic traffic patterns and differentiated QoS requirements. When multiple types of traffic coexist, this limitation often leads to link congestion, significantly degrading network transmission efficiency. The SR-DRL algorithm, based on a single-agent architecture, is constrained by the computing power and storage capacity of the central agent. In large-scale networks with massive state information and complex traffic patterns, it suffers from computational bottlenecks and state space explosion, hindering fast and accurate dynamic policy adjustment. MALA employs a multi-agent framework with GCN-based topology feature extraction and contribution-based reward allocation, which addresses the problem of undifferentiated reward in traditional multi-agent methods. However, it still relies on hop-by-hop routing decisions.

(4) Communication Overhead Analysis

To evaluate the scalability of RS-MADDPG in large networks, we measured communication overhead in a 40-node Fat-Tree topology and compared it with MALA. The key metrics include data transmitted per agent per training cycle (reflecting bandwidth consumption) and communication latency between the farthest agents (reflecting real-time interaction efficiency).

As shown in Table 3, results show that RS-MADDPG reduces data transmission by ~60% and latency by ~50% compared to MALA. This is attributed to the following: (1) transmitting only traffic-type-specific features (e.g., delay sensitivity label) instead of full network states; (2) asynchronous experience sharing via the replay buffer, reducing real-time communication frequency. The overhead remains stable even when scaling to 100 nodes, indicating suitability for large networks.

(5) Generalization Across Topologies and Traffic Patterns

To verify the algorithm’s adaptability to diverse network topologies and traffic patterns, additional experiments were conducted on two typical topologies (Fat-Tree with 40 nodes/80 links and NSFNET with 14 nodes/21 links). We also tested random traffic patterns, where the proportion of Type I/II/III flows changes dynamically (10–90%) every 100 training steps. Results are shown in Table 4.

Results in Table 4 show that RS-MADDPG maintains stable performance across topologies: compared to GEANT, average throughput varies by less than 4%, delay by less than 5%, and packet loss rate by less than 0.5%. Under random traffic patterns, the algorithm adjusts reward weights dynamically, with performance degradation within 3% relative to fixed-pattern traffic. These findings indicate that RS-MADDPG generalizes effectively to different network environments by leveraging multi-dimensional state input and adaptive reward mechanisms.

In contrast, the RS-MADDPG algorithm achieves global network awareness and real-time strategy coordination through a distributed multi-agent collaborative framework. Built on an actor-critic architecture, it uses the critic network to evaluate joint state-action values across agents, incorporates experience replay, and applies a soft update mechanism for the target network, together enhancing the stability and convergence speed of policy learning. When handling various traffic types, including throughput-sensitive, delay-sensitive, and packet loss-sensitive flows, RS-MADDPG accurately identifies their respective QoS requirements, dynamically balances network resource allocation, and achieves efficient traffic scheduling. Experimental results show that, compared to MALA, RS-MADDPG increases average throughput by approximately 3%, reduces average delay by about 7%, and lowers average packet loss by around 2%. Its collaborative optimization mechanism offers an improved solution for traffic scheduling in complex network environments.

6. Conclusions

In response to the increasing complexity and dynamics of network demands from users, this paper proposes the RS-MADDPG algorithm, which combines differentiated QoS optimization with lightweight multi-agent communication. By prioritizing key state features and leveraging asynchronous experience sharing, RS-MADDPG reduces communication overhead by ~60% compared to existing multi-agent methods, making it suitable for large-scale networks while addressing diverse traffic requirements. Deployed within the SDN architecture, the RS-MADDPG algorithm is an improvement of the MADDPG method that incorporates actor–critic networks. During the training process of the actor and critic networks, multi-agent collaborative training is employed to update the actor network and optimize the critic function network. This enables the collection of dynamic network information, facilitates intelligent agent interaction, and effectively meets the dynamic demands of various sensitive flows, thus achieving the search for global optimal solutions. To address different service types of traffic, the traffic is categorized into throughput-sensitive flow, delay-sensitive flow, and packet loss-sensitive flow. Simulation experimental results demonstrate the convergence and training reward superiority of the RS-MADDPG algorithm compared to other multi-agent routing algorithms. Under various sensitive flows, the RS-MADDPG algorithm outperforms SPF, SR-DRL, and MALA routing algorithms. In comparison to the MALA algorithm, the RS-MADDPG algorithm exhibits an average improvement of approximately 3% in throughput, a reduction of approximately 7% in average delay, and a decrease of around 2% in average packet loss rate. Furthermore, experiments on Fat-Tree, NSFNET topologies, and random traffic patterns demonstrate its strong generalization capability, with performance fluctuations within 5% across scenarios, validating its applicability to diverse network environments.

In future work, we plan to enhance the algorithm’s robustness from multiple perspectives by conducting in-depth research on the characteristics and evolution of extreme network environments. Additionally, we will explore meta-reinforcement learning to further improve generalization across highly heterogeneous topologies (e.g., satellite networks, IoT mesh networks) and design a topology-agnostic state embedding mechanism. We also aim to construct a comprehensive testing environment that encompasses diverse abnormal scenarios, such as abrupt congestion and link disruptions, to systematically analyze the algorithm’s generalization capabilities.

Author Contributions

The authors confirm contribution to the paper as follows: Study conception and design: S.K., J.Z. and S.L. (Shilin Liang); data collection: S.K. and Y.L.; analysis and interpretation of results: S.K., J.Z. and W.H.; manuscript writing: S.L. (Shilin Liang) and S.L. (Siyuan Liang). All authors have read and agreed to the published version of the manuscript.

Funding

The financial support from the Science and Technology Project of State Grid Henan Electric Power Company (Grant No. 521710250007), the Science and Technology Project of Henan Province (Grant No. 232102210078), and the Stabilization Support Program of The Shenzhen Science and Technology Innovation Commission (Grant No.20231130110921001) is acknowledged.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Conflicts of Interest

Shi Kuang was employed by the State Grid Zhengzhou Electric Power Supply Company and Jinyu Zheng was employed by the State Grid Henan Electric Power Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Gunavathie, M.A.; Umamaheswari, S. Traffic-Aware Optimal Routing in Software Defined Networks by Predicting Traffic Using Neural Network. Expert Syst. Appl. 2024, 239, 122415. [Google Scholar] [CrossRef]
Ding, C.; Zhu, L.; Shen, L.; Li, Z.; Li, Y.; Liang, Q. The Intelligent Traffic Flow Control System Based on 6G and Optimized Genetic Algorithm. IEEE Trans. Intell. Transp. Syst. 2024, 25, 1–14. [Google Scholar] [CrossRef]
Hu, J.; Shen, H.; Liu, X.; Wang, J. RDMA Transports in Datacenter Networks: Survey. IEEE Netw. 2024, 38, 380–387. [Google Scholar] [CrossRef]
Wang, Z.; Gong, T.; Huang, S.H. Graph Neural Network With Soft Actor-Critic and Attention based Large Model for Intelligent Edge Routing in Consumer Internet of Things. IEEE Trans. Consum. Electron. 2025. [Google Scholar] [CrossRef]
Li, D.; Zhang, Z.; Alizadeh, B.; Zhang, Z.; Duffield, N.; Meyer, M.A.; Thompson, C.M.; Gao, H.; Behzadan, A.H. A Reinforcement Learning-Based Routing Algorithm for Large Street Networks. Int. J. Geogr. Inf. Sci. 2023, 38, 183–215. [Google Scholar] [CrossRef]
Babooram, L.; Fowdur, T.P. Performance Analysis of Collaborative Real-Time Video Quality of Service Prediction with Machine Learning Algorithms. Int. J. Data Sci. Anal. 2024, 40, 1513–1545. [Google Scholar] [CrossRef]
Gadallah, W.G.; Ibrahim, H.M.; Omar, N.M. A Deep Learning Technique to Detect Distributed Denial of Service Attacks in Software-Defined Networks. Comput. Secur. 2024, 137, 103588. [Google Scholar] [CrossRef]
Wang, S.; Song, R.; Zheng, X.; Huang, W.; Liu, H. A3C-R: A QoS-Oriented Energy-Saving Routing Algorithm for Software-Defined Networks. Future Internet 2025, 17, 158. [Google Scholar] [CrossRef]
Xiao, Y.; Yang, Y.; Yu, H.; Liu, J. Scalable QoS-Aware Multipath Routing in Hybrid Knowledge-Defined Networking with Multiagent Deep Reinforcement Learning. IEEE Trans. Mob. Comput. 2024, 23, 10628–10646. [Google Scholar] [CrossRef]
Hu, K.; Xu, K.; Xia, Q.; Li, M.; Song, Z.; Song, L.; Sun, N. An Overview: Attention Mechanisms in Multi-Agent Reinforcement Learning. Neurocomputing 2024, 598, 128015. [Google Scholar] [CrossRef]
Guo, Y.; Lin, B.; Tang, Q.; Ma, Y.; Luo, H.; Tian, H.; Chen, K. Distributed Traffic Engineering in Hybrid Software Defined Networks: A Multi-Agent Reinforcement Learning Framework. IEEE Trans. Netw. Serv. Manag. 2024, 21, 6759–6769. [Google Scholar] [CrossRef]
Luan, Z.; Li, Q.; Jiang, Y.; Duan, J.; Zheng, R.; Chen, D.; Liu, S. MATE: When Multi-Agent Deep Reinforcement Learning Meets Traffic Engineering in Multi-Domain Networks. Comput. Netw. 2024, 247, 110399. [Google Scholar] [CrossRef]
Yang, T.; Sun, J.; Mohajer, A. Queue Stability and Dynamic Throughput Maximization in Multi-Agent Heterogeneous Wireless Networks. Wirel. Netw. 2024, 30, 3229–3255. [Google Scholar] [CrossRef]
Priyadarshi, R. Exploring Machine Learning Solutions for Overcoming Challenges in IoT-Based Wireless Sensor Network Routing: A Comprehensive Review. Wirel. Netw. 2024, 30, 2647–2673. [Google Scholar] [CrossRef]
Changazi, S.A.; Bakhshi, A.D.; Yousaf, M.; Mohsin, S.M.; Akber, S.M.A.; Abazeed, M.; Ali, M. Optimization of Network Topology Robustness in IoTs: A Systematic Review. Comput. Netw. 2024, 250, 110568. [Google Scholar] [CrossRef]
Huang, L.; Ye, M.; Xue, X.; Wang, Y.; Qiu, H.; Deng, X. Intelligent Routing Method Based on Dueling DQN Reinforcement Learning and Network Traffic State Prediction in SDN. Wirel. Netw. 2024, 30, 4507–4525. [Google Scholar] [CrossRef]
Pei, X.; Sun, P.; Hu, Y.; Li, D.; Chen, B.; Tian, L. Enabling Efficient Routing for Traffic Engineering in SDN with Deep Reinforcement Learning. Comput. Netw. 2024, 241, 110220. [Google Scholar] [CrossRef]
Bai, J.; Sun, J.; Wang, Z.; Zhao, X.; Wen, A.; Zhang, C.; Zhang, J. An Adaptive Intelligent Routing Algorithm Based on Deep Reinforcement Learning. Comput. Commun. 2024, 216, 195–208. [Google Scholar] [CrossRef]
Ding, M.; Guo, Y.; Huang, Z.; Lin, B.; Luo, H. GROM: A Generalized Routing Optimization Method with Graph Neural Network and Deep Reinforcement Learning. J. Netw. Comput. Appl. 2024, 229, 103927. [Google Scholar] [CrossRef]
He, Q.; Wang, Y.; Wang, X.; Xu, W.; Li, F.; Yang, K.; Ma, L. Routing Optimization With Deep Reinforcement Learning in Knowledge Defined Networking. IEEE Trans. Mob. Comput. 2024, 23, 1444–1455. [Google Scholar] [CrossRef]
Shi, Y.; Wang, W.; Zhu, X.; Zhu, H. Low Earth Orbit Satellite Network Routing Algorithm Based on Graph Neural Networks and Deep Q-Network. Appl. Sci. 2024, 14, 3840. [Google Scholar] [CrossRef]
Song, Y.; Qian, X.; Zhang, N.; Wang, W.; Xiong, A. QoS Routing Optimization Based on Deep Reinforcement Learning in SDN. Comput. Mater. Contin. 2024, 79, 3007–3021. [Google Scholar] [CrossRef]
Xu, W.; An, J.; Li, H.; Gan, L.; Yuen, C. Algorithm-Unrolling-Based Distributed Optimization for RIS-Assisted Cell-Free Networks. IEEE Internet Things J. 2024, 11, 944–957. [Google Scholar] [CrossRef]
Li, J.; Ye, M.; Huang, L.; Deng, X.; Qiu, H.; Wang, Y.; Jiang, Q. An Intelligent SDWN Routing Algorithm Based on Network Situational Awareness and Deep Reinforcement Learning. IEEE Access 2023, 11, 83322–83342. [Google Scholar] [CrossRef]
Zhu, Z.; Chen, M.; Zhu, C.; Zhu, Y. Effective Defense Strategies in Network Security Using Improved Double Dueling Deep Q-Network. Comput. Secur. 2024, 136, 103578. [Google Scholar] [CrossRef]
İpek, A.D.; Cicioğlu, M.; Çalhan, A. AIRSDN: AI Based Routing in Software-Defined Networks for Multimedia Traffic Transmission. Comput. Commun. 2025, 240, 108222. [Google Scholar] [CrossRef]
Sanchez, L.P.A.; Shen, Y.; Guo, M. MDQ: A QoS-Congestion Aware Deep Reinforcement Learning Approach for Multi-Path Routing in SDN. J. Netw. Comput. Appl. 2025, 235, 104082. [Google Scholar] [CrossRef]
Chen, L.; Lingys, J.; Chen, K.; Liu, F. AuTO: Scaling Deep Reinforcement Learning for Datacenter-Scale Automatic Traffic Optimization. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), Budapest, Hungary, 20–25 August 2018. [Google Scholar]
Liu, P.; Bai, X.; Cheng, H.; Gao, X.; Su, J. RL-MR: Multipath Routing Based on Multi-Agent Reinforcement Learning for SDN-Based Data Center Networks. In Proceedings of the 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Kaifeng, China, 30 October–2 November 2024. [Google Scholar]
Okine, A.A.; Adam, N.; Naeem, F.; Kaddoum, G. Multi-Agent Deep Reinforcement Learning for Packet Routing in Tactical Mobile Sensor Networks. IEEE Trans. Netw. Serv. Manag. 2024, 21, 2155–2169. [Google Scholar] [CrossRef]
Lyu, Y.; Hu, H.; Fan, R.; Liu, Z.; An, J.; Mao, S. Dynamic Routing for Integrated Satellite-Terrestrial Networks: A Constrained Multi-Agent Reinforcement Learning Approach. IEEE J. Sel. Areas Commun. 2024, 42, 1204–1218. [Google Scholar] [CrossRef]
Wang, G.; Yang, F.; Song, J.; Han, Z. Optimization for Dynamic Laser Inter-Satellite Link Scheduling With Routing: A Multi-Agent Deep Reinforcement Learning Approach. IEEE Trans. Commun. 2024, 72, 2762–2778. [Google Scholar] [CrossRef]
Park, S.; Kim, G.S.; Jung, S.; Kim, J. Markov Decision Policies for Distributed Angular Routing in LEO Mobile Satellite Constellation Networks. IEEE Internet Things J. 2024, 11, 38744–38754. [Google Scholar] [CrossRef]
Kołakowski, R.; Tomaszewski, L.; Tępiński, R.; Kukliński, S. Hierarchical Traffic Engineering in 3D Networks Using QoS-Aware Graph-Based Deep Reinforcement Learning. Electronics 2025, 14, 1045. [Google Scholar] [CrossRef]
Huang, W.; Liu, H.; Li, Y.; Ma, L. ERA-MADDPG: An Elastic Routing Algorithm Based on Multi-Agent Deep Deterministic Policy Gradient in SDN. Future Internet 2025, 17, 291. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6382–6393. [Google Scholar]
Yue, X.; Wu, L.; Duan, W. Multi-Agent Reinforcement Learning With Contribution-Based Assignment Online Routing In SDN. In Proceedings of the 2022 19th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 16–18 December 2022. [Google Scholar]
Xiao, Y.; Yu, H.; Yang, Y.; Wang, Y.; Liu, J.; Ansari, N. Adaptive Joint Routing and Caching in Knowledge-Defined Networking: An Actor-Critic Deep Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2025, 24, 4118–4135. [Google Scholar] [CrossRef]
Zhou, W.; Jiang, X.; Luo, Q.; Guo, B.; Sun, X.; Sun, F.; Meng, L. AQROM: A Quality of Service Aware Routing Optimization Mechanism Based on Asynchronous Advantage Actor-Critic in Software-Defined Networks. Digit. Commun. Netw. 2024, 10, 1405–1414. [Google Scholar] [CrossRef]
Uhlig, S.; Quoitin, B.; Lepropre, J.; Balon, S. Providing Public Intradomain Traffic Matrices to the Research Community. SIGCOMM Comput. Commun. Rev. 2006, 36, 83–86. [Google Scholar] [CrossRef]
Malik, S.U.; Srinivasan, S.K.; Khan, S.U. A Methodology for OSPF Routing Protocol Verification. In Proceedings of the 12th International Conference on Scalable Computing and Communications (ScalCom), Changzhou, China, 24–26 October 2012. [Google Scholar]

Figure 1. Schematic diagram of network topology.

Figure 2. Intelligent routing architecture of RS-MADDPG algorithm.

Figure 3. RS-MADDPG algorithm processing flow.

Figure 4. RS-MADDPG algorithm framework diagram.

Figure 5. Mapping process.

Figure 6. GEANT network topology diagram.

Figure 7. Schematic diagram of the training process.

Figure 8. Performance comparison under different types of traffic.

Table 1. Experimental parameter settings.

Parameter	Specific Value
Training steps $T$	200,000
Learning rate $l r$	0.0001
Reward discount factor $γ$	0.9
Experience replay pool size $D$	10,000
Traffic load intensity $H$	0.8
Type I traffic	$n_{1} = 0.6, n_{2} = 0.2, n_{3} = 0.2$
Type II traffic	$n_{1} = 0.2, n_{2} = 0.6, n_{3} = 0.2$
Type III traffic	$n_{1} = 0.2, n_{2} = 0.2, n_{3} = 0.6$

Table 2. Comparative Analysis of Network Performance under Various Traffic Conditions.

Algorithm	Average Throughput (Mbps)			Average Delay (ms)			Average Packet Loss Rate (%)
Algorithm	Type1	Type2	Type3	Type1	Type2	Type3	Type1	Type2	Type3
SPF	60	61	58	180	185	183	11	10	12
SR-DRL	69	67	67	166	161	167	9	9	8
MALA	71	69	70	158	144	155	8	7	6
RS-MADDPG	75	73	72	144	132	145	6	6	3

Table 3. Communication Overhead Comparison.

Algorithm	Data Transmitted (KB/Agent/Cycle)	Communication Latency (ms)
MALA	12.8 ± 1.5	8.7 ± 0.6
RS-MADDPG	5.2 ± 0.8	4.3 ± 0.4

Table 4. Generalization Performance Across Topologies.

Algorithm	Average Throughput (Mbps)	Average Delay (ms)	Average Packet Loss Rate (%)
GEANT	75	144	6
Fat-Tree	72 ± 1.8	151 ± 5.2	6.2 ± 0.3
NSFNET	73 ± 2.1	138 ± 4.7	5.8 ± 0.2
Random Traffic	74 ± 2.3	147 ± 5.1	6.1 ± 0.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kuang, S.; Zheng, J.; Liang, S.; Li, Y.; Liang, S.; Huang, W. RS-MADDPG: Routing Strategy Based on Multi-Agent Deep Deterministic Policy Gradient for Differentiated QoS Services. Future Internet 2025, 17, 393. https://doi.org/10.3390/fi17090393

AMA Style

Kuang S, Zheng J, Liang S, Li Y, Liang S, Huang W. RS-MADDPG: Routing Strategy Based on Multi-Agent Deep Deterministic Policy Gradient for Differentiated QoS Services. Future Internet. 2025; 17(9):393. https://doi.org/10.3390/fi17090393

Chicago/Turabian Style

Kuang, Shi, Jinyu Zheng, Shilin Liang, Yingying Li, Siyuan Liang, and Wanwei Huang. 2025. "RS-MADDPG: Routing Strategy Based on Multi-Agent Deep Deterministic Policy Gradient for Differentiated QoS Services" Future Internet 17, no. 9: 393. https://doi.org/10.3390/fi17090393

APA Style

Kuang, S., Zheng, J., Liang, S., Li, Y., Liang, S., & Huang, W. (2025). RS-MADDPG: Routing Strategy Based on Multi-Agent Deep Deterministic Policy Gradient for Differentiated QoS Services. Future Internet, 17(9), 393. https://doi.org/10.3390/fi17090393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RS-MADDPG: Routing Strategy Based on Multi-Agent Deep Deterministic Policy Gradient for Differentiated QoS Services

Abstract

1. Introduction

2. Related Work

2.1. Single-Agent Routing Algorithm

2.2. Multi-Agent Routing Algorithm

3. RS-MADDPG System Architecture

4. Intelligent Routing Under Multiple Controllers

4.1. Overview of RS-MADDPG Algorithm

4.2. Mapping of State, Action, and Reward

5. Simulation Experiments

5.1. Experimental Environment Setup

5.2. Comparing Performance and Methods

5.3. Performance Evaluation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI