Multi-Agent Reinforcement Learning with Two-Layer Control Plane for Traffic Engineering

Stepanov, Evgeniy; Smeliansky, Ruslan; Garkavy, Ivan

doi:10.3390/math13193180

Open AccessArticle

Multi-Agent Reinforcement Learning with Two-Layer Control Plane for Traffic Engineering

by

Evgeniy Stepanov

^*,

Ruslan Smeliansky

and

Ivan Garkavy

Department of Computing Systems and Automation, Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, 119991 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(19), 3180; https://doi.org/10.3390/math13193180

Submission received: 10 August 2025 / Revised: 26 September 2025 / Accepted: 30 September 2025 / Published: 3 October 2025

(This article belongs to the Special Issue Applications of Cloud Computing, Big Data, and Data Dissemination in Information Engineering)

Download

Browse Figures

Versions Notes

Abstract

The article presents a new method for multi-agent traffic flow balancing. It is based on the MAROH multi-agent optimization method. However, unlike MAROH, the agent’s control plane is built on the principles of human decision-making and consists of two layers. The first layer ensures autonomous decision-making by the agent based on accumulated experience—representatives of states the agent has encountered and knows which actions to take in them. The second layer enables the agent to make decisions for unfamiliar states. A state is considered familiar to the agent if it is close, in terms of a specific metric, to a state the agent has already encountered. The article explores variants of state proximity metrics and various ways to organize the agent’s memory. It has been experimentally shown that an agent with the proposed two-layer control plane SAMAROH-2L outperforms the efficiency of an agent with a single-layer control plane, e.g., makes decisions faster, and inter-agent communication reduction varies from 1% to 80% depending on the selected similarity threshold comparing the method with simultaneous actions SAMAROH and from 80% to 96% comparing to MAROH.

Keywords:

traffic engineering; multi-agent reinforcement learning; traffic load balancing

MSC:

68T42; 68M10

1. Introduction

The growth of the size of data communication networks and increasing channel capacity require continuous improvements in methods for balancing data flows in the network. The random nature of the load and its high volatility make classical optimization methods inapplicable [1]. Multicommodity Flow methods, which belong to the class of NP-complete problems, also fail to address the issue, making their use inefficient for data flow balancing [2,3,4]. Consequently, significant attention has been paid to researching the applicability of machine learning methods for balancing.

The paper [5] presented the Multi-Agent Routing using Hashing (MAROH) balancing method based on a combination of the following techniques: multi-agent optimization, reinforcement learning (RL), and consistent hashing (this method is described in detail in Section 3). Experiments showed that MAROH outperformed popular methods like the Equal-cost Multi-Path (ECMP) and Unequal-cost Multi-Path (UCMP) in terms of efficiency.

This method involved the mutual exchange of data about agent states, which burdened the network’s bandwidth. The more agents involved in the exchange process, the greater this burden became. The construction of a new agent state and its action also required certain computations. Besides this, MAROH was based on a sequential decision-making model: only one agent, selected in a specific way, could change the flow distribution. Such organization assumes that the entire exchange process will be repeated multiple times for flow distribution adjustments.

The new method presented in this article aimed to reduce the number of inter-agent communications and the time for an agent to make an optimal action, as well as to remove the sequential agent activation model, thereby accelerating balancing and reducing the network load.

The number of inter-agent communications was reduced by the following. Agents could act independently of each other after exchanging information with neighbors, thereby constraining the number of inter-agent communications as the time of a balancing process. The second novelty is the approach inspired by Nobel winner Daniel Kahneman’s research on human decision-making under uncertainty [6]. The common point of human decision-making and network agent decision-making is that they operate in a situation of uncertainty. Therefore, by analogy with the two-system model of human decision-making proposed in [6] (fast intuitive and slow analytical reactions), the agent’s control plane was divided into two layers: experience and decision-making. The experience layer is activated when the current state is familiar to the agent, i.e., it is close, in terms of the metric, to a state the agent has encountered before. In this case, the agent applies the action that was previously successful without any communication with other agents. If the current state is unfamiliar, i.e., it is farther than a certain threshold from all familiar states, the agent activates the second layer responsible for the processing of an unfamiliar state, decision-making, and experience updating. It is important to note that this is a loose conceptual analogy rather than a strict implementation.

Thus, the main contributions of the article are the ways to reduce the number of inter-agent interactions, as well as the time and computational costs for agent decision-making, by

Replacing the sequential activation scheme of agents with the new scheme of independent agent activation (experiments showed this achieves a flow distribution close to optimal);
Inventing algorithms for a two-layer agent control plane that allows one to reduce the number of inter-agent exchanges and accelerates agent decision-making.

The rest of the article is the following. Section 2 reviews works that use the agent-states history. Section 3 describes the MAROH method [5] that was used as the basis for the proposed new method. Section 4 presents proposed novelty solutions. Section 5 describes an experimental methodology. The experimental results are presented in Section 6, and Section 7 discusses the achievements.

2. Related Work

Recent advances in multi-agent reinforcement learning have focused on improving coordination while mitigating communication overhead. Among these, Communication-Enhanced Value Decomposition Multi-Agent Reinforcement Learning (CVDMARL) [7] reduces the communication overhead by integrating a gated recurrent unit (GRU) to mine the temporal features in the historical data and obtain the communication matrix. In contrast to such methods that still rely on communication, albeit reduced, our proposed method eliminates the need for inter-agent messaging entirely in familiar states.

BayesIntuit [8] introduces the Dynamic Memory Bank—given a query, it retrieves a memory from the same semantic cluster, approximating nearest-neighbor retrieval in the learned latent space. Cosine similarity between the retrieved memory and current representation is used to provide a reliability score. However, BayesIntuit is developed for a single-agent supervised learning algorithm.

An analysis of publications revealed only a few papers [9,10,11] that consider a two-layer approach in combination with reinforcement learning. The work [9] provides an overview of memory-based methods, but they are focused on single-agent reinforcement learning. The study [10] proposes an approach based on episodic memory in a multi-agent setting, but the memory is used in a centralized manner to improve the training process. The work [11] outlines preliminary ideas with the lack of a detailed description of the proposed two-layer approach and presents only a simplified experimental study.

Memory Augmented Multi-Agent Reinforcement Learning for Cooperative Environments [12] incorporates long-short-term memory (LTSM) into actors and critics to enhance MARL performance. This augmentation facilitates the extraction of relevant information from history, which is then combined with the extracted features from the current observations. However, this method is not designed to reduce inter-agent exchanges.

In [13], the asynchronous operating of agents is considered to balance data flows between servers in data centers, but no interaction between agents is assumed. In [14], a case is considered where agents can act simultaneously, but the number of agents acting in parallel is limited by a constant.

3. Background

As mentioned earlier, the developed method is based on MAROH. Here, a brief description of MAROH is provided, necessary for understanding the new method. In MAROH, each agent manages one channel and calculates the weight of its channel to achieve uniform channel loading (see Figure 1). Uniformity is ensured by the multi-agent optimization of the functional Φ, which is the value of the channel load deviation from the average network load:

Φ = \frac{1}{| E |} \sum_{u, v \in E} {(\frac{b_{u, v}}{c_{u, v}} - μ')}^{2},

(1)

where

b_{u, v}

is the occupied channel bandwidth,

c_{u, v}

is the nominal channel bandwidth, and

μ^{'} = \frac{1}{|E|} \sum_{u, v \in E} \frac{b_{u, v}}{c_{u, v}}

(2)

is the average channel load in the network.

The choice of functional Φ directly corresponds to the core optimization target of our method—balancing channel utilization—and aligns with the metrics used in prior works (e.g., MAROH, ECMP) for fair comparison. Balanced load distribution lowers congestion that leads to lower delays and packet loss, important for traffic engineering methods. However, our main contribution is communication-efficient learning architecture, so we did not conduct experiments with distinct quality metrics.

Balancing consists of distributing data flows proportionally to channel weights. This way makes the agent operation independent of the number of channels on a network device. The agent state is a pair of channel weights and channel loads (Figure 1a). Each agent, upon changing its state (called a hidden state), transmits its state to the agent neighbors (Figure 1b). Neighbors

B (v)

are agents on adjacent nodes in the network topology. Each agent processes the received hidden states from neighbors using a graph neural network, which is the Message Passing Neural Network (MPNN). The collected hidden states are processed by the agent’s special neural network, referred to as the “update,” which calculates the agent’s new hidden state (Figure 1c). This process is repeated K times. The value of K determines the size of the agent neighborhood domain, i.e., how many states of other agents are involved in the calculation of the current hidden state. K is a method parameter. After K iterations, the resulting hidden state is fed into a special neural network, referred to as the “readout”. Outputs of the readout network are used to select an action for weight adjustment by the softmax function. In Figure 1d, the readout neural network operation finishing with a new weight is shown. The steps of action selection are omitted there. Possible actions for weight calculation include the following:

Addition: +1;
Null action: +0;
Multiplication: ×k (k is a tunable parameter);
Division: /k (k is a tunable parameter).

MAROH agents update weights not upon the arrival of each new flow but only when the channel load changing exceeds a specified threshold. In this way, feedback on channel load fluctuations is regulated, which minimizes the overhead of the agent’s operation. New flows are distributed according to the current set of weights without delay for the agent’s decisions. However, outdated weights may lead to suboptimal channel loading, so it is crucial for agents in the proposed method to decide on weight adjustments as quickly as possible.

The network load-balancing problem addressed by the MAROH method is formalized as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), defined by the tuple

< S, A, P, R, Ω, O, γ >

, where

$S$ is the global state of the entire network. This includes the weight and load of every channel in the network. This full state is not accessible to any single agent.
$A$ is the joint action space, defined as the Cartesian product of each agent’s action space: $A = A_{1} \times A_{2} \times \dots \times A_{n} .$ As mentioned previously, the action space for each agent consists of weight modification actions.
$P (s' | s, a)$ is the state transition probability function. This function is complex and unknown, sampled from the real network or a simulation.
$R (s, a)$ is the global reward and is defined as the difference in values of the objective function $Φ$ (1).
$Ω$ is the set of joint observations. Each agent n receives a private observation $o_{n} \in Ω_{n}$ containing the weight and load of its assigned channel.
$O (o | s, a)$ is the deterministic observation function that maps the state s and action a to a joint observation o.
Γ is the discount factor $(0 \leq γ < 1)$ , which determines how much an agent values future rewards compared to immediate ones.

4. Proposed Methods

This section presents solutions that allow agents to operate independently and transform the agent control plane into a two-layer one.

4.1. Simultaneous Actions MAROH (SAMAROH)

Let us introduce some terms. The time interval between agent hidden state initialization and action selection is called a horizon. For an agent, a new horizon begins when the channel load changes beyond a specified threshold. An episode is a fixed number of horizons required for the agent to establish the optimal channel weight. The limitation (only one agent acting) in MAROH was introduced for the theoretical justification of its convergence. However, experiments showed that convergence to the optimal solution persists even with asynchronous agent actions.

Let us consider the number of horizons required to achieve optimal weights by MAROH and by simultaneously acting agents. Let {

a_{1}, a_{2}, \dots, a_{k}}

is the set of optimal weights. For simplicity, suppose the agents are allowed perform an action to only add one to the weight (consideration with other actions is similar). In this case, MAROH will take

\sum_{i = 1}^{k} a_{k}

actions (and thus horizons) to achieve optimal weights. When agents operate independently, they only need

\max_{i} a_{i}

horizons. Therefore, independent agents operating can significantly reduce the time to obtain the optimal weights and reduce the number of horizons. For example, for a rhombus topology with eight agents, about 100 horizons are needed, but 20 is enough in the case of an independent agent operating.

In our experimental research, we implemented an abstraction where the algorithm progressed through a sequence of discrete horizons. The length of an episode (i.e., the number of horizons it contains) was treated as a hyperparameter whose optimal value was discovered empirically and shown in Section 5.

In the new approach called SAMAROH (Simultaneous Actions MAROH), all agents can independently adjust weights in each of their horizons. To simplify convergence, following [14], the number of agent actions was reduced to

Multiplication: ×k (k is a tunable parameter);
Null action: +0.

Algorithm 1 outlines the pseudo-code for agent decision-making under simultaneous action execution. The key modification compared to MAROH is implemented in lines 11–13, where each agent now computes its local policy and independently selects an action.

Algorithm 1. SAMAROH operation
1:	Agents initialize their states $s_{v}^{0}$ based on link bandwidth and initial weight
2:	for t ← 0 to T do
3:	$h_{v}^{0}$ ← ( $s_{v}^{t}$ ,0,…,0)
4:	for k ← 0 to K do
5:	Agents share their current hidden state $h_{v}^{k}$ to neighboring agents B(v)
6:	Agents process the received messages: $M_{v}^{k}$ ← $a_{θ_{a}} ({\{m_{θ_{m}} (h_{v}^{k}, h_{μ}^{k})\}}_{μ \in B (v)})$
7:	Agents update their hidden state $h_{v}^{k + 1}$ ← $u (h_{v}^{k}, M_{v}^{k})$
8:	end for
9:	Agents compute their actions’ logits: ${\{l o g i t_{v} (a)\}}_{a \in A (v)}$ ← $r_{θ_{r}} (h_{v}^{k})$
10:	Agents compute the local policy $π_{v}^{θ}$ ← CategoricalDist( ${\{l o g i t_{v} (a)\}}_{a \in A (v)}$ )
11:	Agents select an action $a_{v}^{t}$ according to policy $π_{v}^{θ}$
12:	All agents execute its corresponding action $a_{v}^{t}$
13:	Agents update their states $s_{v}^{t + 1}$
14:	Evaluation of Φ
15:	end for

In MAROH, agents at each horizon selected a single agent and the action it would perform. This occurred as follows: each agent collected a vector of readout neural network outputs from all other agents and applied the softmax function to select the best action. In SAMAROH, the agent collects only a local vector with the readout output for each action, further reducing inter-agent exchanges. Unlike [14], where the number of acting agents was restricted by a constant, our approach allows all agents to act independently. Experimental results (see Section 6) show that this does not affect the method’s accuracy.

4.2. Two-Layer Control Plane MAROH (MAROH-2L)

The second innovation, based on [6], divides the control plane into two layers: experience and decision-making, detailed below.

4.2.1. Experience Layer

Each agent stores past hidden states and corresponding actions. Their collection is called the agent’s memory. Denote the memory of the

i

-th agent as

M_{i}

:

M_{i} = [(h_{i, l}^{j}, a_{i, K}^{j}), j \in \{1, \dots, m\}, l \in \{0, \dots, K - 1\}], i \in {1, \dots, n},

(3)

where

h_{i, l}^{j}

is the hidden state of agent

i

before the (

l

+1)-th exchange iteration,

m

is the memory size (number of stored states),

n

is the number of agents in the system,

K

is the maximum number of exchange iterations, and

a_{i, K}^{j}

is the action taken by the agent after completing

K

exchange iterations.

When assessing how close the current state is to familiar states, the agent takes into account all the past hidden states at all intermediate iterations of the exchange in MPNN

(l \in {1, \dots, K - 1})

. This allows familiar states to be identified during the progression toward the horizon.

The second component of a memory element is the agent action, taken after the final

K

-th exchange iteration. However, it is possible to store the input or output of the readout network and reconstruct the selected action. In this case, an ε-greedy strategy can be applied to the agent’s decision-making process to improve learning efficiency [15], and the readout network can be further trained.

Let us call a case a possible memory element. Denote

A

as the set of all cases, i.e., possible memory elements that may arise for an agent. Let Δ:

Δ > 0 & Δ \in R

, called the threshold. Denote

δ (x, y)

as the metric on

A

, where

x, y \in A

. Recall that

δ (x, y)

is symmetric, non-negative, and satisfies the triangle inequality. If

δ (x, y) \leq Δ

,

x

and

y

are called close. A case

y \in A

is familiar to the i-th agent if

\exists (x, a) \in M_{i}, x \in A : δ (x, y) \leq Δ

.

Hidden states are vectors resulting from neural network transformations. The components of vector

h = (h^{1}, \dots, h^{D})

are numerical representations obtained through nonlinear neural network transformations without direct interpretable meaning. The dimensionality

D

of the hidden state does not have significant importance. It directly affects the volume of messages exchanged between agents, as each message size linearly depends on

D

. Moreover, the choice of a value of

D

allows for balancing the trade-off between the objective function’s value and network load. As it was shown in [5], increasing

D

generally improves the balancing quality but increases the overhead.

To measure the distance between vectors on the experience layer, the following two metrics were used:

Euclidean metric (L2):

$δ_{L 2} (x, y) = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}$

(4)
Manhattan metric (L1):

$δ_{L 1} (x, y) = \sum_{i = 1}^{n} |x_{i} - y_{i}|$

(5)

These metrics were chosen due to the equal significance of hidden state vector components and low computational complexity. Their comparative analysis is detailed in Section 6. In addition to the aforementioned metrics, the cosine distance (Cos) was also applied to measure the distance between vectors on the experience layer:

δ_{C O S} (x, y) = 1 - \frac{x * y}{‖x‖ ‖y‖} = 1 - \frac{\sum_{i = 1}^{n} x_{i} y_{i}}{\sqrt{\sum_{i = 1}^{n} x_{i}^{2}} \sqrt{\sum_{i = 1}^{n} y_{i}^{2}}}

(6)

The cosine distance is not formally a metric, as it does not satisfy the triangle inequality. Nevertheless, as shown in [16], the problem of determining a cosine similarity neighborhood can be transformed into the problem of determining the Euclidean distance, and the cosine distance is widely used in machine learning [8,17].

Since the agent’s memory size is limited, it only includes hidden states called representatives. The choice of representatives is discussed in Section 4.2.2 and Section 4.2.3. Although

Δ

is considered as prefixed in this paper, the two-layer method could be generalized to the case that

Δ

is specific for each representative. The value of

Δ

is the parameter balancing the trade-off between the number of transmitted messages (i.e., how many close states are detected) and the solution quality. Each metric requires individual threshold tuning. The schematic of experience layer is shown in Figure 2.

Depending on the method for selecting representatives for hidden states, there may be alternatives when, for the same hidden state, there is either a single representative (Section 4.2.3) or there are multiple representatives (Section 4.2.2). In the latter, the agent selects the representative with the minimal distance

δ

.

There may be cases where the agent memory stores suboptimal actions, as agents are not fully trained or employ an ε-greedy strategy [15]. Therefore, in the two-layer method, when a representative is found, an additional parameter

γ \in [0, 1]

is used to determine the probability of transition to the experience layer. That is, with a certain probability, the agent may start interacting with other agents to improve its action selection, even if a representative for the current state exists in memory.

4.2.2. Decision-Making Layer

At the decision-making layer, the agent, by communicating with other agents, selects an action for a new, unfamiliar state, as in MAROH. At this layer, the agent enriches its memory by adding a new representative. Recall that the agent’s memory does not store all familiar cases but their representatives. There were two ways explored to determine the representatives: experimentally, based on clustering methods, and theoretically, based on constructing an

ε

-net [18] (an

ε

-net is a subset

Z

of a metric space

X

for

M ⸦ X

such that

\forall x \in M \to \exists z \in Z

that is no farther than ε from x. Since we are working in a metric space and the set of agent states is finite, an

ε

-net exists for it). Let

Δ

be the threshold—which will also serve as

ε

for the

ε

-net—the metric value beyond which a state is considered unfamiliar.

Consider the process of selecting memory representatives based on clustering. It is clear that clustering is computationally more complex than the

ε

-net way and does not guarantee a single representative for a state.

Initially, when the memory is empty, all hidden states are stored until the memory is exhausted. The hidden states are added to the memory upon the MPNN message processing cycle completing and actions are known.

Any state without a representative in memory is a new candidate for representation. To avoid heavy clustering for each new candidate, they are gathered in a special array

M_{e x t r a}

. Once this array is full, the joint clustering of arrays

M

and

M_{e x t r a}

into

| M |

clusters is performed. After clustering, candidates and existing representatives are merged into clusters by metric

δ

, leaving one random state per cluster as the memory representative. Thus, after clustering, selected states remain in

M

, and

M_{e x t r a}

is cleared. Several clustering algorithms can be used to find representatives. In [11], Mini-Batch K-Means showed the best performance in terms of complexity and message volume. However, since this algorithm only works with

δ_{L 2}

(4), Agglomerative Clustering [19] was used for other metrics.

Algorithm 2 outlines the pseudo-code for agent decision-making with two-layer control planes. The experience layer is implemented in lines 5–7, whereas the decision-making layer comprises the remaining lines. Additional logic for experience updates is introduced in lines 8 and 17. The proposed clustering approach is encapsulated within line 8. Line 17 is required since the action is undefined at line 8.

Algorithm 2. MAROH 2L operation
1:	Agents initialize their states $s_{v}^{0}$ based on link bandwidth and initial weight
2:	for t ← 0 to T do
3:	$h_{v}^{0}$ ← ( $s_{v}^{t}$ ,0,…,0)
4:	for k ← 0 to K do
5:	if $\exists < s, a_{v_{s}} > \in M : δ (s, h_{v}^{k}) \leq Δ$ then
6:	$a_{t}$ ← $a_{v_{s}}$ ; $v^{'}$ ← $v_{s}$
7:	goto 17
8:	Update $M$ with $h_{v}^{k}$
9:	Agents share their current hidden state $h_{v}^{k}$ to neighboring agents B(v)
10:	Agents process the received messages: $M_{v}^{k}$ ← $a_{θ_{a}} ({\{m_{θ_{m}} (h_{v}^{k}, h_{μ}^{k})\}}_{μ \in B (v)})$
11:	Agents update their hidden state $h_{v}^{k + 1}$ ← $u (h_{v}^{k}, M_{v}^{k})$
12:	end for
13:	Agents compute their actions’ logits: ${\{l o g i t_{v} (a)\}}_{a \in A (v)}$ ← $r_{θ_{r}} (h_{v}^{k})$
14:	Agents receive other agents’ logits and compute the global policy $π^{θ}$ ← CategoricalDist( ${\{l o g i t_{v} (a)\}_{a \in A (v)}}_{v \in V}$ )
15:	Using the same probabilistic seed, agents select an action $a_{t} \in A_{v'}$ for $v^{'} \in V$ according to policy $π^{θ}$
16:	Update $M$ with $a_{t}$
17:	Agent $v^{'}$ executes action $a_{t}$
18:	Agents update their states $s_{v}^{t + 1}$
19:	Evaluation of Φ
20:	end for

The clustering algorithm with minimal complexity is Mini-Batch K-Means. Its complexity is

O (b \cdot | M | \cdot d \cdot I)

[20], where

b

is the batch size,

| M |

is the number of clusters,

d

is the data dimensionality, and

I

is the number of iterations. Although the original source [20] does not explicitly include dimensionality, we introduce d here to account for the cost per feature. The similar state search complexity in line 5 is

O (| M | \cdot d)

. Thus, the price of the two-layer approach is a computational complexity per horizon step in the worst case of

O (b \cdot | M | \cdot d \cdot I)

. For a complete episode, this incurs an additional factor of

T / | M_{e x t r a} |

, resulting in an overall episodic complexity

O (b \cdot | M | \cdot d \cdot I \cdot T / | M_{e x t r a} |)

, where T denotes the episode length and

| M_{e x t r a} |

represents the size of the additional array

M_{e x t r a}

.

Such complexity may be unacceptable for large-scale networks. Section 4.2.3 proposes an alternative method eliminating clustering, reducing the complexity to

O (| M | \cdot d)

using

ε

-nets [15].

4.2.3. Decision-Making Layer: $ε$ -Net-Based Method

This section describes the representative selection algorithm based on

ε

-nets [18] and its theoretical justification. It is simpler and avoids computationally intensive clustering.

Let

A

be a finite set of cases. Randomly select

c_{0} \in A

. For any

c \in A

such that

δ (c_{0}, c) \leq Δ,

c_{0}

is considered the representative of

c

. The representatives are denoted as

\hat{c_{i}}

and their set as

\hat{A} \subset A

. To obtain a unique representative for any case, the

\forall \hat{c^{'}},

\hat{c^{''}} \in \hat{A},

the Condition

δ (\hat{c^{'}}, \hat{c^{''}}) \geq 2 Δ

must hold.

Representative Selection Method: Let

\hat{A} \neq \emptyset

. For each new element

c_{t}

from

A

, searching in

\hat{A}

for

\hat{c_{x}}

such that

δ (\hat{c_{x}}, c_{t}) \leq Δ - ϵ

, where

ϵ

is arbitrarily small is known in advance. If such

\hat{c_{x}}

exists, it is accepted as the representative for

c_{t}

.

If no such

\hat{c_{x}}

exists in

\hat{A}

, find

\hat{c_{y}}

in

\hat{A}

, minimizing

δ (\hat{c_{y}}, c_{t}) - Δ

. Let

\underset{\hat{c_{y}} \in \hat{A}}{argmin} (δ (\hat{c_{y}}, c_{t}) - Δ) = \hat{c_{y^{*}}}

.

If

δ (\hat{c_{y^{*}}}, c_{t}) - Δ < Δ

, declare

c_{t}

a temporary representative

\overset{ˇ}{c}

and form

\overset{ˇ}{A} \subset A

, but

\overset{ˇ}{A} \cap \hat{A} = \emptyset

.

Next, search

\overset{ˇ}{A}

for temporary representative

\overset{ˇ}{c'}

, where

δ (\overset{ˇ}{c'}, c_{t}) < Δ

, but

m i n (δ (\hat{c_{y^{*}}}, \overset{ˇ}{c^{'}}) - Δ) < m i n (δ (\hat{c_{y^{*}}}, c_{t}) - Δ) < Δ,

and replace

\overset{ˇ}{c^{'}}

with

c_{t}

in

\overset{ˇ}{A}

, declaring it the new temporary representative

\overset{ˇ}{c^{'}}

. This process is repeated until

m i n (δ (\hat{c_{y^{*}}}, \overset{ˇ}{c^{'}}) - Δ) > Δ

. Because we are working in a metric space and

A

, the set of agent states is finite, the process will be complete at some time t, and the

ε

-net [18] is obtained. If

\overset{ˇ}{A}

is non-empty, temporary representatives will have overlapping

ε

-net regions with

\hat{A}

representatives. For these overlaps, assign them to a representative, e.g., only

\hat{A}

representatives with the same action as the new case.

If

δ (\hat{c_{y^{*}}}, c_{t}) - Δ \geq Δ

, then declare

c_{t}

a representative and remove all temporary

\overset{ˇ}{c'}

from

\overset{ˇ}{A}

such that

δ (\overset{ˇ}{c'}, c_{t}) < Δ

.

This part of the representative selection procedure constitutes the essence of the decision-making layer and changes in Algorithm 2 for the lines 5–8 with Algorithm 3. In practice, the selection between clustering-based and

ε

-net-based approaches should consider factors such as state-space geometry, the rate of environmental change, and available computational/memory resources. A general purpose criterion for this choice remains an open question for future work.

Algorithm 3. Representative selection using $ε$ -net
	Input: $c_{t}$ —new hidden state, $\hat{A}$ —set of representatives, $\overset{ˇ}{A}$ —set of temporary
	representatives
1:	$\hat{c_{y^{*}}}$ ← an arbitrary element of $\hat{A}$
2:	for each $\hat{c_{x}}$ in $\hat{A}$ do
3:	if $δ (\hat{c_{x}}, c_{t}) \leq Δ - ϵ$ then return $\hat{c_{x}}$
4:	if $δ (\hat{c_{x}}, c_{t}) \leq δ (\hat{c_{y^{}}}, c_{t})$ then* $\hat{c_{y^{*}}}$ ← $\hat{c_{x}}$
5:	end for
6:	$c^{*}$ ← $c_{t}$
7:	if $δ (\hat{c_{y^{}}}, c_{t}) < 2 Δ$ then*
8:	for each $\overset{ˇ}{c'}$ in $\overset{ˇ}{A}$ do
9:	if $δ (c_{t}, \overset{ˇ}{c^{'}}) \geq Δ$ or $δ (\hat{c_{y^{}}}, \overset{ˇ}{c^{'}}) \geq 2 Δ$ then continue*
10:	if $δ (\hat{c_{y^{}}}, \overset{ˇ}{c^{'}}) \leq δ (\hat{c_{y^{}}}, c_{t})$ then
11:	replace $\overset{ˇ}{c'}$ in $\overset{ˇ}{A}$ with $c_{t}$
12:	else
13:	$c^{*}$ ← $\overset{ˇ}{c'}$
14:	end for
15:	if $c^{} \notin \overset{ˇ}{A}$ then* Add $c^{*}$ in $\overset{ˇ}{A}$
16:	else
17:	for each $\overset{ˇ}{c'}$ in $\overset{ˇ}{A}$ do
18:	if $δ (c_{t}, \overset{ˇ}{c^{'}}) < Δ$ then remove $\overset{ˇ}{c'}$ from $\overset{ˇ}{A}$
19:	end for
20:	Add $c_{t}$ in $\hat{A}$
21:	return $c^{*}$

Lemma 1.

This procedure ensures for

δ_{L 2} (4)

and

δ_{L 1} (5)

:

\forall \hat{c_{k}}, \hat{c_{l}} \in

\hat{A} : δ (\hat{c_{k}}, \hat{c_{l}}) \geq 2 Δ

Proof of Lemma 1.

The algorithm maintains the invariant that all pairs of representatives in

\hat{A}

are separated by at least 2Δ. This is ensured by lines 7, 16, and 20 of Algorithm 3, where each new hidden state that is added in

\hat{A}

(line 20) should be at least

2 Δ

from all other representatives in

\hat{A}

(line 7, 16).

Theorem 1.

Let r be the representative of state h, obtained by Algorithm 3. Then, exactly one of the following conditions is satisfied for

δ_{L 2} (4)

and

δ_{L 1} (5)

:

1.: $\exists! \hat{c_{k}} \in \hat{A} : c_{k} = r a n d δ (\hat{c_{k}}, h) < Δ;$
2.: $\exists \hat{c_{k}} \in \hat{A} : c_{k} = r a n d δ (\hat{c_{k}}, h) = Δ;$
3.: $\exists \overset{ˇ}{c'} \in \overset{ˇ}{A} : c_{k} = r a n d δ (\overset{ˇ}{c'}, h) < Δ$ .

Proof of Theorem 1.

First consider a case where

\exists \hat{c_{k}} \in \hat{A} : c_{k} = r a n d δ (\hat{c_{k}}, h) < Δ

. The uniqueness in Condition 1 will be satisfied according to Lemma 1. In case

δ (r, h) = Δ

, it is possible that

\exists \hat{c_{k 1}} & \exists \hat{c_{k 2}} \in \hat{A} : δ (\hat{c_{k 1}}, c_{ξ}^{'}) = δ (\hat{c_{k 2}}, c_{ξ}^{'}) = Δ

and holds Condition 2. If there is no representative in

\hat{A}

, the distance from the temporary representative will be less than

Δ

, according to line 9 and 15 of Algorithm 3. Finally, no other Condition could be true, because either there is a representative from

\hat{A}

according to lines 2–5 and 16–21, or there is a representative from

\overset{ˇ}{A}

according to lines 6–15 and 21.

The computational complexity of the

ε

-net-based method is

O (|\hat{A}| \cdot d)

arithmetic operations for comparing a new hidden state (a d-dimensional vector) with each representative (lines 2–5) and

O (|\overset{ˇ}{A}| \cdot d)

for comparing with temporary representatives (lines 8–14 or 16–18). The total complexity is

O ((|\overset{ˇ}{A}| + |\hat{A}|) \cdot d)

or

O (M \cdot d)

in cases where all representatives are stored in

M

.

The two-layer control plane applies to both SAMAROH and MAROH. Their modifications are denoted SAMAROH-2L (presented in Algorithm 4) and MAROH-2L, respectively.

Algorithm 4. SAMAROH 2L operation
1:	Agents initialize their states $s_{v}^{0}$ based on link bandwidth and initial weight
2:	for t ← 0 to T do
3:	$h_{v}^{0}$ ← ( $s_{v}^{t}$ ,0,…,0)
4:	for k ← 0 to K do
5:	if $\exists < s, a_{v} > \in M : δ (s, h_{v}^{k}) \leq Δ$ then
6:	$a_{v}^{t}$ ← $a_{v_{s}}$ ;
7:	goto 17
8:	Update $M$ with $h_{v}^{k}$
9:	Agents share their current hidden state $h_{v}^{k}$ to neighboring agents B(v)
10:	Agents process the received messages: $M_{v}^{k}$ ← $a_{θ_{a}} ({\{m_{θ_{m}} (h_{v}^{k}, h_{μ}^{k})\}}_{μ \in B (v)})$
11:	Agents update their hidden state $h_{v}^{k + 1}$ ← $u (h_{v}^{k}, M_{v}^{k})$
12:	end for
13:	Agents compute their actions’ logits: ${\{l o g i t_{v} (a)\}}_{a \in A (v)}$ ← $r_{θ_{r}} (h_{v}^{k})$
14:	Agents compute the local policy $π_{v}^{θ}$ ← CategoricalDist( ${\{l o g i t_{v} (a)\}}_{a \in A (v)}$ )
15:	Agents select an action $a_{v}^{t}$ according to policy $π_{v}^{θ}$
16:	Update $M$ with $a_{v}^{t}$
17:	All agents execute its corresponding action $a_{v}^{t}$
18:	Agents update their states $s_{v}^{t + 1}$
19:	Evaluation of Φ
20:	end for

5. Materials and Methods

All algorithms of the proposed methods were implemented in Python 3.10 with TensorFlow 2.16.1 framework and are publicly available on Zenodo [21]. The code is in the dte_stand directory. Input data for the experiments (Section 6 results) are in data_examples directory. Each experiment input included the following:

Network topology;
Load as traffic matrix;
Balancing algorithm and its parameters (memory size, threshold Δ, metric, and clustering algorithm).

Taking into consideration the computational complexity of multi-agent methods, experiments were restricted by four topologies. Two (Figure 3) were synthetic and symmetric for transparency of demonstration. The symmetry was used for debugging purposes—under uniform load conditions, the method was expected to produce identical weights, while under uneven channel loads, the resulting weights were expected to balance the load distributions. Other topologies (Figure 4) were taken from TopologyZoo [22], selected for ≤100 agents (for computational feasibility with the constraint of no more than one week per experiment) and maximal alternative routes. All channels were bidirectional with equal capacity. The number of agents is twice the number of links, as each agent is responsible for monitoring its own unidirectional channel.

Network load was represented by the set of flows, each as the tuple (source, destination, rate, and start/end time) and the scheduling of their beginnings. Flows were generated for average network loads of 40% or 60%.

The investigated algorithms and their parameters are specified in tabular form in Table 1 and in the following section.

Results were evaluated based on the objective function Φ (1) (also called solution quality) and the total number of agent exchanges.

To evaluate the optimality of obtained solutions, the results of experiments were compared with those of the centralized genetic algorithm. Although the genetic algorithm provided an approximate suboptimal solution to the NP-complete balancing problem—unlike the proposed method, it relied on a centralized view of all channel states—the values obtained through the genetic algorithm served as our benchmark. The genetic algorithm was selected, as it was applied for different network load-balancing problems [23,24].

The genetic algorithm with the objective to minimize the target function Φ was organized as follows. The crossover operation between two solutions (i.e., two sets of agent weights) worked as follows: over several iterations, Solution 1 and Solution 2 exchanged the weights of all agents at a randomly selected network node. The mutation operation involved multiple iterations where the weights of all agents at a randomly chosen node were cyclically shifted, followed by updating a random subset of these weights with new random values. Solution selection was performed based on the value of the objective function Φ.

The experiments were conducted under the assumption of synchronous switching between the layers to ensure the correct computation of the hidden state by each agent. Although this assumption is a restriction, our experiments show that, even with it, SAMAROH-2L still yields significant gains in both decision-making speed and a reduction in inter-agent communications. Removing this assumption is a promising direction for future research.

6. Experimental Results

6.1. MAROH vs. SAMAROH

In the first series of experiments, the effectiveness of simultaneous weight adjustments by all agents were compared with sequential ones where only a single agent performs actions. In Figure 5, Figure 6, Figure 7 and Figure 8, the comparison between MAROH and SAMAROH methods for the 4-node topology (Figure 3a) and Abilene topology (Figure 4a) under different load conditions is exhibited. On these figures solid lines represent the average Φ values over intervals of 2000–3000 episodes, while vertical bars indicate the range from minimum to maximum Φ values across these intervals. Single points to the right of the line plots mark the minimum values of Φ averaged over 2000-episode intervals.

In Figure 5, Figure 6, Figure 7 and Figure 8, the green line with a dot marker representing SAMAROH’s objective function values consistently remains below the blue one with a cross marker corresponding to MAROH’s performance. This phenomenon demonstrates that, despite all agents acting simultaneously, SAMAROH not only maintains almost the same quality, but also significantly improves it. This happens because the agents need to be trained to act in fewer horizons during an episode, which simplifies the training.

These figures also represent the results obtained with uniformed weights, representing traditional load-balancing approaches like ECMP. As evident across all plots, the green line with the dot marker (SAMAROH) consistently shows closer alignment with the centralized algorithm’s performance (indicated by a dark green solid horizontal line) compared to the gray dashed line representing ECMP.

SAMAROH required five times less horizons than the original MAROH (20 vs. 100 for the 4-node topology and 50 vs. 250 for Abilene). This translates to a fivefold reduction in both inter-agent communications and decision-making time, as achieving comparable results now requires significantly fewer horizons.

Conditions under which SAMAROH does not achieve a solution quality as good as MAROH were also examined. Table 2 presents a comparison of the objective function values for the Abilene topology under a 40% load for different values of trajectory lengths (PPO parameter which defines the period of neural network weights update) and Adam optimizer step size (parameter which controls the learning rate). Multiple values of optimizer step size were chosen, because different trajectory lengths may require different learning rates to achieve a solution of similar quality. The Φ values shown in Table 2 are derived as follows: for each experimental run, the average Φ value over the last 2000 episodes is calculated, and the average and standard deviation of such values from six independent runs (on different random seeds) is presented in the table. Optimizer step sizes are shown as a pair of values corresponding to actor and critic networks, respectively. The conclusion from Table 2 is that SAMAROH achieves a worse solution quality than MAROH on the trajectory length of 1 episode and a better quality length than MAROH on larger trajectory length, such as 5, 25, or 75 episodes (if choosing the optimizer step size that achieves the best solution quality among examined values). For all other experiments presented in this work, the trajectory length was chosen as 75 episodes, and Adam step sizes for experiments on Abilene topology were chosen as 0.00015 for the actor network and 0.00085 for the critic network.

6.2. Research of Two-Layer Approach

The next series of experiments were dedicated to evaluating of the efficiency of the two-layer approach. Figure 9, Figure 10, Figure 11 and Figure 12 present the comparison of this approach across all topologies, benchmarking the method against SAMAROH, the centralized genetic algorithm, and ECMP. In the legend, the two-layer method agent communication count (shown in parentheses) is expressed relative to SAMAROH’s baseline number of exchanges. In these series, the 4-node topology required the memory capacity for 512 states, while the other topologies required the memory capacity for 1024 states.

In Figure 9, Figure 10, Figure 11 and Figure 12, the objective function values for SAMAROH-2L (shown as blue and green lines with cross and dot markers, respectively, for different proximity thresholds, except Figure 12 having only one threshold) consistently match or exceed the yellow line with an down arrow marker corresponding to SAMAROH’s performance.

The general conclusion from these results is as follows: with a moderate reduction in message exchanges (up to 20% for the Abilene topology, Figure 11), the two-layer method achieves objective function values that slightly differ from the SAMAROH ones. With more substantial reductions in exchanges, the two-layer method either yields significantly worse solution quality (for the Abilene topology, Figure 11) or requires more training episodes to match SAMAROH’s performance (for the 4-node topology, Figure 9). Meanwhile, the actual reduction in inter-agent exchanges can be adjusted between 25 and 50% depending on the required solution accuracy.

Figure 11 demonstrates that the proximity threshold Δ has significant impacts on the objective function value as on the number of inter-agent exchanges. Furthermore, the two-layer method’s performance is also under the influence of two key parameters: memory size and the comparison metric employed. The corresponding series of the experiments systematically evaluates the method efficacy across varying configurations of these parameters (see Table 3, Table 4 and Table 5).

Table 3 presents a comparison of the objective function values and exchange counts for the symmetric 4-node topology under a 40% load, while Table 4 and Table 5 show corresponding data for the Abilene topology at the same load level. Each reported value represents an average and standard deviation across six experimental runs on different random seeds. Due to computational constraints (with a single experiment involving 96 agents requiring over a week to complete), comprehensive parameter comparisons were not conducted for larger topologies. The Φ values shown in the tables correspond to the average over the last 2000 episodes of each experimental run. The amount of data exchanged per episode is calculated as an average by all the episodes, considering that, for each horizon, each agent exchanges from one to K messages with its neighbors four times (once for data collection and once for each training epoch). Each message occupies 64 bytes, since it represents the hidden state of an agent consisting of 16 real numbers. The number of exchanges between agents is given as a percentage relative to the number of exchanges in the corresponding single-layer method (that is, relative to MAROH for MAROH-2L and relative to SAMAROH for SAMAROH-2L).

For the 4-node topology, the memory size was fixed at 512 states based on the results from [11], representing the maximum capacity we could use to store representatives. As shown in Table 3, Φ values remain relatively stable at small threshold values, but the objective function begins to degrade as the threshold grows. This pattern is corresponded to the one in the Abilene topology results (Table 4 and Table 5). This is explained by the fact that, with an increasing threshold, states that correspond to different actions become close; however, the agent will apply the same action in both states. In the conducted series of experiments, the threshold value was set manually, but the question arises about an automatic method for selecting the value of this parameter in such a way that the compromise between the value of the objective function and the number of exchanges is observed.

It has to be kept in mind that the analyzed processes are non-stationary, and the observed values have very intricate interrelations. Under such conditions, the application of statistical tests will be unjustifiably redundant from the viewpoint of the correctness of statistical inferences for the area under consideration. It is extremely strange to expect that the strong theoretical conditions for the statistical tests to be adequate and hold in the area under consideration. It is either impossible to strictly verify the conditions of the tests’ adequacy or these tests conditions are obviously violated. For example, all sample elements (the values of the objective function) are positive and hence remain bounded from the left even after appropriate centering, whereas the normal distribution of the sample elements (assumed when the Student t-test is used) has the whole real line as its support (meaning that any non-empty interval must have the positive probability under the normal distribution). Therefore, when discussing the consistency of the obtained results, we are forcibly restricted to non-formal conclusions, for example, based on visual distinction or coincidence. Nevertheless, following the common practice in related studies (e.g., BayesIntuit [8]), we report the values of the corresponding statistical tests. For SAMAROH-2L on Abilene topology (for all metrics for 1024 memory size and L1 metric for 512 memory size) and MAROH-2L on 4-node topology, the increase in the objective function on higher thresholds was proven with the Student or Mann–Whitney U tests with p-values less than 0.05 (ranging from 5.8 × 10⁻⁵ to 0.024 for each set of thresholds), with samples consisting of 12—36 independent experiments. A more comprehensive exploration of these parameters is the important direction for future proposed method development.

The metric comparison across Table 3, Table 4 and Table 5 revealed that the L2 metric achieves a superior solution quality under the same number of exchanges.

For the Abilene topology, the memory size was set to 512 states (Table 4), consistent with the configuration in [11] for a comparable 4-node topology with similar agent numbers. Increasing the memory capacity to 1024 states (Table 5) while maintaining the same threshold has led to the downgrade of the objective. This can be explained by the fact that a larger number of representatives will have an intersection of Δ neighborhoods that can have different actions, which justifies the relevance of the method for selecting representatives without intersection of the Δ neighborhoods proposed in Section 4.2.3.

A comparison between MAROH-2L and SAMAROH-2L in Table 3 reveals that, while using identical memory parameters, SAMAROH-2L achieves a 30–70% reduction in message exchanges, whereas MAROH-2L only attains a 7–13% reduction. This performance gap likely stems from MAROH-2L’s fivefold greater number of horizons per episode, which proportionally generates more hidden states—consequently, the same memory size proves inadequate when compared to SAMAROH-2L’s requirements.

In Table 3, Table 4 and Table 5, the number of exchanges between agents for SAMAROH-2L is given as a percentage relative to the number of exchanges in SAMAROH. Since SAMAROH and SAMAROH-2L also have a reduction in the number of horizons in one episode relative to MAROH, the reduction in the number of exchanges in SAMAROH-2L relative to MAROH is much more significant. Thus, for Table 4 and Table 5, the number of horizons in one episode was reduced by five times; therefore, for example, the reduction in the number of communications to 80.06% relative to SAMAROH means that it was reduced to 0.8006 / 5 × 100% ≈ 16.01% relative to MAROH (this number ranges from 14.4% to 16.87% over different runs of the corresponding experiment). This reduction ratio can also be obtained by dividing values of exchanged data amounts for different algorithms in the tables.

In conclusion, the two-layer SAMAROH-2L method achieved significant improvements over the single-layer SAMAROH method: for the eight-agent topology, it reduced the objective function value from 0.0265 to 0.0236 while cutting communication exchanges by 40.92% (from 100% to 59.08% relative to the SAMAROH baseline), and for the Abilene topology, it reduced the objective function value from 0.0211 to 0.0208 while cutting communication exchanges by 19.94% (from 100% to 80.06% relative to the SAMAROH baseline).

7. Discussion

The efficiency of the proposed two-layer method was experimentally proven. This is especially clearly visible from the experiments with the Abilene topology, as the method maintained high-quality solutions (with the objective function reduced from 0.0211 to 0.0208) while achieving a 19.94% reduction in the number of communications relative to the corresponding single-layer method.

Promising directions for future research include

Investigating the effectiveness of the representative selection method proposed in Section 4.2.3;
Developing adaptive algorithms for state proximity threshold tuning;
Creating dynamic memory management techniques with the intelligent identification of obsolete states and optimal memory sizing based on current network conditions;
Study of channel quality indicators (e.g., delay, throughput, and packet loss) in large-scale topologies under high load using balancing by the proposed methods, as well as comparison with modern approaches;
Optimizing computational complexity for large-scale network topologies.

A significant area of improvement for the method is the development of an adaptive algorithm for adjusting the state proximity threshold Δ. One possible approach is to calculate the maximum permissible number of representatives based on the available memory capacity and dynamically adjust the global threshold Δ. If the number of representatives exceeds the maximum allowable number, the threshold can be increased by merging representatives, provided that their associated actions are similar. A more sophisticated strategy could involve assigning individual thresholds to each representative based on the density of hidden states in the metric space. In dense regions, where similar states require different actions, a smaller threshold is beneficial. Conversely, in sparse regions, a larger threshold can be used to reduce memory consumption.

The current limitations of the method primarily involve extended training durations for 96-agent systems (necessitating distributed simulation frameworks) and the substantial number of required training episodes (4000–10,000). While accelerating multi-agent methods was not this study’s primary focus, it remains a crucial direction for future research.

While the proposed two-layer control architecture is developed and validated in the context of traffic engineering, its design principles—experience-based decision and collaborative problem-solving for novel states—might be possible to generalize to other cooperative multi-agent systems with similar characteristics (e.g., distributed control, partial observability, and communication constraints). However, further empirical validation in non-networking domains is needed to substantiate this potential.

8. Conclusions

A new multi-agent method for traffic flow balancing is presented based on two fundamental innovations: abandoning sequential models of agent decision-making and developing a two-layer human-like agent control plane in multi-agent optimization. For the agent control plane, the concept of agent hidden state representatives reflecting its experience was proposed. This concept has allowed to significantly reduce the number of stored hidden states. Two methods of representative selecting were proposed and investigated. Experimentally it was proven that the proposed innovations significantly reduce the amount of inter-agent communications, decision-making time by agents, and optimality of flow balancing while maintaining the original objective function.

The proposed method has significant potential for scalability, which is important for large-scale networks. Once a state representative is found in memory, action search is very efficient, requiring only a linear memory scan; this complexity can be further reduced to logarithmic time by using specialized data structures for similarity search, such as k-d trees. It should also be noted that when a representative for an agent’s state is not found, communication overhead remains limited, since the agent exchanges messages only within its k-hop neighborhood, without broadcasting them to the entire network. Moreover, since the method works by adjusting channel weights, data packets are never delayed for agent decisions and can continue to be forwarded using existing weights until new ones are calculated. A key question for future research is determining the optimal weight update frequency to minimize the duration of suboptimal routing while maintaining low communication overhead.

Author Contributions

E.S.: methodology, project administration, software, writing—original draft, and validation. R.S.: methodology, conceptualization, writing—review and editing, and supervision. I.G.: software, investigation, visualization, and formal analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by The Ministry of Economic Development of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000C313925P4H0002; grant No 139-15-2025-012).

Data Availability Statement

The original data presented in the study are openly available in Zenodo at https://doi.org/10.5281/zenodo.17208706.

Acknowledgments

We acknowledge graduate student Ariy Okonishnikov for his contributions to the initial implementation and preliminary experimental validation of the method.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CVDMARL	Communication-Enhanced Value Decomposition Multi-Agent Reinforcement Learning
Dec-POMDP	Decentralized Partially Observable Markov Decision Process
ECMP	Equal-Cost Multi-Path
GRU	Gated Recurrent Unit
LSTM	Long-Short-Term Memory
MARL	Multi-Agent Reinforcement Learning
MAROH	Multi-Agent Routing Using Hashing
MAROH-2L	MAROH with Two-Layer Control Plane
MLP	Multi-Layer Perceptron
MPNN	Message Passing Neural Network
NP	Non-deterministic Polynomial time
PPO	Proximal Policy Optimization
RL	Reinforcement Learning
SAMAROH	Simultaneous Actions MAROH
UCMP	Unequal-Cost Multi-Path

References

Moiseev, N.N.; Ivanilov, Y.P.; Stolyarova, E.M. Optimization Methods; Nauka: Moscow, Russia, 1978; 352p. [Google Scholar]
Wang, I.L. Multicommodity network flows: A survey, Part I: Applications and Formulations. Int. J. Oper. Res. 2018, 15, 145–153. [Google Scholar]
Wang, I.L. Multicommodity network flows: A survey, part II: Solution methods. Int. J. Oper. Res. 2018, 15, 155–173. [Google Scholar]
Even, S.; Itai, A.; Shamir, A. On the complexity of time table and multi-commodity flow problems. In Proceedings of the 16th Annual Symposium on Foundations of Computer Science (sfcs 1975), Washington, DC, USA, 13–15 October 1975. [Google Scholar]
Stepanov, E.P.; Smeliansky, R.L.; Plakunov, A.V.; Borisov, A.V.; Zhu, X.; Pei, J.; Yao, Z. On fair traffic allocation and efficient utilization of network resources based on MARL. Comput. Netw. 2024, 250, 110540. [Google Scholar] [CrossRef]
Kahneman, D. Thinking, Fast and Slow; Macmillan: New York, NY, USA, 2011. [Google Scholar]
Chang, A.; Ji, Y.; Wang, C.; Bie, Y. CVDMARL: A communication-enhanced value decomposition multi-agent reinforcement learning traffic signal control method. Sustainability 2024, 16, 2160. [Google Scholar] [CrossRef]
Bornacelly, M. BayesIntuit: A Neural Framework for Intuition-Based Reasoning. In North American Conference on Industrial Engineering and Operations Management-Computer Science Tracks; Springer Nature: Cham, Switzerland, 2025; pp. 117–132. [Google Scholar]
Ramani, D. A short survey on memory based reinforcement learning. arXiv 2019, arXiv:1904.06736. [Google Scholar] [CrossRef]
Zheng, L.; Chen, J.; Wang, J.; He, J.; Hu, Y.; Chen, Y.; Fan, C.; Gao, Y.; Zhang, C. Episodic multi-agent reinforcement learning with curiosity-driven exploration. Adv. Neural Inf. Process. Syst. 2021, 34, 3757–3769. [Google Scholar]
Okonishnikov, A.A.; Stepanov, E.P. Memory mechanism efficiency analysis in multi-agent reinforcement learning applied to traffic engineering. In Proceedings of the 2024 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC), Moscow, Russia, 29–31 October 2024. [Google Scholar]
Kia, M.; Cramer, J.; Luczak, A. Memory Augmented Multi-agent Reinforcement Learning for Cooperative Environment. In International Conference on Artificial Intelligence and Soft Computing; Springer Nature: Cham, Switzerland, 2024; pp. 92–103. [Google Scholar]
Yao, Z.; Ding, Z.; Clausen, T. Multi-agent reinforcement learning for network load balancing in data center. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–22 October 2022; pp. 3594–3603. [Google Scholar]
Bernárdez, G.; Suárez-Varela, J.; López, A.; Shi, X.; Xiao, S.; Cheng, X.; Barlet-Ros, P.; Cabellos-Aparicio, A. MAGNNETO: A graph neural network-based multi-agent system for traffic engineering. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 494–506. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, UK, 1998; Volume 1. [Google Scholar]
Kryszkiewicz, M. The cosine similarity in terms of the euclidean distance. In Encyclopedia of Business Analytics and Optimization; IGI Global: Hershey, PA, USA, 2014; pp. 2498–2508. [Google Scholar]
Xia, P.; Zhang, L.; Li, F. Learning similarity with cosine similarity ensemble. Inf. Sci. 2015, 307, 39–52. [Google Scholar] [CrossRef]
Yosida, K. Functional Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1995; Volume 123. [Google Scholar]
Overview of Clustering Methods. Available online: https://scikit-learn.org/stable/modules/clustering.html#overviewof-clustering-methods (accessed on 5 August 2025).
Peng, K.; Leung, V.C.M.; Huang, Q. Clustering approach based on mini batch kmeans for intrusion detection system over big data. IEEE Access 2018, 6, 11897–11906. [Google Scholar] [CrossRef]
Garkavy, I. Estepanov-Lvk/Maroh: MAROH-2L V1.0.2. Zenodo. 2025. Available online: https://zenodo.org/records/17208706 (accessed on 26 September 2025).
Topology Zoo. Available online: https://github.com/sk2/topologyzoo (accessed on 31 July 2025).
Kang, S.-B.; Kwon, G.-I. Load balancing of software-defined network controller using genetic algorithm. Contemp. Eng. Sci. 2016, 9, 881–888. [Google Scholar] [CrossRef]
Singh, A.R.; Devaraj, D.; Banu, R.N. Genetic algorithm-based optimisation of load-balanced routing for AMI with wireless mesh networks. Appl. Soft Comput. 2019, 74, 122–132. [Google Scholar] [CrossRef]

Figure 1. Schematic of the MAROH method [5].

Figure 2. Schematic of the experience layer operation for the eighth agent.

Figure 3. Symmetric topologies for experiments: (a) 4 nodes with 8 agents; (b) 16 nodes with 64 agents.

Figure 4. Topologies from the TopologyZoo library: (a) Abilene—modified to reduce the number of transit vertices while preserving the asymmetry of the original topology. The final configuration contains 7 nodes and 20 agents; (b) Geant2009—with degree-1 vertices removed. The final configuration consists of 30 nodes and 96 agents.

Figure 5. Dependence of the objective function value on the episode number for MAROH and SAMAROH methods under 40% load for 4-node topology: (a) average values over 2000-episode intervals; (b) average values and min-max ranges over 2000-episode intervals.

Figure 6. Dependence of the objective function value on the episode number for MAROH and SAMAROH methods under 40% load for Abilene topology: (a) average values over 3000-episode intervals; (b) average values and min-max ranges over 3000-episode intervals.

Figure 7. Dependence of the objective function value on the episode number for MAROH and SAMAROH methods under 60% load for 4-node topology: (a) average values over 2000-episode intervals; (b) average values and min-max ranges over 2000-episode intervals.

Figure 8. Dependence of the objective function value on the episode number for MAROH and SAMAROH methods under 60% load for Abilene topology: (a) average values over 3000-episode intervals; (b) average values and min-max ranges over 3000-episode intervals.

Figure 9. Dependence of the objective function value on the episode number under 40% load for symmetric 4-node topology with 8 agents: (a) average values over 2000-episode intervals; (b) average values and min-max ranges over 2000-episode intervals.

Figure 10. Dependence of the objective function value on the episode number under 40% load for symmetric 16-node topology with 64 agents: (a) average values over 3000-episode intervals; (b) average values and min-max ranges over 3000-episode intervals.

Figure 11. Dependence of the objective function value on the episode number under 40% load for Abilene topology from TopologyZoo Library: (a) average values over 3000-episode intervals; (b) average values and min-max ranges over 3000-episode intervals.

Figure 12. Dependence of the objective function value on the episode number under 40% load for Geant2009 topology from TopologyZoo Library: (a) average values over 2700-episode intervals; (b) average values and min-max ranges over 2700-episode intervals.

Table 1. Algorithm hyperparameter values.

Hyperparameter	4-Node Topology	16-Node Topology	Abilene	Geant2009
K (number of message iterations)	2	5	3	5
Link state size	16	16	16	16
Agent actions	×k, +0
k (multiplication action factor)	1.25
Number of horizons in an episode (SAMAROH, SAMAROH-2L)	20	100	50	150
Number of horizons in an episode (MAROH, MAROH-2L)	100	-	250	-
Trajectory length	75 episodes	75 episodes	75 episodes	75 episodes
Clip range	0.25
Discount (γ)	0.9
GAE parameter (λ)	0.95
Entropy coefficient	0.001
VF coefficient	0.5
Max grad norm	1.0
Minibatch size	25
Number of epochs	3
Policy optimization algorithm	Proximal policy optimization (PPO)
Optimizer	Adam
Adam step size (actor)	1.5 × 10⁻⁴	1.5 × 10⁻⁴	1.5 × 10⁻⁴	3 × 10⁻⁵
Adam step size (critic)	8.5 × 10⁻⁴	8.5 × 10⁻⁴	8.5 × 10⁻⁴	1.7 × 10⁻⁴
Adam β1	0.9
Adam ε	0.00001
Epsilon-greedy action selection ε	0.9
Actor and critic networks	Multi-layer perceptron (MLP)
Hidden layers (actor)	[128, 64]
Hidden layers (critic)	[128, 64]
Activation function (actor, critic)	tanh

Table 2. Experimental comparison results of objective function of the proposed methods for Abilene topology with different values of trajectory length and optimizer step size over 8000 episodes.

	Φ: SAMAROH, Step_Sizes = [0.00003, 0.00017]	Φ: SAMAROH, Step_Sizes = [0.00003 × 5, 0.00017 × 5]	Φ: MAROH, Step_Sizes = [0.00003, 0.00017]	Φ: MAROH, Step_Sizes = [0.00003 × 5, 0.00017 × 5]
Trajectory = 1 episode	0.0280 ± 0.0021	0.0314 ± 0.0028	0.0271 ± 0.0020	0.0284 ± 0.0010
Trajectory = 5 ep.	0.0242 ± 0.0013	0.0258 ± 0.0027	0.0250 ± 0.0013	0.0261 ± 0.0014
Trajectory = 25 ep.	0.0280 ± 0.0031	0.0226 ± 0.0023	0.0255 ± 0.0011	0.0264 ± 0.0008
Trajectory = 75 ep.	0.0267 ± 0.0030	0.0235 ± 0.0025	0.0262 ± 0.0011	0.0264 ± 0.0006

Table 3. Experimental comparison results of the proposed methods for the 4-node topology with 8 agents over 6000 episodes.

Algorithm	Clustering	Metric	Memory Size	Threshold	Φ (Mean ± Std)	Number of Exchanges (%) (Mean ± Std)	Data Exchanged Per Episode, Mbytes (Mean ± Std)
Genetic	-	-	-	-	0.0205	-	-
MAROH	-	-	0	-	0.0293 ± 0.0042	100 ± 0%	0.78 ± 0
SAMAROH	-	-	0	-	0.0265 ± 0.0043	100 ± 0%	0.16 ± 0
MAROH-2L	Mini-Batch K-Means	L2	512	0.007	0.0244 ± 0.0017	89.02 ± 8.37%	0.70 ± 0.07
				0.010	0.0234 ± 0.0008	92.89 ± 4.58%	0.73 ± 0.04
				0.015	0.0248 ± 0.0043	86.84 ± 9.92%	0.68 ± 0.08
				0.018	0.0238 ± 0.0011	90.92 ± 3.15%	0.71 ± 0.02
				0.025	0.0268 ± 0.0039	69.94 ± 20.31%	0.55 ± 0.16
				0.030	0.0235 ± 0.0009	75.72 ± 4.13%	0.59 ± 0.03
				0.040	0.0242 ± 0.0007	61.75 ± 12.34%	0.48 ± 0.10
				0.050	0.0262 ± 0.0019	57.51 ± 9.73%	0.45 ± 0.08
				0.060	0.0255 ± 0.0009	56.47 ± 5.35%	0.44 ± 0.04
				0.070	0.0291 ± 0.0031	36.59 ± 15.10%	0.29 ± 0.12
SAMAROH-2L	Mini-Batch K-Means	L2	512	0.007	0.0260 ± 0.0025	67.24 ± 9.78%	0.11 ± 0.02
				0.010	0.0236 ± 0.0034	59.08 ± 9.40%	0.09 ± 0.01
				0.015	0.0265 ± 0.0037	47.62 ± 7.30%	0.07 ± 0.01
				0.018	0.0265 ± 0.0050	31.57 ± 12.00%	0.05 ± 0.02
	Agglomerative Clustering	L1	512	0.01	0.0251 ± 0.0053	88.35 ± 1.92%	0.14 ± 0.00
				0.03	0.0264 ± 0.0036	77.82 ± 7.15%	0.12 ± 0.01
				0.04	0.0279 ± 0.0055	51.84 ± 6.31%	0.08 ± 0.01
				0.05	0.0284 ± 0.0026	35.95 ± 10.63%	0.06 ± 0.02
		Cos	512	1 × 10⁻⁷	0.0273 ± 0.0045	80.98 ± 11.11%	0.13 ± 0.02
				3 × 10⁻⁷	0.0261 ± 0.0022	77.05 ± 10.10%	0.12 ± 0.02
				4 × 10⁻⁷	0.0258 ± 0.0029	58.12 ± 13.65%	0.09 ± 0.02
				5 × 10⁻⁷	0.0276 ± 0.0028	58.51 ± 11.04%	0.09 ± 0.02

Table 4. Experimental comparison results of the proposed methods for Abilene topology with 512 memory size over 15,000 episodes.

Algorithm	Clustering	Metric	Memory Size	Threshold	Φ (Mean ± Std)	Number of Exchanges (%) (Mean ± Std)	Data Exchanged Per Episode, Mbytes (Mean ± Std)
Genetic	-	-	-	-	0.0179	-	-
MAROH	-	-	0	-	0.0250 ± 0.0005	100 ± 0%	10.99 ± 0
SAMAROH	-	-	0	-	0.0211 ± 0.0006	100 ± 0%	2.20 ± 0
SAMAROH-2L	Mini-Batch K-Means	L2	512	0.035	0.0219 ± 0.0009	90.73 ± 2.48%	1.99 ± 0.05
				0.040	0.0223 ± 0.0025	87.64 ± 3.81%	1.93 ± 0.08
				0.050	0.0224 ± 0.0019	85.14 ± 3.73%	1.87 ± 0.08
				0.060	0.0214 ± 0.0023	79.68 ± 3.29%	1.75 ± 0.07
	Agglomerative Clustering	L1	512	0.052	0.0215 ± 0.0010	99.71 ± 0.34%	2.19 ± 0.01
				0.062	0.0209 ± 0.0016	99.45 ± 0.19%	2.19 ± 0.00
				0.067	0.0223 ± 0.0013	98.44 ± 1.00%	2.16 ± 0.02
				0.077	0.0219 ± 0.0014	96.57 ± 1.05%	2.12 ± 0.02
		Cos	512	1.5 × 10⁻⁷	0.0228 ± 0.0022	99.12 ± 0.67%	2.18 ± 0.01
				3.0 × 10⁻⁷	0.0212 ± 0.0017	98.69 ± 0.43%	2.17 ± 0.01
				4.5 × 10⁻⁷	0.0225 ± 0.0015	98.60 ± 0.70%	2.17 ± 0.02
				7.5 × 10⁻⁷	0.0211 ± 0.0010	97.27 ± 0.60%	2.14 ± 0.01

Table 5. Experimental comparison results of the proposed methods for Abilene topology with 1024 memory size over 15,000 episodes.

Algorithm	Clustering	Metric	Memory Size	Threshold	Φ (Mean ± Std)	Number of Exchanges (%) (Mean ± Std)	Data Exchanged Per Episode, Mbytes (Mean ± Std)
Genetic	-	-	-	-	0.0179	-	-
MAROH	-	-	0	-	0.0250 ± 0.0005	100 ± 0%	10.99 ± 0
SAMAROH	-	-	0	-	0.0211 ± 0.0006	100 ± 0%	2.20 ± 0
SAMAROH-2L	Mini-Batch K-Means	L2	1024	0.035	0.0212 ± 0.0013	83.84 ± 2.60%	1.84 ± 0.06
				0.040	0.0208 ± 0.0015	80.06 ± 4.62%	1.76 ± 0.10
				0.050	0.0276 ± 0.0046	69.08 ± 3.76%	1.52 ± 0.08
				0.060	0.0315 ± 0.0010	54.59 ± 2.55%	1.20 ± 0.06
	Agglomerative Clustering	L1	1024	0.052	0.0214 ± 0.0013	96.08 ± 0.91%	2.11 ± 0.02
				0.062	0.0208 ± 0.0011	88.12 ± 1.87%	1.94 ± 0.04
				0.067	0.0209 ± 0.0007	80.74 ± 2.06%	1.77 ± 0.05
				0.077	0.0294 ± 0.0008	57.65 ± 1.15%	1.27 ± 0.03
		Cos	1024	1.5 × 10⁻⁷	0.0217 ± 0.0023	96.98 ± 1.64%	2.13 ± 0.04
				3.0 × 10⁻⁷	0.0212 ± 0.0005	90.16 ± 0.96%	1.98 ± 0.02
				4.5 × 10⁻⁷	0.0235 ± 0.0024	73.47 ± 1.68%	1.61 ± 0.04
				7.5 × 10⁻⁷	0.0309 ± 0.0010	35.99 ± 4.52%	0.79 ± 0.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Stepanov, E.; Smeliansky, R.; Garkavy, I. Multi-Agent Reinforcement Learning with Two-Layer Control Plane for Traffic Engineering. Mathematics 2025, 13, 3180. https://doi.org/10.3390/math13193180

AMA Style

Stepanov E, Smeliansky R, Garkavy I. Multi-Agent Reinforcement Learning with Two-Layer Control Plane for Traffic Engineering. Mathematics. 2025; 13(19):3180. https://doi.org/10.3390/math13193180

Chicago/Turabian Style

Stepanov, Evgeniy, Ruslan Smeliansky, and Ivan Garkavy. 2025. "Multi-Agent Reinforcement Learning with Two-Layer Control Plane for Traffic Engineering" Mathematics 13, no. 19: 3180. https://doi.org/10.3390/math13193180

APA Style

Stepanov, E., Smeliansky, R., & Garkavy, I. (2025). Multi-Agent Reinforcement Learning with Two-Layer Control Plane for Traffic Engineering. Mathematics, 13(19), 3180. https://doi.org/10.3390/math13193180

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Reinforcement Learning with Two-Layer Control Plane for Traffic Engineering

Abstract

1. Introduction

2. Related Work

3. Background

4. Proposed Methods

4.1. Simultaneous Actions MAROH (SAMAROH)

4.2. Two-Layer Control Plane MAROH (MAROH-2L)

4.2.1. Experience Layer

4.2.2. Decision-Making Layer

4.2.3. Decision-Making Layer: $ε$ -Net-Based Method

5. Materials and Methods

6. Experimental Results

6.1. MAROH vs. SAMAROH

6.2. Research of Two-Layer Approach

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Multi-Agent Reinforcement Learning with Two-Layer Control Plane for Traffic Engineering

Abstract

1. Introduction

2. Related Work

3. Background

4. Proposed Methods

4.1. Simultaneous Actions MAROH (SAMAROH)

4.2. Two-Layer Control Plane MAROH (MAROH-2L)

4.2.1. Experience Layer

4.2.2. Decision-Making Layer

4.2.3. Decision-Making Layer: ε -Net-Based Method

5. Materials and Methods

6. Experimental Results

6.1. MAROH vs. SAMAROH

6.2. Research of Two-Layer Approach

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2.3. Decision-Making Layer: $ε$ -Net-Based Method