Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications

Löppenberg, Marlon; Yuwono, Steve; Schwung, Andreas

doi:10.3390/ai7020062

Open AccessArticle

Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications

by

Marlon Löppenberg

^*

,

Steve Yuwono

and

Andreas Schwung

Department of Automation Technology and Learning Systems, South Westphalia University of Applied Sciences, 59494 Soest, Germany

^*

Author to whom correspondence should be addressed.

AI 2026, 7(2), 62; https://doi.org/10.3390/ai7020062

Submission received: 22 December 2025 / Revised: 3 February 2026 / Accepted: 4 February 2026 / Published: 6 February 2026

(This article belongs to the Special Issue Responsible AI: Alignment, Decentralization, and Optimization in Multi-Agent Systems Across Dynamic Environments)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Distributed production systems have to increasingly balance economic goals such as energy efficiency and productivity with critical technical requirements such as flexibility, real-time capability, and reliability. This paper presents a novel approach for distributed optimization by means of Evolutionary State-based Potential Games with dynamic grid structures. More in detail, we leverage the combination of Potential Games which provide rigorous convergence guarantees with population-based optimization to improve the efficiency of the learning process. Specifically, we address challenges of previous approaches including inefficient best response strategies, insufficient coverage of the state–action space and the lack of knowledge transfer among agents. The developed strategies are evaluated on a industrial system of laboratory scale. The results highlight advances in evolutionary state-based knowledge transfer and an improved coverage resulting in efficient control policies. By leveraging dynamic grid structures, Evolutionary State-based Potential Games enable the maximization of weighted production targets while simultaneously eliminating process losses resulting in improvements in the considered metrics compared to state-of-the-art methods.

Keywords:

multi-agent systems; evolutionary algorithms; dynamic grid structures; transfer learning; production systems; industrial real world problem

1. Introduction

Over recent decades, the integration of Artificial Intelligence (AI) and the Internet of Things (IoT) has significantly transformed process automation [1,2] to Cyber–Physical Systems (CPS) [3] initiated by applications such as self-learning [4], anomaly detection [5] and predictive maintenance [6].

In order to meet modern industrial demands, workflows have to become more adaptive and flexible as a key to enhanced efficiency, reduced errors and ensured scalability. Consequently, there is a growing interest in distributed optimization strategies that rely on flexible, autonomous control approaches [7,8]. Addressing these requirements calls for a distributed Multi-Agent System (MAS) approach [9], capable of aligning local objectives with global optimization goals.

Typical approaches for machine learning-based distributed optimization algorithms are found in both game theory (GT) and reinforcement learning (RL). While RL is commonly applied in areas like robot navigation [10], job shop scheduling [11], and autonomous driving [12], GT is increasingly used in domains such as urban traffic control [13] and cloud security [14]. The use of multi-agent GT architectures for production systems with simple best response learning for optimization in large-scale production environments based on Potential Games (PG) has been demonstrated and further extended by state informations in [15]. Ref. [16] built on this using multi-step model predictors in model-based learning to train State-based Potential Game (SbPG) players in simulated environments for self-optimizing manufacturing systems. Furthermore, Transfer Learning (TL) approaches have been introduced to infer similarities between players during training [17]. In parallel, other works combined SbPGs with Stackelberg strategies in modular systems, assigning roles to players to facilitate multi-objective optimization via simplified utility functions [18], and incorporating gradient-based learning methods [19].

Based on these principles, we are expanding these strategies and focus on the following challenges: First, RL reaches its limits in complex multi-agent systems, particularly due to high computing costs, slow convergence, and problems with generalization [20,21]. In GT, the assumption of rational players and the search for stable equilibria such as the Nash equilibrium complicate practical applicability in dynamic, real-world scenarios [22]. Second, the best response strategies employed before rely on static grids which appear to be highly ineffective for dynamic systems [20]. This results in increased computational load and incorrect decisions, as well as limited generalizability [21]. Third, in dynamic and complex production environments with multiple products and machines, traditional optimization methods reach their limits [23]. High variance, machine dependencies, and constant rescheduling make it difficult to make efficient decisions and adapt quickly to new production conditions [24].

To tackle the first challenge, we propose a novel agent structure that combines GT-based learning with Evolutionary Algorithms (EAs). This enables more diverse strategies for optimizing manufacturing [25]. Combining EAs with best response strategies [15,26] can be expected to offer more efficient learning. To address the inherent limitations of static grid structures for best response learning, we propose a novel dynamic grid structure and integrate it into an EA-based learning strategy. Furthermore, we tackle the complexity of dynamic and complex environments by introducing a knowledge transfer framework which allows us to exchange gained knowledge between individual learning agents, further increasing the efficiency of the self-learning process.

The main contributions of this work can be summarized as follows:

We introduce a novel Evolutionary State-based Potential Game, called Evo-SbPG, to enable adaptive and scalable distributed optimization. We formally provide convergence guarantees for the novel game structure.
We propose a state–action representation, called dynamic grid structures (DGS), which allows for a more flexible representation of agent policies.
We propose a novel EA-embedded knowledge transfer scheme between agents to share existing knowledge and explore new strategies on the DGS level.
We evaluate the approaches developed in a production environment and compare the results of different alternative learning strategies, highlighting the effectiveness of the approach.

The structure of this work is divided into an overview and evaluation of the current literature in Section 2. In Section 3, the problem statement is presented and Section 4 presents the framework and the defintion of the Evo-SbPG. Section 5 explains the fundamental approach of the DGS, which is extended by a multi-agent knowledge transfer. The convergence analysis is provided in Section 6. An analysis of the results and an extensive discussion are provided in Section 7, with a final conclusion in Section 8.

2. Related Work

In this section, we discuss research on process optimization in distributed manufacturing, use of EAs for distributed optimization, and knowledge transfer in multi-agent systems.

2.1. Optimization in Distributed Manufacturing Processes

Distributed manufacturing describes the production of goods in a spatially separated arrangement of production facilities and locations. In contrast to traditional manufacturing processes, which are characterized by a central manufacturing and production process, this method of manufacturing has been enabled by IoT [2], adaptive manufacturing [27] and advanced automation technology. The benefits of distributed manufacturing processes are characterized by shorter lead times, flexibility to meet local market requirements and resilience to total downtime.

Distributed manufacturing processes play an increasing role in industrial applications [8,15,28]. Based on the prevailing interaction scheme between agent and environment, approaches from the fields of RL [15,29] and GT [16,19] have been employed. Particularly, specific game types like PG and SbPGs [15] have been used for distributed optimization in production environments. This has been extended to model based SbPG by means of multi-step model predictors [16] to train SbPG players in simulated environments. We extend the current state of the art in several aspects. First, we propose an evolutionary SbPG approach allowing us to integrate EA techniques with convergence guarantees. Second, we employ dynamic grid structures which prove useful for the more efficient use of best response strategies.

2.2. Optimization Using EAs

EAs are stochastic, meta-heuristic optimization methods inspired by natural processes [30,31]. Initially, individual solutions are generated and grouped into a population, whose evolution is driven by parameters such as selection pressure, mutation rate, and recombination rate [32,33,34]. Each solution is evaluated by a fitness function that determines its performance with respect to the optimization objective. A deeper understanding of the fitness landscape enables strategies to overcome local optima [35], while population diversity can be maintained by controlling convergence and introducing exit criteria. Combining EA with local optimization methods leads to hybrid approaches that balance global exploration and local exploitation, offering robust solutions to complex problems [36]. EA-based approaches, which are widely used in the optimization of industrial processes, have been recently applied to self-optimizing PLC control systems [37]. These studies demonstrate the potential of EA-driven MAS to optimize complex production environments through adaptive algorithms and dynamic action state mapping.

In the field of EA, various approaches and techniques have been developed to solve complex problems. Key methods include Genetic Algorithms (GAs) [38], Evolutionary Strategies (ES) [39] and other forms of stochastic optimization [40,41]. These methods provide robust and flexible solutions, but face challenges in scalability and parameter tuning. The inherent stochastic nature of these algorithms introduces variability in reproducibility, potentially leading to overfitting and reducing the generalizability of solutions [42,43]. In addition, convergence is typically not assured by EA-based methods. We are using the challenge posed by limited resources and use EA in combination with SbPG to provide solutions that can be used for complex systems with resource-saving state–action representation.

2.3. Knowledge Transfer in Multi-Agent Systems

TL is a machine learning (ML) technique that use existing knowledge structures from one task or domain to improve the performance of a related but different task [44]. Instead of training a model from scratch, it uses pre-trained models or knowledge structures, adapting them to the specific requirements of the new task [45]. This strategy reduces the need for large datasets and extensive computation, making it particularly effective for tasks with limited data or overlapping features with prior knowledge [45,46]. TL is used for approaches in applications such as multitask learning [47], transfer reinforcement learning [44,48], auxiliary learning [49] or unsupervized domain expansion [50]. There are also various strategies for transferring parameters, rewards or policies, each designed to improve learning efficiency and adaptability in novel task environments [44,48]. Collectively, these methods aim to improve the generalization and applicability of learned knowledge across tasks.

Within the field of agent-based TL, a variety of methodologies have been developed to address different challenges [44,48]. Another subfield is domain adaptation [51], where existing knowledge structures from the source domain are adapted to differences in data distribution within target tasks. A key challenge in distributed knowledge sharing is the difference in how source and target tasks are handled [49,50]. This can lead to negative transfer of irrelevant or harmful information [52]. In the context of this study, we propose an approach to knowledge transfer between multiple EA agents that incorporates the entire DGS of each individual. This approach involves the exchange of domain-specific strategies and knowledge structures to improve convergence.

3. Problem Description

We focus on distributed production systems as illustrated in Figure 1.

The system, divided into functional sub-areas, operates on two interconnected levels, the process level and the communication level, see Figure 2. At the process level, production transfer involves the movement of goods between sub-processes. Simultaneously, knowledge transfer enables continuous communication via interfaces such as Ethernet, field bus systems, or wireless links. Depending on system complexity, interactions can occur sequentially or in parallel while forming discrete, continuous, or hybrid process structures.

We model the structure of distributed manufacturing systems using graph theory. Particularly, we formulate a directed graph

G (V, E)

with vertices set

V

and edges

E

. Further, we consider a group of actuators

N = 1; 2; \dots; N

with corresponding action space

A_{i} \in R^{c} \times N^{d}

and the state space of the production system

S \in R^{m}

. For each actuator, we define its upstream state space

S_{p r i o r}^{A_{i}} = {s_{i} \in S | \exists e = (s_{i}, A_{i})}

and a downstream state space

S_{n e x t}^{A_{i}} = {s_{j} \in S | \exists e = (A_{i}, s_{j})}

. To model the production objectives, we define fitness functions

Ψ_{i} (S^{A_{i}}, a)

with

S^{A_{i}} \in S^{A_{i}} = S_{p r i o r}^{A_{i}} \cup S_{n e x t}^{A_{i}} \cup S^{g}

and

a_{i} \in A_{i}

where

S^{g}

denotes global states known to all agents.

In our study, we allow agent–agent knowledge transfer via the communication level. This leads to a transfer set

T_{i, j}

that promotes knowledge transfer between EA agents i and j. Here, transfer is limited to agent pairs, defined as

e = (A_{i}, A_{j})

.

Consequently, the overall objective of distributed optimization can be stated as the maximization of a globally defined fitness

max_{\forall a \in A} Ψ_{g l o b a l} (S^{A_{i}}, a) .

(1)

by solely maximizing the local fitness functions

Ψ_{i}

of each module. The latter is achieved by combining ideas from SbPG to evolutionary processes. Such distributed production scenarios can be found in various areas of industrial production, such as food production, automotive industry, or pharmaceutical production.

4. Framework and Evo-SbPG Definition

In this section, we introduce an EA-based multi-agent structure, as shown in Figure 3. To this end, we employ an EA-based agent i for each actuator with corresponding action set

A_{i}

including discrete actions

a_{i} \in N

or continuous actions

a_{i} \in R

.

To allow for distributed optimization according to Equation (1), we propose to leverage multi-agent structures from game theory, namely SbPG, and define them in the context of EAs. This results in an EA agent structure defined as follows.

Definition 1.

A game

E v o - S b P G (N, A, S, {Ψ_{i}}, Ψ_{g l o b a l})

defines an Evo-SbPG if a global objective

Ψ_{g l o b a l} : a \times S \to R^{d_{i}}

can be found that for every state–action pair

[s, a] \in S \times A

conforms to the conditions

Ψ_{i} (s, a_{i}, a_{- i}) - Ψ_{i} (s, a_{i}^{'}, a_{- i}) = Ψ_{g l o b a l} (s, a_{i}, a_{- i}) - Ψ_{g l o b a l} (s, a_{i}^{'}, a_{- i})

(2)

and

\begin{matrix} Ψ_{g l o b a l} (s^{'}, a_{i}) \geq Ψ_{g l o b a l} (s, a_{i}) . \end{matrix}

(3)

Thereby, the optimization uses an evolutionary population-based learning process.

Equation (3) is standard in SbPGs and assures the contractivity of the potential function with respect to the state variable. The most crucial characteristic of standard SbPGs [26] is their convergence properties to equilibrium points given a best response strategy. Furthermore, various conditions have been derived to prove the existence of an SbPG for a given game setting [53]. These properties directly translate to Evo-SbPG. In what follows, we will present a novel best response learning strategy using evolutionary operations, bringing together convergent best response learning and evolutionary operations.

5. EA-Based Learning with Dynamic Grids

A central aspect is the interaction of SbPGs within the EA framework. In what follows, building on the research of [15,19] regarding best response learners, we start with extending the best response learners to using DGS. Then, we demonstrate the adaptation of evolutionary principles through population modelling and detail the integration of genetic operators by recombination, mutation, and selection, which is categorized into local and global scales. Lastly, we show the implementation of a convergent knowledge transfer procedure.

5.1. Characteristics of Best Response Learning

Existing approaches for optimizing SbPG are based on best response learning [15], later expanded to gradient-based learning in [19]. These works use static, fixed grids to map states into resulting actions. The grids are defined by support vectors

s_{t}^{A_{i}}

,

l = 1, 2, \dots, L

whose assigned state value is fixed while the action values are continuously adjusted according to a best response stategy, see Figure 4a). The computation of an action

a_{i, t + 1}

given an actual state value

s_{i, t}

is then computed as a distance-weighted sum of all support vectors

a_{i, t + 1} = \sum_{l} \frac{w_{i}^{l, k}}{\sum_{m} w_{i}^{l, m}} \cdot a_{i, l}^{l} .

(4)

with

w_{i}^{l, t} = \frac{1}{{(D_{i}^{l, t})}^{2} + y}

(5)

and

D_{i}^{l, t} = {‖ s_{t}^{A_{i}} - S_{t}^{A_{i}} ‖}_{2}^{2} .

(6)

However, the static grids are inefficient for nonlinear optimization while best response strategies often exhibit slow convergence. In the following, we extend the learning strategies using a population-based learning within dynamic, adjustable grid structures as shown in Figure 4b). The process schematic of dynamic adjustment and coverage maintenance is illustrated in Figure 4c).

To this end, we first consider the support vectors as individuals of a population consisting of its state values, action value and an associated fitness values. Second, we use recombination and mutation to encourage diversity in the action update. Third, we update the state values within the population based on the system dynamics. Fourth, we use a combination of local and global selection to ensure a constant population size in every step. We will now present these steps in detail.

5.2. Population of Support Vectors

We define, for each support vector, an individual

\vec{I}

within a population

\vec{P}

defined as

\vec{I} (s^{A_{i}}, a_{i}, ψ) = {[\begin{matrix} s^{A_{i}} \\ a_{i} \\ ψ \end{matrix}]}^{T} .

(7)

with state support vector

s^{A_{i}} \in S^{A_{i}}

, corresponding action

a_{i} \in A_{i}

and associated fitness

ψ

. We employ a finite number

ν

of individuals in a population

{\vec{P}}_{i}^{t}

of agent i resulting in

{\vec{P}}_{i}^{t} = {[\begin{matrix} {\vec{I}}_{1}^{t}, {\vec{I}}_{2}^{t}, \dots, {\vec{I}}_{n_{ν}}^{t} \end{matrix}]}^{T} .

(8)

Note that the above population covers state–action space and fitness value, while the action has to be optimized. Hence, we propose to employ the regular evolutionary operations, i.e., recombination and mutation, solely to the actions, while updating the state vector via the regular system dynamics.

5.2.1. Recombination

To maintain genetic diversity and enhance robustness, individuals are recombined. This mechanism is addressed in [37] by

a_{l + k} (t + 1) = α_{k} a_{i} (t) + (1 - α_{k}) a_{j} (t), k = 1 \dots K,

(9)

where the different individual’s actions

a_{i} (t)

and

a_{j} (t)

are sampled from the population and recombined to form the action

a_{l + k} (t + 1)

,

α_{k}

is a random parameter, and K is the number of recombined individuals.

5.2.2. Mutation

To explore novel solutions, random mutations are introduced to expand the search space. This process supports the exploration of new strategies and is described by

a_{l + K + m} (t + 1) = a_{i} (t) + ϵ (t), m = 1 \dots M

(10)

with random value

ϵ

and number of mutated individuals M. This genetic adaptation enables individuals to expand the search space.

5.2.3. Update of Support Vectors

In addition to the action update, we have to consider the update of the support vector. In theory, we can also use recombination and mutation. However, this would ignore the system dynamics. Hence, we opt to simply set the new support vector to be equal to

s_{t + 1}^{A_{i}}

, i.e., the state obtained after applying

a_{i} (t)

to the system. Hence, we obtain the new individual

{\vec{I}}_{n e w}

as

{\vec{I}}_{n e w} (s_{t + 1}^{A_{i}}, a_{i_{t + 1}}, ψ_{t + 1}) .

(11)

where

ψ_{t + 1}

is the achieved fitness.

5.3. Local and Global Selection

The above updates result in an increased population which has to be reduced to its original size. To this end, we have to adress two goals, namely to select individuals with the best fitness but at the same time preserving the grid coverage. To this end, we propose a local selection for coverage preservation followed by a standard global selection.

5.3.1. Local Selection

The task of local selection is to preserve the grid’s coverage of the state space. Let

conv {\vec{P}}_{i}^{t} : = ⋂_{{\vec{P}}_{i}^{t}} s^{A_{i}} .

(12)

denote the convex hull of the grid, i.e., the smallest convex shape that encloses all points. Then, we define two distinct cases for the population update:

{\vec{P}}_{i}^{t} \overset{{\vec{I}}_{n e w}}{\to} {\vec{P}}_{i}^{t + 1} = \{\begin{matrix} update (s_{t + 1}^{A_{i}}), & if s_{t + 1}^{A_{i}} \in conv {\vec{P}}_{i}^{t} or \\ add (s_{t + 1}^{A_{i}}), & if s_{t + 1}^{A_{i}} \notin conv {\vec{P}}_{i}^{t} \end{matrix}

(13)

Note that the property

s_{t + 1}^{A_{i}} \in conv {\vec{P}}_{i}^{t}

can be easily determined by applying the QuickHill algorithm [54].

In case of add(

s_{t + 1}^{A_{i}}

), the individual

{\vec{I}}_{n e w}

is simply added to the population. The case of update(

s_{t + 1}^{A_{i}}

) is more involved as the corresponding support vector is already present in the coverage of the state–action representation. To avoid an accumulation of support vectors within a small area, but make use of the obtained fitness update, we propose a local update step of the grid.

To this end, we evaluate whether adding the new individual

{\vec{I}}_{n e w}

improves the local fitness of the state–action representation

Ψ_{i}

at position

s_{t + 1}^{A_{i}}

by the fitness value

ψ_{t + 1}

. Since the comparable fitness

ψ_{t}

at the point of interest may not correspond directly to an individual in the population

{\vec{P}}_{i}^{t}

, we estimate the fitness value using Delaunay triangulation

D T ({\vec{P}}_{i}^{t})

of the three nearest support vectors. We use Delaunay triangulation to divide a set of points into triangles so that no point lies within the perimeter of a triangle [55]. This maximizes the smallest angles of the triangles and thus avoids very pointed, numerically unfavorable shapes. For a comprehensive derivation of the calculations, see Appendix A.

The calculated fitness of the state–action representation is used to decide on exploitation or a rejection of the information with

{\vec{P}}_{i}^{t} \overset{{\vec{I}}_{n e w}}{\to} {\vec{P}}_{i}^{t + 1} = \{\begin{matrix} add (), & if ψ_{t} \leq ψ_{t + 1} \\ reject () . & else \end{matrix}

(14)

Consequently, a new individual is only added to the population, if its fitness is better compared to a virtual fitness calculated from neighboring support vectors. We denote this as local selection.

5.3.2. Global Selection

As the size of the population after local selection typically exceeds the maximum number of individuals, we use a standard selection approach in that we select individuals based on their fitness:

Top (n_{ν}) ({\vec{I}}_{A_{i}}^{t}, Ψ_{i}) .

(15)

where

Top (n_{ν})

denotes the selection of the

n_{ν}

individuals with the best fitness.

5.4. Agent-Based Knowledge Transfer

As stated before, the training process can be enhanced using knowledge transfer between agents, i.e., agents exhibiting similar behavior can share information. The aim is to enable faster learning with minimal training requirements by utilising existing knowledge to improve performance, see Figure 5.

To this end, we have to solve two basic issues. First, we have to derive metrics, which determine the degree of similarity between the agent’s state–action representations. Second, we leverage our population-based learning approach to transfer knowledge between the agents.

In order to transfer knowledge, we use the individual’s representation consisting of support vector, action and fitness as a normalized feature vector on which we define a similarity metric as follows:

d (I_{i}, I_{j}) = \sqrt{\sum_{i = 1}^{d} {(I_{i} - I_{j})}^{2}} .

(16)

Hence, the knowledge transfer takes place between agents i and j with the most consistent match according to

{min}_{I_{i} \in {\vec{P}}_{i}} d (I_{i}, I_{j})

.

A reasonable knowledge transfer can only be achieved if integrating the individual

I_{i}

improves the fitness value

ψ_{j}

. This transfer mechanism

T_{i, j} : I_{i}, I_{j} \to R

is described by

T_{i, j} = {I_{i} \in P_{i} | ψ_{j} (I_{j}) < ψ_{j} (I_{j} \cup {I}} .

(17)

Hence, only the individual I is selected for transfer if integrating it into the target state–action representation

F_{j}

improves the fitness value

ψ_{j}

. By requiring a strictly positive effect on target fitness, we ensure that source knowledge acts as a catalyst rather than a source of interference in the sensitive early stages of adaptation.

Technical implementations of the update state–action representation by DGS, see Algorithm A1, the agent-based evolutionary process, see Algorithm A2, and an excerpt of the multi-agent knowledge transfer, see Algorithm A3, are provided as pseudocode in Appendix B.

6. Convergence Analysis

In this section, we discuss the convergence properties of the proposed Evo-SbPG based learning algorithms. To this end, we first recall results on the convergence of conventional SbPG. In fact, it has been shown in [26] that SbPG converge to a local equilibrium if the learning algorithm follows a best response or gradient-based learning scheme. This is due to the symmetric Hessian of the underlying game structure. Hence, if we can prove a best response behavior of our proposed learning scheme, this yields convergence proofs of the EVO-SbPG. To this end, we have to analyze the analysis of the Evo-SbPG learning using the global selection, for which we can state the following:

Theorem 1.

Let

ψ_{max}

and

ψ_{min}

denote the maximum and minimum fitness value across the population. Then, under the global selection scheme, we have

\begin{matrix} ψ_{max} (t + 1) & \geq ψ_{max} (t), \end{matrix}

(18)

\begin{matrix} ψ_{min} (t + 1) & \geq ψ_{min} (t) . \end{matrix}

(19)

Proof.

Recall that we select the new population

P_{t + 1}

of size

n_{ν}

out of the old population

P_{t}

of size

n_{ν}

plus additional individuals collected in population

P_{t}^{*}

of size

n_{ν}^{*}

obtained by recombination and mutation, local selection and knowledge transfer. To this end, we employ the TopK sampling, i.e., we select the best

n_{ν}

individuals.

Suppose we have

max {ψ_{I} (t) | I \in P_{t}^{*}} \leq ψ_{min} (t)

, then none of the additional individuals are selected yielding

ψ_{min} (t + 1) = ψ_{min} (t)

. Otherwise, if

max {ψ_{I} (t) | I \in P_{t}^{*}} > ψ_{min} (t)

, at least one of the additional individuals will replace the worse individual, yielding

ψ_{min} (t + 1) > ψ_{min} (t)

. The same arguments hold also for

ψ_{max} (t + 1)

, concluding the proof. □

The above theorem basically proves that the fitness only increases or remains constant during an update. This is essentially a best response learning scheme. This remains valid under the TL scheme as well as the local selection as both just add additional individuals. Consequently, we can state the following convergence behavior.

Theorem 2.

Given an Evo-SbPG with correspondingly chosen fitness functions and the above EA-based learning scheme, the algorithm converges to a local Nash equilibrium.

Proof.

This follows directly from the underlying Evo-SbPG structure and the fact [26] that the learning scheme represents a best response strategy. □

7. Results and Discussion

In this section, the proposed approaches and developed strategies undergo thorough evaluation and critical analysis conducted on a distributed production environment. To this end, after introducing the testbed, we first compare the novel approach in different settings with other state-of-the-art methods. We then present the actor policies followed by a discussion of the coverage of the action space.

7.1. Implementation in a Laboratory Test Field

The Bulk Good Laboratory Process (BGLP) is a flexible, intelligent production system that specializes in the continuous transport of bulk materials shown in Figure 6.

The process takes place across three independent operating modules. Station 1 is a preparation unit where material is distributed via a conveyor belt. Station 2 uses a vibrating conveyor for processing. Station 3 uses a rotary valve for continuous dosing of the material. In addition, level sensors are used to monitor the process.

For the considered scenario, we assign an agent to each actuator, i.e., the conveyor, vibration conveyor, two vacuum pumps and rotary feeder. All actuators have a normalized action space of

[0, 1]

except for the on–off vibration conveyor. The state space of each agent consist of the fill levels of the upstream and downstream buffer respectively. We further define the local fitness function

Ψ_{i}

for each agent i using the overflow and emptiness of the upstream and downstream buffers

L_{p}^{i}

and

L_{s}^{i}

, the energy consumption

P^{i}

, and the production targets

V_{D}

as

Ψ_{i} = \frac{1}{1 + α_{L} L_{p}^{i}} + 1_{i \neq N} \frac{1}{1 + α_{L} L_{s}^{i}} + 1_{i = N} \frac{1}{1 - α_{D} V_{D}} + \frac{1}{1 + α_{P} P^{i}} .

(20)

with weighing factors

α_{L}

,

α_{D}

and

α_{P}

. The distributed production process iterates with a sequence T, where the fill level of the individual silos and hoppers are covered by

h^{N}

. The overflow and emptiness of the buffers are calculated by

\begin{matrix} L_{s}^{i} = \int_{0}^{T} 1_{h_{s}^{i} > H_{s}^{i}} (h_{s}^{i}) d t, & L_{p}^{i} = \int_{0}^{T} 1_{h_{p}^{i} > H_{p}^{i}} (h_{p}^{i}) d t, \end{matrix}

(21)

with upper limit

H_{s}^{i}

and lower limit

H_{p}^{i}

. The production demand is calculated with

V_{D} = \int_{0}^{T} {\dot{D}}_{t} d t, with {\dot{D}}_{t} \{\begin{matrix} {\dot{V}}_{N, o u t} - {\dot{V}}_{N, i n}, & if h^{N} = 0 \\ 0, & otherwise \end{matrix}

(22)

where the inflow

{\dot{V}}_{N, i n}

and the outflow

{\dot{V}}_{N, o u t}

is revered to the station.

7.2. Results at the BGLP

We evaluate the strategies of the Evo-SbPG using the BGLP process. To this end, we compare the results of DGS, EA DGS, and EA DGS Trans with those of the Vanilla SbPG [15] and GB SbPG M1 and GB SbPG M2 approaches [19]. Both the Vanilla SbPG and GB SbPG approach are based on fixed grid structures. However, the Vanilla SbPG approach uses random action generation, whereas the GB SbPG approach uses gradient optimization methods. GB SbPG M1 is based on a single-leader–follower dynamic, and GB SbPG M2 with multiple leader–follower relationships. The described approaches were carried out with the proposed parameters from the literature. While the DGS approach only uses dynamic grid structures, the EA DGS approach combines dynamic representation with evolutionary strategies. The EA DGS Trans approach additionally combines both with knowledge transfer. An excerpt from the fitness values of the local agents and the global resulting objective are shown in Table 1.

The contributions of each agent to global fitness are presented in Figure 7.

Figure 7 shows that the different variants of our approach, i.e., DGS, EA DGS, and EA DGS Trans outperform Vanilla SbPG and gradient based approaches with EA DGS Trans performing best. Further, they yield strategies where the individual agents contribute more evenly to the overall objective. This indicates improved coordination between individual agent policies.

We further provide results for the individual objectives for transport, overflow, demand and power consumption as shown in Table 2.

Although all approaches enable control of the BGLP process, significant differences in their individual process characteristics become apparent. The best response strategies Vanilla SbPG, DGS, EA DGS, and EA DGS Trans exhibit higher transport rates, whereas GB SbPG M1 and GB SbPG M2 show lower performance in this regard while requiring lower energy consumption.

A key feature of the DGS, EA DGS, and EA DGS Trans approaches is highlighted in the time plots shown in Figure 8. All the established strategies Vanilla SbPG, GB SbPG M1, and GB SbPG M2 are primarily focused on identifying and maintaining a single optimal control point while the effort required to establish or maintain control points varies depending on the strategy.

However, these results do not reflect typical control behavior in such event-driven distributed control systems which are typically characterized by on–off control behavior. As shown, the novel approaches are more aligned with such realistic plant control behavior, as they support dynamic adaptation to fluctuating system conditions. This adaptive behavior is particularly evident in the transport, overflow, and power profiles of the DGS, EA DGS, and EA DGS Trans strategies, as illustrated in Figure 8. We advacate this to the better exploration of the state–action space and improved coverage due to the DGS.

This interpretation is confirmed by an extended breakdown of the individual process objectives with the individual agents at the BGLP, see Table 3.

While Vanilla SbPG and DGS, EA DGS and EA DGS Trans strategies focus on self-optimizing state–action representations, GB SbPG M1 and GB SbPG M2 pursue goal-oriented optimization. This leads to higher transport values by the Vanilla SbPG and DGS methods, whereas GB-SbPGM1 and GB SbPG M2 approaches result in lower transport performance. Similar behavior is reflected in the material overflow characteristics. Vanilla SbPG and DGS tolerate minor overflow to prioritize overall optimization, whereas GB SbPG M1 and GB SbPG M2 aim to minimize excess. The Vanilla SbPG and DGS strategies show positive lower demand than GB-SbPGM1 and GB SbPG M2. Power consumption is inversely proportional to demand, while Vanilla SbPG and DGS use a higher power supply with their strategies.

7.3. Analysis and Interpretation

Based on the parameters recorded at the BGLP, we perform a Kernel Density Estimation (KDE) as a violin plot. For statistical classification, we also include the Interquartile Range (IQR) by the quartiles of

25 %

and

75 %

, the mean value

μ

, and the single and double standard deviations

1 σ

and

2 σ

in the representation, see Figure 9.

The recorded values from the BGLP process are treated as independent and identically distributed samples. To estimate the univariate probability density function

\hat{f}

at a given point x from a recorded dataset

{x_{1}, x_{2}, \dots, x_{n}}

, we employ a KDE defined by

\hat{f} (x) = \frac{1}{n h} \sum_{i = 1}^{n} K (\frac{x - x_{i}}{h}),

(23)

where n denotes the number of data points and h represents the bandwidth. As kernel

K (\cdot)

, we used a Gaussian Kernel with

K (u) = \frac{1}{\sqrt{2 π}} e^{- \frac{1}{2} u^{2}} .

(24)

Other potential kernel functions include uniform, triangular, biweight, or triweight kernels. For our evaluation, we determine the bandwidth h by scaling Scott’s Rule with a factor

b w

to minimize the mean squared error, defined as

h = b w \cdot h_{S c o t t} = b w \cdot σ n^{- 1 / 5} .

(25)

While the bandwidth could alternatively be determined via Silverman’s Rule or manual adjustment, we selected a scaling factor of

b w = 0.3

. Mathematical smoothing of the KDE can attenuate or exaggerate outliers in the data points, creating a continuous distribution shape that goes beyond the individual data values. In addition to the

25 %

and

75 %

IQR quartiles shown in red, the upper and lower whisker limits are estimated and plotted using

1.5

times the IQR.

7.4. Visualization of the State–Action Representation

In addition to the performance improvements, we are also interested in the coverage of the state–action representation of the DGS. To this end, we visualize the state–action representation by the DGS in Figure 10. Specifically, we show the barchart of action values, a bird view of the state space DGS, the resulting action grid, and the fitness surface of each agent. Together, these representations support both qualitative and quantitative analysis.

First, the DGS covers the state space entirely for agents 2–4, indicating the strength of the EA-based optimization which allows us to explore the state space with maximum coverage. The reduced coverage of agent 1 and 5 are due to the process conditions. Specifically, the rotary feeder is directly responsible for demand fulfilment, restricting the state space while the conveyor belt is typically only operated for smaller input buffer values. Further, we observe comparably smooth fitness surfaces indicating a certain optimal region within the state space which align with the action surface as well. Overall, our approach succeed in coverageging the full state–action space according to the optimization objectives resulting in the better usage of actuation as discussed before.

7.5. Tuning and Configuration Parameters

The reproducibility of the investigations is secured through strictly controlled study conditions, enabling direct comparability with reference approaches. The BGLP operates with an iteration cycle of

T_{I} = 10

s, allowing for the comprehensive capture of system dynamics via a discretized parameter sampling rate of

T_{S} = 0.5

s.

We set up the same episodic training procedure for each experiment. This procedure contains nine training episodes and one testing episode. During training, the DGS were continuously adapted to the observed behavior and adaptive learning strategies of the agents.

The fitness function, which defines the objective of agent training, was specified by weighting factors of the upstream and downstream buffer

α_{L} = 1

, the production demand target

α_{D} = 4

, and energy consumption

α_{P} = 0.001

.

The parameters and system settings presented and used are methodologically and numerically directly comparable with the configurations of the reference approaches Vanilla SbPG [15] and GB SbPG M1 and GB SbPG M2 approaches [19].

7.6. Scalability, Transferability and Limitations

The results demonstrate that agents formed autonomously interact both cooperatively and competitively, learning from one another. The Evo-SbPG agent architecture, developed for this study, enables parallel operation and linear scaling by proportionally expanding comparable modules. Although tested in a specific setting, the approach’s underlying mechanisms are transferable to other distributed production systems. We refer to [17] for a more detailed discussion. A limitation of the approach is its inherent reliance on the availability of the full state vector, which might be restrictive in real-world applications. Nevertheless, incorporating corresponding state observations or soft sensors might mitigate this problem. We leave an extension to partly observable process to future research. Furthermore, the results presented are based on simulation models in lab environments which might suffer from the sim2real gap. Approaches to account for the sim2real gap will be part of future research as well. Thanks to its stable core logic, this approach can also be applied to similar decentralized processes including dynamic supply chain management, traffic control, coordination of autonomous vehicles and voltage and frequency stabilization in smart grids.

8. Conclusions

In this paper, we proposed a novel approach for distributed optimization using population-based learning and DGS. To this end, we present Evo-SbPG, a new game theory-based framework that integrates EA strategies in SbPG. Further, we address the complexity of dynamic environments and extend the approach by DGS, EA DGS and EA DGS Trans strategies for present dynamic state–action representations. The strategies are evaluated in order to solve multi-objective optimization problems in distributed production systems. The results demonstrate improved performance in state–action representation and can be adapted for various applications. By leveraging, the approach ensures high computational efficiency and scalability, allowing it to handle increasingly complex strategy spaces without the exponential cost typically associated with fine-grained discretization.

Future research will exploit the insights gained from DGS transitions to enhance state–action representation in more intricate coupled systems. Additionally, we will further explore and extend the foundations of knowledge transfer in MAS systems and its representation through graph theory, developing more comprehensive approaches in the process. Specifically, we aim to investigate knowledge transfer mechanisms in multi-objective Evo-games, such as knowledge distillation as well as the adaptability of DGS in non-stationary environments.

Author Contributions

Conceptualization, M.L., S.Y. and A.S.; methodology, M.L., S.Y. and A.S.; software, M.L.; validation, M.L. and S.Y.; formal analysis, M.L., S.Y. and A.S.; investigation, A.S.; resources, A.S.; data curation, M.L. and S.Y.; writing—original draft preparation, M.L. and A.S.; writing—review and editing, M.L., S.Y. and A.S.; visualization, M.L.; supervision, M.L.; project administration, M.L. and A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This article is funded by the Open Access Publication Fund of South Westphalia University of Applied Sciences.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Agent structures, scenarios, and environments (including the BGLP) introduced in this study are provided by the MLPro framework. Parameters, data structures, and comparative benchmarks for Vanilla SbPG [15] and GB SbPG M1 and GB SbPG M2 approaches [19] are included within these resources.

Acknowledgments

We would like to express our gratitude to our colleagues in the Department of Automation Technology and Learning Systems at the South Westphalia University of Applied Sciences for their invaluable support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BGLP	Bulk Good Laboratory Process
CPS	Cyber–Physical Systems
DGS	Dynamic Grid Structures
EA	Evolutionary Algorithms
ES	Evolutionary Strategies
Evo-SbPG	Evolutionary State-based Potential Game
GA	Genetic Algorithms
GT	Game Theory
IoT	Internet of Things
IQR	Interquartile Range
KDE	Kernal Density Estimation
MAS	Multi-Agent System
ML	Machine Learning
PG	Potential Games
RL	Reinforcement Learning
SbPG	State-based Potential Games

Appendix A. Barycentric Coordinate Difference

The adjustment of the existing representation by the dynamic grid

D T (\vec{P_{t}})

in time step t is a central aspect. When an adjustment is made at point P, the surrounding triangle section

Δ A B C

is taken into account. Using the Barycentric coordinates and the triangle points A, B and C, a linear combination can be established as follows:

P = λ_{1} A + λ_{2} B + λ_{3} C .

(A1)

Refer the Barycentric coordinates

λ_{1}

,

λ_{2}

and

λ_{2}

to the point P and conditions apply as follows:

\begin{matrix} \begin{matrix} λ_{1} + λ_{2} + λ_{3} & = 1, \\ λ_{1}, λ_{2}, λ_{3} & \geq 0 . \end{matrix} \end{matrix}

(A2)

By decomposing the triangulation points

A = (x_{1}, y_{1})

,

B = (x_{2}, y_{2})

and

C = (x_{3}, y_{3})

into coordinates of the state space location, a system of equation can be established as follows:

\begin{matrix} \begin{matrix} x & = λ_{1} x_{1} + λ_{2} x_{2} + λ_{3} x_{3}, \\ y & = λ_{1} y_{1} + λ_{2} y_{2} + λ_{3} y_{3}, \\ λ_{1} + λ_{2} + λ_{3} & = 1 . \end{matrix} \end{matrix}

(A3)

By rearranging and inserting the equation, the system of equations can be reduced to equations with two unknowns. To solve the equations according to the Barycentric coordinates, the linear equation can be transferred to the matrix representation

(\begin{matrix} x - x_{3} \\ y - y_{3} \end{matrix}) = (\begin{matrix} x_{1} - x_{3} x_{2} - x_{3} \\ y_{1} - y_{3} y_{2} - y_{3} \end{matrix}) (\begin{matrix} λ_{1} \\ λ_{2} \end{matrix}) .

(A4)

Summarizing the matrix A and the vector b results in the problem of a linear equation with

A \cdot (\begin{matrix} λ_{1} \\ λ_{2} \end{matrix}) = b .

(A5)

Rearranging and solving the Barycentric coordinates

λ_{1}

,

λ_{2}

can be used to calculate the missing Barycentric coordinates

λ_{3}

with

λ_{3} = 1 - λ_{1} - λ_{2} .

(A6)

By interpolating the z values of the considered triangle points A, B and C, a calculation at the point P can be made with

z = λ_{1} z_{1} + λ_{2} z_{2} + λ_{3} z_{3} .

(A7)

The calculated intersection point of the triangle section

Δ A B C

is used as an additional target and for adjustment in the further course of the evolutionary algorithm.

Appendix B. Pseudocode

The following section presents the pseudocodes used for the implementation.

Algorithm A1: Update state–action representation by DGS

Algorithm A2: Agent-based evolutionary process

Algorithm A3: Excerpt of multi-agent knowledge transfer

References

Jan, Z.; Ahamed, F.; Mayer, W.; Patel, N.; Grossmann, G.; Stumptner, M.; Kuusk, A. Artificial intelligence for industry 4.0: Systematic review of applications, challenges, and opportunities. Expert Syst. Appl. 2023, 216, 119456. [Google Scholar] [CrossRef]
Soori, M.; Arezoo, B.; Dastres, R. Internet of things for smart factories in industry 4.0, a review. Internet Things Cyber-Phys. Syst. 2023, 3, 192–204. [Google Scholar] [CrossRef]
Ryalat, M.; ElMoaqet, H.; AlFaouri, M. Design of a Smart Factory Based on Cyber-Physical Systems and Internet of Things towards Industry 4.0. Appl. Sci. 2023, 13, 2156. [Google Scholar] [CrossRef]
Al-Sharman, M.; Dempster, R.; Daoud, M.A.; Nasr, M.; Rayside, D.; Melek, W. Self-Learned Autonomous Driving at Unsignalized Intersections: A Hierarchical Reinforced Learning Approach for Feasible Decision-Making. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12345–12356. [Google Scholar] [CrossRef]
Arshad, K.; Ali, R.F.; Muneer, A.; Aziz, I.A.; Naseer, S.; Khan, N.S.; Taib, S.M. Deep Reinforcement Learning for Anomaly Detection: A Systematic Review. IEEE Access 2022, 10, 124017–124035. [Google Scholar] [CrossRef]
Nunes, P.; Santos, J.; Rocha, E. Challenges in predictive maintenance—A review. CIRP J. Manuf. Sci. Technol. 2023, 40, 53–67. [Google Scholar] [CrossRef]
Javaid, M.; Haleem, A.; Singh, R.P.; Suman, R. Enabling flexible manufacturing system (FMS) through the applications of industry 4.0 technologies. Internet Things Cyber-Phys. Syst. 2022, 2, 49–62. [Google Scholar] [CrossRef]
Duan, S.; Wang, D.; Ren, J.; Lyu, F.; Zhang, Y.; Wu, H.; Shen, X. Distributed Artificial Intelligence Empowered by End-Edge-Cloud Computing: A Survey. IEEE Commun. Surv. Tutor. 2023, 25, 591–624. [Google Scholar] [CrossRef]
Amirkhani, A.; Barshooi, A.H. Consensus in multi-agent systems: A review. Artif. Intell. Rev. 2022, 55, 3897–3935. [Google Scholar] [CrossRef]
Gu, S.; Grudzien Kuba, J.; Chen, Y.; Du, Y.; Yang, L.; Knoll, A.; Yang, Y. Safe multi-agent reinforcement learning for multi-robot control. Artif. Intell. 2023, 319, 103905. [Google Scholar] [CrossRef]
Zhang, J.D.; He, Z.; Chan, W.H.; Chow, C.Y. DeepMAG: Deep reinforcement learning with multi-agent graphs for flexible job shop scheduling. Knowl.-Based Syst. 2023, 259, 110083. [Google Scholar] [CrossRef]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
Ahmad, F.; Shah, Z.; Al-Fagih, L. Applications of evolutionary game theory in urban road transport network: A state of the art review. Sustain. Cities Soc. 2023, 98, 104791. [Google Scholar] [CrossRef]
Gill, K.S.; Sharma, A.; Saxena, S. A Systematic Review on Game-Theoretic Models and Different Types of Security Requirements in Cloud Environment: Challenges and Opportunities. Arch. Comput. Methods Eng. 2024, 31, 3857–3890. [Google Scholar] [CrossRef]
Schwung, D.; Schwung, A.; Ding, S.X. Distributed Self-Optimization of Modular Production Units: A State-Based Potential Game Approach. IEEE Trans. Cybern. 2022, 52, 2174–2185. [Google Scholar] [CrossRef]
Yuwono, S.; Schwung, A. Model-based learning on state-based potential games for distributed self-optimization of manufacturing systems. J. Manuf. Syst. 2023, 71, 474–493. [Google Scholar] [CrossRef]
Yuwono, S.; Schwung, D.; Schwung, A. Transfer learning of state-based potential games for process optimization in decentralized manufacturing systems. Comput. Ind. 2025, 173, 104376. [Google Scholar] [CrossRef]
Yuwono, S.; Schwung, D.; Schwung, A. Distributed Stackelberg Strategies in State-Based Potential Games for Autonomous Decentralized Learning Manufacturing Systems. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 8112–8125. [Google Scholar] [CrossRef]
Yuwono, S.; Löppenberg, M.; Schwung, D.; Schwung, A. Gradient-based Learning in State-based Potential Games for Self-Learning Production Systems. In Proceedings of the IECON 2024—50th Annual Conference of the IEEE Industrial Electronics Society; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar] [CrossRef]
Ladosz, P.; Weng, L.; Kim, M.; Oh, H. Exploration in deep reinforcement learning: A survey. Inf. Fusion 2022, 85, 1–22. [Google Scholar] [CrossRef]
Hao, J.; Yang, T.; Tang, H.; Bai, C.; Liu, J.; Meng, Z.; Liu, P.; Wang, Z. Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 8762–8782. [Google Scholar] [CrossRef]
Leonardos, S.; Piliouras, G. Exploration-exploitation in multi-agent learning: Catastrophe theory meets game theory. Artif. Intell. 2022, 304, 103653. [Google Scholar] [CrossRef]
Dogru, O.; Xie, J.; Prakash, O.; Chiplunkar, R.; Soesanto, J.; Chen, H.; Velswamy, K.; Ibrahim, F.; Huang, B. Reinforcement Learning in Process Industries: Review and Perspective. IEEE/CAA J. Autom. Sin. 2024, 11, 283–300. [Google Scholar] [CrossRef]
Kaven, L.; Huke, P.; Göppert, A.; Schmitt, R.H. Multi agent reinforcement learning for online layout planning and scheduling in flexible assembly systems. J. Intell. Manuf. 2024, 35, 3917–3936. [Google Scholar] [CrossRef]
Ma, X.W.; Huang, T.; Liu, W.L.; Gong, Y.J. Collision-Aware Evolutionary Algorithm for Multi-Agent Coverage Path Planning. In Proceedings of the 2024 11th International Conference on Machine Intelligence Theory and Applications (MiTA); IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar] [CrossRef]
Marden, J.R. State based potential games. Automatica 2012, 48, 3075–3088. [Google Scholar] [CrossRef]
Rani, S.; Jining, D.; Shoukat, K.; Shoukat, M.U.; Nawaz, S.A. A Human–Machine Interaction Mechanism: Additive Manufacturing for Industry 5.0—Design and Management. Sustainability 2024, 16, 4158. [Google Scholar] [CrossRef]
Perez-Gonzalez, P.; Framinan, J.M. A review and classification on distributed permutation flowshop scheduling problems. Eur. J. Oper. Res. 2024, 312, 1–21. [Google Scholar] [CrossRef]
Huang, J.P.; Gao, L.; Li, X.Y. A Hierarchical Multi-Action Deep Reinforcement Learning Method for Dynamic Distributed Job-Shop Scheduling Problem With Job Arrivals. IEEE Trans. Autom. Sci. Eng. 2024, 22, 2501–2513. [Google Scholar] [CrossRef]
Maier, H.; Razavi, S.; Kapelan, Z.; Matott, L.; Kasprzyk, J.; Tolson, B. Introductory overview: Optimization using evolutionary algorithms and other metaheuristics. Environ. Model. Softw. 2019, 114, 195–213. [Google Scholar] [CrossRef]
Li, J.; Soradi-Zeid, S.; Yousefpour, A.; Pan, D. Improved differential evolution algorithm based convolutional neural network for emotional analysis of music data. Appl. Soft Comput. 2024, 153, 111262. [Google Scholar] [CrossRef]
Menczer, F.; Degeratu, M.; Street, W.N. Efficient and Scalable Pareto Optimization by Evolutionary Local Selection Algorithms. Evol. Comput. 2000, 8, 223–247. [Google Scholar] [CrossRef]
Sato, H.; Aguirre, H.; Tanaka, K. On the locality of dominance and recombination in multiobjective evolutionary algorithms. In Proceedings of the 2005 IEEE Congress on Evolutionary Computation; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 451–458. [Google Scholar] [CrossRef]
Liu, H.l.; Li, X.; Chen, Y. Multi-Objective Evolutionary Algorithm Based on Dynamical Crossover and Mutation. In Proceedings of the 2008 International Conference on Computational Intelligence and Security; IEEE: Piscataway, NJ, USA, 2008; Volume 1, pp. 150–155. [Google Scholar] [CrossRef]
Plump, C.; Berger, B.J.; Drechsler, R. Using density of training data to improve evolutionary algorithms with approximative fitness functions. In Proceedings of the 2022 IEEE Congress on Evolutionary Computation (CEC); IEEE: Piscataway, NJ, USA, 2022; pp. 1–10. [Google Scholar] [CrossRef]
Song, Y.; Cai, X.; Zhou, X.; Zhang, B.; Chen, H.; Li, Y.; Deng, W.; Deng, W. Dynamic hybrid mechanism-based differential evolution algorithm and its application. Expert Syst. Appl. 2023, 213, 118834. [Google Scholar] [CrossRef]
Löppenberg, M.; Schwung, A. Structured Graph Generation by Evolutionary Algorithm for Program Code Development. In Proceedings of the IECON 2024—50th Annual Conference of the IEEE Industrial Electronics Society; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Alhijawi, B.; Awajan, A. Genetic algorithms: Theory, genetic operators, solutions, and applications. Evol. Intell. 2024, 17, 1245–1256. [Google Scholar] [CrossRef]
Yuan, S.; Song, K.; Chen, J.; Tan, X.; Li, D.; Yang, D. EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms. arXiv 2024, arXiv:2406.14228. [Google Scholar]
Wang, X.; Guo, S.; Li, B. Minimum Risk Decision Making Problem Based on Single-Objective Stochastic Optimization Model and Genetic Algorithm. In Proceedings of the 2024 IEEE 2nd International Conference on Electrical, Automation and Computer Engineering (ICEACE); IEEE: Piscataway, NJ, USA, 2024; pp. 1706–1711. [Google Scholar] [CrossRef]
Malik, S.; Devine, M.T.; Keane, A. Leader-Follower Dynamics in P2P Energy Markets: A Bilevel Stochastic Optimization Approach. In Proceedings of the 2024 IEEE PES Innovative Smart Grid Technologies Europe (ISGT EUROPE); IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Sapora, S.; Swamy, G.; Lu, C.; Teh, Y.W.; Foerster, J.N. EvIL: Evolution strategies for generalisable imitation learning. In Proceedings of the ICML’24: 41st International Conference on Machine Learning; JMLR.org: Brookline, MA, USA, 2024; pp. 1–5. [Google Scholar]
Bejani, M.M.; Ghatee, M. A systematic review on overfitting control in shallow and deep neural networks. Artif. Intell. Rev. 2021, 54, 6391–6438. [Google Scholar] [CrossRef]
Zhu, Z.; Lin, K.; Jain, A.K.; Zhou, J. Transfer Learning in Deep Reinforcement Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef]
Zhao, Z.; Alzubaidi, L.; Zhang, J.; Duan, Y.; Gu, Y. A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and limitations. Expert Syst. Appl. 2024, 242, 122807. [Google Scholar] [CrossRef]
Iman, M.; Arabnia, H.R.; Rasheed, K. A Review of Deep Transfer Learning and Recent Advancements. Technologies 2023, 11, 40. [Google Scholar] [CrossRef]
Fontana, M.; Spratling, M.; Shi, M. When Multitask Learning Meets Partial Supervision: A Computer Vision Review. Proc. IEEE 2024, 112, 516–543. [Google Scholar] [CrossRef]
Wang, W.; Wang, X.; Li, R.; Jiang, H.; Liu, D.; Ping, X. Transfer Reinforcement Learning of Robotic Grasping Training using Neural Networks with Lateral Connections. In Proceedings of the 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS); IEEE: Piscataway, NJ, USA, 2023; pp. 489–494. [Google Scholar] [CrossRef]
Jiang, J.; Chen, B.; Pan, J.; Wang, X.; Liu, D.; Jiang, J.; Long, M. ForkMerge: Mitigating Negative Transfer in Auxiliary-Task Learning. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 30367–30389. [Google Scholar]
Chen, Y.; Wang, D.; Zhu, D.; Xu, Z.; He, B. Unsupervised domain adaptation of dynamic extension networks based on class decision boundaries. Multimed. Syst. 2024, 30, 80. [Google Scholar] [CrossRef]
Wang, B. Domain Adaptation in Reinforcement Learning: Approaches, Limitations, and Future Directions. J. Inst. Eng. (India) Ser. B 2024, 105, 1223–1240. [Google Scholar] [CrossRef]
Zhang, W.; Deng, L.; Zhang, L.; Wu, D. A Survey on Negative Transfer. IEEE/CAA J. Autom. Sin. 2023, 10, 305–329. [Google Scholar] [CrossRef]
Zazo, S.; Valcarcel Macua, S.; Sanchez-Fernandez, M.; Zazo, J. Dynamic Potential Games with Constraints: Fundamentals and Applications in Communications. IEEE Trans. Signal Process. 2016, 64, 3806–3821. [Google Scholar] [CrossRef]
Barber, C.B.; Dobkin, D.P.; Huhdanpaa, H. The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. 1996, 22, 469–483. [Google Scholar] [CrossRef]
Lee, D.T.; Schachter, B.J. Two algorithms for constructing a Delaunay triangulation. Int. J. Comput. Inf. Sci. 1980, 9, 219–242. [Google Scholar] [CrossRef]

Figure 1. Consideration of distributed production systems for process measurement and control systems in a decentralized arrangement, divided into sub-systems, sub-processes and sub-applications.

Figure 2. Representation of knowledge and production transfer in distributed production and manufacturing systems at the process and communication levels in serial and parallel arrangements.

Figure 3. Structural design of the multi-agent system and its interaction with the production environment.

Figure 4. 2D top view of the state–action representation and process schematic. (a) Illustration shows the fixed grid structure from [15,16,19]. (b) Illustration shows the DGS extension being moved and placed freely, with coverage and adjustment. (c) Schematic process of dynamic adjustment and updating of individuals for coverage maintenance.

Figure 5. Agent-based knowledge transfer. Transfer Learning of specific knowledge from the state–action representation of an agent

I_{i}

to another agent

I_{j}

in a similar problem.

Figure 5. Agent-based knowledge transfer. Transfer Learning of specific knowledge from the state–action representation of an agent

I_{i}

to another agent

I_{j}

in a similar problem.

Figure 6. The bulk good process with its agent structure.

Figure 7. Global fitness distribution and contributions of each individual agent.

Figure 8. Time plots evaluated at the BGLP over 10,000 steps. The aim is to provide continuous production for transport, overflow, demand and power characteristics based on global fitness.

Figure 9. The BGLP parameters are analyzed and interpreted using a KDE plot, which includes additional quartiles, mean values and standard deviations.

Figure 10. State–action representation of the BGLP as 2D and 3D visualization.

Table 1. Average fitness rating of each agent.

Agent	Vanilla SbPG	GB SbPG M1	GB SbPG M2	DGS	EA DGS	EA DGS Trans
Agent 1	2.152	2.610	2.703	2.394	2.228	2.404
Agent 2	1.892	2.635	2.544	2.285	2.012	2.317
Agent 3	1.944	2.207	2.109	2.304	2.199	2.324
Agent 4	1.561	1.467	1.711	2.052	2.390	2.108
Agent 5	1.127	1.171	1.319	1.422	1.746	1.476
Total	8.675	10.089	10.387	10.456	10.574	10.629

Table 2. Average process parameters per strategy.

Total	Vanilla SbPG	GB SbPG M1	GB SbPG M2	DGS	EA DGS	EA DGS Trans
transport	0.984	0.451	0.475	0.854	0.957	0.884
overflow	0.034	0.000	0.000	0.018	0.009	0.017
demand	127.130	129.850	130.793	118.448	115.027	116.623
power	0.726	0.440	0.446	0.652	0.700	0.669

Table 3. Average process parameters for each agent and production target at the BGLP.

		Vanilla SbPG	GB SbPG M1	GB SbPG M2	DGS	EA DGS	EA DGS Trans
Agent 1	transport	0.191	0.080	0.089	0.172	0.186	0.177
	overflow	0.000	0.000	0.000	0.000	0.000	0.000
	demand	21.008	21.019	21.195	21.095	21.024	21.040
	power	0.045	0.040	0.041	0.044	0.045	0.045
Agent 2	transport	0.228	0.080	0.090	0.184	0.203	0.192
	overflow	0.001	0.000	0.000	0.001	0.000	0.001
	demand	30.094	22.540	24.081	27.213	30.303	27.994
	power	0.162	0.077	0.082	0.137	0.148	0.141
Agent 3	transport	0.200	0.080	0.089	0.171	0.212	0.179
	overflow	0.007	0.000	0.000	0.004	0.000	0.003
	demand	19.686	23.030	24.953	17.282	25.087	17.039
	power	0.027	0.027	0.027	0.027	0.027	0.027
Agent 4	transport	0.168	0.115	0.109	0.159	0.178	0.164
	overflow	0.025	0.000	0.000	0.012	0.009	0.012
	demand	24.444	32.021	31.285	24.895	23.655	24.864
	power	0.251	0.179	0.174	0.240	0.263	0.247
Agent 5	transport	0.197	0.095	0.099	0.167	0.177	0.172
	overflow	0.000	0.000	0.000	0.000	0.000	0.000
	demand	31.898	31.240	29.279	27.963	14.960	25.686
	power	0.241	0.116	0.122	0.205	0.218	0.210

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Löppenberg, M.; Yuwono, S.; Schwung, A. Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications. AI 2026, 7, 62. https://doi.org/10.3390/ai7020062

AMA Style

Löppenberg M, Yuwono S, Schwung A. Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications. AI. 2026; 7(2):62. https://doi.org/10.3390/ai7020062

Chicago/Turabian Style

Löppenberg, Marlon, Steve Yuwono, and Andreas Schwung. 2026. "Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications" AI 7, no. 2: 62. https://doi.org/10.3390/ai7020062

APA Style

Löppenberg, M., Yuwono, S., & Schwung, A. (2026). Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications. AI, 7(2), 62. https://doi.org/10.3390/ai7020062

Article Menu

Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications

Abstract

1. Introduction

2. Related Work

2.1. Optimization in Distributed Manufacturing Processes

2.2. Optimization Using EAs

2.3. Knowledge Transfer in Multi-Agent Systems

3. Problem Description

4. Framework and Evo-SbPG Definition

5. EA-Based Learning with Dynamic Grids

5.1. Characteristics of Best Response Learning

5.2. Population of Support Vectors

5.2.1. Recombination

5.2.2. Mutation

5.2.3. Update of Support Vectors

5.3. Local and Global Selection

5.3.1. Local Selection

5.3.2. Global Selection

5.4. Agent-Based Knowledge Transfer

6. Convergence Analysis

7. Results and Discussion

7.1. Implementation in a Laboratory Test Field

7.2. Results at the BGLP

7.3. Analysis and Interpretation

7.4. Visualization of the State–Action Representation

7.5. Tuning and Configuration Parameters

7.6. Scalability, Transferability and Limitations

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Barycentric Coordinate Difference

Appendix B. Pseudocode

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI