Next Article in Journal
Multimodal Classification Algorithms for Emotional Stress Analysis with an ECG-Centered Framework: A Comprehensive Review
Previous Article in Journal
Edge-Ready Romanian Language Models: Training, Quantization, and Deployment
Previous Article in Special Issue
Causal Reasoning and Large Language Models for Military Decision-Making: Rethinking the Command Structures in the Era of Generative AI
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications

by
Marlon Löppenberg
*,
Steve Yuwono
and
Andreas Schwung
Department of Automation Technology and Learning Systems, South Westphalia University of Applied Sciences, 59494 Soest, Germany
*
Author to whom correspondence should be addressed.
Submission received: 22 December 2025 / Revised: 3 February 2026 / Accepted: 4 February 2026 / Published: 6 February 2026

Abstract

Distributed production systems have to increasingly balance economic goals such as energy efficiency and productivity with critical technical requirements such as flexibility, real-time capability, and reliability. This paper presents a novel approach for distributed optimization by means of Evolutionary State-based Potential Games with dynamic grid structures. More in detail, we leverage the combination of Potential Games which provide rigorous convergence guarantees with population-based optimization to improve the efficiency of the learning process. Specifically, we address challenges of previous approaches including inefficient best response strategies, insufficient coverage of the state–action space and the lack of knowledge transfer among agents. The developed strategies are evaluated on a industrial system of laboratory scale. The results highlight advances in evolutionary state-based knowledge transfer and an improved coverage resulting in efficient control policies. By leveraging dynamic grid structures, Evolutionary State-based Potential Games enable the maximization of weighted production targets while simultaneously eliminating process losses resulting in improvements in the considered metrics compared to state-of-the-art methods.

1. Introduction

Over recent decades, the integration of Artificial Intelligence (AI) and the Internet of Things (IoT) has significantly transformed process automation [1,2] to Cyber–Physical Systems (CPS) [3] initiated by applications such as self-learning [4], anomaly detection [5] and predictive maintenance [6].
In order to meet modern industrial demands, workflows have to become more adaptive and flexible as a key to enhanced efficiency, reduced errors and ensured scalability. Consequently, there is a growing interest in distributed optimization strategies that rely on flexible, autonomous control approaches [7,8]. Addressing these requirements calls for a distributed Multi-Agent System (MAS) approach [9], capable of aligning local objectives with global optimization goals.
Typical approaches for machine learning-based distributed optimization algorithms are found in both game theory (GT) and reinforcement learning (RL). While RL is commonly applied in areas like robot navigation [10], job shop scheduling [11], and autonomous driving [12], GT is increasingly used in domains such as urban traffic control [13] and cloud security [14]. The use of multi-agent GT architectures for production systems with simple best response learning for optimization in large-scale production environments based on Potential Games (PG) has been demonstrated and further extended by state informations in [15]. Ref. [16] built on this using multi-step model predictors in model-based learning to train State-based Potential Game (SbPG) players in simulated environments for self-optimizing manufacturing systems. Furthermore, Transfer Learning (TL) approaches have been introduced to infer similarities between players during training [17]. In parallel, other works combined SbPGs with Stackelberg strategies in modular systems, assigning roles to players to facilitate multi-objective optimization via simplified utility functions [18], and incorporating gradient-based learning methods [19].
Based on these principles, we are expanding these strategies and focus on the following challenges: First, RL reaches its limits in complex multi-agent systems, particularly due to high computing costs, slow convergence, and problems with generalization [20,21]. In GT, the assumption of rational players and the search for stable equilibria such as the Nash equilibrium complicate practical applicability in dynamic, real-world scenarios [22]. Second, the best response strategies employed before rely on static grids which appear to be highly ineffective for dynamic systems [20]. This results in increased computational load and incorrect decisions, as well as limited generalizability [21]. Third, in dynamic and complex production environments with multiple products and machines, traditional optimization methods reach their limits [23]. High variance, machine dependencies, and constant rescheduling make it difficult to make efficient decisions and adapt quickly to new production conditions [24].
To tackle the first challenge, we propose a novel agent structure that combines GT-based learning with Evolutionary Algorithms (EAs). This enables more diverse strategies for optimizing manufacturing [25]. Combining EAs with best response strategies [15,26] can be expected to offer more efficient learning. To address the inherent limitations of static grid structures for best response learning, we propose a novel dynamic grid structure and integrate it into an EA-based learning strategy. Furthermore, we tackle the complexity of dynamic and complex environments by introducing a knowledge transfer framework which allows us to exchange gained knowledge between individual learning agents, further increasing the efficiency of the self-learning process.
The main contributions of this work can be summarized as follows:
  • We introduce a novel Evolutionary State-based Potential Game, called Evo-SbPG, to enable adaptive and scalable distributed optimization. We formally provide convergence guarantees for the novel game structure.
  • We propose a state–action representation, called dynamic grid structures (DGS), which allows for a more flexible representation of agent policies.
  • We propose a novel EA-embedded knowledge transfer scheme between agents to share existing knowledge and explore new strategies on the DGS level.
  • We evaluate the approaches developed in a production environment and compare the results of different alternative learning strategies, highlighting the effectiveness of the approach.
The structure of this work is divided into an overview and evaluation of the current literature in Section 2. In Section 3, the problem statement is presented and Section 4 presents the framework and the defintion of the Evo-SbPG. Section 5 explains the fundamental approach of the DGS, which is extended by a multi-agent knowledge transfer. The convergence analysis is provided in Section 6. An analysis of the results and an extensive discussion are provided in Section 7, with a final conclusion in Section 8.

2. Related Work

In this section, we discuss research on process optimization in distributed manufacturing, use of EAs for distributed optimization, and knowledge transfer in multi-agent systems.

2.1. Optimization in Distributed Manufacturing Processes

Distributed manufacturing describes the production of goods in a spatially separated arrangement of production facilities and locations. In contrast to traditional manufacturing processes, which are characterized by a central manufacturing and production process, this method of manufacturing has been enabled by IoT [2], adaptive manufacturing [27] and advanced automation technology. The benefits of distributed manufacturing processes are characterized by shorter lead times, flexibility to meet local market requirements and resilience to total downtime.
Distributed manufacturing processes play an increasing role in industrial applications [8,15,28]. Based on the prevailing interaction scheme between agent and environment, approaches from the fields of RL [15,29] and GT [16,19] have been employed. Particularly, specific game types like PG and SbPGs [15] have been used for distributed optimization in production environments. This has been extended to model based SbPG by means of multi-step model predictors [16] to train SbPG players in simulated environments. We extend the current state of the art in several aspects. First, we propose an evolutionary SbPG approach allowing us to integrate EA techniques with convergence guarantees. Second, we employ dynamic grid structures which prove useful for the more efficient use of best response strategies.

2.2. Optimization Using EAs

EAs are stochastic, meta-heuristic optimization methods inspired by natural processes [30,31]. Initially, individual solutions are generated and grouped into a population, whose evolution is driven by parameters such as selection pressure, mutation rate, and recombination rate [32,33,34]. Each solution is evaluated by a fitness function that determines its performance with respect to the optimization objective. A deeper understanding of the fitness landscape enables strategies to overcome local optima [35], while population diversity can be maintained by controlling convergence and introducing exit criteria. Combining EA with local optimization methods leads to hybrid approaches that balance global exploration and local exploitation, offering robust solutions to complex problems [36]. EA-based approaches, which are widely used in the optimization of industrial processes, have been recently applied to self-optimizing PLC control systems [37]. These studies demonstrate the potential of EA-driven MAS to optimize complex production environments through adaptive algorithms and dynamic action state mapping.
In the field of EA, various approaches and techniques have been developed to solve complex problems. Key methods include Genetic Algorithms (GAs) [38], Evolutionary Strategies (ES) [39] and other forms of stochastic optimization [40,41]. These methods provide robust and flexible solutions, but face challenges in scalability and parameter tuning. The inherent stochastic nature of these algorithms introduces variability in reproducibility, potentially leading to overfitting and reducing the generalizability of solutions [42,43]. In addition, convergence is typically not assured by EA-based methods. We are using the challenge posed by limited resources and use EA in combination with SbPG to provide solutions that can be used for complex systems with resource-saving state–action representation.

2.3. Knowledge Transfer in Multi-Agent Systems

TL is a machine learning (ML) technique that use existing knowledge structures from one task or domain to improve the performance of a related but different task [44]. Instead of training a model from scratch, it uses pre-trained models or knowledge structures, adapting them to the specific requirements of the new task [45]. This strategy reduces the need for large datasets and extensive computation, making it particularly effective for tasks with limited data or overlapping features with prior knowledge [45,46]. TL is used for approaches in applications such as multitask learning [47], transfer reinforcement learning [44,48], auxiliary learning [49] or unsupervized domain expansion [50]. There are also various strategies for transferring parameters, rewards or policies, each designed to improve learning efficiency and adaptability in novel task environments [44,48]. Collectively, these methods aim to improve the generalization and applicability of learned knowledge across tasks.
Within the field of agent-based TL, a variety of methodologies have been developed to address different challenges [44,48]. Another subfield is domain adaptation [51], where existing knowledge structures from the source domain are adapted to differences in data distribution within target tasks. A key challenge in distributed knowledge sharing is the difference in how source and target tasks are handled [49,50]. This can lead to negative transfer of irrelevant or harmful information [52]. In the context of this study, we propose an approach to knowledge transfer between multiple EA agents that incorporates the entire DGS of each individual. This approach involves the exchange of domain-specific strategies and knowledge structures to improve convergence.

3. Problem Description

We focus on distributed production systems as illustrated in Figure 1.
The system, divided into functional sub-areas, operates on two interconnected levels, the process level and the communication level, see Figure 2. At the process level, production transfer involves the movement of goods between sub-processes. Simultaneously, knowledge transfer enables continuous communication via interfaces such as Ethernet, field bus systems, or wireless links. Depending on system complexity, interactions can occur sequentially or in parallel while forming discrete, continuous, or hybrid process structures.
We model the structure of distributed manufacturing systems using graph theory. Particularly, we formulate a directed graph  G ( V , E ) with vertices set  V and edges  E . Further, we consider a group of actuators N = 1 ; 2 ; ; N with corresponding action space A i R c × N d and the state space of the production system S R m . For each actuator, we define its upstream state space S p r i o r A i = { s i S | e = ( s i , A i ) } and a downstream state space S n e x t A i = { s j S | e = ( A i , s j ) } . To model the production objectives, we define fitness functions Ψ i ( S A i , a ) with S A i S A i = S p r i o r A i S n e x t A i S g and a i A i where S g denotes global states known to all agents.
In our study, we allow agent–agent knowledge transfer via the communication level. This leads to a transfer set T i , j that promotes knowledge transfer between EA agents i and j. Here, transfer is limited to agent pairs, defined as e = ( A i , A j ) .
Consequently, the overall objective of distributed optimization can be stated as the maximization of a globally defined fitness
max a A Ψ g l o b a l ( S A i , a ) .
by solely maximizing the local fitness functions Ψ i of each module. The latter is achieved by combining ideas from SbPG to evolutionary processes. Such distributed production scenarios can be found in various areas of industrial production, such as food production, automotive industry, or pharmaceutical production.

4. Framework and Evo-SbPG Definition

In this section, we introduce an EA-based multi-agent structure, as shown in Figure 3. To this end, we employ an EA-based agent i for each actuator with corresponding action set A i including discrete actions a i N or continuous actions a i R .
To allow for distributed optimization according to Equation (1), we propose to leverage multi-agent structures from game theory, namely SbPG, and define them in the context of EAs. This results in an EA agent structure defined as follows.
Definition 1.
A game E v o - S b P G ( N , A , S , { Ψ i } , Ψ g l o b a l ) defines an Evo-SbPG if a global objective Ψ g l o b a l : a × S R d i can be found that for every state–action pair [ s , a ] S × A conforms to the conditions
Ψ i ( s , a i , a i ) Ψ i ( s , a i , a i ) = Ψ g l o b a l ( s , a i , a i ) Ψ g l o b a l ( s , a i , a i )
and
Ψ g l o b a l ( s , a i ) Ψ g l o b a l ( s , a i ) .
Thereby, the optimization uses an evolutionary population-based learning process.
Equation (3) is standard in SbPGs and assures the contractivity of the potential function with respect to the state variable. The most crucial characteristic of standard SbPGs [26] is their convergence properties to equilibrium points given a best response strategy. Furthermore, various conditions have been derived to prove the existence of an SbPG for a given game setting [53]. These properties directly translate to Evo-SbPG. In what follows, we will present a novel best response learning strategy using evolutionary operations, bringing together convergent best response learning and evolutionary operations.

5. EA-Based Learning with Dynamic Grids

A central aspect is the interaction of SbPGs within the EA framework. In what follows, building on the research of [15,19] regarding best response learners, we start with extending the best response learners to using DGS. Then, we demonstrate the adaptation of evolutionary principles through population modelling and detail the integration of genetic operators by recombination, mutation, and selection, which is categorized into local and global scales. Lastly, we show the implementation of a convergent knowledge transfer procedure.

5.1. Characteristics of Best Response Learning

Existing approaches for optimizing SbPG are based on best response learning [15], later expanded to gradient-based learning in [19]. These works use static, fixed grids to map states into resulting actions. The grids are defined by support vectors s t A i , l = 1 , 2 , , L whose assigned state value is fixed while the action values are continuously adjusted according to a best response stategy, see Figure 4a). The computation of an action a i , t + 1 given an actual state value s i , t is then computed as a distance-weighted sum of all support vectors
a i , t + 1 = l w i l , k m w i l , m · a i , l l .
with
w i l , t = 1 ( D i l , t ) 2 + y
and
D i l , t = s t A i S t A i 2 2 .
However, the static grids are inefficient for nonlinear optimization while best response strategies often exhibit slow convergence. In the following, we extend the learning strategies using a population-based learning within dynamic, adjustable grid structures as shown in Figure 4b). The process schematic of dynamic adjustment and coverage maintenance is illustrated in Figure 4c).
To this end, we first consider the support vectors as individuals of a population consisting of its state values, action value and an associated fitness values. Second, we use recombination and mutation to encourage diversity in the action update. Third, we update the state values within the population based on the system dynamics. Fourth, we use a combination of local and global selection to ensure a constant population size in every step. We will now present these steps in detail.

5.2. Population of Support Vectors

We define, for each support vector, an individual I within a population P defined as
I ( s A i , a i , ψ ) = s A i a i ψ T .
with state support vector s A i S A i , corresponding action a i A i and associated fitness ψ . We employ a finite number ν of individuals in a population P i t of agent i resulting in
P i t = I 1 t , I 2 t , , I n ν t T .
Note that the above population covers state–action space and fitness value, while the action has to be optimized. Hence, we propose to employ the regular evolutionary operations, i.e., recombination and mutation, solely to the actions, while updating the state vector via the regular system dynamics.

5.2.1. Recombination

To maintain genetic diversity and enhance robustness, individuals are recombined. This mechanism is addressed in [37] by
a l + k ( t + 1 ) = α k a i ( t ) + ( 1 α k ) a j ( t ) , k = 1 K ,
where the different individual’s actions a i ( t ) and a j ( t ) are sampled from the population and recombined to form the action a l + k ( t + 1 ) , α k is a random parameter, and K is the number of recombined individuals.

5.2.2. Mutation

To explore novel solutions, random mutations are introduced to expand the search space. This process supports the exploration of new strategies and is described by
a l + K + m ( t + 1 ) = a i ( t ) + ϵ ( t ) , m = 1 M
with random value ϵ and number of mutated individuals M. This genetic adaptation enables individuals to expand the search space.

5.2.3. Update of Support Vectors

In addition to the action update, we have to consider the update of the support vector. In theory, we can also use recombination and mutation. However, this would ignore the system dynamics. Hence, we opt to simply set the new support vector to be equal to s t + 1 A i , i.e., the state obtained after applying a i ( t ) to the system. Hence, we obtain the new individual I n e w as
I n e w ( s t + 1 A i , a i t + 1 , ψ t + 1 ) .
where ψ t + 1 is the achieved fitness.

5.3. Local and Global Selection

The above updates result in an increased population which has to be reduced to its original size. To this end, we have to adress two goals, namely to select individuals with the best fitness but at the same time preserving the grid coverage. To this end, we propose a local selection for coverage preservation followed by a standard global selection.

5.3.1. Local Selection

The task of local selection is to preserve the grid’s coverage of the state space. Let
conv P i t : = P i t s A i .
denote the convex hull of the grid, i.e., the smallest convex shape that encloses all points. Then, we define two distinct cases for the population update:
P i t I n e w P i t + 1 = update ( s t + 1 A i ) , if s t + 1 A i conv P i t or add ( s t + 1 A i ) , if s t + 1 A i conv P i t
Note that the property s t + 1 A i conv P i t can be easily determined by applying the QuickHill algorithm [54].
In case of add( s t + 1 A i ), the individual I n e w is simply added to the population. The case of update( s t + 1 A i ) is more involved as the corresponding support vector is already present in the coverage of the state–action representation. To avoid an accumulation of support vectors within a small area, but make use of the obtained fitness update, we propose a local update step of the grid.
To this end, we evaluate whether adding the new individual I n e w improves the local fitness of the state–action representation Ψ i at position s t + 1 A i by the fitness value ψ t + 1 . Since the comparable fitness ψ t at the point of interest may not correspond directly to an individual in the population P i t , we estimate the fitness value using Delaunay triangulation D T ( P i t ) of the three nearest support vectors. We use Delaunay triangulation to divide a set of points into triangles so that no point lies within the perimeter of a triangle [55]. This maximizes the smallest angles of the triangles and thus avoids very pointed, numerically unfavorable shapes. For a comprehensive derivation of the calculations, see Appendix A.
The calculated fitness of the state–action representation is used to decide on exploitation or a rejection of the information with
P i t I n e w P i t + 1 = add ( ) , if ψ t ψ t + 1 reject ( ) . else
Consequently, a new individual is only added to the population, if its fitness is better compared to a virtual fitness calculated from neighboring support vectors. We denote this as local selection.

5.3.2. Global Selection

As the size of the population after local selection typically exceeds the maximum number of individuals, we use a standard selection approach in that we select individuals based on their fitness:
Top ( n ν ) ( I A i t , Ψ i ) .
where Top ( n ν ) denotes the selection of the n ν individuals with the best fitness.

5.4. Agent-Based Knowledge Transfer

As stated before, the training process can be enhanced using knowledge transfer between agents, i.e., agents exhibiting similar behavior can share information. The aim is to enable faster learning with minimal training requirements by utilising existing knowledge to improve performance, see Figure 5.
To this end, we have to solve two basic issues. First, we have to derive metrics, which determine the degree of similarity between the agent’s state–action representations. Second, we leverage our population-based learning approach to transfer knowledge between the agents.
In order to transfer knowledge, we use the individual’s representation consisting of support vector, action and fitness as a normalized feature vector on which we define a similarity metric as follows:
d ( I i , I j ) = i = 1 d ( I i I j ) 2 .
Hence, the knowledge transfer takes place between agents i and j with the most consistent match according to min I i P i d ( I i , I j ) .
A reasonable knowledge transfer can only be achieved if integrating the individual I i improves the fitness value ψ j . This transfer mechanism T i , j : I i , I j R is described by
T i , j = { I i P i | ψ j ( I j ) < ψ j ( I j { I } } .
Hence, only the individual I is selected for transfer if integrating it into the target state–action representation F j improves the fitness value ψ j . By requiring a strictly positive effect on target fitness, we ensure that source knowledge acts as a catalyst rather than a source of interference in the sensitive early stages of adaptation.
Technical implementations of the update state–action representation by DGS, see Algorithm A1, the agent-based evolutionary process, see Algorithm A2, and an excerpt of the multi-agent knowledge transfer, see Algorithm A3, are provided as pseudocode in Appendix B.

6. Convergence Analysis

In this section, we discuss the convergence properties of the proposed Evo-SbPG based learning algorithms. To this end, we first recall results on the convergence of conventional SbPG. In fact, it has been shown in [26] that SbPG converge to a local equilibrium if the learning algorithm follows a best response or gradient-based learning scheme. This is due to the symmetric Hessian of the underlying game structure. Hence, if we can prove a best response behavior of our proposed learning scheme, this yields convergence proofs of the EVO-SbPG. To this end, we have to analyze the analysis of the Evo-SbPG learning using the global selection, for which we can state the following:
Theorem 1.
Let ψ max and ψ min denote the maximum and minimum fitness value across the population. Then, under the global selection scheme, we have
ψ max ( t + 1 ) ψ max ( t ) ,
ψ min ( t + 1 ) ψ min ( t ) .
Proof. 
Recall that we select the new population P t + 1 of size n ν out of the old population P t of size n ν plus additional individuals collected in population P t * of size n ν * obtained by recombination and mutation, local selection and knowledge transfer. To this end, we employ the TopK sampling, i.e., we select the best n ν individuals.
Suppose we have max { ψ I ( t ) | I P t * } ψ min ( t ) , then none of the additional individuals are selected yielding ψ min ( t + 1 ) = ψ min ( t ) . Otherwise, if max { ψ I ( t ) | I P t * } > ψ min ( t ) , at least one of the additional individuals will replace the worse individual, yielding ψ min ( t + 1 ) > ψ min ( t ) . The same arguments hold also for ψ max ( t + 1 ) , concluding the proof.    □
The above theorem basically proves that the fitness only increases or remains constant during an update. This is essentially a best response learning scheme. This remains valid under the TL scheme as well as the local selection as both just add additional individuals. Consequently, we can state the following convergence behavior.
Theorem 2.
Given an Evo-SbPG with correspondingly chosen fitness functions and the above EA-based learning scheme, the algorithm converges to a local Nash equilibrium.
Proof. 
This follows directly from the underlying Evo-SbPG structure and the fact [26] that the learning scheme represents a best response strategy.    □

7. Results and Discussion

In this section, the proposed approaches and developed strategies undergo thorough evaluation and critical analysis conducted on a distributed production environment. To this end, after introducing the testbed, we first compare the novel approach in different settings with other state-of-the-art methods. We then present the actor policies followed by a discussion of the coverage of the action space.

7.1. Implementation in a Laboratory Test Field

The Bulk Good Laboratory Process (BGLP) is a flexible, intelligent production system that specializes in the continuous transport of bulk materials shown in Figure 6.
The process takes place across three independent operating modules. Station 1 is a preparation unit where material is distributed via a conveyor belt. Station 2 uses a vibrating conveyor for processing. Station 3 uses a rotary valve for continuous dosing of the material. In addition, level sensors are used to monitor the process.
For the considered scenario, we assign an agent to each actuator, i.e., the conveyor, vibration conveyor, two vacuum pumps and rotary feeder. All actuators have a normalized action space of [ 0 , 1 ] except for the on–off vibration conveyor. The state space of each agent consist of the fill levels of the upstream and downstream buffer respectively. We further define the local fitness function Ψ i for each agent i using the overflow and emptiness of the upstream and downstream buffers L p i and L s i , the energy consumption P i , and the production targets V D as
Ψ i = 1 1 + α L L p i + 1 i N 1 1 + α L L s i + 1 i = N 1 1 α D V D + 1 1 + α P P i .
with weighing factors α L , α D and α P . The distributed production process iterates with a sequence T, where the fill level of the individual silos and hoppers are covered by h N . The overflow and emptiness of the buffers are calculated by
L s i = 0 T 1 h s i > H s i ( h s i ) d t , L p i = 0 T 1 h p i > H p i ( h p i ) d t ,
with upper limit H s i and lower limit H p i . The production demand is calculated with
V D = 0 T D ˙ t d t , with D ˙ t V ˙ N , o u t V ˙ N , i n , if h N = 0 0 , otherwise
where the inflow V ˙ N , i n and the outflow V ˙ N , o u t is revered to the station.

7.2. Results at the BGLP

We evaluate the strategies of the Evo-SbPG using the BGLP process. To this end, we compare the results of DGS, EA DGS, and EA DGS Trans with those of the Vanilla SbPG [15] and GB SbPG M1 and GB SbPG M2 approaches [19]. Both the Vanilla SbPG and GB SbPG approach are based on fixed grid structures. However, the Vanilla SbPG approach uses random action generation, whereas the GB SbPG approach uses gradient optimization methods. GB SbPG M1 is based on a single-leader–follower dynamic, and GB SbPG M2 with multiple leader–follower relationships. The described approaches were carried out with the proposed parameters from the literature. While the DGS approach only uses dynamic grid structures, the EA DGS approach combines dynamic representation with evolutionary strategies. The EA DGS Trans approach additionally combines both with knowledge transfer. An excerpt from the fitness values of the local agents and the global resulting objective are shown in Table 1.
The contributions of each agent to global fitness are presented in Figure 7.
Figure 7 shows that the different variants of our approach, i.e., DGS, EA DGS, and EA DGS Trans outperform Vanilla SbPG and gradient based approaches with EA DGS Trans performing best. Further, they yield strategies where the individual agents contribute more evenly to the overall objective. This indicates improved coordination between individual agent policies.
We further provide results for the individual objectives for transport, overflow, demand and power consumption as shown in Table 2.
Although all approaches enable control of the BGLP process, significant differences in their individual process characteristics become apparent. The best response strategies Vanilla SbPG, DGS, EA DGS, and EA DGS Trans exhibit higher transport rates, whereas GB SbPG M1 and GB SbPG M2 show lower performance in this regard while requiring lower energy consumption.
A key feature of the DGS, EA DGS, and EA DGS Trans approaches is highlighted in the time plots shown in Figure 8. All the established strategies Vanilla SbPG, GB SbPG M1, and GB SbPG M2 are primarily focused on identifying and maintaining a single optimal control point while the effort required to establish or maintain control points varies depending on the strategy.
However, these results do not reflect typical control behavior in such event-driven distributed control systems which are typically characterized by on–off control behavior. As shown, the novel approaches are more aligned with such realistic plant control behavior, as they support dynamic adaptation to fluctuating system conditions. This adaptive behavior is particularly evident in the transport, overflow, and power profiles of the DGS, EA DGS, and EA DGS Trans strategies, as illustrated in Figure 8. We advacate this to the better exploration of the state–action space and improved coverage due to the DGS.
This interpretation is confirmed by an extended breakdown of the individual process objectives with the individual agents at the BGLP, see Table 3.
While Vanilla SbPG and DGS, EA DGS and EA DGS Trans strategies focus on self-optimizing state–action representations, GB SbPG M1 and GB SbPG M2 pursue goal-oriented optimization. This leads to higher transport values by the Vanilla SbPG and DGS methods, whereas GB-SbPGM1 and GB SbPG M2 approaches result in lower transport performance. Similar behavior is reflected in the material overflow characteristics. Vanilla SbPG and DGS tolerate minor overflow to prioritize overall optimization, whereas GB SbPG M1 and GB SbPG M2 aim to minimize excess. The Vanilla SbPG and DGS strategies show positive lower demand than GB-SbPGM1 and GB SbPG M2. Power consumption is inversely proportional to demand, while Vanilla SbPG and DGS use a higher power supply with their strategies.

7.3. Analysis and Interpretation

Based on the parameters recorded at the BGLP, we perform a Kernel Density Estimation (KDE) as a violin plot. For statistical classification, we also include the Interquartile Range (IQR) by the quartiles of 25 % and 75 % , the mean value μ , and the single and double standard deviations 1 σ and 2 σ in the representation, see Figure 9.
The recorded values from the BGLP process are treated as independent and identically distributed samples. To estimate the univariate probability density function f ^ at a given point x from a recorded dataset { x 1 , x 2 , , x n } , we employ a KDE defined by
f ^ ( x ) = 1 n h i = 1 n K ( x x i h ) ,
where n denotes the number of data points and h represents the bandwidth. As kernel K ( · ) , we used a Gaussian Kernel with
K ( u ) = 1 2 π e 1 2 u 2 .
Other potential kernel functions include uniform, triangular, biweight, or triweight kernels. For our evaluation, we determine the bandwidth h by scaling Scott’s Rule with a factor b w to minimize the mean squared error, defined as
h = b w · h S c o t t = b w · σ n 1 / 5 .
While the bandwidth could alternatively be determined via Silverman’s Rule or manual adjustment, we selected a scaling factor of b w = 0.3 . Mathematical smoothing of the KDE can attenuate or exaggerate outliers in the data points, creating a continuous distribution shape that goes beyond the individual data values. In addition to the 25 % and 75 % IQR quartiles shown in red, the upper and lower whisker limits are estimated and plotted using 1.5 times the IQR.

7.4. Visualization of the State–Action Representation

In addition to the performance improvements, we are also interested in the coverage of the state–action representation of the DGS. To this end, we visualize the state–action representation by the DGS in Figure 10. Specifically, we show the barchart of action values, a bird view of the state space DGS, the resulting action grid, and the fitness surface of each agent. Together, these representations support both qualitative and quantitative analysis.
First, the DGS covers the state space entirely for agents 2–4, indicating the strength of the EA-based optimization which allows us to explore the state space with maximum coverage. The reduced coverage of agent 1 and 5 are due to the process conditions. Specifically, the rotary feeder is directly responsible for demand fulfilment, restricting the state space while the conveyor belt is typically only operated for smaller input buffer values. Further, we observe comparably smooth fitness surfaces indicating a certain optimal region within the state space which align with the action surface as well. Overall, our approach succeed in coverageging the full state–action space according to the optimization objectives resulting in the better usage of actuation as discussed before.

7.5. Tuning and Configuration Parameters

The reproducibility of the investigations is secured through strictly controlled study conditions, enabling direct comparability with reference approaches. The BGLP operates with an iteration cycle of T I = 10 s, allowing for the comprehensive capture of system dynamics via a discretized parameter sampling rate of T S = 0.5 s.
We set up the same episodic training procedure for each experiment. This procedure contains nine training episodes and one testing episode. During training, the DGS were continuously adapted to the observed behavior and adaptive learning strategies of the agents.
The fitness function, which defines the objective of agent training, was specified by weighting factors of the upstream and downstream buffer α L = 1 , the production demand target α D = 4 , and energy consumption α P = 0.001 .
The parameters and system settings presented and used are methodologically and numerically directly comparable with the configurations of the reference approaches Vanilla SbPG [15] and GB SbPG M1 and GB SbPG M2 approaches [19].

7.6. Scalability, Transferability and Limitations

The results demonstrate that agents formed autonomously interact both cooperatively and competitively, learning from one another. The Evo-SbPG agent architecture, developed for this study, enables parallel operation and linear scaling by proportionally expanding comparable modules. Although tested in a specific setting, the approach’s underlying mechanisms are transferable to other distributed production systems. We refer to [17] for a more detailed discussion. A limitation of the approach is its inherent reliance on the availability of the full state vector, which might be restrictive in real-world applications. Nevertheless, incorporating corresponding state observations or soft sensors might mitigate this problem. We leave an extension to partly observable process to future research. Furthermore, the results presented are based on simulation models in lab environments which might suffer from the sim2real gap. Approaches to account for the sim2real gap will be part of future research as well. Thanks to its stable core logic, this approach can also be applied to similar decentralized processes including dynamic supply chain management, traffic control, coordination of autonomous vehicles and voltage and frequency stabilization in smart grids.

8. Conclusions

In this paper, we proposed a novel approach for distributed optimization using population-based learning and DGS. To this end, we present Evo-SbPG, a new game theory-based framework that integrates EA strategies in SbPG. Further, we address the complexity of dynamic environments and extend the approach by DGS, EA DGS and EA DGS Trans strategies for present dynamic state–action representations. The strategies are evaluated in order to solve multi-objective optimization problems in distributed production systems. The results demonstrate improved performance in state–action representation and can be adapted for various applications. By leveraging, the approach ensures high computational efficiency and scalability, allowing it to handle increasingly complex strategy spaces without the exponential cost typically associated with fine-grained discretization.
Future research will exploit the insights gained from DGS transitions to enhance state–action representation in more intricate coupled systems. Additionally, we will further explore and extend the foundations of knowledge transfer in MAS systems and its representation through graph theory, developing more comprehensive approaches in the process. Specifically, we aim to investigate knowledge transfer mechanisms in multi-objective Evo-games, such as knowledge distillation as well as the adaptability of DGS in non-stationary environments.

Author Contributions

Conceptualization, M.L., S.Y. and A.S.; methodology, M.L., S.Y. and A.S.; software, M.L.; validation, M.L. and S.Y.; formal analysis, M.L., S.Y. and A.S.; investigation, A.S.; resources, A.S.; data curation, M.L. and S.Y.; writing—original draft preparation, M.L. and A.S.; writing—review and editing, M.L., S.Y. and A.S.; visualization, M.L.; supervision, M.L.; project administration, M.L. and A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This article is funded by the Open Access Publication Fund of South Westphalia University of Applied Sciences.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Agent structures, scenarios, and environments (including the BGLP) introduced in this study are provided by the MLPro framework. Parameters, data structures, and comparative benchmarks for Vanilla SbPG [15] and GB SbPG M1 and GB SbPG M2 approaches [19] are included within these resources.

Acknowledgments

We would like to express our gratitude to our colleagues in the Department of Automation Technology and Learning Systems at the South Westphalia University of Applied Sciences for their invaluable support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
AIArtificial Intelligence
BGLPBulk Good Laboratory Process
CPSCyber–Physical Systems
DGSDynamic Grid Structures
EAEvolutionary Algorithms
ESEvolutionary Strategies
Evo-SbPGEvolutionary State-based Potential Game
GAGenetic Algorithms
GTGame Theory
IoTInternet of Things
IQRInterquartile Range
KDEKernal Density Estimation
MASMulti-Agent System
MLMachine Learning
PGPotential Games
RLReinforcement Learning
SbPGState-based Potential Games

Appendix A. Barycentric Coordinate Difference

The adjustment of the existing representation by the dynamic grid D T ( P t ) in time step t is a central aspect. When an adjustment is made at point P, the surrounding triangle section Δ A B C is taken into account. Using the Barycentric coordinates and the triangle points A, B and C, a linear combination can be established as follows:
P = λ 1 A + λ 2 B + λ 3 C .
Refer the Barycentric coordinates λ 1 , λ 2 and λ 2 to the point P and conditions apply as follows:
λ 1 + λ 2 + λ 3 = 1 , λ 1 , λ 2 , λ 3 0 .
By decomposing the triangulation points A = ( x 1 , y 1 ) , B = ( x 2 , y 2 ) and C = ( x 3 , y 3 ) into coordinates of the state space location, a system of equation can be established as follows:
x = λ 1 x 1 + λ 2 x 2 + λ 3 x 3 , y = λ 1 y 1 + λ 2 y 2 + λ 3 y 3 , λ 1 + λ 2 + λ 3 = 1 .
By rearranging and inserting the equation, the system of equations can be reduced to equations with two unknowns. To solve the equations according to the Barycentric coordinates, the linear equation can be transferred to the matrix representation
x x 3 y y 3 = x 1 x 3 x 2 x 3 y 1 y 3 y 2 y 3 λ 1 λ 2 .
Summarizing the matrix A and the vector b results in the problem of a linear equation with
A · λ 1 λ 2 = b .
Rearranging and solving the Barycentric coordinates λ 1 , λ 2 can be used to calculate the missing Barycentric coordinates λ 3 with
λ 3 = 1 λ 1 λ 2 .
By interpolating the z values of the considered triangle points A, B and C, a calculation at the point P can be made with
z = λ 1 z 1 + λ 2 z 2 + λ 3 z 3 .
The calculated intersection point of the triangle section Δ A B C is used as an additional target and for adjustment in the further course of the evolutionary algorithm.

Appendix B. Pseudocode

The following section presents the pseudocodes used for the implementation.
Algorithm A1: Update state–action representation by DGS
Ai 07 00062 i001
Algorithm A2: Agent-based evolutionary process
Ai 07 00062 i002
Algorithm A3: Excerpt of multi-agent knowledge transfer
Ai 07 00062 i003

References

  1. Jan, Z.; Ahamed, F.; Mayer, W.; Patel, N.; Grossmann, G.; Stumptner, M.; Kuusk, A. Artificial intelligence for industry 4.0: Systematic review of applications, challenges, and opportunities. Expert Syst. Appl. 2023, 216, 119456. [Google Scholar] [CrossRef]
  2. Soori, M.; Arezoo, B.; Dastres, R. Internet of things for smart factories in industry 4.0, a review. Internet Things Cyber-Phys. Syst. 2023, 3, 192–204. [Google Scholar] [CrossRef]
  3. Ryalat, M.; ElMoaqet, H.; AlFaouri, M. Design of a Smart Factory Based on Cyber-Physical Systems and Internet of Things towards Industry 4.0. Appl. Sci. 2023, 13, 2156. [Google Scholar] [CrossRef]
  4. Al-Sharman, M.; Dempster, R.; Daoud, M.A.; Nasr, M.; Rayside, D.; Melek, W. Self-Learned Autonomous Driving at Unsignalized Intersections: A Hierarchical Reinforced Learning Approach for Feasible Decision-Making. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12345–12356. [Google Scholar] [CrossRef]
  5. Arshad, K.; Ali, R.F.; Muneer, A.; Aziz, I.A.; Naseer, S.; Khan, N.S.; Taib, S.M. Deep Reinforcement Learning for Anomaly Detection: A Systematic Review. IEEE Access 2022, 10, 124017–124035. [Google Scholar] [CrossRef]
  6. Nunes, P.; Santos, J.; Rocha, E. Challenges in predictive maintenance—A review. CIRP J. Manuf. Sci. Technol. 2023, 40, 53–67. [Google Scholar] [CrossRef]
  7. Javaid, M.; Haleem, A.; Singh, R.P.; Suman, R. Enabling flexible manufacturing system (FMS) through the applications of industry 4.0 technologies. Internet Things Cyber-Phys. Syst. 2022, 2, 49–62. [Google Scholar] [CrossRef]
  8. Duan, S.; Wang, D.; Ren, J.; Lyu, F.; Zhang, Y.; Wu, H.; Shen, X. Distributed Artificial Intelligence Empowered by End-Edge-Cloud Computing: A Survey. IEEE Commun. Surv. Tutor. 2023, 25, 591–624. [Google Scholar] [CrossRef]
  9. Amirkhani, A.; Barshooi, A.H. Consensus in multi-agent systems: A review. Artif. Intell. Rev. 2022, 55, 3897–3935. [Google Scholar] [CrossRef]
  10. Gu, S.; Grudzien Kuba, J.; Chen, Y.; Du, Y.; Yang, L.; Knoll, A.; Yang, Y. Safe multi-agent reinforcement learning for multi-robot control. Artif. Intell. 2023, 319, 103905. [Google Scholar] [CrossRef]
  11. Zhang, J.D.; He, Z.; Chan, W.H.; Chow, C.Y. DeepMAG: Deep reinforcement learning with multi-agent graphs for flexible job shop scheduling. Knowl.-Based Syst. 2023, 259, 110083. [Google Scholar] [CrossRef]
  12. Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
  13. Ahmad, F.; Shah, Z.; Al-Fagih, L. Applications of evolutionary game theory in urban road transport network: A state of the art review. Sustain. Cities Soc. 2023, 98, 104791. [Google Scholar] [CrossRef]
  14. Gill, K.S.; Sharma, A.; Saxena, S. A Systematic Review on Game-Theoretic Models and Different Types of Security Requirements in Cloud Environment: Challenges and Opportunities. Arch. Comput. Methods Eng. 2024, 31, 3857–3890. [Google Scholar] [CrossRef]
  15. Schwung, D.; Schwung, A.; Ding, S.X. Distributed Self-Optimization of Modular Production Units: A State-Based Potential Game Approach. IEEE Trans. Cybern. 2022, 52, 2174–2185. [Google Scholar] [CrossRef]
  16. Yuwono, S.; Schwung, A. Model-based learning on state-based potential games for distributed self-optimization of manufacturing systems. J. Manuf. Syst. 2023, 71, 474–493. [Google Scholar] [CrossRef]
  17. Yuwono, S.; Schwung, D.; Schwung, A. Transfer learning of state-based potential games for process optimization in decentralized manufacturing systems. Comput. Ind. 2025, 173, 104376. [Google Scholar] [CrossRef]
  18. Yuwono, S.; Schwung, D.; Schwung, A. Distributed Stackelberg Strategies in State-Based Potential Games for Autonomous Decentralized Learning Manufacturing Systems. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 8112–8125. [Google Scholar] [CrossRef]
  19. Yuwono, S.; Löppenberg, M.; Schwung, D.; Schwung, A. Gradient-based Learning in State-based Potential Games for Self-Learning Production Systems. In Proceedings of the IECON 2024—50th Annual Conference of the IEEE Industrial Electronics Society; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar] [CrossRef]
  20. Ladosz, P.; Weng, L.; Kim, M.; Oh, H. Exploration in deep reinforcement learning: A survey. Inf. Fusion 2022, 85, 1–22. [Google Scholar] [CrossRef]
  21. Hao, J.; Yang, T.; Tang, H.; Bai, C.; Liu, J.; Meng, Z.; Liu, P.; Wang, Z. Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 8762–8782. [Google Scholar] [CrossRef]
  22. Leonardos, S.; Piliouras, G. Exploration-exploitation in multi-agent learning: Catastrophe theory meets game theory. Artif. Intell. 2022, 304, 103653. [Google Scholar] [CrossRef]
  23. Dogru, O.; Xie, J.; Prakash, O.; Chiplunkar, R.; Soesanto, J.; Chen, H.; Velswamy, K.; Ibrahim, F.; Huang, B. Reinforcement Learning in Process Industries: Review and Perspective. IEEE/CAA J. Autom. Sin. 2024, 11, 283–300. [Google Scholar] [CrossRef]
  24. Kaven, L.; Huke, P.; Göppert, A.; Schmitt, R.H. Multi agent reinforcement learning for online layout planning and scheduling in flexible assembly systems. J. Intell. Manuf. 2024, 35, 3917–3936. [Google Scholar] [CrossRef]
  25. Ma, X.W.; Huang, T.; Liu, W.L.; Gong, Y.J. Collision-Aware Evolutionary Algorithm for Multi-Agent Coverage Path Planning. In Proceedings of the 2024 11th International Conference on Machine Intelligence Theory and Applications (MiTA); IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar] [CrossRef]
  26. Marden, J.R. State based potential games. Automatica 2012, 48, 3075–3088. [Google Scholar] [CrossRef]
  27. Rani, S.; Jining, D.; Shoukat, K.; Shoukat, M.U.; Nawaz, S.A. A Human–Machine Interaction Mechanism: Additive Manufacturing for Industry 5.0—Design and Management. Sustainability 2024, 16, 4158. [Google Scholar] [CrossRef]
  28. Perez-Gonzalez, P.; Framinan, J.M. A review and classification on distributed permutation flowshop scheduling problems. Eur. J. Oper. Res. 2024, 312, 1–21. [Google Scholar] [CrossRef]
  29. Huang, J.P.; Gao, L.; Li, X.Y. A Hierarchical Multi-Action Deep Reinforcement Learning Method for Dynamic Distributed Job-Shop Scheduling Problem With Job Arrivals. IEEE Trans. Autom. Sci. Eng. 2024, 22, 2501–2513. [Google Scholar] [CrossRef]
  30. Maier, H.; Razavi, S.; Kapelan, Z.; Matott, L.; Kasprzyk, J.; Tolson, B. Introductory overview: Optimization using evolutionary algorithms and other metaheuristics. Environ. Model. Softw. 2019, 114, 195–213. [Google Scholar] [CrossRef]
  31. Li, J.; Soradi-Zeid, S.; Yousefpour, A.; Pan, D. Improved differential evolution algorithm based convolutional neural network for emotional analysis of music data. Appl. Soft Comput. 2024, 153, 111262. [Google Scholar] [CrossRef]
  32. Menczer, F.; Degeratu, M.; Street, W.N. Efficient and Scalable Pareto Optimization by Evolutionary Local Selection Algorithms. Evol. Comput. 2000, 8, 223–247. [Google Scholar] [CrossRef]
  33. Sato, H.; Aguirre, H.; Tanaka, K. On the locality of dominance and recombination in multiobjective evolutionary algorithms. In Proceedings of the 2005 IEEE Congress on Evolutionary Computation; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 451–458. [Google Scholar] [CrossRef]
  34. Liu, H.l.; Li, X.; Chen, Y. Multi-Objective Evolutionary Algorithm Based on Dynamical Crossover and Mutation. In Proceedings of the 2008 International Conference on Computational Intelligence and Security; IEEE: Piscataway, NJ, USA, 2008; Volume 1, pp. 150–155. [Google Scholar] [CrossRef]
  35. Plump, C.; Berger, B.J.; Drechsler, R. Using density of training data to improve evolutionary algorithms with approximative fitness functions. In Proceedings of the 2022 IEEE Congress on Evolutionary Computation (CEC); IEEE: Piscataway, NJ, USA, 2022; pp. 1–10. [Google Scholar] [CrossRef]
  36. Song, Y.; Cai, X.; Zhou, X.; Zhang, B.; Chen, H.; Li, Y.; Deng, W.; Deng, W. Dynamic hybrid mechanism-based differential evolution algorithm and its application. Expert Syst. Appl. 2023, 213, 118834. [Google Scholar] [CrossRef]
  37. Löppenberg, M.; Schwung, A. Structured Graph Generation by Evolutionary Algorithm for Program Code Development. In Proceedings of the IECON 2024—50th Annual Conference of the IEEE Industrial Electronics Society; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
  38. Alhijawi, B.; Awajan, A. Genetic algorithms: Theory, genetic operators, solutions, and applications. Evol. Intell. 2024, 17, 1245–1256. [Google Scholar] [CrossRef]
  39. Yuan, S.; Song, K.; Chen, J.; Tan, X.; Li, D.; Yang, D. EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms. arXiv 2024, arXiv:2406.14228. [Google Scholar]
  40. Wang, X.; Guo, S.; Li, B. Minimum Risk Decision Making Problem Based on Single-Objective Stochastic Optimization Model and Genetic Algorithm. In Proceedings of the 2024 IEEE 2nd International Conference on Electrical, Automation and Computer Engineering (ICEACE); IEEE: Piscataway, NJ, USA, 2024; pp. 1706–1711. [Google Scholar] [CrossRef]
  41. Malik, S.; Devine, M.T.; Keane, A. Leader-Follower Dynamics in P2P Energy Markets: A Bilevel Stochastic Optimization Approach. In Proceedings of the 2024 IEEE PES Innovative Smart Grid Technologies Europe (ISGT EUROPE); IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
  42. Sapora, S.; Swamy, G.; Lu, C.; Teh, Y.W.; Foerster, J.N. EvIL: Evolution strategies for generalisable imitation learning. In Proceedings of the ICML’24: 41st International Conference on Machine Learning; JMLR.org: Brookline, MA, USA, 2024; pp. 1–5. [Google Scholar]
  43. Bejani, M.M.; Ghatee, M. A systematic review on overfitting control in shallow and deep neural networks. Artif. Intell. Rev. 2021, 54, 6391–6438. [Google Scholar] [CrossRef]
  44. Zhu, Z.; Lin, K.; Jain, A.K.; Zhou, J. Transfer Learning in Deep Reinforcement Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef]
  45. Zhao, Z.; Alzubaidi, L.; Zhang, J.; Duan, Y.; Gu, Y. A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and limitations. Expert Syst. Appl. 2024, 242, 122807. [Google Scholar] [CrossRef]
  46. Iman, M.; Arabnia, H.R.; Rasheed, K. A Review of Deep Transfer Learning and Recent Advancements. Technologies 2023, 11, 40. [Google Scholar] [CrossRef]
  47. Fontana, M.; Spratling, M.; Shi, M. When Multitask Learning Meets Partial Supervision: A Computer Vision Review. Proc. IEEE 2024, 112, 516–543. [Google Scholar] [CrossRef]
  48. Wang, W.; Wang, X.; Li, R.; Jiang, H.; Liu, D.; Ping, X. Transfer Reinforcement Learning of Robotic Grasping Training using Neural Networks with Lateral Connections. In Proceedings of the 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS); IEEE: Piscataway, NJ, USA, 2023; pp. 489–494. [Google Scholar] [CrossRef]
  49. Jiang, J.; Chen, B.; Pan, J.; Wang, X.; Liu, D.; Jiang, J.; Long, M. ForkMerge: Mitigating Negative Transfer in Auxiliary-Task Learning. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 30367–30389. [Google Scholar]
  50. Chen, Y.; Wang, D.; Zhu, D.; Xu, Z.; He, B. Unsupervised domain adaptation of dynamic extension networks based on class decision boundaries. Multimed. Syst. 2024, 30, 80. [Google Scholar] [CrossRef]
  51. Wang, B. Domain Adaptation in Reinforcement Learning: Approaches, Limitations, and Future Directions. J. Inst. Eng. (India) Ser. B 2024, 105, 1223–1240. [Google Scholar] [CrossRef]
  52. Zhang, W.; Deng, L.; Zhang, L.; Wu, D. A Survey on Negative Transfer. IEEE/CAA J. Autom. Sin. 2023, 10, 305–329. [Google Scholar] [CrossRef]
  53. Zazo, S.; Valcarcel Macua, S.; Sanchez-Fernandez, M.; Zazo, J. Dynamic Potential Games with Constraints: Fundamentals and Applications in Communications. IEEE Trans. Signal Process. 2016, 64, 3806–3821. [Google Scholar] [CrossRef]
  54. Barber, C.B.; Dobkin, D.P.; Huhdanpaa, H. The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. 1996, 22, 469–483. [Google Scholar] [CrossRef]
  55. Lee, D.T.; Schachter, B.J. Two algorithms for constructing a Delaunay triangulation. Int. J. Comput. Inf. Sci. 1980, 9, 219–242. [Google Scholar] [CrossRef]
Figure 1. Consideration of distributed production systems for process measurement and control systems in a decentralized arrangement, divided into sub-systems, sub-processes and sub-applications.
Figure 1. Consideration of distributed production systems for process measurement and control systems in a decentralized arrangement, divided into sub-systems, sub-processes and sub-applications.
Ai 07 00062 g001
Figure 2. Representation of knowledge and production transfer in distributed production and manufacturing systems at the process and communication levels in serial and parallel arrangements.
Figure 2. Representation of knowledge and production transfer in distributed production and manufacturing systems at the process and communication levels in serial and parallel arrangements.
Ai 07 00062 g002
Figure 3. Structural design of the multi-agent system and its interaction with the production environment.
Figure 3. Structural design of the multi-agent system and its interaction with the production environment.
Ai 07 00062 g003
Figure 4. 2D top view of the state–action representation and process schematic. (a) Illustration shows the fixed grid structure from [15,16,19]. (b) Illustration shows the DGS extension being moved and placed freely, with coverage and adjustment. (c) Schematic process of dynamic adjustment and updating of individuals for coverage maintenance.
Figure 4. 2D top view of the state–action representation and process schematic. (a) Illustration shows the fixed grid structure from [15,16,19]. (b) Illustration shows the DGS extension being moved and placed freely, with coverage and adjustment. (c) Schematic process of dynamic adjustment and updating of individuals for coverage maintenance.
Ai 07 00062 g004
Figure 5. Agent-based knowledge transfer. Transfer Learning of specific knowledge from the state–action representation of an agent I i to another agent I j in a similar problem.
Figure 5. Agent-based knowledge transfer. Transfer Learning of specific knowledge from the state–action representation of an agent I i to another agent I j in a similar problem.
Ai 07 00062 g005
Figure 6. The bulk good process with its agent structure.
Figure 6. The bulk good process with its agent structure.
Ai 07 00062 g006
Figure 7. Global fitness distribution and contributions of each individual agent.
Figure 7. Global fitness distribution and contributions of each individual agent.
Ai 07 00062 g007
Figure 8. Time plots evaluated at the BGLP over 10,000 steps. The aim is to provide continuous production for transport, overflow, demand and power characteristics based on global fitness.
Figure 8. Time plots evaluated at the BGLP over 10,000 steps. The aim is to provide continuous production for transport, overflow, demand and power characteristics based on global fitness.
Ai 07 00062 g008
Figure 9. The BGLP parameters are analyzed and interpreted using a KDE plot, which includes additional quartiles, mean values and standard deviations.
Figure 9. The BGLP parameters are analyzed and interpreted using a KDE plot, which includes additional quartiles, mean values and standard deviations.
Ai 07 00062 g009
Figure 10. State–action representation of the BGLP as 2D and 3D visualization.
Figure 10. State–action representation of the BGLP as 2D and 3D visualization.
Ai 07 00062 g010
Table 1. Average fitness rating of each agent.
Table 1. Average fitness rating of each agent.
AgentVanilla SbPGGB SbPG M1GB SbPG M2DGSEA DGSEA DGS Trans
Agent 12.1522.6102.7032.3942.2282.404
Agent 21.8922.6352.5442.2852.0122.317
Agent 31.9442.2072.1092.3042.1992.324
Agent 41.5611.4671.7112.0522.3902.108
Agent 51.1271.1711.3191.4221.7461.476
Total8.67510.08910.38710.45610.57410.629
Table 2. Average process parameters per strategy.
Table 2. Average process parameters per strategy.
TotalVanilla SbPGGB SbPG M1GB SbPG M2DGSEA DGSEA DGS Trans
transport0.9840.4510.4750.8540.9570.884
overflow0.0340.0000.0000.0180.0090.017
demand127.130129.850130.793118.448115.027116.623
power0.7260.4400.4460.6520.7000.669
Table 3. Average process parameters for each agent and production target at the BGLP.
Table 3. Average process parameters for each agent and production target at the BGLP.
Vanilla SbPGGB SbPG M1GB SbPG M2DGSEA DGSEA DGS Trans
Agent 1transport0.1910.0800.0890.1720.1860.177
overflow0.0000.0000.0000.0000.0000.000
demand21.00821.01921.19521.09521.02421.040
power0.0450.0400.0410.0440.0450.045
Agent 2transport0.2280.0800.0900.1840.2030.192
overflow0.0010.0000.0000.0010.0000.001
demand30.09422.54024.08127.21330.30327.994
power0.1620.0770.0820.1370.1480.141
Agent 3transport0.2000.0800.0890.1710.2120.179
overflow0.0070.0000.0000.0040.0000.003
demand19.68623.03024.95317.28225.08717.039
power0.0270.0270.0270.0270.0270.027
Agent 4transport0.1680.1150.1090.1590.1780.164
overflow0.0250.0000.0000.0120.0090.012
demand24.44432.02131.28524.89523.65524.864
power0.2510.1790.1740.2400.2630.247
Agent 5transport0.1970.0950.0990.1670.1770.172
overflow0.0000.0000.0000.0000.0000.000
demand31.89831.24029.27927.96314.96025.686
power0.2410.1160.1220.2050.2180.210
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Löppenberg, M.; Yuwono, S.; Schwung, A. Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications. AI 2026, 7, 62. https://doi.org/10.3390/ai7020062

AMA Style

Löppenberg M, Yuwono S, Schwung A. Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications. AI. 2026; 7(2):62. https://doi.org/10.3390/ai7020062

Chicago/Turabian Style

Löppenberg, Marlon, Steve Yuwono, and Andreas Schwung. 2026. "Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications" AI 7, no. 2: 62. https://doi.org/10.3390/ai7020062

APA Style

Löppenberg, M., Yuwono, S., & Schwung, A. (2026). Multi-Agent Transfer Learning Based on Evolutionary Algorithms and Dynamic Grid Structures for Industrial Applications. AI, 7(2), 62. https://doi.org/10.3390/ai7020062

Article Metrics

Back to TopTop