1. Introduction
Over recent decades, the integration of Artificial Intelligence (AI) and the Internet of Things (IoT) has significantly transformed process automation [
1,
2] to Cyber–Physical Systems (CPS) [
3] initiated by applications such as self-learning [
4], anomaly detection [
5] and predictive maintenance [
6].
In order to meet modern industrial demands, workflows have to become more adaptive and flexible as a key to enhanced efficiency, reduced errors and ensured scalability. Consequently, there is a growing interest in distributed optimization strategies that rely on flexible, autonomous control approaches [
7,
8]. Addressing these requirements calls for a distributed Multi-Agent System (MAS) approach [
9], capable of aligning local objectives with global optimization goals.
Typical approaches for machine learning-based distributed optimization algorithms are found in both game theory (GT) and reinforcement learning (RL). While RL is commonly applied in areas like robot navigation [
10], job shop scheduling [
11], and autonomous driving [
12], GT is increasingly used in domains such as urban traffic control [
13] and cloud security [
14]. The use of multi-agent GT architectures for production systems with simple best response learning for optimization in large-scale production environments based on Potential Games (PG) has been demonstrated and further extended by state informations in [
15]. Ref. [
16] built on this using multi-step model predictors in model-based learning to train State-based Potential Game (SbPG) players in simulated environments for self-optimizing manufacturing systems. Furthermore, Transfer Learning (TL) approaches have been introduced to infer similarities between players during training [
17]. In parallel, other works combined SbPGs with Stackelberg strategies in modular systems, assigning roles to players to facilitate multi-objective optimization via simplified utility functions [
18], and incorporating gradient-based learning methods [
19].
Based on these principles, we are expanding these strategies and focus on the following challenges: First, RL reaches its limits in complex multi-agent systems, particularly due to high computing costs, slow convergence, and problems with generalization [
20,
21]. In GT, the assumption of rational players and the search for stable equilibria such as the Nash equilibrium complicate practical applicability in dynamic, real-world scenarios [
22]. Second, the best response strategies employed before rely on static grids which appear to be highly ineffective for dynamic systems [
20]. This results in increased computational load and incorrect decisions, as well as limited generalizability [
21]. Third, in dynamic and complex production environments with multiple products and machines, traditional optimization methods reach their limits [
23]. High variance, machine dependencies, and constant rescheduling make it difficult to make efficient decisions and adapt quickly to new production conditions [
24].
To tackle the first challenge, we propose a novel agent structure that combines GT-based learning with Evolutionary Algorithms (EAs). This enables more diverse strategies for optimizing manufacturing [
25]. Combining EAs with best response strategies [
15,
26] can be expected to offer more efficient learning. To address the inherent limitations of static grid structures for best response learning, we propose a novel dynamic grid structure and integrate it into an EA-based learning strategy. Furthermore, we tackle the complexity of dynamic and complex environments by introducing a knowledge transfer framework which allows us to exchange gained knowledge between individual learning agents, further increasing the efficiency of the self-learning process.
The main contributions of this work can be summarized as follows:
We introduce a novel Evolutionary State-based Potential Game, called Evo-SbPG, to enable adaptive and scalable distributed optimization. We formally provide convergence guarantees for the novel game structure.
We propose a state–action representation, called dynamic grid structures (DGS), which allows for a more flexible representation of agent policies.
We propose a novel EA-embedded knowledge transfer scheme between agents to share existing knowledge and explore new strategies on the DGS level.
We evaluate the approaches developed in a production environment and compare the results of different alternative learning strategies, highlighting the effectiveness of the approach.
The structure of this work is divided into an overview and evaluation of the current literature in
Section 2. In
Section 3, the problem statement is presented and
Section 4 presents the framework and the defintion of the Evo-SbPG.
Section 5 explains the fundamental approach of the DGS, which is extended by a multi-agent knowledge transfer. The convergence analysis is provided in
Section 6. An analysis of the results and an extensive discussion are provided in
Section 7, with a final conclusion in
Section 8.
3. Problem Description
We focus on distributed production systems as illustrated in
Figure 1.
The system, divided into functional sub-areas, operates on two interconnected levels, the process level and the communication level, see
Figure 2. At the process level, production transfer involves the movement of goods between sub-processes. Simultaneously, knowledge transfer enables continuous communication via interfaces such as Ethernet, field bus systems, or wireless links. Depending on system complexity, interactions can occur sequentially or in parallel while forming discrete, continuous, or hybrid process structures.
We model the structure of distributed manufacturing systems using graph theory. Particularly, we formulate a directed graph with vertices set and edges . Further, we consider a group of actuators with corresponding action space and the state space of the production system . For each actuator, we define its upstream state space and a downstream state space . To model the production objectives, we define fitness functions with and where denotes global states known to all agents.
In our study, we allow agent–agent knowledge transfer via the communication level. This leads to a transfer set that promotes knowledge transfer between EA agents i and j. Here, transfer is limited to agent pairs, defined as .
Consequently, the overall objective of distributed optimization can be stated as the maximization of a globally defined fitness
by solely maximizing the local fitness functions
of each module. The latter is achieved by combining ideas from SbPG to evolutionary processes. Such distributed production scenarios can be found in various areas of industrial production, such as food production, automotive industry, or pharmaceutical production.
4. Framework and Evo-SbPG Definition
In this section, we introduce an EA-based multi-agent structure, as shown in
Figure 3. To this end, we employ an EA-based agent
i for each actuator with corresponding action set
including discrete actions
or continuous actions
.
To allow for distributed optimization according to Equation (
1), we propose to leverage multi-agent structures from game theory, namely SbPG, and define them in the context of EAs. This results in an EA agent structure defined as follows.
Definition 1. A game defines an Evo-SbPG if a global objective can be found that for every state–action pair conforms to the conditionsandThereby, the optimization uses an evolutionary population-based learning process. Equation (
3) is standard in SbPGs and assures the contractivity of the potential function with respect to the state variable. The most crucial characteristic of standard SbPGs [
26] is their convergence properties to equilibrium points given a best response strategy. Furthermore, various conditions have been derived to prove the existence of an SbPG for a given game setting [
53]. These properties directly translate to Evo-SbPG. In what follows, we will present a novel best response learning strategy using evolutionary operations, bringing together convergent best response learning and evolutionary operations.
5. EA-Based Learning with Dynamic Grids
A central aspect is the interaction of SbPGs within the EA framework. In what follows, building on the research of [
15,
19] regarding best response learners, we start with extending the best response learners to using DGS. Then, we demonstrate the adaptation of evolutionary principles through population modelling and detail the integration of genetic operators by recombination, mutation, and selection, which is categorized into local and global scales. Lastly, we show the implementation of a convergent knowledge transfer procedure.
5.1. Characteristics of Best Response Learning
Existing approaches for optimizing SbPG are based on best response learning [
15], later expanded to gradient-based learning in [
19]. These works use static, fixed grids to map states into resulting actions. The grids are defined by support vectors
,
whose assigned state value is fixed while the action values are continuously adjusted according to a best response stategy, see
Figure 4a). The computation of an action
given an actual state value
is then computed as a distance-weighted sum of all support vectors
with
and
However, the static grids are inefficient for nonlinear optimization while best response strategies often exhibit slow convergence. In the following, we extend the learning strategies using a population-based learning within dynamic, adjustable grid structures as shown in
Figure 4b). The process schematic of dynamic adjustment and coverage maintenance is illustrated in
Figure 4c).
To this end, we first consider the support vectors as individuals of a population consisting of its state values, action value and an associated fitness values. Second, we use recombination and mutation to encourage diversity in the action update. Third, we update the state values within the population based on the system dynamics. Fourth, we use a combination of local and global selection to ensure a constant population size in every step. We will now present these steps in detail.
5.2. Population of Support Vectors
We define, for each support vector, an individual
within a population
defined as
with state support vector
, corresponding action
and associated fitness
. We employ a finite number
of individuals in a population
of agent
i resulting in
Note that the above population covers state–action space and fitness value, while the action has to be optimized. Hence, we propose to employ the regular evolutionary operations, i.e., recombination and mutation, solely to the actions, while updating the state vector via the regular system dynamics.
5.2.1. Recombination
To maintain genetic diversity and enhance robustness, individuals are recombined. This mechanism is addressed in [
37] by
where the different individual’s actions
and
are sampled from the population and recombined to form the action
,
is a random parameter, and
K is the number of recombined individuals.
5.2.2. Mutation
To explore novel solutions, random mutations are introduced to expand the search space. This process supports the exploration of new strategies and is described by
with random value
and number of mutated individuals
M. This genetic adaptation enables individuals to expand the search space.
5.2.3. Update of Support Vectors
In addition to the action update, we have to consider the update of the support vector. In theory, we can also use recombination and mutation. However, this would ignore the system dynamics. Hence, we opt to simply set the new support vector to be equal to
, i.e., the state obtained after applying
to the system. Hence, we obtain the new individual
as
where
is the achieved fitness.
5.3. Local and Global Selection
The above updates result in an increased population which has to be reduced to its original size. To this end, we have to adress two goals, namely to select individuals with the best fitness but at the same time preserving the grid coverage. To this end, we propose a local selection for coverage preservation followed by a standard global selection.
5.3.1. Local Selection
The task of local selection is to preserve the grid’s coverage of the state space. Let
denote the convex hull of the grid, i.e., the smallest convex shape that encloses all points. Then, we define two distinct cases for the population update:
Note that the property
can be easily determined by applying the QuickHill algorithm [
54].
In case of add(), the individual is simply added to the population. The case of update() is more involved as the corresponding support vector is already present in the coverage of the state–action representation. To avoid an accumulation of support vectors within a small area, but make use of the obtained fitness update, we propose a local update step of the grid.
To this end, we evaluate whether adding the new individual
improves the local fitness of the state–action representation
at position
by the fitness value
. Since the comparable fitness
at the point of interest may not correspond directly to an individual in the population
, we estimate the fitness value using Delaunay triangulation
of the three nearest support vectors. We use Delaunay triangulation to divide a set of points into triangles so that no point lies within the perimeter of a triangle [
55]. This maximizes the smallest angles of the triangles and thus avoids very pointed, numerically unfavorable shapes. For a comprehensive derivation of the calculations, see
Appendix A.
The calculated fitness of the state–action representation is used to decide on exploitation or a rejection of the information with
Consequently, a new individual is only added to the population, if its fitness is better compared to a virtual fitness calculated from neighboring support vectors. We denote this as local selection.
5.3.2. Global Selection
As the size of the population after local selection typically exceeds the maximum number of individuals, we use a standard selection approach in that we select individuals based on their fitness:
where
denotes the selection of the
individuals with the best fitness.
5.4. Agent-Based Knowledge Transfer
As stated before, the training process can be enhanced using knowledge transfer between agents, i.e., agents exhibiting similar behavior can share information. The aim is to enable faster learning with minimal training requirements by utilising existing knowledge to improve performance, see
Figure 5.
To this end, we have to solve two basic issues. First, we have to derive metrics, which determine the degree of similarity between the agent’s state–action representations. Second, we leverage our population-based learning approach to transfer knowledge between the agents.
In order to transfer knowledge, we use the individual’s representation consisting of support vector, action and fitness as a normalized feature vector on which we define a similarity metric as follows:
Hence, the knowledge transfer takes place between agents
i and
j with the most consistent match according to
.
A reasonable knowledge transfer can only be achieved if integrating the individual
improves the fitness value
. This transfer mechanism
is described by
Hence, only the individual
I is selected for transfer if integrating it into the target state–action representation
improves the fitness value
. By requiring a strictly positive effect on target fitness, we ensure that source knowledge acts as a catalyst rather than a source of interference in the sensitive early stages of adaptation.
Technical implementations of the update state–action representation by DGS, see Algorithm A1, the agent-based evolutionary process, see Algorithm A2, and an excerpt of the multi-agent knowledge transfer, see Algorithm A3, are provided as pseudocode in
Appendix B.
7. Results and Discussion
In this section, the proposed approaches and developed strategies undergo thorough evaluation and critical analysis conducted on a distributed production environment. To this end, after introducing the testbed, we first compare the novel approach in different settings with other state-of-the-art methods. We then present the actor policies followed by a discussion of the coverage of the action space.
7.1. Implementation in a Laboratory Test Field
The Bulk Good Laboratory Process (BGLP) is a flexible, intelligent production system that specializes in the continuous transport of bulk materials shown in
Figure 6.
The process takes place across three independent operating modules. Station 1 is a preparation unit where material is distributed via a conveyor belt. Station 2 uses a vibrating conveyor for processing. Station 3 uses a rotary valve for continuous dosing of the material. In addition, level sensors are used to monitor the process.
For the considered scenario, we assign an agent to each actuator, i.e., the conveyor, vibration conveyor, two vacuum pumps and rotary feeder. All actuators have a normalized action space of
except for the on–off vibration conveyor. The state space of each agent consist of the fill levels of the upstream and downstream buffer respectively. We further define the local fitness function
for each agent
i using the overflow and emptiness of the upstream and downstream buffers
and
, the energy consumption
, and the production targets
as
with weighing factors
,
and
. The distributed production process iterates with a sequence
T, where the fill level of the individual silos and hoppers are covered by
. The overflow and emptiness of the buffers are calculated by
with upper limit
and lower limit
. The production demand is calculated with
where the inflow
and the outflow
is revered to the station.
7.2. Results at the BGLP
We evaluate the strategies of the Evo-SbPG using the BGLP process. To this end, we compare the results of DGS, EA DGS, and EA DGS Trans with those of the Vanilla SbPG [
15] and GB SbPG M1 and GB SbPG M2 approaches [
19]. Both the Vanilla SbPG and GB SbPG approach are based on fixed grid structures. However, the Vanilla SbPG approach uses random action generation, whereas the GB SbPG approach uses gradient optimization methods. GB SbPG M1 is based on a single-leader–follower dynamic, and GB SbPG M2 with multiple leader–follower relationships. The described approaches were carried out with the proposed parameters from the literature. While the DGS approach only uses dynamic grid structures, the EA DGS approach combines dynamic representation with evolutionary strategies. The EA DGS Trans approach additionally combines both with knowledge transfer. An excerpt from the fitness values of the local agents and the global resulting objective are shown in
Table 1.
The contributions of each agent to global fitness are presented in
Figure 7.
Figure 7 shows that the different variants of our approach, i.e., DGS, EA DGS, and EA DGS Trans outperform Vanilla SbPG and gradient based approaches with EA DGS Trans performing best. Further, they yield strategies where the individual agents contribute more evenly to the overall objective. This indicates improved coordination between individual agent policies.
We further provide results for the individual objectives for transport, overflow, demand and power consumption as shown in
Table 2.
Although all approaches enable control of the BGLP process, significant differences in their individual process characteristics become apparent. The best response strategies Vanilla SbPG, DGS, EA DGS, and EA DGS Trans exhibit higher transport rates, whereas GB SbPG M1 and GB SbPG M2 show lower performance in this regard while requiring lower energy consumption.
A key feature of the DGS, EA DGS, and EA DGS Trans approaches is highlighted in the time plots shown in
Figure 8. All the established strategies Vanilla SbPG, GB SbPG M1, and GB SbPG M2 are primarily focused on identifying and maintaining a single optimal control point while the effort required to establish or maintain control points varies depending on the strategy.
However, these results do not reflect typical control behavior in such event-driven distributed control systems which are typically characterized by on–off control behavior. As shown, the novel approaches are more aligned with such realistic plant control behavior, as they support dynamic adaptation to fluctuating system conditions. This adaptive behavior is particularly evident in the transport, overflow, and power profiles of the DGS, EA DGS, and EA DGS Trans strategies, as illustrated in
Figure 8. We advacate this to the better exploration of the state–action space and improved coverage due to the DGS.
This interpretation is confirmed by an extended breakdown of the individual process objectives with the individual agents at the BGLP, see
Table 3.
While Vanilla SbPG and DGS, EA DGS and EA DGS Trans strategies focus on self-optimizing state–action representations, GB SbPG M1 and GB SbPG M2 pursue goal-oriented optimization. This leads to higher transport values by the Vanilla SbPG and DGS methods, whereas GB-SbPGM1 and GB SbPG M2 approaches result in lower transport performance. Similar behavior is reflected in the material overflow characteristics. Vanilla SbPG and DGS tolerate minor overflow to prioritize overall optimization, whereas GB SbPG M1 and GB SbPG M2 aim to minimize excess. The Vanilla SbPG and DGS strategies show positive lower demand than GB-SbPGM1 and GB SbPG M2. Power consumption is inversely proportional to demand, while Vanilla SbPG and DGS use a higher power supply with their strategies.
7.3. Analysis and Interpretation
Based on the parameters recorded at the BGLP, we perform a Kernel Density Estimation (KDE) as a violin plot. For statistical classification, we also include the Interquartile Range (IQR) by the quartiles of
and
, the mean value
, and the single and double standard deviations
and
in the representation, see
Figure 9.
The recorded values from the BGLP process are treated as independent and identically distributed samples. To estimate the univariate probability density function
at a given point
x from a recorded dataset
, we employ a KDE defined by
where
n denotes the number of data points and
h represents the bandwidth. As kernel
, we used a Gaussian Kernel with
Other potential kernel functions include uniform, triangular, biweight, or triweight kernels. For our evaluation, we determine the bandwidth
h by scaling Scott’s Rule with a factor
to minimize the mean squared error, defined as
While the bandwidth could alternatively be determined via Silverman’s Rule or manual adjustment, we selected a scaling factor of . Mathematical smoothing of the KDE can attenuate or exaggerate outliers in the data points, creating a continuous distribution shape that goes beyond the individual data values. In addition to the and IQR quartiles shown in red, the upper and lower whisker limits are estimated and plotted using times the IQR.
7.4. Visualization of the State–Action Representation
In addition to the performance improvements, we are also interested in the coverage of the state–action representation of the DGS. To this end, we visualize the state–action representation by the DGS in
Figure 10. Specifically, we show the barchart of action values, a bird view of the state space DGS, the resulting action grid, and the fitness surface of each agent. Together, these representations support both qualitative and quantitative analysis.
First, the DGS covers the state space entirely for agents 2–4, indicating the strength of the EA-based optimization which allows us to explore the state space with maximum coverage. The reduced coverage of agent 1 and 5 are due to the process conditions. Specifically, the rotary feeder is directly responsible for demand fulfilment, restricting the state space while the conveyor belt is typically only operated for smaller input buffer values. Further, we observe comparably smooth fitness surfaces indicating a certain optimal region within the state space which align with the action surface as well. Overall, our approach succeed in coverageging the full state–action space according to the optimization objectives resulting in the better usage of actuation as discussed before.
7.5. Tuning and Configuration Parameters
The reproducibility of the investigations is secured through strictly controlled study conditions, enabling direct comparability with reference approaches. The BGLP operates with an iteration cycle of s, allowing for the comprehensive capture of system dynamics via a discretized parameter sampling rate of s.
We set up the same episodic training procedure for each experiment. This procedure contains nine training episodes and one testing episode. During training, the DGS were continuously adapted to the observed behavior and adaptive learning strategies of the agents.
The fitness function, which defines the objective of agent training, was specified by weighting factors of the upstream and downstream buffer , the production demand target , and energy consumption .
The parameters and system settings presented and used are methodologically and numerically directly comparable with the configurations of the reference approaches Vanilla SbPG [
15] and GB SbPG M1 and GB SbPG M2 approaches [
19].
7.6. Scalability, Transferability and Limitations
The results demonstrate that agents formed autonomously interact both cooperatively and competitively, learning from one another. The Evo-SbPG agent architecture, developed for this study, enables parallel operation and linear scaling by proportionally expanding comparable modules. Although tested in a specific setting, the approach’s underlying mechanisms are transferable to other distributed production systems. We refer to [
17] for a more detailed discussion. A limitation of the approach is its inherent reliance on the availability of the full state vector, which might be restrictive in real-world applications. Nevertheless, incorporating corresponding state observations or soft sensors might mitigate this problem. We leave an extension to partly observable process to future research. Furthermore, the results presented are based on simulation models in lab environments which might suffer from the sim2real gap. Approaches to account for the sim2real gap will be part of future research as well. Thanks to its stable core logic, this approach can also be applied to similar decentralized processes including dynamic supply chain management, traffic control, coordination of autonomous vehicles and voltage and frequency stabilization in smart grids.
8. Conclusions
In this paper, we proposed a novel approach for distributed optimization using population-based learning and DGS. To this end, we present Evo-SbPG, a new game theory-based framework that integrates EA strategies in SbPG. Further, we address the complexity of dynamic environments and extend the approach by DGS, EA DGS and EA DGS Trans strategies for present dynamic state–action representations. The strategies are evaluated in order to solve multi-objective optimization problems in distributed production systems. The results demonstrate improved performance in state–action representation and can be adapted for various applications. By leveraging, the approach ensures high computational efficiency and scalability, allowing it to handle increasingly complex strategy spaces without the exponential cost typically associated with fine-grained discretization.
Future research will exploit the insights gained from DGS transitions to enhance state–action representation in more intricate coupled systems. Additionally, we will further explore and extend the foundations of knowledge transfer in MAS systems and its representation through graph theory, developing more comprehensive approaches in the process. Specifically, we aim to investigate knowledge transfer mechanisms in multi-objective Evo-games, such as knowledge distillation as well as the adaptability of DGS in non-stationary environments.