A Hybrid Genetic Algorithm and Proximal Policy Optimization System for Efficient Multi-Agent Task Allocation

Zhu, Zimo; Yu, Chuanqiang; Wang, Junti

doi:10.3390/systems13060453

Open AccessArticle

A Hybrid Genetic Algorithm and Proximal Policy Optimization System for Efficient Multi-Agent Task Allocation

by

Zimo Zhu

,

Chuanqiang Yu

^*

and

Junti Wang

Department of Vehicle Engineering, Rocket Force University of Engineering, Xi’an 710025, China

^*

Author to whom correspondence should be addressed.

Systems 2025, 13(6), 453; https://doi.org/10.3390/systems13060453

Submission received: 18 February 2025 / Revised: 25 May 2025 / Accepted: 3 June 2025 / Published: 9 June 2025

(This article belongs to the Section Artificial Intelligence and Digital Systems Engineering)

Download

Browse Figures

Versions Notes

Abstract

Efficient task allocation remains a fundamental challenge in multi-agent systems, particularly under resource constraints and large-scale deployments. Classical methods, including market-based mechanisms, centralized optimization techniques, and game-theoretic strategies, have been widely applied to address the multi-agent task allocation problem. While effective in small-to-medium-sized settings, these approaches often encounter limitations in terms of scalability, adaptability to dynamic environments, and computational efficiency as the problem size increases. To address these limitations, this study introduces a proximal policy optimization system augmented with a genetic algorithm (GAPPO) that integrates evolutionary search with deep reinforcement learning. GAPPO enables agents to develop energy-efficient task allocation strategies by perceiving environmental states and optimizing their actions through iterative policy updates. The genetic component promotes broader policy exploration beyond local optima, while the proximal policy optimization ensures update stability and sample efficiency. To evaluate the proposed GAPPO algorithm, extensive simulations are conducted across four scenarios, with the largest involving 50 tasks and 500 agents. The results demonstrate that GAPPO achieves superior performance compared to baseline methods, particularly in reducing task completion time. These findings highlight the algorithm’s robustness and efficiency in handling large-scale and computationally intensive coordination tasks.

Keywords:

genetic algorithm; multi-agent systems; proximal policy optimization; reinforcement learning; task allocation

1. Introduction

Task allocation in multi-agent systems optimizes resource utilization and minimizes operational costs [1]. The challenge lies in balancing task requirements, agent capabilities, and environmental constraints, which can vary significantly [2]. Efficient approaches mitigate resource wastage by assigning tasks to appropriate agents [3]. As the number of agents and tasks increases, the computational complexity of coordination grows substantially—often at least quadratically—due to the combinatorial nature of agent–task interactions and inter-agent communication [4]. This trend underscores the necessity for adaptive and scalable solutions capable of maintaining performance in large-scale environments. Developing dynamic algorithms is therefore crucial for enhancing scalability and robustness across diverse applications [5]. Multi-agent systems have gained attention for their role in dynamic task allocation and rapid deployment. They have been applied in supply chain management [6], energy distribution [7], and autonomous vehicle coordination [8], among other fields [9].

Centralized optimization algorithms, such as mixed-integer programming and reinforcement learning algorithms, are common in agent-assisted networks, but they rely on global information from a central controller [10]. To address this, Jiang et al. developed distributed algorithms for large-scale semidefinite programming, enhancing efficiency with low-rank transformations [11]. Gao et al. combined potential games with MADDPG to optimize UAV trajectories, reducing delays and improving energy efficiency [12]. Hu et al. proposed MO-MIX, a CTDE-based framework for multi-objective reinforcement learning, achieving high-quality Pareto approximations with lower computational overhead [13]. However, centralized algorithms suffer from scalability issues and single points of failure in dynamic environments [14], highlighting the need for alternative solutions.

Distributed optimization decentralizes decision-making, allowing agents to allocate tasks based on local information with limited communication [15]. This improves scalability and adaptability and reduces failure risks. Ren et al. demonstrated its effectiveness in large-scale wireless sensor networks [16], while Zhang et al. applied it to multi-robot systems [17]. However, Liu et al. noted inefficiencies from incomplete information [18], and the lack of global coordination can lead to suboptimal allocation and slower convergence [19].

Reinforcement learning (RL) has emerged as a powerful approach for multi-agent task allocation, enabling agents to optimize task distribution through continuous interaction with dynamic environments [20]. Its adaptability enhances decision-making in complex scenarios, improving applicability to multi-agent systems [21]. Ke et al. combined combinatorial optimization with RL to improve ride-sourcing efficiency [22], while Gao et al. proposed an autonomous perception-based RL scheme for optimal consensus control [23]. Chang et al. leveraged RL for UAV trajectory and resource allocation, optimizing user association and power allocation [24]. Furthermore, Guo et al. integrated genetic algorithms with proximal policy optimization for differentiated traffic scheduling in TSN-5G industrial networks, achieving notable improvements in end-to-end delay and algorithm convergence [25]. The overview of optimization-based approaches in multi-agent systems can be seen in Table 1.

Motivated by the potential benefits of combining different algorithms, we propose a hybrid approach that integrates genetic algorithms (GAs) with proximal policy optimization (PPO) for multi-agent task allocation. GAs improve exploration through evolutionary operations, while PPO enables adaptive policy refinement. This combination enhances both adaptability and performance by uniting GAs’ exploration with PPO’s optimization.

The main contributions of this paper are as follows:

We propose a Markov game formulation for the multi-agent task allocation problem, in which agents independently refine their strategies through continuous interaction with the environment. By relying on local observations to guide each agent’s decision-making, the proposed approach effectively supports coordinated and efficient task distribution throughout the swarm, thereby addressing scalability and resource allocation challenges in multi-agent systems.
We propose the GAPPO algorithm, which integrates genetic algorithms with proximal policy optimization and incorporates an attention mechanism alongside an adaptive learning rate. This design enables agents to adjust to varying task requirements, enhancing learning stability and coordination efficiency.
Numerical results validate the efficiency and scalability of the proposed method, highlighting its faster convergence and better adaptability compared to the prevailing reinforcement learning algorithms.

The remainder of this paper is organized as follows. Section 2 describes the problem formulation. Section 3 presents the GAPPO algorithm and provides a detailed overview of its process. Section 4 discusses the performance of the algorithm in different environments. Section 5 presents a summary of the research and mentions future research directions.

2. Problem Formulation

In this paper, we address the multi-agent task allocation problem, where agents are assigned to distinct tasks and operate either independently or cooperatively to improve overall task efficiency and coverage. Each agent performs both task execution and allocation assessment, aiming to adapt autonomously to dynamic environments while optimizing resource utilization. To achieve this, agents iteratively estimate their local states, allocate tasks, and update their strategies, as illustrated in Figure 1.

State Estimation: At each time step t, the agent maintains a local state $S_{t i}$ , comprising its spatial coordinates, surrounding occupancy information, and relevant task identifiers. This state is continuously updated and evolves according to a transition function $f (S_{t i}, a_{i})$ . The Actor–Critic framework takes this state as input to support real-time motion planning, enabling effective navigation, obstacle circumvention, and target recognition.
Task Allocation: To prevent agent congestion and redundant task assignments, a distributed allocation scheme is employed. Agents evaluate candidate tasks based on factors such as distance, priority level, and current workload. Decisions are dynamically adjusted through localized communication, promoting balanced task execution and optimal use of agent capabilities.
Strategy: Agents refine their behavior through policy updates driven by local observations and rewards. In non-cooperative conditions, each agent adapts its policy independently. Under collaborative settings, shared policy gradients enhance collective learning. The GAPPO model incorporates attention mechanisms to facilitate both autonomous decision-making and inter-agent coordination.

2.1. System Model

Suppose a set of agents

N = {1, 2, \dots, n}

and a set of tasks

S = {1, 2, \dots, m}

are given. Each agent can participate in only one task at a time. Let

τ_{i} \in S

denote the task allocated to agent

i \in N

, and let

I_{j} \subseteq N

represent the set of agents assigned to task

j \in S

. Each agent is restricted to a single task at any given time, whereas a task may accommodate multiple agents simultaneously. Tasks are located at fixed positions

{pos}_{j} = (x_{j}, y_{j}, z_{j})

, and the position of agent i is assumed to coincide with the location of its allocated task, i.e.,

{pos}_{i} = {pos}_{τ_{i}}

. The Euclidean distance between agent i and task j is defined as

d_{i j} = ∥ {pos}_{i} - {pos}_{j} ∥ .

(1)

The above distance also constrains communication, limiting interactions to tasks and agents within range.

Each task j has a workload

h_{j} \geq 0

, while an agent’s work ability

ω_{i}

defines its execution capacity per unit time. The total work ability of the assigned agents must satisfy the following:

\sum_{i \in {I}_{j}} ω_{i} - h_{j} \geq 0 .

(2)

To ensure efficient allocation, the reward for task j is

R_{j} = - τ^{\sum_{i \in {I}_{j}} ω_{i} - h_{j}},

(3)

where

τ \in (0, 1)

controls the reward sensitivity to surplus work ability. A higher

τ

incentivizes faster completion, while a lower

τ

limits unnecessary agent participation.

The objective is to minimize the total completion time

T = max {T_{1}, T_{2}, \dots, T_{m}}

, where

T_{j} = \frac{h_{j}}{\sum_{i \in {I}_{j}} ω_{i}} .

(4)

Thus, the optimization problem is formulated as follows:

\begin{matrix} min_{{e_{i j}}} {T_{j}}, \\ s . t . \sum_{j = 1}^{m} e_{i j} \leq 1, \forall i \in N, \\ e_{i j} \in {0, 1}, \forall i \in N, \forall j \in S, \end{matrix}

(5)

where

e_{i j}

is a binary decision variable indicating whether agent i is assigned to task j.

2.2. State Action Model

In the agent swarm framework, the decision-making process of each agent is represented by the tuple

(s_{t}, a_{t}, r_{t}, s_{t + 1})

, where

s_{t}

denotes the agent’s state at time t, including its position

{pos}_{i} = (x_{i}, y_{i}, z_{i})

, task-related features, and the environmental map. The agent selects action

a_{t}

based on

s_{t}

by maximizing the action-value function

a_{t} = arg max_{j \in C (i)} Q (s_{t}, j),

(6)

where

Q (s_{t}, j)

estimates the cumulative reward of selecting task j. If no tasks are available, the agent selects an idle action to conserve resources.

The policy function

π (s_{t}, j)

maps

s_{t}

to a probability distribution over actions

π (s_{t}, j) = P (a_{t} = j ∣ s_{t}),

(7)

which is optimized using reinforcement learning. The Critic evaluates the expected reward starting from

s_{t}

using the value function

V (s_{t}) = E [\sum_{k = t}^{T} γ^{k - t} r_{k} ∣ s_{t}],

(8)

guiding the Actor to refine the policy.

The reward

r_{t}

reflects the agent’s performance, considering task importance

I_{j}

, completion effectiveness

E_{i}

, and collaborative efficiency

C_{i}

r_{t} = α \cdot I_{j} + β \cdot E_{i} + γ_{i} \cdot C_{i},

(9)

where

α, β, γ_{i}

are the respective weights for those components. The subsequent state

s_{t + 1}

reflects the updated environment after taking action

a_{t}

:

s_{t + 1} = f (s_{t}, a_{t}),

(10)

where

f (\cdot)

represents the environment transition function.

This decision-making process models the interaction between agents and the environment, optimizing actions to maximize cumulative rewards through the Actor–Critic framework and genetic algorithms.

In proximal policy optimization (PPO), the advantage function

A (s_{t}, a_{t})

measures the benefit of taking action

a_{t}

in state

s_{t}

and is computed using Generalized Advantage Estimation (GAE)

A (s_{t}, a_{t}) = {\hat{A}}_{t} = \sum_{i = t}^{T} {(γ λ)}^{i - t} δ_{i},

(11)

where

δ_{i}

is the temporal difference error, given by the following:

δ_{i} = r_{i} + γ V (s_{i + 1}; ω) - V (s_{i}; ω) .

(12)

The value function

V (s_{i}; ω)

is updated iteratively as follows:

V (s_{i}; ω) \leftarrow V (s_{i}; ω) + α_{1} δ_{i},

(13)

where

α_{1}

is the learning rate. This iterative update process allows agents to optimize their rewards, improving task execution and resource allocation.

3. Learning Algorithm

Various optimization algorithms, such as the leader–follower algorithm [26], virtual structure algorithm [27], and behavior-based control algorithm [28], have been proposed to address the multi-agent task allocation problem. Additionally, reinforcement learning has proven effective in this domain. This paper introduces a novel proximal policy optimization system augmented with a genetic algorithm (GAPPO) that combines the stability and sample efficiency of PPO with the global search capability of genetic algorithms to enhance decision-making in complex task environments. The pseudocode of GAPPO algorithm can be seen in Algorithm 1.

Algorithm 1 Proximal Policy Optimization Augmented With Genetic Algorithm.

1: Initialize: Population of Actor–Critic networks

P = {P_{1}, P_{2}, \dots, P_{N}}

, where each policy is encoded as a real-valued vector. Structural infeasibility, such as instability or divergence, is mitigated by bounding weight perturbations and incorporating penalty terms within the fitness evaluation to discourage infeasible policies;

2: Set hyperparameters: learning rate

l r

, discount factor

γ

, clip ratio

ϵ

, and KL divergence target

K L_{t a r g e t}

;

3: Initialize replay buffer with capacity

C = 10000

;

4: Set generation counter

g = 0

;

5: while training do

6: for each individual

P_{i} \in P

do

7: Initialize trajectory T and observe initial state

s_{0}

;

8: for each time step

t = 0

to

T_{m a x}

do

9: Select action

a_{t}

using policy

P_{i}

based on

s_{t}

;

10: Execute

a_{t}

, observe reward

r_{t}

and next state

s_{t + 1}

;

11: Append transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

to trajectory T;

12: if terminal state reached then

13: Break

14: end if

15: end for

16: Store trajectory T in replay buffer;

17: end for

18: if len(replay buffer) ≥ BATCH_SIZE then

19: Sample a batch of transitions from replay buffer;

20: for each transition in the batch do

21: Compute returns

R_{t}

using

γ

;

22: Compute advantages

{\hat{A}}_{t}

using Generalized Advantage Estimation (GAE);

23: Get action probabilities

π (a_{t} | s_{t})

and state values

V (s_{t})

;

24: Calculate surrogate loss

L (θ)

by (15);

25: Calculate value loss by (16);

26: Compute total loss

L = L (θ) + L_{V}

;

27: Update parameters using gradient descent by (17);

28: end for

29: end if

30: end while

31: Global Evolution via Genetic Algorithm

32: while g < max_generations do

33: for each individual

P_{i} \in P

do

34: Evaluate

P_{i}

to obtain fitness score

F_{i}

;

35: end for

36: Select individuals based on fitness;

37: Select top 10% individuals based on fitness and retain them as elites;

38: Generate offspring to fill the remaining 90% of the population via:

39: Selection (e.g., tournament or roulette),

40: Crossover (with rate 0.8), and

41: Mutation (with rate 0.05);

42: Form new population

P^{'} = Elites \cup Offspring

;

43: Update population

P \leftarrow P^{'}

and increment generation counter g;

44: end while

45: Periodically: Update target networks to stabilize training

To enhance replicability, all components of the genetic algorithm used in GAPPO are systematically detailed in Table 2. This includes the encoding scheme, selection and variation operators, fitness evaluation, constraint handling, and hyperparameter configurations.

While PPO is effective for local policy refinement, it is sensitive to initialization and may converge to suboptimal solutions in highly non-convex policy spaces. To mitigate this, GAPPO introduces a population-based genetic layer. Initially, policy networks are pretrained using proximal policy optimization (PPO), where agents interact with the environment and store state–action–reward tuples in the replay buffer. Once sufficient data are collected, Generalized Advantage Estimation (GAE) is applied to compute advantages

{\hat{A}}_{t}

, and policy parameters are updated via the surrogate objective

L (θ)

. This pretraining helps ensure a reasonable starting point for the evolutionary process. The resulting population serves as the initial generation for the genetic algorithm, which then refines policies by directly optimizing network parameters through selection, crossover, and mutation operations, allowing for the simultaneous evolution of multiple Actor–Critic networks. This dual approach helps prevent premature convergence to suboptimal solutions by promoting diversity and global search in the policy space.

The combination of PPO’s local optimization with GA’s global search enhances the overall adaptability and robustness of the system, especially in dynamic and complex environments.

The process begins by initializing a population P of Actor–Critic networks, each representing an optimized policy. Key hyperparameters are selected based on prior studies: the learning rate

l r

is set to

3 \times 10^{- 4}

for balanced convergence and stability; the discount factor

γ = 0.99

emphasizes long-term rewards; the clipping threshold

ϵ = 0.2

constrains policy updates to avoid abrupt changes; and the KL divergence target

K L_{target} = 0.01

prevents over-optimization and supports stable learning.

To further optimize these parameters, we employ automated hyperparameter tuning techniques, including grid search and random search. These tools explore a range of values, such as

l r \in [10^{- 5}, 10^{- 3}]

,

γ \in [0.9, 0.99]

, and

ϵ \in [0.1, 0.3]

. After conducting multiple experiments, the final optimal values were found to be

l r = 3 \times 10^{- 4}

,

γ = 0.98

, and

ϵ = 0.2

, which achieved the best performance in terms of task completion time and energy efficiency.

A replay buffer with a capacity

C = 10, 000

is used to store state–action–reward sequences for training, ensuring a sufficient amount of experience for the agents to learn effectively. A generation counter g tracks the evolution of the population, allowing for continuous improvement of the policies across generations. This process is repeated for a total of 500 generations to ensure that the policies converge to an optimal solution.

To enhance adaptability and prevent convergence to suboptimal solutions, a genetic algorithm is employed following PPO updates. Prior research has demonstrated the effectiveness of this hybrid paradigm in complex optimization tasks, as discussed in literature [25]. Although the referenced work targets TSN-5G traffic scheduling, the underlying challenge of navigating high-dimensional policy spaces is shared with multi-agent task allocation, justifying the applicability of the GA-PPO framework in this study.

In our implementation, each individual in the population encodes the weights of an Actor–Critic network as a real-valued vector. This direct encoding avoids structural infeasibility. During the evolutionary process, policy infeasibility (e.g., instability or divergence) is mitigated by bounding weight perturbations and applying soft penalties to the fitness function for high-variance or low-reward behaviors. Moreover, mutation operations are clipped within a fixed range to ensure parameter validity throughout generations. Each policy’s fitness is evaluated by

F_{i} = λ \cdot \frac{R_{i}}{R_{\max}} + μ \cdot \frac{H_{i}}{H_{\max}} - ν \cdot \frac{E_{i}}{E_{\max}},

(14)

where

R_{i}

is the cumulative reward,

H_{i}

is the policy entropy, and

E_{i}

is the computational cost. The parameters

λ

,

μ

, and

ν

balance those components.

The coefficients

λ

,

μ

, and

ν

are introduced to balance the trade-offs among reward maximization, policy exploration, and energy efficiency. Specifically,

λ

emphasizes the importance of task performance, guiding the selection toward high-reward policies.

μ

encourages exploration by favoring policies with higher entropy, thereby reducing the risk of premature convergence.

ν

penalizes computational cost to promote energy-aware behavior, which is critical in resource-constrained environments. Those weights are empirically tuned to ensure that the evolved policies achieve a desirable balance between task effectiveness and adaptability. PPO’s stability is ensured using the clipped objective

L (θ) = min (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})} {\hat{A}}_{t}, clip (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{t}),

(15)

where

ϵ

controls the clipping range, limiting policy updates for stability.

The total loss combines the surrogate and value loss

L_{V} = \frac{1}{2} {(R_{t} - V (s_{t}))}^{2},

(16)

and the total loss L is optimized using gradient descent.

After computing the total loss

L = L (θ) + L_{V}

, where

L (θ)

is the clipped surrogate objective and

L_{V}

is the value function loss, the policy and value network parameters are updated using gradient descent

θ \leftarrow θ - η \nabla_{θ} L,

(17)

where

η

is the learning rate. This update rule ensures that the parameters are optimized to improve policy performance while maintaining training stability.

Based on the computed fitness, a tournament selection strategy is used to identify parent policies. Crossover and mutation are then applied to generate a new population

P^{'}

, enabling broader exploration of the policy space beyond the local region optimized by PPO. This evolutionary step improves the diversity of candidate solutions and enhances the algorithm’s robustness to initial conditions and non-stationary environments.

Compared with standard PPO, GAPPO introduces a population-based optimization step that improves the policy search process in two aspects: (i) it enables the discovery of more globally optimal solutions through exploration beyond the gradient-based neighborhood, and (ii) it incorporates energy awareness by explicitly penalizing high energy consumption in the fitness evaluation.

Simulation results demonstrate that GAPPO achieves (i) reduced task completion time, resulting from more effective task distribution strategies; (ii) lower energy consumption, as evolved policies tend to favor energy-efficient behaviors; and (iii) enhanced robustness to initialization, attributed to the diversity maintained by the genetic population.

The optimization process iterates until the maximum number of generations is reached. The final policy is selected based on the highest average reward over multiple evaluation episodes.

4. Experimental Results

This section evaluates the model and the proposed GAPPO algorithm through numerical simulations across various configurations. Four experiments were conducted to assess GAPPO’s performance in task allocation with different agent–task scenarios. The experiments were conducted on a system equipped with an NVIDIA RTX 4090 GPU, using Python 3.11.4 with key libraries including PyTorch 2.1, NumPy. GAPPO integrates genetic algorithms with proximal policy optimization, enhanced by an attention mechanism and adaptive learning rate.

The genetic algorithm employs a population size of 100, a crossover rate of 0.8, and a mutation rate of 0.05, while PPO is trained with a learning rate of 0.0003. To balance the contributions of task importance, completion effectiveness, and collaborative efficiency in the reward function, we empirically set the weights as

α = 0.4

,

β = 0.3

, and

γ_{i} = 0.3

. This configuration reflects the design choice to prioritize timely and appropriate task execution (

I_{j}

) while also considering the agent’s execution quality (

E_{i}

) and the overall coordination level among agents (

C_{i}

). For the genetic evaluation function, we set

λ = 0.6

,

μ = 0.3

, and

ν = 0.1

to prioritize reward maximization while ensuring adequate exploration and minimizing computational cost. To manage infeasible allocations—such as agents exceeding task capacity constraints or violating spatial range limits—we incorporate a penalty term into the reward function. Specifically, if an agent selects an invalid task, a negative reward is applied. The penalty value was initially set to

- 5

based on empirical evaluation, considering that the immediate rewards for valid actions typically fall within the range of [0, 10]. While this fixed penalty has proven effective in discouraging invalid actions without significantly impeding exploration, its static nature may limit adaptability across varying scenarios. A potential direction for future enhancement is to adopt a dynamic penalty strategy, where the penalty magnitude is scaled relative to the current reward distribution. The modified reward function is expressed as

r_{t} = α \cdot I_{j} + β \cdot E_{i} + γ \cdot C_{i} + θ \cdot I_{invalid},

(18)

where

I_{invalid} = - 1

if the allocation is invalid and 0 otherwise, and

θ = 5

.

The GAPPO algorithm is compared with the following representative algorithms.

Advantage Actor–Critic (A2C) [29]: A2C integrates policy and value functions, with the advantage function $A (s_{t}, a_{t})$ defined as

$A (s_{t}, a_{t}) = Q (s_{t}, a_{t}) - V (s_{t}),$

(19)

where $Q (s_{t}, a_{t})$ is the expected return and $V (s_{t})$ is the baseline value, enhancing learning stability by reducing gradient variance.
Proximal Policy Optimization (PPO) [30]: PPO balances exploration and exploitation by constraining policy updates, which enhances stability and sample efficiency.
Deep Q-Network (DQN) [31]: The DQN approximates the optimal action-value function $Q (s, a)$ using deep neural networks, stabilizing learning through experience replay and a target network. The update rule is as follows:

$\begin{matrix} δ = r_{t} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t}), \\ Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α δ . \end{matrix}$

(20)
Deep Deterministic Policy Gradient (DDPG) [32]: The DDPG handles continuous action spaces with an Actor–Critic framework. The policy gradient update is as follows:

$\nabla_{θ} J \approx E_{s_{t} \sim ρ^{π}} [\nabla_{a} Q (s_{t}, a | θ^{Q}) \nabla_{θ} μ (s_{t} | θ^{μ})] .$

(21)

To evaluate the performance of GAPPO, we examine four distinct scenarios: (i) 100 agents are allocated to 20 tasks, (ii) 300 agents are allocated to 30 tasks, (iii) 500 agents are allocated to 30 tasks, and (iv) 500 agents are allocated to 50 tasks.

For each scenario, results are obtained by averaging over more than 100 independent runs, each initialized with random states and actions. This approach simulates the inherent randomness in multi-agent task allocation scenarios, ensuring a robust evaluation of the algorithm’s adaptability and effectiveness under dynamic and unpredictable conditions. By averaging the results across multiple runs, we obtain performance metrics that reflect the consistency and generalization of the algorithm in real-world dynamic environments. In each simulation trial, agents are randomly allocated at the start. The maximum number of iterations is set to 500 for the first two scenarios and 800 for the last two. In each scenario, the agents’ work abilities and the tasks’ workloads are initialized with distinct values. The optimal completion time for each scenario is determined through exhaustive search, and performance is assessed using key indicators: task completion time (T), the proportion of optimal time (

P_{0}

), and the number of iterations needed to reach the minimum value (min). The difference ratio

δ

is expressed as

δ = \frac{T - T_{opt}}{T_{opt}},

(22)

and represents the deviation from the optimal time. The time unit

Δ t

is defined as 1 unit time, which corresponds to a fixed and consistent simulation step used across all scenarios. For simplicity, we do not map this unit to a real-world measure such as hours or seconds, as the focus is on relative performance across algorithms. Importantly, this abstraction ensures that the theoretical optimal value—representing the minimum task completion time under ideal allocation conditions—remains constant at 1 unit time in all experiments.

To assess the performance of the proposed approach, we conduct comparisons between GAPPO and the four aforementioned baseline algorithms under four progressively challenging scenarios. These scenarios are designed with balanced agent capacities and task demands, ensuring a consistent theoretical minimum completion time of one time unit. To facilitate a clear performance comparison, a red dotted line is included in the figures to represent this optimal solution. The optimal value is analytically derived based on the theoretical minimum task completion time under ideal allocation conditions—assuming perfect coordination among agents and optimal matching between task requirements and agent capabilities, without communication delay or resource contention. This value serves as a lower bound benchmark, allowing us to assess how closely the proposed algorithm approaches the theoretical optimum. The results demonstrate GAPPO’s superior performance, rapid convergence, and stability in large-scale task allocation.

In the first scenario (Table 3, Figure 2), GAPPO quickly minimizes the task completion time, stabilizing at a low value. In contrast, other algorithms exhibit slower convergence and greater fluctuations. Notably, PPO and DQN experience prolonged exploration phases, resulting in higher iteration counts and suboptimal time proportions.

In the second scenario (Table 4, Figure 3), where the complexity increases with a larger agent population, GAPPO achieves near-optimal performance with a minimal performance gap (

δ = 1.4 %

) and significantly fewer iterations compared to the alternatives. Other algorithms, such as A2C and PPO, demonstrate delayed convergence, while DDPG suffers from increased task completion time.

In the third scenario (Table 5, Figure 4), GAPPO maintains strong scalability, delivering competitive task completion times with minimal deviation from the optimal solution. On the other hand, other methods exhibit significant performance degradation, with some reaching over 50% in performance gap, indicating challenges in adapting to denser agent environments.

In the fourth scenario (Table 6, Figure 5), where the task-to-agent ratio increases, GAPPO continues to excel, demonstrating efficient task allocation with the fewest iterations. Meanwhile, DDPG and DQN display instability and fail to achieve a noticeable reduction in completion time.

In conclusion, GAPPO consistently outperforms competing algorithms in terms of scalability, convergence speed, and solution quality across all tested scenarios. It demonstrates minimal performance degradation even as the problem size grows, underscoring its suitability for real-world multi-agent systems requiring fast, stable, and energy-efficient task allocation in dynamic environments.

5. Conclusions and the Future Work

In conclusion, this paper introduces the GAPPO algorithm, which combines genetic algorithms with proximal policy optimization (PPO) for effective task allocation in multi-agent systems. The proposed method demonstrates superior performance in task completion time and energy consumption, particularly when the number of agents and tasks increases. Experimental results show that GAPPO outperforms traditional algorithms like PPO and DDPG, making it a promising approach for scalable and efficient task allocation in dynamic environments.

Future work will focus on further optimizing GAPPO’s performance, particularly in large-scale systems and real-world applications, to improve its adaptability and scalability. Additionally, integrating advanced optimization techniques could enhance its robustness in complex, dynamic environments.

Author Contributions

Conceptualization, Z.Z. and C.Y.; methodology, Z.Z.; software, J.W.; validation, Z.Z., C.Y. and J.W.; formal analysis, C.Y.; investigation, Z.Z.; resources, C.Y. and J.W; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, C.Y. and J.W.; visualization, J.W.; supervision, C.Y.; project administration, C.Y.; funding acquisition, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Q.; Li, N.; Duan, J.; Qin, J.; Zhou, Y. Resource scheduling optimisation study considering both supply and demand sides of services under cloud manufacturing. Systems 2024, 12, 133. [Google Scholar] [CrossRef]
Hou, Y.; Ma, Z.; Pan, Z. Online multi-agent task assignment and path finding with kinematic constraint in the federated internet of things. IEEE Trans. Consum. Electron. 2024, 70, 2586–2595. [Google Scholar] [CrossRef]
Chen, Y.; Wang, H. Intelligentcrowd: Mobile crowdsensing via multi-agent reinforcement learning. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 5, 840–845. [Google Scholar] [CrossRef]
Liu, W.; Liu, S.; Cao, J.; Wang, Q.; Lang, X.; Liu, Y. Learning communication for cooperation in dynamic agent-number environment. IEEE/ASME Trans. Mechatron. 2021, 26, 1846–1857. [Google Scholar] [CrossRef]
Xiao, T.; Chen, C.; Dong, M.; Ota, K.; Liu, L.; Dustdar, S. Multi-agent reinforcement learning-based trading decision-making in platooning-assisted vehicular networks. IEEE/ACM Trans. Netw. 2024, 32, 2143–2158. [Google Scholar] [CrossRef]
Dharmapriya, S.; Kiridena, S.; Shukla, N. Multiagent optimization approach to supply network configuration problems with varied product-market profiles. IEEE Trans. Eng. Manag. 2022, 69, 2707–2722. [Google Scholar] [CrossRef]
Gao, G.; Wen, Y.; Tao, D. Distributed energy trading and scheduling among microgrids via multiagent reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 10638–10652. [Google Scholar] [CrossRef]
Ramirez, C.A.; Agrawal, P.; Thompson, A.E. An approach integrating model-based systems engineering, iot, and digital twin for the design of electric unmanned autonomous vehicles. Systems 2025, 13, 73. [Google Scholar] [CrossRef]
Rehman, A.U.; Usmani, Y.S.; Mian, S.H.; Abidi, M.H.; Alkhalefah, H. Simulation and goal programming approach to improve public hospital emergency department resource allocation. Systems 2023, 11, 467. [Google Scholar] [CrossRef]
Gao, Y.; Yang, S.; Li, F.; Trajanovski, S.; Zhou, P.; Hui, P.; Fu, X. Video content placement at the network edge: Centralized and distributed algorithms. IEEE Trans. Mob. Comput. 2023, 22, 6843–6859. [Google Scholar] [CrossRef]
Jiang, X.; Zeng, X.; Sun, J.; Chen, J. Distributed synchronous and asynchronous algorithms for semidefinite programming with diagonal constraints. IEEE Trans. Autom. Control 2023, 68, 1007–1022. [Google Scholar] [CrossRef]
Gao, A.; Wang, Q.; Liang, W.; Ding, Z. Game combined multi-agent reinforcement learning approach for uav assisted offloading. IEEE Trans. Veh. Technol. 2021, 70, 12888–12901. [Google Scholar] [CrossRef]
Hu, T.; Luo, B.; Yang, C.; Huang, T. Mo-mix: Multi-objective multi-agent cooperative decision-making with deep reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12098–12112. [Google Scholar] [CrossRef] [PubMed]
Jing, G.; Bai, H.; George, J.; Chakrabortty, A.; Sharma, P.K. Distributed multiagent reinforcement learning based on graph-induced local value functions. IEEE Trans. Autom. Control 2024, 69, 6636–6651. [Google Scholar] [CrossRef]
Tian, L.; Ji, X.; Zhou, Y. Maximizing information dissemination in social network via a fast local search. Systems 2025, 13, 59. [Google Scholar] [CrossRef]
Ren, Y.; Wang, Q.; Duan, Z. Optimal distributed leader-following consensus of linear multi-agent systems: A dynamic average consensus-based approach. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 1208–1212. [Google Scholar] [CrossRef]
Luo, Q.; Liu, S.; Wang, L.; Tian, E. Privacy-preserved distributed optimization for multi-agent systems with antagonistic interactions. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 1350–1360. [Google Scholar] [CrossRef]
Zhang, M.; Pan, C. Hierarchical optimization scheduling algorithm for logistics transport vehicles based on multi-agent reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 3108–3117. [Google Scholar] [CrossRef]
Zhou, J.; Lv, Y.; Wen, C.; Wen, G. Solving specified-time distributed optimization problem via sampled-data-based algorithm. IEEE Trans. Netw. Sci. Eng. 2022, 9, 2747–2758. [Google Scholar] [CrossRef]
Khdoudi, A.; Masrour, T.; Hassani, I.E.; Mazgualdi, C.E. A deep-reinforcement-learning-based digital twin for manufacturing process optimization. Systems 2024, 12, 38. [Google Scholar] [CrossRef]
Zhu, Q.; Wang, S.-M.; Ni, Y.-Q. Cooperative control of maglev levitation system via hamilton–jacobi–bellman multi-agent deep reinforcement learning. IEEE Trans. Veh. Technol. 2024, 73, 12747–12759. [Google Scholar] [CrossRef]
Ke, J.; Xiao, F.; Yang, H.; Ye, J. Learning to delay in ride-sourcing systems: A multi-agent deep reinforcement learning framework. IEEE Trans. Knowl. Data Eng. 2022, 34, 2280–2292. [Google Scholar] [CrossRef]
Gao, S.; Xu, C.; Dong, H. Deterministic reinforcement learning consensus control of nonlinear multi-agent systems via autonomous convergence perception. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 2229–2233. [Google Scholar] [CrossRef]
Chang, Z.; Deng, H.; You, L.; Min, G.; Garg, S.; Kaddoum, G. Trajectory design and resource allocation for multi-uav networks: Deep reinforcement learning approaches. IEEE Trans. Netw. Sci. Eng. 2023, 10, 2940–2951. [Google Scholar] [CrossRef]
Guo, J.; Yao, H.; He, W.; Mai, T.; Ouyang, T.; Wang, F. Reinforcement learning-based genetic algorithm for differentiated traffic scheduling in industrial TSN-5G networks. In Proceedings of the 2024 International Wireless Communications and Mobile Computing (IWCMC), Ayia Napa, Cyprus, 27–31 May 2024; pp. 1283–1289. [Google Scholar]
Ren, W.; Beard, R. Consensus seeking in multiagent systems under dynamically changing interaction topologies. IEEE Trans. Autom. Control 2005, 50, 655–661. [Google Scholar] [CrossRef]
Low, C.B. A dynamic virtual structure formation control for fixed-wing uavs. In Proceedings of the 2011 9th IEEE International Conference on Control and Automation (ICCA), Santiago, Chile, 19–21 December 2011; pp. 627–632. [Google Scholar]
Balch, T.; Arkin, R. Behavior-based formation control for multirobot teams. IEEE Trans. Robot. Autom. 1998, 14, 926–939. [Google Scholar] [CrossRef]
Konda, V.; Tsitsiklis, J. Actor-critic algorithms. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; Volume 12. [Google Scholar]
Zhang, H.; Jiang, M.; Liu, X.; Wen, X.; Wang, N.; Long, K. Ppo-based pdacb traffic control scheme for massive iov communications. IEEE Trans. Intell. Transp. Syst. 2023, 24, 1116–1125. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Zheng, K.; Jia, X.; Chi, K.; Liu, X. Ddpg-based joint time and energy management in ambient backscatter-assisted hybrid underlay crns. IEEE Trans. Commun. 2023, 71, 441–456. [Google Scholar] [CrossRef]

Figure 1. Adaptive task-oriented GAPPO for agent swarm algorithm.

Figure 2. Convergence performance of five algorithms in the environment with 100 agents and 20 tasks.

Figure 3. Convergence performance of five algorithms in the environment with 300 agents and 30 tasks.

Figure 4. Convergence performance of five algorithms in the environment with 500 agents and 30 tasks.

Figure 5. Convergence performance of five algorithms in the environment with 500 agents and 50 tasks.

Table 1. Overview of optimization-based approaches in multi-agent systems.

Technique	References	Year	Data Used
Distributed Optimization	[16]	2022	Wireless sensor networks
Mixed-Integer Programming	[11]	2023	Semidefinite programming
UAV Trajectory + Resource Allocation	[24]	2024	UAV data
MADDPG	[12]	2021	UAV trajectories
Combinatorial Optimization + RL	[22]	2022	Ride-sourcing data
MO-MIX (CTDE)	[13]	2023	Multi-objective optimization data
Autonomous Perception-based RL	[23]	2024	Autonomous perception data
GA + PPO for Traffic Scheduling	[25]	2024	TSN-5G traffic data

Table 2. Key components of the genetic algorithm framework in GAPPO.

Component	Design Choice and Rationale
Encoding Scheme	Real-valued direct encoding of Actor–Critic network weights to ensure structural feasibility and smooth parameter space exploration.
Population Initialization	The population size N is scaled proportionally with the number of agents and tasks, ranging from 100 to 300 individuals. Each individual represents a complete Actor–Critic policy, encoded as a real-valued parameter vector.
Selection Strategy	Fitness-proportionate selection with soft penalty terms integrated to discourage unstable or divergent policies.
Crossover Mechanism	Uniform crossover applied with a rate of 0.8, enabling information exchange between high-performing individuals.
Mutation Strategy	Gaussian mutation with a probability of 0.05 per gene. Perturbations are clipped to ensure parameter validity and prevent instability.
Fitness Evaluation Function	$F_{i} = λ \cdot \frac{R_{i}}{R_{\max}} + μ \cdot \frac{H_{i}}{H_{\max}} - ν \cdot \frac{E_{i}}{E_{\max}}$ , where $R_{i}$ : reward, $H_{i}$ : entropy, and $E_{i}$ : computational cost.
Fitness Weights	$λ = 0.6$ (performance), $μ = 0.3$ (exploration), and $ν = 0.1$ (efficiency). Tuned empirically for optimal balance.
Constraint Handling	Penalty of $- 5$ for infeasible allocations (capacity, range). Mutation range constraints enforce feasibility.
Generational Control	The population evolves for 500 generations. A generation counter tracks convergence trends.
Termination Criteria	Maximum of 500 generations or earlier if population fitness variance falls below a threshold, $ϵ = 10^{- 3}$ .
Reinsertion Strategy	Elitism is applied: the top 10% of individuals (based on fitness) are directly carried over to the next generation. The remaining 90% of the population is replaced by offspring generated via selection, crossover (rate 0.8), and mutation (rate 0.05).
Actor–Critic Network Structure	Actor: Two-layer MLP (256–128 units, ReLU), output via softmax. Critic: Two-layer MLP (256–128 units, ReLU), scalar output. Parameters are flattened for GA operations.

Table 3. Allocation outcomes for the scenario (

n = 100

,

m = 20