3.1. GAT Layer Attention Mechanism
GAT (Graph Attention Mechanism) is a neural network built upon graph structures, and its algorithmic flow is outlined in Algorithm 2. By constructing association weights between nodes through adaptive learning, GAT accurately captures the relationships between nodes and their neighbors. It relies on the linear transformation of node features combined with the Softmax normalization function to perform weighted aggregation of neighbor information. Furthermore, GAT enhances the representation of key feature information through multi-head attention concatenation.
Algorithm 2: GAT Attention Mechanism Workflow |
Input: , , , , , , , |
Output: |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 , |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
The weight matrix
, attention coefficient vector
, and parameter vector of each attention head
[
28] are initialized as follows.
In the formula,
represents the real number field,
denotes the feature dimension of the node,
indicates the initial feature dimension of the node, and
represents the dimension of the weight matrix
.
K represents the number of attention heads. For each attention head
k, the initial feature
is linearly transformed using the weight matrix
.
In the formula,
represents the initial node feature matrix,
denotes the number of batch nodes, and
indicates the node feature matrix after transformation. To capture the “UAV-obstacle” obstacle avoidance association [
29] and achieve a better trajectory, by combining the attention vector with
dynamic calculation, the feature correlation coefficient
[
30,
31] after concatenating neighboring node changes is computed as follows:
In the formula,
represents the attention vector of the
k attention head, and
denotes the activation function.
represents the feature change in node
j under the
k attention head, as well as the feature change in its neighbors under the same attention head.
denotes the feature concatenation operation. To address the issue that “different neighbors have different effects on the current node,” for each node
i, the neighbor attention coefficient is normalized using the following function
[
32]:
In the formula,
denotes the set of neighbors for node
i, and
represents the normalized attention weight assigned by node
i to its
j neighbor under the
k attention head. By incorporating the characteristics of unmanned aerial vehicles (UAVs) into the “nearest obstacle distance” feature, and subsequently aggregating the flight path features of UAVs, adaptive aggregation of neighboring features can be achieved [
33].
In the formula,
denotes that node
i is under the
k attention head, aggregating the neighbor features
, which signifies the weighted summation operation;
represents the output feature of node
i under the
k attention head, and ELU refers to the exponential linear unit, primarily utilized to break the dependency of linear features.
In the formula,
α represents a constant. The complementary modes of different attention heads capture nodes by utilizing multi-attention head concatenation to output the node matrix
Z:
In the formula,
denotes the output feature matrix of the
k attention head, Concat refers to the feature concatenation operation, and
represents the feature dimension after concatenation. Finally, by quantifying the prediction error of the model, overfitting can be prevented, enabling GAT to progressively learn more characteristic information about dangerous obstacles:
In the formula, denotes the predicted loss for the specified batch; Z signifies the features of the model’s output nodes, and target indicates the true value labels of the nodes. represents the regularization coefficient; denotes the set of model parameters, refers to the regularization term, and represents the total loss. stands for the learning rate, and represents the gradient of the loss function with respect to parameter , which is primarily used to characterize the current changing trend of the loss function with respect to the parameter.
3.2. Group Relative Policy Optimization
The GRPO (Group Relative Policy Optimization) algorithm is illustrated in
Figure 2. Its design eliminates the Critic model, thereby overcoming the limitations of large-scale training inherent in traditional reinforcement learning methods. The core idea of GRPO is to derive an estimation benchmark by comparing multiple output results against one another, enabling optimal conclusion inference for value-free networks.
The GRPO algorithm process is outlined in Algorithm 3, starting with the initial strategy
. In the nested loop, a batch
is first sampled from the dataset
D, and
G actions are collected from this batch using strategy
to compute the action rewards. By constructing the objective function
and integrating it with the dominant function
, the gradient strategy parameter G is updated.
Algorithm 3: GRPO Algorithm Workflow |
Input: , , , , , |
Output: |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 , |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
The GRPO objective function primarily comprises three components: the sampling ratio of old and new policies, the policy clipping objective, and the
KL divergence regularization term. In contrast to the classic PPO algorithm, the GRPO algorithm does not require a value network and predominantly employs group sampling to achieve efficient advantage estimation. The continuous update model of the strategy is realized by enhancing the reward-penalty mechanism within the objective function [
34] as follows:
In the above formula,
denotes the objective function value of the GRPO model;
E represents the mathematical expectation for sampling from the problem distribution
and the output group generated by the old strategy.
signifies sampling the input based on the task probability.
indicates generating
G candidate outputs from the old strategy
q for each sampled
.
refers to averaging the
G candidate outputs within each group.
and
represent the strategy clipping objective and the
KL divergence regularization term, respectively [
35].
In the above formula,
denotes the clipping threshold, and
signifies the threshold limit of the strategy amplitude.
represents the weight of the regularization term, primarily serving to balance strategy improvement and reference constraints.
refers to the reference strategy, and
indicates the
KL divergence between strategy
and the reference strategy
.
and
represent the sampling ratio of the old and new strategies and the estimated intra-group advantage, respectively:
In the above formula, denotes the probability of the current strategy generating , and signifies the probability of the old strategy generating prior to the update. When , it indicates that the new strategy is more likely to generate ; when , it suggests that the new strategy reduces the generation probability of . represents the original reward for the i output, denotes the average of all rewards within the group, and signifies the standard deviation of all rewards within the group. When , it implies that the output performs better than the group average; when , it implies that the output performs worse than the group average.
3.3. Double-Layer GWOP Algorithm Design
The two-layer fusion GWOP algorithm integrates the combination process of GWO and GRPO, with the introduction of the GAT attention mechanism, as illustrated in the overall fusion framework shown in
Figure 3. Through the three-dimensional raster coding environment depicted in the upper left of the figure, the representation of digital spatial grid information is completed [
36]. By meticulously recording the performance of each trajectory algorithm in 3D route planning, fast search and accurate evaluation of the optimal route planning algorithm are achieved. The top right corner of the figure presents the fundamental framework of the Gray Wolf Optimization (GWO). Building upon this, the GAT attention mechanism in the bottom right corner and the GRPO algorithm are fused to form a comprehensive integration process of “GWO group search + GRPO strategy optimization + GAT graph structure perception.” Finally, the output results are applied to the attitude control of the UAV in the bottom right corner, providing an effective solution for real-time path planning of logistics UAVs in unknown environments.
The process of the GWOP fusion algorithm is outlined in Algorithm 4. Firstly, the GRPO algorithm is integrated into the GWO fitness calculation as
. On this basis, the GAT attention mechanism is incorporated to establish the association of key feature information, serving as the fitness constraint term
for the GWO algorithm. The solution for the complete optimal strategy is achieved via bidirectional collaborative feedback between GAT attention weights and GWO strategies.
Algorithm 4: Double—Layer GWOP fusion Algorithm Workflow |
Input: , , , , , , , , , , , , , |
Output: , , , |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 , |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
The GWO “Wolf Pack” corresponds to a set of candidate strategies for logistics drones, which comprises multiple
strategy parameters. Under any selected parameter of Strategy
, dynamic trajectory planning and goods distribution are executed by the logistics unmanned aerial vehicle. The trajectory performance under Strategy
is evaluated using the GRPO algorithm, and ultimately the optimal “α Wolf” guided population evolution is achieved. By integrating the two-tier architecture of GWO and GRPO, the GWOP algorithm realizes a high-quality global/local search strategy. The initialization of the GWOP model [
37] is as follows:
In the formula,
denotes the parameter vector of the reinforcement learning strategy,
represents the real number field, and
dim signifies the dimension of the strategy parameters.
represents the strategy parameters of the
i individual in the population.
denotes the lower bound of the parameters,
denotes the upper bound of the parameters,
represents the random vector where each element follows a uniform distribution in (0,1).
signifies element-wise multiplication.
represents the weight matrix of the
k attention head,
represents the attention coefficient vector of the
k attention head,
k denotes the index of the attention head, and
K represents the total number of attention heads. The reward value model in the fitness evaluation of GRPO [
38] is as follows:
In the formula,
denotes the original reward of the
i sampling, and
denotes the standardized reward of the
i sampling.
G represents the number of samplings for the same type of action,
represents the arithmetic mean of the rewards for
G samplings, and
represents the standard deviation of the rewards for
G samplings. The fitness
of the
i gray wolf individual in the modified GRPO objective function is [
39,
40,
41]:
In the formula,
denotes the task objective function,
denotes the objective term,
denotes the balance coefficient,
denotes the graph structure constraint term, and
G denotes the graph structure data.
E represents the expectation operation,
q represents the environmental state, and
represents the state distribution;
o represents the actions performed by the agent in state
q, and
represents the action distribution of the old policy.
G denotes the number of action samples in the same state, and
denotes the sum of
for
G action samples.
represents the penalty coefficient of
KL divergence, and
represents
KL divergence.
denotes the parameters of the current candidate strategy, and
denotes the parameters of the old strategy;
represents the probability difference in the output action
o between the new and old strategies in state
q,
represents the clipping coefficient, and
indicates clipping the input value to the interval
. The model for converting the maximization objective of GRPO into the minimization objective of GWO is:
In the formula,
denotes the performance evaluation value of GRPO for strategy
, and
denotes the fitness function value of the
i individual in GWO. Coefficients of the GWO update mechanism [
42,
43]:
In the formula,
t denotes the current number of iterations, and
denotes the maximum number of iterations.
and
represent uniformly distributed random numbers within the range (0, 1).
represents the bounding coefficient, and
represents the direction coefficient [
44].
In the formula,
denotes the original feature vector of node
i,
denotes the weight matrix of the
k attention map,
denotes the updated feature vector of node
i, and
represents the node set of the graph.
represents the original attention coefficient between nodes
i and
j,
represents the inner product operation between nodes,
denotes the feature concatenation operation, and
represents the edge set of the graph.
represents the normalized attention weight of node
i to neighbor
j,
denotes the exponential transformation of the original attention coefficient,
represents the sum of the exponential attention coefficients for all neighbors of the node
i, and
denotes the neighbor set of node
i.
represents the new feature vector after node
i aggregates the features of its neighbors, and
denotes the weighted summation operation.
represents the containment coefficient, which measures the distance from the Wolf pack and is used for candidate strategy updates [
45]:
In the formula,
,
,
, and
respectively denote the current optimal strategy, suboptimal strategy, third optimal strategy, and the parameter vector of the current individual.
represents element-wise multiplication.
,
, and
respectively denote the parameter difference vectors of the
α,
β, and
δ wolf.
,
, and
respectively denote the candidate parameter vectors of the
α,
β, and
δ wolf.
denotes the final candidate strategy that integrates the three leader-guided strategies. The values of
,
, and
are dynamically adjusted by the output of GAT [
46]:
In the formula,
denotes the control coefficient for the position update of the gray wolf,
a denotes the basic control parameter of GWOP,
denotes the scaling factor,
represents the attention weight function calculated by the GAT model for node, where
i is the current node and
j is a neighboring node. The greedy update and convergence model [
47] is as follows:
In the formula, denotes the set of learning parameters of GAT, denotes the learning rate, and denotes the gradient of the loss function with respect to . represents the task-specific loss, represents the balance coefficient, and represents the GWOP guiding loss. denotes the new position of the i gray wolf individual, and denotes the fitness function corresponding to the new position of the i gray wolf individual. represents the current position of the i gray wolf individual, and represents the fitness function corresponding to the current position of the i gray wolf individual.