Adaptive Refined Graph Convolutional Action Recognition Network with Enhanced Features for UAV Ground Crew Marshalling

Zhou, Qing; Dong, Liheng; Zhang, Zhaoxiang; Xu, Yuelei; Xiao, Feng; Wang, Yingxia

doi:10.3390/drones9120819

Open AccessArticle

Adaptive Refined Graph Convolutional Action Recognition Network with Enhanced Features for UAV Ground Crew Marshalling

by

Qing Zhou

¹

,

Liheng Dong

^2,*,

Zhaoxiang Zhang

²,

Yuelei Xu

²

,

Feng Xiao

^1,* and

Yingxia Wang

¹

School of Defence Science and Technology, Xi’an Technological University, Xi’an 710021, China

²

Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an 710072, China

^*

Authors to whom correspondence should be addressed.

Drones 2025, 9(12), 819; https://doi.org/10.3390/drones9120819

Submission received: 1 October 2025 / Revised: 16 November 2025 / Accepted: 20 November 2025 / Published: 26 November 2025

(This article belongs to the Section Artificial Intelligence in Drones (AID))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

An adaptive refined graph convolutional network integrating multi-order features (joints/bones/angles) with static-dynamic domains is proposed, enhancing adaptability to inter-class similarity and intra-class variation through dynamic topology learning and ARFAM mechanism.
Joint-type semantics and frame-index semantics are introduced as spatio-temporal constraint modeling, improving the capability to capture temporal evolution patterns of actions and enhancing discriminability and logical consistency of complex long-term sequences.

What is the implication of the main finding?

The method’s effectiveness is validated on NTU-RGB+D and self-constructed ICAO ground crew datasets (90.71% real-time accuracy), addressing the technical challenge of recognizing highly similar gestures in UAV ground crew marshalling scenarios.
Technical support is provided for UAV-ground command system coordination, with robustness validated through edge device deployment, facilitating practical applications of smart airport ground operations in the low-altitude economy context.

Abstract

For unmanned aerial vehicle (UAV) ground crew marshalling tasks, the accuracy of skeleton-based action recognition is often limited by the high similarity of motion patterns across action categories as well as variations in individual performance. To address this issue, we propose an adaptive refined graph convolutional network with enhanced features for action recognition. First, a multi-order and motion feature modeling module is constructed, which integrates joint positions, skeletal structures, and angular encodings for multi-granularity representation. Static-domain and dynamic-domain features are then fused to enhance the diversity and expressiveness of the input representations. Second, a data-driven adaptive graph convolution module is designed, where inter-joint interactions are dynamically modeled through a learnable topology. Furthermore, an adaptive refinement feature activation mechanism is introduced to optimize information flow between nodes, enabling fine-grained modeling of skeletal spatial information. Finally, a frame-index semantic temporal modeling module is incorporated, where joint-type semantics and frame-index semantics are introduced in the spatial and temporal dimensions, respectively, to capture the temporal evolution of actions and comprehensively exploit spatio-temporal semantic correlations. On the NTU-RGB+D 60 and NTU-RGB+D 120 benchmark datasets, the proposed method achieves accuracies of 89.4% and 94.2% under X-Sub and X-View settings, respectively, as well as 81.7% and 83.3% on the respective benchmarks. On the self-constructed UAV Airfield Ground Crew Dataset, the proposed method attains accuracies of 90.71% and 96.09% under X-Sub and HO settings, respectively. Environmental robustness experiments demonstrate that under complex environmental conditions including illumination variations, haze, rain, shadows, and occlusions, the adoption of the Test + Train strategy reduces the maximum performance degradation from 3.1 percentage points to within 1 percentage point. Real-time performance testing shows that the system achieves an end-to-end inference latency of 24.5 ms (40.8 FPS) on the edge device NVIDIA Jetson Xavier NX, meeting real-time processing requirements and validating the efficiency and practicality of the proposed method on edge computing platforms.

Keywords:

ground crew marshalling; skeleton sequence; action recognition; adaptive graph convolution

Graphical Abstract

1. Introduction

With the rapid development of the low-altitude economy, unmanned aerial vehicle (UAV) technology is gradually becoming an integral component of future transportation systems [1]. Emerging types of autonomous aerial platforms, represented by electric vertical takeoff and landing (eVTOL) aircraft and unmanned cargo drones, are driving profound transformations in urban air mobility, logistics, and emergency response [2]. These UAVs are expected not only to reshape traditional transportation modes but also to promote collaborative innovation and integration across the entire industry chain. However, compared with the rapid expansion of the low-altitude economy, the development of relevant laws and regulations, air traffic control systems, dedicated airports, and supporting infrastructure remains relatively lagging [3], thereby constraining the large-scale practical deployment of UAVs.

In current aviation systems, airport ground operations are critical to ensuring flight safety. Pilots rely on instructions conveyed by ground crew through standardized hand signals and wands to complete maneuvers such as taxiing and docking. The International Civil Aviation Organization (ICAO) has established a globally standardized marshalling gesture system to ensure uniformity and safety in ground operations [4]. In the context of the low-altitude economy, enabling UAVs to understand and respond to these standardized ground crew marshalling gestures, while effectively coordinating with traditional ground command systems, has become a core issue and a major challenge for the seamless integration of UAVs into existing airport operational frameworks.

Nevertheless, the automatic recognition of ground crew gestures still faces significant challenges in practical applications. Existing action recognition methods encounter multiple bottlenecks in the context of UAV marshalling tasks [5]. First, there exists substantial inter-class similarity: different gesture categories often exhibit highly overlapping motion trajectories or local motion patterns. For instance, both the “normal stop” and “emergency stop” commands involve raising and crossing the arms, resulting in blurred class boundaries. Second, considerable intra-class variability arises from differences in body shape, motion amplitude, execution habits, and situational factors (e.g., viewpoint, distance, clothing) [6], which weakens the stability of unified feature representations.

Furthermore, mainstream skeleton-based modeling approaches primarily rely on joint coordinates and natural topological connections, yielding relatively static and low-dimensional descriptions [7]. Such representations are prone to feature overlap for similar actions and lack robustness when modeling highly variable gestures. Fourth, the natural connectivity of skeletal structures mainly reflects local physical adjacency, making it difficult to directly capture long-range joint dependencies and dynamic correlations, thereby limiting model discriminability and generalization under complex conditions. In addition, current research on gesture recognition in UAV marshalling remains insufficient: existing datasets are small in scale, cover only a limited range of action categories, and are often restricted to common routine commands, falling short of reflecting the full complexity of real-world marshalling. These challenges collectively hinder the practical value of gesture recognition systems in real-world UAV ground crew operations.

To address the aforementioned challenges, this paper proposes an adaptive refined graph convolutional action recognition network with enhanced features. First, a multi-order feature fusion strategy is employed, where representations are constructed at different granularities including joints, bones, and angles. Static-domain features (pose morphology) and dynamic-domain features (temporal motions) are jointly encoded to alleviate feature overlap among similar actions and to enhance intra-class consistency. Second, a data-driven adaptive graph convolution mechanism is introduced, which dynamically adjusts inter-joint interactions through a learnable topology. An adaptive refined feature activation scheme is further incorporated to optimize information flow and channel selection, effectively activating key high-dimensional motion patterns.However, these mechanisms face multiple technical challenges in implementation: multi-order feature fusion may lead to dimensional expansion and computational burden, fully data-driven adjacency matrix learning may induce overfitting, and adaptive channel refinement may produce redundancy and feature degradation. To address these issues, this paper employs a carefully designed angular encoding strategy to control feature complexity, utilizes prior topology regularization to constrain the adjacency matrix learning space, adopts local channel interaction mechanisms to avoid global parameter explosion, and leverages residual connections to preserve feature diversity. Meanwhile, a staged training strategy is employed to stabilize the optimization process, and dimensionality reduction convolutions are introduced after semantic concatenation to control parameter growth, achieving a reasonable balance between performance improvement and computational efficiency. On this basis, joint-type semantics in the spatial dimension and frame-index semantics in the temporal dimension are incorporated as spatio-temporal constraints, providing guiding signals that enhance sequence-level discriminability and robustness, thereby improving the distinctiveness and logical consistency of complex long-term actions.

To systematically validate the effectiveness and transferability of the proposed method, two types of experiments are conducted: (1) offline experiments on the NTU-RGB+D 60 and NTU-RGB+D 120 benchmark datasets [6] to evaluate model accuracy and robustness in general action recognition scenarios; and (2) application-oriented experiments in UAV ground crew marshalling tasks, where an ICAO-compliant dataset is constructed. In this scenario, raw video frames are converted into skeleton sequences via a pose estimation network, and the proposed recognition model is used to predict ground crew action categories. Experimental results demonstrate that the proposed method achieves superior performance on standard benchmarks and can effectively recognize UAV ground crew marshalling gestures, thereby providing technical support for efficient coordination between UAVs and ground command systems.

The main contributions of this work are summarized as follows:

(1) We propose a multi-order feature fusion approach that integrates joints, bones, and angular information, and combines static-domain and dynamic-domain features, thereby enhancing the model’s adaptability to inter-class similarity and intra-class variation.

(2) We design an adaptive refined graph convolutional network, incorporating dynamic topology learning and feature activation mechanisms to strengthen the expressive power and discriminability of the model in complex action recognition tasks.

(3) We introduce semantic constraints in both spatial and temporal dimensions by modeling joint-type semantics and frame-index semantics, which improves the ability to capture temporal evolution patterns of actions and enhances the distinctiveness and logical consistency of complex long-term sequences.

(4) We construct an ICAO-compliant UAV ground crew marshalling dataset and implement real-time recognition on edge devices, validating the robustness and adaptability of the proposed method in real-world airport ground operation scenarios.

2. Related Work

The recognition of UAV ground crew marshalling actions in this paper primarily focuses on identifying motion features derived from human skeletal poses. Human skeletons can be viewed as articulated systems composed of rigid segments connected by joints. Skeleton models represent human postures through joint locations and their relative positions, making them an important representation for action recognition. Unlike the pose estimation of a single rigid body, human posture estimation involves processing multiple interrelated joints and non-rigid motion. Skeleton representation not only reduces the dimensionality of the data but also preserves key structural information, making it widely used in action recognition and analysis. Current methods for obtaining skeleton data primarily include depth sensors (e.g., Kinect), inertial sensors (accelerometers, gyroscopes), and human pose estimation algorithms based on images or videos. Skeleton-based action recognition methods can be broadly classified into four categories: (1) handcrafted feature-based methods, (2) Recurrent Neural Network (RNN)-based methods, (3) Convolutional Neural Network (CNN)-based methods, and (4) Graph Convolutional Network (GCN)-based methods. Duan et al. [8] systematically reviewed the evolution of skeleton-based action recognition from traditional methods to deep learning approaches, providing a comprehensive evaluation of the technical characteristics, applicable scenarios, and performance of various methods, which offers important methodological references and developmental directions for subsequent research.

Early research primarily concentrated on traditional machine learning methods and handcrafted feature-based representations. Methods such as Random Forests, Bayesian Networks, Markov Models, and Support Vector Machines (SVM) achieved good performance in controlled environments with limited data, but relied on manually designed shallow features and lacked sufficient generalization capability.

RNN-based action recognition methods effectively capture temporal dependencies and dynamic features when processing skeleton sequences. Compared to handcrafted feature extraction methods, RNNs overcome the generalization limitations of traditional methods through end-to-end learning, demonstrating particular advantages in complex and dynamic scenarios. The Part-Aware LSTM enhances the modeling capability of spatiotemporal dependencies by functionally partitioning the skeleton and independently processing joint sequences from different body parts [6]. The large-scale NTU RGB+D dataset established in this research is one of the most influential and well-known datasets in the field of 3D pose-based action recognition. The Spatio-Temporal LSTM (ST-LSTM) incorporates a trust gate mechanism that helps mitigate the impact of noise on model performance [9]. These methods have continuously optimized spatiotemporal feature modeling, improving the accuracy and robustness of skeleton-based action recognition. Despite the advantages of RNNs in temporal modeling, they still exhibit limitations in capturing spatial dependencies.

CNN-based methods primarily transform skeleton data into image representations and then leverage the powerful feature extraction capabilities of convolutional neural networks for action recognition. CNNs can automatically learn hierarchical representations of data, demonstrating strong performance in complex pattern recognition tasks [10]. To enhance spatial relationship modeling, Ke et al. [11] generated image representations through cylindrical coordinates and joint chain structures, highlighting the spatial dependencies between joints. CNN-based methods extract local spatial features through convolution and can capture joint dependencies. However, since CNNs are designed for regular grids, transforming skeleton sequences into pseudo-images inevitably results in information loss, making it difficult to fully exploit the topological relationships of the skeletal structure.

In recent years, Graph Convolutional Networks (GCNs) have gained widespread attention due to their ability to directly model graph-structured data. Human skeletons naturally form graph structures composed of joint nodes and their connections. GCNs effectively capture high-order relationships and global dependencies between joint nodes by defining convolution operations on graphs, making them the mainstream approach in skeleton-based action recognition research. Yan et al. proposed Spatial–Temporal Graph Convolutional Networks (ST-GCN), which first integrated GCN with temporal convolution in a unified framework, achieving significant improvements in action recognition performance [7]. ST-GCN utilizes adjacency matrices to model spatial dependencies within a single frame while connecting cross-frame nodes of the same joint along the temporal dimension, achieving joint spatiotemporal modeling. This work is widely regarded as a foundational contribution to the field.

Building upon ST-GCN, extensive research has focused on optimizing graph structures and enhancing feature representation capabilities. Cheng et al. proposed Decoupling GCN [12], which allows different channels to adopt different topological structures, improving the fine-grained modeling of inter-joint dependencies while introducing DropGraph regularization to suppress overfitting. To reduce computational complexity, the Channel-wise Topology Refinement Graph Convolution (CTR-GC) method [13] proposed learning a shared topology first and then refining it by incorporating channel correlations, thereby balancing efficiency and accuracy. Cai et al. [14] introduced Joint-aligned Optical Flow Patches (JFP) to compensate for the limitations of skeleton sequences in capturing subtle motions, constructing a dual-stream GCN to fuse global skeletal information with local optical flow features. Lee et al. proposed the Hierarchically Decomposed Graph Convolutional Network (HD-GCN) [15], which decomposes the human skeleton into multiple hierarchical subgraph structures based on functional and physical connectivity, separately modeling local joint interactions and global body part dependencies. This approach effectively enhances the model’s hierarchical understanding of complex actions while improving recognition accuracy and maintaining computational efficiency. Wang et al. proposed a Temporal-Channel Aggregation (TCA) method for skeleton-based action recognition [16], which designs a joint aggregation mechanism across temporal and channel dimensions. Through cross-temporal channel interactions and cross-channel temporal modeling, this method significantly enhances the model’s capability to capture dynamic spatiotemporal patterns while maintaining computational efficiency, providing new insights for lightweight network design. To further enhance the feature learning capability of GCNs, researchers have explored various innovative strategies. Chi et al. proposed InfoGCN [17], which designs a learning objective function based on information bottleneck theory, introduces attention mechanisms to capture context-dependent topological structures, and leverages multimodal representations of joint relative positions to provide complementary spatial information. Li et al. proposed Symbiotic Graph Neural Networks [18], which jointly model action recognition and motion prediction tasks by designing multi-scale graph structures to separately capture action relationships and physical constraints, achieving mutual enhancement between tasks. Shi et al. proposed a Decoupled Spatial–Temporal Attention Network [19], which reduces computational complexity through independent spatial and temporal attention modules, demonstrating superior performance in action and gesture recognition tasks.

To address the scarcity of labeled data, researchers have proposed various semi-supervised and unsupervised learning methods. Shu et al. proposed Multi-granularity Anchor-based Contrastive learning (MAC-Learning) [20], which conducts contrastive pretraining by constructing anchors at three granularities: local, contextual, and global. By designing an anchor-based contrastive loss to avoid interference from noisy samples, this method outperforms existing approaches on multiple benchmark datasets. Lin et al. proposed action-unit-based contrastive learning [21], which constructs contrastive tasks by decomposing complete actions into semantic action units, effectively improving feature learning quality in unsupervised scenarios. Zhou et al. [22] designed a multi-task learning framework that combines self-supervised pretraining with supervised fine-tuning, introducing discriminative loss to enhance inter-class separability and improving generalization capability in few-shot scenarios.

In recent years, Transformer architectures have demonstrated tremendous potential in skeleton-based action recognition, effectively alleviating the limited receptive field problem of GCNs, which are constrained by the physical connectivity of joints. Do et al. proposed SkateFormer [23], which partitions joints and frames according to skeletal-temporal relationships through a partition-specific attention strategy and performs self-attention within each partition. This approach achieves action-adaptive selection of key joints and frames, attaining state-of-the-art performance on multiple benchmark datasets. Huu et al. proposed STEP CATFormer [24], which employs a spatial–temporal effective body-part cross-attention mechanism. By modeling cross-part interactions across different body parts, this method enhances the representation capability for complex actions.

Beyond Transformer architectures, generative methods and multimodal general-purpose models have also garnered significant attention. Xiang et al. [25] proposed a generative action description prompting method that leverages text descriptions generated by large language models to guide skeleton encoders in learning semantically discriminative features, effectively fusing linguistic and geometric information. Wang et al. proposed Hulk [26], which constructs the first multimodal human-centric general-purpose model. Through a unified modality conversion framework, it processes 2D/3D visual, skeletal, and vision-language tasks, achieving state-of-the-art performance across multiple benchmarks including skeleton-based action recognition. Qin et al. [27] systematically reviewed datasets and methods in this field and proposed the ANUBIS benchmark framework, promoting the standardization of evaluation systems.

In summary, skeleton-based action recognition methods have evolved from early handcrafted features to a new era dominated by deep learning. RNN-based methods effectively model temporal dependencies but exhibit limitations in capturing spatial relationships. CNN-based methods enhance feature extraction but struggle to fully exploit skeletal topology. GCN-based methods directly model graph structures and have become mainstream, while recent Transformer architectures further improve global modeling capabilities. Although Transformers possess advantages in modeling global dependencies, GCNs remain an important research direction due to their explicit modeling of skeletal topological structures, offering superior parameter efficiency and physical interpretability. However, existing GCN methods have limitations in the following aspects: on one hand, in graph structure learning, reliance on predefined topologies (e.g., ST-GCN [7]) makes it difficult to adaptively model task-relevant semantic associations, while global attention-based methods (e.g., SkateFormer [23]) may introduce redundant connections and increase computational complexity; on the other hand, in feature fusion, existing multi-stream strategies (e.g., Decoupling GCN [12]) mostly employ shallow operations and fail to fully exploit cross-modal complementary information. Furthermore, generalization capability in small-scale or domain-specific scenarios requires improvement. To address these issues, this paper proposes the Adaptive Refined Graph Convolutional Network (ARGCN), which resolves the aforementioned problems through adaptive graph learning, refined feature activation, and multi-order semantic fusion, while maintaining the topological modeling advantages of GCNs.

3. Method

To enhance the ability of UAVs to recognize complex ground crew marshalling commands, this paper proposes an adaptive refined graph convolutional action recognition network with enhanced features, which consists of three main components: a multi-order and motion feature modeling module, a data-driven adaptive refined graph convolution module, and a frame-index semantic temporal modeling module. The multi-order and motion feature modeling module captures motion patterns within action sequences, thereby enriching feature diversity and expressive capacity. The data-driven adaptive refined graph convolution module dynamically captures complex inter-joint interactions, enabling fine-grained modeling of the spatial information of ground crew skeletons, while the adaptive feature activation mechanism further strengthens information flow and interaction among joints. The frame-index semantic temporal modeling module focuses on capturing the dynamic characteristics of actions as they evolve over time. The overall network architecture and the detailed data flow between modules are illustrated in Figure 1. The input skeleton sequence is first processed by the feature modeling module, which captures diverse motion characteristics from multiple perspectives by decomposing them into five parallel feature streams: Joint static domain (Joint), Joint dynamic domain (Joint_vel), Bone static domain (Bone), Bone dynamic domain (Bone_vel), and Angle static domain (Angle). These five feature streams are subsequently fed into the fusion module F, which integrates multi-source motion features and incorporates semantic prior knowledge of spatial joint types. At this stage, the five feature streams are concatenated with one-hot encoded joint type semantic information. The concatenated features are then reduced in dimensionality through a

1 \times 1

convolutional layer to control the number of parameters. The fused features next enter a three-layer residual adaptive refined graph convolutional network (ARGCN) module, which performs spatial relationship modeling through data-driven adaptive topology and adaptive refined feature activation mechanisms, dynamically capturing complex inter-joint interactions and enhancing discriminative feature representations. The spatially modeled features are subsequently sent to the temporal feature modeling module, which captures cross-frame temporal dynamics and motion evolution patterns by introducing one-hot encoded frame index semantic information. Finally, spatial max pooling (SMP) and temporal max pooling (TMP) are applied to aggregate multi-scale spatial and temporal features, followed by a fully connected layer that outputs the final action classification results.

3.1. Multi-Order and Motion Feature Modeling

The multi-order and motion feature modeling module is designed to enhance the discriminative capability of skeleton-based action recognition models through richer feature encodings. Existing approaches primarily employ joint and bone representations to analyze the skeletal motion patterns of ground crew, but they remain insufficient when dealing with actions exhibiting high inter-class similarity and significant intra-class variation. In particular, when different marshalling gestures share similar joint motion trajectories, models relying on a single feature modality often fail to accurately distinguish between them. Moreover, variations in body shape and execution styles across individuals can lead to considerable discrepancies in joint coordinates, even within the same action category. Inspired by multi-stream architectures in the literature [28,29], the proposed method enriches the input data by incorporating joint positions (first-order), joint velocities (first-order), bone data (second-order), bone velocities (second-order), and angular encodings (third-order).

3.1.1. Angle Encoding

Angular information is based on the relative motion recognition of body parts in the skeleton and is achieved by measuring the angles between three given joints: v,

u_{1}

, and

u_{2}

, where v is the target joint, and

u_{1}

and

u_{2}

are endpoint joints in the skeleton. Let

{\vec{b}}_{t}^{v u_{i}}

represent the vector from the target joint v to the skeleton endpoint joint

u_{i}

(

i = 1, 2

) at time t, with

{\vec{b}}_{t}^{v u_{i}} = (x_{t}^{u_{i}} - x_{t}^{v}, y_{t}^{u_{i}} - y_{t}^{v}, z_{t}^{u_{i}} - z_{t}^{v})

, where

J_{t}^{k} = (x_{t}^{k}, y_{t}^{k}, z_{t}^{k})

represents the coordinates of joint k (

k = u, v_{1}, v_{2}

). Let

α

denote the angle formed by

{\vec{b}}_{t}^{v u_{1}}

and

{\vec{b}}_{t}^{v u_{2}}

. The angle encoding for the target joint v is defined as:

e_{t}^{ν} = \{\begin{matrix} 1 - cos α = 1 - \frac{{\vec{b}}_{t}^{ν u_{1}} \cdot {\vec{b}}_{t}^{ν u_{2}}}{‖ {\vec{b}}_{t}^{ν u_{1}} ‖ ‖ {\vec{b}}_{t}^{ν u_{2}} ‖} & if ν \neq u_{1}, ν \neq u_{2} \\ 0 & if ν = u_{1} or ν = u_{2} \end{matrix}

(1)

As

α

varies from 0 to

π

radians, the feature value increases monotonically. Compared to first-order features representing joint coordinates and second-order features representing bone length and orientation, the angle encoding as a third-order feature focuses more on motion information and remains invariant to human body scale. The angle encoding velocity is obtained through frame-to-frame differentiation across consecutive frames:

v_{t + 1}^{u} = e_{t + 1}^{u} - e_{t}^{u}

(2)

where

v_{t + 1}^{u}

represents the angle encoding velocity of node u at time step

t + 1

. When angles across all nodes are organized into an angular feature group, computing the complexity will reach

O (V^{3} T)

(where V denotes the number of nodes and T denotes the number of time steps). If computed naively, this would lead to excessive computational complexity, resulting in significant reduction in model training and inference speed. In this work, angle encoding with strong discriminative capability is employed to guide action discrimination, while avoiding the drastic increase in computational cost. Three types of angular feature groups are presented in Figure 2 of this paper.

The local angular features are characterized by their relative motion relationship through computing the angles between two adjacent nodes and the target node. When the target node has only one adjacent node, its angular feature is set to zero; when the node has two or more adjacent nodes, the pair with the most active motion is selected for computation. In this work, the "activity" of a node is defined by maximizing its motion amplitude across the entire action sequence. Specifically, for node j, its activity

A_{j}

is computed as follows:

A_{j} = \frac{1}{T - 1} \sum_{t = 1}^{T - 1} {∥ p_{j}^{t + 1} - p_{j}^{t} ∥}_{2},

(3)

where

p_{j}^{t}

represents the 3D coordinates of node j at time step t, T is the total length of the sequence, and

{∥ \cdot ∥}_{2}

denotes the Euclidean distance. The activity

A_{j}

measures the average positional change of node j over time. For example, the neck joint is typically associated with both shoulder joints, while the head and torso joints exhibit minimal movement and are thus not considered as primary calculation objects. Local angular features effectively capture the relative motion patterns between two bones, making them suitable for analyzing subtle movements in localized regions. These features allow for a more fine-grained description of inter-joint motion patterns and are particularly advantageous when analyzing the relative movements of body regions such as the shoulders and elbows in ground crew marshalling actions. Compared to using only joint or bone data, local angular features provide richer motion information, improving the model’s ability to distinguish subtle differences in complex actions.

Central-directed angular features are used to measure the relative positional relationship between the target joint and the core body region. Two types of angles are constructed: the first is the non-fixed axis angle, defined by the neck, target joint, and pelvis; the second is the fixed axis angle, defined by the neck, pelvis, and target joint. For the neck and pelvis joints, which lie on the reference axis for angle calculation, the angle features are set to zero. This angular feature effectively describes the relative motion between the target joint and the body center, with the central-directed angle between the target arm joint and the core joints being more suitable for analyzing coordinated movements. The central-directed angle helps reveal the connection between the target joint and the overall body structure.

Pairwise joint angular features are used to describe the angle relationship between the target joint and four key endpoints (hands, elbows, knees, and feet), capturing the angle information related to movement within the skeleton structure. Similar to the treatment of local angular features, when the target joint belongs to one of these four key endpoints, its corresponding angular feature is set to zero. The selection of these four key endpoints is based on their high relevance during action execution, as they effectively reflect the dynamic characteristics of the overall motion trajectory and joint coordination.

3.1.2. Static and Dynamic Domain Modeling of Joints and Bones

The position data of the joints comes from the original joint 3D coordinates provided in the dataset, where the coordinate representation of joint j at time step t is denoted as

J_{t}^{v} = (x_{t}^{v}, y_{t}^{v}, z_{t}^{v})

, encoding the basic structural and postural information of the human body. The velocity of the joints is computed through consecutive frame differentiation and is calculated using a position-based differential approach:

V_{t}^{ν} = J_{t + 1}^{ν} - J_{t}^{ν} = (x_{t + 1}^{ν} - x_{t}^{ν}, y_{t + 1}^{ν} - y_{t}^{ν}, z_{t + 1}^{ν} - z_{t}^{ν})

(4)

where

V_{t}^{v}

denotes the velocity of joint v at time t, capturing the motion trend of the joint. The skeletal data is represented based on the relative positions between joint points, characterizing the spatial connectivity of each body part. The bone vector points from the source joint v to the target joint u, with its direction oriented away from the skeletal centroid:

B_{t}^{v} = J_{t}^{u} - J_{t}^{v} = (x_{t}^{u} - x_{t}^{v}, y_{t}^{u} - y_{t}^{v}, z_{t}^{u} - z_{t}^{v})

(5)

This scale-invariant representation enables the normalization of features across individuals with different body sizes. The bone velocity is obtained through frame-wise differentiation of the bone vectors:

M_{t}^{v} = B_{t + 1}^{v} - B_{t}^{v}

(6)

where

M_{t}^{v}

represents the velocity of bone v at time t.

3.1.3. Feature Construction and Fusion

Multiple orders of information are embedded into the same high-dimensional space. Taking the embedding of node positions as an example, two fully connected layers are employed to encode the node positions:

{\tilde{J}}_{t}^{ν} = σ (W_{2} (σ (W_{1} J_{t}^{ν} + b_{1}) + b_{2}))

(7)

where

W_{1} \in R^{C_{1} \times C}

and

W_{2} \in R^{C_{2} \times C_{2}}

are weight matrices,

b_{1}

and

b_{2}

are bias vectors,

σ

denotes the ReLU activation function, C represents the input feature dimensionality, and

C_{1}

denotes the concatenated embedding dimension. Through this formulation, the positional information of the joints is embedded into a high-dimensional representation space to capture the complex spatio-temporal relationships in the skeletal sequence. Similarly, the remaining features are embedded as

{\tilde{V}}_{t}^{v}

,

{\tilde{B}}_{t}^{v}

,

{\tilde{M}}_{t}^{v}

, and

{\tilde{E}}_{t}^{v}

. Finally, these embedded features are concatenated to obtain the fused data representation.

F = σ (B N (W_{1} [{\tilde{J}}_{t}^{ν}, {\tilde{V}}_{t}^{ν}, {\tilde{B}}_{t}^{ν}, {\tilde{M}}_{t}^{ν}, {\tilde{E}}_{t}^{ν}]) + b_{1})

(8)

where

[\cdot]

denotes the feature concatenation operation, which concatenates the five types of embedded features along the channel dimension to form a

5 C_{2}

-dimensional joint representation.

W_{3} \in R^{C_{out} \times 5 C_{2}}

is the fusion transformation matrix,

b_{3} \in R^{C_{out}}

is the bias vector, and BN represents the batch normalization layer. The fused feature

F \in R^{C_{out} \times V \times T}

(where V denotes the number of joints and T denotes the number of temporal frames) is utilized for the subsequent action recognition task.

3.2. Self-Adaptive Graph Convolutional Module Based on Enhanced Data-Driven Learning

3.2.1. Adaptive Topology Construction Driven by Data

The human skeleton can be regarded as a naturally connected graph structure. By constructing and learning such a topology, convolutional modeling can be performed. Based on natural connections, the adjacency matrix can be expressed as

A = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}

is the adjacency matrix with self-connections, and

\tilde{D}

is the corresponding degree matrix.

As illustrated in Figure 3, the natural connections between joints are determined by the explicit skeletal topology, which only reflects static physical relationships. However, in real-world actions, joint interactions are far more complex and often involve long-range coordination. For example, during running, the legs, arms, and trunk must move in synchrony. In ST-GCN [7], the fixed adjacency matrix is limited to capturing only local relations, such as between the knee and hip or the elbow and shoulder, while neglecting critical long-range collaborations, such as those between the left leg and right hand. This localized receptive field constrains the network’s ability to model global motion patterns, causing long-range dependencies to gradually diminish and thereby limiting the overall understanding of actions.

Moreover, a fixed adjacency matrix struggles to adapt to the dynamic variations of joint relationships across different actions. For instance, when waving, the shoulder joint primarily interacts with the arm, whereas in jumping, it must coordinate with the legs to generate force. Thus, natural skeletal connections cannot fully represent the coupled relationships between joints during motion. To more accurately characterize this complex coupling, we construct joint connections using a data-driven adaptive topology with enhanced dynamic perception, as illustrated in the right part of Figure 3.

Mainstream approaches for obtaining dynamic adjacency matrices fall into three categories. The inner-product method directly computes connection weights based on feature similarity between nodes. The bilinear method introduces a linear mapping of feature vectors before computing the inner product. The adaptive data-driven method, by contrast, leverages neural network layers to learn the adjacency matrix. Since the inner-product method alone cannot fully capture latent joint relationships, this work adopts a data-driven adaptive strategy to learn dynamic topologies, thereby more precisely reflecting the complex interactions between joints during action execution.

A_{t} = softmax (θ {(F_{t})}^{T} \cdot φ (F_{t}))

(9)

where

F_{t}

denotes the input feature at time t, and

θ (\cdot)

and

ϕ (\cdot)

are learnable linear transformations (

θ, ϕ : R^{C \times V} \to R^{C^{'} \times V}

), and the connection weight between joints is measured through their inner product. A softmax function is then applied to normalize the results such that the sum of all edge weights connected to a target node equals one. Mathematically, the inner product reflects the projection of one vector onto the direction of another, where its magnitude indicates the degree of correlation between the two vectors. A smaller angle yields a larger inner product, implying stronger relevance between the corresponding vectors.

This mechanism is employed for constructing the dynamic adjacency matrix, where the inner product is used to measure the correlation between any two joints. The normalized adjacency matrix at time t is denoted as

A_{t}

. The model learns a distinct graph structure for each time step, as illustrated in Figure 4. Specifically,

F_{t}

is first passed through two independent mapping modules to obtain node representations from different perspectives. These representations are then transposed and subjected to similarity computation to derive the correlation strength between nodes. Finally, a normalization operation generates the time-varying adjacency matrix

A_{t}

, which characterizes the dynamic dependencies among joints.

To validate the effectiveness of the data-driven adaptive graph topology, we visualize the learned adjacency matrices corresponding to different ground crew gesture commands in the experimental section. Through comparative analysis, significant differences can be observed in the joint dependency patterns learned by the model for different action types, demonstrating that the adaptive graph convolution is capable of capturing action-specific joint interaction relationships.

3.2.2. Adaptive Refinement Feature Activation Mechanism

The adaptive graph convolution introduced above enables the acquisition of spatial relationships among nodes by modeling graph structures. However, it does not fully exploit the interdependencies among joint features. Inspired by Sun et al. [30], we introduce an Adaptive Refinement Feature Activation Mechanism (ARFAM) into the feature aggregation stage of adaptive graph convolution. This mechanism aims to more effectively leverage both global and local motion representations during network construction, thereby optimizing information flow among nodes and enhancing the allocation and activation of high-dimensional motion features. Although the squeeze-and-excitation (SE) channel attention mechanism has been widely adopted, its reliance on fully connected layers to capture global information limits effective interaction with local information, resulting in suboptimal feature weighting. To address this limitation, ARFAM incorporates an adaptive refinement channel regulation strategy into the GCN feature aggregation stage. This design refines the integration of global and local information, enabling a more reasonable distribution of feature weights and ultimately yielding more discriminative feature representations. As illustrated in Figure 5, the proposed ARFAM consists of two main components: Information Cross Mapping (ICF) and Dynamic Fusion with Learnable Factors (

β

-DF).

The workflow of ARFAM is as follows: First, for the input feature

U \in R^{C \times T \times V}

(where C denotes the number of channels, T denotes the number of temporal frames, and V denotes the number of joints), global average pooling is applied to aggregate information across spatiotemporal dimensions, yielding a channel descriptor of dimension

C \times 1 \times 1

. In the ICP module, this descriptor is fed into two parallel pathways: the first pathway employs a diagonal matrix

D \in R^{C \times C}

for transformation, capturing long-range dependencies among channels through channel-wise weighting to generate the global channel descriptor

U_{g c}^{T} \in R^{C \times 1}

; the second pathway utilizes a band matrix

B \in R^{C \times C}

for transformation, capturing local interaction patterns between adjacent channels to generate the local channel descriptor

U_{l c}^{T} \in R^{C \times 1}

. Subsequently, in the

β

-DF module, a

C \times C

-dimensional cross-correlation matrix is constructed via matrix outer product operation. This module adopts dual-pathway parallel processing: the upper pathway performs row-wise summation (sum) on the cross-correlation matrix, reducing the

C \times C

matrix to a

C \times 1

vector, which is then followed by Sigmoid activation and element-wise multiplication with learnable factor-related weights; the lower pathway performs the same summation operation on the transposed cross-correlation matrix, similarly reducing it to a

C \times 1

vector followed by Sigmoid activation, and then performs element-wise multiplication with

σ (β)

, where

β

is a learnable scalar parameter. Finally, the outputs from both pathways are fused through element-wise addition (⊕), followed by Sigmoid activation to obtain the final channel attention weights (

C \times 1 \times 1

), thereby achieving adaptive dynamic balancing between global and local features. This entire process enables the network to dynamically adjust the contribution ratios of global and local features according to the characteristics of input data, thus generating more discriminative channel attention weights. The detailed mathematical formulation of ARFAM is presented below.

In the information cross-mapping component, the input feature is

X \in R^{C \times V \times T}

, where C denotes the number of channels, V denotes the number of joints, and T denotes the number of temporal frames. First, the feature containing global information (temporal dimension T and joint dimension V) is transformed into a channel descriptor U through global average pooling.

U = \frac{1}{T \times V} \sum_{t = 1}^{T} \sum_{v = 1}^{V} X_{n}

(10)

where

X_{n}

denotes the feature value of the n-th channel at time step t and node v. U represents the global feature representation. A band matrix

B = [b_{1}, b_{2}, \dots, b_{m}]

is employed for local channel interaction, where m denotes the number of local channels. The local channel descriptor is computed as:

U_{l c} = \sum_{i = 1}^{m} U \cdot b_{i}

(11)

where

U_{l c}

denotes the local channel descriptor, and m is the number of local channel interaction terms. To further enhance the representation of global information, a diagonal matrix

D = [d_{1}, d_{2}, d_{3}, \dots, d_{c}]

is used to capture dependencies among channels and extract refined global information. The corresponding descriptor is defined as

U_{g c} = \sum_{i = 1}^{c} U \cdot d_{i}

(12)

where

U_{g c}

denotes the global channel descriptor, and c represents the number of global channels. Through cross-scale association operations, the adaptive refined feature mechanism captures deep correlations across different granularity levels:

M = U_{g c} \cdot U_{l c}^{T}

(13)

where M denotes the cross-correlation matrix. The dynamic fusion based on the learnable factor

β

allocates features and reduces computational complexity, with feature fusion optimized through learnable parameters:

U_{g c}^{w} = \sum_{j}^{c} M_{i, j}, i \in 1, 2, 3 . . . c

(14)

U_{l c}^{w} = \sum_{j}^{c} {(U_{l c} \cdot U_{g c}^{T})}_{i, j} = \sum_{j}^{c} M_{i, j}^{T}, i \in 1, 2, 3 . . . c

(15)

U^{w} = σ (σ (β) \times σ (U_{g c}^{w}) + (1 - σ (β)) \times σ (U_{l c}^{w}))

(16)

where

U_{g c}^{w}

and

U_{l c}^{w}

represent the fused global and local channel weights, respectively. where

M_{(i, j)}

and

M_{(i, j)}^{T}

denote the

(i, j)

-th elements of matrix M and its transpose, respectively.

β

is a learnable parameter,

σ

represents the Sigmoid function, and

U^{w}

denotes the final channel attention weights. Equation (16) achieves adaptive fusion of global and local channel weights. The learnable parameter

β

, after Sigmoid activation, is mapped to the interval

(0, 1)

and serves as a dynamic gating factor: when

σ (β)

approaches 1, the model emphasizes long-range dependencies captured by the global descriptor

U_{g c}^{w}

; when

σ (β)

approaches 0, it focuses on fine-grained features of the local descriptor

U_{l c}^{w}

. Both

U_{g c}^{w}

and

U_{l c}^{w}

are normalized through Sigmoid activation to ensure gradient stability. The final output

U^{w} \in {(0, 1)}^{C}

recalibrates the graph convolutional features through element-wise multiplication, enabling the network to automatically adjust the weighting ratio between global and local features according to the input data. Through information cross-mapping and dynamic fusion based on learnable factors, the adaptive graph convolutional module efficiently integrates global and local action representation information across different granularities.

3.2.3. Joint Semantic Adaptive Graph Convolutional Spatial Modeling Module

In the scenario of UAV ground crew marshalling action recognition, many gestures exhibit highly similar joint trajectories in skeleton representations, particularly in terms of joint position changes and motion directions, which often overlap significantly. For example, the “Normal stop” command (7) and the “Emergency stop” command (7) both involve raising both arms and crossing them above the head; similarly, the “Identify gate” command (2) and the “Move upwards” command (22) may both exhibit an arm-raised spatial distribution. When relying solely on the spatial positions and motion trajectories of joints, it becomes challenging to clearly distinguish such similar actions. Semantic information can aid in understanding the contextual meaning of the gestures by more precisely localizing specific joints. Incorporating joint-type semantic information, which reflects their functional roles within the human body structure, can semantically enhance the model’s understanding of how different joints contribute to the execution of actions. Spatial joint-type semantic information is encoded using a one-hot representation of node types. For the k-th joint, a one-hot vector

j_{k} \in d_{j}

is used, where the k-th dimension is set to 1 and all others are set to 0. The spatial joint-type semantic encoding is concatenated with fused data and fed as input into different stages of the network. Since concatenation increases the number of parameters, a single convolutional layer is introduced after concatenation to control parameter size, formulated as:

Z_{i n} = σ (B N (W_{1} [F, \tilde{J}]) + b_{1})

(17)

where

W_{1} \in R^{C_{out} \times (C_{F} + C_{J})}

, and

C_{F}

and

C_{J}

denote the number of channels in F and

\tilde{J}

, respectively. Equation (17) describes the integration process of semantic information. The one-hot encoding of joint types

\tilde{J}

is concatenated with the fused features F along the channel dimension to form

[F, \tilde{J}]

. Since concatenation increases the number of parameters, a

1 \times 1

convolutional layer

(W_{1}, b_{1})

is introduced for dimensionality reduction through projection. Batch normalization (BN) is employed to stabilize training and accelerate convergence, while Sigmoid activation

σ

introduces nonlinearity and constrains the output to the interval

(0, 1)

. The resulting

Z_{in}

integrates motion features with joint structural semantics, enabling the network to distinguish between gesture actions that exhibit similar spatial trajectories but differ in functionality. To realize effective information exchange among joints, we construct a three-layer residual Adaptive Graph Convolutional Network (AGCN). The network leverages data-driven adaptive topology and incorporates the Adaptive Refinement Feature Activation Mechanism (based on information cross mapping and dynamic fusion with learnable factors) to enhance the discriminative power of key features. The structure of each layer is illustrated in Figure 6, and the core computations are expressed as:

Y_{t} = A_{t} Z_{t} W_{y}

(18)

Y_{t}^{'} = U^{w} \otimes Y_{t}

(19)

Z_{t}^{'} = Y_{t}^{'} + Z_{t} W_{z}

(20)

For each time step, where

Z_{t}

denotes the input features, a unique adjacency matrix

A_{t}

is learned in a data-driven fashion. The output

U^{w}

from the adaptive refinement feature activation mechanism is combined with the graph convolutional output

Y_{t}

via element-wise multiplication, denoted as ⊗. Here,

W_{y}

and

W_{z}

denote transformation matrices, which are shared across different time steps. The final output is denoted as

Z_{t}^{'}

. By stacking three residual AGCN layers sequentially, the network facilitates deeper message passing among nodes with dynamically evolving connectivity structures.

3.2.4. Frame-Index Semantic Temporal Feature Modeling Module

In the temporal dimension, frame indices can serve as semantic information to characterize the dynamic variations of joints over time, thereby explicitly indicating the sequential order of action execution and providing additional discriminative cues for temporal modeling. For instance, the “Chocks inserted” action (11) and the “Chocks removed” action (12) both exhibit a spatial trajectory of raising both arms followed by a thrusting motion. However, they differ markedly in directionality: the former involves both arms gradually moving inward until contact, whereas the latter is characterized by both arms moving outward from the center. By incorporating frame indices, the model is able to capture such directional differences in action evolution over time.

The temporal joint index semantics are encoded using a one-hot strategy, similar to the process of node-type encoding. For the t-th temporal feature, a one-hot vector

f_{t} \in R^{d_{t}}

is used, where the t-th dimension is set to 1 and the others are set to 0. After encoding, the semantic vector is concatenated with fused feature data and used as network input. To prevent an excessive increase in the number of parameters caused by concatenation, a single convolutional layer is applied for dimensionality control.

The resulting representation after adaptive convolution is denoted as

Z_{o u t}

. Finally,

Z_{o u t}

is concatenated with the temporal index vector T, followed by a transformation into higher-dimensional space for subsequent processing.

I = σ (B N (W_{1} [Z_{o u t}, \tilde{T}]) + b_{1})

(21)

where

W_{1} \in R^{C_{out} \times (C_{Z_{out}} + C_{\tilde{T}})}

. To further capture the temporal evolution patterns of actions, a temporal convolutional module is designed, as illustrated in Figure 7. This module takes spatio-temporal features with dimensions

C \times T \times V

as input, where C denotes the number of channels, T represents the number of time steps, and V indicates the number of joints. The module employs a three-layer convolutional architecture to model the temporal dimension: the first and third layers utilize

1 \times 3

convolutional kernels to capture short-term dependencies between adjacent frames, while the second layer applies a

1 \times 1

convolutional kernel for channel-wise transformation to enhance feature representation capability.

After the temporal convolution process, a temporal pooling layer aggregates information across all time steps, reducing the temporal scale to 1. This operation is achieved by applying max pooling along the temporal dimension, yielding an output with dimensions

C \times 1 \times V

, which generates a comprehensive feature representation that summarizes information from the entire sequence. The aggregated feature vector is subsequently fed into a fully connected layer to map the representation into the classification space of various action categories.

3.3. Computational Complexity Analysis

Although the multi-order feature modeling and adaptive graph convolutional modules introduce additional computational overhead, architectural optimizations ensure controllable complexity. For multi-order feature modeling, the five feature streams adopt parallel independent encoding followed by concatenation (Equation (8)), with a complexity of

O (5 V T)

, avoiding the exponential growth of

O (V^{3} T)

associated with full combinatorial approaches. Angular encoding only computes three representative feature types, reducing the complexity from

O (V^{3})

to

O (3 V)

. After feature concatenation, dimensional compression is performed through

1 \times 1

convolutions (Equation (17)), with a parameter count of

(C_{F} + C_{J}) \times C_{out}

, which is significantly smaller than the

V^{2} T^{2}

scale of fully connected layers.

For the adaptive graph convolutional module, data-driven adjacency matrix learning (Equation (9)) computes node similarity through inner products, with a complexity of

O (V^{2} C^{'})

. Compared to the global self-attention of Transformers with

O (V^{2} C)

complexity, our method reduces computational cost by approximately 50% through sparse graph structures. The ARFAM mechanism employs diagonal matrix D and band matrix B for channel-wise transformation (Equations (11) and (12)), with parameter counts of c and

m c

, avoiding full connection parameters at the scale of

c^{2}

. Cross-correlation matrix operations (Equations (13)–(16)) are executed only in the channel dimension, with a complexity of

O (c^{2})

, decoupled from the spatial dimension V and temporal dimension T.

Theoretically, the complexity of a single layer of traditional graph convolution is

O (| E | C^{2})

, while the total complexity of our three-layer AGCN is

O (3 | E | C^{2} + V^{2} C^{'})

. Due to the sparsity of the skeleton graph (

| E | \approx 2 V

), the overhead of the data-driven adjacency matrix

O (V^{2} C^{'})

accounts for approximately 15%. Ablation experiments (Section 4.2.2) validate the effectiveness of the design: the multi-order feature configuration achieves a performance-efficiency ratio of 3.16 (+15.8% accuracy/5.0 × FLOPs), and the ARFAM module brings a 1.3% accuracy improvement with only 4.7% FLOPs increase. Edge device deployment experiments (Section 5.2.4) demonstrate that our method achieves real-time processing at 40 FPS on NVIDIA Jetson Xavier NX, verifying the engineering feasibility of the computational complexity. In summary, through strategies such as parallel encoding, sparse topology, and channel-level operations, our approach controls computational overhead within a reasonable range while introducing innovative mechanisms, achieving a balance between academic rigor and practical applicability.

4. Experiments

This work first evaluates the proposed action recognition method on two widely used public skeleton-based action recognition datasets (NTU-RGB+D 60 and NTU-RGB+D 120) to validate its effectiveness and generalizability on standard benchmarks.

4.1. Datasets and Experimental Settings

4.1.1. NTU-RGB+D 60/120 Datasets

We evaluate the performance of our model on two large-scale benchmark datasets for skeleton-based action recognition. The NTU-RGB+D 60 dataset contains 56,000 samples across 60 action categories, including daily activities, health-related actions, and interaction actions. The data are captured using the Microsoft Kinect V2 sensor, which records accurate 3D positions of 25 body joints. Compared to RGB image-based pose estimation methods, depth sensors directly measure joint spatial coordinates through infrared structured light, avoiding the effects of visual occlusion and illumination variations, thus achieving higher joint localization accuracy (average localization error < 2 cm). Three camera viewpoints are employed: a frontal view (

0^{\circ}

) and two side views (

\pm 45^{\circ}

). NTU-RGB+D 120 is an extended version comprising 113,945 samples over 120 action categories. The dataset involves participants from 15 countries and recordings under 32 different camera setups, providing greater complexity and diversity in action representation.

For experimental settings, two standard evaluation protocols are adopted for NTU-RGB+D 60. (1) Cross-Subject (X-Sub), where 40 subjects are evenly divided into training and testing groups to generate training and testing samples, respectively, and (2) Cross-View (X-View), where data captured from the frontal view and

+ 45^{\circ}

view are used for training, while data from the

- 45^{\circ}

view are used for testing. For NTU-RGB+D 120, the Cross-Subject (X-Sub) and Cross-Setup (X-Set) protocols are employed. In X-Sub, 106 subjects are divided into two groups, while in X-Set, samples captured from even-numbered camera setups are used for training and those from odd-numbered setups are used for testing. Examples of data samples and skeletal representations are shown in Figure 8.

4.1.2. Experimental Settings

Regarding the experimental platform configuration, all experiments are conducted on a workstation equipped with an Intel Core i7-10700 CPU and an NVIDIA GeForce RTX 3070 GPU, running Ubuntu 20.04 LTS operating system. The deep learning framework employed is PyTorch 2.0.1, complemented by CUDA 11.8 and cuDNN 8.7. This hardware configuration is comparable to mainstream experimental platforms in the field of skeleton-based action recognition in recent years.

For the network architecture configuration, in the data fusion stage, the concatenated encoded features result in increased dimensionality, which is then reduced to 128 dimensions using a convolutional layer. The three-layer Adaptive Graph Convolutional Network (AGCN) adopts a data-driven graph structure, with feature dimensions set to 128, 256, and 256, respectively. In the temporal processing stage, the Temporal Convolutional Network (TCN) module further extracts features with dimensions of 256, 512, and 512. A temporal pooling layer then compresses the temporal dimension to 1. Finally, a fully connected layer followed by a Softmax function maps the representation to the number of action categories, generating the final classification results.

For the training configuration, the experiments are implemented in Python 3.8 using the PyTorch 2.0.1 deep learning framework. All models are trained with the same hyperparameters: a batch size of 64, determined based on the trade-off between the 8 GB memory constraint of RTX 3070 and training efficiency (the model occupies approximately 6.8 GB memory for a single forward pass at this batch size, which is reasonable), an initial learning rate of 0.001 following the common configuration for Adam optimizer in graph convolutional networks, and the Adam optimizer. Training is performed for a total of 120 epochs, with the learning rate decayed by a factor of 10 at the 60th, 90th, and 110th epochs. This learning rate decay strategy is consistent with mainstream methods in the field of skeleton-based action recognition [31], facilitating rapid convergence in the early training phase and fine-tuning in the later stage. To mitigate overfitting, a Dropout regularization mechanism with a rate of 0.2 is applied. While the typical Dropout value in graph convolutional networks is 0.5 [32], we adopt a relatively conservative 0.2 to maintain sufficient feature representation capability in the five-feature fusion architecture and avoid excessive suppression of high-order features’ spatial propagation. The ReLU activation function is adopted to enhance the non-linear representation capacity, and cross-entropy loss is used to optimize the network.

For data preprocessing, we adopt the original depth skeleton coordinates provided by the dataset. Abnormal samples caused by sensor tracking failures are filtered out based on the following criteria: (1) all joint coordinates are zero; (2) joint coordinates exceed reasonable physical ranges; (3) abnormal joint displacements between adjacent frames, ensuring the reliability of input data. Incomplete data are removed and noise is reduced through a two-step denoising process based on frame length and diffusion. If a frame contains two persons, it is split into two separate frames so that each sequence corresponds to a single individual. Moreover, each skeleton sequence is randomly divided into 20 segments, and one frame is randomly selected from each segment to form a new sequence of 20 frames. This sampling strategy is determined based on the video frame rate of the NTU-RGB+D dataset (25 FPS), where 20 frames correspond to approximately 0.8 s of action duration, sufficient to cover the complete execution of most single actions (e.g., waving, sitting down, etc.) while unifying the dimensions of input sequences with different lengths.

4.2. Experimental Results and Analysis

4.2.1. Comparative Experiments

To evaluate the performance of the proposed enhanced feature-based adaptive refinement graph convolutional network for action recognition, we conduct experimental validation on the NTU-RGB+D 60 public dataset. Table 1 presents the comparison results with various mainstream graph convolutional network-based action recognition methods on the NTU-RGB+D 60 dataset, demonstrating that our method exhibits performance advantages under both X-Sub and X-View settings. Compared with the ST-GCN method, our method achieves Top-1 accuracy improvements of 7.9% and 5.9% under the X-Sub and X-View settings, respectively, validating its capability in modeling skeletal feature relationships and distinguishing complex action categories. Compared with the HCN method, our method achieves Top-1 accuracy improvements of 2.9% and 3.1% under the X-Sub and X-View settings, respectively, demonstrating its superiority in capturing action details and spatial modeling. Compared with the AS-GCN method, our method achieves a Top-1 accuracy improvement of 2.6% under the X-Sub setting, further confirming the effectiveness of our approach. Ultimately, our algorithm achieves Top-1 accuracies of 89.4% (X-Sub) and 94.2% (X-View) on the NTU-RGB+D 60 dataset, obtaining the best performance under the X-Sub setting, matching AS-GCN under the X-View setting, and only 0.1% lower than ST-TR.

Compared with current mainstream methods, our algorithm demonstrates lightweight and efficient characteristics while maintaining competitive accuracy. The visualization of model parameters and Top-1 accuracy under the X-Sub setting is shown in Figure 9. Our algorithm has 1.7 M parameters, which is more than one order of magnitude lower than the AGC-LSTM method. Our algorithm achieves 89.4% (X-Sub) accuracy with only 1.2 M parameters, realizing a favorable balance between performance and computational complexity. Compared with other methods with similar Top-1 accuracy, such as the ST-TR method (6.2 M parameters, 89.3% accuracy) and the SGN method (0.7 M parameters, 88.4% accuracy), our algorithm achieves a better trade-off between parameter count and accuracy, demonstrating the superiority of our algorithm in architectural design and providing an effective solution for optimizing real-time performance and resource utilization.

The confusion matrices of our algorithm under different evaluation settings on the NTU-RGB+D 60 dataset are shown in Figure 10. In the confusion matrices, rows represent predicted categories, columns represent ground truth categories, and color intensity reflects the classification accuracy. The dark regions along the main diagonal indicate the proportion of accurate predictions for each category, while the colors at off-diagonal positions represent the proportion of misclassifications. The two confusion matrices corresponding to the NTU-RGB+D 60 dataset exhibit darker colors along the main diagonal, reflecting the excellent discriminative capability of our algorithm in handling actions with inter-class similarity and intra-class variability, thereby validating the robustness and adaptability of our algorithm.

4.2.2. Ablation Studies

To validate the effectiveness of each component in the proposed method, we conduct systematic ablation studies from three perspectives: multi-level feature modeling, adaptive refined feature activation mechanism with semantic information, and network depth.

(1): Effectiveness of Multi-Level Feature Modeling

Table 2 presents the performance and computational costs of different feature combinations on the NTU-RGB+D 60 and NTU-RGB+D 120 datasets. When using only the joint static domain (Joint) or bone static domain (Bone) information, the Top-1 accuracy is the lowest, achieving 73.6% and 65.3% respectively under the X-Sub setting of NTU-RGB+D 60, with FLOPs of 11.2 M and 11.3 M. This indicates that single static domain features are insufficient for comprehensively representing action characteristics. Incorporating dynamic domain information yields significant performance improvements: the Joint+Joint_vel combination achieves accuracies of 86.9% and 92.7% under X-Sub and X-View settings, respectively, with FLOPs increasing to 22.6 M; the Bone+Bone_vel combination improves to 86.5% and 89.6%, also requiring 22.6 M FLOPs. These results validate the positive impact of dynamic domain information on action recognition. The combination of static and dynamic domain features exhibits complementarity, with the Joint+Bone combination achieving 86.0% and 92.1% accuracy at 22.5 M FLOPs, demonstrating that the two-level features can mutually complement each other to improve action representation. The four-stream features (Joint, Joint_vel, Bone, Bone_vel) collectively achieve accuracies of 89.2% and 93.8% under X-Sub and X-View settings on NTU-RGB+D 60, and 80.9% and 82.2% under X-Sub and X-Set settings on NTU-RGB+D 120, with 45.1 M FLOPs, representing substantial improvements over single features. The experimental results demonstrate that integrating low-level static and dynamic domain features effectively enhances the model’s ability to discriminate between inter-class similar actions and improves recognition accuracy for intra-class varied actions.

Table 3 presents the performance and computational cost of different angle encoding types on the NTU-RGB+D 60 dataset. Three types of angle encodings are evaluated based on joint static domain (Joint) information: local angles (+Local), center-oriented angles (+Center), pairwise joint angles (+Pair), and combined angle encodings (+All). The introduction of all angle features leads to significant performance improvements, with Top-1 accuracy gains ranging from 9.2% to 10.0% under the X-Sub protocol and from 6.3% to 6.5% under the X-View protocol.

In terms of computational cost, the baseline Joint feature requires 11.2 M FLOPs, while incorporating different angle encoding types increases FLOPs to approximately 22.5–23.0 M. Although this represents roughly a twofold increase in computational overhead, the performance gains are substantial. The combined angle encoding feature (+All) achieves the best performance, with Top-1 accuracies of 83.7% (X-Sub) and 89.9% (X-View) at 23.1 M FLOPs, outperforming any single angle feature used in isolation. This demonstrates that combined angle encodings can more comprehensively capture relative motion information in actions. By enhancing discriminative features for action recognition in scenarios with subtle action details and minimal joint position variations, the combined angle encodings effectively reduce the impact of inter-subject variability on model performance.

A further analysis of the ten actions with the largest Top-1 accuracy improvements under the X-Sub protocol is presented in Table 4. For instance, the headache action (A44) exhibits the largest gain of 37.7%. Both the headache (A44) and neck pain (A47) actions involve interactions between the hands and the head/neck, with highly similar joint positions, making them difficult to distinguish using only joint coordinates. This aligns with daily observations: headache actions typically involve more pronounced angular changes in the elbow, shoulder, and hand, especially in local elbow flexion, compared to neck pain. Such differences can be effectively captured by center-oriented and pairwise joint angles.

For the pick up action (A5), accuracy increases from 55.6% to 81.5%, an improvement of 25.9%. Distinguishing between pick up (A5) and kick something (A24) relies on the different motion ranges and angular variations of the arms and legs, where center-oriented and pairwise joint angles provide critical cues. Similarly, in fine-grained actions with limited motion such as writing (A12) and reading (A11), angular features help discriminate subtle differences in hand and upper-limb movements. For instance, the accuracy of writing (A12) improves from 24.6% to 45.6%, an increase of 23.0%.

In distinguishing between pick up (A6) and put on shoe (A16), angular encodings again prove effective. Pick up (A6) typically involves larger angular variations in the upper limbs, particularly in the shoulders and elbows, along with noticeable wrist and finger movements when grasping an object. In contrast, put on shoe (A16) primarily involves the lower limbs, especially knee and ankle angles. Although both actions involve bending and extension, the critical joints and angular ranges differ. The inclusion of angular encodings provides higher-order information for capturing such distinctions, offering valuable guidance for recognizing actions with similar joint positions and overlapping movement patterns.

Building upon the four-stream lower-order features, we further investigate the synergistic effect between the static domain (Angle) and dynamic domain (Angle_vel) of the third-order angle encodings. As shown in Table 2, the introduction of angle static domain features yields improvements of 0.8% and 1.1% in Top-1 accuracy on the NTU-RGB+D 120 dataset under X-Sub and X-Set protocols, reaching 81.7% and 83.3%, respectively. The computational cost increases from 45.1 M to 56.5 M FLOPs, validating the effectiveness of angle encodings as a complementary feature to lower-order representations.

However, the angle dynamic domain feature (Angle_vel) does not contribute significant performance gains. Under the X-View protocol on NTU-RGB+D 60, it achieves the same accuracy as the static domain (94.2%), while exhibiting marginal performance degradation in other settings, with an additional computational overhead of approximately 11 M FLOPs. Considering the trade-off between performance and efficiency, we adopt a five-stream feature configuration comprising Joint, Joint_vel, Bone, Bone_vel, and Angle, which requires 56.5 M FLOPs and achieves optimal performance across both datasets.

(2): Ablation Study on Adaptive Refined Feature Activation Mechanism and Semantic Information

To validate the applicability of the Adaptive Refined Feature Activation Mechanism (ARFAM) to graph convolutional modules, this work designs ablation experiments by introducing ARFAM at different depths. Specifically, GCN_n(ARFAM) denotes the introduction of ARFAM only in the n-th stage, GCN(ARFAM) indicates the incorporation of ARFAM across all three stages, and w/o represents the baseline without any attention mechanism. Furthermore, this work employs joint type and frame index as semantic information to guide spatial–temporal structure representation. To verify the superiority of ARFAM over generic attention mechanisms, this work further compares it with single-head self-attention (SA) [8] and multi-head self-attention (MHSA) [23]. The experimental results are presented in Table 5.

Comparison with Other Attention Mechanisms. We compare ARFAM with two representative general-purpose attention mechanisms: SA [8] and MHSA [23]. On the NTU-RGB+D 60 dataset, SA achieves accuracies of 88.1% and 92.4%, representing decreases of 0.4% and 0.5% compared to the baseline (88.5% and 92.9%). Similarly, on NTU-RGB+D 120, SA underperforms the baseline (79.3% vs. 79.8% for X-Sub; 81.0% vs. 81.6% for X-Set). This observation aligns with the findings of Duan et al. [8], who demonstrated that the global dense computation inherent in vanilla self-attention is ill-suited for the sparse structure of skeleton sequences, thereby introducing redundant noise. MHSA partially mitigates this issue through multi-head parallel processing, achieving performance comparable to the baseline on NTU-RGB+D 60 (88.5% and 92.8%), albeit at a computational cost of 62.0 M FLOPs (+16.1%). In contrast, GCN(ARFAM) achieves 88.9% and 93.1% on NTU-RGB+D 60 (improvements of 0.4% and 0.2%), and 80.4% and 82.0% on NTU-RGB+D 120 (improvements of 0.6% and 0.4%), with only a 2.5 M FLOPs increase (+4.7%). In terms of performance-efficiency ratio, ARFAM (0.09) significantly outperforms MHSA (0.00) and SA (

- 0.03

), validating the necessity of skeleton-specific attention mechanisms.

Effectiveness of ARFAM at Different Depths. Without semantic information, GCN(ARFAM) improves over the baseline by 0.4% and 0.2% on NTU-RGB+D 60, and by 0.6% and 0.4% on NTU-RGB+D 120. Depth-wise analysis reveals that GCN₁ (ARFAM) achieves 80.4% and 81.8% on NTU-RGB+D 120, performing comparably or slightly better than GCN₂(ARFAM) and GCN₃(ARFAM). With complete semantic information, GCN₁(ARFAM) reaches 81.6% and 83.2% on X-Sub and X-Set, respectively, surpassing GCN₃(ARFAM) by 0.2% and 0.3%. These results indicate that ARFAM exhibits more pronounced feature activation effects at shallower layers. The full configuration, with ARFAM applied across all three stages, synergistically combines shallow and deep activation mechanisms to achieve optimal performance on both datasets.

Impact of Semantic Information. Under the baseline configuration (w/o), incorporating complete semantic information (joint type + frame index) improves Top-1 accuracy on NTU-RGB+D 120 from 79.8% and 81.6% to 80.9% and 82.8% (gains of 1.1% and 1.2%), and from 88.5% and 92.9% to 88.9% and 93.8% on NTU-RGB+D 60 (gains of 0.4% and 0.9%). The contribution of semantic information becomes more pronounced when combined with the full GCN(ARFAM) configuration: on NTU-RGB+D 120, accuracy increases from 80.4% and 82.0% to 81.7% and 83.3% (gains of 1.3% and 1.7%), while on NTU-RGB+D 60, it improves from 88.9% and 93.1% to 89.4% and 94.2% (gains of 0.5% and 1.1%).

To disentangle the individual contributions of the two semantic cues, we conduct fine-grained ablations (last three rows in Table 5). On NTU-RGB+D 120, joint type alone yields improvements of 0.7% and 0.7%, while frame index contributes 0.5% and 0.5%; their combination achieves a 1.3% gain, demonstrating complementarity. On NTU-RGB+D 60, joint type exhibits slightly stronger effects (0.3% and 0.7% vs. 0.2% and 0.5%). Joint type provides spatial structural priors to distinguish actions with similar skeletal configurations, whereas frame index captures temporal ordering cues to differentiate actions with similar trajectories but opposite execution directions. Notably, the joint gain (1.3%) slightly exceeds the sum of individual contributions (1.2%), indicating positive synergy in spatiotemporal modeling.

Computational Efficiency. The computational overhead introduced by ARFAM and semantic information remains modest. The baseline model requires 53.4 M FLOPs. Applying ARFAM to a single stage increases this to 54.3 M (+1.7%), while applying it to all three stages results in 55.9 M FLOPs (+4.7%). Semantic information, designed with lightweight one-hot encoding followed by single-layer convolution, adds approximately 0.3 M FLOPs per modality (joint type or frame index), and 0.6 M (+1.1%) when both are used. The full configuration (GCN(ARFAM) + complete semantic information) totals 56.5 M FLOPs, representing only a 3.1 M increase (+5.8%) over the baseline. Despite this minimal overhead, the method achieves substantial accuracy improvements of 1.9% and 1.7% on NTU-RGB+D 120, validating its superior cost-effectiveness.

(3): Network Depth Ablation Study.

To investigate the impact of graph convolutional network depth on model performance, we conduct ablation experiments with varying numbers of layers (ranging from 1 to 5). All experiments are performed on the NTU-RGB+D 120 dataset with the Adaptive Refined Feature Activation Mechanism (ARFAM) and semantic information enabled. The experimental results are illustrated in Figure 11.

As illustrated in Figure 11, as the number of network layers increases from one to three, the model’s Top-1 accuracy improves steadily. Under the X-Sub and X-Set experimental settings, the 1-layer graph convolutional network achieves Top-1 accuracies of 78.6% and 80.1%, respectively. The two-layer network improves these to 80.9% and 82.4%, while the three-layer network (adopted in this work) reaches 81.7% and 83.3%. These results demonstrate that moderately increasing network depth effectively expands the receptive field of nodes, enabling the model to capture long-range dependencies in the skeleton graph (e.g., coordinated movements between hands and feet), thereby enhancing action recognition accuracy.

However, when the network depth further increases to 4 and 5 layers, the model exhibits diminishing returns or even negative gains. The 4-layer network achieves Top-1 accuracies of 81.9% (X-Sub) and 83.5% (X-Set). Although it outperforms the three-layer network by a marginal 0.2% on X-Sub, this minor improvement comes at the cost of a 20% increase in parameters (from 1.0 M to 1.2 M) and a 17% increase in inference time (from 4.6 ms to 5.4 ms). The 5-layer network shows further performance degradation, with Top-1 accuracies dropping to 81.3% (X-Sub) and 82.9% (X-Set), falling short of the three-layer network’s performance.

This phenomenon can be attributed to the over-smoothing problem in graph neural networks. In the human skeleton graph, the average node degree is approximately 2.3, and the graph diameter (the farthest distance from head to feet) is approximately 6–8 hops. A three-layer graph convolutional network already enables each node to aggregate information within a 3-hop neighborhood, covering over 90% of long-range connections in the skeleton graph (e.g., coordinated relationships among head-torso-legs), which is sufficient to capture spatial dependencies of actions. When the network depth exceeds 4 layers, excessive information propagation causes node features from different body parts to become homogenized, undermining locally discriminative information (e.g., subtle angular differences between elbows and knees), thereby reducing the model’s capability to distinguish between inter-class similar actions.

From a computational efficiency perspective, the three-layer network achieves an optimal balance between inference time (4.6 ms) and parameter count (1.0 M). Although the 4-layer network shows a marginal accuracy improvement on X-Sub (+0.2%), its increased costs (parameters +20%, inference time +17%) are disproportionate to the performance gain, particularly in practical deployment scenarios on edge devices (e.g., Jetson Xavier NX), where such additional overhead is prohibitive. The performance degradation of the 5-layer network (

- 0.4 %

) further validates the law of diminishing marginal returns for deeper networks.

In summary, the three-layer graph convolutional network achieves an optimal trade-off among receptive field coverage, feature discriminability, and computational efficiency. Therefore, this work adopts a three-layer residual adaptive graph convolutional network as the core architecture for the spatial modeling module.

5. Application Study: UAV Ground Crew Marshalling Action Recognition

After validating the effectiveness of the proposed method on general-purpose datasets, we further apply the model to the task of UAV ground crew marshalling action recognition. By constructing a dataset of UAV ground crew marshalling gestures in accordance with ICAO standards, the method is evaluated and analyzed in real-world scenarios to verify its applicability and reliability in practical applications.

5.1. Dataset Construction

5.1.1. Dataset Specifications and Development

UAV ground crew marshalling actions serve as a standardized form of non-verbal communication in aviation ground operations. By conveying information through specific gestures and the use of marshalling wands, they compensate for the limitations of verbal communication and ensure efficient interaction between pilots and ground personnel. These gestures are unified and standardized under ICAO regulations, providing universality and consistency. Figure 12 illustrates a subset of UAV ground crew marshalling actions along with their corresponding meanings as defined by ICAO standards.

To address the issue of data scarcity in UAV ground crew marshalling action recognition, we constructed a representative dataset of UAV ground crew marshalling gestures. Following the ICAO standards for UAV ground crew marshalling actions (as shown in Figure 12), we developed a dataset comprising 34 standardized gesture categories. Data collection involved five trained performers executing the standardized gestures. The recording was conducted in a spacious indoor environment with diverse background elements, including greenery, buildings, and runways. A fixed USB camera (1280 × 720, 25 FPS) was used, and the performers executed actions within a distance range of 2–10 m from the camera, facing different orientations. In total, 1883 video clips were recorded, covering all 34 gesture categories, with their distribution illustrated in Figure 13. To evaluate model performance, two experimental settings were adopted: (1) Cross-subject evaluation (X-Sub): The training and testing sets were divided based on different performers, simulating recognition across unseen subjects. (2) Hold-out evaluation (HO): The dataset was split into training and testing sets with an 8:2 ratio, repeated five times with random partitions, and the average results were reported to ensure evaluation reliability.

To enhance dataset diversity and evaluate the robustness of the proposed method under complex environmental conditions, we perform environmental augmentation on the original dataset. By employing image processing techniques to simulate typical interference factors that may be encountered in real airport environments, we construct an extended dataset encompassing diverse environmental conditions. Specifically, we generate five environmental variants for each original video sample: (1) Illumination: simulating illumination variations across different time periods (dawn, morning, noon, evening, dusk) through gamma correction and contrast adjustment; (2) Illumination+Haze: adding haze effects via atmospheric scattering models on top of illumination variations to simulate reduced visibility weather conditions; (3) Illumination+Rain: incorporating rain streak effects through motion blur and noise overlay techniques on top of illumination variations to simulate the impact of rainy weather on visual perception; (4) Shadow: adding shadows of other aircraft and buildings around the human body and on the ground through region darkening operations; (5) Occlusion: simulating partial occlusion scenarios in real-world scenes by randomly generating rectangular occluded regions, with occlusion areas accounting for 10–30% of the human detection bounding box and randomly distributed across upper limbs, lower limbs, or torso regions. Figure 14 illustrates sample examples of UAV ground crew command gestures captured or generated under different environmental conditions. These environmental variants comprehensively cover typical interference scenarios that may be encountered in actual airport operations.

Through environmental augmentation, the dataset size expands from the original 1883 video samples to 37,660 samples (including original samples and 19 environmental variants). The augmented dataset is utilized to enhance model training and improve the model’s adaptability to complex environmental conditions. During the training phase, we mix original samples and environmentally augmented samples at a ratio of 8:2.

To support the recognition of action sequences with varying durations, this work employs a sliding window strategy to extract skeleton sequence samples from original videos. Given a target sequence length of L frames, the i-th sample comprises L consecutive frames:

S_{i} = {f_{i + k}}_{k = 0}^{L - 1}

, where

f_{j}

denotes the j-th frame in the video. To balance training sample coverage and computational efficiency, the starting frame interval between adjacent samples is set to

Δ = 5

frames. For example, if

L = 20

, the extracted sample sequences are

S_{1} = [f_{1}, f_{2}, \dots, f_{20}]

,

S_{2} = [f_{6}, f_{7}, \dots, f_{25}]

,

S_{3} = [f_{11}, f_{12}, \dots, f_{30}]

, and so forth. This sampling strategy ensures: (1) temporal continuity is maintained within each sample, fully capturing the local temporal patterns of actions; (2) moderate overlap exists between adjacent samples (

L - Δ

frames), which increases training sample diversity while avoiding high redundancy caused by dense sampling.

5.1.2. Annotation Software

To improve the efficiency of dataset construction and the accuracy of annotations, we designed and developed a dedicated annotation software for UAV ground crew marshalling gestures, as shown in Figure 15. The software is built on the PyQt5 framework for the interactive interface, with video loading and frame-by-frame display supported by OpenCV. Annotation results are exported as structured files using the Pandas library. The software supports multiple video formats and offers features such as frame-by-frame browsing, playback control, and quick navigation, facilitating efficient video review for annotators. To meet standardization requirements, the interface is pre-configured with the 34 UAV ground crew marshalling gestures and the “No Gesture” label, with annotation results displayed and managed in real time. Additionally, the software integrates annotation completeness checks and result export functions to ensure the process is standardized and accurate. By optimizing the interaction flow and incorporating automated checking mechanisms, this software significantly enhances annotation efficiency and data quality, providing strong support for the construction of the UAV ground crew marshalling gesture dataset.

5.1.3. Experimental Platform

The model training was conducted on an Intel Core i7-10700 CPU and NVIDIA GeForce RTX 3070 GPU environment, with experiments implemented using Python 3.8 and the PyTorch 2.0.1 deep learning framework. The model training adopts the same hyperparameter configuration as described in Section 4.1.2. To further validate the model’s applicability on embedded platforms, we tested it on the NVIDIA Jetson Xavier NX platform (as shown in Section 5.2.4). This device measures just 70 mm × 45 mm × 40 mm and offers 14 TOPS and 21 TOPS of computing performance at 10 W and 15 W power consumption, respectively, making it suitable for low-power, compact scenarios such as unmanned equipment. The hardware includes 384 CUDA cores, 48 Tensor Cores, and 2 NVDLA engines, supporting high-resolution visual data processing and efficient deep learning network operation through optimized acceleration libraries. The experimental deployment relied on the JetPack SDK, integrated with CUDA/cuDNN, and PyTorch was configured in an Ubuntu environment to ensure smooth execution of the model on the edge device.

5.2. Experimental Details and Results Analysis

5.2.1. Workflow of UAV Ground Crew Marshalling Gesture Recognitio

Figure 16 illustrates the complete workflow of UAV ground crew marshalling gesture recognition. The raw input consists of a sequence of consecutive UAV ground crew marshalling gesture videos. Human body detection is performed on each video frame to locate the key regions of the human body (Regions of Interest, ROI). The detection results are output in the form of a four-element set

{[x_{i}, y_{i}, w_{i}, h_{i}]}_{i = 0}^{n}

, where n denotes the number of detected targets,

x_{i}, y_{i}

represent the center coordinates of the i-th target, and

w_{i}, h_{i}

denote the width and height of the detection frame, respectively. A confidence threshold mechanism is established to improve the reliability of detection results, retaining only detection results with confidence greater than 0.5. Meanwhile, to ensure that the detection frame completely covers the target personnel region, the ROI is expanded to 1.25 times the original region during the detection process. The cropped target region images are normalized to a size of

256 \times 192

pixels to provide the data foundation for subsequent pose estimation tasks. Pose estimation employs the AlphaPose algorithm to extract 2D coordinates and confidence scores of 17 key joints (following the COCO skeleton annotation protocol) from RGB images, with each joint represented as a

(x, y, confidence)

triplet. To ensure that inaccuracies in skeleton structure do not confound the evaluation of action recognition algorithms, we introduce a skeleton quality control mechanism: the average confidence score across all joints is computed for each frame, and only frames with an average joint confidence

\geq 0.6

are retained for subsequent recognition. This threshold is determined based on AlphaPose’s performance characteristics on the COCO dataset (AP@0.5 = 80.3%), effectively filtering out samples with failed pose estimation (e.g., severe occlusions, extreme poses), thereby decoupling pose estimation errors from action recognition performance and focusing the evaluation on the effectiveness of the proposed adaptive refinement graph convolutional network. All keypoint combinations form a skeleton sequence:

S = {X_{t}^{v} ∣ t = 1, 2, 3, \dots, T; v = 1, 2, 3, \dots, V},

where T represents the total number of frames in the sequence, V denotes the number of keypoints within a frame, and

X_{t}^{v}

represents the feature of keypoint v at time t. The skeleton sequence is then input into a spatiotemporal graph convolutional network to predict the marshalling action category of UAV ground crew personnel in the video.

5.2.2. Application Experimental Results and Analysis

(1): Comparison Experiments with Baseline Methods

To validate the effectiveness of the proposed method for ground crew command action recognition tasks in UAV scenarios, we conduct comparative experiments on the constructed original dataset (without multi-environment augmentation) using two evaluation protocols: X-Sub and HO (Held-Out). The X-Sub protocol primarily evaluates the model’s generalization capability when encountering unseen subjects, while the HO protocol focuses on assessing the model’s performance stability under random data partitioning. Comparative methods include classical GCN approaches (ST-GCN [7], Shift-GCN [39], 2s-AGCN [37]), recent state-of-the-art methods (MS-G3D [40], CTR-GC [13], SGN [38], EfficientGCN [41]), and Transformer-based methods (GAP [25], STEP-CATFormer [24]).

Table 6 presents the experimental results on the UAV ground crew action dataset under the X-Sub protocol. In terms of overall performance, classical GCN methods (ST-GCN, Shift-GCN, 2s-AGCN) achieve accuracy in the range of 86.19–88.30% and Jaccard coefficients in the range of 80.15–82.88%. Recent state-of-the-art methods (MS-G3D, CTR-GC, SGN, EfficientGCN) demonstrate significant performance improvements, with accuracy ranging from 89.15% to 89.82% and Jaccard coefficients from 83.21% to 83.89%. Transformer-based methods (GAP, STEP-CATFormer) achieve accuracy in the range of 88.93–89.67%, with overall performance positioned between classical and recent state-of-the-art approaches. The proposed method achieves an accuracy of 90.71%, a Jaccard coefficient of 84.32%, and an F1-score of 90.13%, surpassing all comparative methods.

Further analysis reveals that classical GCN methods generally achieve Jaccard coefficients below 83%, indicating insufficient discriminative capability for similar action categories in cross-subject scenarios. Although recent state-of-the-art methods improve performance to over 89% through mechanisms such as multi-scale modeling, channel refinement, and semantic guidance, improvements in a single dimension struggle to simultaneously address multi-order motion feature extraction and dynamic topology learning. Transformer-based methods exhibit overfitting risks on small-scale datasets. The proposed method effectively addresses the challenges of discriminating between similar inter-class actions and handling intra-class variations by capturing action evolution patterns across different temporal scales through multi-order motion feature modeling and optimizing information flow between joints via data-driven adaptive topology learning, thereby demonstrating significant advantages in cross-subject generalization capability.

The experimental results under the HO setting are presented in Table 7, where the proposed method also achieves the best performance. Compared with 2s-AGCN, the accuracy (96.09%) and F1 score (96.22%) are improved by 1.97% and 1.84%, respectively, fully demonstrating the model’s accuracy under random data partitioning conditions. Compared with the X-Sub setting, all classical GCN methods exhibit substantial performance improvements (4–6 percentage points) under the HO setting, indicating that these methods maintain relatively stable generalization capability under random data partitioning conditions. Recent state-of-the-art methods demonstrate even more prominent performance under the HO setting, with MS-G3D, CTR-GC, SGN, and EfficientGCN achieving accuracies of 94.58%, 94.73%, 95.21%, and 94.89%, respectively, significantly outperforming classical methods. Among them, SGN exhibits particularly outstanding performance in inter-class action discrimination through its semantic guidance mechanism. The Transformer-based method GAP achieves an accuracy of 95.08% and an F1 score of 95.13% under the HO setting, demonstrating the advantages of global attention mechanisms in capturing long-range spatiotemporal dependencies, while STEP-CATFormer [24] achieves an accuracy of 94.35%, slightly lower than GAP. In UAV ground crew command action recognition tasks, significant inter-class similarity and intra-class variability pose considerable challenges to action recognition. For instance, the “normal parking” and “emergency parking” gestures exhibit similar dynamic features, while the “turn left” and “turn right” gestures display substantial variability due to differences in individual execution styles and posture angles. Through multi-order feature modeling and adaptive refinement mechanisms, the proposed method optimizes feature extraction and inter-node interaction relationships, effectively addressing the issues of inter-class similarity and intra-class variability in UAV ground crew command action recognition tasks. Compared with the second-best method SGN, the proposed method improves accuracy by 0.88 percentage points and F1 score by 0.75 percentage points, validating the necessity of the proposed architecture in handling complex action patterns. These improvements enhance the model’s performance in UAV ground crew command action recognition tasks, demonstrating the high accuracy and application potential of the proposed algorithm in dynamic environments.

Figure 17 illustrates partial action test results on the UAV ground crew command action dataset, where the action sequence is consistent with the ICAO standard action categories. For each section, the left side shows the ICAO standard action example, while the right side displays the video keyframe recognition results for the corresponding actions in the dataset. It should be noted that the proposed method performs action recognition based on complete video sequences, and the keyframes shown in the figure are only representative moments selected for visualization purposes. In each action keyframe, the pose estimation results are shown as blue dots within orange detection boxes, clearly identifying the key point locations of the commander. The top-left corner of each keyframe lists the prediction results for UAV ground crew command action categories, including the Top-10 categories and their prediction confidence scores. The experimental results demonstrate that the proposed method can accurately recognize all UAV ground crew command actions in the dataset. The model performs classification by analyzing the spatiotemporal features of the entire action video sequence, reflecting the reliability of the proposed model in UAV ground crew command action classification tasks.

(2): Typical Action Recognition Results and Confusion Analysis

To deeply analyze the model’s discriminative ability for temporally similar but semantically different actions, this paper selects four groups of typical confusable action pairs for quantitative evaluation: (1) Normal Stop (7) and Emergency Stop (8)—both involve crossing both arms overhead, but differ significantly in execution speed; (2) Insert Chocks (11) and Remove Chocks (12)—similar action trajectories but opposite directions; (3) Identify Parking Position (2) and Normal Stop (7)—partially overlapping trajectories of raising both arms; (4) Guide Aircraft (28) and Do Not Touch Controls (29)—subtle differences in hand postures. The averaged confusion matrix statistics based on five random split experiments under the HO setting are shown in Figure 18. The experimental results demonstrate that the model exhibits good discriminative ability for speed differences, with Normal Stop being misclassified as Emergency Stop at an average probability of 1.35%, and Emergency Stop being misclassified as Normal Stop at 2.17%. For action pairs with opposite directions, the mutual confusion rates between Insert Chocks and Remove Chocks are 8.94% and 6.73%, respectively, showing relatively high confusion levels. The probability of Identify Parking Position being misclassified as Normal Stop is 3.67%, while Normal Stop being misclassified as Identify Parking Position is 2.83%. The mutual confusion rates between Guide Aircraft and Do Not Touch Controls are 12.67% and 8.15%, representing the highest confusion level among all action pairs, with Guide Aircraft being more susceptible to misclassification. This is primarily due to the subtle posture differences in the hand region between the two actions requiring more refined feature representations.

To verify the model’s applicability in more diverse scenarios, Figure 19 presents test results on samples with multiple viewpoints, multiple subjects, and varying distances. The experimental results demonstrate that the proposed method can accurately recognize marshaller gestures in the majority of test samples, with the prediction confidence of the correct category significantly higher than that of other categories, fully reflecting the discriminative capability and superiority of the algorithm. In a small number of test samples captured from medium-to-long distances or side viewpoints, the confidence of the Top-1 category decreases slightly, indicating that variations in viewpoint and distance have an adverse effect on the model’s discrimination. However, this reduction in confidence does not lead to misclassification of the primary category, demonstrating the model’s strong robustness.

(3): Visualization Analysis of Adaptive Graph Topology

To gain deeper insights into the learning mechanism of data-driven adaptive graph convolution, we conducted a visualization analysis of the dynamic joint dependencies learned by the model. Figure 20 illustrates the normalized adjacency matrices for four representative ground crew marshalling gestures: Normal Stop (7), Emergency Stop (8), Insert Chocks (11), and Remove Chocks (12). We extracted the adjacency matrix

A \in R^{17 \times 17}

from the last layer of the adaptive graph convolution module and averaged it across the temporal dimension for all test samples of each action category. According to Equation (9), the dynamic topology is generated through

A_{t} = softmax (θ {(F_{t})}^{T} \cdot ϕ (F_{t}))

, where

θ

and

ϕ

are data transformation functions that measure connection strength by computing inner products between joint features, followed by softmax normalization to ensure that the weights in each column sum to one. Subsequently, the learned dynamic adjacency matrix is weighted and fused with the fixed skeleton adjacency matrix, followed by GCN symmetric normalization (corresponding to line 328 in the paper):

\tilde{A} = {\tilde{D}}^{- 1 / 2} A {\tilde{D}}^{- 1 / 2}

, where

\tilde{A} = A + I

represents the adjacency matrix with added self-connections, and

\tilde{D}

denotes the corresponding degree matrix. The matrix element

A_{i j}

represents the influence weight of joint i on target joint j, with darker colors in the heatmap indicating stronger connection strengths.

Comparing Figure 20a Normal Stop and Figure 20b Emergency Stop reveals significant differences. Although both gestures exhibit similar motion trajectories (both arms raised from the sides to cross overhead), the Emergency Stop displays stronger weights on key connections: the cross-body connections between hip and shoulder joints are notably enhanced, reflecting the need for stronger core stability during rapid force generation; the synchronization and coordination weights between left and right shoulder joints are significantly elevated, embodying the higher demand for precise coordination in “swift” movements; more importantly, long-range dependencies between hip joints and wrists emerge, which in traditional ST-GCN would require three-level propagation through “wrist → elbow → shoulder → hip”, whereas adaptive graph convolution directly captures this whole-body tension pattern through data-driven learning.

The comparison between Insert Chocks and Remove Chocks validates the model’s sensitivity to directional features. Figure 20c Insert Chocks shows higher connection weights between wrists and hip joints (inward-converging features), while wrist-shoulder connections are weaker; conversely, Figure 20d Remove Chocks exhibits the opposite pattern, with significantly enhanced wrist-shoulder connections (outward-diverging features) and weakened wrist-hip connections. Furthermore, there is a notable difference in the interaction weights between the two wrists: higher in the insertion action (contact point) and lower in the removal action (separation). This demonstrates that the model not only learns the spatial distribution of joints but also successfully extracts motion directionality, which is crucial for distinguishing paired operations with similar spatial trajectories (e.g., “pull/push”, “insert/extract”).

The connection patterns learned by adaptive graph convolution significantly transcend the anatomical constraints of fixed skeletons. Analysis reveals that: direct wrist-hip connections require three-level propagation in fixed graphs; direct bilateral shoulder coordination patterns in fixed graphs must be indirectly connected through the torso chain; long-range hip-wrist connections do not exist in natural topology. These data-driven long-range dependencies enable the model to capture complex multi-joint coordination patterns (e.g., diagonal synergies, cross-body balance). Notably, after GCN symmetric normalization, self-connections (diagonal elements) appear in lighter gray with weights lower than key strong connections. This occurs because self-connections of high-degree nodes (e.g., shoulder joints, hip joints) are suppressed by degree normalization, effectively preventing them from dominating information propagation and allowing long-range synergistic relationships to be fully expressed.

5.2.3. Robustness Analysis

(1): Robustness Analysis of Sequence Length and Execution Speed

To comprehensively evaluate the model’s robustness to temporal variations, we conducted ablation experiments from two dimensions: (1) sequence length variation, assessing the model’s adaptability to action duration; (2) execution speed variation, evaluating the model’s tolerance to action tempo. All experiments adopted the sliding window sampling strategy described in Section 5.2.1 and were independently trained and tested under the HO setting.

Sequence Length Experiments: We conducted experiments with different frame lengths (15, 20, 25, 30 frames), with all configurations using the same frame interval

Δ = 5

. Table 8 presents the recognition performance. The 20-frame configuration achieved optimal accuracy (96.09%), validating the effectiveness of the default setting. The 15-frame configuration exhibited a decrease in accuracy to 94.67% (a drop of 1.42%), primarily due to insufficient temporal information: some actions with longer execution cycles could only capture partial stages within the 0.6-s sampling window, resulting in incomplete temporal features. The 25-frame and 30-frame configurations achieved accuracies of 95.78% and 95.53%, respectively, slightly lower than the 20-frame configuration (decreases of 0.31% and 0.56%), attributed to redundant noise introduced by transitional or static frames in longer sequences, which dilutes the weight of discriminative features.

Execution Speed Experiments: We simulated different execution speeds by adjusting the frame sampling interval. For a fixed sequence length

L = 20

frames, the sampling strategy is defined as

S_{i} = {f_{i + k Δ}}_{k = 0}^{L - 1}

, where

Δ

is the frame interval parameter. This design maintains a consistent number of input frames while varying only the temporal span, thereby isolating the effect of speed factors. The model was trained on original speed data (

Δ = 1

) and tested under three speed conditions (

Δ \in {1, 2, 3}

, corresponding to

1 \times

,

1.5 \times

, and

2 \times

speeds, respectively). Table 9 presents the recognition performance. The original speed achieved optimal accuracy (96.09%). When the speed increased to 1.5 times, the accuracy slightly decreased to 95.47% (a drop of 0.62%), primarily due to frame skipping that weakened the temporal continuity of critical motion stages. When the speed further increased to 2 times, the accuracy dropped to 94.13% (a decline of 1.96%). At this point, substantial frame skipping caused joint motion trajectories to exhibit discrete jumps, introducing significant noise to multi-order motion features (especially acceleration features).

Integrating the results from both experimental groups, the model maintains accuracy ≥ 94% under variations in sequence length (15–30 frames) and execution speed (1 − 2 × speed), validating the temporal robustness of the proposed method. This can be primarily attributed to: (1) multi-order motion feature modeling provides complementary information across different temporal scales; (2) adaptive topology dynamically adjusts joint connections based on input characteristics; (3) the receptive field design of temporal convolution (kernel size = 9) effectively adapts to different sampling patterns. Considering both overall performance and computational efficiency, the recommended configuration is

L = 20

frames with

Δ = 1

(original speed). For resource-constrained scenarios,

L = 15

frames or

Δ = 2

(1.5 × speed) can serve as alternative options, with performance degradation < 1.5%.

(2): Distance Robustness Analysis

To validate the algorithm’s applicability in real-world UAV ground crew command scenarios, it is necessary to consider the relationship between camera field of view, target distance, and imaging scale. This study employs a Hikvision DS-E12 camera (horizontal field of view 80.3°, vertical field of view 50.8°, resolution 1280 × 720), which is suitable for small UAV close-range scenarios (3–10 m). For medium-sized (5–15 m) or large UAVs (10–30 m), it is recommended to use telephoto lenses with narrower fields of view.

In practical applications, excessive distance causes the human body to appear too small in the image, reducing keypoint detection accuracy and increasing the risk of misclassification, while increasing resolution sacrifices real-time performance. Therefore, we propose using the ratio of human body height to image height as an evaluation criterion: when the commander’s imaging height is ≥15% of the image height, the system can maintain effective recognition capability.

The current dataset was captured at distances of 3–6 m, corresponding to human body ratios of 34–68% (see Figure 21). To simulate extreme scenarios while considering actual human heights, we selected 20 images each from 3, 4, 5, and 6 m, and geometrically scaled them to three ratios: 25%, 20%, and 15%, where 15% corresponds to approximately 14 m distance (the theoretical limit under the current camera’s field of view).

The proposed algorithm is based on normalized motion feature modeling of keypoint coordinates, inherently providing robustness to the absolute scale of targets. As long as keypoint detection remains accurate, the algorithm can still determine action categories through normalized relative positional relationships even when the human body imaging is reduced. Therefore, the scaled test images were only added to the test set and did not participate in model training.

The experimental results in Table 10 demonstrate that: at 25% and 20% ratios, the algorithm performance is essentially equivalent to the original dataset (accuracy ≥ 95%). At the 15% ratio (extreme scenario), performance shows a slight degradation (approximately 1 percentage point), but accuracy remains above 94% with an F1-score of 93.8%. Visualization results are shown in Figure 22, from top to bottom: scaled to 25%, 20%, and 15%, respectively. The blue skeleton represents the keypoint detection results. The action recognition results are satisfactory, and under the extreme condition of 15% ratio, the algorithm can still accurately recognize key gestures, validating the scale robustness of the proposed method.

(3): Environmental Robustness Analysis

To evaluate the applicability of the proposed method in real-world UAV ground crew scenarios, we constructed augmented data representing five typical environmental conditions based on the original dataset: illumination variation (Illumination), illumination with haze (Illumination + Haze), illumination with rain (Illumination + Rain), shadow (Shadow), and occlusion (Occlusion). Two training strategies were employed for comparison: (1) Test: training only on the original data, with environmentally augmented data used exclusively for testing; (2) Test+Train: mixed training with original data and environmentally augmented data in a 7:3 ratio. Table 11 and Table 12 present the performance comparison of the two training strategies under different environmental conditions. Figure 23 illustrates typical recognition samples under various environmental conditions, with 4 representative samples selected for each environment, totaling 24 test cases, intuitively demonstrating the system’s recognition performance under complex environments.

Under the Test mode, environmental factors exerted varying degrees of impact on system performance. Illumination changes resulted in approximately 0.9 percentage point decrease in X-Sub accuracy, primarily affecting human detection and keypoint localization precision. Due to the inherent insensitivity of skeleton-based methods to illumination variations, the performance degradation remained relatively moderate. When haze or rain was superimposed on illumination changes, the system performance remained stable or even slightly improved, indicating that uniformly distributed visual noise has minimal impact on skeleton extraction. Shadow interference led to accuracy drops of approximately 1.2–1.4 percentage points, attributed to non-uniform local low-illumination regions affecting bounding box detection and keypoint localization. Occlusion emerged as the most significant factor, causing accuracy decreases of approximately 2.9–3.1 percentage points.

For occlusion scenarios, the pose estimation module employs a keypoint prediction mechanism: when certain points among the 17 keypoints are occluded, estimation is performed based on visible keypoints and human kinematic constraints. As illustrated in the Occlusion column of Figure 23, green markers denote directly detected keypoints, while yellow markers indicate estimated keypoints. When simple body parts (such as torso or legs) are occluded, the estimated keypoint positions are relatively reliable. However, when complex critical regions (such as arms, hands, or batons) are occluded, estimated points may exhibit substantial deviations. For instance, in the bottom-right sample of Figure 23, arm occlusion causes positional displacement of the yellow estimated points, disrupting the upper limb skeleton structure, which significantly impacts action recognition accuracy.

The Test + Train mode significantly enhanced the model’s environmental robustness. Under original conditions, accuracy improved by 0.16–0.21 percentage points, validating the benefits of data diversity. Under illumination-related conditions, the performance degradation was reduced from 0.9 percentage points to 0.07–0.08 percentage points, indicating that the model learned illumination-invariant features. Under shadow conditions, the performance drop decreased from 1.2–1.4 percentage points to 0.4 percentage points, reflecting the model’s adaptation capability to local low-illumination areas. Under occlusion conditions, accuracy increased by 1.97–2.27 percentage points, with the performance degradation reduced from approximately 3 percentage points to around 1 percentage point. Augmented training enabled the model to recognize the uncertainty of estimated keypoints and compensate through visible keypoints and temporal information. However, when complex critical regions are severely occluded, estimation deviations still affect recognition performance, reflecting the inherent limitations of skeleton-based methods under extreme occlusion scenarios.

As illustrated in Figure 23, the model maintained high accuracy and stability across most environmental conditions. Under original and illumination variation conditions, the system achieved accurate recognition. Under haze and rain conditions, despite degraded image quality, the system remained capable of correct classification. Under shadow conditions, local low-illumination regions exerted certain impacts on keypoint localization, yet overall recognition performance remained stable. Under occlusion conditions, the distribution of green (directly detected) and yellow (estimated) keypoints is clearly observable. In the first three samples, the estimated point positions are reasonable, and the system provides correct predictions; however, in the bottom-right sample, occlusion of complex regions causes estimation deviations, leading to decreased model prediction confidence. These results demonstrate the applicability of the proposed method in real airport environments, while also identifying future improvement directions: incorporating multimodal fusion, refining keypoint estimation mechanisms, and implementing confidence-based adaptive feature weighting strategies to further enhance robustness under extreme conditions.

5.2.4. Edge Device Deployment and Real-Time Performance

To evaluate the real-time performance of the proposed method and recognition pipeline, we deployed and tested the object detection, pose estimation, and action recognition models on the edge device NVIDIA Jetson Xavier NX. This platform is widely adopted in UAV ground control systems and provides native support for the TensorRT deep learning inference optimization framework. During inference, the TensorRT framework was employed to optimize computational efficiency. The test scenario and example running results are shown in Figure 24. The experimental results for cooperative target action recognition of marshalling personnel reveal that the inference time of the object detection model is 12.1 ms, the pose estimation model is 7.8 ms, and the action recognition model is 4.6 ms. Overall, the proposed method achieves low inference latency on edge devices, meeting the real-time processing requirements in UAV ground crew marshalling scenarios.

In practical deployment, the system employs a sliding window mechanism for continuous action recognition. For the first recognition sample, object detection and pose estimation must be performed on all frames of the complete sequence (total latency of approximately

(12.1 + 7.8) \times 20 + 4.6 \approx 402.6

ms for a 20-frame sequence, and

(12.1 + 7.8) \times 30 + 4.6 \approx 601.6

ms for a 30-frame sequence). Subsequent samples require processing only newly arrived single frames through an incremental update strategy. Regardless of whether the sequence length is 20 frames, 30 frames, or longer, the single-frame processing latency remains

12.1 + 7.8 + 4.6 \approx 24.5

ms, corresponding to a frame rate of approximately 40 FPS, which meets the real-time processing standard (25 FPS) for UAV ground crew marshalling scenarios. This mechanism achieves complete decoupling between latency and sequence length, ensuring stable real-time performance of the system under different window configurations.

Although the current system has achieved real-time inference, there remains further optimization potential for the proposed Adaptive Refined Graph Convolutional Network. On one hand, knowledge distillation strategies can be employed to transfer the feature representation capability of complex models to lightweight architectures (such as SGN), maintaining recognition accuracy while reducing model complexity through teacher-student network co-training. On the other hand, structured pruning techniques can be introduced to compress the Adaptive Refined Feature Activation Mechanism (ARFAM), selecting discriminative feature dimensions based on channel importance evaluation metrics and removing redundant channels to reduce computational overhead. These optimization strategies provide feasible pathways for efficient deployment of the proposed method in resource-constrained scenarios such as UAV airborne platforms.

6. Conclusions

This work addresses the problem of skeleton-based action recognition in UAV airfield ground crew command scenarios by proposing an Enhanced Feature-based Adaptive Refined Graph Convolutional network (EF-ARGC) for action recognition. The proposed method effectively tackles the core challenges of high similarity in motion patterns of ground crew command actions, large individual variations, and complex spatial–temporal dependencies through multi-scale feature enhancement, adaptive graph structure learning, and spatial–temporal semantic fusion.

In terms of methodological innovation, this work makes three main contributions: (1) A multi-level feature and motion feature modeling module is proposed, which integrates joint positions, bone information, and angle encoding to construct more discriminative multi-scale feature representations. (2) A data-driven adaptive graph convolutional module is designed to break through the limitations of fixed topological structures, dynamically learn semantic correlations between joints, and incorporate an Adaptive Refined Feature Activation Mechanism (ARFAM) to achieve fine-grained modeling of skeleton spatial information. (3) A frame index semantic temporal feature modeling module is constructed to explicitly encode temporal position information, enhancing the model’s perception capability for temporal evolution of actions.

For performance validation, the effectiveness of the proposed method was verified on the NTU-RGB+D 60 and NTU-RGB+D 120 benchmark datasets. On the NTU-RGB+D 60 dataset, under X-Sub and X-View settings, Top-1 accuracies of 89.4% and 94.2% were achieved, respectively; on the NTU-RGB+D 120 dataset, accuracies of 81.7% and 83.3% were obtained, respectively. On the self-constructed UAV Airfield Ground Crew Dataset (UAV-AGCD), the proposed method achieves accuracies of 90.71% and 96.09% under X-Sub and HO settings, respectively. Ablation experiments validate the contribution of each core module, particularly the significant role of adaptive graph convolution and multi-scale feature fusion in enhancing inter-class discriminability and handling intra-class variations.

For application validation, this work constructs the first skeleton-based action dataset for UAV airfield ground crew command scenarios, encompassing 34 categories of standard command actions specified by ICAO. The experiments employ pose estimation networks to convert videos into skeleton sequences as model inputs. Robustness analysis validates the stable performance of the model under different sequence lengths, shooting distances, and environmental conditions. Environmental robustness experiments demonstrate that, under complex environmental conditions, including illumination variations, haze, rain, shadows, and occlusions, the adoption of the Test + Train strategy reduces the maximum performance degradation from 3.1 percentage points to within 1 percentage point. Real-time performance testing shows that the system achieves an end-to-end inference latency of 24.5 ms (40.8 FPS) on the edge device NVIDIA Jetson Xavier NX, meeting real-time processing requirements and validating the efficiency and practicality of the proposed method on edge computing platforms.

The application value of this research extends beyond UAV airfield ground crew command, demonstrating potential in domains such as traffic police gesture recognition, industrial command gesture recognition, emergency rescue gesture command recognition, and abnormal behavior detection in public places, providing a general solution for intelligent transportation and public safety fields.

Despite achieving promising experimental results, there remains room for improvement. First, in extreme occlusion scenarios, the degradation of keypoint detection quality in front-end pose estimation affects recognition accuracy. Future work could introduce multi-modal fusion mechanisms combining RGB images with skeleton features and design confidence-based adaptive weighting strategies to reduce the impact of low-quality keypoints. Second, the current method targets single-person command scenarios, and performance in multi-person interaction scenarios requires optimization. Third, although robustness analysis validates performance under various environmental conditions, long-term stability under extreme weather conditions still requires large-scale real-world deployment verification. Finally, future research could explore lightweight model design and edge computing optimization to reduce computational costs while maintaining accuracy, supporting broader deployment scenarios.

Author Contributions

Conceptualization, Y.X. and F.X.; Methodology, Q.Z.; Validation, Q.Z.; Formal analysis, Z.Z. and Y.W.; Investigation, L.D.; Resources, Y.X.; Data curation, Y.W.; Writing—original draft, Q.Z.; Writing—review & editing, L.D. and Z.Z.; Visualization, L.D.; Supervision, F.X. and Y.W.; Project administration, Y.X.; Funding acquisition, Z.Z. and F.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Young Scientists Fund of the National Natural Science Foundation of China (Grant No. 52302506), the National Natural Science Foundation of China (Grant No. 62171361), and the Shaanxi Key Research and Development Program (Grant No. 2025GH-YBXM-022).

Data Availability Statement

Part of the data presented in this study are available in the NTU RGB+D Dataset repository at https://rose1.ntu.edu.sg/dataset/actionRecognition/ (accessed on 16 November 2025). These data were derived from the following resources available in the public domain: NTU-RGB+D 60 and NTU-RGB+D 120 datasets. The rest of the raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, Y. Unmanned aerial vehicles based low-altitude economy with lifecycle techno-economic-environmental analysis for sustainable and smart cities. J. Clean. Prod. 2025, 499, 145050. [Google Scholar] [CrossRef]
Zhang, J.; Liu, Y.; Zheng, Y. Overall eVTOL aircraft design for urban air mobility. Green Energy Intell. Transp. 2024, 3, 100150. [Google Scholar] [CrossRef]
Jin, Y. The Evolution and Challenges of Low-Altitude Economy: Insights from Experience in China. In Proceedings of the 1st International Conference on Modern Logistics and Supply Chain Management (MLSCM 2024), Foshan, China, 28–30 June 2024. [Google Scholar]
Postorino, M.N.; Sarné, G.M. Reinventing mobility paradigms: Flying car scenarios and challenges for urban mobility. Sustainability 2020, 12, 3581. [Google Scholar] [CrossRef]
Ren, B.; Liu, M.; Ding, R.; Liu, H. A survey on 3d skeleton-based action recognition using learning method. Cyborg Bionic Syst. 2024, 5, 0100. [Google Scholar] [CrossRef] [PubMed]
Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2969–2978. [Google Scholar]
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 816–833. [Google Scholar]
Bono, F.M.; Radicioni, L.; Cinquemani, S.; Conese, C.; Tarabini, M. Development of soft sensors based on neural networks for detection of anomaly working condition in automated machinery. In Proceedings of the NDE 4.0, Predictive Maintenance, and Communication and Energy Systems in a Globally Networked World, Long Beach, CA, USA, 6 March–11 April 2022; pp. 56–70. [Google Scholar]
Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3288–3297. [Google Scholar]
Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; Lu, H. Decoupling gcn with dropgraph module for skeleton-based action recognition. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 536–553. [Google Scholar]
Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13359–13368. [Google Scholar]
Cai, J.; Jiang, N.; Han, X.; Jia, K.; Lu, J. JOLO-GCN: Mining joint-centered light-weight information for skeleton-based action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 2735–2744. [Google Scholar]
Lee, J.; Lee, M.; Lee, D.; Lee, S. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10444–10453. [Google Scholar]
Wang, S.; Zhang, Y.; Zhao, M.; Qi, H.; Wang, K.; Wei, F.; Jiang, Y. Skeleton-based action recognition via temporal-channel aggregation. arXiv 2022, arXiv:2205.15936. [Google Scholar]
Chi, H.-G.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20186–20196. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3316–3333. [Google Scholar] [CrossRef] [PubMed]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Shu, X.; Xu, B.; Zhang, L.; Tang, J. Multi-granularity anchor-contrastive representation learning for semi-supervised skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7559–7576. [Google Scholar] [CrossRef] [PubMed]
Lin, L.; Zhang, J.; Liu, J. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2363–2372. [Google Scholar]
Zhou, H.; Liu, Q.; Wang, Y. Learning discriminative representations for skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10608–10617. [Google Scholar]
Do, J.; Kim, M. Skateformer: Skeletal-temporal transformer for human action recognition. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 401–420. [Google Scholar]
Long, N.H.B. Step catformer: Spatial-temporal effective body-part cross attention transformer for skeleton-based action recognition. arXiv 2023, arXiv:2312.03288. [Google Scholar]
Xiang, W.; Li, C.; Zhou, Y.; Wang, B.; Zhang, L. Generative action description prompts for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10276–10285. [Google Scholar]
Wang, Y.; Wu, Y.; He, W.; Guo, X.; Zhu, F.; Bai, L.; Zhao, R.; Wu, J.; He, T.; Ouyang, W. Hulk: A universal knowledge translator for human-centric tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 5672–5689. [Google Scholar] [CrossRef] [PubMed]
Qin, Z.; Liu, Y.; Perera, M.; Gedeon, T.; Ji, P.; Kim, D.; Anwar, S. Anubis: Skeleton action recognition dataset, review, and benchmark. arXiv 2022, arXiv:2205.02071. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7912–7921. [Google Scholar]
Qin, Z.; Liu, Y.; Ji, P.; Kim, D.; Wang, L.; McKay, R.I.; Anwar, S.; Gedeon, T. Fusing higher-order features in graph neural networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 4783–4797. [Google Scholar] [CrossRef] [PubMed]
Sun, H.; Wen, Y.; Feng, H.; Zheng, Y.; Mei, Q.; Ren, D.; Yu, M. Unsupervised bidirectional contrastive reconstruction and adaptive fine-grained channel attention networks for image dehazing. Neural Netw. 2024, 176, 106314. [Google Scholar] [CrossRef] [PubMed]
Hu, K.; Shen, C.; Wang, T.; Shen, S.; Cai, C.; Huang, H.; Xia, M. Action Recognition Based on Multi-Level Topological Channel Attention of Human Skeleton. Sensors 2023, 23, 9738. [Google Scholar] [CrossRef] [PubMed]
Liu, D.; Xu, H.; Wang, J.; Lu, Y.; Kong, J.; Qi, M. Adaptive attention memory graph convolutional networks for skeleton-based action recognition. Sensors 2021, 21, 6761. [Google Scholar] [CrossRef] [PubMed]
Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 2021, 208, 103219. [Google Scholar] [CrossRef]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar] [CrossRef]
Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1112–1121. [Google Scholar]
Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 143–152. [Google Scholar]
Song, Y.-F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Action recognition network framework for UAV ground crew marshalling command recognition. The input skeleton sequence is decomposed into five parallel feature streams (Joint, Joint_vel, Bone, Bone_vel, Angle) by the feature modeling module. These feature streams are concatenated with one-hot encoded joint type information at the fusion module F, followed by spatial modeling through three ARGCN layers. Subsequently, the temporal modeling module captures temporal dynamics by incorporating one-hot encoded frame indices. Finally, action classification results are obtained after SMP and TMP pooling operations. Arrows in the diagram represent the direction of data flow through the network.

Figure 2. Illustration of angular encoding. The yellow dashed arrows indicate the angles of different features.

Figure 3. Natural connections between joints (left) and adaptive topology (right).

Figure 4. Process of data-driven adaptive graph generation. The ⊗ symbol denotes element-wise multiplication.

Figure 5. Adaptive Refined Feature Activation mechanism. The input feature

U \in R^{C \times T \times V}

undergoes global average pooling to obtain a

C \times 1 \times 1

-dimensional channel descriptor. The ICP module generates a global channel descriptor

U_{g c}

and a local channel descriptor

U_{l c}

(both of dimension

C \times 1

) through diagonal matrix (diag) and band matrix (band) operations, respectively. The

β

-DF module constructs a

C \times C

-dimensional cross-correlation matrix via matrix outer product. Two parallel branches perform row-wise summation on this matrix and its transpose, followed by Sigmoid activation, and then adaptively fuse the results through a learnable parameter

β

. Finally, a

C \times 1 \times 1

-dimensional channel attention weight is output.

Figure 5. Adaptive Refined Feature Activation mechanism. The input feature

U \in R^{C \times T \times V}

undergoes global average pooling to obtain a

C \times 1 \times 1

-dimensional channel descriptor. The ICP module generates a global channel descriptor

U_{g c}

and a local channel descriptor

U_{l c}

(both of dimension

C \times 1

) through diagonal matrix (diag) and band matrix (band) operations, respectively. The

β

-DF module constructs a

C \times C

-dimensional cross-correlation matrix via matrix outer product. Two parallel branches perform row-wise summation on this matrix and its transpose, followed by Sigmoid activation, and then adaptively fuse the results through a learnable parameter

β

. Finally, a

C \times 1 \times 1

-dimensional channel attention weight is output.

Figure 6. Architecture of the Adaptive Graph Convolutional Network (AGCN). The ⊗ symbol denotes element-wise multiplication.

Figure 7. Illustration of the temporal convolution module.

Figure 8. Examples from the NTU-RGB+D 60/120 datasets, including raw samples and skeleton representations. Green dots represent body keypoints and red lines represent skeletal connections between adjacent keypoints.

Figure 9. Visualization of the model parameter size and Top-1 accuracy comparison on the NTU-RGB+D 60 dataset (X-Sub setting).

Figure 10. Confusion matrices of our algorithm.

Figure 11. Impact of network depth on performance and efficiency on the NTU-RGB+D 120 dataset.

Figure 12. Examples of UAV ground crew marshalling gestures and their meanings defined by ICAO standards.

Figure 13. Category distribution of the UAV ground crew marshalling action dataset.

Figure 14. Sample examples of UAV ground crew command gestures under different environmental conditions.

Figure 15. Basic interface of the UAV ground crew marshalling gesture annotation software.

Figure 16. Workflow of UAV ground crew marshalling gesture recognition.

Figure 17. Recognition results of UAV ground crew marshalling gestures.

Figure 18. Average confusion matrices for four groups of typical confusable action pairs under the HO setting.

Figure 19. Examples of recognition results for UAV ground crew marshalling gestures under multiple viewpoints, multiple subjects, and varying distances.

Figure 20. Normalized adjacency matrices for representative ground crew marshalling gestures.

Figure 21. Examples of human body proportion in images at different distances.

Figure 22. Examples of recognition results under different human body proportions.

Figure 23. Recognition results under different environmental conditions. Green points represent directly detected keypoints, while yellow points indicate estimated keypoints for occluded body parts.

Figure 24. Test scenario and running results.

Table 1. Comparison of the proposed method with mainstream approaches on the NTU-RGB+D 60 dataset.

Methods	X-Sub (%)	X-View (%)
ST-GCN [7]	81.5	88.3
ST-TR [33]	89.3	94.3
HCN [34]	86.5	91.1
AGC-LSTM [35]	87.5	93.5
AS-GCN [36]	86.8	94.2
2S-AGCN [37]	88.5	92.9
SGN [38]	88.4	93.8
OURS	89.4	94.2

Table 2. Ablation study on multi-order feature combinations.

Joint	Joint_vel	Bone	Bone_vel	Angle	NTU-RGB+D 60		NTU-RGB+D 120		FLOPs (M)
Joint	Joint_vel	Bone	Bone_vel	Angle	X-Sub (%)	X-View (%)	X-Sub (%)	X-Set (%)	FLOPs (M)
✓	-	-	-	-	73.6	83.2	77.0	75.6	11.2
✓	✓	-	-	-	86.9	92.7	79.9	81.7	22.6
-	-	✓	-	-	65.3	69.9	71.1	73.5	11.3
-	-	✓	✓	-	86.5	89.6	74.8	79.5	22.6
✓	-	✓	-	-	86.0	92.1	79.2	81.0	22.5
✓	✓	✓	-	-	89.1	93.5	80.3	82.0	33.9
✓	✓	✓	✓	-	89.2	93.8	80.9	82.2	45.1
✓	✓	✓	✓	✓	89.4	94.2	81.7	83.3	56.5

Table 3. Comparison of different angular encodings on the NTU-RGB+D 60 dataset.

Methods	X-Sub (%)	X-View (%)	FLOPs (M)
Joint	73.6	83.2	11.2
+Local	83.1	89.7	22.5
+Center	82.8	89.6	22.7
+Pair	83.4	89.7	22.8
+All	83.7	89.9	23.1

Table 4. Performance improvement analysis of action recognition based on angle encoding and comparison with similar actions.

Action	Joint (%)	+All (%)	Acc_Improve (%)	Similar Action
A44: headache	37.7	75.4	37.7	A47: neck pain
A41: sneeze/cough	42.0	68.8	26.8	A37: wipe face
A5: drop	55.6	81.5	25.9	A24: kicking something
A33: check time	59.8	85.5	25.7	A39: put palms together
A10: clapping	44.3	67.8	23.5	A34: rub two hands
A12: writing	24.6	45.6	23.0	A11: reading
A45: chest pain	69.5	88.8	19.3	A46: back pain
A42: staggering	74.6	93.8	19.2	A51: kicking
A13: tear up paper	64.9	84.1	19.2	A11: reading
A6: pick up	78.6	97.1	18.5	A16: put on a shoe

Table 5. Ablation study on adaptive refined feature activation mechanism and semantic information.

Attention Mechanism	Joint Type	Frame Index	FLOPs (M)	NTU-RGB+D 60		NTU-RGB+D 120
Attention Mechanism	Joint Type	Frame Index	FLOPs (M)	X-Sub (%)	X-View (%)	X-Sub (%)	X-Set (%)
w/o	-	-	53.4	88.5	92.9	79.8	81.6
GCN₁(ARFAM)	-	-	54.3	88.7	93.1	80.4	81.8
GCN₂(ARFAM)	-	-	54.3	88.6	92.9	80.2	81.7
GCN₃(ARFAM)	-	-	54.3	88.6	93.0	80.3	81.7
GCN(ARFAM)	-	-	55.9	88.9	93.1	80.4	82.0
SA	-	-	61.5	88.1	92.4	79.3	81.0
MHSA	-	-	62.0	88.5	92.8	79.9	81.6
w/o	✓	✓	54.0	88.9	93.8	80.9	82.8
GCN₁(ARFAM)	✓	✓	54.9	89.3	94.2	81.6	83.2
GCN₂(ARFAM)	✓	✓	54.9	89.0	93.9	81.6	83.0
GCN₃(ARFAM)	✓	✓	54.9	89.1	94.1	81.4	82.9
GCN(ARFAM)	✓	-	56.2	89.2	93.8	81.1	82.7
GCN(ARFAM)	-	✓	56.2	89.1	93.6	80.9	82.5
GCN(ARFAM)	✓	✓	56.5	89.4	94.2	81.7	83.3

Table 6. Experimental results of UAV ground crew action dataset under X-Sub setting.

Methods	Accuracy (%)	Jaccard (%)	F1-Score (%)
ST-GCN	87.76	80.62	87.52
Shift-GCN	86.19	80.15	85.01
2s-AGCN	88.30	82.88	89.29
MS-G3D	89.15	83.21	88.67
CTR-GC	89.47	83.58	89.04
SGN	89.82	83.89	89.35
EfficientGCN	89.56	83.74	89.18
GAP	89.67	82.17	88.92
STEP-CATFormer	88.93	81.45	88.26
Ours	90.71	84.32	90.13

Table 7. Experimental results of UAV ground crew action dataset under HO setting.

Methods	Accuracy (%)	Jaccard (%)	F1-Score (%)
ST-GCN	92.96	89.67	91.25
Shift-GCN	92.83	88.59	91.04
2s-AGCN	94.12	91.40	94.38
MS-G3D	94.58	91.85	94.72
CTR-GC	94.73	92.08	94.91
SGN	95.21	92.73	95.47
EfficientGCN	94.89	92.35	94.68
GAP	95.08	92.21	95.13
STEP-CATFormer	94.35	91.58	94.52
Ours	96.09	93.62	96.22

Table 8. Recognition performance under different sequence lengths (HO setting).

Sequence Length (Frames)	Accuracy (%)	Jaccard (%)	F1-Score (%)
15	94.67	91.12	94.55
20	96.09	93.62	96.22
25	95.78	93.18	95.89
30	95.53	92.87	95.64

Table 9. Recognition performance under different execution speeds (HO setting).

Speed Multiplier	Frame Interval ( $Δ$ )	Accuracy (%)	Jaccard (%)	F1-Score (%)
$1 \times$	1	96.09	93.62	96.22
$1.5 \times$	2	95.47	92.85	95.61
$2 \times$	3	94.13	91.38	94.32

Table 10. Recognition performance under different human body proportions (HO setting).

Body Proportion	Corresponding Distance	Accuracy (%)	Jaccard (%)	F1-Score (%)
Original Dataset (34–68%)	3–6 m	96.09	93.62	96.22
25%	∼9 m	95.83	93.21	95.94
20%	∼11 m	95.41	92.78	95.53
15%	∼14 m	94.25	91.36	93.81

Table 11. Performance comparison under different training strategies (X-Sub setting).

Environmental Condition	Test			Test + Train
Environmental Condition	Acc. (%)	Jac. (%)	F1 (%)	Acc. (%)	Jac. (%)	F1 (%)
Original	90.71	84.32	90.13	90.92	84.58	90.38
Illumination	89.78	83.15	89.05	90.85	84.49	90.29
Illumination + Haze	89.82	83.21	89.11	90.88	84.54	90.33
Illumination + Rain	89.95	83.38	89.26	90.91	84.57	90.36
Shadow	89.32	82.56	88.64	90.52	84.08	89.94
Occlusion	87.64	80.87	87.21	89.91	83.42	89.26

Table 12. Performance comparison under different training strategies (HO setting).

Environmental Condition	Test			Test + Train
Environmental Condition	Acc. (%)	Jac. (%)	F1 (%)	Acc. (%)	Jac. (%)	F1 (%)
Original	96.09	93.62	96.22	96.25	93.82	96.40
Illumination	95.15	92.48	95.28	96.17	93.71	96.31
Illumination + Haze	95.19	92.54	95.33	96.19	93.74	96.34
Illumination + Rain	95.26	92.65	95.41	96.22	93.79	96.38
Shadow	94.87	92.03	94.95	95.85	93.22	95.96
Occlusion	93.21	89.76	93.18	95.18	92.39	95.16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Q.; Dong, L.; Zhang, Z.; Xu, Y.; Xiao, F.; Wang, Y. Adaptive Refined Graph Convolutional Action Recognition Network with Enhanced Features for UAV Ground Crew Marshalling. Drones 2025, 9, 819. https://doi.org/10.3390/drones9120819

AMA Style

Zhou Q, Dong L, Zhang Z, Xu Y, Xiao F, Wang Y. Adaptive Refined Graph Convolutional Action Recognition Network with Enhanced Features for UAV Ground Crew Marshalling. Drones. 2025; 9(12):819. https://doi.org/10.3390/drones9120819

Chicago/Turabian Style

Zhou, Qing, Liheng Dong, Zhaoxiang Zhang, Yuelei Xu, Feng Xiao, and Yingxia Wang. 2025. "Adaptive Refined Graph Convolutional Action Recognition Network with Enhanced Features for UAV Ground Crew Marshalling" Drones 9, no. 12: 819. https://doi.org/10.3390/drones9120819

APA Style

Zhou, Q., Dong, L., Zhang, Z., Xu, Y., Xiao, F., & Wang, Y. (2025). Adaptive Refined Graph Convolutional Action Recognition Network with Enhanced Features for UAV Ground Crew Marshalling. Drones, 9(12), 819. https://doi.org/10.3390/drones9120819

Article Menu

Adaptive Refined Graph Convolutional Action Recognition Network with Enhanced Features for UAV Ground Crew Marshalling

Highlights

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Multi-Order and Motion Feature Modeling

3.1.1. Angle Encoding

3.1.2. Static and Dynamic Domain Modeling of Joints and Bones

3.1.3. Feature Construction and Fusion

3.2. Self-Adaptive Graph Convolutional Module Based on Enhanced Data-Driven Learning

3.2.1. Adaptive Topology Construction Driven by Data

3.2.2. Adaptive Refinement Feature Activation Mechanism

3.2.3. Joint Semantic Adaptive Graph Convolutional Spatial Modeling Module

3.2.4. Frame-Index Semantic Temporal Feature Modeling Module

3.3. Computational Complexity Analysis

4. Experiments

4.1. Datasets and Experimental Settings

4.1.1. NTU-RGB+D 60/120 Datasets

4.1.2. Experimental Settings

4.2. Experimental Results and Analysis

4.2.1. Comparative Experiments

4.2.2. Ablation Studies

5. Application Study: UAV Ground Crew Marshalling Action Recognition

5.1. Dataset Construction

5.1.1. Dataset Specifications and Development

5.1.2. Annotation Software

5.1.3. Experimental Platform

5.2. Experimental Details and Results Analysis

5.2.1. Workflow of UAV Ground Crew Marshalling Gesture Recognitio

5.2.2. Application Experimental Results and Analysis

5.2.3. Robustness Analysis

5.2.4. Edge Device Deployment and Real-Time Performance

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI