MSG-GCN: Multi-Semantic Guided Graph Convolutional Network for Human Overboard Behavior Recognition in Maritime Drone Systems

Hang, Ruijie; He, Guiqing; Dong, Liheng

doi:10.3390/drones9110768

Open AccessArticle

MSG-GCN: Multi-Semantic Guided Graph Convolutional Network for Human Overboard Behavior Recognition in Maritime Drone Systems

by

Ruijie Hang

,

Guiqing He

^*

and

Liheng Dong

^*

School of Electronics And Information, Northwestern Polytechnical University, Xi’an 710072, China

^*

Authors to whom correspondence should be addressed.

Drones 2025, 9(11), 768; https://doi.org/10.3390/drones9110768

Submission received: 28 September 2025 / Revised: 2 November 2025 / Accepted: 4 November 2025 / Published: 6 November 2025

(This article belongs to the Special Issue Artificial Intelligence-Driven Drones Systems for Marine Engineering Applications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Enhanced discrimination between similar actions through multi-semantic guidance and fine-grained modeling.
Development of a lightweight action recognition network via a concise and efficient hierarchical design.

What are the implications of the main findings?

This study provides an action recognition algorithm suitable for complex environments, offering high accuracy and robustness.
The lightweight design enables deployment on resource-constrained platforms such as unmanned aerial vehicles.

Abstract

Drones are increasingly being used in maritime engineering for ship maintenance, emergency rescue, and safety monitoring tasks. In these tasks, action recognition is important for human–drone interaction and for detecting abnormal situations such as falls or distress signals. However, the maritime environment is highly challenging, with illumination variations, water spray, and dynamic backgrounds often leading to ambiguity between similar actions. To address this issue, we propose MSG-GCN, a multi-semantic guided graph convolutional network for human action recognition. Specifically, MSG-GCN integrates structured prior semantic information and further introduces a textual–semantic alignment mechanism to improve the consistency and expressiveness of multimodal features. Benefiting from its lightweight hierarchical design, our model offers excellent deployment flexibility, making it well suited for resource-constrained UAV applications. Experimental results on large-scale benchmark datasets, including NTU60, NTU120 and UAV-human, demonstrate that MSG-GCN surpasses state-of-the-art methods in both classification accuracy and computational efficiency.

Keywords:

maritime drone systems; human action recognition; graph convolutional network; prompt learning; fine-grained modeling

1. Introduction

Recent advances at the intersection of unmanned aerial vehicles (UAVs) and deep learning have enabled a wide range of intelligent applications, including target tracking [1,2], coordinated swarm operations [3,4], and autonomous navigation [5]. In maritime engineering, these technologies play an increasingly important role, supporting tasks such as vessel inspection, port maintenance, and safety monitoring [6]. In these scenarios, accurate recognition of human actions is crucial for identifying abnormal events, such as falls or emergency gestures, and ensuring timely response. However, the harsh and dynamic marine environment, with challenges like unstable illumination and wave interference, often leads to confusion between visually similar actions. Recent advances in skeleton-based action recognition provide a promising solution, as pose-driven representations focus on motion dynamics rather than appearance, offering greater robustness under complex conditions.

In early studies, researchers attempted to describe human motion patterns by manually designing features [7] such as joint angles and limb length ratios, encompassing both static [8] and dynamic [9] characteristics. Although these approaches were effective to some extent, they heavily relied on expert knowledge and exhibited limited generalization capability. In recent years, with the advancement of deep learning techniques, neural network–based methods have been introduced, enabling the application of powerful automatic feature extraction tools to skeleton-based action recognition. Among them, methods based on recurrent neural networks (RNNs) [10,11,12,13,14] and convolutional neural networks (CNNs) [15,16,17,18] have achieved promising results in action recognition tasks. However, since human skeletal data is essentially non-Euclidean with an inherent topological structure, conventional grid-based processing methods struggle to effectively model the complex inter-joint relationships.

To address this issue, researchers have gradually introduced graph structures to represent the human skeleton in a more natural manner and proposed novel methods based on graph convolutional networks (GCNs) [19,20,21,22,23,24,25,26]. GCNs are capable of performing convolution directly on graph-structured data, effectively capturing spatial adjacency relationships among joints as well as motion coordination patterns. By constructing a human graph in which joints are represented as vertices and bones as edges, GCNs enable efficient feature propagation and aggregation, thereby improving both the accuracy and robustness of action recognition.

However, most existing methods rely primarily on global skeletal structural features, with limited attention to fine-grained variations across different body parts during action execution. Consequently, local motion patterns are often inadequately captured, leading to confusion when distinguishing between similar actions. To address this issue, some studies have introduced more complex graph structures to model inter-joint relationships across spatial and temporal dimensions. Nevertheless, these approaches emphasize structural-level optimization while overlooking local semantic variations and multi-scale feature interactions, which hinders consistent and stable performance gains in fine-grained action recognition tasks.

To address the aforementioned limitations, we propose MSG-GCN, a multi-semantic guided action recognition network that explicitly and implicitly introduces semantic information to compensate for the deficiencies of existing approaches in local feature modeling and semantic utilization. As illustrated in Figure 1, the overall architecture sequentially captures spatial and temporal features, resulting in a concise yet efficient action recognition network. In the spatial module, we integrate multi-dimensional motion cues and structural semantics at the input stage to enhance dynamic feature representations, thereby better modeling the structural and dynamic relationships among local joints. Moreover, we introduce a text semantic contrastive learning module to supervise the training of the spatial feature extraction module. This supervision is only applied during training and does not increase the computational cost at inference.

The main contributions of this work are summarized as follows:

We propose MSG-GCN, a novel framework that explicitly incorporates joint-type and frame-index semantics, and further integrates textual semantic supervision to achieve multi-level semantic alignment of skeletal features, thereby enhancing feature representation in both spatial and temporal dimensions.
We design a fine-grained modeling strategy that strengthens dynamic feature representations and leverages multi-scale spatial pooling to effectively fuse local patterns of different body parts with global action information, improving the discriminability and robustness of feature representations.
The proposed MSG-GCN is a lightweight graph convolutional network that, through multi-semantic guidance and multi-scale feature interaction, achieves state-of-the-art performance on four large-scale benchmark datasets while requiring fewer parameters.

2. Related Work

2.1. Human Action Recognition

In maritime emergency scenarios, such as man-overboard accidents, recognizing human behaviors accurately is crucial for effective search and rescue (SAR). Maritime drones provide an efficient observation platform, yet appearance-based action recognition methods often fail under challenging sea conditions due to low-resolution imagery and background interference. Skeleton-based action recognition offers a more robust alternative by representing human motions as structured joint data, but early methods relying on handcrafted features and traditional machine learning lacked robustness and generalization in complex open-sea environments.

Recurrent Neural Network based. Recurrent neural networks (RNNs), including LSTM [10] and GRU [11], have been widely adopted for temporal sequence modeling [27]. Du et al. [12] partitioned the human skeleton into five hierarchical components and applied a bidirectional RNN to each part before merging their outputs. To enhance temporal feature extraction, Lee et al. [13] introduced the Temporal Sliding LSTM (TS-LSTM), which incorporates multiple temporal window sizes to better capture time-varying dynamics in skeleton-based action recognition (SBAR). Wang [14] proposed a dual-stream design, where stacked RNNs are responsible for learning temporal patterns while hierarchical RNNs encode spatial configurations. Although these methods demonstrate viable solutions, RNN-based models still exhibit weak sensitivity to spatial topology, which limits their performance—especially in recognizing subtle drowning-related motions that depend on fine-grained coordination across body parts.

Convolutional Neural Network based. Originally developed for image-related tasks such as object detection [28], CNNs have been widely adopted in skeleton-based action recognition by converting skeleton sequences into pseudo-image formats to leverage convolutional architectures. Wang et al. [15] transformed 3D joint trajectories into three types of 2D images to encode spatiotemporal cues. Cao et al. [16] further examined multiple joint-ordering strategies for projecting frame-wise coordinates onto 2D pseudo-images. Li et al. [17] proposed a shape–motion representation to enhance the extraction of spatial and temporal dynamics. More recently, Duan et al. [18] introduced PoseC3D, a 3D CNN framework for skeleton-based action recognition (SBAR), which constructs 3D heatmap volumes by aggregating temporal sequences of 2D heatmaps to better capture joint–limb relationships. Despite these innovative designs, CNN-based pipelines inherently lose part of the temporal motion dependencies when skeleton data are reshaped into image-like forms. Such temporal loss is especially problematic in maritime search-and-rescue scenarios, where continuous drowning motions contain crucial discriminative patterns.

Graph convolutional network based. Graph convolutional networks naturally model the topological structure of the human skeleton, making them one of the most popular approaches for skeleton-based action recognition (SBAR) in recent years. Among the early studies using GCNs for SBAR, the most notable is ST-GCN by Yan et al. [19], which defines a spatial adjacency matrix for the human skeleton graph, enabling efficient modeling of human poses. Shi et al. [20] proposed the two-stream adaptive GCN (2s-AGCN), introducing an adaptive graph mechanism to optimize skeleton connection weights. MSG3D [21] addressed biased weighting issues by combining multi-scale graph convolutions with unified spatiotemporal convolutions. Methods such as CTR-GCN [22], KA-AGTN [23], and MT-GCN [24] further enhance performance by integrating attention mechanisms with multi-scale graph structures. Xu et al. [25] incorporated language prior knowledge to assist action recognition networks. Li et al. [26] introduced semantic information to construct multimodal recognition frameworks. In GCN-based methods, constructing skeleton feature structures with semantic significance is a key aspect of action recognition. This is particularly promising in maritime UAV applications, as GCNs can simultaneously capture global motion trends and local joint variations, which are essential for distinguishing between normal swimming and life-threatening drowning behaviors.

2.2. Prompt Learning

Prompt learning initially emerged in the field of natural language processing (NLP), aiming to guide pre-trained language models to better perform downstream tasks by designing or learning “prompts” within the input text, without requiring large-scale fine-tuning of model parameters.

In recent years, prompt learning has gradually expanded from purely textual modalities to vision–language joint modeling scenarios. For example, the CLIP model [29], pre-trained on large-scale image–text pairs, enables zero-shot image classification. However, its performance heavily depends on manually designed text prompt templates. To overcome this limitation, Li et al. [30] proposed CoOp, which represents prompts as learnable continuous vectors and significantly improves few-shot and zero-shot classification performance without modifying the CLIP backbone. Subsequently, methods such as Prompt Tuning [31] and Prefix Tuning [32] further explored the effectiveness of optimizing a small set of prompt vectors while keeping the backbone model parameters frozen.

In the field of multimodal action recognition, prompt learning has also begun to attract attention. Some studies have explored incorporating textual prompts into video understanding tasks to enhance semantic priors. For example, PromptCap [33] generates descriptive text using learnable prompts to assist video content comprehension, while SwinBERT [34] introduces dynamic prompts in vision–language pre-training to improve cross-modal alignment. Recent research has further investigated integrating prompt mechanisms with graph-structured data. For instance, GraphPrompt [35] introduces learnable node prompts within graph neural networks for graph classification tasks, and PHGNN [36] employs soft prompts to enhance GNN generalization on sparsely labeled graphs. Despite the significant progress of prompt learning in NLP and computer vision, its application in human action recognition remains in its early stages.

3. Method

We propose a multi-semantic guided action recognition network, and the overall end-to-end framework is illustrated in Figure 1. It primarily consists of a spatial module and a temporal module. Detailed descriptions of each component are provided in the following subsections.

3.1. Spatial Modules

Distinguishing similar actions fundamentally relies on identifying variations in the motion patterns of relevant joints. To this end, we first employ an enhanced dynamic representation module to perform fine-grained modeling of joint information.

Enhanced Dynamic Representation. We represent a skeleton sequence as a joint set

S = {X_{t}^{k} ∣ t = 1, 2, \dots, T; k = 1, 2, \dots, J}

, where

X_{t}^{k}

denotes the k-th joint at frame t. Here, T is the total number of frames in the sequence, and J is the number of human joints in each frame. For a joint

X_{t}^{k}

, its enhanced motion attributes are characterized by its 3D position

p_{t, k} = {(x_{t, k}, y_{t, k}, z_{t, k})}^{T} \in R^{3}

, velocity

v_{t, k} = p_{t, k} - p_{t - 1, k}

, and fine-grained movement descriptors

m_{t, k} = p_{t, k} - r_{t, n}

and

m_{t, k}^{'} = v_{t, k} - r_{t, n}^{'}

, where

r_{t, n}

and

r_{t, n}^{'}

indicate the reference joint of the n-th body part at time t for position and velocity, respectively. As illustrated in Figure 2, the human body is partitioned into five parts, and each part is assigned a specific joint as its reference.

The position, velocity, and their fine-grained counterparts are projected into a shared high-dimensional feature space, yielding the embeddings

\tilde{p_{t, k}}

,

\tilde{v_{t, k}}

,

\tilde{m_{t, k}}

, and

\tilde{m_{t, k}^{'}}

. These embeddings are then aggregated via element-wise summation:

z_{t, k} = \tilde{p_{t, k}} + \tilde{v_{t, k}} + \tilde{m_{t, k}} + \tilde{m_{t, k}^{'}} \in R^{C_{1}},

(1)

where

C_{1}

indicates the dimensionality of the unified joint representation. Taking the velocity vector as an example, two fully connected (FC) layers are applied to achieve its embedding, which can be expressed as:

\tilde{v_{t, k}} = σ (W_{2} (σ (W_{1} v_{t, k} + b_{1})) + b_{2}),

(2)

where

W_{1} \in R^{C_{1} \times 3}

and

W_{2} \in R^{C_{1} \times C_{1}}

denote the weight matrices,

b_{1}

and

b_{2}

are the corresponding bias terms, and

σ

represents the ReLU activation function.

Multi-Head Attention Graph. To capture the correlations within skeleton data, we employ Graph Convolutional Network (GCN) layers for further feature extraction. Early studies typically constructed the adjacency matrix based on the natural connectivity of the human skeleton, which effectively leverages the inherent topological structure of the body but still suffers from certain limitations. Shi et al. [20] proposed an adjacency matrix computation method based on feature similarity, in which an adaptive adjacency matrix is constructed by exploiting the similarity of joint features in the embedding space, formulated as follows:

G (i, j) = θ {(z_{t, i})}^{T} ϕ (z_{t, j})

(3)

where

θ (x) = W_{3} x + b_{3} \in R^{C_{2}}

,

ϕ (x) = W_{4} x + b_{4} \in R^{C_{2}}

denote two transformation matrices implemented by fully connected (FC) layers.

Inspired by the multi-head attention mechanism [37], we introduce a multi-head strategy into the adjacency matrix computation, where multiple independent attention heads are designed to enhance the model’s capability of representing spatial features. Specifically, when constructing the adjacency matrix, we consider the similarity between the features of two joints, formulated as follows:

G = \sum_{h = 1}^{H} μ_{h} G_{h}

(4)

where

h = 1, 2 \dots, H

represents the number of attention heads, and

μ_{h}

corresponds to the associated weight.

GCNs. Before applying graph convolution, the semantic identity of each joint is integrated into its feature representation, allowing the model to exploit this prior knowledge during learning. In particular, a one-hot vector

j_{k} \in R^{J}

is used to indicate the type of the k-th joint, where the k-th entry is 1 and all remaining entries are 0. After mapping this vector into a high-dimensional embedding space, it is concatenated with the enhanced joint feature as

z_{t, k} = [\begin{matrix} z_{t, k}, {\tilde{j}}_{k} \end{matrix}] \in R^{2 C_{1}} .

(5)

Consequently, the joint features within a single frame can be expressed as

Z_{t} = (\begin{matrix} z_{t, 1}; \dots; z_{t, J} \end{matrix}) \in R^{J \times 2 C_{1}} .

(6)

For the adjacency matrix at time t, which is

G_{t} \in J \times J

, we apply a Softmax operation to convert the connection strengths between each node and all other nodes into a probability distribution, thereby ensuring weight normalization. We implement message passing among nodes using three graph convolutional layers. Inspired by the ResNet [38] architecture, residual connections are introduced between the GCN layers to stabilize the training process. A graph convolution layer with node-wise message passing is defined as

\begin{matrix} Y_{t} & = G_{t} Z_{t} W_{y}, \\ Z_{t}^{'} & = Y_{t} + Z_{t} W_{x}, \end{matrix}

(7)

where

W_{x}

and

W_{y}

denote the learnable transformation matrices shared across all time steps. By stacking three consecutive GCN layers, the joint-level features are progressively propagated and integrated.

3.2. Temporal Modules

After the spatial module layer, the one-hot encoding vector of the frame index

f_{t} \in R^{T}

is embedded as

{\tilde{f}}_{t} \in R^{C_{3}}

and added to the output of the spatial module:

Z_{t}^{'} = Z_{t}^{'} + {\tilde{f}}_{t} \in R^{C_{3}}

(8)

Multi-scale spatial pooling is then applied to further aggregate joint information, and the training of the spatial module is supervised using a text-semantic Contrastive learning approach.

Multi-Scale Spatial Pooling. As shown in Figure 1, we introduce a multi-scale spatial pooling module. For the input feature vectors, we first partition the features into five parts: left hand, right hand, left knee, right knee, and head&spine with the method shown in Figure 2. MaxPooling is then performed along the skeleton dimension for each part. The resulting feature vectors of the five parts are concatenated and subsequently subjected to average pooling. Finally, this is concatenated with the overall feature after maxpooling and passed through a

1 \times 1

convolutional layer (including batch normalization and ReLU) to produce the output.

Compared with single-scale pooling methods, this module can better capture the local structural characteristics of actions while preserving global information, thereby enhancing the model’s ability to perceive complex human motion patterns. After the multi-scale pooling operation, three CNN layers are employed to model temporal dependencies, and the features are mapped to a higher-dimensional space to strengthen the model’s representational capacity. Finally, maxpooling along the temporal dimension is applied to aggregate temporal information, and a fully connected layer is used to perform classification, yielding the action prediction results.

Contrastive Learning Modules. In the proposed multi-semantic guided action recognition network, we design a text-assisted Contrastive learning module. Specifically, for each action category, we employ the pretrained large language model Qwen3 [39] to generate detailed textual descriptions, including a paragraph-level description of the action label and additional descriptions of the involved body parts. Taking the action “standing” as an example, its generated description is illustrated in Figure 1. During training, we adopt the CLIP-ViT-B/32 [29] text encoder to convert all textual content into feature vectors, and both tokenization and preprocessing follow its official implementation.

Within the temporal modeling module, the multi-scale pooling block produces five feature vectors corresponding to different body parts. These vectors are projected into the same dimensional space as the text features through fully connected layers and are then paired with the semantic text features of the corresponding body parts for contrastive learning. The global skeleton feature is contrasted with the paragraph-level textual description. This design not only refines the learning of localized body-part features but also enables our model to develop a semantic understanding of the roles and interactions of different body parts in executing an action.

More concretely, for skeleton features

s

and text features

t

, we construct multiple positive and negative pairs within each training batch for joint optimization. Unlike CLIP, which follows a one-to-one image–text alignment, our setting does not enforce such a strict pairing. Therefore, we adopt the KL divergence loss for optimization:

L_{c o n} = \frac{1}{2} E_{s, t \sim D} [K L (p^{s 2 t} (s), y^{s 2 t}) + K L (p^{t 2 s} (t), y^{t 2 s})]

(9)

where

D

denotes the entire dataset,

y^{s 2 t}

and

y^{t 2 s}

are the ground-truth similarity scores, and

p_{i}^{s 2 t}

and

p_{i}^{t 2 s}

are the predicted probabilities, which are calculated by the following equation:

\begin{matrix} p_{i}^{s 2 t} (s_{i}) & = \frac{exp (sim (s_{i}, t_{i}) / τ)}{\sum_{j = 1}^{B} exp (sim (s_{i}, t_{j}) / τ)}, \\ p_{i}^{t 2 s} (t_{i}) & = \frac{exp (sim (t_{i}, s_{i}) / τ)}{\sum_{j = 1}^{B} exp (sim (t_{i}, s_{j}) / τ)} . \end{matrix}

(10)

where

sim (t, s)

represents the cosine similarity,

τ

is the temperature parameter, and B is the batch size. As described in Section 3.2, the human skeleton is divided into five body parts. We apply the contrastive loss function to both the features of different body parts and the global feature, thereby constructing a multi-part contrastive loss function as

L_{c o n}^{m u l t i} = \frac{1}{N + 1} \sum_{n = 1}^{N + 1} L_{c o n}^{n}

(11)

where N denotes the number of parts into which the human skeleton is divided.

3.3. Loss Function

In skeleton-based action recognition, action sequences often exhibit both inter-frame similarities and differences along the temporal dimension. Traditional cross-entropy loss, built upon the independent and identically distributed assumption, only measures the alignment between the predictions and discrete labels, while neglecting the dynamic evolution and latent dependencies within the sequence. This limitation may cause the model to over-rely on a few highly discriminative frames, thereby weakening its overall temporal modeling capability. To address this issue, we adopt label smoothing loss as the loss function for the skeleton encoder:

L_{l s c} = - \frac{1}{m} \sum_{k = 1}^{K} \sum_{i (y_{i} = k)} [(1 - α) \ln (P_{i k}) + α \frac{1}{K} \sum_{j = 1}^{K} \ln (P_{i j})]

(12)

where m denotes the number of samples, K is the number of classes,

y_{i}

represents the ground-truth class of the i-th sample,

P_{i k}

is the predicted probability that the i-th sample belongs to class k, and

α

is the smoothing coefficient, typically set to 0.1.

In summary, the overall loss function of this work can be expressed as follows:

L_{t o t a l} = L_{l s c} + λ L_{c o n}^{m u l t i}

(13)

where

λ

is a tunable parameter used to balance the relative contributions of the skeleton encoder and the text encoder in the overall loss.

4. Experiments

4.1. Datasets

NTU RGB+D (NTU60) [40]. This dataset is one of the largest multimodal 3D human motion capture datasets widely used in human action recognition research. It contains 56,880 action samples covering 60 different types of daily activities and motions performed by 40 distinct subjects. The actions range from simple gestures to complex interactive behaviors and were recorded using the Microsoft Kinect v2 sensor, providing multiple modalities including RGB videos, depth maps, skeletal joint positions, and infrared videos. Consequently, the dataset is not only suitable for vision-based action recognition methods but also supports research that leverages 3D skeletal information.

NTU120 RGB+D (NTU120) [41]. To further broaden the applicability and challenges of the NTU RGB+D dataset, researchers released its extended version, NTU RGB+D 120. This version significantly increases the diversity and complexity of action categories, containing 120 different types of activities with a total of 114,480 action samples. Similar to the original version, these samples include four modalities—RGB videos, depth maps, skeletal data, and infrared videos—but they were collected from a wider variety of scenes and participants, with greater variability and complexity across actions.

UAV-Human [42]. This dataset is a large benchmark for human behavior understanding with UAVs, which contains 67,428 skeleton sequences with 119 subjects, 155 action categories, and 17 joints. It is collected in a variety of urban and rural districts in both daytime and nighttime over three months, covering multiple subjects, backgrounds, illuminations, weathers, occlusions, camera motions, and UAV flying attitudes. We follow the two protocols defined in the original paper, cross-subject-v1 (CSv1) and cross-subject-v2 (CSv2), where the IDs for training and testing subjects differ for different cross-subject protocols. The testing subjects in CSv1 contain more noise than those in CSv2, which makes CSv1 a much more challenging protocol.

MOBDrone [43]. MOBDrone is a large-scale dataset designed for maritime man-overboard (MOB) detection and search-and-rescue tasks. It is constructed from 66 high-definition (1080P) video sequences captured by an unmanned aerial vehicle (UAV) flying at altitudes ranging from 10 m to 60 m above the sea surface. From these videos, a total of 126,170 static frames are extracted, with more than 180,000 manually annotated bounding boxes covering five object categories: wood, lifebuoy, surfboard, vessel, and human. The dataset spans diverse flight altitudes, camera viewing angles, and complex sea surface environments, providing rich visual variability and challenging conditions that closely reflect real-world maritime scenarios. Owing to these characteristics, MOBDrone serves as an ideal benchmark dataset for evaluating action recognition models deployed on UAV platforms operating in maritime environments.

4.2. Implementation Details

Data Processing. Similar to [44], in cases where a single frame contains two persons, the data are separated into two frames, each containing the action data of one individual. During training, following [45], each skeleton sequence is evenly divided into 20 segments, and one frame is randomly selected from each segment to form a new sequence of 20 frames. In addition, we apply rotational augmentation to the skeleton sequences during training to enhance the model’s robustness to viewpoint variations. For the CS (cross-subject) setting, we randomly rotate the skeletons within the range of

[- 17^{\circ}, 17^{\circ}]

along the X, Y, and Z axes. For the CV (cross-view) setting, where viewpoint variation is more significant, the rotation range is increased to

[- 30^{\circ}, 30^{\circ}]

.

Training. All experiments were conducted on the PyTorch 1.11.0 platform using a single NVIDIA GeForce RTX 4090 GPU. We employed the Adam optimizer with an initial learning rate of 0.001, which was decayed by a factor of 10 at epochs 60, 90, and 110. The maximum number of training epochs was set to 120, with a weight decay of 0.0001 and a batch size of 64. The smoothing factor for the label smoothing loss was set to 0.1.

To ensure experimental reliability, each experiment was independently repeated five times with different random seeds. The dataset was strictly divided into training, validation, and testing subsets to avoid any data leakage. The validation set was used solely for model selection and hyperparameter tuning, while the test set was reserved for final performance evaluation. All training hyperparameters were kept consistent across different datasets for fair comparison.

4.3. Ablation Study

To validate the effectiveness of the proposed MSG-GCN, we conducted extensive experiments on the NTU RGB+D dataset. The effectiveness of each module was evaluated, where MS denotes the multi-semantic module, EDR stands for enhanced dynamic representation, MHAG represents the multi-head attention graph, and MSSP refers to the multi-scale spatial pooling.

4.3.1. Effectiveness of MSG-GCN

As shown in Table 1, the enhanced dynamic representation (EDR) improves accuracy from 86.8% to 87.8% with almost no increase in the number of parameters, demonstrating the effectiveness of fine-grained modeling of joint features. In contrast, introducing the MHAG module alone only raises the accuracy from 86.8% to 86.9%, indicating that this module contributes little in the absence of other auxiliary mechanisms. When the MSSP module is applied individually, the accuracy increases significantly to 87.8%, showing that the multi-scale pooling strategy effectively captures spatial information at different scales. With the inclusion of the MS module, accuracy further increases substantially to 89.4%, highlighting the significant advantage of the semantic guidance mechanism in enhancing feature representations. Finally, when all modules are combined, the accuracy reaches the highest value of 90.2%, fully validating the effectiveness of the proposed multi-semantic-guided action recognition algorithm.

4.3.2. Effectiveness of MS

To further evaluate the contribution of different semantic information within the multi-semantic (MS) module, we separately assessed the effects of joint type semantics (JT), frame index (FI), and text-assisted contrastive learning (Text) on action recognition performance. As shown in Table 2, incorporating either JT or FI leads to improvements in accuracy. When applied individually, the JT module improves accuracy by 1.9%, indicating that by modeling the physical meaning and motion characteristics of different joints, JT effectively enhances the spatial discriminability of features, enabling the model to more precisely distinguish different action patterns. The FI module alone improves accuracy by 1.0%, suggesting that by introducing temporal order priors, FI helps the model better capture the temporal evolution of actions, thereby enhancing the coherence of sequence modeling. When JT and FI are combined, the accuracy is further improved by 2.1%. With the introduction of the Text module, the accuracy reaches the highest value of 89.4%, achieving a 2.6% improvement over the baseline. The textual information provides additional high-level semantic guidance, enabling the model to leverage linguistic descriptions to form clearer decision boundaries between action categories during training, thus improving generalization ability.

4.3.3. Impact of Different MHAG Modes

To further investigate the impact of different numbers of heads in the multi-head attention graph module (MHAG) on model performance, we conducted a detailed experimental analysis. Table 3 presents the changes in model accuracy under different multi-head configurations. From the experimental results, it can be observed that as the number of heads increases, the recognition accuracy of the model exhibits certain fluctuations. Specifically, when using 2 attention heads, the model achieved an accuracy of 90.0%. When the number of heads increased to 4, the accuracy improved to 90.2%, indicating that more attention heads can capture richer local and global features, thereby enhancing the model’s representational capacity. Furthermore, when the number of heads was increased to 6, the accuracy remained unchanged. This phenomenon may be attributed to the fact that an excessive number of attention heads introduces information redundancy and higher computational complexity, which in turn has a negative effect on model performance. Finally, when using 8 attention heads, the accuracy decreased to 90.0%, further supporting the above inference.

4.3.4. Impact of Different MSSP Modes

To further investigate the influence of spatial pooling strategies on model performance, we evaluated the model under four distinct pooling configurations. As summarized in Table 4, the choice of pooling strategy had a significant impact on recognition accuracy. Without any pooling, the model achieved 89.2%, serving as a baseline for performance in the absence of an effective feature integration mechanism. Incorporating global max pooling improved accuracy to 89.7%, demonstrating that it effectively captures representative features across the entire spatial domain and enhances the model’s comprehension of overall patterns. Employing a fused local pooling strategy further increased accuracy to 89.8%. This approach enables the model to concentrate on the most informative cues within each sub-region, thereby strengthening local feature extraction and boosting overall recognition performance. The highest accuracy, 90.2%, was achieved by combining local and global pooling (Parts+Global). This hybrid strategy simultaneously reinforces critical local details via local pooling and preserves holistic structural information through global pooling, resulting in optimal performance.

4.3.5. Impact of Loss Function

To validate the impact of the loss function, we conducted comparative experiments. Replacing the label smoothing loss with the cross-entropy loss resulted in a model accuracy of 90.0%, a decrease of 0.2%, confirming the effectiveness of label smoothing in mitigating overfitting and enhancing generalization. Building on this, we further examined model accuracy under different settings of the loss function hyperparameter

λ

. We tested

λ \in {0.1, 0.2, 0.4, 0.6, 0.8}

, and the corresponding results are

{90.2 %, 90.0 %, 89.5 %, 89.3 %, 89.2 %}

. Figure 3 shows the sensitivity analysis of the weighting coefficient

λ

and the learning curves during training. The result showed that

λ = 0.1

achieves the best performance.

4.3.6. Impact of Text-Encoder and Prompt

To validate the impact of the text-encoder and prompt, we compare three types of prompts with different levels of semantic richness, as shown in Table 5. The first prompt provides only a brief description of the action; the second introduces a more natural and contextualized description; and the third includes detailed body-part–level motion descriptions that closely align with the structure of the skeleton data. Across all settings, both BERT and CLIP encoders achieve stable performance, demonstrating that the proposed contrastive framework is relatively robust to prompt variations.

4.4. Visualization

Figure 4 presents the accuracy changes of several action categories between MSG-GCN and the baseline. It can be observed that, our method better distinguishes similar actions such as “reading”, “clapping” and “writing”. As shown in Figure 5, the proposed MSG-GCN demonstrates superiority in capturing the structure of dynamic adjacency matrices. In the “check time” action, our graph structure attends to the key interaction of the left hand reaching toward the right hand, and in the “clapping” action, it better captures the interaction between the left and right arms. In addition, we compared the feature representations of models with and without text semantic supervision using PCA, as shown in Figure 6. Compared to the case without the loss, the latent representations aligned with text semantics exhibit more clearly separated class-conditional distributions and more compact intra-class features. Furthermore, detailed confusion matrices of our model on multiple datasets are provided in the Appendix A for comprehensive evaluation.

4.5. Comparison with the State-of-the-Arts

To validate the effectiveness of the proposed model, we compared it against state-of-the-art approaches on four public benchmarks: NTU-RGB+D, NTU-RGB+D 120, UAV-human and MOBDrone.

As shown in Table 6, MSG-GCN achieves superior accuracy on the NTU-RGB+D dataset in both the cross-subject (CS) and cross-view (CV) settings. Specifically, it obtains 90.2% accuracy on CS and 95.6% accuracy on CV, outperforming most comparison methods. These results indicate that MSG-GCN can effectively capture critical spatiotemporal features of actions while maintaining strong generalization ability. Although some models such as MS-G3D, CTR-GCN, and LKA-GCN slightly outperform MSG-GCN in certain metrics, they rely on multi-branch ensemble strategies, which significantly increase computational complexity. In contrast, as a single-stream network, MSG-GCN surpasses the single-stream accuracies of other methods while avoiding complex model designs, thereby demonstrating greater application potential. In terms of parameter size, MSG-GCN requires only 1.88 M parameters, which is significantly lower than most comparison methods. This highlights that MSG-GCN achieves superior performance with extremely low parameter overhead, offering high model compactness and deployment friendliness.

As shown in Table 7, MSG-GCN also demonstrates outstanding performance on the more complex NTU RGB+D 120 dataset. Although this dataset contains a larger number of action categories and thus increases task difficulty, MSG-GCN still achieves 84.4% accuracy (CS) and 86.4% accuracy (CV). This result further verifies the effectiveness and robustness of MSG-GCN in handling large-scale action recognition tasks.

4.6. Validation of MSG-GCN for Maritime Search and Rescue Applications

To validate the effectiveness of the proposed MSG-GCN in real-world maritime search and rescue (SAR) scenarios, we conducted extensive experiments on the UAV-Human dataset, which provides large-scale skeleton-based action data captured from aerial perspectives. This dataset allows us to evaluate the robustness of our method under conditions that closely resemble UAV-based monitoring of maritime environments. As shown in Table 8, MSG-GCN achieves competitive performance compared with state-of-the-art skeleton-based action recognition models, while maintaining a lower parameter scale (1.88 M). In particular, our method reaches a Top-1 accuracy of 41.6%, outperforming classical baselines such as ST-GCN and 2s-AGCN, and showing comparable results to larger models such as MSSTNet.

Furthermore, we conducted extensive experiments on the MOBDrone dataset using the Jetson Orin platform to validate the effectiveness and robustness of the proposed MSG-GCN. Table 9 reports the accuracy and computational efficiency of MSG-GCN compared with existing compact baselines, where FLOPs/f denotes FLOPs per frame, Lat. represents inference latency, and Mem. refers to peak memory usage. Table 10 also presents the performance of MSG-GCN on classes related to overboard behaviors (as illustrated in Figure 7, including drowning suspected, swimming, and call for help). It can be observed that, benefitting from the efficient network design, MSG-GCN achieves high accuracy while maintaining low computational cost, demonstrating its suitability for deployment on resource-constrained platforms such as UAVs. In addition, as shown in Table 11, MSG-GCN maintains competitive accuracy even under scenarios with partial joint missing. Compared with other methods, MSG-GCN incorporates an enhanced dynamic representation module that fuses four complementary motion descriptors, enabling a more robust and reliable motion representation than using raw skeleton sequences alone.

5. Conclusions

In this paper, we have proposed MSG-GCN, a multi-semantic guided network for human overboard behavior recognition in maritime drone systems. By explicitly introducing semantic information of joint types and frame indices, and implicitly leveraging textual semantics for feature alignment, we constructed a concise and efficient action recognition framework. Through fine-grained feature modeling, MSG-GCN achieves higher accuracy in distinguishing similar actions, demonstrating its potential for application in complex marine environments. Experiments on four large-scale datasets effectively verify the superiority of the multi-semantic guided mechanism, providing a new paradigm for lightweight human action recognition.

Author Contributions

Conceptualization, R.H. and L.D.; methodology, R.H.; validation, R.H., L.D. and G.H.; resources, L.D.; data curation, R.H.; writing—original draft preparation, R.H.; writing—review and editing, L.D. and G.H.; visualization, R.H.; supervision, G.H.; project administration, G.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Aviation Science Foundation under Grant 2023Z073053008 and Grant D5120220246.

Data Availability Statement

The data used in this study are publicly available online https://rose1.ntu.edu.sg/dataset/actionRecognition (accessed on 5 March 2025), https://github.com/SUTDCV/UAV-Human (accessed on 5 March 2025) and https://aimh.isti.cnr.it/dataset/mobdrone (accessed on 8 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Confusion Matrix

Figure A1. Confusion matrix and per-class F1-scores on the NTU60 datasets. Darker colors indicate higher values, while lighter colors indicate lower values.

Figure A2. Confusion matrix and per-class F1-scores on the UAV-Human datasets. Darker colors indicate higher values, while lighter colors indicate lower values.

References

Li, B.; Yang, Z.P.; Chen, D.Q.; Liang, S.Y.; Ma, H. Maneuvering target tracking of UAV based on MN-DDPG and transfer learning. Def. Technol. 2021, 17, 457–466. [Google Scholar] [CrossRef]
Zhang, H.; Li, B.; Huang, J.; Song, C.; He, P.; Neretin, E. A Parallel Multi-Demonstrations Generative Adversarial Imitation Learning Approach on UAV Target Tracking Decision. Chin. J. Electron. 2025, 34, 1185–1198. [Google Scholar] [CrossRef]
Li, B.; Wang, J.; Song, C.; Yang, Z.; Wan, K.; Zhang, Q. Multi-UAV roundup strategy method based on deep reinforcement learning CEL-MADDPG algorithm. Expert Syst. Appl. 2024, 245, 123018. [Google Scholar] [CrossRef]
Fu, X.; Sun, Y. A Combined Intrusion Strategy Based on Apollonius Circle for Multiple Mobile Robots in Attack-Defense Scenario. IEEE Robot. Autom. Lett. 2024, 10, 676–683. [Google Scholar] [CrossRef]
Song, C.; Li, H.; Li, B.; Wang, J.; Tian, C. Distributed Robust Data-Driven Event-Triggered Control for QUAVs under Stochastic Disturbances. Def. Technol. 2025, in press. [Google Scholar] [CrossRef]
Wang, J.; Zhou, K.; Xing, W.; Li, H.; Yang, Z. Applications, evolutions, and challenges of drones in maritime transport. J. Mar. Sci. Eng. 2023, 11, 2056. [Google Scholar] [CrossRef]
Bobick, A.F.; Davis, J.W. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 23, 257–267. [Google Scholar] [CrossRef]
Guo, K.; Ishwar, P.; Konrad, J. Action recognition using sparse representation on covariance manifolds of optical flow. In Proceedings of the 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, Boston, MA, USA, 29 August–1 September 2010; pp. 188–195. [Google Scholar]
Salmane, H.; Ruichek, Y.; Khoudour, L. Object tracking using Harris corner points based optical flow propagation and Kalman filter. In Proceedings of the 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), Washington, DC, USA, 5–7 October 2011; pp. 67–73. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
Lee, I.; Kim, D.; Kang, S.; Lee, S. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1012–1020. [Google Scholar]
Wang, H.; Wang, L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 499–508. [Google Scholar]
Wang, P.; Li, W.; Li, C.; Hou, Y. Action recognition based on joint trajectory maps with convolutional neural networks. Knowl.-Based Syst. 2018, 158, 43–53. [Google Scholar] [CrossRef]
Cao, C.; Lan, C.; Zhang, Y.; Zeng, W.; Lu, H.; Zhang, Y. Skeleton-based action recognition with gated convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3247–3257. [Google Scholar] [CrossRef]
Li, Y.; Xia, R.; Liu, X.; Huang, Q. Learning shape-motion representations from geometric algebra spatio-temporal model for skeleton-based action recognition. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1066–1071. [Google Scholar]
Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2969–2978. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 143–152. [Google Scholar]
Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]
Liu, Y.; Zhang, H.; Xu, D.; He, K. Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl.-Based Syst. 2022, 240, 108146. [Google Scholar] [CrossRef]
Wu, Z.; Zhan, M.; Zhang, H.; Luo, Q.; Tang, K. MTGCN: A multi-task approach for node classification and link prediction in graph data. Inf. Process. Manag. 2022, 59, 102902. [Google Scholar] [CrossRef]
Xu, H.; Gao, Y.; Hui, Z.; Li, J.; Gao, X. Language knowledge-assisted representation learning for skeleton-based action recognition. IEEE Trans. Multimed. 2025, 27, 5784–5799. [Google Scholar] [CrossRef]
Li, C.; Liang, W.; Yin, F.; Zhao, Y.; Zhang, Z. Semantic information guided multimodal skeleton-based action recognition. Inf. Fusion 2025, 123, 103289. [Google Scholar] [CrossRef]
Feng, Q.; Li, B.; Liu, X.; Gao, X.; Wan, K. Low-high frequency network for spatial–temporal traffic flow forecasting. Eng. Appl. Artif. Intell. 2025, 158, 111304. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, Y.; Song, J.; Zhou, Q.; Rasol, J.; Ma, L. Planet Craters Detection Based on Unsupervised Domain Adaptation. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 7140–7152. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar] [CrossRef]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar] [CrossRef]
Hu, Y.; Hua, H.; Yang, Z.; Shi, W.; Smith, N.A.; Luo, J. Promptcap: Prompt-guided image captioning for vqa with gpt-3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2963–2975. [Google Scholar]
Lin, K.; Li, L.; Lin, C.C.; Ahmed, F.; Gan, Z.; Liu, Z.; Lu, Y.; Wang, L. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17949–17958. [Google Scholar]
Liu, Z.; Yu, X.; Fang, Y.; Zhang, X. Graphprompt: Unifying pre-training and downstream tasks for graph neural networks. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 417–428. [Google Scholar]
Liu, C.; Rossi, L. PHGNN: A Novel Prompted Hypergraph Neural Network to Diagnose Alzheimer’s Disease. arXiv 2025, arXiv:2503.14577. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
Li, T.; Liu, J.; Zhang, W.; Ni, Y.; Wang, W.; Li, Z. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16266–16275. [Google Scholar]
Cafarelli, D.; Ciampi, L.; Vadicamo, L.; Gennaro, C.; Berton, A.; Paterni, M.; Benvenuti, C.; Passera, M.; Falchi, F. MOBDrone: A drone video dataset for man overboard rescue. In Proceedings of the International Conference on Image Analysis and Processing; Springer: Cham, Switzerland, 2022; pp. 633–644. [Google Scholar]
Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2117–2126. [Google Scholar]
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 816–833. [Google Scholar]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar] [CrossRef]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
Huang, L.; Huang, Y.; Ouyang, W.; Wang, L. Part-level graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11045–11052. [Google Scholar]
Alsarhan, T.; Ali, U.; Lu, H. Enhanced discriminative graph convolutional network with adaptive temporal modelling for skeleton-based action recognition. Comput. Vis. Image Underst. 2022, 216, 103348. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, H.; Li, Y.; He, K.; Xu, D. Skeleton-based human action recognition via large-kernel attention graph convolutional network. IEEE Trans. Vis. Comput. Graph. 2023, 29, 2575–2585. [Google Scholar] [CrossRef]
Cheng, Q.; Cheng, J.; Ren, Z.; Zhang, Q.; Liu, J. Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition. Pattern Anal. Appl. 2023, 26, 1303–1315. [Google Scholar] [CrossRef]
Li, T.; Geng, P.; Lu, X.; Li, W.; Lyu, L. Skeleton-based action recognition through attention guided heterogeneous graph neural network. Knowl.-Based Syst. 2025, 309, 112868. [Google Scholar] [CrossRef]
Si, C.; Jing, Y.; Wang, W.; Wang, L.; Tan, T. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 103–118. [Google Scholar]
Zhang, X.; Huang, C.; Xu, Y.; Xia, L.; Dai, P.; Bo, L.; Zhang, J.; Zheng, Y. Traffic flow forecasting with spatial-temporal graph diffusion network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 15008–15015. [Google Scholar]
Plizzari, C.; Cannici, M.; Matteucci, M. Spatial temporal transformer network for skeleton-based action recognition. In Proceedings of the International Conference on Pattern Recognition, Virtual, 10–15 January 2021; Springer: Cham, Switzerland, 2021; pp. 694–701. [Google Scholar]
Ghorbani, M.; Kazi, A.; Baghshah, M.S.; Rabiee, H.R.; Navab, N. RA-GCN: Graph convolutional network for disease prediction problems with imbalanced data. Med Image Anal. 2022, 75, 102272. [Google Scholar] [CrossRef]
Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
Sun, Y.; Shen, Y.; Ma, L. Msst-rt: Multi-stream spatial-temporal relative transformer for skeleton-based action recognition. Sensors 2021, 21, 5339. [Google Scholar] [CrossRef]
Xu, L.; Lan, C.; Zeng, W.; Lu, C. Skeleton-based mutually assisted interacted object localization and human action recognition. IEEE Trans. Multimed. 2022, 25, 4415–4425. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of MSG-GCN, a multi-semantic guided network for action recognition. By explicitly introducing semantic information of joint types and frame indices, and implicitly leveraging textual semantics for feature alignment, we construct a concise and efficient action recognition framework.

Figure 2. Illustration of the segmentation of body parts (left hand, right hand, left knee, right knee, and head and spine) and the selection of reference joints.

Figure 3. Hyperparameter Sensitivity and Training Dynamics. (a) Sensitivity analysis of

λ

. (b) Learning curves during training.

Figure 3. Hyperparameter Sensitivity and Training Dynamics. (a) Sensitivity analysis of

λ

. (b) Learning curves during training.

Figure 4. Action classes with accuracy difference between our method and baseline on the NTU60 dataset.

Figure 5. Visualization of the topologies learned by MSG-GCN and baseline across two actions. Darker color indicates stronger correlation between corresponding joints.

Figure 6. Compared the PAC projections of latent representations with and without the text semantic contrastive loss

L_{c o n}^{m u l t i}

. The five visualized action categories were randomly selected from NTU60. (a) PAC w/o

L_{c o n}^{m u l t i}

Loss. (b) PAC with

L_{c o n}^{m u l t i}

Loss.

Figure 6. Compared the PAC projections of latent representations with and without the text semantic contrastive loss

L_{c o n}^{m u l t i}

. The five visualized action categories were randomly selected from NTU60. (a) PAC w/o

L_{c o n}^{m u l t i}

Loss. (b) PAC with

L_{c o n}^{m u l t i}

Loss.

Figure 7. (a) Drowning suspected; (b) swimming; (c) call for help.

Table 1. Comparison of model validation accuracy and parameter count under different configurations.

Method	Accuracy (%)	Params (M)
Baseline	86.8	0.54
+EDR	87.8 ± 0.07	0.55
+MHAG	86.9 ± 0.12	0.89
+MSSP	87.8 ± 0.07	1.68
+MS	89.4 ± 0.08	0.69
MSG-GCN	90.2 ± 0.11	1.88

Table 2. Comparison of model accuracy and parameter size under different types of semantic information, where JT denotes joint type, FI denotes frame index, and Text denotes textual semantics.

Method	Accuracy (%)	Params (M)
Baseline	86.8	0.54
+FI	87.8 ± 0.05	0.56
+JT	88.7 ± 0.05	0.67
+JT+FI	88.9 ± 0.06	0.69
MS (+JT+FI+Text)	89.4 ± 0.08	0.69

Table 3. Comparison of model accuracy and parameter size under different numbers of heads in the multi-head attention graph module.

Method	Accuracy (%)	Params (M)
2-heads	90.0 ± 0.12	1.75
4-heads	90.2 ± 0.11	1.88
6-heads	90.2 ± 0.11	2.02
8-heads	90.1 ± 0.14	2.15

Table 4. Comparison of model accuracy and parameter size under different modes in the multi-scale spatial pooling module.

Method	Accuracy (%)	Params (M)
None	89.2 ± 0.06	0.90
Global	89.7 ± 0.10	0.90
Parts	89.8 ± 0.13	1.23
MSSP (Parts+Global)	90.2 ± 0.11	1.88

Table 5. Comparison of prompt design and text encoder performance on action recognition.

Prompt	Text-Encoder	Acc (%)
Describe [action] briefly; Describe motion of: left hand, right hand, left knee, right knee, head and spine.	BERT	$90.0 \pm 0.12$
	CLIP	$90.0 \pm 0.08$
Describe a person performing [action] in natural language; Then describe the movements of: left hand, right hand, left knee, right knee, head and spine.	BERT	$90.0 \pm 0.08$
	CLIP	$90.1 \pm 0.09$
Describe a person [action] in details; Describe following body parts actions when [action]: left hand, right hand, left knee, right knee, head and spine.	BERT	$90.1 \pm 0.13$
	CLIP	$90.2 \pm 0.11$

Table 6. Comparisons of the Top-1 accuracy (%) against state-of-the-art methods on the NTU-RGB+D.

Method	Params (M)	CS (%)	CV (%)
HCN [46]	1.13	86.5	91.1
ST-GCN [19]	3.10	81.5	88.3
2s-AGCN [20]	6.94	88.5	95.1
AS-GCN [47]	6.99	86.8	94.2
AGC-LSTM [48]	22.89	89.2	95.0
PL-GCN [49]	20.70	89.2	95.0
MS-G3D (joint) [21]	3.20	89.4	95.0
MS-G3D [21]	6.44	91.5	96.2
CTR-GCN (joint) [22]	1.46	89.9	95.0
CTR-GCN [22]	2.92	92.4	96.9
ED-GCN (joint) [50]	-	86.7	93.9
ED-GCN [50]	-	88.7	95.2
LKA-GCN (joint) [51]	-	90.0	95.0
LKA-GCN [51]	3.78	90.7	96.1
MSSTNet [52]	39.6	89.6	95.3
AG-HGNN (joint) [53]	-	87.6	93.2
AG-HGNN [53]	16.46	93.1	97.2
MSG-GCN (ours)	1.88	90.2	95.6

Table 7. Comparisons of the Top-1 accuracy (%) against state-of-the-art methods on the NTU-RGB+D 120.

Method	Params (M)	CS (%)	CV (%)
ST-GCN [19]	3.10	70.7	73.2
SR-TSL [54]	19.07	74.1	79.9
2s-AGCN [20]	6.94	82.5	84.2
AS-GCN [47]	6.99	77.7	78.9
ST-GDN [55]	-	80.8	82.3
ST-TR [56]	12.10	82.7	84.7
RA-GCN [57]	6.25	81.1	82.7
MSSTNet (joint & bone) [52]	39.6	83.6	84.7
MSSTNet [52]	39.6	85.3	86.0
MSG-GCN (ours)	1.88	84.4	86.4

Table 8. Comparisons of the Top-1 accuracy (%) against state-of-the-art methods on the UAV-human.

Method	Params (M)	Accuracy (%)
ST-GCN [19]	3.10	30.3
2s-AGCN [20]	6.94	34.8
Shift-GCN [58]	2.80	38.0
MSST-RT [59]	-	41.2
IO-SGN [60]	-	40.0
MSSTNet [52]	39.6	43.0
MSG-GCN (ours)	1.88	41.6

Table 9. Comparisons of efficiency metrics against state-of-the-art methods on the MOBDrone.

Method	Params (M)	GFLOPs	FLOPs/f (M)	Lat. (ms)	FPS	Mem. (MB)	Acc (%)
ST-GCN [19]	3.10	16.30	34.42	4.1	244	445	91.0
2s-AGCN [20]	6.94	37.32	66.74	4.6	217	766	94.5
MS-G3D (joint) [21]	6.44	5.22	33.13	3.9	256	667	94.0
Shift-GCN (joint) [58]	2.80	2.50	18.52	3.5	311	310	93.7
CTR-GCN (joint) [22]	1.46	19.7	49.25	4.7	212	489	96.0
MSG-GCN (ours)	1.88	0.31	13.65	2.6	390	218	96.0

Table 10. Per-class performance on critical maritime search-and-rescue actions.

Action	Precision (%)	Recall (%)	F-Score
Drowning suspected	97.2	97.0	0.971
Swimming	96.7	95.2	0.959
Call for help	95.8	97.1	0.964

Table 11. Performance under different occlusion levels on the MOBDrone dataset.

Method	Accuracy at Each Occlusion Level (%)
Method	None	10%	20%	30%
ST-GCN [19]	91.0	78.1	62.4	35.6
2s-AGCN [20]	94.5	82.4	66.8	44.5
MS-G3D (joint) [21]	94.0	83.9	70.4	57.7
Shift-GCN (joint) [58]	93.7	84.0	69.3	58.9
CTR-GCN (joint) [22]	96.0	86.1	74.7	60.7
MSG-GCN (ours)	96.0	88.2	79.2	64.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hang, R.; He, G.; Dong, L. MSG-GCN: Multi-Semantic Guided Graph Convolutional Network for Human Overboard Behavior Recognition in Maritime Drone Systems. Drones 2025, 9, 768. https://doi.org/10.3390/drones9110768

AMA Style

Hang R, He G, Dong L. MSG-GCN: Multi-Semantic Guided Graph Convolutional Network for Human Overboard Behavior Recognition in Maritime Drone Systems. Drones. 2025; 9(11):768. https://doi.org/10.3390/drones9110768

Chicago/Turabian Style

Hang, Ruijie, Guiqing He, and Liheng Dong. 2025. "MSG-GCN: Multi-Semantic Guided Graph Convolutional Network for Human Overboard Behavior Recognition in Maritime Drone Systems" Drones 9, no. 11: 768. https://doi.org/10.3390/drones9110768

APA Style

Hang, R., He, G., & Dong, L. (2025). MSG-GCN: Multi-Semantic Guided Graph Convolutional Network for Human Overboard Behavior Recognition in Maritime Drone Systems. Drones, 9(11), 768. https://doi.org/10.3390/drones9110768

Article Menu

MSG-GCN: Multi-Semantic Guided Graph Convolutional Network for Human Overboard Behavior Recognition in Maritime Drone Systems

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Human Action Recognition

2.2. Prompt Learning

3. Method

3.1. Spatial Modules

3.2. Temporal Modules

3.3. Loss Function

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Study

4.3.1. Effectiveness of MSG-GCN

4.3.2. Effectiveness of MS

4.3.3. Impact of Different MHAG Modes

4.3.4. Impact of Different MSSP Modes

4.3.5. Impact of Loss Function

4.3.6. Impact of Text-Encoder and Prompt

4.4. Visualization

4.5. Comparison with the State-of-the-Arts

4.6. Validation of MSG-GCN for Maritime Search and Rescue Applications

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Confusion Matrix

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI