GCN-Former: A Method for Action Recognition Using Graph Convolutional Networks and Transformer

Cui, Xueshen; Zhang, Jikai; He, Yihao; Wang, Zhixing; Zhao, Wentao

doi:10.3390/app15084511

Open AccessArticle

GCN-Former: A Method for Action Recognition Using Graph Convolutional Networks and Transformer

by

Xueshen Cui

,

Jikai Zhang

^*,

Yihao He

,

Zhixing Wang

and

Wentao Zhao

School of Digital and Intelligence Industry, Inner Mongolia University of Science & Technology, Baotou 014010, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4511; https://doi.org/10.3390/app15084511

Submission received: 26 February 2025 / Revised: 11 April 2025 / Accepted: 18 April 2025 / Published: 19 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Skeleton-based action recognition, which aims to classify human actions through the coordinates of body joints and their connectivity, is a significant research area in computer vision with broad application potential. Although Graph Convolutional Networks (GCNs) have made significant progress in processing skeleton data represented as graphs, their performance is constrained by local receptive fields and fixed joint connection patterns. Recently, researchers have introduced Transformer-based methods to overcome these limitations and better capture long-range dependencies. However, these methods face significant computational resource challenges when attempting to capture the correlations between all joints across all frames. This paper proposes an innovative Spatio-Temporal Graph Convolutional Network: GCN-Former, which aims to enhance model performance in skeleton-based action recognition tasks. The model integrates the Transformer architecture with traditional GCNs, leveraging the Transformer’s powerful capability for handling long-sequence data and the effective capture of spatial dependencies by GCNs. Specifically, this study designs a Transformer Block temporal encoder based on the self-attention mechanism to model long-sequence temporal actions. The temporal encoder can effectively capture long-range dependencies in action sequences while retaining global contextual information in the temporal dimension. In addition, in order to achieve a smooth transition from graph convolutional networks (GCNs) to Transformers, we further develop a contextual temporal attention (CTA) module. These components are aimed at enhancing the understanding of temporal and spatial information within action sequences. Experimental validation on multiple benchmark datasets demonstrates that our approach not only surpasses existing techniques in prediction accuracy, but also has significant performance advantages in handling action recognition tasks involving long time sequences and can more effectively capture and understand long-range dependencies in complex action patterns.

Keywords:

action recognition; human skeleton; spatiotemporal graph convolutional network; transformer

1. Introduction

In recent years, with the success of Graph Neural Networks (GNNs) in handling complex data structures, GNNs have demonstrated significant potential in various domains, including but not limited to social network analysis, recommendation systems, and bioinformatics. In particular, Graph Convolutional Networks (GCNs) [1] have significantly enhanced our understanding of graph-structured data by effectively integrating node features and their neighborhood information. However, when dealing with data that exhibit spatiotemporal characteristics, traditional GCN models often struggle to be directly applicable, as they primarily focus on static graph structures and overlook the influence of the temporal dimension. This issue is particularly prominent in skeleton-based action recognition. Early methods primarily relied on handcrafted features and traditional machine learning algorithms, such as Support Vector Machines (SVMs) [2] and Hidden Markov Models (HMMs) [3]. Although these methods achieved some success in simpler tasks, their performance is limited when handling complex action sequences, as they struggle to capture the temporal dependencies and spatial structures inherent in the actions.

In recent years, the development of deep learning techniques has brought revolutionary changes to skeleton-based action recognition. In particular, the successful application of Convolutional Neural Networks (CNNs) [4,5] and Recurrent Neural Networks (RNNs) [6,7,8] has enabled models to automatically learn high-level feature representations from raw skeleton data. However, these methods still have limitations when dealing with graph-structured data, as they are mainly designed for regular grid data. To address this issue, researchers have begun exploring methods to incorporate temporal information into GCNs. Yan et al. [9] proposed Spatiotemporal Graph Convolutional Networks (ST-GCNs) in 2018, which modeled human skeleton sequences in spatial and temporal dimensions for the first time, treating joints as graph vertices and constructing a spatiotemporal graph. However, the fixed graph topology in ST-GCNs does not account for non-physical connections. Wen et al. [10] introduced a directed graph and a weighted adjacency matrix, emphasizing spatial proximity between joints, enhancing the model’s ability to focus on relevant relationships. Li et al. [10] proposed an AS-GCN, incorporating two sub-networks for motion and structural connections, expanding the neighborhood and enriching joint-related cues. Shi et al. [11] proposed a 2s-AGCN, which allows the graph topology to be learned in an end-to-end fashion, improving flexibility and joint interaction learning. Shi et al. [12] further improved the ST-GCN and 2s-AGCN by representing human skeletons with directed acyclic graphs and introducing directed graph neural networks to extract features of joint interactions, skeleton information and their relationships. Zhang et al. [13] developed CA-GCN, which enables nodes to consider both local and distant joints, capturing long-range dependencies without multiple layers, simplifying the network. Cheng et al. [14] proposed a Shift-GCN, introducing a spatial shift operation to reduce complexity and improve performance for long time series. Chi et al. proposed InfoGCN [15] and InfoGCN++ [16], which optimized the graph convolution process using information theory, enhancing robustness and accuracy in complex scenarios. Lee et al. [17] introduced HD-GCN, employing hierarchical decomposition to optimally configure different GCN components and improve overall model performance. Chen et al. [18] developed CTRGCN, which fine-tuned the channel topology of GCNs to enhance adaptability for complex action recognition tasks. Zhou et al. [19] proposed BlockGCN, redefining the graph topology to boost generalization and adaptability, further advancing GCN performance in action recognition. These contributions collectively drive the evolution of GCNs in skeleton-based action recognition. Despite the significant advancements of GCNs in handling graph-structured data, there are still limitations when processing long temporal sequences.

Meanwhile, the Transformer architecture [20] has seen significant success in Natural Language Processing (NLP). Its core advantage lies in the self-attention mechanism, which efficiently handles long sequence data and excels at modeling long-range dependencies. Compared to traditional RNNs and CNNs, the Transformer not only avoids the vanishing and exploding gradient problem but also supports parallel computation, significantly enhancing training efficiency. This makes the Transformer a potential solution to address the limitations of GCNs and TCNs. In skeleton-based action recognition, some studies have begun to explore the application of the Transformer to this task. For example, Plizzari et al. [21] proposed a Spatial–Temporal Transformer network (ST-TR), a model that combines spatial and temporal self-attention mechanisms. ST-TR understands the interactions between different body parts within a frame through the spatial self-attention module (SSA) and models the association between frames through the temporal self-attention module (TSA). Zhang et al. [22] proposed a Spatial–Temporal Specialized Transformer (STST), a Transformer encoder specifically designed for skeleton-based action recognition. An STST models the pose of each skeleton frame and the action of the entire time span, respectively, through spatial and temporal self-attention mechanisms. Gao et al. [23] proposed a model, FG-STFormer, that combines focal and global spatiotemporal transformers. The FG-STFormer models key joints and overall motion patterns through spatial transformers coupled with focal joints and global parts, as well as focal and global temporal transformers. Shi et al. [24] proposed TAR-128, a sparse Transformer for skeleton recognition, incorporating a sparse attention module that reduces redundant computations by focusing on neighboring joints and leveraging logical relations within body parts. Zhou et al. [25] proposed Hyperformer, advancing Transformer-based models by incorporating skeletal structure through relative position embeddings and a Hypergraph Self-Attention (HyperSA) mechanism to capture high-order kinematic dependencies, outperforming graph-based models in accuracy and efficiency. Pang et al. [26] proposed the Interaction Graph Transformer (IGFormer), which uses a Graph Interaction Multi-Head Self-Attention (GI-MSA) module to model interactions between body parts of multiple subjects, along with a semantic partitioning module (SPM) for better handling of interactive actions. Duan et al. [27] introduced SkeleTR, a framework that uses short skeleton sequences to reduce association errors, with a hybrid pooling module for efficiency. Wen et al. [28] proposed the Interactive Spatiotemporal Token Attention Network (ISTA-Net), which uses interactive spatiotemporal tokens (ISTs) to capture spatial, temporal, and interactive relationships. These contributions highlight the ongoing advancements in integrating Transformers for skeleton-based action recognition, each focusing on optimizing spatiotemporal modeling, interaction capture, and computational efficiency.

Building upon this, the present work proposes an innovative spatiotemporal graph convolutional network by replacing the original TCN module with a Transformer, thus creating a more flexible and efficient model. Specifically, we designed a Transformer Block temporal encoder based on the self-attention mechanism to capture the dynamic features of key points over time while retaining the spatial aggregation capability of GCNs. This ensures that the model can comprehensively consider both temporal and spatial information. In addition, through experiments on multiple benchmark datasets, this paper aims to demonstrate that the proposed model not only outperforms existing methods in prediction accuracy but also exhibits higher efficiency and robustness in handling complex spatiotemporal data. The contributions of this paper are primarily as follows:

A spatiotemporal graph convolutional network framework is proposed, combining the Transformer with GCNs. Additionally, we introduce the Contextual Temporal Attention (CTA) mechanism to more efficiently leverage contextual information, enhancing the expressive power of the output aggregated feature map.
A Transformer Block-based time encoder with a self-attention mechanism is designed, effectively addressing the modeling challenge of long sequence data.
Extensive experiments are conducted on multiple datasets, validating the effectiveness and superiority of the proposed model.

2. Methods

2.1. GCN-Former Architecture Overview

The overall architecture of the proposed GCN-Former is shown in Figure 1. For each action, we first sample its corresponding skeleton sequence, ensuring that all samples consist of the same number of frames, denoted as

T

. This results in a data tensor, denoted as

X \in R^{N \times C_{i n} \times T \times V \times M}

, where

N

represents the batch size,

C_{i n}

refers to the data dimension for a single joint (typically the joint coordinates dimension, such as 3 for

X

,

Y

,

Z

coordinates),

T

denotes the time steps,

V

is the number of joints per frame, and

M

represents the number of individuals involved in the action.

Next, we transform the original tensor into a shape of (

N \cdot M \cdot T

,

V

,

C_{i n}

) through rearrangement and embedding operations, expanding the

C_{i n}

dimension to 128 (denoted as

C

) via joint embedding. Furthermore, the skeleton sequence is normalized. For simplicity, we redefine

N

as

N \cdot M

, resulting in the formulation

X \in R^{N \times C_{i n} \times T \times V}

for the subsequent text.

The input data are then passed through

L

layers of the GCN-Former Bloc. Each layer consists of a GCN convolution module, the improved Contextual Temporal Attention (CTA) mechanism, and the Transformer Block time convolution module, all stabilized by residual connections during training. Finally, the features processed by the

L

layers of the GCN-Former Block are mapped to the class space via Global Average Pooling (GAP) followed by a Fully Connected Layer (FC) to complete the action recognition task.

2.2. GCN-Former Block Architecture

Each GCN-Former Block consists of three core components: the GCN module, the Contextual Temporal Attention (CTA) mechanism, and the Transformer Block. Specifically, the GCN module aggregates information from neighboring nodes via graph convolution operations on the input data, generating new feature representations for each node. In action recognition, this process helps capture local spatial dependencies and dynamic patterns, such as the relative motion relationships between human joints. Additionally, GCNs preserve graph structure information, crucial for understanding key components of the action. Nevertheless, standard GCNs have limitations in capturing complex long-range dependencies between nodes. To address these limitations, we designed the CTA module to follow the GCN, enabling further exploration of contextual information within local neighborhoods, particularly when the nodes in the graph possess rich spatial or geometric structures. The CTA module enhances visual representation capability by utilizing contextual information between input keys to guide the learning of the self-attention matrix. It combines static context encoding with dynamic self-attention learning, effectively capturing dependencies between different positions and improving the global understanding of the entire graph. Additionally, the CTA module introduces a context-aware mechanism that enhances feature expressiveness without significantly increasing computational cost. This is particularly crucial for handling complex tasks, such as image recognition and object detection. The Transformer Block in this study excels at handling sequential data and modeling long-range dependencies through the multi-head self-attention mechanism.

However, transitioning directly from GCN to Transformer may result in the loss of important local details. To address this, we use the CTA as an intermediate layer to preserve local details. This ensures that the details are passed on in a form better suited for the Transformer, achieving more efficient feature transformation and smoother data flow. Meanwhile, the dynamic context representation in the CTA module is built based on the interaction between queries and context keys, making it an ideal choice for linking the two different architectural components. This facilitates information exchange and promotes collaboration between them. Finally, the features enhanced by the CTA module are fed into the Transformer Block, which models relationships between nodes over a broader range using the multi-head self-attention mechanism, including long-range dependencies.

Specifically, the Transformer Block is well suited for handling long-duration action sequences, as it can analyze the development trends and rhythmic changes in the action from a global perspective. This is crucial for accurately classifying different types of actions. These three modules employ residual connections, meaning that the output of each layer not only includes the new information generated by the current layer but also retains a certain proportion of the original input. This design helps alleviate the vanishing gradient problem that may arise during the training of deep networks and makes the model easier to optimize. Residual connections ensure the continuity and stability of information flow throughout the architecture, facilitating the effective transmission and fusion of information. This enhances the overall robustness and generalization capability of the system.

2.3. Contextual Temporal Attention

The Contextual Temporal Attention (CTA) module is designed to overcome the limitations of traditional self-attention mechanisms when processing 2D feature maps. Specifically, the attention matrix is calculated based solely on isolated query–key pairs, which ignores the rich contextual information between neighboring keys. This limitation hinders the ability of self-attention to learn effectively and reduces the efficiency of visual representation learning. The Contextual Transformer (CoT) [29] integrates contextual information mining with self-attention learning into a unified architecture, enhancing the representational power of the output aggregated feature map. However, this approach only enhances the representation through local neighborhood relationships and the combination of static and dynamic information. To further enhance the representation of contextual information, we propose adding an independent temporal feature enhancement path to the existing CoT module. This path extracts the temporal dynamics of the input features by applying simple temporal convolutions and combines them with the original features. This approach improves the model’s sensitivity to temporal information without significantly increasing computational overhead.

The detailed structure of the proposed CTA module is shown in Figure 2. For the input feature map

X \in R^{T \times H \times W \times C}

, we define the following: keys (

K

) are directly taken from the input feature map

X

; queries (

Q

) are also directly taken from the input feature map

X

; values (

V

) are obtained from the input feature map

X

via a linear transformation

W_{v}

. Unlike traditional self-attention mechanisms that use

1 \times 1 \times 1

convolutions to encode each key, the CTA block utilizes

1 \times k \times k

grouped convolutions to contextualize all local neighboring keys. This operation captures the local neighborhood relationships of each key within its

k \times k

grid, generating a contextualized key

K^{1} \in R^{T \times H \times W \times C}

that reflects the static contextual information between local neighboring keys. Next, the CTA block combines the contextualized key

K^{1}

and the query

Q

to generate the attention matrix

A

. Specifically,

K^{1}

and

Q

are concatenated and passed through two successive

1 \times 1 \times 1

convolution layers

W_{θ}

and

W_{δ}

, where

W_{θ}

has a ReLU activation function, while

W_{δ}

does not. The formulation is as follows:

A = [K^{1}, Q] W_{θ} W_{δ} .

(1)

Here,

A

represents the local attention matrix for each head, which is learned for each spatial position based on the query features and the contextualized key features, rather than relying on isolated query–key pairs. This approach enhances self-attention learning and introduces additional guidance from the mined static context

K^{1}

. Based on the contextualized attention matrix

A

, the CTA module computes the attention-refined feature map

K^{2}

by aggregating all the values

V

. The formula is as follows:

K^{2} = V ⊛ A .

(2)

Here,

K^{2}

captures the dynamic feature interactions between the inputs, and thus is referred to as the dynamic contextual representation of the input.

Compared to the CoT module, the designed CTA module introduces a lightweight temporal convolution branch to extract temporal dynamic information. Specifically, this branch applies a

k \times 1 \times 1

convolution operation along the temporal dimension of the input features to capture the dynamic variation patterns between different time steps. Finally, the output

Y

of the CTA module is obtained by adding the result of the temporal convolution to the fusion of the static context

K^{1}

and the dynamic context

K^{2}

via residual connections. This enhances the model’s ability to model temporal dependencies without disrupting the original feature space, ensuring that the final output contains both static contextual information and dynamic feature interaction information, thereby improving the model’s representational power. The formula for the final output

Y

of the CTA module is as follows:

\begin{matrix} T = C o n v_{T e m p} (x {),}^{\leftarrow} \\ Y = K^{'} + K^{2} + R e L U (T {) .}^{\leftarrow} \end{matrix}

(3)

In summary, the CTA module, through its unique design, effectively combines contextual information mining and self-attention learning, enhancing the visual representation capability of 2D feature maps. It improves the model’s sensitivity to temporal information while maintaining computational efficiency.

2.4. Transformer Block Module

The Transformer Block consists of N Encoder modules, each containing Multi-Head Attention, Add & Norm, Feed Forward, and another Add & Norm (as shown in Figure 3). Notably, the Multi-Head Attention here contains multiple Convolutional Attention [30] layers. Specifically, the input

X

is passed through

h

different Convolutional Attention layers, resulting in

h

output matrices

X^{'}

. These matrices are then concatenated (Concat) and passed through a Linear layer to form the final output matrix

Y

of the Multi-Head Attention.

Convolutional Attention replaces the matrix multiplication in traditional self-attention mechanisms with pixel-wise convolution operations, avoiding the need for feature flattening and reshaping. It not only simplifies the computational process but also reduces inference time, which is crucial for real-time applications such as video segmentation and action recognition. Unlike traditional self-attention mechanisms, Convolutional Attention uses convolution operations to compute similarity, preserving the spatial coherence of the input data. This is particularly important for data with inherent spatial layouts, such as images and videos.

By extending learnable vectors into learnable kernels, Convolutional Attention can more precisely capture local context and retain more local spatial information. This approach helps better align the semantic information of the Transformer, thereby improving the model’s expressiveness. Convolutional Attention is an innovative attention mechanism. Its core idea is to use convolution operations to simulate the similarity calculation process in self-attention mechanisms, while preserving the spatial structure of the input data. Unlike traditional methods that calculate the dot product between queries and keys, Convolutional Attention uses convolution kernels to compare the similarity between pixel blocks. This approach better captures local features and naturally incorporates positional information through the convolution operation. In addition, Convolutional Attention introduces a special normalization method—Grouped Dual Normalization (

θ

)—which ensure the effective allocation of attention weights. Specifically, given an input feature map XXX and a set of learnable queries (Q) and keys (K), the Convolutional Attention operation consists of two main steps: calculating the similarity matrix and aggregating the values. The formula for calculating the similarity matrix is as follows:

A = θ (X \otimes K) .

(4)

Here,

A

is the computed similarity matrix, which captures the local similarity between the input feature map

X

and the keys

K

.

X \in R^{C \times H \times W}

is the input feature map, where

C

denotes the number of channels, and

H

and

W

represent the height and width, respectively.

K \in R^{C \times N \times k \times k}

is the learnable query and key values, where

N

is the number of kernels, and

k

is the kernel size.

θ

represents Grouped Dual Normalization (GDN), which includes Softmax normalization along the

H \times W

dimensions and group L2 norm normalization along the

N

dimension.

\otimes

denotes the convolution operation. First, the similarity between the input feature map

X

and the query/key values

K

is computed through the convolution operation. The resulting similarity matrix

A

reflects the correlation between different positions and is normalized using

θ

to ensure the effective allocation of attention weights. The formula for aggregating the values is as follows:

X^{'} = A \otimes K^{T} .

(5)

Here,

X^{'}

is the final output feature map, which combines the input feature map

X

and the similarity matrix

A

.

K^{T} \in R^{N \times C \times k \times k}

is the transpose version of

K

. In this step, we again use the convolution operation to combine the transposed key values

K^{T}

with the similarity scores, ultimately producing the normalized feature map

X^{'}

. This feature map integrates the original input information with the spatial contextual information enhanced through the attention mechanism.

The Add & Norm layer refers to the residual connection followed by LayerNorm, as represented by the following formula:

A d d & N o r m : L a y e r N o r m (X + S u b l a y e r (X)) .

(6)

The Sublayer represents the transformation that the data undergoes. For example, in the first Add & Norm, the Sublayer represents the Multi-Head Attention. The Feed Forward layer refers to a fully connected layer, which is represented as follows:

F F N (X) = m a x (0, X W_{1} + b_{1}) W_{1} + b_{2} .

(7)

Therefore, after passing the input matrix

X

through an Encoder module, the output representation is as follows:

\begin{matrix} O = LayerNorm (X + Multi-Head-Attention (X)), \\ O = LayerNorm (O + F F N (O)) . \end{matrix}

(8)

After passing through the single Encoder module described above, the input matrix

X \in R^{n \times d}

produces the output matrix

O \in R^{n \times d}

. By stacking multiple Encoder modules, the complete output of the Transformer Block is formed.

3. Experiments

3.1. Datasets

NTU RGB+D [31] is a large-scale human action recognition dataset that contains 56,880 skeleton action sequences. The action samples were performed by 40 volunteers and are categorized into 60 classes, including activities like drinking water, eating, brushing teeth, and dropping objects. Each sample contains a single action, with at most two subjects. The sequences are captured simultaneously from different angles by three Microsoft Kinect v2 cameras. The authors recommend two benchmarks. One is Cross-subject (X-sub), where the training data come from 20 subjects, and the testing data come from the other 20 subjects. The other is Cross-view (X-view), where the training data come from camera views 2 and 3, and the testing data come from camera view 1.

NTU RGB+D 120 [32] is currently the largest dataset with 3D joint annotations for human action recognition. It extends the NTU RGB+D dataset, including actions like wearing headphones, shooting basketballs, and tossing ping pong balls. The dataset introduces 60 additional action classes, adding 57,367 skeleton sequences to the original dataset. In total, it includes 113,945 samples across more than 120 classes, performed by 106 volunteers and captured from three camera views. The dataset contains 32 different setups, each representing a specific location and background. The authors recommend two benchmarks. One is Cross-subject (X-sub), where the training data come from 53 subjects, and the testing data come from the other 53 subjects. The other is Cross-setup (X-setup), where the training data come from samples with even setup IDs, and the testing data come from samples with odd setup IDs.

UT-Kinect Dataset [33] is captured using a single fixed Kinect. It consists of 200 sequences divided into 10 classes, where each skeleton has 20 joints. The dataset is recorded in three channels: RGB, depth, and skeleton joint positions. However, we only use the 3D skeleton joint coordinates.

3.2. Experiment Details

All experiments were conducted on an RTX 3080 TI GPU using the PyTorch 1.13.1 version deep learning framework. Our model was trained using Stochastic Gradient Descent (SGD) with a momentum of 0.9 and weight decay of 0.0004 to prevent overfitting. The number of training epochs was set to 140, and a warm-up strategy [34] was applied during the first five epochs to stabilize the training process. The learning rate was set to 0.01, and it was decayed by a factor of 0.1 at epochs 110 and 120. For both NTU RGB+D and NTU RGB+D 120, the batch size was set to 64, with each sample containing 64 frames. Since the sequences in the UT Kinect dataset contain very few frames, we designed two ways to generate the training set: sampling and interpolation. For longer sequences (i.e., the length of the sequence is greater than 64), we randomly chose 64 frames; for the other sequences, we calculated the mean of two adjacent frames and inserted it into the sequence as a new frame, eventually forming a sequence of 54 frames. For all the sequences, we repeated this operation two times to generate the training set.

Data preprocessing followed the approach outlined in [35]. Sequence-level translation based on the first frame was performed to be invariant to the initial position. If a frame contained two people, the frame was divided into two frames by making each frame contain one human skeleton. During training, we evenly split the entire skeleton sequence into 20 clips and randomly selected one frame from each clip to obtain a new sequence of 20 frames. During testing, we randomly created five new sequences and used the average score to predict the category. During training, we performed data argumentation by randomly rotating the 3D skeleton to a certain degree at the sequence level in order to be robust to view changes. For the (CS setting), NTU 120 and UT-Kinect datasets, we randomly selected three degrees (around the X, Y, Z axes, respectively) in the range of [−17, 17]. Considering the large view variation in NTU 60, (CV setting), we randomly selected three degrees in the range of [−30°, 30°].

3.3. Comparison with State of the Art

To establish a fair comparison, we adopted the widely accepted 4-Stream fusion method in our experiments. Specifically, we input the four different modalities: joint, skeleton, joint motion, and skeleton motion. The joint and skeleton modalities represent the raw skeleton coordinates and their derivatives relative to the skeleton connections, respectively, providing essential spatial and structural information. The joint motion and skeleton motion modalities calculate the temporal derivatives of the joint and skeleton modalities, capturing the temporal dynamics of the action. Subsequently, we combined the prediction scores from each stream to produce the final fused result.

We compared our model with state-of-the-art methods on the NTU RGB+D 120 and NTU RGB+D datasets, and the results are presented in Table 1. On both datasets, our method outperforms all existing methods across all evaluation benchmarks. It can be observed that, in terms of parameter count and computational cost, we differ from BlockGCN (which has the lowest parameter count and computational cost) by 0.5 M and 0.29 G, respectively. However, our method demonstrates significant improvements in all metrics compared to BlockGCN, but these are still lower than other methods. In practical deployment scenarios, compared with the BlockGCN method, the additional parameters and Flops of this method do not affect the performance of the model. Notably, our method is the first to replace temporal convolutions with a self-attention mechanism module.

We compared our model with the classic STGCN method and the latest BlockGCN method on the UT-Kinect dataset, using AUC and G-mean to evaluate the performance differences between the three models on each action category. It is worth mentioning that the performance metrics AUC and G-mean, as well as the balance between Precision and Recall, present the information provided by a confusion matrix in compact form; therefore, they constitute the proper metrics to evaluate the classification ability of a prediction model [47,48]. On the UT-Kinect dataset, we compared the proposed model with the classic STGCN method and the BlockGCN method with the highest accuracy. The evaluation indicators used AUC and G-mean to analyze the performance differences in each model in different action categories in detail. The specific results are summarized in Table 2.

The experimental results show that in simpler action categories such as “Walk”, all three models achieved satisfactory recognition results; however, for easily confused action pairs, such as “Sit down” and “Standup”, and “Push” and “Pull”, the recognition performance of all models decreased significantly. For the action types “Wave Hands” and “Clap Hands” with high-frequency periodic characteristics, the method combined with Transformer temporal modeling proposed in this study showed superior recognition ability compared with the other two methods. In terms of the AUC indicator of Wave Hands action recognition, the proposed method achieved a 0.95% improvement compared to the BlockGCN method. Especially when processing the long-time action “Carry”, this model achieved a better result than the BlockGCN method, with an AUC improvement of 4.3% and a G-mean improvement of 4.18%. This shows that the proposed model architecture has obvious performance advantages in handling action recognition tasks involving long time sequences and can more effectively capture and understand long-range dependencies in complex action patterns. Compared with the BlockGCN method, the method proposed in this paper achieved an improvement of 0.46% and 0.44% in the average AUC and G-mean indicators, respectively, further verifying its effectiveness and advancement in action recognition tasks.

3.4. Ablation Experiment

To demonstrate the effectiveness of the introduced and designed modules, ablation experiments were conducted on the NTU RGB+D dataset. Ablation experiments were conducted using a single-joint modality on the X-Sub benchmark, and the results are shown in Table 3. From the table, replacing the TCN module with the Transformer Block module led to a 1.6% performance improvement. Additionally, after incorporating the CTA attention mechanism, our method outperformed the GCN + TCN approach by 0.5%. This shows that using CTA as an intermediate layer helps preserve local details and transmit them in a form better suited for the Transformer, enabling more efficient feature transformation and smoother data flow.

Comparison of attention mechanisms: We conducted a detailed comparison of attention mechanisms widely applied in graph convolutional networks, with the results shown in Table 4. As shown in the data, methods incorporating attention mechanisms consistently outperform baseline methods, demonstrating the critical role of attention in enhancing the expression of regions of interest. Furthermore, experimental results indicate that our proposed CTA method outperforms other competing approaches. The CTA module, by combining contextual information mining with self-attention learning, effectively enhances the visual representation of 2D feature maps while improving the model’s sensitivity to temporal information. This method has demonstrated excellent efficiency and reliability in handling long-sequence data for action classification tasks.

Number of Convolutional Attention Heads in Multi-Head Attention: We investigated the effectiveness of the initial number of Convolutional Attention heads (h) in our model, with the results shown in Table 5. The findings show that the multi-head attention mechanism improves action classification performance.

4. Conclusions

The main contribution of this study is the development of a spatiotemporal graph convolutional network framework, GCN-Former, which cleverly combines the spatial aggregation capabilities of graph convolutional networks (GCNs) and the temporal modeling advantages of the Transformer architecture. We introduce an improved contextual temporal attention (CTA) mechanism, which not only effectively captures the long-range dependencies in long-span action sequences but also retains key local detail features. The experimental results show that GCN-Former achieves state-of-the-art performance on multiple evaluation metrics on the widely used NTU RGB+D 60 and NTU RGB+D 120 datasets, which proves the effectiveness and superiority of the method. It is also proved on the UT-Kinect dataset that GCN-Former has obvious performance advantages in processing action recognition tasks involving long time sequences and can more effectively capture and understand long-range dependencies in complex action patterns.

In addition, through ablation experimental analysis, we verify the positive impact of each component of the model on the overall performance improvement, further confirming the rationality of the model design. However, despite the significant progress made by GCN-Former, there are still some limitations. For example, the performance of the model may be limited for extremely complex action sequences or highly dynamically changing action patterns. Future work will focus on optimizing the CTA module to better handle complex scenes and explore more effective strategies to reduce computational costs and improve the efficiency of real-time applications. At the same time, incorporating more diverse information sources into the model, such as audio or environmental perception data, to enhance the model’s ability to understand human movements in different situations and provide more comprehensive technical support and theoretical basis for skeleton-based action recognition should be considered. This not only broadens the scope of application of current research, but also highlights the direction for subsequent research. The current study focuses on directly comparing performance indicators to evaluate algorithm performance without in-depth exploration of statistically significant differences between algorithms. Although this approach provides an intuitive comparison, the introduction of more rigorous statistical test methods (such as non-parametric Friedman Aligned Ranking (FAR) and post hoc Finner test) [52] will enhance the reliability and robustness of the conclusions. We plan to use these methods in future work to further verify the results and accurately analyze the relative performance and statistical significance of each algorithm.

Author Contributions

X.C. designed and conducted this research, analyzed the results, and wrote the manuscript; J.Z. and W.Z. critically reviewed and provided valuable feedback on the manuscript; Y.H. and Z.W. conducted formal analysis and investigation work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Inner Mongolia (2024LHMS06005370), Metallurgical Engineering First-Class Discipline Scientific Research Special Project of the Department of Education of Inner Mongolia Autonomous Region (YLXKZX-NKD-012) and Fundamental Research Funds for Inner Mongolia University of Science & Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2017, arXiv:1609.02907. [Google Scholar]
Schuldt, C.; Laptev, I.; Caputo, B. Recognizing Human Actions: A Local SVM Approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004, ICPR 2004, Cambridge, UK, 26 August 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 32–36. [Google Scholar]
Ahmad, M.; Lee, S.-W. HMM-Based Human Action Recognition Using Multiview Image Sequences. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 1, pp. 263–266. [Google Scholar]
Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A New Representation of Skeleton Sequences for 3d Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3288–3297. [Google Scholar]
Liu, M.; Liu, H.; Chen, C. Enhanced Skeleton Visualization for View Invariant Human Action Recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
Du, Y.; Wang, W.; Wang, L. Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1110–1118. [Google Scholar]
Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; AAAI: Palo Alto, CA, USA, 2017; Volume 31. [Google Scholar]
Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2126. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; AAAI: Palo Alto, CA, USA, 2018; Volume 32. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3595–3603. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 12026–12035. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition with Directed Graph Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7912–7921. [Google Scholar]
Zhang, X.; Xu, C.; Tao, D. Context Aware Graph Convolution for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 14333–14342. [Google Scholar]
Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition with Shift Graph Convolutional Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 183–192. [Google Scholar]
Chi, H.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. Infogcn: Representation Learning for Human Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 20186–20196. [Google Scholar]
Chi, S.; Chi, H.; Huang, Q.; Ramani, K. InfoGCN++: Learning Representation by Predicting the Future for Online Human Skeleton-Based Action Recognition. arXiv 2023, arXiv:2310.10547. [Google Scholar] [CrossRef] [PubMed]
Lee, J.; Lee, M.; Lee, D.; Lee, S. Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 10444–10453. [Google Scholar]
Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-Wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13359–13368. [Google Scholar]
Zhou, Y.; Yan, X.; Cheng, Z.-Q.; Yan, Y.; Dai, Q.; Hua, X.-S. Blockgcn: Redefine Topology Awareness for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2049–2058. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-Based Action Recognition via Spatial and Temporal Transformer Networks. Comput. Vis. Image Underst. 2021, 208, 103219. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, B.; Li, W.; Duan, L.; Gan, C. STST: Spatial-Temporal Specialized Transformer for Skeleton-Based Action Recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; ACM: New York, NY, USA, 2021; pp. 3229–3237. [Google Scholar]
Gao, Z.; Wang, P.; Lv, P.; Jiang, X.; Liu, Q.; Wang, P.; Xu, M.; Li, W. Focal and Global Spatial-Temporal Transformer for Skeleton-Based Action Recognition. In Proceedings of the Asian Conference on Computer Vision, ACCV, Macao, China, 4–8 December 2022; pp. 382–398. [Google Scholar]
Shi, F.; Lee, C.; Qiu, L.; Zhao, Y.; Shen, T.; Muralidhar, S.; Han, T.; Zhu, S.-C.; Narayanan, V. STAR: Sparse Transformer-Based Action Recognition. arXiv 2021, arXiv:2107.07089. [Google Scholar]
Zhou, Y.; Cheng, Z.-Q.; Li, C.; Fang, Y.; Geng, Y.; Xie, X.; Keuper, M. Hypergraph Transformer for Skeleton-Based Action Recognition. arXiv 2023, arXiv:2211.09590. [Google Scholar]
Pang, Y.; Ke, Q.; Rahmani, H.; Bailey, J.; Liu, J. IGFormer: Interaction Graph Transformer for Skeleton-Based Human Interaction Recognition. In Computer Vision–ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2022; Volume 13685, pp. 605–622. ISBN 978-3-031-19805-2. [Google Scholar]
Duan, H.; Xu, M.; Shuai, B.; Modolo, D.; Tu, Z.; Tighe, J.; Bergamo, A. Skeletr: Towards Skeleton-Based Action Recognition in the Wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 13634–13644. [Google Scholar]
Wen, Y.; Tang, Z.; Pang, Y.; Ding, B.; Liu, M. Interactive Spatiotemporal Token Attention Network for Skeleton-Based General Interactive Action Recognition. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7886–7892. [Google Scholar]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual Transformer Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1489–1500. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.; Wu, D.; Yu, C.; Chu, X.; Sang, N.; Gao, C. Sctnet: Single-Branch Cnn with Transformer Semantic Information for Real-Time Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; AAAI: Washington, DC, USA, 2024; Volume 38, pp. 6378–6386. [Google Scholar]
Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. Ntu Rgb+ d: A Large Scale Dataset for 3d Human Activity Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1010–1019. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; Kot, A.C. Ntu Rgb+ d 120: A Large-Scale Benchmark for 3d Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
Xia, L.; Chen, C.-C.; Aggarwal, J.K. View Invariant Human Action Recognition Using Histograms of 3D Joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA; pp. 20–27. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1112–1121. [Google Scholar]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 143–152. [Google Scholar]
Chen, Z.; Li, S.; Yang, B.; Li, Q.; Liu, H. Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Conference, 2–9 February 2021; AAAI: Palo Alto, CA, USA, 2021; Volume 35, pp. 1113–1122. [Google Scholar]
Song, Y.-F.; Zhang, Z.; Shan, C.; Wang, L. Constructing Stronger and Faster Baselines for Skeleton-Based Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar] [CrossRef] [PubMed]
Pan, L.; Lu, J.; Tang, X. Spatial-Temporal Graph Neural ODE Networks for Skeleton-Based Action Recognition. Sci. Rep. 2024, 14, 7629. [Google Scholar] [CrossRef] [PubMed]
Kang, M.-S.; Kang, D.; Kim, H. Efficient Skeleton-Based Action Recognition via Joint-Mapping Strategies. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 3403–3412. [Google Scholar]
Wu, L.; Zhang, C.; Zou, Y. SpatioTemporal Focus for Skeleton-Based Action Recognition. Pattern Recognit. 2023, 136, 109231. [Google Scholar] [CrossRef]
Gedamu, K.; Ji, Y.; Gao, L.; Yang, Y.; Shen, H.T. Relation-Mining Self-Attention Network for Skeleton-Based Human Action Recognition. Pattern Recognit. 2023, 139, 109455. [Google Scholar] [CrossRef]
Yang, W.; Zhang, J.; Cai, J.; Xu, Z. HybridNet: Integrating GCN and CNN for Skeleton-Based Action Recognition. Appl. Intell. 2023, 53, 574–585. [Google Scholar] [CrossRef]
Bavil, A.F.; Damirchi, H.; Taghirad, H.D. Action Capsules: Human Skeleton Action Recognition. Comput. Vis. Image Underst. 2023, 233, 103722. [Google Scholar] [CrossRef]
Lu, J.; Huang, T.; Zhao, B.; Chen, X.; Zhou, J.; Zhang, K. Dual-Excitation Spatial–Temporal Graph Convolution Network for Skeleton-Based Action Recognition. IEEE Sens. J. 2024, 24, 8184–8196. [Google Scholar] [CrossRef]
Ke, L.; Peng, K.-C.; Lyu, S. Towards To-at Spatio-Temporal Focus for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Conference, 22 February–1 March 2022; AAAI: Palo Alto, CA, USA, 2022; Volume 36, pp. 1131–1139. [Google Scholar]
Livieris, I.E.; Karacapilidis, N.; Domalis, G.; Tsakalidis, D. An Advanced Explainable and Interpretable ML-Based Framework for Educational Data Mining. In Methodologies and Intelligent Systems for Technology Enhanced Learning, Workshops Proceedings of the 13th International Conference, Guimaraes, Portugal, 12–14 July 2023; Kubincová, Z., Caruso, F., Kim, T., Ivanova, M., Lancia, L., Pellegrino, M.A., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 87–96. [Google Scholar]
Livieris, I.E. A Novel Forecasting Strategy for Improving the Performance of Deep Learning Models. Expert Syst. Appl. 2023, 230, 120632. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13713–13722. [Google Scholar]
Kiriakidou, N.; Livieris, I.E.; Pintelas, P. Mutual Information-Based Neighbor Selection Method for Causal Effect Estimation. Neural Comput. Appl. 2024, 36, 9141–9155. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of GCN-Former.

Figure 2. Contextual Temporal Attention (CTA).

Figure 3. Structure of the Transformer Block (a) and details of the Convolutional Attention (b). GDN denotes Group Dual Normalization.

\otimes

represents the convolution operation,

\oplus

indicates addition, and

k

denotes the kernel size.

Figure 3. Structure of the Transformer Block (a) and details of the Convolutional Attention (b). GDN denotes Group Dual Normalization.

\otimes

represents the convolution operation,

\oplus

indicates addition, and

k

denotes the kernel size.

Table 1. Performance on the NTU RGB+D and NTU RGB+D 120 datasets.

Methods	Parameters	FLOPs	NTU RGB+D 60		NTU RGB+D 120
Methods	Parameters	FLOPs	X-Sub (%)	X-View (%)	X-Sub (%)	X-Set (%)
MS-G3D [36]	2.8 M	5.22 G	91.5	96.2	86.9	88.4
MST-GCN [37]	12 M	-	91.5	96.6	87.5	88.8
CTR-GCN	1.5 M	1.97 G	92.4	96.4	88.9	90.4
EfficientGCN-B4 [38]	2.0 M	15.2 G	91.7	95.7	88.3	89.1
STG-NODE [39]	-	-	84.0	91.1	-	-
AM-GCN [40]	-	-	90.3	95.2	86.9	88.2
4s-STF-Net [41]	6.8 M	-	91.1	96.5	86.5	88.2
RSA-Net [42]	3.5 M	3.84 G	91.8	96.8	88.4	89.7
4s-HybridNet [43]	-	-	91.4	96.9	87.5	89.0
Action Capsules [44]	-	3.84 G	90.0	96.3	-	-
JPA-DESTGCN [45]	4.41 M	5.2 G	91.6	96.9	87.5	88.5
STF [46]	-	-	92.5	96.9	88.9	89.9
InfoGCN	1.6 M	1.84 G	92.3	96.7	89.2	90.7
HDGCN	1.7 M	1.77 G	93.0	97.0	89.8	91.2
BlockGCN	1.3 M	1.63 G	93.1	97.0	90.3	91.5
Ours	1.8 M	1.92 G	93.5	97.3	90.7	91.9

Table 2. Performance on the UT-Kinect dataset.

Action	ST-GCN		BlockGCN		Ours
Action	AUC	G-Mean	AUC	G-Mean	AUC	G-Mean
Walk	96.98	96.96	98.69	98.66	98.35	98.33
Sit Down	89.37	89.21	94.29	94.14	94.42	94.24
Standup	90.05	89.92	94.52	94.49	94.17	94.12
Pickup	90.29	90.24	92.46	92.28	92.61	92.57
Carry	89.34	89.17	91.18	90.77	95.54	94.95
Throw	92.24	92.24	95.25	95.23	94.75	94.69
Push	87.69	87.18	89.41	88.92	89.26	88.86
Pull	85.73	85.39	86.74	86.35	86.83	86.51
Wave Hands	92.56	92.44	95.23	95.16	96.18	96.09
Clap Hands	90.65	90.16	94.72	94.58	94.95	94.62
Average	90.49	90.29	93.25	93.06	93.71	93.50

Table 3. Ablation experiment.

Net				Param	FLOPs	Acc (%)
GCN	TCN	CTA	Transformer Block	Param	FLOPs	Acc (%)
√	√			1.5 M	1.97 G	88.9
√	√	√		1.6 M	1.99 G	89.2
√			√	1.7 M	1.88 G	90.5
√		√	√	1.8 M	1.92 G	91.3

Table 4. Comparison of attention mechanisms.

Settings	Param	FLOPs	Acc (%)
GCN-Former (Ours)	1.7 M	1.88 G	90.5
+SE [49]	1.9 M	1.89 G	90.8
+CBAM [50]	1.8 M	1.89 G	90.8
+CA [51]	1.9 M	1.95 G	91.0
+CoT	1.7 M	1.90 G	91.0
+CTA(Ours)	1.8 M	1.92 G	91.3

Table 5. Number of convolutional attention heads.

GCN	Convolution Attention			Acc (%)
GCN	h = 1	h = 2	h = 4	Acc (%)
√	√			90.4
√		√		90.9
√			√	91.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, X.; Zhang, J.; He, Y.; Wang, Z.; Zhao, W. GCN-Former: A Method for Action Recognition Using Graph Convolutional Networks and Transformer. Appl. Sci. 2025, 15, 4511. https://doi.org/10.3390/app15084511

AMA Style

Cui X, Zhang J, He Y, Wang Z, Zhao W. GCN-Former: A Method for Action Recognition Using Graph Convolutional Networks and Transformer. Applied Sciences. 2025; 15(8):4511. https://doi.org/10.3390/app15084511

Chicago/Turabian Style

Cui, Xueshen, Jikai Zhang, Yihao He, Zhixing Wang, and Wentao Zhao. 2025. "GCN-Former: A Method for Action Recognition Using Graph Convolutional Networks and Transformer" Applied Sciences 15, no. 8: 4511. https://doi.org/10.3390/app15084511

APA Style

Cui, X., Zhang, J., He, Y., Wang, Z., & Zhao, W. (2025). GCN-Former: A Method for Action Recognition Using Graph Convolutional Networks and Transformer. Applied Sciences, 15(8), 4511. https://doi.org/10.3390/app15084511

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GCN-Former: A Method for Action Recognition Using Graph Convolutional Networks and Transformer

Abstract

1. Introduction

2. Methods

2.1. GCN-Former Architecture Overview

2.2. GCN-Former Block Architecture

2.3. Contextual Temporal Attention

2.4. Transformer Block Module

3. Experiments

3.1. Datasets

3.2. Experiment Details

3.3. Comparison with State of the Art

3.4. Ablation Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI