ActionMamba: Action Spatial–Temporal Aggregation Network Based on Mamba and GCN for Skeleton-Based Action Recognition

Wen, Jinglong; Liu, Dan; Zheng, Bin

doi:10.3390/electronics14183610

Open AccessArticle

ActionMamba: Action Spatial–Temporal Aggregation Network Based on Mamba and GCN for Skeleton-Based Action Recognition

by

Jinglong Wen

^*,

Dan Liu

and

Bin Zheng

School of Electrical and Control Engineering, North University of China, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(18), 3610; https://doi.org/10.3390/electronics14183610

Submission received: 1 July 2025 / Revised: 3 September 2025 / Accepted: 8 September 2025 / Published: 11 September 2025

(This article belongs to the Special Issue Advances in Image Recognition, Image Segmentation, Image Fusion, and Singal Processing)

Download

Browse Figures

Versions Notes

Abstract

Skeleton-based action recognition networks have widely adopted the approach of Graph Convolutional Networks (GCN) due to their superior capabilities in modeling data topology, but several key issues still require further investigation. Firstly, the graph convolutional network extracts action features by applying temporal convolution to each key point, which causes the model to ignore the temporal connections between different important points. Secondly, the local receptive field of graph convolutional networks limits their ability to capture correlations between non-adjacent joints. Motivated by the State Space Model (SSM), we propose an Action Spatio-temporal Aggregation Network, named ActionMamba. Specifically, we introduce a novel embedding module called the Action Characteristic Encoder (ACE), which enhances the coupling of temporal and spatial information in skeletal features by combining intrinsic spatio-temporal encoding with extrinsic space encoding. Additionally, we design an Action Perception Model (APM) based on Mamba and GCN. By effectively combining the excellent feature processing capabilities of GCN with the outstanding global information modeling capabilities of Mamba, APM is able to comprehend the hidden features between different joints and selectively filter information from various joints. Extensive experimental results demonstrate that ActionMamba achieves highly competitive performance on three challenging benchmark datasets: NTU-RGB+D 60, NTU-RGB+D 120, and UAV–Human.

Keywords:

action recognition; spatial state model; graph convolutional networks; spatio-temporal modeling

1. Introduction

Action recognition is a fundamental task in video understanding. In earlier research efforts, a variety of modalities have been employed to represent features, including RGB frames [1], optical flow information [2,3], and human skeleton representations [4,5]. Among these different modalities, action recognition methods that rely on skeletal data have attracted growing attention, and this is largely due to their unique focus on the coordinate sequences of human joints. In the early stages of relevant research, skeleton-centered methodologies mainly converted skeletal data into sequences of joint vectors or pseudo-images. These converted representations were subsequently used to train Convolutional Neural Networks (CNNs) [6] or Recurrent Neural Networks (RNNs) [7]. Nevertheless, these vector-based sequences had limitations: they lacked the ability to model the structural connections between keypoints and failed to sufficiently capture the spatial dependencies that exist between adjacent joints. To address this problem, Graph Neural Networks (GNNs) were introduced into the field. Specifically, the research by [8] proposed a framework where human joints and their physical connections are defined as nodes and edges in a graph, respectively. On this basis, Graph Convolutional Networks (GCNs) are applied to extract spatio-temporal features from the pre-defined graph structure. Through the process of aggregating information from connected nodes, this method enables the direct modeling of joint adjacency and the interactions between joints. Furthermore, skeleton-based technical approaches show remarkable robustness when facing contextual variations such as background noise and changes in lighting conditions. This robustness stems from the compact and low-dimensional nature of skeletal data representations. These inherent advantages of skeleton-based methods lay a solid foundation for the advancement of subsequent research in the domain of skeleton-based action recognition.

For improved modeling of interactions among distant joints, AS-GCN was introduced by [9]. This model enhances classification results in action recognition by using fundamental blocks that combine action–structure graph convolution with temporal convolution for learning features across space and time. In an effort to uncover latent connections between joints, Ref. [10] utilized a neural architecture search method to refine the GCN model’s structure. Their method expands the search parameters by incorporating a range of dynamic graph substructures and higher-order connections, and it utilizes Chebyshev polynomial approximations to implicitly represent the associations between joints. The work from [11] successfully captures dependencies over extended ranges by integrating contextual information via feature fusion. With the aim of lowering the computational burden of GCNs, Shift-GCN was presented by [12], adapting the principle of shift convolution for graph-based networks. This architecture leverages shift-graph operations combined with lightweight point-wise convolutions to create adaptable receptive fields for spatial and temporal graph convolutions, thereby augmenting representational power while decreasing complexity. Furthermore, Ref. [13] put forward a multi-stream GCN design that merges input from joint positions, motion velocities, and bone characteristics in its initial layers. The framework utilizes separable convolution layers alongside composite scaling methods to diminish superfluous parameters and concurrently enhance the model’s overall capacity.

Subsequently, inspired by Transformers, Ref. [14] decomposed the data into spatial and temporal dimensions and proposed the Decoupled Spatial–Temporal Attention Network (DSTA-Net). This network encodes the two streams sequentially with attention modules, modeling the spatial and temporal dependencies between joints without involving positional or mutual connection information. Ref. [15] introduced a multimodal Transformer-based action recognition network that fuses the encoding results of spatial–temporal skeletal models and acceleration models. To leverage the advantages of both Transformers and GCNs simultaneously, Ref. [16] proposed a multi-order multimodal Transformer (3Mformer) that applies high-order Transformers to process skeletal data’s spatio-temporal features to better capture higher-order motion patterns between joints. SkeleTR [17] initially captures individual dynamic information with GCNs and then models human interaction with stacked Transformer encoders. Ref. [18] proposed a new triple attention module (TAM) to guide GCN in perceiving significant changes in local motion. BlockGCN [19] encodes bone connections by utilizing the feature of graph distance to describe physical topology, preserving important topological details that are often lost in traditional GCNs and improving the performance of the original GCN.

Despite the promising results achieved in previous research, existing technical methods still exhibit significant limitations, particularly in balancing efficiency and effectiveness when addressing behavior recognition tasks. GCN-based approaches face challenges in capturing global contextual information due to their inherently limited receptive fields—a constraint that hinders the development of high-quality skeletal features. In contrast, Transformer-based models and hybrid frameworks excel at global feature modeling but are constrained by considerable computational costs. This elevated cost is mainly caused by the self-attention mechanism, whose complexity exhibits a quadratic relationship with the number of tokens.

Fortunately, the improved S4 model [20], namely Mamba [21], offers a new solution to the aforementioned problems through selection mechanisms and parallel algorithm design. It facilitates the model to attain Transformer’s long-distance context modeling capabilities while maintaining linear complexity. Recently, some pioneering work has applied Mamba to tasks such as remote sensing [22,23], visual recognition [24,25], and medical imaging [26,27], but the potential of Mamba in action recognition tasks has not been fully explored due to the lack of designs like cross-attention.

Therefore, we propose an Action Spatio-temporal Aggregation Network (ActionMamba) based on GCN and Mamba. It mainly consists of two parts: the Action Characteristic Encoder (ACE) and the Action Perception Model (APM). The Action Characteristic Encoder, consists of Inner Spatial–Temporal Embedding and External Space Embedding. By introducing explicit coordinate features and emphasizing the importance of the intrinsic skeleton in the input features, this module significantly improves the spatial geometric modeling effect of the model and mitigates the model’s disregard for spatio-temporal connections between different key points. In the network’s backbone stage, we designed the Action Perception Model. To fully utilize GCN’s excellent early-stage processing power and Mamba’s efficiency in extracting remote features, we divided APM into two stages. First, the skeletal features are fed into the Shift Mamba-GCN module (SM-GCN), which expands the graph convolutional receptive field and extracts intrinsic skeletal features through shift graph convolution and Mamba block processing. This endows the model with the capability to capture implicit dependencies among different joints. Subsequently, we introduce the Spatio-Temporal Mamba block (ST-Mamba), which integrates residual connections with state space modeling (SSM) modules. This design is engineered to capture long-term relationships at both the intra-frame and inter-frame levels, enabling the acquisition of feature representations that are more precise, discriminative, and resilient. To assess the performance of our proposed model, comprehensive experiments were performed using three benchmark datasets for skeleton-based action recognition: NTU RGB+D 60, NTU RGB+D 120, and UAV–Human.

The primary contributions of this work are outlined as follows:

We propose ActionMamba, an action spatio-temporal aggregation network featuring a global receptive field and dynamic weighting mechanism. By integrating Mamba with a Graph Convolutional Network, ActionMamba adaptively captures global spatio-temporal properties of human skeletons with linear complexity, thereby improving model performance.
To boost the model’s comprehension of spatial relationships, we have devised an efficient action feature encoder. This module incorporates extrinsic skeletal characteristics and attention-based mechanisms, facilitating a more accurate representation of the human body’s structural configuration.
Furthermore, to discern implicit relationships among joints, we put forward a novel Action Perception Model. This new model synergizes the global feature extraction capacity of Mamba with the local feature aggregation proficiency of GCNs.
Extensive experiments conducted on multiple benchmark datasets demonstrate that our approach consistently outperforms existing state-of-the-art methods.

The structure for the remainder of this manuscript is as follows. In Section 2, we provide a review of the pertinent literature. Section 3 is dedicated to detailing our methodology, encompassing the SSM model and our proposed framework, which consists of the APM and ACE modules. The comprehensive evaluation and the resulting outcomes are presented in Section 4. Finally, Section 5 concludes the paper by summarizing our proposed contributions and suggesting avenues for future investigation.

2. Related Work

2.1. GCN-Based Action Recognition

Skeleton-based action recognition has garnered considerable attention in research due to the compactness of skeleton data and the strong correlations among features. Graph Convolutional Networks, particularly ST-GCN [8], have achieved remarkable success in this domain by exploiting the intrinsic connectivity and feature correlation of the human body. This model pioneered graph convolutions and temporal convolutions, which encode the connections among skeleton joints using adjacency matrices and extract the dynamic information of individual joints in the temporal dimension through 1D convolution. To enhance the adaptability of the skeleton graph, AGCN [28] introduced an adaptive adjacency matrix that not only emphasizes existing skeletal connections but also captures potential relationships between nodes. By focusing on contextually related intrinsic topology modeling, Dynamic GCN [29] integrates the contextual features of all joints into the learning of joint relationships. Channel-Topology Refined GCN (CTR-GCN) [5] focuses on embedding joint topology across different channels. In contrast, InfoGCN [30] introduces an attention-based graph convolution mechanism that captures contextually relevant topologies through potential representations learned via the information bottleneck principle. Spatial Graph Diffusion Convolution (S-GDC) networks [31] aim to learn dynamic graph structures through diffusion, enabling the model to capture both long-range dependencies within a single body and interactions between multiple bodies. In summary, while GCN-based approaches are effective at modeling joint interactions via topological structures, they often remain constrained to local spatial–temporal regions.

2.2. Transformers-Based Action Recognition

The Transformer architecture, originally developed for machine translation tasks in the natural language processing domain, subsequently had its range of applications broadened to include the field of computer vision. The Vision Transformer (ViT) [32] took the lead in this transition, as it adopts Transformer encoders to extract feature information from images. Compared with GCN-based methods, Transformer-centered approaches offer more efficient global topological modeling capabilities and can better emphasize the significance of joints that have no physical adjacency. To make full use of the self-attention mechanism, DSTA-Net [14] proposed a decoupled framework specifically designed for spatio-temporal self-attention. Following a comparable research direction, STAR [33] combines sparse attention applied in the spatial domain with segmented linear attention used in the temporal domain; this combination allows it to efficiently handle skeleton action sequences with variable lengths. Ref. [15] put forward a multimodal action recognition network based on Transformers, which integrates the encoded feature representations derived from both spatio-temporal skeleton models and acceleration-focused models. As a hybrid approach, SkeleTR [17] initially leverages GCNs to capture the dynamic information of individual skeletons, after which it utilizes stacked Transformer encoders to model the interaction relationships among humans. While Transformer-based methods are capable of capturing global topological information and emphasizing the relationships between non-adjacent joints, the self-attention mechanism they depend on introduces computational complexity that exhibits a quadratic relationship with the number of tokens. This intrinsic characteristic significantly restricts the scalability of these methods and their practical deployment in real-world application scenarios [34].

2.3. State Space Model

Benefiting from its excellent mathematical properties, the state space model (SSM) has gained significant attention in recent times. A key distinguishing feature of this model is its ability to effectively retain historical input information, while only relying on the latest input during both training and inference processes. When compared with recurrent neural networks (RNNs) and Transformers, architectures constructed based on SSMs have shown significantly better performance in tasks requiring long-context understanding—with the Long-Range Arena (LRA) benchmark [35] serving as a typical example. In recent developments, prominent models like Mamba [21] have demonstrated improved performance and higher computational efficiency compared to advanced Transformer models in long-context scenarios, its structure is shown in Figure 1. This achievement strongly confirms the great potential of SSMs to act as core architectural components. Consequently, the application scope of Mamba has expanded beyond its initial use in natural language processing to cover a range of other fields. For example, within the Vision Transformer framework, Vim [25] makes use of Mamba to develop a new, general-purpose vision backbone, which is built on bidirectional Mamba blocks. This particular backbone embeds positional data into the sequence of images and utilizes the bidirectional state space model for compressing visual feature representations. For the specific application of biomedical image segmentation, a hybrid architecture known as U-Mamba was introduced by [36]. This model combines convolutional networks with state space models and was explicitly created to manage long-range dependencies. For multimodal image fusion tasks, [37] incorporated Mamba to capture long-range features from single-modal images: it first extracts low-level features using CNNs and high-level features using Mamba blocks, and then utilizes a reconstruction module to produce the fused output. Within the domain of action recognition, Simba [38] has also presented a skeleton-based action recognition model based on the U-ShiftGCN framework. This method embeds Mamba blocks into the core part of the U-ShiftGCN architecture to process temporal features between skeletal connections.

3. Method

In Section 3.1, we introduce the basics of the SSM and outline the motivation behind designing ActionMamba. Next, in Section 3.2, we describe the Action Characteristic Encoder, and in Section 3.3, we present the architectural details of the Action Perception Model.

3.1. Preliminaries

State space models originated from classical control theory and were initially utilized in the fields of control and computational neuroscience for modeling dynamic system frameworks. They were later attempted to be discretized through reconstruction and convolution for computational processing. In recent years, SSMs have gained popularity among researchers due to their exceptional performance in analyzing long-term data. Particularly, an SSM defines a linear mapping from an input signal, denoted as

x (t)

(a function of time t), to an output signal, denoted as

y (t)

, through a latent state

h (t)

. This relationship can be represented by Equation (1):

\begin{matrix} h^{'} (t) & = A h (t) + B x (t), \\ y (t) & = C h (t) + D x (t) . \end{matrix}

(1)

where

h (t), h^{'} (t) \in R^{d \times 1}

represent complex-valued states,

x (t), y (t) \in R_{q}

represent input and output, respectively, and t represents a continuous time index.

A

,

B

,

C

,

D

are complex-valued matrices of appropriate dimensions. Additionally, the SSM model requires the discretization of model parameters before being applied to deep learning algorithms. Specifically, the continuous parameters

A

,

B

,

C

,

D

are transformed into their discrete counterparts

\bar{A}, \bar{B}, \bar{C}, \bar{D}

using a time scale

Δ \in R

and the Zero-order Hold method, as depicted in Equation (2):

\begin{matrix} \bar{A} = & e x p (Δ A), \bar{B} = & {(Δ A)}^{- 1} e x p (Δ A - I) Δ B \approx Δ B, \\ \bar{C} = & C, \bar{D} = D . \end{matrix}

(2)

For the ease of subsequent discretization treatment, we refer to this process as Disc.

\begin{matrix} \bar{A}, \bar{B}, \bar{C}, \bar{D} = Disc (x) \end{matrix}

(3)

The discrete state space model reformulates Equation (1) as follows:

\begin{matrix} h_{i} & = \bar{A} h_{i - 1} + \bar{B} x_{i}, \\ y_{i} & = \bar{C} h_{i} + \bar{D} x_{i} . \end{matrix}

(4)

where,

\bar{A}, \bar{B}, \bar{C},

and

\bar{D}

are matrices representing discrete-time dynamics that have been discretized using the time step

Δ \in R

, and i represents a discrete time index. Additionally, Mamba proposed a modified version of a parallel scanning algorithm designed specifically for training and inference. This update effectively overcomes the time-varying nature of the SSM and addresses the issue of high computing cost in real applications.

Despite Mamba’s impressive efficiency in handling various visual task inputs, its introduction also presents significant challenges. Firstly, Mamba inherently operates in a causal manner, requiring images to be converted into shifted sequences and processed cyclically. This limits the receptive field of each token to its preceding sequence and introduces additional latency. Secondly, as mentioned by [39], Mamba’s choice of the state space model resembles a single-head linear attention mechanism and does not incorporate a multi-head design. Although this strategy increases efficiency, it may not be appropriate for non-causal data structures which include graph structures.

As a result of this, we redesigned the Mamba structure. Mamba is effectively applied to non-causal data structures by combining global shift convolution and the Mamba structure, resulting in omnidirectional spatio-temporal feature mixing. Furthermore, to compensate for the reduction in receptive field produced by the calculation of the Mamba shift sequence, we incorporated an action feature encoder. By incorporating the fundamental properties of joints as well as the intrinsic spatio-temporal connection, Mamba gains local bias and location information, which improves model performance.

3.2. Action Characteristic Encoder

In action recognition tasks, the spatial position and associated features of the skeleton play a crucial role due to the significant variations in data from human body joints across different actions. To capture and incorporate the dynamic spatial–temporal relationship inherent in human skeleton data, we have designed the Action Characteristic Encoder module. As illustrated in Figure 2a, we employ Inner Spatial–Temporal Embedding and External Space Embedding to embed the input human skeleton features.

Inner Spatial–Temporal Embedding. The temporal information’s shifting rhythm is closely linked to skeletal features, hence reliable spatial–temporal encoding is critical for action recognition tasks. To effectively integrate and process these spatial–temporal dynamics, we propose the Inner Spatial–Temporal Embedding module (ISTE). This module employs importance as a guiding factor to aggregate spatial–temporal information among skeletons. As shown in Figure 2a, the primary idea behind the intrinsic spatial–temporal connection unit involves calculating attention scores between adjacent vertices using an attention mechanism. Subsequently, the vertex features are aggregated and updated. This operation prioritizes nodes that carry essential features and selectively establishes implicit dependencies by aggregating joint features over long distances. As a result, the aggregate representation

z_{i}

of node i can be expressed as:

\begin{matrix} z_{i}^{(t)} = ReLU (α_{i, i} W x_{i}^{(t)} + \sum_{j \in N (i)} α_{i, j} W x_{j}^{(t)}) \end{matrix}

(5)

where,

x_{i}^{(t)} \in R^{w}

represents the input features of node

i

.

N (i) = \{j ∣ A_{j i} > 0\}

indicates the collection of neighboring nodes of

i

obtained from the learned adjacency matrix A.

W

is initialized using a uniform initialization method and trained as a model parameter. The calculation of the attention coefficients

α_{i, j}

is shown in Equation (6):

\begin{matrix} π (i, j) & = LeakyReLU (a^{⊤} (W x_{i}^{(t)} \oplus W x_{j}^{(t)})) \end{matrix}

(6)

\begin{matrix} α_{i, j} & = \frac{exp (π (i, j))}{\sum_{k \in N (i) \cup {i}} exp (π (i, k))} \end{matrix}

(7)

where, ⊕ denotes the concatenation operation, and

a

represents trainable coefficient vector. We employ the LeakyReLU activation function to introduce non-linearity when computing attention coefficients, which are subsequently normalized using the softmax function.

External Space Embedding. Following the ISTE module, we concatenate the C-dimensional features with the X, Y, and Z coordinates of the 3D skeletal data to help the model better understand the spatial positions of individual joints. Detailed specifics are illustrated in Figure 2a. This strategy enables the network to explicitly observe local geometric patterns of the human body, thereby facilitating the learning of its structural configuration across the entire network. The final integrated output feature

Z \in R^{N \times (C + 3) \times T}

is expressed in Equation (8):

\begin{matrix} Z_{i}^{(t)} = Concat (z_{i}^{(t)}, z_{i}^{(3)}) \end{matrix}

(8)

where, Concat represents the concatenation operation. The

z_{i}^{(3)}

vector belongs to the set

R^{N \times 3 \times T}

and represents the X-Y-Z coordinates of the three-dimensional skeleton data.

The rationale behind our External Space Embedding can be understood from the work of [12], where Shift-GCN illustrates that augmenting skeletal data with spatial coordinates can substantially improve the model’s capacity for geometric representation. Our method adopts a similar principle. However, in contrast to the architecture proposed by [40], we confine the concatenation of the skeleton’s X-Y-Z coordinates with the input features to the entry point of the network’s first block, rather than applying this structure globally. Although this technique involves a less explicit encoding of external spatial position, the Mamba model’s exceptional proficiency in capturing long-range dependencies allows the network to more accurately discern geometric patterns within the human form. This assertion is empirically validated through the experiments detailed in Section 4.4.2.

3.3. Action Perception Model

To develop a Mamba architecture suitable for non-causal data, we integrated the Shift-GCN design with Mamba and proposed the Action Perception Model (APM). On one hand, the model leverages Shift-GCN’s flexible receptive field over spatial and temporal graphs, enabling it to adaptively capture implicit relationships among local joints. On the other hand, it benefits from Mamba’s selective scanning mechanism and dynamic adjustment capabilities, which help model long-range spatio-temporal dependencies within human skeletal structures. The overall structure of the APM is depicted in Figure 2. The APM comprises two main stages: Shift Mamba-GCN Block (SM-GCN) and Spatial–Temporal Mamba Block (ST-Mamba).

Shift Mamba-GCN Block. The Shift-GCN method utilizes non-local shift graph operations, enabling each node to collect information from all other nodes. However, this approach uniformly blends the features across different nodes and artificially equalizes the connection strength between them. Consequently, we have introduced a more effective module known as the Shift Mamba-GCN Block. This module establishes global joint dependency relations via a structural state space model and diminishes the influence of insignificant nodes at each iteration, leading to the heterogeneous mixture modeling of features.

As depicted in Figure 2b, the SM-GCN block represents a fusion of Shift-GCN and Mamba, incorporating Spatial-Mamba (S-Mamba) and Temporal-Mamba (T-Mamba) modules to improve the integration of spatio-temporal data. More precisely, for a given spatial skeleton feature map designated as

F \in R^{C \times T \times N}

, N corresponds to the quantity of skeleton points and C denotes the feature channel count. The shift displacement for the i-th channel is calculated as

i; mod; N

. This feature map is then processed by an S-Shift block, where a Shift operation and a subsequent convolution transform it into a spiral feature map containing global information, as illustrated in Figure 3. Next, these spiral features are input to the S-Mamba module. Within this block, Mamba establishes long-range dependencies at both intra-frame and inter-frame levels. By merging local detailed features from various joints, this process ultimately yields an omnidirectional, mixed spatio-temporal feature. In contrast, the T-Mamba module is designed to aggregate temporal information across the skeleton by inverting the spatial and temporal selection scanning mechanisms.

The computational process of the S-Mamba block is depicted in Figure 4, and the pseudocode is presented in Algorithm 1. To effectively utilize the scanning mechanism of Mamba and the advantages of hardware-aware parallel algorithms, we reshape the input feature

F_{i n} \in R^{C \times T \times N}

through a multilayer perceptron (MLP) into tensors X and Z. This process is designed to align and concatenate slices from each time step t, as shown below:

\begin{matrix} X, Z = {Conv}_{x} (Norm (F_{i n})), {Conv}_{z} (Norm (F_{i n})) . \end{matrix}

(9)

where

Norm (\cdot)

represents layer normalization, and

{Conv}_{x} (\cdot)

and

{Conv}_{z} (\cdot)

represent two convolutional layers. Afterwards,

X

is flattened in three directions, producing

x_{0}

,

x_{1}

, and

x_{2}

. The dimensions of

x_{0}

and

x_{1}

are

CT \times N

, and

x_{2}

is

CN \times T

. By reshaping the data along different dimensions, we obtain a new set of feature vectors that help the model to process spatial and temporal information in conjunction. This allows for a more effective capturing of complex patterns in the data. Subsequently, these 1D sequences are processed through 1D convolution and SSM blocks for feature extraction, resulting in three outputs:

y_{0}

,

y_{1}

, and

y_{2}

.

\{\begin{matrix} x_{i} = {Flatten}_{i} (X), \\ y_{i} = {SSM}_{i} (C o n v_{i} (x_{i})) . \end{matrix} i = 0, 1, 2 .

(10)

where

{Flatten}_{i}

denotes the flattening operation in the ith direction,

{SSM}_{i}

represents the ith SSM block. In Section 3.1, we detailed the process and discretization of the SSM blocks. Subsequently, we unflatten the outputs of the SSM blocks to obtain

y_{0}

,

y_{1}

, and

y_{2}

. The features are then gated and added with Z to derive the fused output

F_{mid} \in R^{C \times T \times N}

.

\{\begin{matrix} Y_{i} = {UnFlatten}_{i} (y_{i}), \\ F_{m i d} = \sum_{i}^{2} (Y_{i} \cdot SiLU (Z)) \end{matrix} i = 0, 1, 2 .

(11)

where

{UnFlatten}_{i}

represents the unfolding operation along the i-th direction.

SiLU (\cdot)

denotes the SiLU activation function. The final output of the module

F_{o u t} \in R^{C \times T \times N}

can be described as:

\begin{matrix} F_{o u t} & = {Conv}_{o} (F_{m i d}) . \end{matrix}

(12)

where

{Conv}_{o} (\cdot)

represents the convolutional layer. In contrast to adaptive time-shifted graph convolution [12] and dynamic offset graph convolution [40], SM-GCN emphasizes the potential features between joints and connections. The comparison with other traditional methods is shown in Figure 5. This allows the network to prioritize important joints after the feature shift transformation. Therefore, this method not only addresses the coarse kernel space exploration problem caused by dynamic convolution but also enables the Mamba module to effectively process non-causal data. Ultimately, the model can selectively filter relevant information based on the input features of each joint, thereby enhancing overall performance.

Algorithm 1 S-Mamba Block.

Input:

F_{i n} : (C, T, N)

Output:

F_{o u t} : (C, T, N)

1:

X : (C, T, N) \leftarrow {Conv}_{x} (L a y e r N o r m (X))

2:

Z : (C, T, N) \leftarrow {Conv}_{z} (L a y e r N o r m (Z))

3: for

i = 0

;

i < 3

;

i + +

do
4: if

i < 2

then
5:

x_{i} : (C T, N) \leftarrow {F l a t t e n}_{i} (X)

6:

y_{i} : (C T, N) \leftarrow {S S M}_{i} ({C o n v}_{i} (X))

7: else
8:

x_{i} : (C N, T) \leftarrow {F l a t t e n}_{i} (X)

9:

y_{i} : (C N, T) \leftarrow {S S M}_{i} ({C o n v}_{i} (X))

/*

SSM

represents Equation (4) implemented
by selective scan [21] */
10: end if
11: end for
12: for

i = 0

;

i < 3

;

i + +

do
13:

Y_{i} : (C, T, N) \leftarrow {UnFlatten}_{i} (x_{i})

14: end for
15:

F_{m i d} : (C, T, N) \leftarrow \sum_{i}^{2} (Y_{i} \cdot SiLU (Z))

16:

F_{o u t} : (C, T, N) \leftarrow {Conv}_{0} (F_{m i d})

return F_{o u t}

Spatial–Temporal Mamba Block. Through the design of the SM-GCN block, the model is able to perform a global feature overview at an early stage and obtain the intrinsic hidden features between different joints. However, due to the inherent characteristics of the GCN method, the SM-GCN network is still insufficient in modeling the spatio-temporal global context, making it difficult to generate high-quality and reliable features. Therefore, we propose a Mamba-based spatio-temporal fusion module, called the ST-Mamba block, which is placed at the end of the APM structure. It fully leverages the excellent processing capability of the GCN in the early stage of the task and the high efficiency of Mamba in extracting distant features. This ensures that the network has strong input adaptability and global information modeling capability while maintaining lower complexity.

As illustrated in Figure 2, the core idea of the ST-Mamba block is to effectively integrate different levels of cross-scale features, thereby capturing various layers of detail and contextual information to enhance the overall feature representation. Consequently, the ST-Mamba branches the input

x \in R^{C \times T \times N}

into two branches after layer normalization.

\begin{matrix} x_{1} = Norm (x) \end{matrix}

(13)

In the first branch, a multi-layer perceptron is utilized for projection and convolution, with SiLU activation applied to obtain an intermediate output

X

. This intermediate output,

X

, is then used as input to the state space model (Formula (2)) to generate the output

f

. In the other branch, the input

x

is propagated through an MLP layer and SiLU activation to acquire a gating factor,

f^{'}

, which controls the output

f

. Finally, via an MLP layer and a residual connection, the final output

y \in R^{C \times T \times N}

is obtained. The specific computation process of the first branch is depicted in Formulas (14)–(16):

\begin{matrix} X = SiLU (Conv ({MLP}_{1} (x_{1}))) \end{matrix}

(14)

\begin{matrix} \bar{A}, \bar{B}, \bar{C}, \bar{D} = Disc (X) \end{matrix}

(15)

\begin{matrix} f = SSM (\bar{A}, \bar{B}, C) (X) \end{matrix}

(16)

where LayerNorm represents layer normalization, SiLU represents the activation function,

{MLP}_{1}

stands for the multi-layer perceptron, and Conv denotes the convolution layer. Disc and SSM are as depicted in Section 3.1. The precise calculation process of the second branch is illustrated in Equation (17):

\begin{matrix} f^{'} = SiLU ({MLP}_{2} (x_{1})) \end{matrix}

(17)

where, SiLU denotes the activation function,

{MLP}_{2}

represents the multi-layer perceptron, and

f^{'}

indicates the intermediate state. The final output of the ST-Mamba block is given in Equation (18):

\begin{matrix} y = {MLP}_{3} (f^{'} ⊙ f) + x \end{matrix}

(18)

where ⊙ represents the element-wise multiplication mechanism, and

{MLP}_{3}

denotes the multilayer operation. Ultimately, this approach enables the model to capture a broader range of contextual information and strengthens its ability to focus on salient features, thereby enhancing overall performance.

4. Experiments

In this section, we first introduce the datasets and experimental settings. Then, we compare our ActionMamba architecture with the current state-of-the-art skeleton-based human action recognition benchmarks, including some recent models based on Mamba, to demonstrate the superior performance of our model. Next, we conduct extensive ablation studies to verify the effectiveness and practicability of the proposed modules.

4.1. Dataset

NTU RGB+D. NTU RGB+D [41] is a prominent and extensively utilized benchmark for assessing skeleton-based action recognition systems. The dataset contains a total of 56,880 sequences of skeletal actions, which are grouped into 60 distinct categories and were performed by 40 different subjects. Each sample portrays a single action, involving a maximum of two individuals. All data was captured concurrently from three separate viewpoints using Microsoft Kinect v2 sensors. For standardized evaluation, the dataset’s creators have established two official protocols: (1) Cross-Subject (X-Sub), where the training set includes sequences from 20 subjects and the test set comprises data from the remaining 20 subjects; and (2) Cross-View (X-View), which uses data from camera perspectives 2 and 3 for training, while data from the first camera perspective is reserved for testing.

NTU-120 RGB+D. As the largest available dataset for human action recognition with three-dimensional joint annotations, NTU RGB+D 120 [42] encompasses 114,480 action samples distributed across 120 categories. Data for these samples were captured from 106 different subjects using three distinct camera viewpoints. The collection consists of 32 unique setups, each defined by a particular location and background environment. The creators have established two benchmark evaluation protocols: (1) Cross-Subject (X-Sub), for which the 106 participants are split equally into training and testing sets of 53 subjects each; and (2) Cross-Setup (X-Setup), where samples for training are selected from even-numbered setup identifiers, and samples for testing are sourced from odd-numbered ones.

UAV–Human. The UAV–Human [43] dataset is a comprehensive, large-scale collection created for analyzing human behavior from aerial viewpoints captured by drones. Its data was gathered over a three-month duration through drone flights in a variety of urban and rural settings, under both daytime and nighttime conditions. This collection features a wide spectrum of variability, including diverse objects, backgrounds, lighting scenarios, weather patterns, occlusions, camera movements, and drone flight orientations. UAV–Human is designed to support several research tasks, such as action recognition, pose estimation, re-identification, and attribute recognition. For our assessment, we adopt the evaluation methodology proposed in [44]. The dataset offers two distinct evaluation benchmarks, Cross-Subject v1 (CSv1) and Cross-Subject v2 (CSv2), both of which utilize a split of 89 subjects for training purposes and 30 subjects for testing.

4.2. Implementation

All experimental procedures were implemented within the PyTorch(Cuda 11.8 version) framework and executed on an NVIDIA GTX 3090 GPU (Nvidia Corporation, Santa Clara, CA, USA). For the model’s training, we utilized the Stochastic Gradient Descent (SGD) optimizer with a Nesterov momentum value of 0.9 and a weight decay of 0.0002. Our architecture was configured with an 8-layer SM-GCN module and a 2-layer ST-Mamba module. Each skeleton sequence was normalized to a consistent length of 300 frames, with sequences falling short of this length being appended with zero frames. During the training phase, a batch size of 16 was used, and the model was trained for a total of 140 epochs. The learning rate was initialized to 0.01 and underwent a linear warmup for the first 10 epochs. Subsequently, this rate was reduced by a factor of 10 at the 60th, 80th, and 100th epoch milestones.

4.3. Comparison with SOTA Methods

To validate the overall performance of our proposed ActionMamba method, we conducted experiments on three datasets: NTU RGB+D, NTU-120 RGB+D, and UAV–Human. Many advanced GCN methods have adopted a multi-stream fusion framework. We also use this framework for comparison.

NTU RGB+D. The experimental results on NTU RGB+D 60 are summarized in Table 1, where we compare ActionMamba with some of the most advanced models (including recent models based on Mamba). The results show that ActionMamba achieves superior performance in the two benchmark tests on the NTU-RGB+D dataset. It reached an accuracy of 91.83% on the X-Sub benchmark, achieving a 0.33% improvement compared to 2s MS-G3D. On the X-View benchmark, ActionMamba achieved an accuracy of 96.87%, a 0.47% improvement over the best-performing STF-Net. Notably, the 2s strategy adopted by ActionMamba garnered results comparable to those of the 4s strategy InfoGCN, fully demonstrating the effectiveness of our designed model. The accuracy results for each category in the NTU-RGB+D 60 X-Sub dataset are shown in Figure 6.

NTU RGB+D 120. ActionMamba’s accuracy on the NTU-RGB+D 120 dataset is equally impressive. As shown in Table 2, ActionMamba achieved accuracies of 86.48% and 88.1% on X-Sub and X-Setup, respectively. Compared to 2s MS-G3D, 2s ActionMamba improved by 0.28% and 0.15%, respectively. Additionally, even when compared to the 4-stream strategy CDGC algorithm, 2s ActionMamba maintained a competitive advantage, effectively validating our design concept.

UAV–Human. Test results on the UAV–Human dataset further substantiate the effectiveness of ActionMamba. As shown in Table 3, ActionMamba outperformed previous algorithms on the UAV–Human dataset. Compared with IDGAN, ActionMamba improves by 0.9% on the CSv1 benchmark and 1.2% on the CSv2 benchmark, effectively demonstrating the practicality and adaptability of our design.

4.4. Ablation Study

To evaluate the effectiveness of our proposed components, we conducted a thorough ablation study on the X-Sub subset of the NTU RGB+D 60 dataset. We used Shift-GCN as the baseline model and systematically modified the network modules to assess the contribution of each individual component.

4.4.1. Effectiveness of Different Components of ActionMamba

We utilized Shift-GCN as the baseline and sequentially integrated ACE, ST-Mamba, and SM-GCN to verify the effectiveness of the algorithm components. As shown in Table 4, ACE enhanced Shift-GCN by 0.18%, while ST-Mamba further improved it by 0.69%. To further demonstrate the performance of our proposed model, we excluded the ST-Mamba module and exclusively incorporated the SM-GCN module on top of ACE. This adjustment led to a 1.73% increase in accuracy for Shift-GCN, effectively confirming the potential of our designed SM-GCN module.

4.4.2. Effectiveness of Action Characteristic Encoder

To evaluate the impact of the Action Characteristic Encoder on overall performance, a series of encoding experiments were designed. After adding ISTE and ESE to Shift-GCN, respectively, the model achieved slight improvements of 0.11% and 0.07% as shown in Table 5. By combining the sub-modules and adding them to Shift-GCN, the model’s performance improved by 0.21%. In order to verify the effectiveness of external space encoding and provide comparison for subsequent experiments, ACE modules (i.e., ISTE and ESE were applied simultaneously) were added to each stage of Shift-GCN. Experimental results show that external space encoding can provide additional human skeleton information for model performance and bring a 0.51% performance improvement to the model. Under the same conditions, when ACE was added to our designed ActionMamba, the performance of this model has been improved by 0.57%. This effectively proved that the Mamba structure can capture encoded information through the state space and apply it to subsequent feature expansion, demonstrating the effectiveness and practicality of our designed module.

In addition, it is worth noting that ACE modules were integrated at each stage of ActionMamba, leading to a further performance improvement of 0.01%. This enhancement, although marginal, underscores the cumulative benefit of refined architectural components. Compared to ST-GCN, this observation supports the efficacy of the proposed SM-GCN module in capturing and preserving the relative positional relationships of human skeletal joints across different stages of the feature extraction process. These results validate the effectiveness of our architectural design and highlight the advantages of incorporating Mamba mechanisms within the network.

4.4.3. Complexity Comparison

For a comprehensive evaluation of our method’s inference efficiency relative to established techniques, we undertook a meticulous comparative analysis of model complexity. The results of this analysis are consolidated in Table 6. The performance of all models was benchmarked on an NVIDIA GeForce RTX 3090 GPU using the ONNX temporal inference framework. To ensure an equitable comparison, specific parameters were employed for the evaluation: the inference batch size was set to one, and each model was subjected to 50 preheating iterations before being tested over 200 inference runs on the same hardware. Our proposed approach exhibits a distinct advantage in both inference time and parameter efficiency, leading to a significant reduction in the demand for computational resources while preserving a competitive level of accuracy.

Notably, when contrasted with the leading baseline model, 3MFormer, our approach demonstrates a 75.48% enhancement in inference velocity while incurring only a minimal reduction in accuracy. Moreover, in a direct comparison with the Mamba-based Simba algorithm, our technique is not only faster (18.753 versus 22.846) but also achieves superior accuracy (91.83 compared to 91.3). Collectively, these outcomes indicate that the proposed method attains an effective equilibrium between model performance and computational demands, which can be attributed to its meticulously crafted, fine-grained architecture.

4.4.4. The Impact of Different Levels of Numbers

To determine how network depth impacts performance, we systematically adjusted the number of SM-GCN and ST-Mamba layers, and evaluated our proposed model on the NTU RGB+D 60 dataset. As shown in Table 7, a larger number of SM-GCN layers typically corresponds to improved performance. This observation implies that deeper network architectures are more capable of capturing complex spatio-temporal relationships. Nevertheless, the gains in performance begin to weaken once the architecture exceeds a certain depth. Specifically, when the number of SM-GCN layers reaches nine, the model exhibits signs of overfitting or gradient vanishing—and this phenomenon results in a slight drop in accuracy. In a concurrent analysis, we carried out a systematic investigation of the ST-Mamba layers, and found that a two-layer setup yielded the best performance. Drawing on these results, we reached the conclusion that an architecture consisting of eight SM-GCN layers and two ST-Mamba layers achieves the optimal balance between the model’s expressive capacity and computational cost. For this reason, this configuration was set as the final architecture for all subsequent experiments.

4.4.5. Confusion Matrix

To conduct a more detailed evaluation of the proposed method’s recognition performance, we performed an in-depth study on 15 action categories known for being highly confusable with similar activities. The confusion matrices presented in Figure 7 illustrate the classification outcomes on the NTU-60 dataset for both Shift-GCN and our methodology. An examination of these matrices reveals that our approach obtains superior recognition accuracy across almost all categories. This enhancement is especially notable in several challenging action classes, highlighted in red, such as the differentiation between “put on shoes” and “take off shoes,” as well as “reading” and “writing”. Figure 8 provides illustrations of these difficult instances, where precise classification is challenging without leveraging temporal data. The critical importance of modeling spatio-temporal information for these particular actions is evident. By aggregating spatio-temporal information, our proposed technique boosts the model’s multimodal learning capacity. This showcases the Mamba algorithm’s advantages in extracting spatio-temporal representation features from video actions and proves the efficacy of the algorithm designed in this paper.

5. Conclusions and Future Work

We present an action spatio-temporal aggregation network termed ActionMamba, which combines the strengths of Graph Convolutional Networks (GCNs) with the Mamba architecture. Through the integration of Mamba into the Shift-GCN framework using shift convolution and a spatio-temporal scanning mechanism, we have developed an enhanced SM-GCN structure. This design enables the model to efficiently utilize spatio-temporal information while maintaining a global receptive field and reducing computational complexity. To further strengthen Mamba’s ability to model long-distance dependencies, we propose the Action Characteristic Encoder, which merges external spatial encoding with intrinsic spatio-temporal encoding. This component enhances the model’s understanding of the human body’s spatial arrangement and helps capture the underlying relationships between non-adjacent joints. Experimental results from assessments on the NTU-RGB+D 60, NTU-RGB+D 120, and UAV–Human benchmark datasets verify that ActionMamba achieves an outstanding balance between performance accuracy and inference speed, even outperforming current Mamba-based methods.

In our future research, we aim to explore the broader applicability of Mamba and other state space models in the field of action recognition. Specifically, we plan to tackle multi-person interaction recognition by employing dynamic feature encoders to model the spatio-temporal dynamics of such interactions, with the goal of achieving effective motion disentanglement in group scenarios. Regarding real-time implementation on low-power hardware, we will focus on model optimization by developing lightweight Mamba variants and adopting dynamic pruning strategies. These improvements are expected to facilitate deployment across a wider range of wearable sensing devices, thereby contributing to advancements in the field of human activity recognition.

Author Contributions

Conceptualization, J.W.; methodology, J.W.; software, J.W.; validation, J.W.; formal analysis, J.W.; investigation, J.W.; resources, J.W.; data curation, J.W.; writing—original draft preparation, J.W.; writing—review and editing, J.W.; visualization, J.W.; supervision, D.L.; project administration, B.Z.; funding acquisition, J.W. and D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by The special fund for Science and Technology Innovation Teams of Shanxi Province, with the Grant/Award Number: 202304051001030, The Research Project Supported by Shanxi Scholarship Council of China, with the Grant/Award Number: 2022-144, The Shanxi Province Central Guiding Local Science and Technology Development Fund Project, with the Grant/Award Number: YDZJSX2025D039 and The Fundamental Research Program of Shanxi Province, with the Grant/Award Number: 202303021222119.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ding, Z.; Wang, P.; Ogunbona, P.O.; Li, W. Investigation of different skeleton features for cnn-based 3d action recognition. In Proceedings of the 2017 IEEE International Conference on Multimedia & ExpoWorkshops (ICMEW), Hong Kong, 10–14 July 2017; pp. 617–622. [Google Scholar]
Song, J.; Wang, L.; Van Gool, L.; Hilliges, O. Thin-slicing network: A deep structured model for pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4220–4229. [Google Scholar]
Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar]
Caetano, C.; Brémond, F.; Schwartz, W.R. Skeleton image representation for 3d action recognition based on tree structure and reference joints. In Proceedings of the 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Rio de Janeiro, Brazil, 28–31 October 2019; pp. 16–23. [Google Scholar]
Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 13359–13368. [Google Scholar]
Li, Y.; Xia, R.; Liu, X.; Huang, Q. Learning shape-motion representations from geometric algebra spatio-temporal model for skeleton-based action recognition. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1066–1071. [Google Scholar]
Zhang, P.; Xue, J.; Lan, C.; Zeng, W.; Gao, Z.; Zheng, N. EleAtt-RNN: Adding attentiveness to neurons in recurrent neural networks. IEEE Trans. Image Process. 2019, 29, 1061–1073. [Google Scholar] [CrossRef]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
Peng, W.; Hong, X.; Chen, H.; Zhao, G. Learning graph convolutional network for skeleton-based human action recognition by neural searching. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 2669–2676. [Google Scholar]
Zhang, X.; Xu, C.; Tao, D. Context aware graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14333–14342. [Google Scholar]
Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Ijaz, M.; Diaz, R.; Chen, C. Multimodal transformer for nursing activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2065–2074. [Google Scholar]
Wang, L.; Koniusz, P. 3mformer: Multi-order multi-mode transformer for skeletal action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5620–5631. [Google Scholar]
Duan, H.; Xu, M.; Shuai, B.; Modolo, D.; Tu, Z.; Tighe, J.; Bergamo, A. SkeleTR: Towards Skeleton-based Action Recognition in the Wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 13634–13644. [Google Scholar]
Li, X.; Zhai, W.; Cao, Y. A tri-attention enhanced graph convolutional network for skeleton-based action recognition. IET Comput. Vis. 2021, 15, 110–121. [Google Scholar] [CrossRef]
Zhou, Y.; Yan, X.; Cheng, Z.Q.; Yan, Y.; Dai, Q.; Hua, X.S. Blockgcn: Redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2049–2058. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zeng, K.; Shi, H.; Lin, J.; Li, S.; Cheng, J.; Wang, K.; Li, Z.; Yang, K. MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model. arXiv 2024, arXiv:2404.12794. [Google Scholar]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar] [PubMed]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Yang, Y.; Xing, Z.; Zhu, L. Vivim: A video vision mamba for medical video object segmentation. arXiv 2024, arXiv:2401.14168. [Google Scholar] [CrossRef]
Xu, J. HC-Mamba: Vision MAMBA with Hybrid Convolutional Techniques for Medical Image Segmentation. arXiv 2024, arXiv:2405.05007. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In MM’20: The 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 55–63. [Google Scholar]
Chi, H.g.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20186–20196. [Google Scholar]
Li, S.; He, X.; Song, W.; Hao, A.; Qin, H. Graph diffusion convolutional network for skeleton based semantic recognition of two-person actions. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8477–8493. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Shi, F.; Lee, C.; Qiu, L.; Zhao, Y.; Shen, T.; Muralidhar, S.; Han, T.; Zhu, S.C.; Narayanan, V. Star: Sparse transformer-based action recognition. arXiv 2021, arXiv:2107.07089. [Google Scholar] [CrossRef]
Zhou, L.; Meng, X.; Liu, Z.; Wu, M.; Gao, Z.; Wang, P. Human Pose-based Estimation, Tracking and Action Recognition with Deep Learning: A Survey. arXiv 2023, arXiv:2310.13039. [Google Scholar] [CrossRef]
Tay, Y.; Dehghani, M.; Abnar, S.; Shen, Y.; Bahri, D.; Pham, P.; Rao, J.; Yang, L.; Ruder, S.; Metzler, D. Long range arena: A benchmark for efficient transformers. arXiv 2020, arXiv:2011.04006. [Google Scholar] [CrossRef]
Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Li, Z.; Pan, H.; Zhang, K.; Wang, Y.; Yu, F. Mambadfuse: A mamba-based dual-phase model for multi-modality image fusion. arXiv 2024, arXiv:2404.08406. [Google Scholar]
Chaudhuri, S.; Bhattacharya, S. Simba: Mamba augmented U-ShiftGCN for Skeletal Action Recognition in Videos. arXiv 2024, arXiv:2404.07645. [Google Scholar]
Han, D.; Wang, Z.; Xia, Z.; Han, Y.; Pu, Y.; Ge, C.; Song, J.; Song, S.; Zheng, B.; Huang, G. Demystify Mamba in Vision: A Linear Attention Perspective. arXiv 2024, arXiv:2405.16605. [Google Scholar] [CrossRef]
Cheng, K.; Zhang, Y.; He, X.; Cheng, J.; Lu, H. Extremely lightweight skeleton-based action recognition with shiftgcn++. IEEE Trans. Image Process. 2021, 30, 7333–7348. [Google Scholar] [CrossRef]
Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef]
Li, T.; Liu, J.; Zhang, W.; Ni, Y.; Wang, W.; Li, Z. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16266–16275. [Google Scholar]
Huo, J.; Cai, H.; Meng, Q. Independent Dual Graph Attention Convolutional Network for Skeleton-Based Action Recognition. Neurocomputing 2024, 583, 127496. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 143–152. [Google Scholar]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1112–1121. [Google Scholar]
Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In MM’20: The 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1625–1633. [Google Scholar]
Plizzari, C.; Cannici, M.; Matteucci, M. Spatial temporal transformer network for skeleton-based action recognition. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, 10–15 January 2021; Part III; Springer: Cham, Switzerland, 2021; pp. 694–701. [Google Scholar]
Heidari, N.; Iosifidis, A. Progressive spatio-temporal graph convolutional network for skeleton-based human action recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3220–3224. [Google Scholar]
Zang, Y.; Yang, D.; Liu, T.; Li, H.; Zhao, S.; Liu, Q. SparseShift-GCN: High precision skeleton-based action recognition. Pattern Recognit. Lett. 2022, 153, 136–143. [Google Scholar] [CrossRef]
Xing, Y.; Zhu, J.; Li, Y.; Huang, J.; Song, J. An improved spatial temporal graph convolutional network for robust skeleton-based action recognition. Appl. Intell. 2023, 53, 4592–4608. [Google Scholar] [CrossRef]
Wu, L.; Zhang, C.; Zou, Y. SpatioTemporal focus for skeleton-based action recognition. Pattern Recognit. 2023, 136, 109231. [Google Scholar] [CrossRef]
He, Z.; Lv, J.; Fang, S. Representation modeling learning with multi-domain decoupling for unsupervised skeleton-based action recognition. Neurocomputing 2024, 582, 127495. [Google Scholar] [CrossRef]
Zhu, Q.; Deng, H. Spatial adaptive graph convolutional network for skeleton-based action recognition. Appl. Intell. 2023, 53, 17796–17808. [Google Scholar] [CrossRef]
Gedamu, K.; Ji, Y.; Gao, L.; Yang, Y.; Shen, H.T. Relation-mining self-attention network for skeleton-based human action recognition. Pattern Recognit. 2023, 139, 109455. [Google Scholar] [CrossRef]
Shao, Y.; Mao, L.; Ye, L.; Li, J.; Yang, P.; Ji, C.; Wu, Z. H²GCN: A hybrid hypergraph convolution network for skeleton-based action recognition. J. King Saud Univ.-Comput. Inf. Sci. 2024, 39, 102072. [Google Scholar] [CrossRef]
Peng, W.; Hong, X.; Zhao, G. Tripool: Graph triplet pooling for 3D skeleton-based action recognition. Pattern Recognit. 2021, 115, 107921. [Google Scholar] [CrossRef]
Miao, S.; Hou, Y.; Gao, Z.; Xu, M.; Li, W. A central difference graph convolutional operator for skeleton-based action recognition. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4893–4899. [Google Scholar] [CrossRef]
Yang, D.; Wang, Y.; Dantcheva, A.; Garattoni, L.; Francesca, G.; Brémond, F. Unik: A unified framework for real-world skeleton-based action recognition. arXiv 2021, arXiv:2107.08580. [Google Scholar]
Li, T.; Liu, J.; Zhang, W.; Duan, L. Hard-net: Hardness-aware discrimination network for 3d early activity prediction. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XI 16; Springer: Cham, Switzerland, 2020; pp. 420–436. [Google Scholar]
Qiu, Z.; Qiu, K.; Fu, J.; Fu, D. Dgcn: Dynamic graph convolutional network for efficient multi-person pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11924–11931. [Google Scholar]
Wang, K.; Deng, H. TFC-GCN: Lightweight Temporal Feature Cross-Extraction Graph Convolutional Network for Skeleton-Based Action Recognition. Sensors 2023, 23, 5593. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Illustration and structure of selective state space model.

Figure 2. An overview of the ActionMamba architecture. (a) presents the workflow for the Action Characteristic Encoder. (b) details the structural design of the SM-GCN Block. (c) visualizes the composition of the ST-Mamba Block.

Figure 3. The process of Shift Mamba operation. For a skeleton sequence feature

F \in R^{C \times T \times N}

, we only show T and C dimensions here.

Figure 3. The process of Shift Mamba operation. For a skeleton sequence feature

F \in R^{C \times T \times N}

, we only show T and C dimensions here.

Figure 4. This diagram illustrates the architectural design of the Spatial-Mamba block. Within this structure, X and Z signify the intermediate and the gated features, respectively. The block utilizes state space models that operate with three distinct scanning strategies: Spatial forward SSM, Spatial backward SSM, and Temporal SSM. The Temporal-Mamba block shares a very similar architecture with the Spatial-Mamba block; the key distinction is that its SSM scanning approach is modified to consist of temporal forward, temporal backward, and spatial scanning.

Figure 5. The diagrams of the shift convolution in CNNs (a), Vision Mamba (b), and the Shift convolution in Spatial GCNs (c) are shown. Our Shift Mamba-GCN is illustrated as (d).

Figure 6. Comparison of the accuracy of each category on the NTU-RGB+D 60 X-Sub dataset, with the horizontal axis representing the category and the vertical axis representing the accuracy.

Figure 7. Confusion matrix comparisons.

Figure 8. Examples on difficult action pairs.

Table 1. Comparison with the state-of-the-art methods on the NTU RGB+D 60 dataset. † represents the result we achieved using the officially released code, reproducing the environment as described in Section 4.2. The gray background indicates the method using Mamba structure.

Model	Year	Conf.	Param(M)	FLOPs(G)	X-Sub	X-View
ST-GCN [8]	2018	AAAI	3.10	16.3	81.5	88.3
AS-GCN [9]	2019	CVPR	6.99	35.5	86.8	94.2
2s-AGCN [28]	2019	CVPR	6.94	37.3	88.5 ^†	95.1 ^†
2s NAS-GCN [10]	2020	AAAI	13.1	−	89.4	95.7
CA-GCN [11]	2020	CVPR	−	−	83.5	91.4
Js Shift-GCN [12]	2020	CVPR	1.52	−	87.8	95.1
2s Shift-GCN [12]	2020	CVPR	3.04	−	89.7	96.0
Js MS-G3D [45]	2020	CVPR	3.20	48.8	89.4	95.0
2s MS-G3D [45]	2020	CVPR	6.44	98.0	91.5 ^†	96.2 ^†
SGN [46]	2020	CVPR	1.8	15.4	86.6	93.4
PA-ResGCN [47]	2020	ACMMM	3.6	−	90.9 ^†	96.0 ^†
ST-TR [48]	2021	ICPR	12.10	259.4	90.3	96.1
2s-PST-GCN [49]	2021	ICASSP	1.84	−	88.6	95.1
SparseShift-GCN [50]	2021	PRL	−	−	90.9	96.6
2s EDGN [8]	2022	CVIU	−	−	88.7	95.2
2s InfoGCN [8]	2022	CVPR	−	−	90.8	95.2
4s InfoGCN [8]	2022	CVPR	−	−	92.3	96.9
IST-GCN [51]	2023	AI	1.62	17.54	90.8	96.2
2s STF-Net [52]	2023	PR	3.4	−	90.8	96.2
RMMD [53]	2023	NC	−	−	83.0	90.5
SARGCN [54]	2023	AI	1.09	5.37	88.9 ^†	94.8 ^†
3Mformer [16]	2023	CVPR	4.37	58.45	94.8	98.7
RSA-Net [55]	2023	PR	3.5	−	90.9 ^†	95.3 ^†
$H^{2}$ GCN [56]	2024	JKSU	−	−	88.9	94.8
Js Simba [38]	2024	−	−	−	91.3 ^†	96.1 ^†
Bs Simba [38]	2024	−	−	−	88.48	93.41
Js ActionMamba		Ours	1.9	5.94	89.16	94.92
Bs ActionMamba		Ours	1.9	5.94	88.79	94.18
2s ActionMamba		Ours	3.8	12.73	91.83	96.67

Table 2. Comparison with the state-of-the-art methods on the NTU RGB+D 120 dataset. † represents the result we achieved using the officially released code, reproducing the environment as described in Section 4.2. The gray background indicates the method using Mamba structure.

Model	X-Sub	X-Setup
ST-GCN [8]	70.7 ^†	73.2 ^†
Js Tripool [57]	80.1	82.8
AS-GCN [9]	77.9 ^†	78.5 ^†
2s AGCN [28]	82.5 ^†	84.2 ^†
4s CDGC [58]	86.3	87.8
Shift-GCN [12]	85.9	87.6
2s MS-G3D [45]	86.2	88.0
SGN [46]	79.2	81.5
2s UNIK [59]	80.8	86.5
2s ST-TR [48]	85.1	87.1
2s STF-Net [52]	84.9 ^†	87.7 ^†
JS Simba [38]	79.75	86.28
2s ActionMamba	86.48	88.15

Table 3. Classification accuracy comparison against state-of-the-art methods on the UAV–Human dataset. The symbol † indicates that the results were obtained using the officially released codes for our implementation. The gray background indicates the method using Mamba structure.

Model	UAV–Human
Model	CSv1 (%)	CSv2 (%)
ST-GCN [8]	30.3 ^†	56.1 ^†
2s-AGCN [28]	34.8	66.7
HARD-Net [60]	37.0	−
Shift-GCN [12]	38.0 ^†	67.0 ^†
DGCN [61]	29.9	−
TFC-GCN [62]	39.6	64.7
IDGAN [44]	43.4 ^†	68.3 ^†
ActionMamba	44.3	69.5

Table 4. Ablation analysis of several ActionMamba components on the NTU RGB+D 60 dataset.The ↑ represents the numerical increase in the experimental results relative to the baseline results.

Model	X-Sub
Shift-GCN	89.70
Shift-GCN+ACE	89.88 ( $↑ 0.18$ )
Shift-GCN+ACE+ST-Mamba	90.39 ( $↑ 0.69$ )
Shift-GCN+ACE+SM-GCN	91.43 ( $↑ 1.73$ )
ActionMamba	91.83 ( $↑ 2.13$ )

Table 5. Effectiveness of the Action Characteristic Encoder. The symbol * signifies that we are placing the module to each layer of backbone.The ↑ represents the numerical increase in the experimental results relative to the baseline results.

Model	X-Sub
Shift-GCN	89.70
Shift-GCN+ISTE	89.81 $↑ 0.11$ )
Shift-GCN+ESE	89.77 ( $↑ 0.07$ )
Shift-GCN+ACE	89.91 ( $↑ 0.21$ )
Shift-GCN+ACE *	90.21 ( $↑ 0.51$ )
ActionMamba w/o ACE	91.36 ( $↑ 1.66$ )
ActionMamba	91.83 ( $↑ 2.23$ )
ActionMamba+ACE *	91.84 ( $↑ 2.24$ )

Table 6. Complexity comparison of the Model.

Model	FLOPs (G)	X-Sub	Inference (ms)
ST-TR	259.4	90.3	29.446
2s InfoGCN	-	90.8	33.736
4s InfoGCN	-	92.3	49.881
IST-GCN	17.54	90.8	23.825
3MFormer	58.45	94.8	76.482
JS Simba	-	91.3	22.846
JS ActionMamba	5.94	89.70	16.224
2s ActionMamba	12.73	91.83	18.753

Table 7. Impact of the number of SM-GCN/ST-Mamba layers on model performance.

Method	SM-GCN	ST-Mamba	FLOPs (G)	X-Sub
ActionMamba	5	1	8.92	87.41
ActionMamba	6	1	10.18	89.93
ActionMamba	7	1	11.34	90.87
ActionMamba	8	1	12.41	91.61
ActionMamba	9	1	13.28	91.38
ActionMamba	8	2	12.73	91.83
ActionMamba	8	3	13.17	91.74
ActionMamba	8	4	13.54	90.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wen, J.; Liu, D.; Zheng, B. ActionMamba: Action Spatial–Temporal Aggregation Network Based on Mamba and GCN for Skeleton-Based Action Recognition. Electronics 2025, 14, 3610. https://doi.org/10.3390/electronics14183610

AMA Style

Wen J, Liu D, Zheng B. ActionMamba: Action Spatial–Temporal Aggregation Network Based on Mamba and GCN for Skeleton-Based Action Recognition. Electronics. 2025; 14(18):3610. https://doi.org/10.3390/electronics14183610

Chicago/Turabian Style

Wen, Jinglong, Dan Liu, and Bin Zheng. 2025. "ActionMamba: Action Spatial–Temporal Aggregation Network Based on Mamba and GCN for Skeleton-Based Action Recognition" Electronics 14, no. 18: 3610. https://doi.org/10.3390/electronics14183610

APA Style

Wen, J., Liu, D., & Zheng, B. (2025). ActionMamba: Action Spatial–Temporal Aggregation Network Based on Mamba and GCN for Skeleton-Based Action Recognition. Electronics, 14(18), 3610. https://doi.org/10.3390/electronics14183610

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ActionMamba: Action Spatial–Temporal Aggregation Network Based on Mamba and GCN for Skeleton-Based Action Recognition

Abstract

1. Introduction

2. Related Work

2.1. GCN-Based Action Recognition

2.2. Transformers-Based Action Recognition

2.3. State Space Model

3. Method

3.1. Preliminaries

3.2. Action Characteristic Encoder

3.3. Action Perception Model

4. Experiments

4.1. Dataset

4.2. Implementation

4.3. Comparison with SOTA Methods

4.4. Ablation Study

4.4.1. Effectiveness of Different Components of ActionMamba

4.4.2. Effectiveness of Action Characteristic Encoder

4.4.3. Complexity Comparison

4.4.4. The Impact of Different Levels of Numbers

4.4.5. Confusion Matrix

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI