Bidirectional Mamba-Enhanced 3D Human Pose Estimation for Accurate Clinical Gait Analysis

Wang, Chengjun; Su, Wenhang; Li, Jiabao; Xu, Jiahang

doi:10.3390/fractalfract9090603

Open AccessArticle

Bidirectional Mamba-Enhanced 3D Human Pose Estimation for Accurate Clinical Gait Analysis

¹

School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China

²

School of Artificial Intelligence, Anhui Polytechnic University, Wuhu 241000, China

³

Anhui Artificial Intelligence Laboratory, Artificial Intelligence Research Institute, Hefei Comprehensive National Science Center, Hefei 230022, China

⁴

School of Mechatronics Engineering, Anhui University of Science and Technology, Huainan 232001, China

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2025, 9(9), 603; https://doi.org/10.3390/fractalfract9090603

Submission received: 7 August 2025 / Revised: 1 September 2025 / Accepted: 12 September 2025 / Published: 17 September 2025

(This article belongs to the Special Issue Fractional and Fractal Methods in Biomedical Imaging and Time Series Learning)

Download

Browse Figures

Versions Notes

Abstract

Three-dimensional human pose estimation from monocular video remains challenging for clinical gait analysis due to high computational cost and the need for temporal consistency. We present Pose3DM, a bidirectional Mamba-based state-space framework that models intra-frame joint relations and inter-frame dynamics with linear computational complexity. Replacing transformer self-attention with state-space modeling improves efficiency without sacrificing accuracy. We further incorporate fractional-order total-variation regularization to capture long-range dependencies and memory effects, enhancing temporal and spatial coherence in gait dynamics. On Human3.6M, Pose3DM-L achieves 37.9 mm MPJPE under Protocol 1 (P1) and 32.1 mm P-MPJPE under Protocol 2 (P2), with 127 M MACs per frame and 30.8 G MACs in total. Relative to MotionBERT, P1 and P2 errors decrease by 3.3% and 2.4%, respectively, with 82.5% fewer parameters and 82.3% fewer MACs per frame. Compared with MotionAGFormer-L, Pose3DM-L improves P1 by 0.5 mm and P2 by 0.4 mm while using 60.6% less computation: 30.8 G vs. 78.3 G total MACs and 127 M vs. 322 M per frame. On AUST-VisGait across six gait patterns, Pose3DM consistently yields lower MPJPE, standard error, and maximum error, enabling reliable extraction of key gait parameters from monocular video. These results highlight state-space models as a cost-effective route to real-time gait assessment using a single RGB camera.

Keywords:

monocular camera; human pose estimation; state-space models; bidirectional Mamba; gait analysis

1. Introduction

Three-dimensional human pose estimation (3D HPE) from monocular cameras is a fundamental task in the field of computer vision with broad applications in various domains such as action recognition [1], virtual reality [2], and autonomous driving [3,4]. In recent years, significant advancements in deep learning have led to substantial improvements in the accuracy, real-time performance, and temporal consistency of 3D HPE models [5,6,7]. This 3D HPE technology is now being applied in healthcare, particularly in assessing joint movement and conducting pathological analyses [8,9,10].

Clinical gait analysis plays a central role in assessing joint function, diagnosing gait disorders, and monitoring rehabilitation progress. It quantifies human locomotion using spatiotemporal and kinematic/kinetic measures [8]. Traditional practice relies on high-precision systems—multi-camera optical motion capture with reflective markers [11,12] and wearable inertial sensors [13]. Although accurate, these systems are expensive, require complex setup and calibration, and have limited portability. Recent markerless methods estimate 3D human pose from monocular video, enabling scalable and cost-effective clinical assessments [8,9,10].

Despite the inherent advantages of markerless monocular 3D HPE, applying this technology to achieve accurate and reliable clinical gait analysis still presents significant challenges. Specifically, achieving the high accuracy required for precise gait parameter measurement from estimated 2D joint positions remains difficult. Small inaccuracies in 2D detection or in the subsequent 2D-to-3D lifting process can lead to significant errors in estimated 3D joint positions and downstream gait analysis. Furthermore, many video-based 3D HPE models using self-attention mechanisms [5,14,15] are computationally inefficient when processing long video sequences. They can also struggle to maintain sufficient temporal consistency, which often leads to unrealistic or jittery pose sequences. This lack of smooth and physiologically plausible motion estimation hinders their utility in analyzing dynamic movements. While some markerless methods have been explored for clinical applications [16], there is an urgent need for more efficient and robust models. These models should provide high accuracy in 3D joint localization and enhance temporal consistency for reliable clinical gait analysis.

The computational bottleneck hinders progress and highlights the need for better accuracy and temporal consistency in clinical gait analysis. These factors drive the search for more efficient sequence modeling architectures. Recent work on video-based 3D HPE highlights the promise of Transformers. Their self-attention captures spatiotemporal dependencies in pose sequences [5,14,15]. However, applying attention mechanisms to long video sequences in comprehensive gait assessments creates a substantial computational burden. This is due to the quadratic complexity of attention relative to sequence length. This limitation restricts the use of such models in resource-constrained clinical settings and for real-time processing of continuous gait data.

Recent advances in state-space models (SSMs) [17,18,19] offer a clear path forward.Among these, Mamba [19] is particularly promising. Mamba distinguishes itself by achieving linear computational complexity for sequence processing, a significant improvement over the quadratic complexity of attention-based approaches. Mamba uses a structured state-space. An input-dependent selection module further tailors the dynamics to the input, enabling effective modeling of complex sequence dependencies. Its dual strengths—computational efficiency and rich context modeling—produce temporally coherent 3D pose estimates. Temporal coherence is a prerequisite for reliable clinical gait analysis.

In this paper, we propose a novel framework for gait analysis called Pose 3D Human Estimation from Mamba (Pose3DM). This framework aims to provide accurate and temporally coherent 3D human pose estimations from monocular video, specifically tailored for clinical gait analysis (Figure 1). We build on the Mamba state-space model and develop pose-specific bidirectional blocks for joint–time modeling. This task-adapted design captures complex spatiotemporal dependencies with linear complexity and improves computational efficiency over attention-based baselines. The proposed framework is structured into two main components: pose estimation and clinical parameter extraction. Specifically, the pose estimation component uses our pose-specific bidirectional Mamba blocks together with fractional-order total-variation (FTV) regularization to produce temporally coherent, high-fidelity 3D poses. These poses are analyzed to extract essential clinical gait parameters, facilitating reliable quantitative assessments of gait patterns in clinical settings.

In summary, the contributions of this paper can be summarized as follows:

Pose3DM uses pose-specific bidirectional Mamba blocks to model cross-joint dependencies and inter-frame dynamics with linear complexity. This reduces memory and latency compared with attention and enables longer temporal context and higher frame rates.
We add fractional-order total-variation (FTV) regularization to the loss. It improves temporal coherence and spatial smoothness, preserves physiologically plausible discontinuities, and is more robust under occlusion and noise.
We build a monocular, state-space (Mamba) clinical gait-analysis pipeline. It converts 3D poses into standard spatiotemporal and kinematic parameters and supports objective assessment across diverse gait patterns.

2. Related Work

2.1. Three-Dimensional Human Pose Estimation

Three-dimensional HPE methods can be classified by input modality into multiview [20,21] and monocular approaches [22,23], and categorized by estimation strategy into direct 3D [24] and 2D-to-3D lifting methods [25,26]. Multiview methods offer high accuracy but require complex multi-camera setups, increasing cost and complicating deployment. Monocular 2D-to-3D lifting has become the dominant paradigm; it leverages robust 2D detectors [27] and is widely used in clinical gait analysis.

Within 2D-to-3D lifting, architectures differ in how they capture spatiotemporal dependencies. Early work uses temporal convolutional networks (TCNs) [28,29] to model local temporal patterns and graph convolutional networks (GCNs) [30,31] to exploit the human-joint graph, but both approaches struggle with intricate, nonlocal correlations across time and joints [32]. More recently, transformer-based models [5,14,15,33,34,35] have achieved state-of-the-art accuracy by capturing long-range dependencies via self-attention, but self-attention scales quadratically with sequence length [36], constraining efficiency and memory in real-time or resource-constrained scenarios.

Table 1 qualitatively summarizes these method categories and the trade-off between modeling capacity and computational efficiency. An effective monocular 2D-to-3D lifting method must capture long-range spatiotemporal dependencies while operating with linear-time complexity; these requirements motivate the state-space approach described in Section 2.2.

2.2. State Space Models

State Space Models (SSMs) have recently re-emerged as a powerful paradigm for efficient sequence modeling, offering an alternative to dominant architectures like Transformers. Building on foundational work such as S4 [17,37] and S4D [18], these models leverage linear recurrent mechanisms for processing sequences with computational efficiency, particularly for long sequences. A significant development is Mamba [19], which incorporates content-aware selectivity and hardware-aware algorithms to achieve efficient long-range dependency modeling with linear complexity, positioning it as a strong and scalable alternative to Transformers.

Following Mamba’s success, explorations into its application in computer vision have emerged, such as Vision Mamba [38] and VMamba [39]. These adapt SSMs for grid-like image data, but their designs may not optimally leverage the structural properties of non-grid data like human pose sequences.

Initial investigations into applying SSM variants for 3D human pose estimation have also been explored [40,41,42]. Prior work ranges from simpler SSMs that trade some accuracy for efficiency [40] to Mamba-based or pose-specific SSMs [41,42]. These approaches represent early attempts to leverage SSM efficiency. However, they often fail to capture the complex, coupled, bidirectional spatiotemporal dynamics of human pose, due to simplifications, hybrid designs, or pose-agnostic adaptations.

To address these limitations and effectively model the intricate spatio-temporal dependencies in human pose sequences while maintaining computational efficiency, this paper introduces Pose3DM. Our framework features a novel Pose Block built upon a specially designed bidirectional Mamba architecture within its Spatial and Temporal Mamba Blocks, offering a tailored solution for efficient and effective 3D HPE.

2.3. Gait Analysis

Gait analysis, the systematic study of human locomotion, is of critical importance in clinical diagnostics, rehabilitation, and biomechanics [43,44]. Established quantitative gait analysis techniques, including marker-based optical motion capture [11,12] and wearable inertial sensors [13], can provide accurate kinematic data. Although these methods can provide accurate kinematic data, their operational costs are high, and they require specialized laboratory environments with complex setup procedures. These barriers hinder widespread clinical integration [45]. Early vision-based approaches using 2D image features or 2D joint tracking improved accessibility. However, they were limited by viewpoint dependence and the inability to reconstruct full 3D kinematics, restricting clinical utility [46].

The advent of 3D HPE has significantly advanced the field of markerless gait analysis. Recent methods increasingly leverage 3D HPE to extract 3D joint coordinates directly from video footage [8,9,10]. A key characteristic of this modern paradigm is its ability to facilitate the derivation of comprehensive 3D kinematic parameters, including joint angles, step length, and stance/swing phase durations, which are critical for in-depth clinical assessment [9,10].

However, the practical efficacy and reliability of current 3D HPE-driven gait analysis are substantially dependent on the quality of the input 3D pose sequences. Key challenges are (i) 3D joint localization errors that inflate derived gait-parameter errors, and (ii) insufficient temporal coherence in pose estimates, which corrupts dynamic-movement analysis [44]. Our work, by proposing Pose3DM for enhanced 3D HPE, aims to address these limitations directly. By delivering more accurate and temporally stable 3D pose estimates, Pose3DM can provide a more robust foundation for downstream gait analysis, thereby enabling more reliable and precise clinical assessments from markerless video data.

3. Method

This section details our proposed Pose3DM model for 3D human pose estimation and its key components, followed by its application within our comprehensive gait analysis framework. As illustrated in Figure 1, the overall framework begins with synchronized data acquisition using a monocular RGB camera for 2D video and a motion capture system for ground-truth 3D pose. Subsequently, 2D human skeletons are extracted from the video frames using a pretrained ResNet-based 2D pose estimation model. The Pose3DM network then processes these extracted 2D human skeletons, transforming them into estimated 3D poses using its internal bidirectional Mamba architecture. Finally, these 3D poses are utilized for detailed gait analysis. We further compute clinically relevant joint angles (hip, knee, ankle) from the estimated 3D joint coordinates via a standard three-point angle, as schematically illustrated in the lower-left subpanel of Figure 1, and extract key gait parameters to facilitate the understanding of movement patterns.

3.1. Pose3DM Network Architecture

To effectively model the complex spatiotemporal dependencies inherent in 3D human pose sequences, we introduce Pose3DM, an encoder–decoder network (Figure 2a) that transforms a sequence of 2D joint detections into a sequence of 3D poses. This structure progressively refines feature representations to produce accurate 3D estimations while preserving temporal coherence.

Given a monocular video with T frames and J joints per frame, the input is

X_{2 D} \in R^{T \times J \times 2}

and the target is

X_{3 D} \in R^{T \times J \times 3}

. Pose3DM learns a mapping

f_{θ} : R^{T \times J \times 2} \to R^{T \times J \times 3}

, such that

{\hat{X}}_{3 D} = f_{θ} (X_{2 D}),

(1)

where

θ

denotes all learnable parameters,

{\hat{X}}_{3 D} \in R^{T \times J \times 3}

is the predicted 3D joint tensor (and

X_{3 D}

is the ground truth), and the slice

{\hat{X}}_{3 D} (t, j, :) = ({\hat{x}}_{t, j}, {\hat{y}}_{t, j}, {\hat{z}}_{t, j})

gives the coordinates of joint j at frame t (in mm), with

t \in {1, \dots, T}

and

j \in {1, \dots, J}

.

The input 2D coordinates are projected to a

d_{m}

-dimensional feature space by a shared linear embedding applied at each

(t, j)

:

X^{(0)} = ϕ (X_{2 D} W_{emb}) + P_{t} + P_{j},

(2)

where

W_{emb} \in R^{2 \times d_{m}}

is the embedding matrix,

ϕ

is a pointwise nonlinearity (SiLU),

P_{t} \in R^{T \times 1 \times d_{m}}

encodes temporal order, and

P_{j} \in R^{1 \times J \times d_{m}}

encodes joint identity. For continuity with prior text, we also denote

X_{0} \equiv X^{(0)}

.

Encoder–decoder structure: The embedded features

X^{(0)}

are processed by a symmetric encoder–decoder with three stages on each side. Each encoder stage first applies a stack of

N_{B}

Pose Blocks to enlarge the spatiotemporal receptive field, then uses a

1 \times 1

channel projection to increase feature width while keeping the

(T, J)

resolution:

\begin{matrix} Z^{(l)} & = {PB}^{(N_{B})} (X^{(l - 1)}), l = 1, 2, 3, \end{matrix}

(3)

\begin{matrix} X^{(l)} & = ϕ (Z^{(l)} W_{enc}^{(l)}), W_{enc}^{(l)} \in R^{d_{m}^{l} \times d_{m}^{l + 1}}, \end{matrix}

(4)

where

{PB}^{(N_{B})} (\cdot)

denotes the sequential application of

N_{B}

Pose Blocks,

Z^{(l)} \in R^{T \times J \times d_{m}^{l}}

is the Pose-Block output at stage l,

W_{enc}^{(l)}

is a learnable

1 \times 1

channel projection that increases the feature width from

d_{m}^{l}

to

d_{m}^{l + 1}

,

X^{(l)} \in R^{T \times J \times d_{m}^{l + 1}}

is the encoder output at stage l after activation

ϕ

, and

d_{m}^{l}

denotes the channel width at stage l. The outputs

{X^{(1)}, X^{(2)}, X^{(3)}}

are cached for skip connections.

Each decoder stage mirrors the encoder: a

1 \times 1

projection reduces channels, the result is fused with the corresponding encoder feature via a skip connection, and a stack of

N_{B}

Pose Blocks further refines the representation:

\begin{matrix} {\tilde{Y}}^{(r)} & = ϕ (Y^{(r - 1)} W_{dec}^{(r)}), W_{dec}^{(r)} \in R^{d_{m}^{5 - r} \times d_{m}^{4 - r}}, r = 1, 2, 3, \end{matrix}

(5)

\begin{matrix} U^{(r)} & = ({\tilde{Y}}^{(r)} \oplus X^{(4 - r)}) W_{mrg}^{(r)}, W_{mrg}^{(r)} \in R^{(d_{m}^{4 - r} + d_{m}^{5 - r}) \times d_{m}^{4 - r}}, \end{matrix}

(6)

\begin{matrix} Y^{(r)} & = {PB}^{(N_{B})} (U^{(r)}), \end{matrix}

(7)

where ⊕ denotes channelwise concatenation and

Y^{(0)} = X^{(3)}

;

Y^{(r - 1)} \in R^{T \times J \times d_{m}^{5 - r}}

is the input to decoder stage r;

W_{dec}^{(r)}

is a learnable

1 \times 1

projection reducing channels from

d_{m}^{5 - r}

to

d_{m}^{4 - r}

;

{\tilde{Y}}^{(r)} \in R^{T \times J \times d_{m}^{4 - r}}

is the projected tensor;

X^{(4 - r)} \in R^{T \times J \times d_{m}^{5 - r}}

is the skip feature from the mirror encoder stage;

W_{mrg}^{(r)}

fuses the concatenation so that

U^{(r)} \in R^{T \times J \times d_{m}^{4 - r}}

; and

{PB}^{(N_{B})} (\cdot)

denotes the sequential application of

N_{B}

Pose Blocks, yielding

Y^{(r)} \in R^{T \times J \times d_{m}^{4 - r}}

.

Output projection: The final decoder tensor

Y^{(3)} \in R^{T \times J \times d_{m}^{1}}

is mapped to 3D by a linear head

W_{out} \in R^{d_{m}^{1} \times 3}

:

{\hat{X}}_{3 D} (t, j, :) = Y^{(3)} (t, j, :) W_{out} .

(8)

Residual formulation of Pose Blocks: Let

x \in R^{T \times J \times d}

be the input to a Pose Block and let

F (\cdot)

denote the composite Spatial/Temporal Mamba operator; the block output is

y = F (x) + W_{skip} x,

(9)

where

F

preserves the

(T, J)

resolution and operates per site in channel space,

W_{skip} \in R^{d \times d}

is a per-site linear projection (implemented as a

1 \times 1

convolution) used to match channels when needed, and

x, y \in R^{T \times J \times d}

share the same shape so that the Pose Block is width-preserving. In the encoder,

d = d_{m}^{l}

and

Z^{(l)} = PB (X^{(l - 1)})

in (3); in the decoder,

d = d_{m}^{4 - r}

and

Y^{(r)} = PB (U^{(r)})

in (7). This matches Figure 2a, where solid arrows denote the main path

F

and dashed arrows denote the skip path.

3.2. Pose Block

The Pose Block, illustrated in Figure 2b, is the fundamental building unit of the Pose3DM network, designed to comprehensively capture both spatial relationships between joints within a frame and temporal dynamics across frames. Each Pose Block sequentially processes its input features through two specialized components: a Spatial Mamba Block and a Temporal Mamba Block. This sequential arrangement first models intra-frame joint correlations and then models how these spatial configurations evolve over time.

Mamba Block Structure: Both the Spatial Mamba Block and the Temporal Mamba Block share an identical internal structure. This common architecture processes input features

F_{i n}

first through Layer Normalization, followed by our novel Bidirectional Mamba module. The original input

F_{i n}

is then added to the module’s output via a residual connection. This sum undergoes another Layer Normalization and is subsequently processed by a Multilayer Perceptron (MLP), again followed by a residual connection. These operations are expressed as:

\begin{matrix} F_{b i m a m b a} & = BidirectionalMamba (Norm (F_{i n})) + F_{i n}, \end{matrix}

(10)

\begin{matrix} F_{o u t} & = MLP (Norm (F_{b i m a m b a})) + F_{b i m a m b a}, \end{matrix}

(11)

where

F_{bimamba} \in R^{T \times J \times d}

denotes the intermediate residual output of the Bidirectional Mamba sub-block (i.e.,

BidirectionalMamba (Norm (F_{i n})) + F_{i n}

), and

F_{out} \in R^{T \times J \times d}

is the final output of the Mamba Block. The Spatial and Temporal variants differ only in the scan dimension of the Bidirectional Mamba module—along joints J for the Spatial block and along frames T for the Temporal block.

Bidirectional Mamba Module: The core of each Mamba Block is the Bidirectional Mamba module (Figure 2c). This module is engineered to leverage the efficiency and long-range dependency modeling capabilities of State Space Models (SSMs) while adapting them for non-causal data like spatial joint configurations and full temporal sequences relevant to pose estimation.

Traditional SSMs, representable by linear time-invariant systems:

h^{'} (t) = A h (t) + B x (t), y (t) = C h (t) + D x (t),

(12)

are discretized for practical implementation:

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}, y_{t} = C h_{t} + D x_{t},

(13)

where

\bar{A} = e^{Δ A}

and

\bar{B} = {(Δ A)}^{- 1} (e^{Δ A} - I) \cdot Δ B

are derived from

A, B

and a step size

Δ

. Mamba [19] enhances this with a selective scanning (S6) mechanism, enabling input-dependent weights (

\bar{A} (x_{t}), \bar{B} (x_{t}), C (x_{t})

) and achieving linear computational complexity. While the recursive S6 mechanism excels in causal sequence modeling, visual and pose data benefit from considering context from all directions.

To this end, our Bidirectional Mamba module (Figure 2c) processes input features through two parallel streams: one forward and one backward. Each stream involves Layer Normalization, a linear projection, a 1D convolution, and the selective SSM. The outputs from these two streams are then combined via summation, followed by a final linear projection. This dual-path design allows the module to capture dependencies from both past and future contexts or from joints in all directions. This is crucial for robust pose estimation, as the position of a joint is often influenced by its neighbors and its recent as well as upcoming movements. The Mamba architecture’s capacity for linear-complexity long-range modeling in both directions significantly enhances the temporal consistency and spatial coherence of the estimated 3D poses, leading to smoother and more physiologically plausible movements. The cascaded application of Spatial and Temporal Mamba Blocks within each Pose Block ensures that Pose3DM can effectively learn and integrate intricate spatio-temporal correlations from 2D pose sequences for accurate 3D pose lifting.

3.3. Loss Function Design with Fractional Calculus

The loss function employed during the training phase of the Pose3DM network incorporates fractional-order total variation (FTV) regularization to improve temporal smoothness (and thereby enhance overall coherence) in the predicted 3D poses. The final loss function L is

L = L_{3 D} + λ_{t} L_{t} + λ_{m} L_{m} + λ_{2 D} L_{2 D} + λ_{F T V} L_{F T V} .

(14)

Here,

L_{3 D}

is the MPJPE loss;

L_{t}

is a temporal consistency loss (TCLoss) that encourages smooth poses over time;

L_{m}

is the MPJVE loss, improving frame-to-frame coherence;

L_{2 D}

is a 2D re-projection loss aligning estimated 3D poses with 2D detections; and

L_{F T V}

is the fractional-order total variation term that primarily promotes temporal smoothness.

We define the FTV penalty as an

ℓ_{1}

aggregation of per-joint, per-coordinate fractional differences along the temporal axis:

L_{F T V} = \sum_{t = K + 1}^{T} \sum_{j = 1}^{J} \sum_{c \in {x, y, z}} |(D_{t}^{α} X_{c}^{'}) (t, j)| .

(15)

The discrete Grünwald–Letnikov (GL) approximation of the fractional difference is

(D_{t}^{α} X_{c}^{'}) (t, j) \approx \frac{1}{{(Δ t)}^{α}} \sum_{k = 0}^{K} C_{k} X_{c}^{'} (t - k, j), t = K + 1, \dots, T .

(16)

where

X^{'} (t, j, :) \in R^{3}

denotes the (optionally normalized) predicted 3D coordinates of joint j at time t, and

X_{c}^{'}

picks the c-th coordinate channel with

c \in {x, y, z}

. The frame interval is

Δ t > 0

(

Δ t = 1 / fps

). Standard boundary handling (reflection or zero padding) is used for indices

t \leq K

.

The GL weights

{C_{k}}

are given by

C_{k} = \frac{{(- 1)}^{k} Γ (k + α)}{Γ (k + 1) Γ (α)}, k = 0, 1, \dots, K .

(17)

where

Γ (\cdot)

denotes the Euler Gamma function (for

n \in N^{+}

,

Γ (n) = (n - 1)!

; for

ℜ (z) > 0

,

Γ (z) = \int_{0}^{\infty} t^{z - 1} e^{- t} d t

),

α \in (0, 1]

is the fractional order, and K is the truncation length used in practice. Using the rising Pochhammer symbol

{(α)}_{k} = Γ (k + α) / Γ (α)

, Equation (17) is equivalently

C_{k} = {(- 1)}^{k} {(α)}_{k} / k!

. For numerical stability, we compute the coefficients via the recurrence

C_{0} = 1, C_{k} = - \frac{α + k - 1}{k} C_{k - 1} (k \geq 1) .

(18)

In practice, the fractional operator enables smooth transitions between frames while capturing long-range temporal dependencies—crucial for human pose estimation where both short- and long-term dynamics matter. The FTV term suppresses high-frequency noise and penalizes abrupt, implausible changes in joint trajectories, while allowing natural variations. The hyperparameter

λ_{F T V}

controls the strength of this regularization relative to other losses. The order

α

adjusts the amount of smoothing, balancing detail preservation and temporal coherence.

3.4. Gait Analysis from 3D Poses

As outlined in the overall framework (Figure 1), the estimated 3D human poses from the Pose3DM network serve as the fundamental input for our subsequent gait analysis. Accurate 3D joint coordinates are crucial for deriving meaningful gait parameters that can characterize human movement patterns.

Following the prediction of 3D joint positions

X_{3 D} \in R^{T \times J \times 3}

for each frame

t \in {1, \dots, T}

and each joint

j \in {1, \dots, J}

, we proceed to calculate various gait parameters. A key step in gait analysis is the computation of angles between different body segments or joints. As illustrated in Figure 1, specific human key joint angles are calculated to quantify the range of motion and coordination during walking. For example, consider the angle between two segments defined by three joints, say

J_{1}

,

J_{2}

, and

J_{3}

, where

J_{2}

is the vertex. Let

p_{1}, p_{2}, p_{3} \in R^{3}

be the 3D coordinates for joints

J_{1}, J_{2},

and

J_{3},

respectively. The two vectors forming the angle at

J_{2}

are

{\bar{V}}_{1} = p_{1} - p_{2}

(representing the segment

J_{2} J_{1}

) and

{\bar{V}}_{2} = p_{3} - p_{2}

(representing the segment

J_{2} J_{3}

). The angle

θ

is then computed using the dot product:

θ = {cos}^{- 1} (\frac{{\bar{V}}_{1} \cdot {\bar{V}}_{2}}{| {\bar{V}}_{1} | | {\bar{V}}_{2} |})

(19)

This allows us to analyze angles such as knee flexion/extension, hip flexion/extension, ankle dorsiflexion/plantarflexion, etc., throughout the gait cycle. The precise calculation of these angles provides valuable insights into the kinematics of the gait.

Beyond joint angles, we extract a suite of important gait parameters to provide a comprehensive characterization of the subject’s walking pattern. These parameters, also depicted in Figure 3, are selected for their clinical relevance and ability to differentiate gait patterns, and typically include:

Gait Cycle (s): The duration of a complete gait cycle, from one heel strike to the next heel strike of the same foot.
Step Frequency (steps/min): The number of steps taken per minute.
Step Length (m): The horizontal distance covered in each step.
Pace of Step (m/s): The speed of walking, indicating how fast the subject is walking.
Stance Time (s): The duration during which the foot is in contact with the ground in each gait cycle.
Swing Time (s): The duration during which the foot is in the air in each gait cycle.

These parameters are derived from the temporal sequence of 3D joint positions and can be analyzed to identify deviations from normal gait patterns. By leveraging the accurate 3D pose estimates from Pose3DM, our framework enables objective and quantitative gait analysis.

4. Experiments

4.1. Datasets and Evaluation Metrics

We evaluate Pose3DM on a standard video-based 3D human pose benchmark (Human3.6M [47] and on a clinical gait dataset (AUST-VisGait) to assess generalizability and clinical relevance. Human3.6M provides comparability to prior work under widely adopted protocols. AUST-VisGait evaluates performance in realistic, clinically oriented scenarios with diverse gait patterns.

4.1.1. Human3.6M (Standard Benchmark)

Human3.6M [47] contains 3.6M frames from 11 subjects in a controlled lab. The subjects perform 15 standard actions: Directions, Discussion, Eating, Greeting, Phoning, Posing, Purchases, Sitting, SittingDown, Smoking, TakingPhoto, Waiting, Walking, WalkingDog, and WalkTogether. We use the common split: S1 and S5–S8 for training, and S9/S11 for testing. Unless otherwise stated, we segment videos into fixed-length clips of

T = 243

frames for both training and inference. We obtain 2D keypoints with a pretrained ResNet-based detector and lift them to 3D using our network. We apply root centering and pelvis alignment following standard practice so that Protocol 1/2 remain comparable to prior work. We use the official per-camera calibration released with Human3.6M (intrinsic matrix and extrinsic poses) for all camera-related computations and do not alter the dataset’s original optics.

4.1.2. AUST-VisGait (Clinical Dataset)

AUST-VisGait [10] targets clinical gait assessment with monocular videos. The cohort includes 80 healthy adults (age ≥ 20). Screening excluded conditions that could affect gait (e.g., prior lower-limb surgery, neurological disorders, heart disease). Ethical approval was granted by the Biomedical Research Ethics Committee of Anhui Science and Technology University. Written informed consent was obtained in accordance with the Declaration of Helsinki. Monocular RGB videos are recorded at 60 Hz with 1080p resolution. A motion-capture reference runs at 300 Hz with

\pm 0.1

mm accuracy. L-/T-frame calibration is used for synchronization and spatial alignment.

The dataset includes six clinician-supervised walking conditions: Normal Walk (NW) and five simulated injury gaits—Ankle Injury Gait (AIG), Calf Muscle Injury Gait (CIG), Knee Injury Gait (KIG), Thigh Muscle Injury Gait (TIG), and Hip Injury Gait (HIG). Each video is split into an 80%/20% train/test partition while preserving the distribution of gait patterns. Unless specified, sequences are segmented into

T = 243

-frame clips with a fixed stride. We extract 2D keypoints using the same pretrained detector as in Human3.6M and normalize them with identical conventions to ensure cross-dataset consistency. Table 2 reports the number of video frames for each walking pattern.

On Human3.6M, we report Protocol 1 (MPJPE, mm) and Protocol 2 (P-MPJPE, mm). Protocol 1 computes the mean per-joint error after pelvis translation alignment. Protocol 2 computes the error after rigid Procrustes alignment. On AUST-VisGait, we report MPJPE (mm) for six lower-limb joints: left/right ankle, knee, and hip. We also provide aggregate statistics across gait patterns. Where applicable, we summarize clinical gait parameters in downstream analyses.

4.2. Implementation Details

We implement our models in PyTorch 2.5.1 and train on a single NVIDIA RTX 3090 GPU. Unless otherwise specified, the input clip length is

T = 243

frames. For each clip, 2D keypoints are extracted with a pretrained ResNet-based detector and then lifted to 3D by our network. Root centering and pelvis alignment follow the evaluation conventions described in Section 4.1 to ensure comparability across protocols. During training, we apply horizontal flip as data augmentation.

We optimize with AdamW for 120 epochs using a batch size of 4, weight decay of 0.01, and an initial learning rate of

2 \times 10^{- 4}

with a cosine annealing schedule. The training objective combines an L1 3D regression loss (MPJPE), temporal consistency and velocity terms, a 2D reprojection term, and fractional-order total variation (FTV) regularization to enhance temporal smoothness and spatial coherence.

Pose3DM is instantiated in three capacities: Small, Base, and Large (S, B, L). Configuration details appear in Table 3. Pose3DM-B offers a strong accuracy–compute trade-off, Pose3DM-L targets the best accuracy at a moderate compute budget, and Pose3DM-S favors efficiency for time-sensitive or resource-constrained scenarios.

4.3. Experimental Settings

We evaluate under the dataset protocols defined in Section 4.1. On Human3.6M we follow the standard split (S1, S5–S8 for training; S9/S11 for testing) and report Protocol 1 (MPJPE, mm) and Protocol 2 (P-MPJPE, mm). On AUST-VisGait, we use the 80%/20% train/test split per video while preserving gait-pattern distribution, and we report per-joint MPJPE (mm) for the six lower-limb joints together with aggregate statistics across patterns.

At test time, each video is segmented into

T = 243

-frame clips with a fixed stride; predictions for overlapped frames are averaged. We apply horizontal flip as a test-time augmentation and keep the same 2D detector and normalization conventions as in training for consistency.

For computational analysis, we report the number of parameters and multiply–accumulate operations (MACs) per frame and for a

T = 243

-frame clip under identical input settings. The model configurations are summarized in Table 3, and full quantitative results appear in the results tables. Random seeds are fixed, and code/configuration files will be released to facilitate reproduction. Throughput (frames per second) is measured on an RTX 3090 with batch size 1 and

T = 243

.

4.4. Comparison with State-of-the-Art

4.4.1. Performance and Computational Efficiency on Human3.6M

While Transformer-based models currently deliver state-of-the-art results on H3.6M, their computational demands remain substantial. For bidirectional applications, Pose3DM achieves superior results compared to existing Transformer methods while significantly reducing computational costs. Table 4 presents the comparison on Human3.6M with state-of-the-art methods under Protocols P1 and P2.

As detailed in Table 4, Pose3DM-L achieves a P1 error of 37.9 mm and a P2 error of 32.1 mm. Compared to MotionBERT [5], a previous SOTA Transformer-based method, Pose3DM-L demonstrates superior accuracy with P1 error reducing from 39.2 mm to 37.9 mm and P2 error from 32.9 mm to 3 mm, while requiring significantly fewer computational resources with approximately 17.5% of MotionBERT’s Params and roughly 17.7% of its MACs/Frame.

Against MotionAGFormer-L [35], which achieves 38.4 mm in P1 and 32.5 mm in P2, Pose3DM-L delivers slightly better accuracy while substantially reducing computational cost. Specifically, Pose3DM-L requires 30.8 G MACs and 127 M MACs/Frame compared to MotionAGFormer-L’s 78.3 G MACs and 322 M MACs/Frame, respectively.

Pose3DM-S, with merely 0.5 M parameters and 2.1 G MACs, achieves competitive results of 42.1 mm in P1 and 35.0 mm in P2, demonstrating the efficiency of the Mamba architecture across different scales.

4.4.2. Distribution of Joint Angles Across Different Gait Patterns

To understand how different gait patterns manifest in joint kinematics and assess the discriminative power of these features for gait pattern analysis, the distribution of key lower limb joint angles for each distinct gait pattern in the clinical dataset was visualized.

Figure 4 demonstrates the distribution of human joint angles for each walking pattern via box plots. The plots are shown separately for the left and right hip, knee, and ankle joints. Different walking patterns exhibit distinct distributions of angles for these key joints. Notably, the box plots reveal that, in general, the left lower limb joint angles are lower than those of the right lower limb, except for the normal gait pattern. This difference arises because our simulated injured gaits maintain the normal position for the right lower limb while simulating abnormalities in the left lower limb. This distinction highlights the impact of simulated injuries on joint angles and emphasizes that joint angle features derived from the monocular 3D pose estimation framework have potential in characterizing and differentiating various human walking patterns in clinical settings.

The box plots for each joint illustrate the range, median, quartiles, and outliers of the respective joint angles across different gait patterns. The hip joint plots show that the distribution of angles for HIG appears notably different compared to NW or other simulated injury gaits. The median and range of hip angles in HIG show significant shifts compared to NW, suggesting that hip-related gait issues are reflected in specific hip joint angle patterns. Similar variations exist across other joints and gait types. These variations across different joints and gait types support the notion that these kinematic features are indicative of underlying gait characteristics and can serve as discriminative features for gait pattern analysis.

4.4.3. Performance Comparison on Clinical Gait Patterns

To evaluate the performance of Pose3DM-L and compare it with state-of-the-art methods on complex human movements, experiments were conducted using the clinical dataset, which includes six distinct gait patterns: AIG, CIG, HIG, KIG, NW, and TIG. The performance of Pose3DM-L was compared with that of MixSTE, MotionBERT, MotionAGFormer-L, and PoseMamba-L in terms of MPJPE, Std Error, and Max Error.

Figure 5 illustrates the MPJPE, Standard Error, and Maximum Error for each gait pattern across different methods. As shown in Figure 5a–f, Pose3DM-L consistently yielded lower MPJPE, Std Error, and Max Error across all tested gait patterns, underscoring its superior overall performance relative to other methods. Table 5 provides detailed MPJPE values for individual lower limb joints across each gait pattern.

Figure 6 presents the frequency distribution of errors for the key lower limb joints for each method across all gait patterns. This visualization offers insights into the overall distribution of errors. For most joints, Pose3DM-L exhibits a distribution shifted towards lower error values compared to other methods, indicating a higher frequency of very small errors.

4.5. Validation of Monocular Camera-Based Gait Analysis

To evaluate the performance of our proposed Pose3DM system and examine gait characteristics across different walking conditions, we conducted a comprehensive analysis comparing monocular camera estimations with motion capture measurements across six groups: AIG, CIG, HIG, KIG, NW, and TIG.

Figure 7 presents the correlation between gait parameters derived from a monocular camera (Pose3DM) and motion capture systems for individual participants. The scatter plots reveal strong linear relationships across all measured parameters, with data points tightly clustered along the diagonal reference line. This high degree of correspondence validates the accuracy of our monocular camera-based approach in capturing essential gait characteristics across diverse walking patterns.

Table 6 summarizes the mean values and standard deviations of key gait parameters estimated using Pose3DM across all groups. The data reveals a clear distinction between normal walking (NW) and the five impaired gait patterns. Specifically, the NW group demonstrates substantially longer step length (0.62 ± 0.06 m) and higher walking speed (1.39 ± 0.19 m/s) compared to the impaired gait groups, which show relatively similar characteristics among themselves (step length: 0.19–0.23 m; pace: 0.44–0.53 m/s). These findings suggest that while different injury-simulated conditions may produce subtle variations in gait mechanics, they collectively result in comparable reductions in spatial parameters when compared to normal walking patterns.

The consistent performance of Pose3DM across all groups, as evidenced by the strong correlations in Figure 7, demonstrates its potential as a practical tool for gait assessment in both clinical and research applications, offering a cost-effective alternative to traditional motion capture systems while maintaining measurement accuracy.

4.6. Ablation Studies

To demonstrate the necessity of each component in our proposed Pose3DM model, we performed an ablation study. Table 7 shows the quantitative results with different model variants, trained with the same strategy on the H3.6M dataset.

Analyzing the results presented in Table 7, we can observe that removing any design of our full model leads to worse performance, proving that our designs contribute to the proposed model’s effectiveness and improve the quality of the estimated 3D poses.

Specifically, the performance degradation when removing the Spatial Mamba highlights its importance in capturing spatial dependencies among joints within a frame. Similarly, the significant drop in performance without the Temporal Mamba confirms the critical role of modeling temporal dynamics. The comparison between bidirectional and unidirectional Mamba variants demonstrates the benefit of processing information in both forward and backward directions for improved context modeling. Removing the skip connections also results in decreased accuracy, emphasizing their function in facilitating information flow between the encoder and decoder.

4.7. Qualitative Analysis

Figure 8 and Figure 9 present qualitative comparisons between Pose3DM and strong baselines (MotionBERT and PoseMamba) on Human3.6M and AUST-VisGait, respectively. Across both datasets, Pose3DM renders more anatomically plausible 3D poses with improved temporal smoothness, consistent with the quantitative trends reported earlier. The green circles in the figures mark representative regions where differences are most visible.

On Human3.6M, we observe clear advantages for actions that rely on long-range context and are prone to partial self-occlusions. A notable example is Greeting: only Pose3DM reconstructs stable upper-limb trajectories and trunk sway. This action involves anticipatory and follow-through cues (arm raise, hand motion, subtle weight shifts) that unfold over an extended time; these cues are captured more reliably by Pose3DM’s bidirectional state-space modeling, which aggregates information from both past and future frames with linear complexity. In addition, the sequential application of Spatial and Temporal Mamba blocks preserves inter-joint relations during fast articulations, and the fractional-order total variation (FTV) regularization suppresses high-frequency jitter that otherwise degrades perceived motion realism. Similar improvements are seen in actions such as Phoning and Eating, where intermittent arm–torso occlusions and small-amplitude hand motions often challenge attention-based or generic SSM counterparts.

On AUST-VisGait, the clinical gait patterns introduce asymmetric lower-limb kinematics (AIG, CIG, KIG, TIG) and regular symmetric gait (NW), along with hip-dominant deviations (HIG). Pose3DM maintains more stable ankle and knee trajectories through stance–swing transitions and better preserves left–right consistency when appropriate (NW), while capturing asymmetry where present (injury-pattern gaits). We attribute these effects to the model’s ability to couple spatial joint dependencies with temporal dynamics over long horizons, again aided by FTV in reducing frame-to-frame variance that can corrupt derived gait parameters.

Figure 10 depicts the correspondence between 2D detections and lifted 3D poses. (a) Human3.6M: green dashed circles mark occluded or low-confidence 2D keypoints that align with larger distal-joint deviations in 3D, while proximal joints remain comparatively stable. Black arrows and solid circles indicate suppressed frame-to-frame jitter over frames 985–987 (orange dashed arrows denote the temporal order). (b) AUST-VisGait: the yellow circle and arrow highlight an ipsilateral deficit visible in both 2D and 3D. We consistently observe: (i) 2D uncertainty propagates to 3D, with distal joints being more sensitive; (ii) short-term 2D jitter induces 3D trajectory fluctuations; and (iii) left–right symmetry in NW and side-specific asymmetry in AIG/CIG/KIG/TIG/HIG are preserved in 3D trajectories and joint-angle profiles. To address (i) and (ii), Pose3DM aggregates long-range context via pose-specific bidirectional Mamba blocks and applies FTV regularization to enforce spatiotemporal consistency and suppress frame-to-frame jitter; for (iii), Pose3DM preserves the 2D symmetry/asymmetry in the resulting 3D trajectories and joint-angle curves.

We observe the largest improvements in three scenarios: (i) when resolving transient or occluded motions requires long-range temporal context; (ii) when rapid limb articulations coincide with self-occlusion; and (iii) when preserving precise inter-joint coordination is essential for downstream clinical parameters. Conversely, on short segments with minimal occlusion—where local evidence suffices—the performance gap narrows. These findings are consistent with the architectural choices in Section 3, namely pose-specific bidirectional Mamba blocks for joint–time modeling and FTV regularization for temporal and spatial coherence, and they support state-space models as a reliable solution for monocular 3D pose estimation in clinical settings.

5. Results Discussion

5.1. Summary of Findings

Across Human3.6M and AUST-VisGait, Pose3DM demonstrates that replacing attention with bidirectional state-space modeling can preserve long-range temporal reasoning while substantially reducing compute and memory. On Human3.6M, Pose3DM-L attains 37.9 mm (P1) and 32.1 mm (P2), outperforming prior transformer baselines while using markedly fewer MACs per frame; on the clinical dataset, Pose3DM consistently yields lower MPJPE, standard error, and maximum error across all six gait patterns (Table 4, Table 5 and Table 6; Figure 5, Figure 6 and Figure 7). These gains are accompanied by smoother, more physiologically plausible pose sequences, aided by FTV regularization.

5.2. Clinical Implications and Deployability

Two aspects are particularly relevant to clinical use: (i) temporal coherence on long sequences, which stabilizes downstream kinematic parameters and gait-phase segmentation; and (ii) linear-time complexity, which enables practical deployment on commodity GPUs/CPUs. The S/B/L variants allow clinicians or integrators to target different operating points: S for resource-limited or near-real-time screening, B for balanced performance, and L for maximal accuracy under moderate compute budgets.

5.3. Error Characteristics and Robustness

Box plots of lower-limb joint angles reveal systematic differences among gait patterns (Figure 4), and error histograms show that Pose3DM is skewed toward smaller joint-position errors relative to baselines (Figure 5). Notably, simulated injury gaits exhibit asymmetries between left and right limbs that our model captures consistently, which supports downstream discrimination of gait abnormalities. The ablation study confirms that both Spatial and Temporal Mamba blocks are necessary; dropping either component degrades MPJPE and P-MPJPE, underscoring the need to jointly model intra-frame structure and inter-frame dynamics.

5.4. Compute–Accuracy Trade-Offs and Throughput

Compared with attention-based models of similar accuracy, Pose3DM reduces MACs per frame and total MACs for

T = 243

clips (Table 4). This improves the feasibility of processing long clinical videos without compromising accuracy. Throughput measured on an NVIDIA RTX 3090 GPU (batch size 1,

T = 243

) indicates that Pose3DM-S achieves the highest FPS, Pose3DM-B offers a strong balance, and Pose3DM-L trades some speed for best accuracy.

5.5. Limitations and Future Work

Our pipeline depends on the quality of the upstream 2D keypoint detector; severe occlusions or unusual camera viewpoints can still induce lifting errors. The clinical cohort focuses on simulated injury gaits in adults; broader validation on diverse pathologies and assistive-device use is warranted. Future work will explore uncertainty-aware inference, test-time adaptation to mitigate domain shift, multi-view/IMU fusion when available, and calibration-light protocols to further facilitate point-of-care deployment.

6. Conclusions

We presented Pose3DM, a bidirectional Mamba-based state-space framework for accurate and efficient 3D human pose estimation tailored to clinical gait analysis. By replacing attention with linear-time state-space modeling and coupling Spatial/Temporal Mamba blocks, Pose3DM enhances long-range temporal reasoning and coherence while substantially reducing compute.

Our main evidence-backed contributions are as follows: (i) architecture and efficiency—Pose3DM employs bidirectional state-space modeling within pose-specific blocks, achieving linear-time complexity and strong throughput under fixed-length clips; compute and model capacities are reported as parameters and MACs for the Small, Base, and Large variants. (ii) Accuracy on Human3.6M—Pose3DM-L attains 37.9 mm (P1) and 32.1 mm (P2), outperforming prior transformer baselines with markedly fewer MACs per frame. (iii) Clinical reliability on AUST-VisGait—across six gait patterns, Pose3DM-L yields consistently lower MPJPE, standard error, and maximum error, enabling robust extraction of clinically meaningful gait parameters. (iv) Deployability—linear-time complexity and compact S/B/L variants provide favorable accuracy–compute trade-offs for clinical workflows using a single RGB camera; throughput is measured on an RTX 3090 with batch size 1 and

T = 243

.

Improved temporal coherence stabilizes downstream kinematic estimates and gait-phase segmentation, supporting reliable markerless assessment. Current limitations include dependence on 2D keypoint quality and validation on simulated impairments; future work will extend to broader pathologies and uncertainty-aware, domain-adaptive inference.

Author Contributions

C.W., funding acquisition, project administration, resources, supervision, and proofreading; W.S., data collection; J.L., methodology, experiments, visualization, and writing; J.X., data curation and writing. All authors have read and agreed to the published version of the manuscript.

Funding

The University Synergy Innovation Program of Anhui Province provided support for this research under grant agreement GXXT-2022-053.

Institutional Review Board Statement

The research protocol received ethical approval from the Institutional Review Board at Anhui University of Science and Technology (China), complying with both the Helsinki Declaration and local regulatory requirements. All experimental procedures were performed following established ethical standards and institutional protocols. Prior to participation, written consent forms were obtained from all study subjects.

Data Availability Statement

Data will be made available upon request, subject to approval from the laboratory. Code is available at: https://github.com/Reus3237/Pose3DM.

Acknowledgments

The authors thank the volunteers who contributed to the experiment.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, J.; Mu, X.; Liu, Z.; Li, H. Human skeleton behavior recognition model based on multi-object pose estimation with spatiotemporal semantics. Mach. Vis. Appl. 2023, 34, 44. [Google Scholar] [CrossRef]
Anvari, T.; Park, K.; Kim, G. Upper body pose estimation using deep learning for a virtual reality avatar. Appl. Sci. 2023, 13, 2460. [Google Scholar] [CrossRef]
Xu, F.; Xu, F.; Xie, J.; Pun, C.M.; Lu, H.; Gao, H. Action recognition framework in traffic scene for autonomous driving system. IEEE Trans. Intell. Transp. Syst. 2021, 23, 22301–22311. [Google Scholar] [CrossRef]
Huo, R.; Zhang, Y.; Guo, Y.; Ju, Z.; Gao, Q. Gtformer: 3D driver body pose estimation in video with graph convolution network and transformer. IEEE Trans. Intell. Veh. 2023. early access. [Google Scholar] [CrossRef]
Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 15085–15099. [Google Scholar]
Zhao, D.; Li, H.; Yan, S. Spatial–Temporal Synchronous Transformer for Skeleton-Based Hand Gesture Recognition. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1403–1412. [Google Scholar] [CrossRef]
Liu, J.; Liu, M.; Liu, H.; Li, W. TCPFormer: Learning Temporal Correlation with Implicit Pose Proxy for 3D Human Pose Estimation. arXiv 2025, arXiv:2501.01770. [Google Scholar] [CrossRef]
Liang, S.; Zhang, Y.; Diao, Y.; Li, G.; Zhao, G. The reliability and validity of gait analysis system using 3D markerless pose estimation algorithms. Front. Bioeng. Biotechnol. 2022, 10, 857975. [Google Scholar] [CrossRef]
Martini, E.; Boldo, M.; Aldegheri, S.; Valè, N.; Filippetti, M.; Smania, N.; Bertucco, M.; Picelli, A.; Bombieri, N. Enabling gait analysis in the telemedicine practice through portable and accurate 3D human pose estimation. Comput. Methods Programs Biomed. 2022, 225, 107016. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Wang, Z.; Wang, C.; Su, W. GaitFormer: Leveraging dual-stream spatial–temporal Vision Transformer via a single low-cost RGB camera for clinical gait analysis. Knowl.-Based Syst. 2024, 295, 111810. [Google Scholar] [CrossRef]
Ghorbani, S.; Mahdaviani, K.; Thaler, A.; Kording, K.; Cook, D.J.; Blohm, G.; Troje, N.F. MoVi: A large multi-purpose human motion and video dataset. PLoS ONE 2021, 16, e0253157. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Jin, T.; Cai, J.; Xu, L.; He, T.; Wang, T.; Tian, Y.; Li, L.; Peng, Y.; Lee, C. Wearable triboelectric sensors enabled gait analysis and waist motion capture for IoT-based smart healthcare applications. Adv. Sci. 2022, 9, 2103694. [Google Scholar] [CrossRef]
Hutabarat, Y.; Owaki, D.; Hayashibe, M. Recent advances in quantitative gait analysis using wearable sensors: A review. IEEE Sens. J. 2021, 21, 26470–26487. [Google Scholar] [CrossRef]
Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13147–13156. [Google Scholar]
Zhang, J.; Tu, Z.; Yang, J.; Chen, Y.; Yuan, J. MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13232–13242. [Google Scholar]
Wishaupt, K.; Schallig, W.; van Dorst, M.H.; Buizer, A.I.; van der Krogt, M.M. The applicability of markerless motion capture for clinical gait analysis in children with cerebral palsy. Sci. Rep. 2024, 14, 11910. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Gu, A.; Goel, K.; Gupta, A.; Ré, C. On the parameterization and initialization of diagonal state space models. Adv. Neural Inf. Process. Syst. 2022, 35, 35971–35983. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Tu, H.; Wang, C.; Zeng, W. Voxelpose: Towards multi-camera 3D human pose estimation in wild environment. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16; Springer: Cham, Switerland, 2020; pp. 197–212. [Google Scholar]
Shuai, H.; Wu, L.; Liu, Q. Adaptive multi-view and temporal fusing transformer for 3D human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4122–4135. [Google Scholar] [CrossRef]
Xu, J.; Yu, Z.; Ni, B.; Yang, J.; Yang, X.; Zhang, W. Deep Kinematics Analysis for Monocular 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–20 June 2020. [Google Scholar]
Baumgartner, T.; Klatt, S. Monocular 3D Human Pose Estimation for Sports Broadcasts Using Partial Sports Field Registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Vancouver, BC, Canada, 18–22 June 2023; pp. 5109–5118. [Google Scholar]
Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral Human Pose Regression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Yang, H.; Liu, H.; Zhang, Y.; Wu, X. Hierarchical parallel multi-scale graph network for 3D human pose estimation. Appl. Soft Comput. 2023, 140, 110267. [Google Scholar] [CrossRef]
Nie, Q.; Liu, Z.; Liu, Y. Lifting 2d human pose to 3D with domain adapted 3D body concept. Int. J. Comput. Vis. 2023, 131, 1250–1268. [Google Scholar] [CrossRef]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks for Action Segmentation and Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Andrade-Ambriz, Y.A.; Ledesma, S.; Ibarra-Manzano, M.A.; Oros-Flores, M.I.; Almanza-Ojeda, D.L. Human activity recognition using temporal convolutional neural network architecture. Expert Syst. Appl. 2022, 191, 116287. [Google Scholar] [CrossRef]
Zou, Z.; Tang, W. Modulated Graph Convolutional Network for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11477–11487. [Google Scholar]
Yu, B.X.; Zhang, Z.; Liu, Y.; Zhong, S.h.; Liu, Y.; Chen, C.W. GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 8818–8829. [Google Scholar]
Wang, L.; Guo, D.; Wu, H.; Li, K.; Yu, W. TC-GCN: Triple cross-attention and graph convolutional network for traffic forecasting. Inf. Fusion 2024, 105, 102229. [Google Scholar] [CrossRef]
Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar]
Zhao, Q.; Zheng, C.; Liu, M.; Wang, P.; Chen, C. PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8877–8886. [Google Scholar]
Mehraban, S.; Adeli, V.; Taati, B. Motionagformer: Enhancing 3D human pose estimation with a transformer-gcnformer network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6920–6930. [Google Scholar]
Fournier, Q.; Caron, G.M.; Aloise, D. A practical survey on faster and lighter transformers. ACM Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
Nguyen, E.; Goel, K.; Gu, A.; Downs, G.; Shah, P.; Dao, T.; Baccus, S.; Ré, C. S4nd: Modeling images and videos as multidimensional signals with state spaces. Adv. Neural Inf. Process. Syst. 2022, 35, 2846–2861. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Mondal, A.; Alletto, S.; Tome, D. HumMUSS: Human Motion Understanding using State Space Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2318–2330. [Google Scholar]
Zhang, J.; Hou, J.; He, Q.; Yuan, Z.; Xue, H. MambaPose: A Human Pose Estimation Based on Gated Feedforward Network and Mamba. Sensors 2024, 24, 8158. [Google Scholar] [CrossRef]
Yin, H.; You, S.; Han, J.; Chen, Z. Sequential Joint Dependency Aware Human Pose Estimation with State Space Model. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 9499–9507. [Google Scholar]
Balakrishnan, A.; Medikonda, J.; Namboothiri, P.K. Role of wearable sensors with machine learning approaches in gait analysis for Parkinson’s disease assessment: A review. Eng. Sci. 2022, 19, 5–19. [Google Scholar] [CrossRef]
Sethi, D.; Bharti, S.; Prakash, C. A comprehensive survey on gait analysis: History, parameters, approaches, pose estimation, and future work. Artif. Intell. Med. 2022, 129, 102314. [Google Scholar] [CrossRef] [PubMed]
Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep learning-based human pose estimation: A survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Needham, L.; Evans, M.; Cosker, D.P.; Wade, L.; McGuigan, P.M.; Bilzon, J.L.; Colyer, S.L. The accuracy of several pose estimation methods for 3D joint centre localisation. Sci. Rep. 2021, 11, 20673. [Google Scholar] [CrossRef] [PubMed]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Liu, J.; Xian, K.; Qiu, R.C. Posemamba: Monocular 3D human pose estimation with bidirectional global-local spatio-temporal state space model. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 3842–3850. [Google Scholar]

$Fractalfract 09 00603 g001$

Figure 1. Overview of the proposed Pose3DM model and the gait analysis framework. The system combines monocular RGB video and motion capture data to estimate 3D human poses, which are then used for detailed gait analysis, including key joint angle measurements and gait parameter extraction.

$Fractalfract 09 00603 g001$

$Fractalfract 09 00603 g002$

Figure 2. Detailed architecture of the proposed Pose3DM model. (a) Overall encoder–decoder architecture of Pose3DM, employing cascaded Pose Blocks, downsampling (Down), and upsampling (UP) operations. It processes a 2D pose sequence to estimate a 3D pose sequence. (b) Structure of a Pose Block, which sequentially applies a Spatial Mamba Block and a Temporal Mamba Block. (c) Architecture of the Bidirectional Mamba module, the core of each Mamba Block, utilizing forward and backward convolutions and SSMs.

$Fractalfract 09 00603 g002$

$Fractalfract 09 00603 g003$

Figure 3. Human gait cycle and key parameters.

$Fractalfract 09 00603 g003$

$Fractalfract 09 00603 g004$

Figure 4. Box plots illustrating the distribution of key lower limb joint angles across six distinct gait patterns. The figure shows the angle distributions for the left and right hip, knee, and ankle joints for AIG, CIG, HIG, KIG, NW, and TIG.

$Fractalfract 09 00603 g004$

$Fractalfract 09 00603 g005$

Figure 5. Performance comparison of different methods on various clinical gait patterns. The figure presents the Mean Per Joint Position Error (MPJPE), Standard Error (Std Error), and Maximum Error (Max Error) for each gait pattern: (a) AIG, (b) CIG, (c) HIG, (d) KIG, (e) NW, and (f) TIG. Asterisks (*, **, ***) denote statistical significance at

p < 0.05

,

p < 0.01

, and

p < 0.001

levels, respectively, comparing Pose3DM-L with other methods for MPJPE.

Figure 5. Performance comparison of different methods on various clinical gait patterns. The figure presents the Mean Per Joint Position Error (MPJPE), Standard Error (Std Error), and Maximum Error (Max Error) for each gait pattern: (a) AIG, (b) CIG, (c) HIG, (d) KIG, (e) NW, and (f) TIG. Asterisks (*, **, ***) denote statistical significance at

p < 0.05

,

p < 0.01

, and

p < 0.001

levels, respectively, comparing Pose3DM-L with other methods for MPJPE.

$Fractalfract 09 00603 g005$

$Fractalfract 09 00603 g006$

Figure 6. Frequency distribution of errors for key lower limb joints across different methods and gait patterns. The figure shows the distribution of 3D joint position errors for the ankle, knee, and hip joints, aggregated over all six clinical gait patterns. Each subplot represents a different method, illustrating the overall distribution of errors for the specified joints.

$Fractalfract 09 00603 g006$

$Fractalfract 09 00603 g007$

Figure 7. Comparison of gait parameters between monocular camera estimation (Pose3DM) and motion capture camera measurements for individual participants across six groups. Each panel shows scatter plots for different gait parameters: Gait Cycle (s), Stance Time (s), Swing Time (s), Step Length (m), and Pace of Step (m/s). The diagonal line represents perfect agreement between the two methods.

$Fractalfract 09 00603 g007$

$Fractalfract 09 00603 g008$

Figure 8. Qualitative comparisons between Pose3DM, MotionBERT, and PoseMamba on Human3.6M. The green circles highlight regions where our method outperforms the baselines.

$Fractalfract 09 00603 g008$

$Fractalfract 09 00603 g009$

Figure 9. Qualitative comparisons between Pose3DM, MotionBERT, and PoseMamba on AUST-VisGait. The green circles highlight regions where our method outperforms the baselines.

$Fractalfract 09 00603 g009$

$Fractalfract 09 00603 g010$

Figure 10. Two-dimensional–three-dimensional consistency visualization. (a) Human3.6M: 2D inputs (top) and 3D reconstructions (bottom) for frames 985–987; orange dashed arrows denote temporal order. Green dashed circles mark occluded or low-confidence 2D keypoints; black arrows and solid circles indicate regions of frame-to-frame jitter observable in 3D. (b) AUST-VisGait: the yellow circle and arrow highlight an ipsilateral deficit visible in both 2D and 3D.

$Fractalfract 09 00603 g010$

Table 1. Comparison of model attributes and qualitative advantages.

Method	Architecture	Temporal Modeling	Pose-Structure Adaptation	Clinical Deployability	Key Qualitative Advantage of Pose3DM
MixSTE [15]	Transformer	Bidirectional spatio-temporal (attention, quadratic)	Generic tokenization	Heavy on long clips	Linear-time SSM; better long-sequence scalability with stronger temporal coherence.
MHFormer [14]	Transformer	Multi-hypothesis temporal (attention, quadratic)	Partial pose priors	Quadratic cost persists	Linear complexity with bidirectional context; reduced computation without sacrificing accuracy.
PoseFormerV2 [34]	Transformer	Long-range via attention (quadratic)	Generic to pose	Costly for long sequences	Comparable long-range modeling with linear-time efficiency.
GLA-GCN [31]	GCN (+temporal)	Local-to-global aggregation (GCN, linear per step)	Strong graph prior	Moderate efficiency	Captures long-range spatio-temporal dynamics more effectively while remaining linear-time.
MotionBERT [5]	Transformer	Rich bidirectional motion context (attention, quadratic)	Generic to pose	High compute/memory	Maintains accuracy with far lower compute/params; improved temporal smoothness via FTV.
MotionAGFormer-L [35]	Transformer + GCN	Augmented temporal aggregation (attention, quadratic)	Hybrid priors	Higher MACs in practice	Better compute–accuracy trade-off (linear-time SSM) with bidirectional modeling.
PoseMamba-L [35]	SSM (Mamba)	Typically bidirectional (SSM, linear)	Generic SSM blocks	Efficient	Extends the SSM approach with explicit bidirectional modeling and FTV regularization for superior temporal coherence in gait analysis.

Table 2. Frame counts by gait pattern in the AUST-VisGait dataset.

Pattern	NW	AIG	KIG	HIG	CIG	TIG
Frames	33,600	65,680	64,640	63,920	58,240	59,280

Table 3. Architecture configurations of Pose3DM variants.

N_{B}

: Number of layers.

d_{m}^{†}

: Feature dimension at the final downsampling stage. T: Number of input frames.

Table 3. Architecture configurations of Pose3DM variants.

N_{B}

: Number of layers.

d_{m}^{†}

: Feature dimension at the final downsampling stage. T: Number of input frames.

Model	T	$N_{B}$	$d_{m}^{†}$	Params	MACs
Pose3DM-S	243	1	128	0.505 M	2.097 G
Pose3DM-B	243	1	256	1.979 M	8.201 G
Pose3DM-L	243	2	256	7.430 M	30.765 G

Table 4. Quantitative comparisons on Human3.6M. T: Number of input frames. Seq2Seq: Estimating complete 3D pose sequences from input frames. P1: MPJPE error (mm). P2: P-MPJPE error (mm). Best results are emphasized in bold, followed by second-best results underlined.

Model	T	Seq2Seq	Params	MACs	MACs/Frame	P1 ↓	P2 ↓
MHFormer [14]	351	×	30.9 M	7.0 G	20 M	43.0	34.4
MixSTE [15]	243	✓	33.6 M	139.0 G	572 M	40.9	32.6
MotionBERT [5]	243	✓	42.3 M	174.8 G	719 M	39.2	32.9
STCFormer-L [33]	243	✓	18.9 M	78.2 G	321 M	40.5	31.8
PoseFormerV2 [34]	243	×	14.4 M	4.8 G	20 M	45.2	35.6
GLA-GCN [31]	243	×	1.3 M	1.5 G	6 M	44.4	34.8
MotionAGFormer-L [35]	243	✓	19.0 M	78.3 G	322 M	38.4	32.5
PoseMamba-S [48]	243	✓	0.9 M	3.6 G	15 M	41.8	35.0
PoseMamba-B [48]	243	✓	3.4 M	13.9 G	57 M	40.8	34.3
PoseMamba-L [48]	243	✓	6.7 M	27.9 G	115 M	38.1	32.5
Pose3DM-S	243	✓	0.5 M	2.1 G	9 M	42.1	35.0
Pose3DM-B	243	✓	2.0 M	8.2 G	34 M	40.3	33.9
Pose3DM-L	243	✓	7.4 M	30.8 G	127 M	37.9	32.1

Table 5. Quantitative evaluation of joint position estimation errors (in mm) for various methods across distinct gait patterns. The table presents results for the left and right ankle, knee, and hip joints under different walking modes: AIG, CIG, HIG, KIG, NW, and TIG. The best results are highlighted in bold, while the second-best results are underlined.

Mode	Method	Left Ankle	Left Knee	Left Hip	Right Ankle	Right Knee	Right Hip
AIG	MixSTE [15]	0.097	0.097	0.078	0.139	0.078	0.021
	MotionBERT [5]	0.052	0.046	0.026	0.031	0.033	0.018
	MotionAGFormer-L [35]	0.041	0.034	0.031	0.079	0.038	0.014
	PoseMamba-L [48]	0.081	0.040	0.019	0.055	0.045	0.018
	Pose3DM-L	0.045	0.026	0.019	0.048	0.041	0.013
CIG	MixSTE [15]	0.071	0.076	0.072	0.137	0.116	0.013
	MotionBERT [5]	0.076	0.032	0.006	0.017	0.032	0.005
	MotionAGFormer-L [35]	0.035	0.032	0.020	0.066	0.039	0.011
	PoseMamba-L [48]	0.079	0.066	0.022	0.061	0.086	0.012
	Pose3DM-L	0.025	0.021	0.018	0.041	0.068	0.009
HIG	MixSTE [15]	0.049	0.059	0.056	0.086	0.076	0.009
	MotionBERT [5]	0.025	0.016	0.013	0.014	0.016	0.003
	MotionAGFormer-L [35]	0.018	0.033	0.016	0.047	0.033	0.005
	PoseMamba-L [48]	0.023	0.009	0.020	0.051	0.081	0.013
	Pose3DM-L	0.015	0.008	0.019	0.018	0.028	0.005
KIG	MixSTE [15]	0.075	0.093	0.075	0.133	0.072	0.024
	MotionBERT [5]	0.040	0.023	0.018	0.026	0.033	0.010
	MotionAGFormer-L [35]	0.032	0.036	0.024	0.077	0.046	0.013
	PoseMamba-L [48]	0.060	0.044	0.020	0.079	0.073	0.011
	Pose3DM-L	0.024	0.021	0.019	0.045	0.044	0.010
NW	MixSTE [15]	0.055	0.056	0.055	0.100	0.096	0.008
	MotionBERT [5]	0.006	0.008	0.006	0.021	0.067	0.005
	MotionAGFormer-L [35]	0.009	0.020	0.009	0.025	0.058	0.006
	PoseMamba-L [48]	0.019	0.010	0.004	0.053	0.068	0.008
	Pose3DM-L	0.004	0.009	0.019	0.032	0.026	0.007
TIG	MixSTE [15]	0.090	0.101	0.079	0.133	0.080	0.023
	MotionBERT [5]	0.044	0.023	0.026	0.036	0.040	0.013
	MotionAGFormer-L [35]	0.039	0.036	0.026	0.091	0.051	0.019
	PoseMamba-L [48]	0.090	0.061	0.019	0.065	0.057	0.018
	Pose3DM-L	0.039	0.028	0.022	0.060	0.052	0.012

Table 6. Comparison of key gait parameters across AIG, CIG, HIG, KIG, NW, and TIG groups. Values are presented as mean ± standard deviation.

Parameters	AIG	CIG	HIG	KIG	NW	TIG
Gait Cycle (s)	0.82 ± 0.13	0.86 ± 0.11	0.84 ± 0.12	0.86 ± 0.13	0.91 ± 0.12	0.84 ± 0.13
Step Frequency (steps/min)	74.5 ± 10.5	70.9 ± 9.10	72.7 ± 9.90	71.1 ± 10.2	67.9 ± 9.10	73.1 ± 10.2
Step Length (m)	0.22 ± 0.04	0.23 ± 0.04	0.19 ± 0.04	0.23 ± 0.04	0.62 ± 0.06	0.22 ± 0.03
Pace of Step (m/s)	0.53 ± 0.04	0.52 ± 0.05	0.44 ± 0.05	0.53 ± 0.05	1.39 ± 0.19	0.52 ± 0.03
Stance Time (s)	0.45 ± 0.07	0.47 ± 0.06	0.46 ± 0.07	0.47 ± 0.07	0.56 ± 0.08	0.46 ± 0.07
Swing Time (s)	0.37 ± 0.06	0.39 ± 0.05	0.38 ± 0.05	0.39 ± 0.06	0.34 ± 0.05	0.38 ± 0.06

Table 7. Ablation study of each component in our proposed Pose3DM model.

Model	Params (M)	MACs (G)	MACs/Frame (M)	MPJPE (mm)	P-MPJPE (mm)
Full Model (Baseline)	1.979 M	8.201 G	33 M	40.3	33.9
w/o Spatial Mamba	1.039 M	4.303 G	18 M	42.0	35.1
w/o Temporal Mamba	1.039 M	4.303 G	18 M	43.2	35.4
w/o Bidirectional (Fwd.)	1.979 M	8.201 G	33 M	42.4	35.2
w/o Bidirectional (Bwd.)	1.979 M	8.201 G	33 M	42.7	35.3
w/o Skip Conn.	1.979 M	8.201 G	33 M	42.2	34.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, C.; Su, W.; Li, J.; Xu, J. Bidirectional Mamba-Enhanced 3D Human Pose Estimation for Accurate Clinical Gait Analysis. Fractal Fract. 2025, 9, 603. https://doi.org/10.3390/fractalfract9090603

AMA Style

Wang C, Su W, Li J, Xu J. Bidirectional Mamba-Enhanced 3D Human Pose Estimation for Accurate Clinical Gait Analysis. Fractal and Fractional. 2025; 9(9):603. https://doi.org/10.3390/fractalfract9090603

Chicago/Turabian Style

Wang, Chengjun, Wenhang Su, Jiabao Li, and Jiahang Xu. 2025. "Bidirectional Mamba-Enhanced 3D Human Pose Estimation for Accurate Clinical Gait Analysis" Fractal and Fractional 9, no. 9: 603. https://doi.org/10.3390/fractalfract9090603

APA Style

Wang, C., Su, W., Li, J., & Xu, J. (2025). Bidirectional Mamba-Enhanced 3D Human Pose Estimation for Accurate Clinical Gait Analysis. Fractal and Fractional, 9(9), 603. https://doi.org/10.3390/fractalfract9090603

Article Menu

Bidirectional Mamba-Enhanced 3D Human Pose Estimation for Accurate Clinical Gait Analysis

Abstract

1. Introduction

2. Related Work

2.1. Three-Dimensional Human Pose Estimation

2.2. State Space Models

2.3. Gait Analysis

3. Method

3.1. Pose3DM Network Architecture

3.2. Pose Block

3.3. Loss Function Design with Fractional Calculus

3.4. Gait Analysis from 3D Poses

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Human3.6M (Standard Benchmark)

4.1.2. AUST-VisGait (Clinical Dataset)

4.2. Implementation Details

4.3. Experimental Settings

4.4. Comparison with State-of-the-Art

4.4.1. Performance and Computational Efficiency on Human3.6M

4.4.2. Distribution of Joint Angles Across Different Gait Patterns

4.4.3. Performance Comparison on Clinical Gait Patterns

4.5. Validation of Monocular Camera-Based Gait Analysis

4.6. Ablation Studies

4.7. Qualitative Analysis

5. Results Discussion

5.1. Summary of Findings

5.2. Clinical Implications and Deployability

5.3. Error Characteristics and Robustness

5.4. Compute–Accuracy Trade-Offs and Throughput

5.5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI