Figure 1.
Overview of the proposed Pose3DM model and the gait analysis framework. The system combines monocular RGB video and motion capture data to estimate 3D human poses, which are then used for detailed gait analysis, including key joint angle measurements and gait parameter extraction.
Figure 1.
Overview of the proposed Pose3DM model and the gait analysis framework. The system combines monocular RGB video and motion capture data to estimate 3D human poses, which are then used for detailed gait analysis, including key joint angle measurements and gait parameter extraction.
Figure 2.
Detailed architecture of the proposed Pose3DM model. (a) Overall encoder–decoder architecture of Pose3DM, employing cascaded Pose Blocks, downsampling (Down), and upsampling (UP) operations. It processes a 2D pose sequence to estimate a 3D pose sequence. (b) Structure of a Pose Block, which sequentially applies a Spatial Mamba Block and a Temporal Mamba Block. (c) Architecture of the Bidirectional Mamba module, the core of each Mamba Block, utilizing forward and backward convolutions and SSMs.
Figure 2.
Detailed architecture of the proposed Pose3DM model. (a) Overall encoder–decoder architecture of Pose3DM, employing cascaded Pose Blocks, downsampling (Down), and upsampling (UP) operations. It processes a 2D pose sequence to estimate a 3D pose sequence. (b) Structure of a Pose Block, which sequentially applies a Spatial Mamba Block and a Temporal Mamba Block. (c) Architecture of the Bidirectional Mamba module, the core of each Mamba Block, utilizing forward and backward convolutions and SSMs.
Figure 3.
Human gait cycle and key parameters.
Figure 3.
Human gait cycle and key parameters.
Figure 4.
Box plots illustrating the distribution of key lower limb joint angles across six distinct gait patterns. The figure shows the angle distributions for the left and right hip, knee, and ankle joints for AIG, CIG, HIG, KIG, NW, and TIG.
Figure 4.
Box plots illustrating the distribution of key lower limb joint angles across six distinct gait patterns. The figure shows the angle distributions for the left and right hip, knee, and ankle joints for AIG, CIG, HIG, KIG, NW, and TIG.
Figure 5.
Performance comparison of different methods on various clinical gait patterns. The figure presents the Mean Per Joint Position Error (MPJPE), Standard Error (Std Error), and Maximum Error (Max Error) for each gait pattern: (a) AIG, (b) CIG, (c) HIG, (d) KIG, (e) NW, and (f) TIG. Asterisks (*, **, ***) denote statistical significance at , , and levels, respectively, comparing Pose3DM-L with other methods for MPJPE.
Figure 5.
Performance comparison of different methods on various clinical gait patterns. The figure presents the Mean Per Joint Position Error (MPJPE), Standard Error (Std Error), and Maximum Error (Max Error) for each gait pattern: (a) AIG, (b) CIG, (c) HIG, (d) KIG, (e) NW, and (f) TIG. Asterisks (*, **, ***) denote statistical significance at , , and levels, respectively, comparing Pose3DM-L with other methods for MPJPE.
Figure 6.
Frequency distribution of errors for key lower limb joints across different methods and gait patterns. The figure shows the distribution of 3D joint position errors for the ankle, knee, and hip joints, aggregated over all six clinical gait patterns. Each subplot represents a different method, illustrating the overall distribution of errors for the specified joints.
Figure 6.
Frequency distribution of errors for key lower limb joints across different methods and gait patterns. The figure shows the distribution of 3D joint position errors for the ankle, knee, and hip joints, aggregated over all six clinical gait patterns. Each subplot represents a different method, illustrating the overall distribution of errors for the specified joints.
Figure 7.
Comparison of gait parameters between monocular camera estimation (Pose3DM) and motion capture camera measurements for individual participants across six groups. Each panel shows scatter plots for different gait parameters: Gait Cycle (s), Stance Time (s), Swing Time (s), Step Length (m), and Pace of Step (m/s). The diagonal line represents perfect agreement between the two methods.
Figure 7.
Comparison of gait parameters between monocular camera estimation (Pose3DM) and motion capture camera measurements for individual participants across six groups. Each panel shows scatter plots for different gait parameters: Gait Cycle (s), Stance Time (s), Swing Time (s), Step Length (m), and Pace of Step (m/s). The diagonal line represents perfect agreement between the two methods.
Figure 8.
Qualitative comparisons between Pose3DM, MotionBERT, and PoseMamba on Human3.6M. The green circles highlight regions where our method outperforms the baselines.
Figure 8.
Qualitative comparisons between Pose3DM, MotionBERT, and PoseMamba on Human3.6M. The green circles highlight regions where our method outperforms the baselines.
Figure 9.
Qualitative comparisons between Pose3DM, MotionBERT, and PoseMamba on AUST-VisGait. The green circles highlight regions where our method outperforms the baselines.
Figure 9.
Qualitative comparisons between Pose3DM, MotionBERT, and PoseMamba on AUST-VisGait. The green circles highlight regions where our method outperforms the baselines.
Figure 10.
Two-dimensional–three-dimensional consistency visualization. (a) Human3.6M: 2D inputs (top) and 3D reconstructions (bottom) for frames 985–987; orange dashed arrows denote temporal order. Green dashed circles mark occluded or low-confidence 2D keypoints; black arrows and solid circles indicate regions of frame-to-frame jitter observable in 3D. (b) AUST-VisGait: the yellow circle and arrow highlight an ipsilateral deficit visible in both 2D and 3D.
Figure 10.
Two-dimensional–three-dimensional consistency visualization. (a) Human3.6M: 2D inputs (top) and 3D reconstructions (bottom) for frames 985–987; orange dashed arrows denote temporal order. Green dashed circles mark occluded or low-confidence 2D keypoints; black arrows and solid circles indicate regions of frame-to-frame jitter observable in 3D. (b) AUST-VisGait: the yellow circle and arrow highlight an ipsilateral deficit visible in both 2D and 3D.
Table 1.
Comparison of model attributes and qualitative advantages.
Table 1.
Comparison of model attributes and qualitative advantages.
| Method | Architecture | Temporal Modeling | Pose-Structure Adaptation | Clinical Deployability | Key Qualitative Advantage of Pose3DM |
|---|
| MixSTE [15] | Transformer | Bidirectional spatio-temporal (attention, quadratic) | Generic tokenization | Heavy on long clips | Linear-time SSM; better long-sequence scalability with stronger temporal coherence. |
| MHFormer [14] | Transformer | Multi-hypothesis temporal (attention, quadratic) | Partial pose priors | Quadratic cost persists | Linear complexity with bidirectional context; reduced computation without sacrificing accuracy. |
| PoseFormerV2 [34] | Transformer | Long-range via attention (quadratic) | Generic to pose | Costly for long sequences | Comparable long-range modeling with linear-time efficiency. |
| GLA-GCN [31] | GCN (+temporal) | Local-to-global aggregation (GCN, linear per step) | Strong graph prior | Moderate efficiency | Captures long-range spatio-temporal dynamics more effectively while remaining linear-time. |
| MotionBERT [5] | Transformer | Rich bidirectional motion context (attention, quadratic) | Generic to pose | High compute/memory | Maintains accuracy with far lower compute/params; improved temporal smoothness via FTV. |
| MotionAGFormer-L [35] | Transformer + GCN | Augmented temporal aggregation (attention, quadratic) | Hybrid priors | Higher MACs in practice | Better compute–accuracy trade-off (linear-time SSM) with bidirectional modeling. |
| PoseMamba-L [35] | SSM (Mamba) | Typically bidirectional (SSM, linear) | Generic SSM blocks | Efficient | Extends the SSM approach with explicit bidirectional modeling and FTV regularization for superior temporal coherence in gait analysis. |
Table 2.
Frame counts by gait pattern in the AUST-VisGait dataset.
Table 2.
Frame counts by gait pattern in the AUST-VisGait dataset.
| Pattern | NW | AIG | KIG | HIG | CIG | TIG |
| Frames | 33,600 | 65,680 | 64,640 | 63,920 | 58,240 | 59,280 |
Table 3.
Architecture configurations of Pose3DM variants. : Number of layers. : Feature dimension at the final downsampling stage. T: Number of input frames.
Table 3.
Architecture configurations of Pose3DM variants. : Number of layers. : Feature dimension at the final downsampling stage. T: Number of input frames.
| Model | T | | | Params | MACs |
|---|
| Pose3DM-S | 243 | 1 | 128 | 0.505 M | 2.097 G |
| Pose3DM-B | 243 | 1 | 256 | 1.979 M | 8.201 G |
| Pose3DM-L | 243 | 2 | 256 | 7.430 M | 30.765 G |
Table 4.
Quantitative comparisons on Human3.6M. T: Number of input frames. Seq2Seq: Estimating complete 3D pose sequences from input frames. P1: MPJPE error (mm). P2: P-MPJPE error (mm). Best results are emphasized in bold, followed by second-best results underlined.
Table 4.
Quantitative comparisons on Human3.6M. T: Number of input frames. Seq2Seq: Estimating complete 3D pose sequences from input frames. P1: MPJPE error (mm). P2: P-MPJPE error (mm). Best results are emphasized in bold, followed by second-best results underlined.
| Model | T | Seq2Seq | Params | MACs | MACs/Frame | P1 ↓ | P2 ↓ |
|---|
| MHFormer [14] | 351 | × | 30.9 M | 7.0 G | 20 M | 43.0 | 34.4 |
| MixSTE [15] | 243 | ✓ | 33.6 M | 139.0 G | 572 M | 40.9 | 32.6 |
| MotionBERT [5] | 243 | ✓ | 42.3 M | 174.8 G | 719 M | 39.2 | 32.9 |
| STCFormer-L [33] | 243 | ✓ | 18.9 M | 78.2 G | 321 M | 40.5 | 31.8 |
| PoseFormerV2 [34] | 243 | × | 14.4 M | 4.8 G | 20 M | 45.2 | 35.6 |
| GLA-GCN [31] | 243 | × | 1.3 M | 1.5 G | 6 M | 44.4 | 34.8 |
| MotionAGFormer-L [35] | 243 | ✓ | 19.0 M | 78.3 G | 322 M | 38.4 | 32.5 |
| PoseMamba-S [48] | 243 | ✓ | 0.9 M | 3.6 G | 15 M | 41.8 | 35.0 |
| PoseMamba-B [48] | 243 | ✓ | 3.4 M | 13.9 G | 57 M | 40.8 | 34.3 |
| PoseMamba-L [48] | 243 | ✓ | 6.7 M | 27.9 G | 115 M | 38.1 | 32.5 |
| Pose3DM-S | 243 | ✓ | 0.5 M | 2.1 G | 9 M | 42.1 | 35.0 |
| Pose3DM-B | 243 | ✓ | 2.0 M | 8.2 G | 34 M | 40.3 | 33.9 |
| Pose3DM-L | 243 | ✓ | 7.4 M | 30.8 G | 127 M | 37.9 | 32.1 |
Table 5.
Quantitative evaluation of joint position estimation errors (in mm) for various methods across distinct gait patterns. The table presents results for the left and right ankle, knee, and hip joints under different walking modes: AIG, CIG, HIG, KIG, NW, and TIG. The best results are highlighted in bold, while the second-best results are underlined.
Table 5.
Quantitative evaluation of joint position estimation errors (in mm) for various methods across distinct gait patterns. The table presents results for the left and right ankle, knee, and hip joints under different walking modes: AIG, CIG, HIG, KIG, NW, and TIG. The best results are highlighted in bold, while the second-best results are underlined.
| Mode | Method | Left Ankle | Left Knee | Left Hip | Right Ankle | Right Knee | Right Hip |
|---|
| AIG | MixSTE [15] | 0.097 | 0.097 | 0.078 | 0.139 | 0.078 | 0.021 |
| MotionBERT [5] | 0.052 | 0.046 | 0.026 | 0.031 | 0.033 | 0.018 |
| MotionAGFormer-L [35] | 0.041 | 0.034 | 0.031 | 0.079 | 0.038 | 0.014 |
| PoseMamba-L [48] | 0.081 | 0.040 | 0.019 | 0.055 | 0.045 | 0.018 |
| Pose3DM-L | 0.045 | 0.026 | 0.019 | 0.048 | 0.041 | 0.013 |
| CIG | MixSTE [15] | 0.071 | 0.076 | 0.072 | 0.137 | 0.116 | 0.013 |
| MotionBERT [5] | 0.076 | 0.032 | 0.006 | 0.017 | 0.032 | 0.005 |
| MotionAGFormer-L [35] | 0.035 | 0.032 | 0.020 | 0.066 | 0.039 | 0.011 |
| PoseMamba-L [48] | 0.079 | 0.066 | 0.022 | 0.061 | 0.086 | 0.012 |
| Pose3DM-L | 0.025 | 0.021 | 0.018 | 0.041 | 0.068 | 0.009 |
| HIG | MixSTE [15] | 0.049 | 0.059 | 0.056 | 0.086 | 0.076 | 0.009 |
| MotionBERT [5] | 0.025 | 0.016 | 0.013 | 0.014 | 0.016 | 0.003 |
| MotionAGFormer-L [35] | 0.018 | 0.033 | 0.016 | 0.047 | 0.033 | 0.005 |
| PoseMamba-L [48] | 0.023 | 0.009 | 0.020 | 0.051 | 0.081 | 0.013 |
| Pose3DM-L | 0.015 | 0.008 | 0.019 | 0.018 | 0.028 | 0.005 |
| KIG | MixSTE [15] | 0.075 | 0.093 | 0.075 | 0.133 | 0.072 | 0.024 |
| MotionBERT [5] | 0.040 | 0.023 | 0.018 | 0.026 | 0.033 | 0.010 |
| MotionAGFormer-L [35] | 0.032 | 0.036 | 0.024 | 0.077 | 0.046 | 0.013 |
| PoseMamba-L [48] | 0.060 | 0.044 | 0.020 | 0.079 | 0.073 | 0.011 |
| Pose3DM-L | 0.024 | 0.021 | 0.019 | 0.045 | 0.044 | 0.010 |
| NW | MixSTE [15] | 0.055 | 0.056 | 0.055 | 0.100 | 0.096 | 0.008 |
| MotionBERT [5] | 0.006 | 0.008 | 0.006 | 0.021 | 0.067 | 0.005 |
| MotionAGFormer-L [35] | 0.009 | 0.020 | 0.009 | 0.025 | 0.058 | 0.006 |
| PoseMamba-L [48] | 0.019 | 0.010 | 0.004 | 0.053 | 0.068 | 0.008 |
| Pose3DM-L | 0.004 | 0.009 | 0.019 | 0.032 | 0.026 | 0.007 |
| TIG | MixSTE [15] | 0.090 | 0.101 | 0.079 | 0.133 | 0.080 | 0.023 |
| MotionBERT [5] | 0.044 | 0.023 | 0.026 | 0.036 | 0.040 | 0.013 |
| MotionAGFormer-L [35] | 0.039 | 0.036 | 0.026 | 0.091 | 0.051 | 0.019 |
| PoseMamba-L [48] | 0.090 | 0.061 | 0.019 | 0.065 | 0.057 | 0.018 |
| Pose3DM-L | 0.039 | 0.028 | 0.022 | 0.060 | 0.052 | 0.012 |
Table 6.
Comparison of key gait parameters across AIG, CIG, HIG, KIG, NW, and TIG groups. Values are presented as mean ± standard deviation.
Table 6.
Comparison of key gait parameters across AIG, CIG, HIG, KIG, NW, and TIG groups. Values are presented as mean ± standard deviation.
| Parameters | AIG | CIG | HIG | KIG | NW | TIG |
|---|
| Gait Cycle (s) | 0.82 ± 0.13 | 0.86 ± 0.11 | 0.84 ± 0.12 | 0.86 ± 0.13 | 0.91 ± 0.12 | 0.84 ± 0.13 |
| Step Frequency (steps/min) | 74.5 ± 10.5 | 70.9 ± 9.10 | 72.7 ± 9.90 | 71.1 ± 10.2 | 67.9 ± 9.10 | 73.1 ± 10.2 |
| Step Length (m) | 0.22 ± 0.04 | 0.23 ± 0.04 | 0.19 ± 0.04 | 0.23 ± 0.04 | 0.62 ± 0.06 | 0.22 ± 0.03 |
| Pace of Step (m/s) | 0.53 ± 0.04 | 0.52 ± 0.05 | 0.44 ± 0.05 | 0.53 ± 0.05 | 1.39 ± 0.19 | 0.52 ± 0.03 |
| Stance Time (s) | 0.45 ± 0.07 | 0.47 ± 0.06 | 0.46 ± 0.07 | 0.47 ± 0.07 | 0.56 ± 0.08 | 0.46 ± 0.07 |
| Swing Time (s) | 0.37 ± 0.06 | 0.39 ± 0.05 | 0.38 ± 0.05 | 0.39 ± 0.06 | 0.34 ± 0.05 | 0.38 ± 0.06 |
Table 7.
Ablation study of each component in our proposed Pose3DM model.
Table 7.
Ablation study of each component in our proposed Pose3DM model.
| Model | Params (M) | MACs (G) | MACs/Frame (M) | MPJPE (mm) | P-MPJPE (mm) |
|---|
| Full Model (Baseline) | 1.979 M | 8.201 G | 33 M | 40.3 | 33.9 |
| w/o Spatial Mamba | 1.039 M | 4.303 G | 18 M | 42.0 | 35.1 |
| w/o Temporal Mamba | 1.039 M | 4.303 G | 18 M | 43.2 | 35.4 |
| w/o Bidirectional (Fwd.) | 1.979 M | 8.201 G | 33 M | 42.4 | 35.2 |
| w/o Bidirectional (Bwd.) | 1.979 M | 8.201 G | 33 M | 42.7 | 35.3 |
| w/o Skip Conn. | 1.979 M | 8.201 G | 33 M | 42.2 | 34.8 |