1. Introduction
Despite significant progress in deep learning-based 3D human pose estimation [
1], recovering stable, accurate, and computationally efficient pose sequences from monocular videos remains a formidable challenge. Mainstream approaches predominantly adopt the “2D-to-3D lifting” paradigm [
2], which takes 2D pose detections as input and employs temporal convolutional networks (TCNs) [
3] or recurrent neural networks (RNNs) [
4] to capture inter-frame dependencies. However, these early architectures are hindered by limited receptive fields and vanishing gradients, making it difficult to effectively model long-range temporal correlations. The introduction of Transformer models fundamentally changed how we handle long-range dependencies. PoseFormer [
5] tracks global context through self-attention, and PoseFormerV2 [
6] later improved upon this by applying frequency-domain compression via discrete cosine transforms (DCTs), which significantly boosted accuracy. Nonetheless, in real-world scenarios featuring diverse and complex human motions, achieving high accuracy while significantly reducing computational overhead remains an open problem.
Leading contemporary techniques, such as PoseFormerV2, primarily depend on rigid computational architectures that process all input sequences with identical settings, thereby neglecting the natural spatial–temporal diversity inherent in human movements. Specifically, human movements exhibit substantial dynamic disparities: quasi-static actions (e.g., standing) contain considerable spatiotemporal redundancy, whereas vigorous motions (e.g., dancing) are rich in high-frequency details. Existing approaches with fixed temporal windows and frequency truncation strategies [
6] are incapable of adapting to such disparities. This leads to substantial computational redundancy when processing low-frequency actions, and critical detail loss due to insufficient spectral bandwidth when handling high-frequency actions [
7]. Furthermore, standard Transformer encoders struggle to distinguish informative features from padded regions, resulting in wasteful computation on null-element interactions. Similar inefficiencies have also motivated dynamic token sparsification and pruning strategies in efficient Transformer models [
8]. This mismatch between computational resource allocation and input-specific demands prevents reductions in theoretical FLOPs from translating linearly into actual inference speed (FPS) gains, thereby limiting deployment on resource-constrained edge devices.
In response to these challenges, the research community has pursued diverse directions. In spatial structure modeling, graph convolutional networks (GCNs) have been employed to explicitly capture joint topologies, as exemplified by ST-GCN [
9]; however, their receptive fields are constrained by predefined physical connections, hindering the modeling of long-range dependencies between non-adjacent joints. Subsequent Transformer variants, such as MixSTE [
10], enhance feature interaction through alternating spatial and temporal attention modules, while MHFormer [
11] addresses depth ambiguity via multi-hypothesis prediction strategies. Nevertheless, the computational complexity of these models grows quadratically with sequence length, leaving efficiency bottlenecks unresolved. In pursuit of efficiency optimization, frequency-domain analysis has emerged as a pivotal direction. PoseFormerV2 [
6] compresses long-range motion information into low-frequency representations via DCT, capturing global motion trends at minimal cost—marking the advent of joint spatiotemporal-frequency modeling in this field. However, this framework adopts a “fixed coefficient truncation” strategy, implicitly assuming that all actions follow an identical spectral distribution and thus failing to adapt processing bandwidth according to motion characteristics. Notably, adaptive computation has been widely explored in general video understanding, such as action recognition. Pioneering works like AdaFrame [
12] and DynamicViT [
13] dynamically drop redundant frames or uninformative spatial tokens to accelerate heavy RGB-based models. However, these token-dropping paradigms are fundamentally unsuitable for 3D Human Pose Estimation (HPE). 3D HPE relies heavily on the continuous temporal kinematics of sparse 2D coordinates; discretely dropping temporal or spatial tokens destroys this physical continuity, severely hindering spatial–temporal lifting and frequency-domain transformations (e.g., DCT). In contrast to these semantic-driven general methods, CAAPoseFormer is uniquely tailored for geometric sequence modeling. It preserves kinematic continuity by adaptively scaling contiguous temporal bounds and frequency-domain coefficients, ensuring mathematically stable efficiency for pose estimation [
8,
14]. In 3D human pose estimation, although pre-trained models such as MotionBERT [
15] have learned generalized skeleton representations, their inference processes remain confined to static graphs, lacking the capacity to respond instantaneously to the complexity of individual input samples. A few recent studies have explored incorporating physical kinematics, such as joint angular velocity and trajectory priors, to enhance motion representations [
16], yet none have tightly coupled fine-grained, learnable complexity measures with the automatic adjustment of core model parameters (e.g., temporal window length, frequency-domain coefficients). Evidently, constructing an end-to-end trainable complexity-aware module that is deeply integrated with the core parameters of spatiotemporal-frequency modeling constitutes the critical pathway from “fixed-parameter“ to ”adaptive-intelligence“ paradigms—and stands as the central focus of this work.
To address the above limitations, this paper proposes a Complexity-Aware Adaptive PoseFormer (CAAPoseFormer) for 3D human pose estimation. Breaking away from static processing paradigms, the proposed framework aims to establish a real-time mapping between computational resource allocation and the intrinsic dynamic characteristics of input sequences. Specifically, CAAPoseFormer first quantifies the spatial–temporal complexity of the input sequence through a lightweight module. Based on the estimated complexity, it adaptively prunes the temporal window length and the number of retained DCT coefficients in an end-to-end manner, thereby enabling demand-driven computation allocation. For the variable-length sequences produced after pruning, a mask-guided sparse interaction mechanism is further designed to eliminate operations on invalid padded regions at the operator level.
The main contributions of this work are summarized as follows:
- (1)
A spatial–temporal coupled complexity quantification module is proposed. This module fuses spatial skeleton dispersion and temporal motion variance through a learnable weighted fusion strategy, enabling fine-grained hierarchical quantification of the complexity of arbitrary action sequences. It provides an interpretable basis for subsequent dynamic resource scheduling.
- (2)
A time–frequency dual-domain adaptive pruning strategy is introduced. Based on the estimated complexity score, the model dynamically adjusts both the temporal window span and the number of retained DCT coefficients, thereby reducing redundant information in both the time and frequency domains and enabling differentiated computational allocation for heterogeneous actions.
- (3)
A mask-guided sparse interaction encoding mechanism is developed. To handle variable-length sequences after pruning, dynamic attention masks are employed to strictly confine feature interaction to valid regions, thereby eliminating invalid computations caused by zero padding and improving actual inference throughput.
Through these innovations, CAAPoseFormer maintains high-precision predictions while effectively mitigating computational redundancy and significantly enhancing inference efficiency, offering a viable pathway for real-time deployment of 3D human pose estimation on resource-constrained edge devices.
3. Results and Discussion
3.1. Datasets and Evaluation Metrics
3.1.1. Dataset Selection
This study conducts validation on Human3.6M [
31], one of the most representative benchmark datasets for 3D human pose estimation. The dataset was captured by four synchronized cameras at 50 Hz and contains 3.6 million high-quality human pose images. The experiments cover 15 daily indoor activities (e.g., walking, discussion, eating) performed by 11 subjects. Following the standard protocol in the literature, data from subjects S1, S5, S6, S7, and S8 are used for training, while subjects S9 and S11 are reserved for evaluation.
For input processing, this work uses 2D keypoint sequences extracted by a CPN (Cascaded Pyramid Network) detector [
17] fine-tuned on Human3.6M as the raw inputs, which are then lifted to the 3D space for accuracy evaluation. To accommodate the proposed variable-length Transformer, the input sequences are dynamically sampled within an adaptive temporal window according to motion complexity.
3.1.2. Definition of Evaluation Metrics
We quantify reconstruction accuracy through two complementary protocols. The primary metric, Mean Per-Joint Position Error (MPJPE), calculates the absolute Euclidean distance (in mm) from our predictions to the ground-truth joints. However, absolute coordinates can sometimes obscure local structural fidelity. Therefore, we additionally track the Procrustes-aligned MPJPE (P-MPJPE), which explicitly neutralizes global rigid transformations to isolate and evaluate the true anatomical alignment of the predicted skeleton.
For efficiency and model characterization, a complexity-awareness indicator (CE) is used to qualitatively determine whether the model supports adaptive resource allocation. Quantitatively, the number of parameters (Params, M) is reported to reflect the static storage cost, while multiply–accumulate operations (MACs, G) are used to measure the overall inference complexity. In particular, to account for the variable-length input setting in this work, we further introduce per-frame computation (MACs/frame, M), defined as the total MACs divided by the number of effective frames. This metric provides a more objective estimate of the actual runtime efficiency and compute utility under adaptive execution paths.
3.2. Experimental Setup and Hyperparameter Configuration
The experiments are implemented using PyTorch 1.13.0 with CUDA 11.7 [
32], and all training and inference are conducted on a workstation equipped with a single NVIDIA RTX 3090 GPU. Crucially, to instantiate the hardware-level sparse acceleration discussed in
Section 2.5.3, the variable-length self-attention operators are integrated with the FlashAttention-2 backend, enabling exact computation of dynamic sequence lengths without the overhead of dense padding. To explicitly address the critical challenge of realizing actual latency reduction for unstructured variable-length sequences on static tensor computation graphs, we meticulously align our operator design with the underlying hardware architecture. The utilized NVIDIA RTX 3090 GPU, built on the Ampere architecture, natively supports asynchronous memory copying and hardware-level variable-length attention kernels (e.g., cu_seqlens). Furthermore, to strictly adhere to Tensor Core memory alignment specifications and maximize global memory throughput, the feature alignment dimension d is deliberately configured as a multiple of 64 (specifically, d = 256 in our base setting). This hardware-aware alignment ensures that the theoretical efficiency gains achieved by avoiding zero-padding natively bypass memory-bound bottlenecks, translating our mask-guided sparse operations into verifiable wall-clock speedups. To systematically evaluate performance and generalization under different temporal receptive fields, the experimental setup follows the PoseFormerV2 protocol by configuring the input sequence length T to 27, 81, and 243 frames for comparative studies.
For adaptive parameterization, the model no longer relies on fixed truncation thresholds; instead, it dynamically adjusts the computation budget according to the spatiotemporal complexity score C. Under the standard T = 243 configuration, the adaptive policy yields an average effective temporal window length ranging from 11.6 to 80.3 frames across action categories (47.2 frames on average over the full set), and retains 6.1 to 34.7 DCT frequency coefficients on average (20.9 coefficients on average over the full set). Moreover, for fine-grained analysis of complexity characteristics, 313 representative sequences spanning all 15 action classes are selected from the test set for detailed evaluation.
3.3. Experimental Results
3.3.1. Complexity-Aware Adaptive Parameterization
We first break down the runtime behavior of the Complexity-Aware Module by plotting its spatial (
, temporal (
), and combined (
) tracking indices. As shown in
Figure 5 and
Figure 6, to compare the model’s response characteristics under different motion patterns, we examine two representative actions: “Discussion”, which contains persistent subtle motions, and “Sitting Down”, which exhibits pronounced transitions between static and dynamic states.
Our tracking data confirm that sequence complexity fluctuates independently across spatial and temporal axes. Specifically, the spatial complexity metric () for the “Discussion” sequence oscillates primarily between 0.2 and 0.4. These quasi-periodic shifts correlate directly with arm gestures and posture adjustments made during conversation, marking repeated expansions and contractions in the subject’s physical topology. This pattern closely reflects changes in the spatial pose topology induced by gesture articulation during verbal interaction (e.g., raising and opening the arms), which leads to recurrent structural variations in the body configuration.
In the temporal domain, the temporal complexity curve () manifests as a series of intermittent spikes. These peaks align with abrupt velocity changes at moments of pose transition, thereby capturing the temporal discontinuities inherent to the motion dynamics.
Crucially, the fused complexity index (, purple dashed line) avoids the saturation seen in extreme motions, settling near a stable baseline of 0.23 rather than approaching 1.0. This numerical convergence indicates the model processes “Discussion” as locally active but macroscopically stationary. By isolating these low-amplitude variations from high-intensity activities, the system retains enough bandwidth for semantic details without triggering maximum resource allocation.
We observe a starkly different dynamic-to-static profile for “Sitting Down” (
Figure 6). Here, all tracking curves spike during the initial rapid descent and then immediately plunge toward zero once the subject is seated. This sharp drop-off proves the evaluation module acts as a strict filter, reliably cutting off redundant compute sequences the moment an action transitions into a fully static state.
Within the CAAPoseFormer pipeline, these real-time complexity estimates strictly govern both the temporal window size and the retained DCT coefficients.
Table 1 aggregates the average complexity scores across all 15 action categories, listing the exact resource constraints (mean temporal length
and coefficient count
) the model autonomously assigned to each class.
Figure 7 tracks the parameter updates and resource scaling for the “Photo” sequence across 80 epochs. In the initial phase, both the spatial (
) and temporal (
) weights climb sharply. This rapid surge indicates the network quickly detects the heavy self-occlusion inherent in this specific pose, pushing the overall complexity score (
) past the activation threshold almost immediately from a cold start.
Between epochs 20 and 60, the curves fluctuate heavily. This variance reflects the optimizer actively balancing two competing goals: minimizing compute operations while retaining enough high-frequency data to prevent smoothing errors. As illustrated in
Figure 7, the learnable weights α and β converge smoothly after approximately 50 epochs. The “Complexity Score Convergence” and “adaptive resource allocation” plots further demonstrate that the system reaches a steady state. Specifically, for the “Taking Photo” action, the model consistently assigns a stable average of 66.1 frames and 28.8 frequency coefficients once the parameters have plateaued. This convergence behavior validates the mathematical stability of our learnable fusion mechanism, ensuring it does not oscillate during inference and provides a robust, predictable computational budget for real-time deployment. The metrics eventually plateau, with
converging at 0.794. At this steady state, the network settles on a hardware budget of 66.1 frames and 28.8 DCT coefficients (
). The trajectory proves our adaptive mechanism autonomously locks onto the correct computational budget based on physical motion demands, eliminating the need for manual parameter tuning.
3.3.2. Per-Action Accuracy Comparison
Table 2 details the performance across all 15 Human3.6M action classes. CAAPoseFormer outpaces the PoseFormerV2 baseline specifically on sequences demanding extended temporal context. We observe clear error reductions on “Walking” (31.3 mm vs. 31.8 mm), “Walking Together” (30.2 mm vs. 32.2 mm), and “Walking Dog” (45.7 mm vs. 46.1 mm). The adaptive temporal window (
) drives these gains by scaling the receptive field to strictly frame complete motion cycles without absorbing surrounding noise. A different mechanism stabilizes endpoint-heavy actions like “Photo” (48.9 mm vs. 51.6 mm) and “Phoning” (44.9 mm vs. 46.5 mm). For these sequences, the dual-domain pruning strategy (
) acts as an action-specific spectral filter, stripping away non-essential high-frequency jitter. However, a detailed per-action analysis reveals that our adaptive strategy does not yield uniform improvements across all categories. Specifically, performance degradation is observed in categories such as “Discussion” (45.6 mm), “Eating” (49.1 mm), and “Sitting Down” (62.7 mm) compared to PoseFormerV2 [
6]. This degradation indicates that for actions characterized by compact body configurations or subtle local motion, the complexity-aware pruning strategy may over-compress informative cues. In “Eating,” for instance, the subtle upper-limb articulations and hand-to-face interactions may be partially lost when the model allocates a restricted temporal window or reduced frequency bandwidth. Similarly, “Sitting Down” remains sensitive to pruning due to its inherent depth ambiguity and pose contraction. We accept these specific degradations as a necessary operational compromise to secure the network’s 64.8% reduction in total FLOPs.
In broader comparisons with mainstream architectures, the proposed method achieves a lower average error (44.2 mm) than conventional graph-based or fully convolutional approaches such as Graph (51.9 mm) and VPose (46.8 mm), corroborating the effectiveness of Transformer-based models in capturing long-range dependencies in skeletal sequences. Relative to the current state-of-the-art MHFormer, which attains the lowest mean error of 43.0 mm due to its multi-hypothesis modeling, CAAPoseFormer exhibits a 1.2 mm higher average error. Nonetheless, under highly dynamic scenarios such as “Walking Together”, CAAPoseFormer achieves a lower error (30.2 mm) than MHFormer (30.6 mm). Therefore, CAAPoseFormer is best understood as a framework that prioritizes overall system efficiency and competitive mean accuracy, accepting marginal per-class trade-offs to enable real-time executability on resource-constrained platforms. Overall, the results suggest that MHFormer remains superior in terms of absolute accuracy, whereas CAAPoseFormer substantially reduces computational demand via mask-driven sparse computation, offering a competitive lightweight solution for latency-sensitive applications.
3.3.3. Evaluation of Temporal Smoothness and Kinematic Plausibility
To evaluate the practical effectiveness of the adaptive mechanism, we measured the Adaptive Trigger Rate (ATR) across the test sets. Under the
configuration, 100% of the samples utilized a temporal window smaller than
. Specifically, the autonomously assigned windows ranged from 11.6 to 80.3 frames with a global average of 47.2 frames, ensuring that the theoretical efficiency gains translate into universal wall-clock speedups across diverse action patterns. Building upon this pervasive efficiency, the model consistently maintains high kinematic quality because video-based applications strictly require continuous temporal smoothness in addition to static spatial precision. As presented in
Table 3, despite utilizing this dynamic resource allocation, CAAPoseFormer achieves an MPVE of 2.9 mm, demonstrating competitive smoothness compared to fully static architectures like PoseFormerV2. This stability is primarily attributed to our frequency-domain adaptation strategy. Instead of discretely dropping intermediate frames that would disrupt physical continuity, our model truncates high-frequency DCT coefficients. This mechanism intrinsically acts as a temporal low-pass filter to smooth out high-frequency positional jitter while dynamically accelerating inference.
3.4. Comprehensive Performance and Computational Overhead Analysis
Table 4 evaluates model capacity against recent baseline architectures to see if CAAPoseFormer is genuinely viable for real-world edge deployment. As can be observed, CAAPoseFormer contains only 2.1 M parameters, demonstrating a clear advantage in storage efficiency. Relative to the baseline PoseFormerV2, CAAPoseFormer reduces the parameter count by 12.2 M, corresponding to an 85.3% decrease, which substantially lowers the memory footprint on resource-constrained edge computing devices, such as the Orange Pi 5 Pro (equipped with the Rockchip RK3588S NPU). Compared with accuracy-oriented heavy models such as MotionBERT and MixSTE, the parameter count is further reduced by 95.1% and 93.8%, respectively. Despite its substantially smaller capacity, CAAPoseFormer maintains a competitive MPJPE of 44.2 mm (improving by 7.7 mm over GraphSH and by 2.6 mm over VPose). This suggests that, enabled by the proposed time–frequency dual-domain pruning strategy, the model can represent human pose topology effectively with minimal information redundancy, achieving a more favorable balance between lightweight deployment and representational power.
In terms of computational efficiency and inference overhead, CAAPoseFormer requires only 0.18 G MACs, indicating strong potential for real-time processing. Compared with PoseFormerV2, the computational cost is reduced by 64.0%, and the average per-frame computation decreases from 528 M to 160 M MACs, effectively mitigating the compute waste associated with full-budget inference. When compared with MHFormer, which is also Transformer-based, CAAPoseFormer reduces MACs by 97.4%, achieving an order-of-magnitude speedup at the cost of only a 1.2 mm increase in error. Although MotionBERT attains state-of-the-art accuracy (39.2 mm), it incurs 174.7 G MACs, which is approximately 970× that of CAAPoseFormer, making it difficult to deploy on compute-limited edge devices. In contrast, CAAPoseFormer leverages a mask-guided sparse interaction mechanism that physically blocks computation over invalid regions at the operator level, thereby maximizing compute utility and making it more suitable for latency-sensitive, real-time motion capture scenarios. As reported in
Table 4, HDFormer achieves a highly impressive MPJPE of 42.6 mm with only 6 M MACs/frame. This extreme theoretical efficiency is achieved through a High-Order Directed Transformer architecture that optimizes spatial joint-to-joint interactions. However, CAAPoseFormer follows a fundamentally different methodological trajectory; while HDFormer optimizes the static graph structure, our framework introduces a dynamic routing mechanism that adapts the sequence length and frequency bandwidth based on motion complexity. This sample-level adaptivity provides a unique perspective on efficiency, focusing on computational utility rather than fixed architectural priors.
3.4.1. Real-World Inference Latency and Desktop Benchmarking
While MACs provide a theoretical proxy for computational complexity, they often fail to capture the impact of operator-level optimizations and memory access patterns. To directly address the hardware deployment capability, we evaluated the actual end-to-end inference latency (in milliseconds), which is a more critical metric for latency-sensitive applications. As summarized in
Table 5, CAAPoseFormer successfully translates its theoretical sparsity into real-world acceleration by coupling adaptive sequence length and DCT mask mechanisms with hardware-optimized kernels.
On a desktop NVIDIA RTX 3090 platform, our model reduces end-to-end latency to a mere 1.95 ms. This represents a massive 23× speedup over the multi-hypothesis MHFormer (45.45 ms) and significantly outperforms the PoseFormerV2 baseline (8.40 ms). These results confirm that our mask-guided sparse interaction mechanism effectively bypasses invalid computation, enabling high-frequency (512 Hz) real-time processing.
3.4.2. Cross-Platform Verification and Edge Deployment
While the desktop GPU metrics indicate a profile highly suitable for deployment, real-world edge performance is subject to distinct hardware constraints, such as limited memory bandwidth and lower integer arithmetic throughput. To further support our deployment-oriented claims, we conducted preliminary on-device verification using an Orange Pi 5 Pro, focusing on the practical executability of the model under resource-constrained conditions.
As illustrated in
Figure 8, the complete inference pipeline was successfully executed on this representative mobile-grade platform. The photographs of the real-world deployment scenario showing the physical hardware setup, the monitor display, and the active runtime interface provide tangible evidence that CAAPoseFormer’s lightweight design is not merely theoretical but practically deployable on low-power hardware.
As summarized in
Table 6, CAAPoseFormer demonstrates strong adaptability across different input scales. Even on the Orange Pi 5 Pro’s architecture, the model maintains a peak memory footprint under 1.5 GB. At a reduced temporal resolution (T = 27), the model achieves 10.87 FPS, providing a viable path for real-time motion capture on edge devices. We emphasize that while the RTX 3090 remains the primary setting for controlled SOTA comparisons, these results verify the framework’s potential for latency-sensitive inference on resource-constrained edge devices.
3.4.3. Efficiency-Accuracy Pareto Analysis
Figure 9a visualizes the trade-off between computational efficiency and reconstruction accuracy. CAAPoseFormer (red pentagram) is clearly located in the advantageous upper-left region of the plot, indicating a favorable accuracy–efficiency trade-off under a low compute budget. Compared with the baseline PoseFormerV2, the proposed method achieves a substantial leftward shift along the x-axis (0.5 G → 0.18 G) while maintaining competitive accuracy, thereby avoiding the considerable computational redundancy of heavy networks such as MotionBERT (clustered in the upper-right). This distribution corroborates the effectiveness of the mask-guided sparse interaction mechanism, demonstrating that CAAPoseFormer attains reconstruction accuracy superior to lightweight models (e.g., DC-GCT) under an extremely low-power budget.
Figure 9b further highlights the superiority of CAAPoseFormer in terms of static storage efficiency. Most state-of-the-art Transformer-based methods (e.g., MixSTE, MotionBERT) are densely distributed in the high-parameter regime on the right side of the plot (>30 M), suggesting a strong dependence on model capacity. In contrast, CAAPoseFormer lies close to the y-axis and achieves performance comparable to PoseFormerV2 (14.3 M) with only 2.1 M parameters. This “small footprint, strong performance” profile indicates that the proposed time–frequency dual-domain pruning strategy enables efficient encoding of human topological features within a severely constrained parameter budget, outperforming methods of similar scale such as GLA-GCN, and thus being more suitable for lightweight deployment on edge devices.
CAAPoseFormer, in contrast, is designed as an efficiency-first framework tailored for latency-sensitive edge applications. As shown in the trade-off curve (
Figure 9), we consciously trade a marginal 1.2 mm accuracy drop for a dramatic 97.4% reduction in computational cost (MACs) compared to MHFormer. On resource-constrained edge devices where running heavy models like MHFormer is practically infeasible due to strict power and memory constraints, our method provides a deployable lightweight alternative, maximizing efficiency while maintaining competitive accuracy relative to PoseFormerV2.
Furthermore, to verify that this efficiency-oriented design does not compromise the modeling of difficult poses, we conducted a comparison by forcing maximum resource allocation on samples identified as ‘hard’ by our complexity module. The results indicated that the performance gain on these samples was marginal, with less than a 2% improvement in MPJPE, while the computational cost increased by an order of magnitude. This confirms that the performance bottleneck for such poses is primarily due to inherent monocular ambiguities rather than insufficient model capacity. Consequently, our adaptive strategy prioritizes computational utility, focusing resources where they provide the most significant accuracy gains and ensuring a more efficient Pareto frontier for edge deployment.
3.4.4. Robustness to 2D Detection Noise
Since CAAPoseFormer relies on 2D input sequences, we investigate its robustness against upstream detection errors by adding Gaussian noise
to the 2D keypoints. As presented in
Table 7, the results highlight a distinct advantage of our adaptive mechanism. Under the simulated Gaussian noise setting, the complexity module tends to interpret unnatural coordinate jitter as increased temporal and spatial variance.
Consequently, as the noise level increases, the network autonomously allocates significantly more resources (average T increases toward 81, and K increases toward the maximum) to establish a broader receptive field for smoothing and error correction. This dynamic compensation suggests that, under coordinate-level Gaussian perturbations, the model can partially reduce the risk of overly aggressive pruning by allocating more computational resources to noisy inputs.
3.4.5. Failure Mode Analysis
While CAAPoseFormer provides an efficient accuracy-efficiency trade-off, its performance is bounded by the alignment between geometric proxies and reconstruction difficulty. The primary failure mode occurs in scenarios characterized by semantic–geometric decoupling, specifically in occlusion-heavy contracted poses like “Sitting Down” and “Eating”. As visualized in
Figure 6, once the subject transitions into a seated state, the comprehensive complexity score (
) plunges toward zero. This occurs because the limbs are compactly tucked near the joint centroid, yielding low spatial dispersion (
). The model interprets this physical contraction as “low complexity” and assigns a near-minimum computational budget. However, these poses involve severe depth ambiguities that actually require extensive temporal context to resolve, leading to the observed localized error of 62.7 mm for “Sitting Down”. A similar underestimation occurs during “Eating”, where subtle hand-to-face articulations are over-compressed by restricted spectral bandwidths.
Furthermore, the module is sensitive to the quality of upstream 2D input. While the Gaussian noise analysis suggests that the system tends to assign higher complexity to inputs with strong coordinate jitter and allocate more resources for smoothing, extreme occlusion-induced detector failure remains a bottleneck. We accept these localized degradations as a necessary engineering trade-off for real-time edge deployment; forcing a maximum computational budget on these “hard” samples yields less than a 2% accuracy improvement while increasing computational costs by an order of magnitude. This analysis provides a concrete anchor for future research into occlusion-aware complexity quantification and robust 3D lifting.
3.5. Ablation Studies
To verify the effectiveness and individual contributions of the key components in the CAAPoseFormer architecture, we conduct detailed ablation studies on Human3.6M, with results summarized in
Table 8. The baseline model (Model 1) adopts a fixed spatiotemporal processing strategy and achieves an MPJPE of 44.4 mm with 0.49 GFLOPs and an inference speed of 185 Hz. The results show that a single adaptive mechanism can substantially reduce computational load, yet it is insufficient to maintain high-fidelity reconstruction on its own. Introducing only the adaptive temporal window (Model 2) increases the throughput to 210 Hz, but the accuracy degrades markedly to 47.1 mm. Similarly, applying only the adaptive DCT strategy (Model 3) compresses the computation to 0.19 GFLOPs and attains 450 Hz, whereas its accuracy (44.5 mm) still fails to surpass the baseline.
We trace these accuracy drops directly to unchecked feature over-sparsification. When Model 2 applies aggressive temporal pruning without frequency-domain support, it breaks critical long-range dependencies, stripping the context needed for occlusions or fast movements. Model 3 forces efficiency through frequency-domain sparsity alone. Yet, missing a precise temporal focus, its global frequency cutoff blindly erases the high-frequency signals required to pinpoint distal joints, stalling any further improvements in precision.
In contrast, when the two adaptive mechanisms operate jointly in the full model (Model 4), a clear complementary synergy emerges. The complete architecture reduces computation to 0.16 GFLOPs (a 67.3% reduction over the baseline) and boosts inference speed to 512 Hz, while also improving accuracy beyond the baseline to achieve the best MPJPE of 44.2 mm. These findings confirm that the adaptive temporal window effectively removes temporal redundancy and provides cleaner inputs for spectral modeling, whereas the adaptive DCT module compensates for the information loss in the temporal domain through dynamic spectral filtering. Rather than a linear combination, their coupling performs joint denoising and feature reorganization, yielding an improved accuracy–efficiency trade-off under an extremely low compute budget.
3.6. Input-Length Sensitivity Under Zero-Shot Transfer and Boundary Robustness
To analyze the input-length sensitivity of CAAPoseFormer under a cross-dataset zero-shot setting, we evaluate the model trained solely on Human3.6M directly on the 3DPW dataset without target-domain fine-tuning. As shown in
Table 9, increasing the input sequence length consistently improves the PA-MPJPE performance. Specifically, when the input length increases from T = 27 to T = 81, the PA-MPJPE decreases from 62.5 mm to 58.4 mm. When the input length is further increased to T = 243, the PA-MPJPE is reduced to 55.2 mm. This trend indicates that longer temporal context is beneficial for CAAPoseFormer under the Human3.6M-to-3DPW zero-shot transfer setting.
Meanwhile, the computational cost increases only moderately with longer input sequences. The MACs increase from 160 M at T = 27 to 181 M at T = 81 and 186 M at T = 243, while the number of parameters remains nearly unchanged at approximately 2.1–2.2 M. This suggests that the proposed adaptive computation strategy can exploit longer temporal information while maintaining a relatively low computational budget.
It should be noted that
Table 9 is intended as an internal input-length sensitivity analysis of CAAPoseFormer under the specified zero-shot transfer setting. Therefore, the reported result of 55.2 mm at T = 243 should be interpreted as the performance of our model under this particular setting, rather than as a direct comparison with methods evaluated under different protocols. In addition, the results suggest that the fixed training-set boundary parameters Cmin and Cmax remain usable under this domain shift. When the estimated complexity of unseen samples falls outside the training-set range, the clipping mechanism constrains the adaptive resource allocation within predefined valid bounds, thereby helping maintain stable pruning behavior without online recalibration.
4. Conclusions
In this study, we tackle the inflexible nature of resource distribution prevalent in current video-driven 3D pose estimation networks by proposing CAAPoseFormer, a novel adaptive architecture driven by spatial–temporal complexity awareness. Instead of relying on rigid network parameters, this framework directly links the computational budget to the estimated spatial–temporal complexity of the input sequence.
We shift the computational paradigm through three core mechanisms. We first track sequence difficulty using a spatiotemporal complexity quantifier that merges spatial dispersion and temporal variance. This continuous metric directly drives a time–frequency dual-domain pruning strategy. Instead of restricting the network to fixed windows or preset DCT limits, this pruning dynamically drops redundant operations for simple poses while reserving more representational capacity for complex sequences. To process these newly unstructured features, we integrate a mask-guided sparse encoder. By explicitly blocking invalid zero-padded regions at the operator level, the encoder natively handles variable-length interactions without wasting standard hardware cycles.
Experiments on Human3.6M indicate that CAAPoseFormer mitigates the efficiency bottleneck of conventional fixed-computation strategies when facing diverse motion patterns. By coupling complexity-aware resource allocation with adaptive time–frequency pruning and mask-driven sparse interaction, the model aligns computational effort with motion complexity. Relative to the strong frequency-domain baseline PoseFormerV2, CAAPoseFormer achieves comparable accuracy (MPJPE: 44.2 mm) while reducing parameters by 85.3% and computational cost by 64.8%, yielding a 2.8× improvement in compute efficiency. These results suggest that the proposed approach attains a favorable accuracy–efficiency trade-off, alleviating both excessive computation on simple motions and under-representation of complex motions, and thus supports real-time 3D pose estimation in resource-constrained settings. While static architectures like HDFormer have pushed the boundaries of theoretical MAC reduction through spatial graph optimization, CAAPoseFormer explores a complementary path by prioritizing dynamic, complexity-aware resource allocation. Our framework demonstrates that sequence-level adaptivity can achieve competitive accuracy while maintaining the high parallelism required for real-time edge deployment.
While the primary experimental validation in this work was performed on a desktop RTX 3090 platform, CAAPoseFormer is explicitly designed for computation-efficient inference through complexity-aware adaptive pruning and mask-guided sparse interaction. Its exceptionally low parameter count, reduced MACs, and high measured throughput indicate strong potential for practical deployment on resource-constrained edge platforms. Notably, our preliminary Orange Pi 5 Pro device-side verification supports the real-world executability of the framework beyond standard desktop-only evaluations. While these initial benchmarks demonstrate the efficiency of CAAPoseFormer, we acknowledge that a direct memory footprint comparison with other lightweight baselines, such as GLA-GCN and HDFormer, requires extensive hardware-specific kernel tuning and framework conversion, such as RKNN/ONNX, to ensure fairness across different architectural paradigms. Future work will focus on establishing a standardized benchmarking protocol to evaluate the peak memory versus static buffer trade-offs of adaptive frameworks on various ARM-based edge platforms.
We do recognize specific operational limits in the current pipeline. Because the complexity module relies entirely on upstream 2D pose data, severe occlusions or raw detector noise may disturb the complexity scores and increase the risk of suboptimal pruning decisions. However, our Gaussian noise analysis suggests that the system tends to assign higher complexity to noisy coordinate inputs and mobilize more computational resources, thereby partially reducing the risk of overly aggressive pruning under this controlled perturbation setting. The dynamic nature of this pruning also creates inherent decision variability, which can occasionally cause minor frame-to-frame prediction jitter across runs. Structurally, forcing the quantification, pruning, and encoding stages into a strict sequential dependency complicates end-to-end optimization. This tight coupling narrows the margin for stable training and requires highly precise hyperparameter tuning.
Future research will focus on improving the robustness of the complexity-aware module under noise, anomalous motion patterns, severe occlusions, and domain shifts while maintaining the frozen global normalization priors used during inference. This design preserves the comparability of complexity scores across historical and current samples, avoids dependence on online test statistics, and prevents additional state-maintenance overhead on edge devices. While the current 3DPW results provide preliminary evidence of cross-dataset transferability, our future research will extend this paradigm to a broader range of benchmarks, including MPI-INF-3DHP and AGORA, to further investigate the transferability of complexity-aware routing across diverse motion distributions under more comprehensive and controlled evaluation settings. Furthermore, extending adaptivity from the temporal and frequency dimensions to the spatial dimension using emerging linear-complexity state space models, such as Mamba-based architectures [
43], could help realize a fully spatiotemporal adaptive framework with lower memory footprint. In addition, we aim to evaluate the transferability of the proposed adaptive computation paradigm to broader video analysis tasks, including action recognition and frequency-adaptive video understanding.
Although the semantic gap between geometric dispersion and actual reconstruction difficulty is partially alleviated by the fusion of temporal variance and the adaptive preservation of temporal–frequency information during dynamic transitions, highly contracted self-occluded poses remain challenging due to intrinsic monocular depth ambiguity. While our geometric dispersion metric ensures high efficiency for real-time deployment, we acknowledge that it may not perfectly capture the complexity of severe self-occlusions. Therefore, developing lightweight, occlusion-aware complexity metrics that can accurately assess pose difficulty without compromising real-time inference speed will be a central focus of our future research.