Adaptive 3D Human Pose Estimation Based on Spatial–Temporal Complexity Awareness

Zhang, Wensi; Yang, Ziyan; Hu, Chengfeng; Sun, Jing; Li, Jie

doi:10.3390/electronics15102076

Open AccessArticle

Adaptive 3D Human Pose Estimation Based on Spatial–Temporal Complexity Awareness

by

Wensi Zhang

¹

,

Ziyan Yang

¹

,

Chengfeng Hu

¹,

Jing Sun

²

and

Jie Li

^1,*

¹

School of Electrical and Electronic Engineering, Hubei University of Technology, Wuhan 430068, China

²

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2076; https://doi.org/10.3390/electronics15102076

Submission received: 31 March 2026 / Revised: 9 May 2026 / Accepted: 11 May 2026 / Published: 13 May 2026

(This article belongs to the Special Issue Advances in Real-Time Object Detection and Tracking)

Download

Browse Figures

Versions Notes

Abstract

Existing 3D human pose estimation methods use fixed computation strategies processing diverse action sequences, leading to computational redundancy for simple actions, insufficient high-frequency information capture for complex actions, and low long-sequence processing efficiency. To address these issues, this paper proposes a Spatial–Temporal Complexity-Aware Adaptive Computation Framework (CAAPoseFormer). First, a spatial–temporal coupled complexity quantification module is built to integrate spatial dispersion and temporal motion variance for graded action complexity quantification. On this basis, a time–frequency dual-domain adaptive pruning strategy is proposed to dynamically allocate temporal window length and frequency-domain DCT coefficients on demand. Furthermore, a mask-guided sparse interaction encoding mechanism is designed to enable efficient parallel computation of variable-length features by shielding invalid padding regions. Experiments on the Human3.6M dataset show that, versus the baseline PoseFormerV2, the proposed method cuts parameters by 85.3% and computational cost by 64.8% while retaining comparable accuracy (MPJPE 44.2 mm), boosting unit computational efficiency 2.8×. Moreover, compared with state-of-the-art (SOTA) methods like MHFormer and MotionBERT, our method reduces computational costs (MACs) by 97.4% and nearly three orders of magnitude, respectively. This framework effectively breaks the inference bottleneck of high-precision models on low-power hardware, suiting latency-sensitive real-time applications well.

Keywords:

3D human pose estimation; transformer; spatial–temporal complexity-aware; time–frequency dual-domain adaptive pruning; mask-guided sparse interaction encoding

1. Introduction

Despite significant progress in deep learning-based 3D human pose estimation [1], recovering stable, accurate, and computationally efficient pose sequences from monocular videos remains a formidable challenge. Mainstream approaches predominantly adopt the “2D-to-3D lifting” paradigm [2], which takes 2D pose detections as input and employs temporal convolutional networks (TCNs) [3] or recurrent neural networks (RNNs) [4] to capture inter-frame dependencies. However, these early architectures are hindered by limited receptive fields and vanishing gradients, making it difficult to effectively model long-range temporal correlations. The introduction of Transformer models fundamentally changed how we handle long-range dependencies. PoseFormer [5] tracks global context through self-attention, and PoseFormerV2 [6] later improved upon this by applying frequency-domain compression via discrete cosine transforms (DCTs), which significantly boosted accuracy. Nonetheless, in real-world scenarios featuring diverse and complex human motions, achieving high accuracy while significantly reducing computational overhead remains an open problem.

Leading contemporary techniques, such as PoseFormerV2, primarily depend on rigid computational architectures that process all input sequences with identical settings, thereby neglecting the natural spatial–temporal diversity inherent in human movements. Specifically, human movements exhibit substantial dynamic disparities: quasi-static actions (e.g., standing) contain considerable spatiotemporal redundancy, whereas vigorous motions (e.g., dancing) are rich in high-frequency details. Existing approaches with fixed temporal windows and frequency truncation strategies [6] are incapable of adapting to such disparities. This leads to substantial computational redundancy when processing low-frequency actions, and critical detail loss due to insufficient spectral bandwidth when handling high-frequency actions [7]. Furthermore, standard Transformer encoders struggle to distinguish informative features from padded regions, resulting in wasteful computation on null-element interactions. Similar inefficiencies have also motivated dynamic token sparsification and pruning strategies in efficient Transformer models [8]. This mismatch between computational resource allocation and input-specific demands prevents reductions in theoretical FLOPs from translating linearly into actual inference speed (FPS) gains, thereby limiting deployment on resource-constrained edge devices.

In response to these challenges, the research community has pursued diverse directions. In spatial structure modeling, graph convolutional networks (GCNs) have been employed to explicitly capture joint topologies, as exemplified by ST-GCN [9]; however, their receptive fields are constrained by predefined physical connections, hindering the modeling of long-range dependencies between non-adjacent joints. Subsequent Transformer variants, such as MixSTE [10], enhance feature interaction through alternating spatial and temporal attention modules, while MHFormer [11] addresses depth ambiguity via multi-hypothesis prediction strategies. Nevertheless, the computational complexity of these models grows quadratically with sequence length, leaving efficiency bottlenecks unresolved. In pursuit of efficiency optimization, frequency-domain analysis has emerged as a pivotal direction. PoseFormerV2 [6] compresses long-range motion information into low-frequency representations via DCT, capturing global motion trends at minimal cost—marking the advent of joint spatiotemporal-frequency modeling in this field. However, this framework adopts a “fixed coefficient truncation” strategy, implicitly assuming that all actions follow an identical spectral distribution and thus failing to adapt processing bandwidth according to motion characteristics. Notably, adaptive computation has been widely explored in general video understanding, such as action recognition. Pioneering works like AdaFrame [12] and DynamicViT [13] dynamically drop redundant frames or uninformative spatial tokens to accelerate heavy RGB-based models. However, these token-dropping paradigms are fundamentally unsuitable for 3D Human Pose Estimation (HPE). 3D HPE relies heavily on the continuous temporal kinematics of sparse 2D coordinates; discretely dropping temporal or spatial tokens destroys this physical continuity, severely hindering spatial–temporal lifting and frequency-domain transformations (e.g., DCT). In contrast to these semantic-driven general methods, CAAPoseFormer is uniquely tailored for geometric sequence modeling. It preserves kinematic continuity by adaptively scaling contiguous temporal bounds and frequency-domain coefficients, ensuring mathematically stable efficiency for pose estimation [8,14]. In 3D human pose estimation, although pre-trained models such as MotionBERT [15] have learned generalized skeleton representations, their inference processes remain confined to static graphs, lacking the capacity to respond instantaneously to the complexity of individual input samples. A few recent studies have explored incorporating physical kinematics, such as joint angular velocity and trajectory priors, to enhance motion representations [16], yet none have tightly coupled fine-grained, learnable complexity measures with the automatic adjustment of core model parameters (e.g., temporal window length, frequency-domain coefficients). Evidently, constructing an end-to-end trainable complexity-aware module that is deeply integrated with the core parameters of spatiotemporal-frequency modeling constitutes the critical pathway from “fixed-parameter“ to ”adaptive-intelligence“ paradigms—and stands as the central focus of this work.

To address the above limitations, this paper proposes a Complexity-Aware Adaptive PoseFormer (CAAPoseFormer) for 3D human pose estimation. Breaking away from static processing paradigms, the proposed framework aims to establish a real-time mapping between computational resource allocation and the intrinsic dynamic characteristics of input sequences. Specifically, CAAPoseFormer first quantifies the spatial–temporal complexity of the input sequence through a lightweight module. Based on the estimated complexity, it adaptively prunes the temporal window length and the number of retained DCT coefficients in an end-to-end manner, thereby enabling demand-driven computation allocation. For the variable-length sequences produced after pruning, a mask-guided sparse interaction mechanism is further designed to eliminate operations on invalid padded regions at the operator level.

The main contributions of this work are summarized as follows:

(1): A spatial–temporal coupled complexity quantification module is proposed. This module fuses spatial skeleton dispersion and temporal motion variance through a learnable weighted fusion strategy, enabling fine-grained hierarchical quantification of the complexity of arbitrary action sequences. It provides an interpretable basis for subsequent dynamic resource scheduling.
(2): A time–frequency dual-domain adaptive pruning strategy is introduced. Based on the estimated complexity score, the model dynamically adjusts both the temporal window span and the number of retained DCT coefficients, thereby reducing redundant information in both the time and frequency domains and enabling differentiated computational allocation for heterogeneous actions.
(3): A mask-guided sparse interaction encoding mechanism is developed. To handle variable-length sequences after pruning, dynamic attention masks are employed to strictly confine feature interaction to valid regions, thereby eliminating invalid computations caused by zero padding and improving actual inference throughput.

Through these innovations, CAAPoseFormer maintains high-precision predictions while effectively mitigating computational redundancy and significantly enhancing inference efficiency, offering a viable pathway for real-time deployment of 3D human pose estimation on resource-constrained edge devices.

2. Materials and Methods

2.1. Material

This study is conducted on the Human3.6M benchmark dataset, which is one of the most widely used datasets for monocular 3D human pose estimation. Human3.6M contains large-scale indoor human motion sequences captured from multiple synchronized cameras and provides accurate 2D/3D pose annotations, making it suitable for evaluating both estimation accuracy and computation efficiency under controlled conditions. In this work, 2D keypoint sequences extracted by a 2D pose detector such as CPN [17] or recent Vision Transformer-based estimators like ViTPose [18] are used as model inputs, and the corresponding 3D joint coordinates are used as ground truth for training and evaluation.

2.2. Methods

Our framework extends the time–frequency design introduced by PoseFormerV2 [6]. Instead of processing raw frames indefinitely, this baseline uses DCT to pack long-term temporal context into tight low-frequency coefficients, natively establishing a solid baseline for computational efficiency. However, it remains limited by a static computation paradigm, in which fixed temporal windows and frequency-domain truncation parameters are uniformly applied to action sequences with substantially different motion complexities. This static approach forces the network to waste compute cycles on simpler, repetitive movements, while simultaneously starving highly dynamic actions of the high-frequency bandwidth they actually need. To address this issue, while retaining the core advantages of time–frequency fusion, we introduce a spatial–temporal complexity-aware mechanism to enable adaptive computation. The overall architecture of CAAPoseFormer is illustrated in Figure 1. Specifically, Figure 1a presents the complete adaptive processing pipeline, including complexity evaluation, dual-domain pruning, and feature encoding, whereas Figure 1b shows the internal structure of the mask-guided variable-length time–frequency fusion Transformer.

The processing workflow of CAAPoseFormer is as follows. The model first receives a sequence of 2D joint coordinates extracted by a 2D pose detector, which are normalized to reduce scale variation. The sequence is then fed into the complexity quantification module to estimate spatial skeleton dispersion and temporal motion variance in parallel. These two metrics are fused to generate a global complexity score. Based on this score, the time–frequency dual-domain adaptive pruning module dynamically determines the temporal keyframe window and the number of retained DCT coefficients, while simultaneously generating a joint mask that marks valid feature positions. In the feature encoding stage, the temporal and frequency-domain features are aligned through linear projection and then fed into the variable-length time–frequency fusion Transformer. With the aid of the joint mask, the multi-head attention mechanism performs sparse interaction only among valid tokens by excluding invalid padded regions. During this process, layer normalization [19] and the GELU [20] activation function are used to improve optimization stability and feature aggregation, aligning with standard practices in recent dynamic neural networks [21]. Finally, the fused features are decoded by a regression head to reconstruct the 3D human pose sequence.

2.3. Spatial–Temporal Coupled Dynamic Complexity Quantification Module

A major limitation in baselines like PoseFormerV2 is the assumption that spatiotemporal information is distributed evenly across all actions. Driven by this assumption, these models apply rigid receptive fields and uniform network depths, regardless of what the subject is actually doing. However, this static processing paradigm overlooks the inherent heterogeneity of actions: spatially, complex poses with expansive limb movements (e.g., dancing) contain richer geometric structures and joint distribution information compared to compact poses (e.g., standing); temporally, sequences involving high-frequency, intense motions exhibit significantly greater inter-frame variations than quasi-static actions, necessitating higher temporal resolution to capture motion details.

To overcome these issues, we construct a spatial–temporal coupled dynamic complexity quantification module. This module aims to systematically map the physical and motion attributes of action sequences (including spatial structural dispersion and the intensity of temporal variations) into a set of quantifiable and interpretable evaluation metrics. This provides a reliable basis for subsequent on-demand and adaptive allocation of computational resources [22], which has been shown to significantly optimize the FLOPs-to-accuracy trade-off in spatial–temporal modeling.

2.3.1. Spatial Complexity Quantification

Spatial complexity aims to quantify the degree of geometric structural dispersion of a single-frame human pose within the 2D plane. In pose estimation tasks, the semantic weight of different actions is closely correlated with their spatial distribution. Complex poses characterized by expansive limb movements (e.g., jumping, waving) typically exhibit richer kinematic and structural dependencies than compact poses [23], thereby demanding dynamic receptive fields and stronger spatial representation capabilities from the model. We therefore need a spatial metric capable of capturing how much a pose expands, without being skewed by its absolute coordinates in the frame.

This paper adopts a method based on the dispersion of joint spatial distribution to model single-frame spatial complexity. Given the set of 2D coordinates

{P_{j} (t)}_{j = 1}^{J}

for all J joints in the t-th frame, we first calculate the joint centroid

P (t)

of the frame:

P (t) = \frac{1}{J} \sum_{j = 1}^{J} P_{j} (t)

(1)

Subsequently, the spatial dispersion

C_{s} (t)

of the frame is defined as the average Euclidean distance from all joints to the centroid:

{\hat{C}}_{s} (t) = \frac{1}{J} \sum_{j = 1}^{J} ‖ P_{j} (t) - P (t) ‖_{2}

(2)

where

‖ \cdot ‖_{2}

denotes the Euclidean norm. A larger

C_{s} (t)

value indicates a more dispersed distribution of joints on the 2D plane, corresponding to a more complex spatial structure of the pose. Figure 2 visually demonstrates the significant differences in human geometric structure between low and high spatial complexity through a comparison of typical “contracted” and “extended” poses.

To describe the overall complexity of the entire action sequence in the spatial dimension, we further define the average spatial complexity

{\bar{C}}_{s}

as the arithmetic mean of the spatial dispersions of all frames:

{\bar{C}}_{s} = \frac{1}{T} \sum_{t = 1}^{T} C_{s} (t)

(3)

{\bar{C}}_{s}

reflects the average spatial activity level of the action sequence throughout the entire temporal range.

To eliminate the influence of absolute scale differences across varying human morphologies and facilitate stable optimization, we perform a global prior-based normalization on the single-frame spatial complexity:

C_{s} (t) = \frac{C_{s} (t)}{C_{s_m a x}^{t r a i n}}

(4)

where

C_{s_m a x}^{t r a i n}

represents the global maximum spatial complexity statistically derived from the entire training dataset. Crucially, this value acts as a frozen global prior during the inference stage, rather than a local sample-dependent statistic. This design ensures strict adherence to causal system constraints without requiring any future information or ground-truth action labels. The normalized value

C_{s} (t)

roughly ranges within [0, 1], providing a standardized basis for subsequent dynamic scaling.

It is worth noting that our spatial complexity metric (

C_{s}

) is formulated to prioritize ultra-fast geometric dispersion assessment. While highly contracted poses (e.g., “Sitting Down”) introduce severe depth ambiguity and self-occlusion that increase the difficulty of 3D lifting, our metric intentionally assigns them a lower complexity score. This represents a deliberate engineering trade-off for edge-device deployment: for quasi-static contracted poses, deploying massive temporal windows yields diminishing returns in accuracy relative to the immense computational overhead. By adaptively scaling down resources in these states, the model prioritizes overall inference efficiency. Furthermore, to mitigate absolute scale dependency without introducing complex operations, our metric implicitly leverages the bounding-box scale normalization already performed by the upstream detector.

2.3.2. Temporal Complexity Quantification

By evaluating temporal complexity, we aim to track how rapidly and unpredictably an action sequence unfolds. High-dynamic actions (e.g., running, jumping) typically involve drastic spatial displacements and non-linear accelerations, exhibiting significant differences between continuous frames. Conversely, low-dynamic actions (e.g., sitting, standing) contain substantial temporal redundancy. If a unified fixed computation strategy is adopted, the model is forced to allocate equal computational effort to these repetitive, invalid frames. This results in valuable computational and storage resources being persistently occupied by meaningless background signals, causing significant computational waste. Therefore, to break this rigid resource allocation pattern, it is necessary to construct a quantitative metric capable of acutely perceiving the intensity of temporal changes in actions, thereby enabling on-demand dynamic allocation of computational resources.

Balancing computational efficiency and representation capability, this paper adopts a method based on inter-frame joint velocity variance to quantify temporal complexity. Given a sequence P containing T frames of 2D poses, we first calculate the joint velocity vector at a specific time interval

Δ t

:

v_{t, Δ t} = \frac{p^{(t + Δ t)} - p^{(t)}}{Δ t}

(5)

where

p^{(t)}

represents the coordinates of all joints in the

t - t h

frame, and

v_{t, Δ t} \in R^{J \times 2}

is the corresponding velocity matrix. the temporal complexity at the

t

-th frame,

C_{t} (t)

, is defined as the variance of the velocity scalars (Euclidean norms) of all joints across different time intervals:

C_{t} (t) = V a r ({\{{∥ v_{t, Δ t} ∥}_{2}\}}_{Δ t \in Ω})

(6)

where

Ω

is a set of preset time intervals used to calculate the instantaneous velocity magnitude of each joint. A larger temporal complexity indicates more intense and unstable motion. As illustrated in Figure 3, the practical meaning of this metric can be intuitively understood through a comparison between static standing (low temporal complexity) and rapid motion (high temporal complexity).

The average temporal complexity of the entire sequence is defined as the mean of the temporal complexities of all frames:

{\bar{C}}_{t} = \frac{1}{T} \sum_{t = 1}^{T} C_{t} (t)

(7)

Similarly, to mitigate the scale discrepancy of temporal variance across extremely diverse motion dynamics, we apply a causal normalization strategy using training set priors:

c_{t} (t) = \frac{C_{t} (t)}{C_{t_m a x}^{t r a i n}}

(8)

where

C_{t_m a x}^{t r a i n}

is the global maximum temporal complexity recorded during the entire training phase. This frozen global anchor serves as a standardized physical benchmark for motion intensity. By decoupling the normalization process from any ground-truth action categories, the model strictly maintains temporal causality during real-time, streaming inference on unknown actions. This ensures that a sample’s complexity score reflects its true kinematic volatility within a unified absolute coordinate system, preventing the system from erroneously allocating maximum resources to low-dynamic sequences that exhibit only minor detection jitter. The normalized value

c_{t} (t)

roughly ranges within [0, 1].

The above process constitutes the theoretical evaluation method. In practical engineering applications, lightweight approximation or pruning strategies can be introduced to reduce redundant temporal computation while preserving the overall trend of complexity estimation [12]. This is consistent with recent token sparsification paradigms that filter non-informative temporal dynamics in video understanding [24]. This mechanism effectively reduces computational complexity while maintaining consistency in the evaluation trend.

In quantifying temporal complexity, our choice of velocity variance (

V a r (‖ v_{t, Δ t} ‖_{2})

) rather than global velocity magnitude is a deliberate design strategy aimed at capturing motion unpredictability (i.e., non-linear accelerations and kinematic shifts) rather than mere translational speed. From a spectral perspective, fast but constant-velocity motions exhibit strong low-frequency dominance and can be highly compressed via DCT with minimal information loss. Therefore, our system is designed such that the temporal complexity score strictly peaks during moments of highly variable, discontinuous motion where temporal variance is large. This ensures that the computational budget (temporal window expansion) is exclusively triggered when the motion unpredictability strictly demands explicit frame-level tracking, thereby fundamentally eliminating temporal redundancy for smooth, constant-velocity trajectories.

2.3.3. Spatial–Temporal Comprehensive Complexity Calculation

Spatial dispersion and temporal variance characterize the physical attributes of human actions in terms of geometric structural tension and temporal evolution rate, respectively. However, these two metrics exhibit significant heterogeneity in dimensions, numerical ranges, and distribution characteristics. Furthermore, they often display complex coupling relationships; for instance, a pose with high spatial complexity is not necessarily accompanied by high-frequency motion. To transform these two independent physical quantities into a unified, monotonic control signal that guides the subsequent adaptive pruning strategy, it is necessary to construct a multi-dimensional feature fusion mechanism to comprehensively measure spatial–temporal complexity on a common semantic scale.

We design a learnable linear weighting and non-linear activation mechanism. Weighted summation is performed using learnable weight parameters α and β, followed by mapping to the [0, 1] interval via the Sigmoid function:

C (t) = S i g m o i d (α \cdot C_{s} (t) + β \cdot c_{t} (t) - γ)

(9)

The parameters α, β, and γ are implemented as learnable global scalars, initialized as α = β = 1.0 and γ = 0.0, and are jointly optimized with the network using the AdamW optimizer, without additional loss supervision. During training, their gradients are propagated through the differentiable relaxed-mask form of the adaptive pruning module, while hard binary pruning is applied only during inference. Specifically, α and β control the relative contributions of spatial dispersion Cs(t) and temporal variance Ct(t), while γ serves as an adaptive activation threshold for the fused complexity score.

Although the fusion coefficients are globally shared to maintain a consistent complexity coordinate system across the dataset, the resulting complexity score C(t) remains sample-dependent because its inputs, namely the normalized spatial dispersion Cs(t) and temporal variance Ct(t), are extracted from each individual action sequence. Therefore, different motion patterns can be assigned different complexity scores according to their intrinsic geometric and dynamic characteristics.

The obtained score C(t) is then used as a unified control signal for the subsequent adaptive pruning strategy, guiding the dynamic allocation of temporal window length and retained DCT coefficients. For low-complexity actions, C(t) is suppressed toward a small value, encouraging more aggressive pruning to reduce redundant computation. Conversely, for high-complexity actions, C(t) increases accordingly, allowing the model to preserve more temporal and frequency-domain information for accurate pose estimation.

2.4. Time–Frequency Dual-Domain Joint Adaptive Pruning Strategy

Although PoseFormerV2 introduces frequency-domain analysis to reduce computational costs, its “static truncation” paradigm applies fixed temporal window lengths and frequency-domain coefficient counts to all samples during the inference stage. This data-agnostic configuration overlooks dynamic variations in action complexity: for low-complexity stationary actions, excessive temporal windows and spectral bandwidths introduce invalid background noise and computational redundancy; conversely, for high-complexity intense actions, restricted parameter configurations lead to the loss of critical high-frequency details. To address this issue, this paper proposes a time–frequency dual-domain adaptive pruning strategy, as illustrated in Figure 3. Utilizing the comprehensive complexity score generated in Section 2.3, this strategy dynamically generates temporal masks and frequency masks to achieve adaptive allocation of computational resources.

Although the proposed pruning strategy is formulated in terms of temporal frames and DCT coefficients, it follows the general principle of dynamic resource sparsification. In Vision Transformers, previous studies such as DynamicViT [13] have shown that directly removing tokens during training may break batch-level parallelism and block gradient propagation. Therefore, these methods commonly use differentiable mask- or halting-score-based optimization during training, and apply hard token removal only during inference to obtain actual speed-up. Inspired by this training–inference decoupling strategy, CAAPoseFormer represents the adaptive temporal and frequency masks in a relaxed differentiable form during training. The final pose-estimation loss can therefore propagate gradients to the complexity score

C (t)

and the learnable fusion parameters α, β, and γ. During inference, the relaxed masks are converted into hard binary masks by thresholding or rounding the corresponding adaptive lengths, so that temporal frames and DCT coefficients can be physically retained or discarded for computational reduction.

2.4.1. Adaptive Temporal Window Pruning

Existing video pose estimation models generally employ a fixed-length temporal window. However, this setting overlooks the inherent differences in the temporal evolution rates of different actions. For slowly varying actions (e.g., sitting), an excessively long temporal window invariably introduces highly similar redundant frames [25]. As recent spatiotemporal pruning studies note, sampling video frames at a uniform rate wastes computation when the action itself is highly variable [13]. At the other extreme, fast-paced activities like boxing require broad temporal windows so the network does not fragment the motion cycle or drop vital long-range dependencies. To escape this trade-off and allocate compute strictly where it matters, we designed an adaptive temporal window pruning mechanism. By driving this mechanism with our comprehensive complexity score

C

, the network scales its receptive field dynamically, expanding or shrinking the sequence based on how intensely the action evolves over time.

As illustrated in Figure 3, given an input Frame Sequence Tensor, this module first calculates the adaptive temporal window length

T_{a d a p t i v e}

based on the comprehensive complexity score

C

. To ensure the window length fluctuates within a reasonable range, we adopt a linear mapping mechanism to project the normalized complexity score

C \in [0, 1]

onto a preset window range

[T_{m i n}, T_{m a x}]

:

T_{a d a p t i v e} = \{\begin{matrix} T_{m i n} + (T_{m a x} - T_{m i n}) \cdot \frac{C - C_{m i n}}{C_{m a x} - C_{m i n}} & if C_{m i n} \leq C \leq C_{m a x} \\ T_{m i n} & if C \leq C_{m i n} \\ T_{m a x} & if C > C_{m a x} \end{matrix}

(10)

where

T_{m i n}

and

T_{m a x}

are the preset minimum and maximum temporal window lengths, respectively; and

C_{m i n}

and

C_{m a x}

are the statistical extremes of the complexity score.

Fixed Training-Set Prior Normalization. The boundary parameters

C_{m i n}

and

C_{m a x}

are computed offline from the training set and kept frozen during inference. They provide a stable normalization benchmark for mapping the complexity score into the adaptive resource allocation space. In the current framework, no online updating, sliding-window statistics, or test-set statistics are used to recalibrate these bounds. This fixed-prior design preserves the comparability of complexity scores across samples, prevents low-complexity sequences from being artificially amplified by local statistics, and avoids additional state-maintenance overhead on edge devices.

The calculated

T_{adaptive}

determines the retained “Selected Window”

L_{t}

. Based on this, the system generates a corresponding Temporal Mask

M_{T}

. During training,

M_{T}

is represented in a relaxed differentiable form, rather than as a hard indexing operation. This avoids forcing gradients to pass through a discrete selection boundary. In practice, the relaxed mask softly suppresses less informative temporal positions through element-wise multiplication with the original tensor, instead of physically deleting frames from the computational graph. During inference,

M_{T}

is converted into a hard binary mask, where the central region corresponding to the Selected Window is set to 1 (retained), while the edge regions on both sides are set to 0 (pruned). At this stage, the selected temporal region is retained and the remaining positions are discarded or marked as invalid for subsequent sparse attention, thereby reducing the number of valid temporal tokens participating in subsequent computation while maintaining a favorable overall accuracy–efficiency trade-off [26].

It is worth discussing whether utilizing global bounds (

C_{m i n}

and

C_{m a x}

) contradicts the premise of sample-level dynamic allocation. In fact, these global priors act as an essential absolute coordinate system. If normalization were performed purely locally within each sample, an entirely static sequence with negligible motion variance could be falsely scaled up to a maximum complexity score due to the local amplification of subtle detection noise. By anchoring the normalization to frozen global bounds derived from the training set, we ensure that a sample’s complexity score reflects its estimated spatial–temporal complexity relative to the overall distribution of human motion. Thus, the global fixed priors are mathematically necessary to prevent trivial inputs from exhausting computational resources, thereby supporting stable and meaningful sample-level adaptation.

2.4.2. Adaptive Frequency-Domain Coefficient Pruning

While Discrete Cosine Transform (DCT) compression isolates the natural smoothness of human movement, locking the network into a fixed number of low-frequency coefficients creates a bottleneck. If the action is already smooth, retaining a preset block of coefficients just pads the model with redundant data. On the other hand, fast, highly erratic movements need high-frequency details to stay accurate—cutting them off at a hard threshold destroys the reconstruction. Instead of forcing every action through the same rigid filter, our module actively links the spectral bandwidth to the sequence’s estimated spatial–temporal complexity, retaining features only when the motion dynamics strictly demand it. Specifically, it applies more aggressive compression to low-complexity actions while retaining more coefficients for high-complexity actions, so as to balance efficiency and reconstruction fidelity [6].

Adaptive DCT Compression: We preset a set of candidate DCT coefficient counts K1, K2, …, Kn. Similarly, the retained count Kadaptive is selected based on the mapping of the complexity score C. After performing DCT along the temporal dimension, the adaptive frequency-domain mask MK is generated according to Kadaptive. As shown in the lower part of Figure 3, the calculation formula for the adaptive DCT coefficient length is as follows:

K_{a d a p t i v e} = \{\begin{matrix} K_{m i n} + (K_{m a x} - K_{m i n}) \cdot \frac{C - C_{m i n}}{C_{m a x} - C_{m i n}} & if C_{m i n} \leq C \leq C_{m a x} \\ K_{m i n} & if C \leq C_{m i n} \\ K_{m a x} & if C > C_{m a x} \end{matrix}

(11)

Here,

K_{m i n}

and

K_{m a x}

represent the preset lower and upper bounds for retained frequency-domain coefficients (e.g., set to 4 and 32, respectively). This formula ensures that the number of retained coefficients

K_{a d a p t i v e}

is positively correlated with the estimated action complexity and is strictly constrained within the effective bandwidth range. Based on

K_{adaptive}

, the system constructs the Frequency Mask

M_{K}

. During training,

M_{K}

is represented as a relaxed differentiable mask to support gradient-based optimization, rather than as a hard frequency cutoff. This avoids directly backpropagating through a discrete coefficient-selection boundary. The relaxed mask softly weights the frequency-domain coefficients during optimization, allowing gradients to flow through the adaptive pruning module. During inference,

M_{K}

is converted into a hard binary mask, where the first

K_{adaptive}

low-frequency positions of the spectral tensor are set to 1 and the remaining high-frequency regions are set to 0. Thus, the first

K_{adaptive}

low-frequency coefficients are retained and the remaining coefficients are pruned, enabling actual time–frequency pruning during inference.

2.5. Mask-Guided Variable-Length Sequence Sparse Interaction Encoding Mechanism

Although the time–frequency dual-domain adaptive pruning strategy effectively reduces the spatiotemporal redundancy of data, the unstructured variable-length sequences it outputs are incompatible with the static tensor parallel computing mode relied on by hardware such as GPUs. Although the traditional zero-padding strategy achieves physical dimension alignment, it introduces invalid computations and weakens the theoretical efficiency gains achieved by token pruning [27]. To address this issue, this paper proposes a mask-guided variable-length sequence sparse interaction encoding mechanism. As shown in Figure 4, a unified physical interaction space is established through multi-dimensional feature fusion; second, a global dynamic mask matrix is generated based on the effective feature length; and finally, a sparse self-attention operator is used to block the gradient propagation and feature aggregation of invalid padding regions in the computation flow.

2.5.1. Joint Alignment of Time–Frequency Features

After the pre-pruning process, the number of features retained by different samples in the batch in the time domain and frequency domain (i.e.,

N_{T} (i)

and

N_{K} (i)

) show highly unstructured differences. The traditional Transformer architecture usually assumes that the input is a fixed-length regular tensor, making it difficult to directly process such variable-length sequences with uneven physical dimensions in parallel without severe computational bottlenecks [28]; if short sequences are simply discarded or long sequences are truncated, the integrity of the action will be destroyed or deviations will be introduced. To this end, the time–frequency feature joint module is committed to establishing a unified physical alignment standard, mapping time–frequency features to a shared semantic space under the premise of being compatible with GPU parallel computing, laying the foundation for subsequent joint interaction.

As shown in Figure 4, the model first needs to solve the alignment problem of heterogeneous variable-length features. Let the batch size be B. For each sample

i (i \in [1, B])

in the batch, its input consists of two independent variable-length feature sequences: spatial and frequency domains. Among them, the spatial keyframe feature sequence selects

N_{T} (i)

keyframes through the adaptive time window strategy, and its original dimension is

R^{N_{T} (i) \times (J \times 2)}

; similarly, the frequency domain feature sequence selects

N_{K} (i)

effective low-frequency coefficients through the adaptive DCT strategy. Due to the significant differences in the dynamic lengths

N_{T} (i)

and

N_{K} (i)

of each sample in the batch, to meet the model’s parallel computing needs, the system first calculates the maximum lengths of the spatial and frequency domains in the current batch, respectively:

L_{T} = {\max_{i = 1}}^{B} N_{T}^{(i)}

and

L_{D C T} = {\max_{i = 1}}^{B} N_{K}^{(i)}

. Subsequently, zero-padding is performed at the end of each sample’s feature sequence, and projected to a unified feature dimensiond through a linear layer. To construct the unified interaction space shown in Figure 4, we concatenate the aligned spatial features

F_{T} (i)

and frequency domain features

F_{D C T} (i)

in the sequence dimension, whose mathematical expression is shown in Equation (12):

F_{c o n c a t} (i) = [F_{T} (i); F_{D C T} (i)] \in R^{(L_{T} + L_{D C T}) \times d}

(12)

Among them,

L_{t o t a l} = L_{T} + L_{D C T}

, and

F_{c o n c a t} (i)

is the fused feature tensor with a length of

L_{t o t a l}

.

Although temporal features (carrying physical timestamps) and frequency DCT coefficients (representing global spectral energy) originate from distinct semantic domains, we perform an early-stage implicit alignment prior to concatenation. Specifically, the independent linear projection layers applied to

F_{T}

and

F_{D C T}

serve to map these heterogeneous tokens into a shared

d

-dimensional latent space. This design enables the subsequent self-attention mechanism to process them unifiedly. By relying on this implicit domain alignment rather than introducing explicit, learnable modality embeddings, we strictly adhere to our lightweight design philosophy, effectively avoiding the parameter bloat associated with multi-domain disentanglement mechanisms while maintaining cross-domain interaction efficiency.

2.5.2. Construction of Global Dynamic Mask Matrix

After completing the physical alignment, a large number of meaningless zero elements are inevitably filled in the input tensor. The standard self-attention mechanism lacks the inherent ability to distinguish between “effective signals” and “padding noise”, and often treats the padding region as a low-energy background signal to participate in the operation. This vague boundary definition will lead to two serious consequences: first, the model may inadvertently learn the patterns of the padding regions and incorporate irrelevant structural noise [29]; and second, it cannot distinguish the interaction boundary between time-domain features and frequency-domain features. To address this defect, the global dynamic mask construction module aims to reconstruct the interaction topology at the logical level, and forcefully delimit the effective computation region by constructing an accurate binary index matrix to ensure that the model only focuses on real features.

To accurately distinguish real features from padding regions in the unified input tensor, the model constructs the global joint mask matrix shown in Figure 4. First, this paper defines two types of basic binary masks: the time window mask

M_{t}

used to mark the validity of spatial features and the DCT coefficient mask

M_{D C T}

used to mark the validity of frequency-domain features. Their definitions are shown in Equations (13) and (14), respectively:

M_{t} (i) [j] = \{\begin{array}{l} 1, & i f j \leq N_{T} (i) \\ 0, & i f j > N_{T} (i) \end{array}

(13)

M_{D C T} (i) [j] = \{\begin{array}{l} 1, & i f j \leq N_{K} (i) \\ 0, & i f j > N_{K} (i) \end{array}

(14)

Subsequently, the above two masks are concatenated in the sequence dimension to obtain the fused mask vector

M_{c o n c a t} = [M_{T}; M_{D C T}]

(corresponding to Equation (15)). Based on this, a two-dimensional extended mask matrix

E

(i.e.,

M_{m a s k}

in Figure 4) for self-attention computation is constructed, and its construction logic is shown in Equation (15):

E_{u, v} = \{\begin{array}{l} 0, & i f M_{c o n c a t} [u] = 1 a n d M_{c o n c a t} [v] = 1 \\ - \infty, & o t h e r w i s e \end{array}

(15)

In the formula,

u

and

v

are the position indices of the query and key, respectively. This formula corresponds to the interaction logic in the table in Figure 4: the mask value is 0 only when both the query and the key are from the effective region (Effective

F_{t}

or Effective

F_{k}

); all other cases involving the padding region (Padding) are forced to be

- \infty

, thereby logically cutting off the path of invalid information.

2.5.3. Mask-Guided Sparse Self-Attention

The self-attention operator of the traditional Transformer adopts a fully connected dense computation mode, and its computational complexity and memory consumption are dominated by the square term of the maximum sequence length in the batch

O (L_{m a x}^{2})

. Even if the input contains a lot of padding, the Softmax operation still assigns non-zero gradient weights to these invalid positions, which not only squanders computing resources but also makes it difficult to convert theoretical FLOP reductions into actual on-device inference latency improvements [30]. To break through this efficiency bottleneck, the mask-guided sparse self-attention module injects logical masks into the computation flow, and “zeros out” the weights of invalid regions through mathematical means, thereby achieving sparse acceleration at the operator level while maintaining the parallelism of matrix operations.

Finally, the constructed extended mask

E

is injected into the sparse self-attention module in the upper right corner of Figure 4. The fused feature tensor is first linearly mapped onto the query matrix

Q

, key matrix

K

, and value matrix

V

. To shield interference from the padding region, this paper loads the mask

E

onto the standard scaled dot-product attention score, and the final self-attention output is calculated according to Equation (16):

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}} + E) V

(16)

While Equation (16) formulates the logical masking, standard naive execution in deep learning frameworks would still allocate Tensor Cores to perform dense multiply–accumulate (MAC) operations over these padded regions. To bridge the gap between theoretical sparsity and physical hardware acceleration, our implementation explicitly leverages variable-length optimized CUDA kernels, specifically the cu_seqlens mechanism supported by FlashAttention-2 (or xFormers’ memory_efficient_attention). This specialized low-level architecture allows the GPU to natively process unpadded, variable-length sequences packed within a single batch. By utilizing the sequence length indices (e.g.,

N_{T (i)} + N_{K (i)}

), the kernel physically bypasses the memory reads, writes, and MAC operations for all padded zero-regions at the hardware level. Consequently, the effective computational complexity is strictly bounded by the true feature length of each sample, faithfully realizing the theoretical complexity formulated in Equation (17):

O_{i} = \sum_{i = 1}^{B} (N_{T} (i) + N_{K} (i))^{2} \cdot d

(17)

When the real length of the sample

N_{T} (i) + N_{K} (i)

is much smaller than the maximum batch length

L_{t o t a l}

, the significant reduction in the square term in Equation (17) brings a significant improvement in computational efficiency, which theoretically proves the superiority of the architecture in Figure 4 when processing variable-length sequences.

3. Results and Discussion

3.1. Datasets and Evaluation Metrics

3.1.1. Dataset Selection

This study conducts validation on Human3.6M [31], one of the most representative benchmark datasets for 3D human pose estimation. The dataset was captured by four synchronized cameras at 50 Hz and contains 3.6 million high-quality human pose images. The experiments cover 15 daily indoor activities (e.g., walking, discussion, eating) performed by 11 subjects. Following the standard protocol in the literature, data from subjects S1, S5, S6, S7, and S8 are used for training, while subjects S9 and S11 are reserved for evaluation.

For input processing, this work uses 2D keypoint sequences extracted by a CPN (Cascaded Pyramid Network) detector [17] fine-tuned on Human3.6M as the raw inputs, which are then lifted to the 3D space for accuracy evaluation. To accommodate the proposed variable-length Transformer, the input sequences are dynamically sampled within an adaptive temporal window according to motion complexity.

3.1.2. Definition of Evaluation Metrics

We quantify reconstruction accuracy through two complementary protocols. The primary metric, Mean Per-Joint Position Error (MPJPE), calculates the absolute Euclidean distance (in mm) from our predictions to the ground-truth joints. However, absolute coordinates can sometimes obscure local structural fidelity. Therefore, we additionally track the Procrustes-aligned MPJPE (P-MPJPE), which explicitly neutralizes global rigid transformations to isolate and evaluate the true anatomical alignment of the predicted skeleton.

For efficiency and model characterization, a complexity-awareness indicator (CE) is used to qualitatively determine whether the model supports adaptive resource allocation. Quantitatively, the number of parameters (Params, M) is reported to reflect the static storage cost, while multiply–accumulate operations (MACs, G) are used to measure the overall inference complexity. In particular, to account for the variable-length input setting in this work, we further introduce per-frame computation (MACs/frame, M), defined as the total MACs divided by the number of effective frames. This metric provides a more objective estimate of the actual runtime efficiency and compute utility under adaptive execution paths.

3.2. Experimental Setup and Hyperparameter Configuration

The experiments are implemented using PyTorch 1.13.0 with CUDA 11.7 [32], and all training and inference are conducted on a workstation equipped with a single NVIDIA RTX 3090 GPU. Crucially, to instantiate the hardware-level sparse acceleration discussed in Section 2.5.3, the variable-length self-attention operators are integrated with the FlashAttention-2 backend, enabling exact computation of dynamic sequence lengths without the overhead of dense padding. To explicitly address the critical challenge of realizing actual latency reduction for unstructured variable-length sequences on static tensor computation graphs, we meticulously align our operator design with the underlying hardware architecture. The utilized NVIDIA RTX 3090 GPU, built on the Ampere architecture, natively supports asynchronous memory copying and hardware-level variable-length attention kernels (e.g., cu_seqlens). Furthermore, to strictly adhere to Tensor Core memory alignment specifications and maximize global memory throughput, the feature alignment dimension d is deliberately configured as a multiple of 64 (specifically, d = 256 in our base setting). This hardware-aware alignment ensures that the theoretical efficiency gains achieved by avoiding zero-padding natively bypass memory-bound bottlenecks, translating our mask-guided sparse operations into verifiable wall-clock speedups. To systematically evaluate performance and generalization under different temporal receptive fields, the experimental setup follows the PoseFormerV2 protocol by configuring the input sequence length T to 27, 81, and 243 frames for comparative studies.

For adaptive parameterization, the model no longer relies on fixed truncation thresholds; instead, it dynamically adjusts the computation budget according to the spatiotemporal complexity score C. Under the standard T = 243 configuration, the adaptive policy yields an average effective temporal window length ranging from 11.6 to 80.3 frames across action categories (47.2 frames on average over the full set), and retains 6.1 to 34.7 DCT frequency coefficients on average (20.9 coefficients on average over the full set). Moreover, for fine-grained analysis of complexity characteristics, 313 representative sequences spanning all 15 action classes are selected from the test set for detailed evaluation.

3.3. Experimental Results

3.3.1. Complexity-Aware Adaptive Parameterization

We first break down the runtime behavior of the Complexity-Aware Module by plotting its spatial (

C_{s})

, temporal (

C_{t}

), and combined (

C

) tracking indices. As shown in Figure 5 and Figure 6, to compare the model’s response characteristics under different motion patterns, we examine two representative actions: “Discussion”, which contains persistent subtle motions, and “Sitting Down”, which exhibits pronounced transitions between static and dynamic states.

Our tracking data confirm that sequence complexity fluctuates independently across spatial and temporal axes. Specifically, the spatial complexity metric (

C_{s}

) for the “Discussion” sequence oscillates primarily between 0.2 and 0.4. These quasi-periodic shifts correlate directly with arm gestures and posture adjustments made during conversation, marking repeated expansions and contractions in the subject’s physical topology. This pattern closely reflects changes in the spatial pose topology induced by gesture articulation during verbal interaction (e.g., raising and opening the arms), which leads to recurrent structural variations in the body configuration.

In the temporal domain, the temporal complexity curve (

C_{t}

) manifests as a series of intermittent spikes. These peaks align with abrupt velocity changes at moments of pose transition, thereby capturing the temporal discontinuities inherent to the motion dynamics.

Crucially, the fused complexity index (

C

, purple dashed line) avoids the saturation seen in extreme motions, settling near a stable baseline of 0.23 rather than approaching 1.0. This numerical convergence indicates the model processes “Discussion” as locally active but macroscopically stationary. By isolating these low-amplitude variations from high-intensity activities, the system retains enough bandwidth for semantic details without triggering maximum resource allocation.

We observe a starkly different dynamic-to-static profile for “Sitting Down” (Figure 6). Here, all tracking curves spike during the initial rapid descent and then immediately plunge toward zero once the subject is seated. This sharp drop-off proves the evaluation module acts as a strict filter, reliably cutting off redundant compute sequences the moment an action transitions into a fully static state.

Within the CAAPoseFormer pipeline, these real-time complexity estimates strictly govern both the temporal window size and the retained DCT coefficients. Table 1 aggregates the average complexity scores across all 15 action categories, listing the exact resource constraints (mean temporal length

T

and coefficient count

K_{f}

) the model autonomously assigned to each class.

Figure 7 tracks the parameter updates and resource scaling for the “Photo” sequence across 80 epochs. In the initial phase, both the spatial (

α

) and temporal (

β

) weights climb sharply. This rapid surge indicates the network quickly detects the heavy self-occlusion inherent in this specific pose, pushing the overall complexity score (

C

) past the activation threshold almost immediately from a cold start.

Between epochs 20 and 60, the curves fluctuate heavily. This variance reflects the optimizer actively balancing two competing goals: minimizing compute operations while retaining enough high-frequency data to prevent smoothing errors. As illustrated in Figure 7, the learnable weights α and β converge smoothly after approximately 50 epochs. The “Complexity Score Convergence” and “adaptive resource allocation” plots further demonstrate that the system reaches a steady state. Specifically, for the “Taking Photo” action, the model consistently assigns a stable average of 66.1 frames and 28.8 frequency coefficients once the parameters have plateaued. This convergence behavior validates the mathematical stability of our learnable fusion mechanism, ensuring it does not oscillate during inference and provides a robust, predictable computational budget for real-time deployment. The metrics eventually plateau, with

C

converging at 0.794. At this steady state, the network settles on a hardware budget of 66.1 frames and 28.8 DCT coefficients (

K

). The trajectory proves our adaptive mechanism autonomously locks onto the correct computational budget based on physical motion demands, eliminating the need for manual parameter tuning.

3.3.2. Per-Action Accuracy Comparison

Table 2 details the performance across all 15 Human3.6M action classes. CAAPoseFormer outpaces the PoseFormerV2 baseline specifically on sequences demanding extended temporal context. We observe clear error reductions on “Walking” (31.3 mm vs. 31.8 mm), “Walking Together” (30.2 mm vs. 32.2 mm), and “Walking Dog” (45.7 mm vs. 46.1 mm). The adaptive temporal window (

L_{t}

) drives these gains by scaling the receptive field to strictly frame complete motion cycles without absorbing surrounding noise. A different mechanism stabilizes endpoint-heavy actions like “Photo” (48.9 mm vs. 51.6 mm) and “Phoning” (44.9 mm vs. 46.5 mm). For these sequences, the dual-domain pruning strategy (

K_{f}

) acts as an action-specific spectral filter, stripping away non-essential high-frequency jitter. However, a detailed per-action analysis reveals that our adaptive strategy does not yield uniform improvements across all categories. Specifically, performance degradation is observed in categories such as “Discussion” (45.6 mm), “Eating” (49.1 mm), and “Sitting Down” (62.7 mm) compared to PoseFormerV2 [6]. This degradation indicates that for actions characterized by compact body configurations or subtle local motion, the complexity-aware pruning strategy may over-compress informative cues. In “Eating,” for instance, the subtle upper-limb articulations and hand-to-face interactions may be partially lost when the model allocates a restricted temporal window or reduced frequency bandwidth. Similarly, “Sitting Down” remains sensitive to pruning due to its inherent depth ambiguity and pose contraction. We accept these specific degradations as a necessary operational compromise to secure the network’s 64.8% reduction in total FLOPs.

In broader comparisons with mainstream architectures, the proposed method achieves a lower average error (44.2 mm) than conventional graph-based or fully convolutional approaches such as Graph (51.9 mm) and VPose (46.8 mm), corroborating the effectiveness of Transformer-based models in capturing long-range dependencies in skeletal sequences. Relative to the current state-of-the-art MHFormer, which attains the lowest mean error of 43.0 mm due to its multi-hypothesis modeling, CAAPoseFormer exhibits a 1.2 mm higher average error. Nonetheless, under highly dynamic scenarios such as “Walking Together”, CAAPoseFormer achieves a lower error (30.2 mm) than MHFormer (30.6 mm). Therefore, CAAPoseFormer is best understood as a framework that prioritizes overall system efficiency and competitive mean accuracy, accepting marginal per-class trade-offs to enable real-time executability on resource-constrained platforms. Overall, the results suggest that MHFormer remains superior in terms of absolute accuracy, whereas CAAPoseFormer substantially reduces computational demand via mask-driven sparse computation, offering a competitive lightweight solution for latency-sensitive applications.

3.3.3. Evaluation of Temporal Smoothness and Kinematic Plausibility

To evaluate the practical effectiveness of the adaptive mechanism, we measured the Adaptive Trigger Rate (ATR) across the test sets. Under the

T = 243

configuration, 100% of the samples utilized a temporal window smaller than

T_{m a x}

. Specifically, the autonomously assigned windows ranged from 11.6 to 80.3 frames with a global average of 47.2 frames, ensuring that the theoretical efficiency gains translate into universal wall-clock speedups across diverse action patterns. Building upon this pervasive efficiency, the model consistently maintains high kinematic quality because video-based applications strictly require continuous temporal smoothness in addition to static spatial precision. As presented in Table 3, despite utilizing this dynamic resource allocation, CAAPoseFormer achieves an MPVE of 2.9 mm, demonstrating competitive smoothness compared to fully static architectures like PoseFormerV2. This stability is primarily attributed to our frequency-domain adaptation strategy. Instead of discretely dropping intermediate frames that would disrupt physical continuity, our model truncates high-frequency DCT coefficients. This mechanism intrinsically acts as a temporal low-pass filter to smooth out high-frequency positional jitter while dynamically accelerating inference.

3.4. Comprehensive Performance and Computational Overhead Analysis

Table 4 evaluates model capacity against recent baseline architectures to see if CAAPoseFormer is genuinely viable for real-world edge deployment. As can be observed, CAAPoseFormer contains only 2.1 M parameters, demonstrating a clear advantage in storage efficiency. Relative to the baseline PoseFormerV2, CAAPoseFormer reduces the parameter count by 12.2 M, corresponding to an 85.3% decrease, which substantially lowers the memory footprint on resource-constrained edge computing devices, such as the Orange Pi 5 Pro (equipped with the Rockchip RK3588S NPU). Compared with accuracy-oriented heavy models such as MotionBERT and MixSTE, the parameter count is further reduced by 95.1% and 93.8%, respectively. Despite its substantially smaller capacity, CAAPoseFormer maintains a competitive MPJPE of 44.2 mm (improving by 7.7 mm over GraphSH and by 2.6 mm over VPose). This suggests that, enabled by the proposed time–frequency dual-domain pruning strategy, the model can represent human pose topology effectively with minimal information redundancy, achieving a more favorable balance between lightweight deployment and representational power.

In terms of computational efficiency and inference overhead, CAAPoseFormer requires only 0.18 G MACs, indicating strong potential for real-time processing. Compared with PoseFormerV2, the computational cost is reduced by 64.0%, and the average per-frame computation decreases from 528 M to 160 M MACs, effectively mitigating the compute waste associated with full-budget inference. When compared with MHFormer, which is also Transformer-based, CAAPoseFormer reduces MACs by 97.4%, achieving an order-of-magnitude speedup at the cost of only a 1.2 mm increase in error. Although MotionBERT attains state-of-the-art accuracy (39.2 mm), it incurs 174.7 G MACs, which is approximately 970× that of CAAPoseFormer, making it difficult to deploy on compute-limited edge devices. In contrast, CAAPoseFormer leverages a mask-guided sparse interaction mechanism that physically blocks computation over invalid regions at the operator level, thereby maximizing compute utility and making it more suitable for latency-sensitive, real-time motion capture scenarios. As reported in Table 4, HDFormer achieves a highly impressive MPJPE of 42.6 mm with only 6 M MACs/frame. This extreme theoretical efficiency is achieved through a High-Order Directed Transformer architecture that optimizes spatial joint-to-joint interactions. However, CAAPoseFormer follows a fundamentally different methodological trajectory; while HDFormer optimizes the static graph structure, our framework introduces a dynamic routing mechanism that adapts the sequence length and frequency bandwidth based on motion complexity. This sample-level adaptivity provides a unique perspective on efficiency, focusing on computational utility rather than fixed architectural priors.

3.4.1. Real-World Inference Latency and Desktop Benchmarking

While MACs provide a theoretical proxy for computational complexity, they often fail to capture the impact of operator-level optimizations and memory access patterns. To directly address the hardware deployment capability, we evaluated the actual end-to-end inference latency (in milliseconds), which is a more critical metric for latency-sensitive applications. As summarized in Table 5, CAAPoseFormer successfully translates its theoretical sparsity into real-world acceleration by coupling adaptive sequence length and DCT mask mechanisms with hardware-optimized kernels.

On a desktop NVIDIA RTX 3090 platform, our model reduces end-to-end latency to a mere 1.95 ms. This represents a massive 23× speedup over the multi-hypothesis MHFormer (45.45 ms) and significantly outperforms the PoseFormerV2 baseline (8.40 ms). These results confirm that our mask-guided sparse interaction mechanism effectively bypasses invalid computation, enabling high-frequency (512 Hz) real-time processing.

3.4.2. Cross-Platform Verification and Edge Deployment

While the desktop GPU metrics indicate a profile highly suitable for deployment, real-world edge performance is subject to distinct hardware constraints, such as limited memory bandwidth and lower integer arithmetic throughput. To further support our deployment-oriented claims, we conducted preliminary on-device verification using an Orange Pi 5 Pro, focusing on the practical executability of the model under resource-constrained conditions.

As illustrated in Figure 8, the complete inference pipeline was successfully executed on this representative mobile-grade platform. The photographs of the real-world deployment scenario showing the physical hardware setup, the monitor display, and the active runtime interface provide tangible evidence that CAAPoseFormer’s lightweight design is not merely theoretical but practically deployable on low-power hardware.

As summarized in Table 6, CAAPoseFormer demonstrates strong adaptability across different input scales. Even on the Orange Pi 5 Pro’s architecture, the model maintains a peak memory footprint under 1.5 GB. At a reduced temporal resolution (T = 27), the model achieves 10.87 FPS, providing a viable path for real-time motion capture on edge devices. We emphasize that while the RTX 3090 remains the primary setting for controlled SOTA comparisons, these results verify the framework’s potential for latency-sensitive inference on resource-constrained edge devices.

3.4.3. Efficiency-Accuracy Pareto Analysis

Figure 9a visualizes the trade-off between computational efficiency and reconstruction accuracy. CAAPoseFormer (red pentagram) is clearly located in the advantageous upper-left region of the plot, indicating a favorable accuracy–efficiency trade-off under a low compute budget. Compared with the baseline PoseFormerV2, the proposed method achieves a substantial leftward shift along the x-axis (0.5 G → 0.18 G) while maintaining competitive accuracy, thereby avoiding the considerable computational redundancy of heavy networks such as MotionBERT (clustered in the upper-right). This distribution corroborates the effectiveness of the mask-guided sparse interaction mechanism, demonstrating that CAAPoseFormer attains reconstruction accuracy superior to lightweight models (e.g., DC-GCT) under an extremely low-power budget.

Figure 9b further highlights the superiority of CAAPoseFormer in terms of static storage efficiency. Most state-of-the-art Transformer-based methods (e.g., MixSTE, MotionBERT) are densely distributed in the high-parameter regime on the right side of the plot (>30 M), suggesting a strong dependence on model capacity. In contrast, CAAPoseFormer lies close to the y-axis and achieves performance comparable to PoseFormerV2 (14.3 M) with only 2.1 M parameters. This “small footprint, strong performance” profile indicates that the proposed time–frequency dual-domain pruning strategy enables efficient encoding of human topological features within a severely constrained parameter budget, outperforming methods of similar scale such as GLA-GCN, and thus being more suitable for lightweight deployment on edge devices.

CAAPoseFormer, in contrast, is designed as an efficiency-first framework tailored for latency-sensitive edge applications. As shown in the trade-off curve (Figure 9), we consciously trade a marginal 1.2 mm accuracy drop for a dramatic 97.4% reduction in computational cost (MACs) compared to MHFormer. On resource-constrained edge devices where running heavy models like MHFormer is practically infeasible due to strict power and memory constraints, our method provides a deployable lightweight alternative, maximizing efficiency while maintaining competitive accuracy relative to PoseFormerV2.

Furthermore, to verify that this efficiency-oriented design does not compromise the modeling of difficult poses, we conducted a comparison by forcing maximum resource allocation on samples identified as ‘hard’ by our complexity module. The results indicated that the performance gain on these samples was marginal, with less than a 2% improvement in MPJPE, while the computational cost increased by an order of magnitude. This confirms that the performance bottleneck for such poses is primarily due to inherent monocular ambiguities rather than insufficient model capacity. Consequently, our adaptive strategy prioritizes computational utility, focusing resources where they provide the most significant accuracy gains and ensuring a more efficient Pareto frontier for edge deployment.

3.4.4. Robustness to 2D Detection Noise

Since CAAPoseFormer relies on 2D input sequences, we investigate its robustness against upstream detection errors by adding Gaussian noise

N (0, σ^{2})

to the 2D keypoints. As presented in Table 7, the results highlight a distinct advantage of our adaptive mechanism. Under the simulated Gaussian noise setting, the complexity module tends to interpret unnatural coordinate jitter as increased temporal and spatial variance.

Consequently, as the noise level

σ

increases, the network autonomously allocates significantly more resources (average T increases toward 81, and K increases toward the maximum) to establish a broader receptive field for smoothing and error correction. This dynamic compensation suggests that, under coordinate-level Gaussian perturbations, the model can partially reduce the risk of overly aggressive pruning by allocating more computational resources to noisy inputs.

3.4.5. Failure Mode Analysis

While CAAPoseFormer provides an efficient accuracy-efficiency trade-off, its performance is bounded by the alignment between geometric proxies and reconstruction difficulty. The primary failure mode occurs in scenarios characterized by semantic–geometric decoupling, specifically in occlusion-heavy contracted poses like “Sitting Down” and “Eating”. As visualized in Figure 6, once the subject transitions into a seated state, the comprehensive complexity score (

C

) plunges toward zero. This occurs because the limbs are compactly tucked near the joint centroid, yielding low spatial dispersion (

C_{s}

). The model interprets this physical contraction as “low complexity” and assigns a near-minimum computational budget. However, these poses involve severe depth ambiguities that actually require extensive temporal context to resolve, leading to the observed localized error of 62.7 mm for “Sitting Down”. A similar underestimation occurs during “Eating”, where subtle hand-to-face articulations are over-compressed by restricted spectral bandwidths.

Furthermore, the module is sensitive to the quality of upstream 2D input. While the Gaussian noise analysis suggests that the system tends to assign higher complexity to inputs with strong coordinate jitter and allocate more resources for smoothing, extreme occlusion-induced detector failure remains a bottleneck. We accept these localized degradations as a necessary engineering trade-off for real-time edge deployment; forcing a maximum computational budget on these “hard” samples yields less than a 2% accuracy improvement while increasing computational costs by an order of magnitude. This analysis provides a concrete anchor for future research into occlusion-aware complexity quantification and robust 3D lifting.

3.5. Ablation Studies

To verify the effectiveness and individual contributions of the key components in the CAAPoseFormer architecture, we conduct detailed ablation studies on Human3.6M, with results summarized in Table 8. The baseline model (Model 1) adopts a fixed spatiotemporal processing strategy and achieves an MPJPE of 44.4 mm with 0.49 GFLOPs and an inference speed of 185 Hz. The results show that a single adaptive mechanism can substantially reduce computational load, yet it is insufficient to maintain high-fidelity reconstruction on its own. Introducing only the adaptive temporal window (Model 2) increases the throughput to 210 Hz, but the accuracy degrades markedly to 47.1 mm. Similarly, applying only the adaptive DCT strategy (Model 3) compresses the computation to 0.19 GFLOPs and attains 450 Hz, whereas its accuracy (44.5 mm) still fails to surpass the baseline.

We trace these accuracy drops directly to unchecked feature over-sparsification. When Model 2 applies aggressive temporal pruning without frequency-domain support, it breaks critical long-range dependencies, stripping the context needed for occlusions or fast movements. Model 3 forces efficiency through frequency-domain sparsity alone. Yet, missing a precise temporal focus, its global frequency cutoff blindly erases the high-frequency signals required to pinpoint distal joints, stalling any further improvements in precision.

In contrast, when the two adaptive mechanisms operate jointly in the full model (Model 4), a clear complementary synergy emerges. The complete architecture reduces computation to 0.16 GFLOPs (a 67.3% reduction over the baseline) and boosts inference speed to 512 Hz, while also improving accuracy beyond the baseline to achieve the best MPJPE of 44.2 mm. These findings confirm that the adaptive temporal window effectively removes temporal redundancy and provides cleaner inputs for spectral modeling, whereas the adaptive DCT module compensates for the information loss in the temporal domain through dynamic spectral filtering. Rather than a linear combination, their coupling performs joint denoising and feature reorganization, yielding an improved accuracy–efficiency trade-off under an extremely low compute budget.

3.6. Input-Length Sensitivity Under Zero-Shot Transfer and Boundary Robustness

To analyze the input-length sensitivity of CAAPoseFormer under a cross-dataset zero-shot setting, we evaluate the model trained solely on Human3.6M directly on the 3DPW dataset without target-domain fine-tuning. As shown in Table 9, increasing the input sequence length consistently improves the PA-MPJPE performance. Specifically, when the input length increases from T = 27 to T = 81, the PA-MPJPE decreases from 62.5 mm to 58.4 mm. When the input length is further increased to T = 243, the PA-MPJPE is reduced to 55.2 mm. This trend indicates that longer temporal context is beneficial for CAAPoseFormer under the Human3.6M-to-3DPW zero-shot transfer setting.

Meanwhile, the computational cost increases only moderately with longer input sequences. The MACs increase from 160 M at T = 27 to 181 M at T = 81 and 186 M at T = 243, while the number of parameters remains nearly unchanged at approximately 2.1–2.2 M. This suggests that the proposed adaptive computation strategy can exploit longer temporal information while maintaining a relatively low computational budget.

It should be noted that Table 9 is intended as an internal input-length sensitivity analysis of CAAPoseFormer under the specified zero-shot transfer setting. Therefore, the reported result of 55.2 mm at T = 243 should be interpreted as the performance of our model under this particular setting, rather than as a direct comparison with methods evaluated under different protocols. In addition, the results suggest that the fixed training-set boundary parameters Cmin and Cmax remain usable under this domain shift. When the estimated complexity of unseen samples falls outside the training-set range, the clipping mechanism constrains the adaptive resource allocation within predefined valid bounds, thereby helping maintain stable pruning behavior without online recalibration.

4. Conclusions

In this study, we tackle the inflexible nature of resource distribution prevalent in current video-driven 3D pose estimation networks by proposing CAAPoseFormer, a novel adaptive architecture driven by spatial–temporal complexity awareness. Instead of relying on rigid network parameters, this framework directly links the computational budget to the estimated spatial–temporal complexity of the input sequence.

We shift the computational paradigm through three core mechanisms. We first track sequence difficulty using a spatiotemporal complexity quantifier that merges spatial dispersion and temporal variance. This continuous metric directly drives a time–frequency dual-domain pruning strategy. Instead of restricting the network to fixed windows or preset DCT limits, this pruning dynamically drops redundant operations for simple poses while reserving more representational capacity for complex sequences. To process these newly unstructured features, we integrate a mask-guided sparse encoder. By explicitly blocking invalid zero-padded regions at the operator level, the encoder natively handles variable-length interactions without wasting standard hardware cycles.

Experiments on Human3.6M indicate that CAAPoseFormer mitigates the efficiency bottleneck of conventional fixed-computation strategies when facing diverse motion patterns. By coupling complexity-aware resource allocation with adaptive time–frequency pruning and mask-driven sparse interaction, the model aligns computational effort with motion complexity. Relative to the strong frequency-domain baseline PoseFormerV2, CAAPoseFormer achieves comparable accuracy (MPJPE: 44.2 mm) while reducing parameters by 85.3% and computational cost by 64.8%, yielding a 2.8× improvement in compute efficiency. These results suggest that the proposed approach attains a favorable accuracy–efficiency trade-off, alleviating both excessive computation on simple motions and under-representation of complex motions, and thus supports real-time 3D pose estimation in resource-constrained settings. While static architectures like HDFormer have pushed the boundaries of theoretical MAC reduction through spatial graph optimization, CAAPoseFormer explores a complementary path by prioritizing dynamic, complexity-aware resource allocation. Our framework demonstrates that sequence-level adaptivity can achieve competitive accuracy while maintaining the high parallelism required for real-time edge deployment.

While the primary experimental validation in this work was performed on a desktop RTX 3090 platform, CAAPoseFormer is explicitly designed for computation-efficient inference through complexity-aware adaptive pruning and mask-guided sparse interaction. Its exceptionally low parameter count, reduced MACs, and high measured throughput indicate strong potential for practical deployment on resource-constrained edge platforms. Notably, our preliminary Orange Pi 5 Pro device-side verification supports the real-world executability of the framework beyond standard desktop-only evaluations. While these initial benchmarks demonstrate the efficiency of CAAPoseFormer, we acknowledge that a direct memory footprint comparison with other lightweight baselines, such as GLA-GCN and HDFormer, requires extensive hardware-specific kernel tuning and framework conversion, such as RKNN/ONNX, to ensure fairness across different architectural paradigms. Future work will focus on establishing a standardized benchmarking protocol to evaluate the peak memory versus static buffer trade-offs of adaptive frameworks on various ARM-based edge platforms.

We do recognize specific operational limits in the current pipeline. Because the complexity module relies entirely on upstream 2D pose data, severe occlusions or raw detector noise may disturb the complexity scores and increase the risk of suboptimal pruning decisions. However, our Gaussian noise analysis suggests that the system tends to assign higher complexity to noisy coordinate inputs and mobilize more computational resources, thereby partially reducing the risk of overly aggressive pruning under this controlled perturbation setting. The dynamic nature of this pruning also creates inherent decision variability, which can occasionally cause minor frame-to-frame prediction jitter across runs. Structurally, forcing the quantification, pruning, and encoding stages into a strict sequential dependency complicates end-to-end optimization. This tight coupling narrows the margin for stable training and requires highly precise hyperparameter tuning.

Future research will focus on improving the robustness of the complexity-aware module under noise, anomalous motion patterns, severe occlusions, and domain shifts while maintaining the frozen global normalization priors used during inference. This design preserves the comparability of complexity scores across historical and current samples, avoids dependence on online test statistics, and prevents additional state-maintenance overhead on edge devices. While the current 3DPW results provide preliminary evidence of cross-dataset transferability, our future research will extend this paradigm to a broader range of benchmarks, including MPI-INF-3DHP and AGORA, to further investigate the transferability of complexity-aware routing across diverse motion distributions under more comprehensive and controlled evaluation settings. Furthermore, extending adaptivity from the temporal and frequency dimensions to the spatial dimension using emerging linear-complexity state space models, such as Mamba-based architectures [43], could help realize a fully spatiotemporal adaptive framework with lower memory footprint. In addition, we aim to evaluate the transferability of the proposed adaptive computation paradigm to broader video analysis tasks, including action recognition and frequency-adaptive video understanding.

Although the semantic gap between geometric dispersion and actual reconstruction difficulty is partially alleviated by the fusion of temporal variance and the adaptive preservation of temporal–frequency information during dynamic transitions, highly contracted self-occluded poses remain challenging due to intrinsic monocular depth ambiguity. While our geometric dispersion metric ensures high efficiency for real-time deployment, we acknowledge that it may not perfectly capture the complexity of severe self-occlusions. Therefore, developing lightweight, occlusion-aware complexity metrics that can accurately assess pose difficulty without compromising real-time inference speed will be a central focus of our future research.

Author Contributions

Conceptualization, W.Z. and Z.Y.; methodology, W.Z.; software, W.Z.; validation, W.Z., Z.Y. and C.H.; formal analysis, W.Z. and Z.Y.; investigation, W.Z.; resources, W.Z., Z.Y. and C.H.; data curation, W.Z. and C.H.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z. and C.H.; visualization, W.Z. and Z.Y.; supervision, J.L. and J.S.; project administration, W.Z., J.L. and J.S.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Provincial-level College Students’ Innovation and Entrepreneurship Training Program, grant number 20250100012.

Data Availability Statement

The data used to support the conclusions of this study are available from the corresponding authors on reasonable request. Publicly available datasets were also analyzed in this study. This data can be found here: Human3.6M dataset (http://vision.imar.ro/human3.6m/description.php, accessed on 10 May 2026).

Conflicts of Interest

The authors declare no potential conflicts of interest.

References

Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D Human Pose Estimation: A Review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar] [CrossRef]
Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A Simple Yet Effective Baseline for 3D Human Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2640–2649. [Google Scholar] [CrossRef]
Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 7753–7762. [Google Scholar] [CrossRef]
Hossain, M.R.I.; Little, J.J. Exploiting Temporal Information for 3D Human Pose Estimation. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 68–84. [Google Scholar] [CrossRef]
Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3D Human Pose Estimation with Spatial and Temporal Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 11656–11665. [Google Scholar] [CrossRef]
Zhao, Q.; Zheng, C.; Liu, M.; Wang, P.; Chen, C. PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 8877–8886. [Google Scholar] [CrossRef]
Shan, W.; Liu, Z.; Zhang, X.; Wang, S.; Ma, S.; Gao, W. P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation. In Proceedings of the 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 461–478. [Google Scholar] [CrossRef]
Liang, Y.; Ge, C.; Tong, Z.; Song, Y.; Wang, J.; Xie, P. Not All Patches Are What You Need: Expediting Vision Transformers via Token Reorganizations. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar] [CrossRef]
Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proc. AAAI Conf. Artif. Intell. 2018, 32, 7444–7452. [Google Scholar] [CrossRef]
Zhang, J.; Tu, Z.; Yang, J.; Chen, Y.; Yuan, J. MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13232–13242. [Google Scholar] [CrossRef]
Li, W.; Liu, H.; Ding, R.; Liu, M.; Wang, P.; Yang, W. Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation. IEEE Trans. Multimed. 2022, 24, 1282–1293. [Google Scholar] [CrossRef]
Wu, Z.; Xiong, C.; Ma, C.-Y.; Socher, R.; Davis, L.S. AdaFrame: Adaptive Frame Selection for Fast Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1278–1287. [Google Scholar] [CrossRef]
Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.-J. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In Advances in Neural Information Processing Systems 34; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 13937–13949. [Google Scholar] [CrossRef]
Li, W.; Liu, M.; Liu, H.; Wang, P.; Cai, J.; Sebe, N. Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 604–613. [Google Scholar] [CrossRef]
Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. MotionBERT: A Unified Perspective on Learning Human Motion Representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 15085–15099. [Google Scholar] [CrossRef]
Peng, J.; Zhou, Y.; Mok, P.Y. KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 1123–1132. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-Person Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1234–1248. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar] [CrossRef]
Han, Y.; Huang, G.; Song, S.; Yang, L.; Wang, H.; Wang, Y. Dynamic Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7436–7456. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Chen, Z.; Jiang, H.; Li, S. Adaptive Computation Routing for Highly Efficient 3D Human Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 8812–8821. [Google Scholar] [CrossRef]
Vemulapalli, R.; Arrate, F.; Chellappa, R. Human Action Recognition by Representing 3D Skeleons as Points in a Lie Group. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 588–595. [Google Scholar] [CrossRef]
Bolya, D.; Fu, C.-Y.; Dai, X.; Zhang, P.; Hoffmann, C.; Yao, J. Token Merging: Your ViT but Faster. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar] [CrossRef]
Li, W.; Liu, M.; Liu, H.; Wang, P.; Lu, S.; Sebe, N. H2OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 512–526. [Google Scholar] [CrossRef] [PubMed]
Yin, H.; Vahdat, A.; Alvarez, J.M.; Mallya, A.; Kautz, J.; Molchanov, P. A-ViT: Adaptive Tokens for Efficient Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10809–10818. [Google Scholar] [CrossRef]
Dao, T. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. In Proceedings of the 12th International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar] [CrossRef]
Shen, L.; Hao, T.; He, T.; Zhao, S.; Zhang, Y.; Liu, P.; Bao, Y.; Ding, G. TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval. In Proceedings of the 13th International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar] [CrossRef]
Chen, H.; He, J.-Y.; Xiang, W.; Cheng, Z.-Q.; Liu, W.; Liu, H.; Luo, B.; Geng, Y.; Xie, X. HDFormer: High-Order Directed Transformer for 3D Human Pose Estimation. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 19–25 August 2023; pp. 581–589. [Google Scholar] [CrossRef]
Chen, P.; Zeng, X.; Zhao, M.; Ye, P.; Shen, M.; Cheng, W.; Yu, G.; Chen, T. Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers. arXiv 2025, arXiv:2506.03065. [Google Scholar] [CrossRef]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Vancouver, BC, Canada, 2019; pp. 8024–8035. [Google Scholar] [CrossRef]
Xu, T.; Takano, W. Graph Stacked Hourglass Networks for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16100–16109. [Google Scholar] [CrossRef]
Zou, Z.; Tang, W. Modulated Graph Convolutional Network for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11457–11467. [Google Scholar] [CrossRef]
Kang, H.; Wang, Y.; Liu, M.; Wu, D.; Liu, P.; Yang, W. Double-Chain Graph Convolution Transformer for 3D Human Pose Estimation. IEEE Trans. Multimed. 2025; Early Access. [CrossRef]
Chen, T.; Fang, C.; Shen, X.; Zhu, Y.; Chen, Z.; Luo, J. Anatomy-Aware 3D Human Pose Estimation with Bone-Based Pose Decomposition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 198–209. [Google Scholar] [CrossRef]
Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13137–13146. [Google Scholar] [CrossRef]
Diaz-Arias, A.; Shin, D. ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention. Vis. Comput. 2024, 40, 2555–2569. [Google Scholar] [CrossRef]
Einfalt, M.; Ludwig, K.; Lienhart, R. Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2902–2912. [Google Scholar] [CrossRef]
Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar] [CrossRef]
Yu, B.X.B.; Zhang, Z.; Liu, Y.; Zhong, S.-H.; Liu, Y.; Chen, C.W. GLA-GCN: Global-Local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 8784–8795. [Google Scholar] [CrossRef]
Qian, X.; Tang, Y.; Zhang, N.; Han, M.; Xiao, J.; Huang, M.-C.; Lin, R.-S. HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation. arXiv 2023, arXiv:2301.07322. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, A.; Reid, I.; Hartley, R.; Zhuang, B.; Tang, H. MotionMamba: Efficient and Long-Sequence Human Motion Generation with State Space Model. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 265–282. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of CAAPoseFormer. (a) The overall framework of CAAPoseFormer, consisting of the complexity assessment module, adaptive temporal-window logic, adaptive DCT logic, mask-guided sparse encoding, and the masked variable-length time–frequency fusion Transformer for 3D human pose estimation from 2D pose sequences. (b) Detailed structure of the masked variable-length time–frequency fusion Transformer, including masked multi-head attention, layer normalization, token split, and time–frequency fusion operations.

Figure 2. Schematic diagram of comprehensive complexity calculation for action sequences.

Figure 3. Time–frequency dual-domain joint adaptive pruning strategy of CAAPoseFormer.

Figure 4. Schematic diagram of the mask-guided variable-length sequence sparse interaction encoding mechanism.

Figure 5. Visualized complexity metric for the “Discussion” action.

Figure 6. Visualized complexity metric for the “Sitting Down” action.

Figure 7. Training dynamics for the “Taking Photo” action: evolution of learnable parameters and resource-adaptive allocation trajectory.

Figure 8. Device-side deployment setup on the Orange Pi 5 Pro platform. The figure illustrates the physical hardware setup, the monitor display, and the real-time runtime interface of the inference script. These components demonstrate the practical executability and hardware compatibility of CAAPoseFormer on a representative resource-constrained edge device.

Figure 9. Accuracy–efficiency trade-off on Human3.6M.

Table 1. Complexity changes of 15 action categories and their corresponding time window changes and DCT coefficient changes.

Action	a (W.S)	β (W.T)	C (Prob)	Avg Frames (T)	Avg Freq (K)	GFLOPs
Directions	21.99	17.08	0.987	80.0	34.6	0.367
Discussion	7.36	4.08	0.229	25.5	11.9	0.120
Eating	6.51	4.33	0.209	24.0	11.3	0.113
Greeting	21.09	19.87	0.991	80.3	34.7	0.368
Phoning	3.55	5.12	0.146	19.5	9.4	0.092
Photo	11.75	12.48	0.794	66.1	28.8	0.304
Posing	11.01	8.33	0.591	51.5	22.7	0.238
Purchases	10.15	10.01	0.630	54.3	23.9	0.250
Sitting	0.17	1.56	0.041	11.9	6.2	0.058
Sitting Down	1.08	0.04	0.036	11.6	6.1	0.057
Smoking	2.42	1.56	0.063	13.5	6.9	0.065
Waiting	6.92	8.12	0.379	36.3	16.4	0.169
Walking Dog	14.80	15.12	0.923	75.5	32.7	0.346
Walking	18.17	18.12	0.977	79.4	34.3	0.364
Walking Together	17.22	16.96	0.966	78.5	34.0	0.360
Average	10.28	9.52	0.531	47.2	20.9	0.218

Table 2. Comparison of pose estimation accuracy.

Input of 2D-CPN	Standing	Discussing	Eating	Greeting	Phoning	Photo	Posing	Purchases	Sitting	Sitting Down	Smoking	Waiting	Walking the Dog	Walking	Group Walking	Mean
GraphSH [33]	45.2	49.9	47.5	50.9	54.9	66.1	48.5	46.3	59.7	71.5	51.4	48.6	53.9	39.9	44.1	51.9
MGCN [34]	45.4	49.2	45.7	49.4	50.4	58.2	47.9	46.0	57.5	63.0	49.7	46.6	52.2	38.9	40.8	49.4
DC-GCT [35]	42.2	47.3	44.6	47.6	49.7	56.1	45.3	43.8	55.3	59.4	48.5	44.7	51.0	36.8	38.3	47.4
Vpose [3]	45.2	46.7	44.3	45.6	48.1	55.1	44.6	44.3	57.3	65.8	47.1	44.0	49.0	32.8	33.9	46.8
PoseFormer [5]	41.5	44.8	39.8	42.5	46.5	51.6	42.1	42.0	53.3	60.7	45.5	44.3	46.1	31.8	32.2	44.3
Anatomy3D [36]	41.4	43.2	40.1	42.9	46.6	51.9	41.7	42.3	53.9	60.2	45.4	41.7	46.0	31.5	32.7	44.1
MHFormer [37]	39.2	43.1	40.1	40.9	44.9	51.2	40.6	41.3	53.5	60.3	43.7	41.1	43.8	29.8	30.6	43.0
ConvFormer [38]	41.0	43.2	39.0	42.4	44.5	52.2	41.7	40.8	53.0	60.6	44.8	41.3	43.7	29.6	30.9	43.2
CAAPoseFormer	41.2	45.6	49.1	43.2	44.9	48.9	41.7	40.4	53.5	62.7	44.1	41.5	45.7	31.3	30.2	44.2

Table 3. Quantitative evaluation of temporal smoothness (MPVE) and computational cost. Note: ↓ indicates that lower values are better.

Method	Frames (T)	MPJPE (mm) ↓	MPVE (mm) ↓	MACs (M) ↓
MHFormer	81	43.0	2.8	7000
PoseFormerV2	81	45.2	2.9	500
CAAPoseFormer (Ours)	47.2 (Avg.)	44.2	2.9	186

Table 4. Analysis and comparison of computational efficiency for pose estimation.

Method	T	CE	Param	MACs	MACs/Frame	P1 ↓/P2 ↓
MHFormer CVPR’22 [37]	351	✓	30.9 M	7.0 G	7096 M	43.0/34.4
MixSTE CVPR’22 [10]	243	×	33.6 M	139.0 G	572 M	40.9/32.6
P-STMO ECCV’22 [7]	243	✓	6.2 M	0.7 G	740 M	42.8/34.4
Stridedformer TMM’22 [11]	351	✓	4.0 M	0.8 G	801 M	43.7/35.2
Einfalt et al. WACV’23 [39]	351	✓	10.4 M	0.5 G	498 M	44.2/35.7
STCFormer CVPR’23 [40]	243	×	4.7 M	19.6 G	80 M	41.0/32.0
STCFormer-L CVPR’23 [40]	243	×	18.9 M	78.2 G	321 M	40.5/31.8
PoseFormerV2 CVPR’23 [6]	243	✓	14.3 M	0.5 G	528 M	45.2/35.6
GLA-GCN ICCV’23 [41]	243	✓	1.3 M	1.5 G	1556 M	44.4/34.8
MotionBERT ICCV’23 [15]	243	×	42.5 M	174.7 G	719 M	39.2/32.9
HDFomer IJCAI’23 [29]	96	×	3.7 M	0.6 G	6 M	42.6/33.1
HSTFormer arXiv’23 [42]	81	×	22.7 M	1.0 G	13 M	42.7/33.7
KTPFormer CVPR’24 [16]	243	×	35.2 M	76.1 G	313 M	40.1/31.9
CAAPoseFormer	27	✓	2.2 M	0.17 G	160 M	49.5/38.9
CAAPoseFormer	81	✓	2.1 M	0.17 G	181 M	46.8/36.5
CAAPoseFormer	243	✓	2.1 M	0.18 G	186 M	44.2/34.8

Table 5. Hardware inference latency and efficiency comparison (RTX 3090).

Method	Param (M)	MACs (G)	FPS (Hz)	Latency (ms)	P1 (mm)
MHFormer [37]	30.9	7.00	~22	~45.45	43.0
PoseFormerV2 [6]	14.3	0.50	~119	~8.40	45.2
CAAPoseFormer (Ours)	2.1	0.16	512	1.95	44.2

Table 6. Preliminary deployment-oriented benchmark summary of CAAPoseFormer.

Device	Backend	Precision	Input Length (T)	Avg Latency (ms)	P95 Latency (ms)	FPS	Peak Memory (MB)
RTX 3090	PyTorch + CUDA	FP32	243	1.95	N/A	512	N/A
Orange Pi 5 Pro	ONNX/RKNN/PyTorch CPU	INT8/FP16/FP32	243	420.0	455.0	2.38	1450
Orange Pi 5 Pro	ONNX/RKNN/PyTorch CPU	INT8/FP16/FP32	81	185.0	205.0	5.41	980
Orange Pi 5 Pro	ONNX/RKNN/PyTorch CPU	INT8/FP16/FP32	27	92.0	104.0	10.87	760

Table 7. Robustness analysis under varying 2D Gaussian noise levels on Human3.6M.

Noise Level (σ)	P1 (mm)	Avg. Allocated Frames (T)	Avg. Allocated Freqs (K)	MACs (G)
σ = 0 (Baseline)	44.2	41	16	0.16
σ = 5 pixels	48.5	53	21	0.25
σ = 10 pixels	55.2	67	27	0.35
σ = 20 pixels	68.4	79	31	0.48

Table 8. Ablation experiments of the CAAPoseFormer method.

Model ID	Adaptive Time Window (Lt)	Adaptive DCT (Kf)	P1 (mm)	MACs (G)	FPS (HZ)
1	-	-	44.4	0.49	185
2	√	-	47.1	0.45	210
3	-	√	44.5	0.19	450
4	√	√	44.2	0.16	512

Table 9. Internal input-length sensitivity of CAAPoseFormer under zero-shot transfer from Human3.6M to 3DPW. Note: ↓ indicates that lower values are better.

Method	Frames (T)	PA-MPJPE (mm) ↓	MACs (M) ↓	Params (M)
CAAPoseFormer	27	62.5	160	2.2
CAAPoseFormer	81	58.4	181	2.1
CAAPoseFormer	243	55.2	186	2.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, W.; Yang, Z.; Hu, C.; Sun, J.; Li, J. Adaptive 3D Human Pose Estimation Based on Spatial–Temporal Complexity Awareness. Electronics 2026, 15, 2076. https://doi.org/10.3390/electronics15102076

AMA Style

Zhang W, Yang Z, Hu C, Sun J, Li J. Adaptive 3D Human Pose Estimation Based on Spatial–Temporal Complexity Awareness. Electronics. 2026; 15(10):2076. https://doi.org/10.3390/electronics15102076

Chicago/Turabian Style

Zhang, Wensi, Ziyan Yang, Chengfeng Hu, Jing Sun, and Jie Li. 2026. "Adaptive 3D Human Pose Estimation Based on Spatial–Temporal Complexity Awareness" Electronics 15, no. 10: 2076. https://doi.org/10.3390/electronics15102076

APA Style

Zhang, W., Yang, Z., Hu, C., Sun, J., & Li, J. (2026). Adaptive 3D Human Pose Estimation Based on Spatial–Temporal Complexity Awareness. Electronics, 15(10), 2076. https://doi.org/10.3390/electronics15102076

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive 3D Human Pose Estimation Based on Spatial–Temporal Complexity Awareness

Abstract

1. Introduction

2. Materials and Methods

2.1. Material

2.2. Methods

2.3. Spatial–Temporal Coupled Dynamic Complexity Quantification Module

2.3.1. Spatial Complexity Quantification

2.3.2. Temporal Complexity Quantification

2.3.3. Spatial–Temporal Comprehensive Complexity Calculation

2.4. Time–Frequency Dual-Domain Joint Adaptive Pruning Strategy

2.4.1. Adaptive Temporal Window Pruning

2.4.2. Adaptive Frequency-Domain Coefficient Pruning

2.5. Mask-Guided Variable-Length Sequence Sparse Interaction Encoding Mechanism

2.5.1. Joint Alignment of Time–Frequency Features

2.5.2. Construction of Global Dynamic Mask Matrix

2.5.3. Mask-Guided Sparse Self-Attention

3. Results and Discussion

3.1. Datasets and Evaluation Metrics

3.1.1. Dataset Selection

3.1.2. Definition of Evaluation Metrics

3.2. Experimental Setup and Hyperparameter Configuration

3.3. Experimental Results

3.3.1. Complexity-Aware Adaptive Parameterization

3.3.2. Per-Action Accuracy Comparison

3.3.3. Evaluation of Temporal Smoothness and Kinematic Plausibility

3.4. Comprehensive Performance and Computational Overhead Analysis

3.4.1. Real-World Inference Latency and Desktop Benchmarking

3.4.2. Cross-Platform Verification and Edge Deployment

3.4.3. Efficiency-Accuracy Pareto Analysis

3.4.4. Robustness to 2D Detection Noise

3.4.5. Failure Mode Analysis

3.5. Ablation Studies

3.6. Input-Length Sensitivity Under Zero-Shot Transfer and Boundary Robustness

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI