4.1. Feature Modulation
The overall framework of the proposed method includes two branches, as depicted in
Figure 2. Within the skeleton branch, the input skeleton sequence undergoes data augmentation and topology mapping, passing through a stack of
N layers of GCN blocks, which incorporate the proposed frequency-domain spatiotemporal blocks, to encode and generate the skeleton feature representation. Similarly, in the lower branch, the text is processed through Byte-Pair Tokenization and then encoded into embeddings utilizing the CLIP text encoder.
Specifically, we employ the text encoder from the CLIP ViT-B/32 model to obtain sentence-level embeddings. Each action description or generated prompt sentence is first tokenized and then mapped into a 512-dimensional embedding space. These embeddings serve as semantic anchors for cross-modal alignment with skeleton features in the Feature Modulation module. During training, the CLIP encoder remains frozen to preserve its pretrained semantic structure while minimizing the expected information gain between modalities.
To elaborate further, we formulate the process of distillation as an optimization problem, where the text embedding is the true distribution, which is generated by a pretrained encoder. is an approximate distribution used to fit , which is encoded by a learnable graph convolutional neural network. The loss is defined as the expected information gain of with respect to , which measures the difference between two distributions.
The process of distillation can be formulated as an optimization problem where we aim to minimize the difference between two distributions. Specifically, the text embedding represents the true distribution, which is generated by a pretrained encoder. On the other hand, is the approximate distribution, which we seek to learn and fit to , using a graph convolutional neural network (GCN).
The goal is to bring
closer to
by minimizing the discrepancy between these two distributions. This discrepancy is quantified by a loss function, often expressed in terms of the expected information gain (or Kullback–Leibler divergence) between
and
. The KL divergence, denoted as
(
∥
), measures how much information is lost when
is used to approximate
. The schematic code is shown as Algorithm 1. The loss function can be formally expressed as:
| Algorithm 1 Pseudocode of FM in a PyTorch-like style |
- 1:
:
query/key embeddings and text embedding. (BC)
- 2:
:
queue of N keys (CN)
- 3:
: temperatures for student/teacher (scalars) - 4:
- 5:
noise_for_q = torch.randn_like(q) × noise_std # Gauss noise - 6:
noise_for_t = torch.randn_like(z_t) × noise_std - 7:
- 8:
l_a = torch.mm(z_q + noise_for_q, queue_a) # compute similarities - 9:
l_b = torch.mm(z_q + noise_for_q, z_t + noise_for_t) - 10:
- 11:
loss_kl = loss_kld (l_b/tau_s, z_t/tau_t) - 12:
- 13:
def - 14:
loss_kld (inputs, targets): - 15:
inputs, targets = F.log_softmax(inputs, dim = 1), F.softmax( - 16:
targets, dim = 1) - 17:
return F.kl_div(inputs, targets, reduction = ’batchmean’)
|
The Feature Modulation (FM) module enhances action discrimination by adaptively emphasizing discriminative motion cues and suppressing redundant or highly correlated patterns across channels and temporal frequencies. By modulating feature responses conditioned on learned frequency–semantic representations, FM helps to separate subtle inter-class variations (e.g., ‘drinking’ vs. ‘eating’), which often share similar motion trajectories but differ in temporal dynamics or joint coordination. This selective recalibration strengthens the representation’s sensitivity to class-specific motion signatures, thereby improving the distinction between visually or kinematically similar actions.
4.2. FreST
Skeleton data usually consists of multiple temporal nodes, reflecting the motion trajectories of different human joints over time. The change frequency of these nodes can reveal key features of certain actions (e.g., speed, rhythm, and periodicity of actions). Specifically, the input skeleton sequence
passes through the Frequency Spatial Block (FreS) and Frequency Temporal Block (FreT), respectively, with feature optimization via their built-in adaptive frequency-domain filtering. FreS and FreT do not change the dimension of the input sequence. To avoid spatial–temporal feature coupling interference [
14,
17,
21], which occurs when spatial topology and temporal motion cues are entangled within a single representation, a corresponding spatial or temporal decoupling module is added before each frequency-domain module. Such coupling can blur discriminative temporal dynamics with static spatial patterns, thereby reducing filtering precision. By decoupling, the module first separates the skeleton sequence’s spatial topological features (e.g., relative joint positions) and temporal dynamic features (e.g., joint trajectories), then feeds the resulting single-dimensional representations into the subsequent frequency-domain module. This separation allows adaptive filters to focus more effectively on domain-specific variations, ultimately improving the accuracy of frequency-domain feature extraction.
As shown in
Figure 3, the FreS module will be introduced below, and the working mechanism of the FreT module is the same. Specifically, after the input sequence is processed by the spatial decoupling module, its dimension is transformed from
. In this context, the spatial domain refers to the coordinate space spanned by all skeletal joints, where each node
corresponds to a specific physical joint location in the human body. The spatial relationships among these joints are defined by the adjacency matrix
, which encodes the topological structure of the human skeleton. Subsequently, we convert the input
, where
for the spatial branch and
for the temporal branch, into the frequency domain
by:
where
is the signal (spectrum) in the frequency domain and represents the component of the signal at frequency
,
t is a temporal variable, and
j denotes the imaginary unit. Then, we define the 1D FFT operation in Equation (
7) as:
.
is defined as the normalized energy, calculated as:
where
denotes the energy of the individual frequency component at
,
is the median value of all frequency component energies, and
is a small constant (e.g.,
) introduced to avoid numerical instability caused by zero denominators. This normalization ensures that the energy values are scaled relative to the central tendency of the energy distribution, facilitating consistent thresholding across different input signals.
where ⊙ denotes element-wise multiplication, and
is the indicator function that generates a binary mask matrix:
if the normalized energy
exceeds the threshold
, and 0 otherwise. Through this operation, frequency components with normalized energy above
are retained in
, while those below the threshold are filtered out. Notably, the threshold
is dynamically adjusted based on the temporal characteristics of the specific action being processed (e.g., motion intensity, frequency bandwidth of key action features), ensuring that critical frequency information (e.g., discriminative motion patterns) is preserved while suppressing high-frequency noise and redundant components that are irrelevant to the action semantics.
After applying adaptive filtering to the frequency-domain data, we introduce two types of learnable filters to further model frequency-domain characteristics. The global filter
operates directly on the original frequency-domain data
, enabling the model to capture global frequency correlations that may span the entire spectral range. In contrast, the local filter
is applied to the adaptively filtered result
, focusing on learning discriminative patterns within the frequency components deemed important by the adaptive thresholding step. Both filters are parameterized to handle complex-valued frequency-domain data, with their mathematical formulations given by:
where
and
denote the real and imaginary parts of the complex-valued filters, respectively, and
j is the imaginary unit satisfying
. To initialize these filters in a stable manner, both
and
are sampled from a zero-mean Gaussian distribution with variance
(e.g.,
), ensuring that initial filter responses are moderate and avoid saturating subsequent computations.
The application of these filters to the frequency-domain data is defined as:
where
represents the globally filtered frequency features, capturing broad spectral patterns across the entire frequency domain, while
denotes the locally filtered features, which focus on the adaptively selected critical frequency components.
Finally, to integrate both global context and local discriminative details, the output frequency features are computed as the sum of the two filtered results: . This integration strategy ensures that the model preserves both coarse-grained global frequency characteristics and fine-grained local details, enhancing the representation capacity for complex spatiotemporal patterns in action recognition tasks.
Modality Harmonizer. After obtaining the enhanced frame-level feature and the enhanced word-level feature , we apply cross-attention layers between (as the query) and (as the key and value) to facilitate interaction and alignment between modalities. The final aligned feature is then derived using the standard cross-attention mechanism.
Design Rationale of FreST. The Frequency-domain Spatial–Temporal (FreST) block leverages the intrinsic spectral sparsity of skeletal motion to selectively retain action-relevant components while suppressing unstable oscillations. Concretely, an adaptive mask is formed using a threshold computed from the median spectral energy of each input sequence, so that filtering automatically tightens for noise-dominated spectra and relaxes for clean, low-frequency motions. FreST uses two learnable complex-valued filters: a global filter that captures sequence-level rhythmic regularities and a local filter that focuses on joint-wise short-range variations. This dual-filter parameterization aligns coarse temporal rhythm with fine spatial–temporal details in a single representation. Compared with time-domain smoothing, which operates on strongly correlated samples and often blurs subtle class-discriminative dynamics, frequency-domain selection compacts signal energy, stabilizes optimization, and preserves fine-grained motion signatures, thereby strengthening separability for visually or kinematically similar actions.
Domain Inversion Module. The Domain Inversion module functions as a core bridge between the frequency and spatial/temporal domains. It first converts the skeleton features back from the frequency domain to the original spatial or temporal space, ensuring that frequency-enhanced information can be seamlessly integrated into subsequent processing. Meanwhile, it refines the filtered frequency components by suppressing noise and preserving valid motion frequencies, thus improving the quality and stability of the reconstructed skeleton features.
4.3. Skeleton Instance Contrastive Loss
We employ identical skeleton encoders to enable contrastive learning at the feature-wise level between upward and downward skeleton modalities. Specifically, given an original skeleton sequence
S, we apply two different augmentations,
and
, to generate the query and key samples, denoted as
x and
, where
C,
T, and
V represent the number of channels, frames, and nodes, respectively. A query encoder
and a momentum-based key encoder
are employed. Following this, global average pooling (GAP) is applied to derive the query embeddings
z and key embeddings
. To optimize the encoder representations and enforce similarity between positive pairs while distinguishing negatives, we adopt the InfoNCE loss as our training objective:
where · represents the dot product of calculating the similarity between the two normalized embeddings, and
is the temperature hyperparameter (set to 0.2 by default).
represents the similarity between embedding in view
v and memory queue
Q, and
K represents the total number of samples stored in the queue
Q. The parameters of the query encoder
are updated by gradient backpropagation, while the parameters of the key encoder
are updated to the moving average of the query encoder, which can be expressed as:
where
is a momentum coefficient, usually close to 1, to maintain consistency in embedding in memory queues. Finally, the loss used to optimize the encoder can be formulated as:
where
is a hyperparameter to balance the different sample pairs. Additionally, we incorporate a Frequency-domain Signal Enhancement module (FreST) following the skeleton encoder. FreST enhances the model’s ability to retain critical action information in the latent space by extracting sparse human joint information, which captures compressed signal energy. This feature extraction effectively boosts action recognition performance by emphasizing the most informative skeletal features.