Part-Wise Graph Fourier Learning for Skeleton-Based Continuous Sign Language Recognition

Wei, Dong; Hu, Hongxiang; Ma, Gang-Feng

doi:10.3390/jimaging11080286

Open AccessArticle

Part-Wise Graph Fourier Learning for Skeleton-Based Continuous Sign Language Recognition

by

Dong Wei

^*

,

Hongxiang Hu

and

Gang-Feng Ma

College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

J. Imaging 2025, 11(8), 286; https://doi.org/10.3390/jimaging11080286

Submission received: 3 July 2025 / Revised: 16 August 2025 / Accepted: 17 August 2025 / Published: 21 August 2025

(This article belongs to the Special Issue Advances in Machine Learning for Computer Vision Applications)

Download

Browse Figures

Versions Notes

Abstract

Sign language is a visual language articulated through body movements. Existing approaches predominantly leverage RGB inputs, incurring substantial computational overhead and remaining susceptible to interference from foreground and background noise. A second fundamental challenge lies in accurately modeling the nonlinear temporal dynamics and inherent asynchrony across body parts that characterize sign language sequences. To address these challenges, we propose a novel part-wise graph Fourier learning method for skeleton-based continuous sign language recognition (PGF-SLR), which uniformly models the spatiotemporal relations of multiple body parts in a globally ordered yet locally unordered manner. Specifically, different parts within different time steps are treated as nodes, while the frequency domain attention between parts is treated as edges to construct a part-level Fourier fully connected graph. This enables the graph Fourier learning module to jointly capture spatiotemporal dependencies in the frequency domain, while our adaptive frequency enhancement method further amplifies discriminative action features in a lightweight and robust fashion. Finally, a dual-branch action learning module featuring an auxiliary action prediction branch to assist the recognition branch is designed to enhance the understanding of sign language. Our experimental results show that the proposed PGF-SLR achieved relative improvements of 3.31%/3.70% and 2.81%/7.33% compared to SOTA methods on the dev/test sets of the PHOENIX14 and PHOENIX14-T datasets. It also demonstrated highly competitive recognition performance on the CSL-Daily dataset, showcasing strong generalization while reducing computational costs in both offline and online settings.

Keywords:

Fourier fully connected graph; frequency enhancement; part-wise action recognition; continuous sign language recognition

1. Introduction

Sign language, characterized by its unique grammar and lexicon, enables the communication of complex ideas and emotions through coordinated body movements, including hand gestures, facial expressions, and torso movements, thus serving as an effective communication medium for the deaf community [1]. However, the majority of hearing individuals lack proficiency in sign language, necessitating the development of sign language recognition (SLR) methods to facilitate effective communication between the deaf and hearing communities [2]. SLR tasks are commonly categorized into isolated sign language recognition (ISLR) [3] and continuous sign language recognition (CSLR). The former focuses on recognizing individual sign language words—that is, Gloss—while the latter aims to identify Gloss sequences and their contextual semantics [4]. Current CSLR tasks predominantly utilize video data as input, leveraging the rich multi-modal information, including RGB images [5], optical flow [6], and human skeletons [7]. Prior to the advent of deep learning, CSLR primarily relied on handcrafted features derived from RGB data, such as HOG or SIFT [8]. Then, researchers incorporated neural networks into their designs. For instance, Cheng et al. [9] replaced handcrafted features with fully convolutional networks, and they learned spatial and temporal features from only sentence-level annotations. Han et al. [10] utilized attention mechanisms to model spatial appearance and temporal evolution concurrently. With the widespread adoption of the CTC [11] loss, which introduces blank tokens to represent uncertainties in the output sequence and allows the merging of consecutive redundant frames into a single frame, end-to-end training of models by maximizing the probability of the output sequence became feasible [12]. Although image-based methods offer abundant visual information and the potential for multi-modal fusion, the foreground and background information in SLR tasks proves to be interference. Moreover, these methods necessitate computations for all pixels or patches, leading to significant computational load.

Compared to other modalities, skeleton data contain only 2D [13] or 3D [14] human joint coordinates, making them more compact and abstract compared to other modalities. Focusing on movements, skeleton-based action-recognition methods can effectively handle complex backgrounds and varying lighting conditions, demonstrating greater efficiency and robustness. GNN-based methods have dominated skeleton-based action recognition since the introduction of ST-GCN [13]. Subsequent works, such as BlockGCN [15], have enhanced spatial modeling via learnable graph topology refinement. APLE-GCN [16] obtains part-level features by adaptively fusing the joint-level features extracted from multiple data streams. For sign language recognition tasks, Tunga et al. [17] developed a GCN-Transformer hybrid, combining GCNs [18] for spatial dependencies and Transformers [19] for temporal dynamics. The Co-sign1s [7] model partitions body parts into specialized GCN streams with inter-sequence interaction, while Co-sign2s [7] fuses skeletal and motion features. For MSKA [20], different stream-level representations within the same time period share the same semantics, and it employs multi-stream keypoint attention for feature fusion. These methods either utilize skeleton data as auxiliary information or model skeleton sequences independently, focusing solely on shallow data relationships, which fails to adequately consider multi-part coordination in actions and hinders recognition performance.

Continuous SLR presents unique challenges, due to the inherently dynamic and nonlinear nature of sign language, compounded by asynchronous body movements that complicate spatiotemporal representation. Models must account for inter-signer variability in execution speed and expressive style while accurately parsing continuous action sequences [21]. Notably, even intra-signer consistency proves challenging, as the same action may exhibit movement variations across repetitions. SignGraph [22], based on RGB data, represents sign language sequences as graphs and proposes Local Sign Graph and Temporal Sign Graph modules to learn correlations of cross-region features within frames and interactions of cross-region features between adjacent frames, respectively. Corrnet [23] proposes a novel correlation network architecture that dynamically computes spatiotemporal affinity matrices between the current frame and its adjacent temporal neighbors. In skeletal-based methods, Cosign [7] first models group-level sequences separately and then uses a complementary mask for inter-sequence interactions to obtain the final action representation. C2SLR [24] further innovates by employing skeletal heatmaps as spatial attention guides to focus on information-rich regions. The asynchrony among different parts of body actions leads to poor compatibility of spatiotemporal networks in these methods, which contradicts the unity of space and time in the real world, restricting the action representation capacity of skeletal-based methods.

To address the aforementioned challenges, we propose a part-wise graph Fourier learning method for skeleton-based continuous sign language recognition (PGF-SLR), as shown in Figure 1. Specifically, PGF-SLR unifies spatiotemporal feature modeling from a graph learning perspective, representing both intra- and inter-part correlations of part-level features as node–node dependencies in a Fourier fully connected graph. Then, the PGF-SLR performs feature learning in the Fourier space via a graph Fourier learning module consisting of stacked Fourier graph operators (FGOs) and an adaptive feature enhancement module. Furthermore, we designed a dual-branch action learning structure comprising an action prediction assisting the sign recognition module to enhance the contextual understanding of the model. Our extensive experiments on three widely used datasets demonstrate that the PGF-SLR achieves highly competitive recognition performance using only skeletal data. The main contributions are as follows:

We propose a novel Fourier fully connected graph action representation structure, which uses the Fourier space features of part-level topological graphs as nodes, employs inter-node attention as edges, and constructs action sequences in a globally ordered yet locally unordered manner.
We propose a graph Fourier learning method that employs Fourier graph operators to learn representations from the Fourier fully connected graph, then applies the proposed MLP-based amplitude enhancement module to improve the sign language representation capability of the model.
We design a dual-branch action learning strategy that integrates an action prediction branch to assist the traditional recognition branch. The two branches collaboratively reinforce each other, thereby improving the model’s understanding of sign sequences.

The rest of this paper is organized as follows: Section 2 introduces related works, Section 3 introduces the proposed PGF-SLR method, Section 4 introduces our experimental results and analysis, and Section 5 is the conclusion.

2. Related Work

This section provides a concise overview of related works on skeleton-based, part-based, and GNN-based action recognition.

2.1. Skeleton-Based Action Recognition

Advancements in pose estimation algorithms [25] and depth cameras, such as Kinect [26], have enabled the rapid and accurate acquisition of human keypoints at a low cost. Skeleton-based action recognition has garnered increasing attention, due to its more compact representation compared to RGB-based methods. Additionally, skeleton-based representations exhibit greater robustness against illumination variations and background noise. Traditional skeleton-based approaches rely on handcrafted descriptors with fixed receptive fields to model joint-level motion patterns across entire temporal sequences, which are then fed into deep networks such as RNNs or CNNs for feature extraction [27]. While these descriptors are interpretable, they typically capture only shallow and simple features, making it difficult to discover significant deep features. Inspired by graph learning, ST-GCN [13] introduced GCN to skeleton-based action recognition, leveraging the topological structure of the human skeleton to simultaneously model spatial configurations and the temporal dynamics of skeletons. CTR-GC [28] further enhanced spatial modeling through learnable topological refinement. Unlike methods that transform skeleton data into images or graphs, Transformer-based methods directly use attention blocks to model dependencies among joints. SkateFormer [29] dynamically partitions skeletal joints and temporal frames, based on four fundamental skeletal–temporal relations, performing specialized skeletal–temporal self-attention computations within each part. Shi et al. [30] adopted a pure Transformer network to explore joint relationships.

In this work, we employ skeleton sequences extracted from RGB videos using the established MMPose [25]. By computing part-level frequency attention in the Fourier space, our method overcomes data interference from RGB-based data, achieving more robust recognition results.

2.2. Part-Based Action Recognition

The human body is a natural topological structure [31], and part-based action recognition methods aim to learn action representations from the coordination of different body parts. Du et al. [32] proposed a hierarchical recurrent neural network for hierarchically concatenating features from different body parts. Chen et al. [33] proposed a coarse-to-fine framework that first predicts the video-level action category and then localizes body parts for part-level action recognition. Song et al. [34] proposed a part attention mechanism to identify the most informative parts. The MGCF-Net [35] model first segments the input into clips and then divides these clips into several parts, based on the structure of the human body. It employs multi-head self-attention to capture contextual features at both the joint and part levels. MMP-ST [36] is a multimodal method incorporating both skeletal part features and textual features for action recognition. STF-Net [37] proposed a multi-grain contextual focus module that captures multi-scale features and a temporal discrimination focus module to capture the local sensitive points of the temporal dynamics. GAP [38] proposed a multimodal training scheme that employs a pre-trained large-scale language model as the knowledge base to generate prompts for body parts and supervise the action encoder. All these methods adopt complex strategies for individual information propagation or fusion of information from different parts.

Considering the nonlinearity of sign language and the asynchrony between parts, we partition sign language actions into parts and use part-level Fourier features as nodes in a Fourier fully connected graph to understand signs in a globally ordered yet locally unordered manner.

2.3. GNN-Based Action Recognition

The existing GNN-based action recognition methods heavily rely on predefined graph structures to establish spatial correlations. However, these methods fail to capture spatial correlation patterns over time. Recognizing the limitations of fixed topology based on natural connections, subsequent works have adopted learnable topologies for action recognition. Notably, CTR-GC [28] is a spatial-adaptive graph generation method based on non-local mechanisms to enhance the flexibility of the skeletal graph structure, thereby improving GNN performance. The GTC-Net [39] model is a complementary network integrating GCN and Transformer to capture both local and global joint correlations, enabling parallel information interaction between the two domains. STA-GCN [40] is a spatial-temporal adaptive GCN that learns an adaptive integrate strategy for spatial and temporal features. DS-GCN [41] is a joint-type-aware and edge-type-aware adaptive topology that integrates these semantic modules with temporal convolution for skeleton-based action recognition. JDP-GCN [42] proposed a distributed spatiotemporal perception module for joint-wise distributed multi-location perception and an anchor pose-driven subaction encoding module for informative clues for subaction reasoning. However, these decoupled modeling approaches, which independently learn spatial and temporal correlations, contradict unified spatiotemporal dependencies in real-world scenarios, ultimately limiting their predictive performance.

Recent research has explored combining complex numbers with graph structures, particularly through Complex-Valued Neural Networks [43] (CVNNs) that process complex inputs and parameters to transform signals between time–frequency domains, efficiently handling phase and amplitude components via complex matrix operations [44]. Graph extensions of time–frequency analysis [45] examine localized spectral features around vertices using window functions, while architectures like GrokFormer [46] integrate graph Fourier transforms with Kolmogorov–Arnold networks to learn adaptive spectral representations. However, the direct modification of complex numbers in traditional frequency domain methods may cause phase distortion.

In this paper, we propose a novel graph Fourier learning method that constructs multi-part action sequences from a pure graph perspective. Our approach uniformly models spatiotemporal dependencies through Fourier fully connected graphs and efficiently performs matrix multiplication in the frequency domain using FGOs.

3. Methodology

We propose a part-wise graph Fourier learning method for skeleton-based continuous sign language recognition. The method constructs a unified dependency representation by modeling part-level features as nodes and their frequency domain attention as edges, effectively capturing skeletal motion patterns to boost recognition accuracy.

3.1. Problem Definition

As shown in Figure 2, we employ the well-established skeleton-extraction method to extract human skeletal data from sign language videos, and we focus on 77 nodes of the upper body and segmenting them into four parts: the torso, with 9 nodes; both hands, with 21 nodes; and the face, with 26 nodes; these four parts are denoted as GB, GL, GR, and GF for the corresponding part-level topological graphs. Let

X \in R^{T \times 77 \times 3}

represent a skeletal sequence with T frames and 3 channels. Our objective is to accurately model part-level feature interactions in Fourier space to obtain higher-quality action representations, thereby enhancing the model’s capability for sign language recognition.

3.2. Model Framework

As shown in Figure 3, the PGF-SLR consists of three main modules:

Fourier fully connected graph construction module. This module first maps part-level topology graphs into Fourier space using graph Fourier transformation (GFT), and it constructs a Fourier fully connected graph using the part-level Fourier representations as nodes features and their frequency domain attention as edges.
Graph Fourier learning module. The module employs stacked Fourier graph operators (FGOs) to learn the spatiotemporal relationships among nodes in the Fourier fully connected graph. An adaptive frequency enhancement module is attached to enhance the learned action features.
Dual-branch learning module. This module comprises an auxiliary action prediction branch to assist the sign language recognition branch to obtain higher-quality sign language action representations.

3.3. Fourier Fully Connected Graph Construction

Due to the nonlinear characteristics of sign language, asynchronous movements naturally occur during multi-part coordination. Unlike the classical two-stage approach that independently models spatial and temporal information [20], the Fourier fully connected graph learns action representations from the perspective of graph learning. It unifies the intra- and inter-sequence correlations of part-level features as node dependencies in a fully connected graph, eliminating the incompatibility of spatiotemporal modeling. This approach constructs adaptive spatiotemporal dependencies, facilitating the learning of asynchrony among actions.

The subgraphs in Figure 3b,c illustrate the Fourier fully connected graph construction process. Specifically, each topology graph is defined as

G_{p} = (X_{p}, A_{p})

, with

A_{p}

representing the part-level anatomical adjacency matrix and nodes

X_{p} \in (x, y, c)

representing skeletal joints, including

(x, y)

coordinates and confidence value c. We first utilize the graph Fourier transform (GFT) to transform the

G_{p}

of the four parts into Fourier space, obtaining four part-level Fourier features denoted as

f_{G B} \in R^{d}

,

f_{G L} \in R^{d}

,

f_{G R} \in R^{d}

, and

f_{G F} \in R^{d}

, where d is the dimension of the Fourier features, as shown in Figure 3b. Given an input window

W_{i n}

at time step t, the input feature is denoted as

X_{t}^{W_{i n}} \in R^{P \times W_{i n} \times d}

, where P is the number of parts. We use the part-level Fourier features as nodes and the frequency domain attention among the nodes as edges, i.e., the values at the corresponding positions in the adjacency matrix

A_{t}^{W_{i n}} \in R^{(P \times W_{i n}) \times (P \times W_{i n})}

, to construct the corresponding Fourier fully connected graph

G_{t}^{W_{i n}} = (X_{t}^{W_{i n}}, A_{t}^{W_{i n}})

, as shown in Figure 3c. Thus, we can formulate the part-wise action understanding task as the understanding of the fully connected graph and formalize it as

{\hat{Y}}_{t} = F_{θ_{G}} (X_{t}^{G}, A_{t}^{G})

(1)

where

F_{θ_{G}}

is the graph Fourier learning module and

θ_{G}

represents the parameter of the module.

3.4. Graph Fourier Learning

The graph Fourier learning module consists of a Fourier graph neural network and an adaptive frequency enhancement module.

3.4.1. Fourier Graph Neural Network

Representing action sequences as pure graphs can enhance spatiotemporal modeling. Given a Fourier fully connected graph

G^{W_{i n}} = (X^{W_{i n}}, A^{W_{i n}})

, the size of the graph grows with the window size

W_{i n}

, resulting in a quadratic increase in computational cost for classical graph networks, and it poses optimization challenges when obtaining precise hidden node representations [31]. To address these issues, by defining a weight matrix

W \in R^{d \times d}

we introduce a learnable Fourier graph operator (FGO) [47] based on a tailored Green’s kernel

κ : [P \times W_{i n}] \times [P \times W_{i n}] \to R^{d \times d}

, where

κ [m, n] = A_{m n}^{W_{i n}} ⊙ W

and

κ [m, n] = κ [m - n]

, which ensures translation invariance, thereby enabling the model to capture part-level relationships in the Fourier domain of fully connected graphs, as is shown in Figure 4b. We then define the FGO as

S_{G} = F (κ) \in C^{P \times W_{i n} \times d \times d}

, where

F ()

denotes the Discrete Fourier Transform (DFT).

According to the convolution theorem [48], the Fourier transform of the convolution of two signals is equivalent to the product of their Fourier features in the frequency domain. Thus, in the Fourier space the multiplication of

F (X^{W_{i n}})

with the FGO can be expressed as

\begin{matrix} F (X^{W_{i n}}) F (κ) & = F ((X^{W_{i n}} * κ) [m]) = F (\sum_{n = 1}^{P \times W_{i n}} X^{W_{i n}} [n] κ [m - n]) \\ = F (\sum_{n = 1}^{P \times W_{i n}} X^{W_{i n}} [n] κ [m, n]), \forall m \in [0, 1, . ., W_{i n}] \end{matrix}

(2)

where

(X^{W_{i n}} * κ) [m]

represents the convolution of X and

κ

. Based on the definition

κ [m, n] = A_{m n}^{W_{i n}} ⊙ W

, the convolution of Equation (2) can be expressed as

\sum_{j = 1}^{P \times W_{i n}} X^{W_{i n}} [n] κ [m, n] = \sum_{j = 1}^{P \times W_{i n}} A_{m n} X^{W_{i n}} [n] W = A X W

(3)

Thus, we can derive

F (X^{W_{i n}}) S_{G} = F (A X^{W_{i n}} W)

(4)

indicating that performing the multiplication of

F (X^{W_{i n}})

and the Fourier-transformed Green’s kernel in the Fourier space corresponds to a graph convolution operation in the time domain. However, multiplication in the Fourier space has significantly lower complexity compared to shifting operations in the time domain

(O (n^{2}) > O (n))

. Thus, we employ an n-invariant polynomial Fourier kernel

S \in C^{d \times d}

to reduce computational complexity. During the feature learning stage, by stacking multiple layers of FGOs we recursively multiply

X_{t}^{W_{i n}}

and the FGO operator

S_{0 : k}

in the Fourier space and represent the output as

Y_{t}^{G} = \sum_{k = 0}^{K} σ (F (X^{W_{i n}}) S_{0 : k} + b_{k}), S_{0 : k} = \prod_{i = 0}^{k} S_{i}

(5)

where

σ

is a nonlinear activation function, which introduces the non-linear graph information diffusion ability to model during the summation process,

S_{k}

is the FGO operator at the k-th layer, and

b_{k} \in C^{d}

is the bias. Since all the operations are performed in the Fourier space, all the parameters are complex numbers.

3.4.2. Adaptive Frequency Enhancing Module

To effectively capture the detailed characteristics of motion sequences while avoiding the phase distortion caused by direct complex number modification in traditional frequency domain methods [43], we enhance the frequency domain features processed by FourierGNN through selective frequency band amplification with strict phase relationship preservation, as shown in Figure 4c.

To be specific, we first decompose the complex spectrum

Y_{t}^{G}

, as follows:

A m p_{Y} = | Y_{t}^{G} |

(6)

P h a_{Y} = a n g l e (Y_{t}^{G})

(7)

where the amplitude spectrum

A m p_{Y}

is adaptively enhanced to amplify critical frequency components, while the phase spectrum

P h a_{Y}

is preserved to maintain structural information. For the amplitude spectrum, we perform nonlinear transformation, using a

1 \times 1

convolution-based MLP to adaptively enhance the expressive ability of the key frequency components:

A m p_{Y}^{'} = σ (W_{2} \cdot R e L U (W_{1} \cdot A m p_{Y} + b_{1}) + b_{2})

(8)

where

W_{1} \in R^{(P \times W_{i n}) \times (2 \times P \times W_{i n})}

and

W_{2} \in R^{(2 \times P \times W_{i n}) \times (P \times W_{i n})}

are the learnable weight matrices, and where

b_{1} \in R^{2 \times P \times W_{i n}}

and

b_{2} \in R^{P \times W_{i n}}

are the corresponding biases of the two convolution layers. The enhanced amplitude spectrum

A m p_{Y}^{'}

is then recombined with the original phase spectrum

P h a_{Y}

to reconstruct the enhanced complex frequency

Y_{t}^{G E}

representation, as follows:

Y_{r e a l} = A m p_{Y}^{'} \cdot cos (P h a_{Y})

(9)

Y_{i m a g} = A m p_{Y}^{'} \cdot sin (P h a_{Y})

(10)

where

Y_{r e a l}

and

Y_{i m a g}

represent the real part and the imaginary coefficient of the adaptively enhanced frequency representation, and where

c o s ()

and

s i n ()

denote the cosine and sine function. Overall, the pipeline of the graph Fourier learning module is shown in Algorithm 1.

Algorithm 1: Graph Fourier learning.
Input: Fourier fully connected graph $G^{W_{i n}} = (X^{W_{i n}}, A^{W_{i n}})$
Output: Enhanced action representation $Y_{t}^{G E}$
1 for $i = 0$ to k do
2 $⌊ Y_{t}^{G} = Y_{t}^{G} + σ (F (X^{W_{i n}}) S_{i} + b_{i})$ ;	// k stacked FGOs
3 $A m p_{Y} = a b s (Y_{t}^{G})$ ;	// Get amplitude
4 $P h a_{Y} = a n g l e (Y_{t}^{G})$ ;	// Get phase
5 $A m p_{Y}^{'} = σ (W_{2} \cdot R e L U (W_{1} \cdot A m p_{Y} + b_{1}) + b_{2})$ ;	// $1 \times 1$ conv based MLP
6 $Y_{r e a l} = A m p_{Y}^{'} \cdot c o s (P h a_{Y})$ ;	// Enhanced real part
7 $Y_{i m a g} = A m p_{Y}^{'} \cdot s i n (P h a_{Y})$ ;	// Enhanced image coefficient
8 $Y_{t}^{G E} = c o m p l e x (Y_{r e a l}, Y_{i m a g})$ ;	// Frequency-enhanced representation $Y_{t}^{G} E$
9 return $Y_{t}^{G} E$

3.5. Dual-Branch Action Learning Module

While humans can recognize sign language from just a few key motions, current sign language recognition methods still require processing massive amounts of video frames, resulting in significant redundancy. Previous studies [49] demonstrate that keyframes account for merely 45%, 47%, and 42% of the PHOENIX14, PHOENIX14-T, and CSL-Daily datasets, respectively, with adjacent frames showing exceptionally high similarity. Thus, in this study, we segment the input data into two keyframe subsequences. Specifically, adjacent frames in the input sequence

X \in R^{T \times 77 \times 3}

are grouped into pairs, with each frame randomly assigned to one of the two subsequences, resulting in

X_{1} \in R^{T / 2 \times 77 \times 3}

and

X_{2} \in R^{T / 2 \times 77 \times 3}

. The outputs of these two keyframe subsequences, after undergoing Fourier fully connected graph learning, are denoted as

Y_{t 1}^{G}

and

Y_{t 2}^{G}

, serving as inputs to the subsequent dual-branch action learning module. Specifically,

Y_{t 1}^{G}

is used for sign language context learning, while

Y_{t 2}^{G}

is utilized for action prediction. We then align them, using Kullback–Leibler (KL) divergence, yielding the following loss:

L_{K L} = \frac{1}{2} (K L (Y_{t 1}^{G}, Y_{t 2}^{G}) + K L (Y_{t 2}^{G}, Y_{t 1}^{G}))

(11)

3.5.1. Sign Language Recognition Branch

We first transform the input to the sign language recognition branch,

Y_{t 1}^{G}

, to the temporal domain by using inverse graph Fourier transform (IGFT), obtaining

Y_{t 1}^{I G} \in R^{P \times W_{i n} \times d}

. We then use a two-layer MLP to transform and flatten it to

{\hat{Y}}_{t 1} \in R^{1 \times (W_{i n} \times d)}

. Referring to models [4,50] taking VAC [2] as a backbone, we also attach a 1D-CNN-based sign language contextual learning module and a BiLSTM-based [51] long-term sequence learning module, and we use two CTC-based losses,

L_{C T C}

and

L_{s e q}

, to supervise the short- and long-term feature learning of the sign language recognition branch, as is shown in Figure 5a.

3.5.2. Action Prediction Branch

The action prediction branch forecasts the values for the next

τ

time steps

{\hat{Y}}_{t 2} = [x_{t + 1}, \dots, x_{t + τ}]

, based on the feature values

X_{t 2} = [x_{t - W_{i n} + 1}, \dots, x_{t}]

within the current input window of

W_{i n}

time steps, where

x_{t} \in R^{P \times d}

. As shown in Figure 5b, after obtaining the final graph representation

Y_{t 2}^{G}

, we use the IGFT to transform it back to the time domain, which yields

Y_{t 2}^{I G}

. We then project

Y_{t 2}^{I G}

through a Feed Forward Network (FFN) with two layers linear of transformation with ReLU activations to obtain the predictions for the next

τ

time steps, which can be formalized as

{\hat{Y}}_{t 2} = R e L U (Y_{t 2}^{I G} W_{3} + b_{3}) W_{4} + b_{3}

(12)

where

W_{3} \in R^{(P \times W_{i n} \times d) \times d_{1}^{F F N}}

and

W_{4} \in R^{d_{1}^{F F N} \times d^{τ}}

are the weights and

b_{3} \in R^{d_{1}^{F F N}}

,

b_{4} \in R^{d^{τ}}

are the biases of the two layers, respectively, and where

d_{1}^{F F N}

and

d^{τ}

represent the dimensions of the two layers. Using

Y_{t 2}

as ground truth, the loss of the prediction branch is

L_{p r e} = \sum_{i}^{τ} | | Y_{t 2}^{i} - {\hat{Y}}_{t 2}^{i} {| |}_{2}^{2}

(13)

3.6. Loss Function

To achieve better training results, we propose a new loss function:

L = L_{s e q} + L_{C T C} + α_{1} L_{K L} + {α_{2} L}_{p r e}

(14)

where

L_{s e q}

and

L_{C T C}

are the traditional losses based on the VAC [2] model,

L_{K L}

is the alignment loss for two keyframe subsequences,

L_{p r e}

is the action prediction loss, and

α_{1}

and

α_{2}

are the weights for

L_{K L}

and

L_{p r e}

, respectively.

4. Experiments

This section details our experiments, including the datasets, implementation details, baseline methods and evaluation metric. We benchmark our method against state-of-the-art approaches and provide comprehensive empirical analysis.

4.1. Datasets

We selected three of the most commonly used and representative datasets in the field of sign language recognition for our comparative experiments:

PHOENIX14 [52] is a dataset recorded by nine presenters, extracted from German weather forecasts with high-contrast backgrounds. It contains 6841 sentences with a total of 1295 Glosses. The dataset is divided into train/dev/test sets, comprising 5672/540/629 samples, respectively.
PHOENIX14-T [1] is another dataset extracted from German weather forecasts. It includes 1085 Glosses distributed across 8247 sentences. The distribution of samples in train/dev/test sets is 7096/519/642, respectively.
CSL-Daily [53] is a Chinese Sign Language dataset related to daily life, recorded indoors by 10 signers. Compared to the previous two datasets, it has a noisier background. The dataset consists of 20654 sentences, with the train/dev/test sets containing 18401/1077/1176 samples, respectively.

4.2. Implementation Details

For this paper, we used the well-established MMPOSE [25] to extract human skeletal information from RGB-based datasets. Following VAC [2], the short-term temporal convolution of the sign language recognition branch of PGF-SLR consists of a convolution with a kernel size of 5, a max-pooling with a kernel size of 2, and a second convolution with a kernel size of 5, as shown in Figure 6. For long-term sequence modeling, a two-layer Bidirectional Long Short-Term Memory (BiLSTM) network with a hidden layer size of 1024 was employed. The weights for the loss terms

L_{K L}

and

L_{p r e}

were set to

α_{1} = 1.0

and

α_{2} = 0.2

, respectively. All our experiments were conducted on an Intel(R) Xeon(R) Platinum 8352V CPU platform with a single NVIDIA-A6000 48G GPU, using a batch size of 1.

4.3. Baseline Methods

To evaluate the proposed method, we selected the state-of-the-art and most representative method from three types of sign language recognition models using different input modalities as the baseline methods: (1) models using only RGB input data, including VAC [2], SMKD [54], TLP [55], AdaBrowse [56], Contrastive [57], SEN [58], HST-GNN [59], TCNet [60], and SignGraph [22]; (2) models using keypoints and other modalities, including STMC [61], Cosign-2s [7], and C2SLR [24]; (3) models using only keypoints data, including Cosign-1s [7] and MSKA [20].

4.4. Evaluation Metrics

For this paper, we used the Word Error Rate (WER) [62] as the evaluation metric, defined as the minimum sum of substitution (

# s u b

), insertion (

# i n s

), and deletion (

# d e l

) operations required to transform the predicted sentence into the ground truth (

# r e f e r e n c e

). A lower WER indicates higher accuracy:

WER = \frac{# s u b + # i n s + # d e l}{# r e f e r e n c e}

(15)

4.5. Comparison with Baseline Methods

Table 1 and Table 2 present our experimental comparison of the PGF-SLR with the baseline models. The experimental results demonstrate that the PGF-SLR performed well across all three datasets. On the PHOENIX14 dataset, the PGF-SLR surpassed all the baseline models, yielding relative gains of 3.31% on dev and 3.70% on test over the previous SOTA TCNet. Similarly, on the PHOENIX14-T dataset the PGF-SLR achieved the highest recognition accuracy, outperforming SignGraph by relative improvements of 2.81% and 7.33% on the dev and test sets, respectively. As can be seen in Table 2, though slightly below that of the SignGraph, our model still achieved competitive recognition accuracy on the more complex CSL-Daily dataset. Compared to the models that used only skeleton input, the PGF-SLR showed significant improvements in recognition capability and outperformed most of the models that used RGB or multimodal inputs. Overall, the PGF-SLR demonstrated considerable recognition performance across all three datasets, highlighting the strong generalization ability of the part-wise graph Fourier learning method for skeleton-based continuous sign language recognition.

4.6. Ablation Study

To systematically evaluate the contributions of the different components in our PGF-SLR model, we conducted comprehensive ablation studies, using CoSign-1s as the baseline model. The experiments examined three key components: DI (dual-keyframe subsequences input), FGL (Fourier fully connected graph construction and graph Fourier learning), and PRE (action prediction branch). As can be seen in Table 3, in control group

a_1

, training on keyframe subsequences alone yielded a modest improvement over the baseline. In

a_2

, applying the FGL module to the full sequence produced substantial gains, confirming its strong motion-learning capacity. In experimental group

a_3

, adding PRE in isolation yielded only limited improvement. In

a_4

and

a_6

, we combined FGL with DI and PRE, respectively; both configurations surpassed

a_2

. However, the DI–PRE pair without FGL in

a_5

delivered only marginal gains, indicating that DI and PRE are most effective when integrated with FGL. When all three modules were employed, the PGF-SLR achieved optimal performance, significantly outperforming the baseline across all metrics on all datasets. These results demonstrate that each of the three proposed modules contributes positively to the model’s recognition capability.

4.7. Hyperparameter Sensitivity Analysis

4.7.1. Weights of Loss Function

As shown in Figure 7, we analyzed the impact of

α_{1}

and

α_{2}

in the loss function. When analyzing

α_{1}

,

α_{2}

was set to 0.2; conversely,

α_{1}

was set to

α_{2}

when analyzing

α_{2}

. The trends of the same parameter were consistent across the dev and test sets. Due to gradient accumulation in the recognition module, the model was more sensitive to

α_{2}

and the value of

L_{p r e}

was relatively large, making the model more sensitive to changes in

α_{2}

and more robust against

α_{1}

. As shown in Figure 7a, the model’s performance improved as

α_{1} = 1

. For

α_{2}

, larger values significantly affected recognition capability, with the best performance at

α_{2} = 0.2

, as shown in Figure 7b. Therefore, we selected

α_{1} = 1

and

α_{2} = 0.2

as the weights for the loss function.

4.7.2. Input Window Size

We conducted six experimental groups with varying window sizes,

W_{i n} = {1, 2, 3, 4, 5, 6}

, to analyze their impact on action representation and recognition performance, where

W_{i n}

denotes the input window size. As shown in Figure 8a, the smaller input windows were unable to effectively capture the relationships between actions. As the window size increased, the model was exposed to more features, which led to more efficient representation and higher recognition accuracy. However, once the window size reached a certain threshold, further enlargement

(W_{i n} > 4)

yielded no additional accuracy gains. Therefore, we ultimately selected an input window size of 4 for constructing the fully connected graph.

4.7.3. Prediction Window Size

We conducted extensive experiments on the PHOENIX14 dataset with six different window size configurations

W_{p r e} = {1, 2, 3, 4, 5, 6}

for the action prediction branch. Here,

W_{p r e}

denotes the prediction window size. The experimental results in Figure 8b reveal that window size significantly affected recognition accuracy, where both excessively large and small windows degraded performance. Specifically, the action prediction achieved optimal recognition performance when

W_{p r e} = 4

, striking an effective balance between sufficient temporal context and computational efficiency. Ultimately, we adopted

W_{p r e} = 4

as the prediction steps, as it provided the best trade-off between recognition accuracy.

4.8. Online Inference

Unlike offline training, real-world online applications require models to process video frames sequentially and perform real-time recognition using both current and historical temporal information. We designed online recognition experiments to test the effectiveness of our PGF-SLR model in practical applications, and AdaBrowse [56] was selected as the baseline model. AdaBrowse is a lightweight RGB-based sign language recognition model that accelerates inference by pruning redundant visual data; among the state-of-the-art methods, it achieves the lowest GFLOPs and highest Throughput, rendering it exceptionally well-suited for real-time deployment. We directly used the skeleton sequences extracted from the PHOENIX14 dataset to simulate real-world scenarios. Three metrics were selected to analyze the online inference ability of the PGF-SLR: Throughput (videos/second), GFLOPs (Giga Floating-Point Operations per Second), and WER. As shown in Table 4, our PGF-SLR significantly outperformed the baseline in all metrics, delivering a 62.37%-faster inference speed and yielding a 15% relative WER reduction, demonstrating good recognition accuracy with skeleton data. These results validate the effectiveness of our PGF-SLR in practical applications.

5. Conclusions

We propose a part-wise Fourier fully connected graph learning method for continuous sign language recognition, exploring skeleton-only sign language recognition in the Fourier space. First, we partition the skeleton data into multiple parts and obtain their Fourier domain features via graph Fourier transform. These features serve as node representations in a fully connected graph, with their inter-part relationships modeled as edges. In the graph Fourier learning module, we sequentially apply the Fourier graph neural network and the adaptive frequency enhancement module to refine action representations by amplifying phase and amplitude information. Finally, a dual-branch action learning module, where an action prediction branch assists the sign recognition branch, enables collaborative learning, yielding deep comprehension of skeleton-based sign sequences. Our extensive experiments on the three most widely used sign language datasets confirmed that our PGF-SLR using purely skeleton features is both efficient and robust. Specifically, the PGF-SLR attained relative improvements of 3.31%/3.70% on the dev/test sets of PHOENIX14 and 2.81%/7.33% on PHOENIX14-T over the prior state-of-the-art method, while delivering highly competitive accuracy on the CSL-Daily dataset. What is more, the PGF-SLR also achieved a 62.37% relative improvement of Throughput. In the future, we will investigate multimodal continuous sign language recognition leveraging skeleton data alongside other modalities.

Author Contributions

Conceptualization, D.W. and G.-F.M.; methodology, D.W. and H.H.; software, D.W.; validation, D.W., H.H. and G.-F.M.; formal analysis, H.H. and G.-F.M.; investigation, D.W., H.H. and G.-F.M.; resources, D.W.; data curation, H.H.; writing—original draft preparation, D.W.; writing—review and editing, D.W., H.H. and G.-F.M.; visualization, D.W. and H.H.; supervision, G.-F.M.; project administration, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author, due to future publication purposes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Camgoz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7784–7793. [Google Scholar]
Min, Y.; Hao, A.; Chai, X.; Chen, X. Visual alignment constraint for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11542–11551. [Google Scholar]
Laines, D.; Gonzalez-Mendoza, M.; Ochoa-Ruiz, G.; Bejarano, G. Isolated sign language recognition based on tree structure skeleton images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 276–284. [Google Scholar]
Rao, Q.; Sun, K.; Wang, X.; Wang, Q.; Zhang, B. Cross-sentence gloss consistency for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4650–4658. [Google Scholar]
Hu, H.; Zhao, W.; Zhou, W.; Li, H. Signbert+: Hand-model-aware self-supervised pre-training for sign language understanding. IEEE Trans. Pattern Analysis Mach. Intell. 2023, 45, 11221–11239. [Google Scholar] [CrossRef]
Piergiovanni, A.; Ryoo, M.S. Representation flow for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 9945–9953. [Google Scholar]
Jiao, P.; Min, Y.; Li, Y.; Wang, X.; Lei, L.; Chen, X. Cosign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 20676–20686. [Google Scholar]
Koller, O.; Zargaran, S.; Ney, H. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4297–4305. [Google Scholar]
Cheng, K.L.; Yang, Z.; Chen, Q.; Tai, Y.-W. Fully convolutional networks for continuous sign language recognition. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16. pp. 697–714. [Google Scholar]
Han, X.; Lu, F.; Yin, J.; Tian, G.; Liu, J. Sign language recognition based on r (2 + 1) d with spatial–temporal–channel attention. IEEE Trans. Hum.-Mach. Syst. 2022, 52, 687–698. [Google Scholar] [CrossRef]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Zhang, H.; Guo, Z.; Yang, Y.; Liu, X.; Hu, D. C2st: Cross-modal contextualized sequence transduction for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 21053–21062. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Zhang, S.; Yin, J.; Dang, Y. A generically contrastive spatiotemporal representation enhancement for 3d skeleton action recognition. Pattern Recognit. 2025, 164, 111521. [Google Scholar] [CrossRef]
Zhou, Y.; Yan, X.; Cheng, Z.-Q.; Yan, Y.; Dai, Q.; Hua, X.-S. Blockgcn: Redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2049–2058. [Google Scholar]
Liu, C.; Liu, S.; Qiu, H.; Li, Z. Adaptive part-level embedding GCN: Towards robust skeleton-based one-shot action recognition. IEEE Trans. Instrum. Meas. 2025, 74, 5024713. [Google Scholar] [CrossRef]
Tunga, A.; Nuthalapati, S.V.; Wachs, J. Pose-based sign language recognition using gcn and bert. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 31–40. [Google Scholar]
Corso, G.; Stark, H.; Jegelka, S.; Jaakkola, T.; Barzilay, R. Graph neural networks. Nat. Rev. Methods Prim. 2024, 4, 17. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; NIPS: Islamabad, Pakistan, 2017; Volume 30. [Google Scholar]
Guan, M.; Wang, Y.; Ma, G.; Liu, J.; Sun, M. Mska: Multi-stream keypoint attention network for sign language recognition and translation. Pattern Recognit. 2025, 165, 111602. [Google Scholar] [CrossRef]
Gunasekara, S.R.; Li, W.; Yang, J.; Ogunbona, P. Asynchronous joint-based temporal pooling for skeleton-based action recognition. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 357–366. [Google Scholar] [CrossRef]
Gan, S.; Yin, Y.; Jiang, Z.; Wen, H.; Xie, L.; Lu, S. Signgraph: A sign sequence is worth graphs of nodes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 13470–13479. [Google Scholar]
Hu, L.; Gao, L.; Liu, Z.; Feng, W. Continuous sign language recognition with correlation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2529–2539. [Google Scholar]
Zuo, R.; Mak, B. C2slr: Consistency-enhanced continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 5131–5140. [Google Scholar]
MMPose Contributors. OpenMMLab Pose Estimation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmpose (accessed on 16 August 2025).
Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimed. 2012, 19, 4–10. [Google Scholar] [CrossRef]
Li, S.; Zheng, L.; Zhu, C.; Gao, Y. Bidirectional independently recurrent neural network for skeleton-based hand gesture recognition. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Sevilla, Spain, 10–21 October 2020; pp. 1–5. [Google Scholar]
Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]
Do, J.; Kim, M. Skateformer: Skeletal-temporal transformer for human action recognition. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 401–420. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Wang, Q.; Shi, S.; He, J.; Peng, J.; Liu, T.; Weng, R. Iip-transformer: Intra-inter-part transformer for skeleton-based action recognition. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 936–945. [Google Scholar]
Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
Chen, T.; Zhou, D.; Wang, J.; Wang, S.; He, Q.; Hu, C.; Ding, E.; Guan, Y.; He, X. Part-aware prototypical graph network for one-shot skeleton-based action recognition. In Proceedings of the 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), Waikoloa Beach, HI, USA, 5–8 January 2023; pp. 1–8. [Google Scholar]
Song, Y.-F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar] [CrossRef]
Qiu, H.; Hou, B. Multi-grained clip focus for skeleton-based action recognition. Pattern Recognit. 2024, 148, 110188. [Google Scholar] [CrossRef]
Zhou, L.; Jiao, X. Multi-modal and multi-part with skeletons and texts for action recognition. Expert Syst. Appl. 2025, 272, 126646. [Google Scholar] [CrossRef]
Wu, L.; Zhang, C.; Zou, Y. Spatiotemporal focus for skeleton-based action recognition. Pattern Recognit. 2023, 136, 109231. [Google Scholar] [CrossRef]
Xiang, W.; Li, C.; Zhou, Y.; Wang, B.; Zhang, L. Generative action description prompts for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10276–10285. [Google Scholar]
Xiang, X.; Li, X.; Liu, X.; Qiao, Y.; El Saddik, A. A gcn and transformer complementary network for skeleton-based action recognition. Comput. Vis. Image Underst. 2024, 249, 104213. [Google Scholar] [CrossRef]
Hang, R.; Li, M. Spatial-temporal adaptive graph convolutional network for skeleton-based action recognition. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 1265–1281. [Google Scholar]
Xie, J.; Meng, Y.; Zhao, Y.; Nguyen, A.; Yang, X.; Zheng, Y. Dynamic semantic-based spatial graph convolution network for skeleton-based human action recognition. In Proceedings of the AAAI Conference on Artificial intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 6225–6233. [Google Scholar]
Huang, Q.; Geng, Q.; Chen, Z.; Li, X.; Li, Y.; Li, X. Joint-wise distributed perception graph convolutional network for skeleton-based action recognition. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Lee, C.; Hasegawa, H.; Gao, S. Complex-valued neural networks: A comprehensive survey. IEEE/CAA J. Autom. Sin. 2022, 9, 1406–1426. [Google Scholar] [CrossRef]
Yu, X.; Wu, L.; Lin, Y.; Diao, J.; Liu, J.; Hallmann, J.; Boesenberg, U.; Lu, W.; Möller, J.; Scholz, M.; et al. Ultrafast bragg coherent diffraction imaging of epitaxial thin films using deep complex-valued neural networks. npj Comput. Mater. 2024, 10, 24. [Google Scholar] [CrossRef]
Shuman, D.I.; Ricaud, B.; Vandergheynst, P. Vertex-frequency analysis on graphs. Appl. Comput. Harmon. Anal. 2016, 40, 260–291. [Google Scholar] [CrossRef]
Ai, G.; Pang, G.; Qiao, H.; Gao, Y.; Yan, H. Grokformer: Graph fourier kolmogorov-arnold transformers. In Proceedings of the 42st International Conference on Machine Learning (ICML), Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Yi, K.; Zhang, Q.; Fan, W.; He, H.; Hu, L.; Wang, P.; An, N.; Cao, L.; Niu, Z. Fouriergnn: Rethinking multivariate time series forecasting from a pure graph perspective. Adv. Neural Inf. Process. Syst. 2023, 36, 69638–69660. [Google Scholar]
Katznelson, Y. An Introduction to Harmonic Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Wei, D.; Yang, X.-H.; Weng, Y.; Lin, X.; Hu, H.; Liu, S. Cross-modal adaptive prototype learning for continuous sign language recognition. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 7354–7367. [Google Scholar] [CrossRef]
Guo, L.; Xue, W.; Guo, Q.; Liu, B.; Zhang, K.; Yuan, T.; Chen, S. Distilling cross-temporal contexts for continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10771–10780. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Koller, O.; Forster, J.; Ney, H. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [Google Scholar] [CrossRef]
Zhou, H.; Zhou, W.; Qi, W.; Pu, J.; Li, H. Improving sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1316–1325. [Google Scholar]
Hao, A.; Min, Y.; Chen, X. Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11303–11312. [Google Scholar]
Hu, L.; Gao, L.; Liu, Z.; Feng, W. Temporal lift pooling for continuous sign language recognition. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 511–527. [Google Scholar]
Hu, L.; Gao, L.; Liu, Z.; Pun, C.-M.; Feng, W. Adabrowse: Adaptive video browser for efficient continuous sign language recognition. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 9 October–3 November 2023; pp. 709–718. [Google Scholar]
Gan, S.; Yin, Y.; Jiang, Z.; Xia, K.; Xie, L.; Lu, S. Contrastive learning for sign language recognition and translation. In Proceedings of the IJCAI, Macao, China, 19–25 August 2023; pp. 763–772. [Google Scholar]
Hu, L.; Gao, L.; Liu, Z. Self-emphasizing network for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 8–12 October 2023; Volume 37, pp. 854–862. [Google Scholar]
Kan, J.; Hu, K.; Hagenbuchner, M.; Tsoi, A.C.; Bennamoun, M.; Wang, Z. Sign language translation with hierarchical spatio-temporal graph neural network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3367–3376. [Google Scholar]
Lu, H.; Salah, A.A.; Poppe, R. Tcnet: Continuous sign language recognition from trajectories and correlated regions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 3891–3899. [Google Scholar]
Zhou, H.; Zhou, W.; Zhou, Y.; Li, H. Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Trans. Multimed. 2021, 24, 768–779. [Google Scholar] [CrossRef]
Klakow, D.; Peters, J. Testing the correlation of word error rate and perplexity. Speech Commun. 2002, 38, 19–28. [Google Scholar] [CrossRef]
Chen, Y.; Zuo, R.; Wei, F.; Wu, Y.; Liu, S.; Mak, B. Two-stream network for sign language recognition and translation. Adv. Neural Inf. Process. Syst. 2022, 35, 17043–17056. [Google Scholar]

Figure 1. Schematic of PGF-SLR. Input frames are first processed by an off-the-shelf skeleton extractor to obtain skeletal data. Next, these keypoints are used to construct a Fourier fully connected graph and fed into the graph Fourier learning module for spatiotemporal representation. Finally, a dual-branch action learning module is used to learn the representations of sign language actions.

Figure 2. Skeleton extraction and partition: (a) skeleton extraction from sign language frames using MMPose; (b) keypoints are anatomically partitioned into four topological graphs: GR (right-hand with 21 nodes), GL (left-hand with 21 nodes), GF (face with 26 nodes), and GB (torso with 9 nodes).

Figure 3. Framework of PGF-SLR. The PGF-SLR consists of a Fourier fully connected graph construction module, a graph Fourier learning module, and a dual-branch action learning module, which includes a recognition branch and an auxiliary prediction branch (a). The Fourier fully connected graph is constructed as (b) skeletons are segmented to four parts and transformed to Fourier Space by GFT, and (c) part-level features of frames within an input window are treated as nodes and their attention are treated as edges to construct the Fourier fully connected graph.

Figure 4. Graph Fourier learning module. This module consists of a FourierGNN module and an adaptive frequency enhancement module (AFE), as shown in (a), while in (b) the FourierGNN is composed of multiple stacked Fourier graph operators (FGOs), and the output

Y_{t}^{G}

is obtained by adding together the iteratively multiplied mediate outputs; in (c), the adaptive frequency enhancement module contains a phase branch and an amplitude branch. The amplitude branch is adaptively enhanced via an

1 \times 1

convolution-based MLP and then recombined with the phase to reconstruct the complex number, yielding the frequency-enhanced feature

Y_{t}^{G E}

.

Figure 4. Graph Fourier learning module. This module consists of a FourierGNN module and an adaptive frequency enhancement module (AFE), as shown in (a), while in (b) the FourierGNN is composed of multiple stacked Fourier graph operators (FGOs), and the output

Y_{t}^{G}

is obtained by adding together the iteratively multiplied mediate outputs; in (c), the adaptive frequency enhancement module contains a phase branch and an amplitude branch. The amplitude branch is adaptively enhanced via an

1 \times 1

convolution-based MLP and then recombined with the phase to reconstruct the complex number, yielding the frequency-enhanced feature

Y_{t}^{G E}

.

Figure 5. Dual-branch action learning module. The module consists of two parallel branches: (a) the recognition branch in the upper part, and (b) the prediction branch in the lower part.

Figure 6. Implementation of contextual learning module and long-term sequence modeling module. The contextual learning module consists of a 1-D convolution with kernel size 5, a 1-D max-pooling with kernel size 2, and a second 1-D convolution with kernel size 5, supervised by the

L_{C T C}

. The long-term sequence modeling module comprises a two-layer BiLSTM with a hidden size of 1024, supervised by

L_{s e q}

.

Figure 6. Implementation of contextual learning module and long-term sequence modeling module. The contextual learning module consists of a 1-D convolution with kernel size 5, a 1-D max-pooling with kernel size 2, and a second 1-D convolution with kernel size 5, supervised by the

L_{C T C}

. The long-term sequence modeling module comprises a two-layer BiLSTM with a hidden size of 1024, supervised by

L_{s e q}

.

Figure 7. Hyperparameter sensitivity analysis of the performance trends of weights in the loss function on the PHOENIX14 Dataset: (a) illustrates the impact of varying the weight

α_{1}

on model recognition performance, while (b) depicts the impact of adjusting

α_{2}

.

Figure 7. Hyperparameter sensitivity analysis of the performance trends of weights in the loss function on the PHOENIX14 Dataset: (a) illustrates the impact of varying the weight

α_{1}

on model recognition performance, while (b) depicts the impact of adjusting

α_{2}

.

Figure 8. Hyperparameter sensitivity analysis of the effects of different window sizes: (a) demonstrates the performance changes with varying input window sizes

W_{i n}

, while (b) illustrates the impact of different prediction window sizes

W_{p r e}

.

Figure 8. Hyperparameter sensitivity analysis of the effects of different window sizes: (a) demonstrates the performance changes with varying input window sizes

W_{i n}

, while (b) illustrates the impact of different prediction window sizes

W_{p r e}

.

Table 1. Comparison of PGF-SLR with baseline models on the PHOENIX14 and PHOENIX14-T datasets.

Methods	Backbone	PHOENIX14				PHOENIX14-T
		dev (%)		test (%)		dev (%)	test (%)
		del/ins	WER	del/ins	WER	dev (%)	test (%)
VAC [2]	ResNet18	7.9/2.5	21.2	8.4/2.6	22.3	-	-
SMKD [54]	ResNet18	6.8/2.5	20.8	6.3/2.3	21.0	20.8	22.4
TLP [55]	ResNet18	6.3/2.8	19.7	6.1/2.9	20.8	19.4	21.2
AdaBrowse [56]	ResNet18	6.0/2.5	19.6	5.9/2.6	20.7	19.5	20.6
Contrastive [57]	ResNet18	5.8/2.6	19.6	5.1/2.7	19.8	19.3	20.7
SEN [58]	ResNet18	5.8/2.6	19.5	7.3/4.0	21.0	19.3	20.7
HST-GNN [59]	ST-GCN	-	19.5	-	19.8	19.5	19.8
SignGraph [22]	Custome(GCN)	6.0/2.2	18.2	5.7/2.2	19.1	17.8	19.1
TCNet [60]	ResNet18	5.5/2.4	18.1	5.4/2.0	18.9	18.3	19.4
STMC [61]	VGG11	7.7/3.4	21.1	7.4/2.6	20.7	19.6	21.0
C2SLR [24]	ResNet18	6.8/3.0	20.5	7.1/2.5	20.4	20.2	20.4
CoSign-2s [7]	ST-GCN	-	19.7	-	20.1	19.5	20.1
TwoStream-SLR [63]	ST-GCN	-	18.4	-	18.8	17.7	19.3
MSKA [20]	ST-GCN	-	21.7	-	22.1	20.1	20.5
CoSign-1s [7]	ST-GCN	-	20.9	-	21.2	20.4	20.6
PGF-SLR	FourierGNN	4.4/2.7	17.5	4.2/2.9	18.2	17.3	17.7

Table 2. Comparison of PGF-SLR with baseline models on the CSL-Daily dataset.

Methods	dev (%)	test (%)
SEN [58]	31.1	30.7
TCNet [60]	29.7	29.3
SignGraph [22]	26.4	25.8
CoSign-2s [7]	28.1	27.2
MSKA [20]	28.2	27.8
CoSign-1s [7]	29.5	29.1
PGF-SLR	27.7	28.3

Table 3. Comparison of component effectiveness across the three datasets.

	DI	FGL	PRE	PHOENIX14		PHOENIX14-T		CSL-Daily
	DI	FGL	PRE	dev (%)	test (%)	dev (%)	test (%)	dev (%)	test (%)
	baseline			20.9	21.2	20.4	20.6	29.5	29.1
a_1	✓			20.7	21.1	20.1	20.4	29.3	29.0
a_2		✓		18.1	19.1	17.7	18.0	28.6	29.0
a_3			✓	20.9	21.0	20.5	20.6	29.3	29.1
a_4	✓	✓		17.8	18.7	17.5	18.0	28.3	28.8
a_5	✓		✓	20.7	20.9	20.4	21.2	29.2	28.9
a_6		✓	✓	17.7	18.4	17.5	18.0	28.1	28.6
PGF-SLR	✓	✓	✓	17.5	18.2	17.3	17.7	27.7	28.3

Table 4. Online inference on the PHOENIX14 dataset.

Methods	Throughput	GFLOPs	WER (%)
Baseline	15.84	175.0	20.8
PGF-SLR	25.72	11.5	18.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, D.; Hu, H.; Ma, G.-F. Part-Wise Graph Fourier Learning for Skeleton-Based Continuous Sign Language Recognition. J. Imaging 2025, 11, 286. https://doi.org/10.3390/jimaging11080286

AMA Style

Wei D, Hu H, Ma G-F. Part-Wise Graph Fourier Learning for Skeleton-Based Continuous Sign Language Recognition. Journal of Imaging. 2025; 11(8):286. https://doi.org/10.3390/jimaging11080286

Chicago/Turabian Style

Wei, Dong, Hongxiang Hu, and Gang-Feng Ma. 2025. "Part-Wise Graph Fourier Learning for Skeleton-Based Continuous Sign Language Recognition" Journal of Imaging 11, no. 8: 286. https://doi.org/10.3390/jimaging11080286

APA Style

Wei, D., Hu, H., & Ma, G.-F. (2025). Part-Wise Graph Fourier Learning for Skeleton-Based Continuous Sign Language Recognition. Journal of Imaging, 11(8), 286. https://doi.org/10.3390/jimaging11080286

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Part-Wise Graph Fourier Learning for Skeleton-Based Continuous Sign Language Recognition

Abstract

1. Introduction

2. Related Work

2.1. Skeleton-Based Action Recognition

2.2. Part-Based Action Recognition

2.3. GNN-Based Action Recognition

3. Methodology

3.1. Problem Definition

3.2. Model Framework

3.3. Fourier Fully Connected Graph Construction

3.4. Graph Fourier Learning

3.4.1. Fourier Graph Neural Network

3.4.2. Adaptive Frequency Enhancing Module

3.5. Dual-Branch Action Learning Module

3.5.1. Sign Language Recognition Branch

3.5.2. Action Prediction Branch

3.6. Loss Function

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Baseline Methods

4.4. Evaluation Metrics

4.5. Comparison with Baseline Methods

4.6. Ablation Study

4.7. Hyperparameter Sensitivity Analysis

4.7.1. Weights of Loss Function

4.7.2. Input Window Size

4.7.3. Prediction Window Size

4.8. Online Inference

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI