Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation

Jiang, Yihan; Yang, Degang; Chen, Chen

doi:10.3390/app16010124

Open AccessArticle

Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation

by

Yihan Jiang

,

Degang Yang

^*

and

Chen Chen

College of Computer and Information Science, Chongqing Normal University, Hu Xi Street, Chongqing 401331, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 124; https://doi.org/10.3390/app16010124

Submission received: 25 November 2025 / Revised: 16 December 2025 / Accepted: 18 December 2025 / Published: 22 December 2025

(This article belongs to the Special Issue Advanced Computer Vision Techniques: AI-Based Object Detection, Tracking, Surveillance and Security Applications)

Download

Browse Figures

Versions Notes

Abstract

Deep learning-based sign language recognition plays a pivotal role in facilitating communication for the deaf community. Current approaches, while effective, often introduce redundant information and incur excessive computational overhead through global feature interactions. To address these limitations, this paper introduces a Deformable Correlation Network (DCA) designed for efficient temporal modeling in continuous sign language recognition. The DCA integrates a Deformable Correlation (DC) module that leverages spatio-temporal driven offsets to adjust the sampling range adaptively, thereby minimizing interference. Additionally, a multi-scale local sampling strategy, guided by motion prior, enhances temporal modeling capability while reducing computational costs. Furthermore, an attention-based Correlation Matrix Filter (CMF) is proposed to suppress interference elements by accounting for feature motion patterns. A long-term temporal enhancement module, based on spatial aggregation, efficiently leverages global temporal information to model the performer’s holistic limb motion trajectories. Extensive experiments on three benchmark datasets demonstrate significant performance improvements, with a reduction in Word Error Rate (WER) of up to 7.0% on the CE-CSL dataset, showcasing the superiority and competitive advantage of the proposed DCA algorithm.

Keywords:

continuous sign language recognition; temporal modeling; deformable correlation module

1. Introduction

Sign language plays a vital role in daily communication within the deaf and mute community. However, due to limited popularization and inherent complexity, a significant communication gap inevitably exists between the general public and individuals with hearing impairments. To address this issue, researchers have proposed numerous sign language translation and recognition solutions, among which video-based deep learning approaches have shown remarkable promise [1,2,3,4]. Drawing inspiration from natural language processing techniques, current video-based sign language recognition methods predict multiple glosses to form coherent sentences, enabling a more structured framework for modeling sign language recognition.

Existing deep learning-based sign language recognition methods can be broadly categorized into two groups: single-frame analysis approaches and implicit temporal modeling methods. Single-frame methods [5,6,7] employ shared-parameter convolutional neural networks (CNNs) to process each video frame independently, which inherently limits inter-frame correlation during feature extraction, leading to a spatio-temporal fragmentation dilemma in sign language recognition models. Unsurprisingly, such frame-independent approaches fail to capture semantic information from body motion trajectories, resulting in suboptimal recognition accuracy.

With advancements in Convolutional Neural Networks and feature fusion techniques, many researchers have shifted their focus toward implicit temporal modeling using 3D convolutions or feature fusion. Relevant studies leverage 3D convolutions [8] or 2D convolutions [9] combined with feature fusion to construct local spatiotemporal contexts, thereby enhancing the model’s semantic understanding of body movements in videos. Additionally, some works employ temporal shift modules [10] and temporal convolutions [11] to achieve short-term multi-frame associations. However, due to the large number of input frames required for sign language recognition, 3D convolutions and other temporal modeling methods struggle to perform global analysis across the entire temporal dimension. Furthermore, convolution-based spatio-temporal fusion approaches rely heavily on data-driven assumptions while lacking sufficient prior knowledge, which hinders the interpretation of semantic information derived from motion changes in videos.

To mitigate these limitations, as shown in Figure 1a, CorrNet [12] was proposed to establish explicit correlation sampling via cross-attention mechanisms. CorrNet computes visual correlations between frames via matrix multiplication, with a channel-wise correlation matrix that captures global inter-frame dependencies. This indirectly describes pixel-level motion between frames, effectively uncovering action-related changes in videos. Additionally, CorrNet employs dilated grouped convolutions to reduce parallel computation overhead, expand the temporal receptive field, and enhance global temporal modeling capabilities. Despite its improved spatio-temporal context modeling, CorrNet inevitably incurs computational costs due to global correlations and suffers from performance degradation caused by misleading interference features in motion trajectory descriptions. Furthermore, spatio-temporal dilation operations in 3D CNNs could also miss crucial time intervals across global temporal sequences.

To solve the above problems, inspired by learnable deformable sampling techniques [13,14,15], this paper proposes a Deformable Correlation Network (DCA) for continuous sign language recognition, as shown in Figure 1b, effectively mitigating computational overhead and interference feature-induced performance decline. The deformable strategy is advantageous primarily because it employs adaptive sampling offsets within a smaller sampling range. This design efficiently captures rapid limb movements while maintaining low computational costs—a critical factor for modeling spatio-temporal context in sign language recognition. Without such adaptive offsets, fixed local sampling would inevitably miss the temporal semantic features inherent in extensive limb movements. The details of our proposed scheme are presented below.

First, we integrate deformable sampling into the correlation computation process to construct a deformable correlation (DC) module, and it leverages inter-frame fusion to generate spatiotemporal context-driven offsets. This enables adaptive adjustment of correlation sampling ranges. By employing input-dependent dynamic range adaptation, DC achieves reliable visual correlation modeling with reduced sampling scope. Our in-depth analysis reveals that the reduced sampling area prevents each element from achieving global attention in the other frame. Consequently, the initial sampling point placement becomes crucial for model performance. To address this, we investigated two approaches: uniform placement based on dilation operations and motion-prior-based placement. Experimental comparisons demonstrate the superior performance of the motion-prior approach. Simultaneously, the motion-prior method enables additional reductions in both the sampling radius. By integrating multi-scale sampling results, it could achieve reliable local-to-global inter-frame correlation modeling.

Secondly, to further suppress irrelevant information in the correlation matrix, we propose a correlation matrix filter (CMF) based on visual similarity distribution to generate masks for filtering out interference features. We employ spatial attention to model independent-element motion at individual points, while using channel attention to accomplish motion modeling within local spatial contexts. This dual approach enhances the attention mask’s capacity to characterize both reliable and unreliable motions, thereby enabling effective noise filtering.

Thirdly, to preserve complete temporal information while reducing computational costs and GPU memory consumption, we propose a spatial aggregation-based long-term temporal enhancement module. The generated global enhancement mask operates on feature-wise representations during the backbone network’s feature extraction stage. This design enables the model to efficiently leverage global temporal characteristics for performance improvement while avoiding information loss caused by temporal dilation.

Experimental results demonstrate that our proposed DCA-based continuous sign language recognition model achieves significant performance on three public datasets. Ablation studies confirm its superior temporal modeling capability and higher recognition accuracy compared to baseline methods.

Key innovations of our approach include the following:

(1) Spatio-temporal driven deformable correlation module: We generate offsets by leveraging temporal fusion between adjacent frames, and establishing a multi-scale correlation sampling based on motion-prior for efficient temporal modeling.

(2) Correlation matrix filter: an inter-frame feature motion-oriented hybrid attention mechanism, specifically designed for sign language recognition tasks, is proposed to generate an attention mask for the correlation matrix to effectively filter out interference information.

(3) Long-term temporal enhancement module: a spatially aggregated efficient global temporal correlation model is proposed to enhance long-term temporal modeling, thereby improving the accuracy of sign language recognition.

(4) Extensive ablation studies and comparative analyses validate the effectiveness of the proposed framework.

2. Related Work

2.1. Continuous Sign Language Recognition

The objective of Continuous Sign Language Recognition (CSLR) algorithms is to convert a performer’s sign language video into a gloss sequence, representing it in natural semantics that are more comprehensible to general audiences. In early approaches [16,17], manually designed feature descriptors were employed for feature extraction, while Hidden Markov Models (HMMs) were utilized for temporal modeling. However, these traditional methods heavily relied on expert-provided prior knowledge, and handcrafted feature extraction struggled to uncover more representative abstract representations, thereby limiting recognition performance.

With the rise of deep learning, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been widely applied to computer vision and temporal modeling tasks, leading to further improvements in fields such as visual tracking [18,19] and action recognition [20,21]. Sign language recognition [22] has evolved to prioritize mining spatiotemporal contexts using neural networks, drawing inspiration from these tasks. By constructing end-to-end networks, backpropagation and Connectionist Temporal Classification (CTC) loss [23] functions are used to update network parameters in a data-driven manner for model optimization.

Current mainstream network architectures primarily employ 2D CNNs [2,5] to extract frame-wise features from individual video frames—a stage typically referred to as frame-wise extraction. Subsequently, 1D CNNs and LSTMs are applied to model temporal dependencies in the video sequence features [12,24], thereby capturing the performer’s motion trajectories and abstracting corresponding sign language information. Other works focused on improving the objective function by introducing alignment loss [24] and pseudo-labels to enhance learning effectiveness. Since CSLR is a typical sequential task, related work has sought to enhance its temporal modeling capabilities. For instance, TLP [25] leverages a temporal lift module to obtain more robust temporal features. SEN [26] introduces multi-scale spatial features to generate a spatial attention weighting mechanism. CorrNet [12] pioneers the incorporation of temporal modeling into the frame-wise feature extraction stage by adopting Transformer-based cross-attention mechanisms to establish global correlations among features within the temporal neighborhood. This approach introduces temporal dependencies at the early feature extraction stage, significantly improving sign language recognition performance. However, CorrNet’s global correlation mechanism introduces substantial irrelevant noise features and leads to a sharp increase in computational complexity. Therefore, mitigating these issues holds significant potential for advancing CSLR models.

2.2. Deformable Neural Net Module

The primary objective of deformable neural network modules is to break the constraints imposed by manually designed fixed feature sampling ranges. By adopting a posterior-driven approach, these modules learn input-dependent offsets during training, thereby enabling adaptive adjustment of the sampling scope.

Deformable Convolution (DCN) [13] introduced learnable deformation strategies for object detection in computer vision, enabling feature sampling to effectively capture irregularly shaped objects and thereby enhancing model recognition capability. Subsequent works, such as Deformable DETR [15], incorporated deformable mechanisms into the self-attention modules of Transformers. This approach avoids exhaustive global feature correlation, reducing computational overhead while improving convergence speed during training.

Later research extended deformable strategies [14,27,28] to temporal modeling tasks, leading to the development of deformable 3D convolution [27], which enables adaptive sampling along the temporal dimension. However, the parameter size and computational complexity of 3D convolutions make them impractical for processing long temporal sequences, particularly in sign language recognition, where a single video often contains over 100 frames. Introducing multiple 3D convolutions would incur prohibitive computational costs.

To address this, we adopt CorrNet’s temporal association strategy at the frame-wise stage, which focuses solely on modeling correlations within the local temporal neighborhood. However, CorrNet employs a global correlation mechanism akin to cross-attention, which tends to introduce excessive irrelevant noise. Inspired by advanced deformable strategies [13,14,15,28], we mitigate this issue by reducing the number of sampling points and enabling dynamic sampling ranges via learnable offsets.

3. Methods

In this section, we elaborate on the proposed Continuous Sign Language Recognition (CSLR) based on the deformable correlation network, as illustrated in Figure 2. According to the feature dimension processing approach, the overall architecture is mainly divided into two parts: frame-wise feature extraction and gloss-wise feature extraction. (1) In the frame-wise part, consecutive video frames are fed into a ResNet-based backbone to complete video feature extraction. During the backbone stage, we not only focus on single-frame visual features but also introduce a multi-scale deformable correlation module to construct a reliable and efficient spatiotemporal context. And an attention-based correlation matrix filter (CMF) is introduced to suppress unreliable correspondences, further enhancing the reliability of inter-frame associations. The features are then fused with the output of the long-term temporal enhanced module to derive the frame-wise features at each stage in the backbone. (2) In the gloss-wise part, we employ temporal modeling based on 1D convolutions and Bi-LSTM to obtain the final gloss-wise features, which are fed into a classifier to produce the final sign language recognition results.

3.1. Method Overview and Motivation

Overview. Given a continuous video frame sequence

{x} \in R^{T \times 3 \times H_{i} \times W_{i}}

(

H_{i}

,

W_{i}

are height and width of original images) of length T, the primary objective of a deep CSLR model is to feed

{x}

into a neural network to obtain a sequence of glosses

y = {y_{i}}_{i = 1}^{N}

that describe a natural language sentence, where N denotes the length of the gloss sequence. Specifically, CSLR first employs a ResNet-based backbone for feature extraction, producing frame-wise features

f_{v} \in R^{T \times C_{1} \times H \times W}

. Subsequently,

f_{v}

is passed into a temporal modeling combination, consisting of 1D convolutions and a Bi-LSTM, to perform long-term temporal modeling while compressing the temporal dimension:

(T \to N)

. Finally, a fully connected network (FCN) layer-based classifier is used to output the predicted gloss probability distribution. During training, the entire network parameters are optimized via backpropagation using the Connectionist Temporal Classification (CTC) loss. During inference, the predicted probability distribution is decoded to produce the gloss sequence representing the natural language sentence.

Motivation. To improve spatio-temporal enhancement in the feature-wise extraction stages, CorrNet introduced a correlation module at each stage of the backbone to establish inter-frame relationships. Specifically, every element in the visual features at time t compute dot products with all elements in the visual features at time

t - 1

or

t + 1

. Assuming there are T frames, the total number of global correlation computations is

2 \times T \times C \times {(H \times W)}^{2}

. If

T \times C

is treated as a constant significantly smaller than

H \times W

, the final time complexity becomes

O ({(H \times W)}^{2})

. In practice, the limb movements that require attention in videos typically exhibit spatio-temporal locality. Consequently, employing global correlations not only incurs excessive computational overhead but also introduces substantial irrelevant noise. Similarly, effective correlation modeling can be achieved by focusing solely on spatio-temporal continuous local regions. Assuming the local region has a size of

{(2 r + 1)}^{2}

(where r is the sampling radius, which is significantly smaller than both H and W), the total computation required becomes:

2 \times C \times T \times H \times W \times {(2 r + 1)}^{2}

, with a computational complexity of O(

H \times W

). Therefore, this paper replaces the original global receptive field with a smaller sampling range and achieves efficient temporal modeling by introducing adaptive deformable operations.

3.2. Deformable Correlation Module Based on Motion Prior

Given a sequence of video frames, the deep CSLR model first employs a backbone network to extract feature-wise representations. In the ResNet-based backbone, there are four stages of feature downsampling. In the first stage, the relatively high resolution would lead to excessive computational overhead if correlation computation were used for inter-frame modeling. Therefore, CorrNet omits the correlation module at this stage. In contrast, the proposed deformable correlation module with efficient computation could be deployed in all stages (as depicted in Figure 3) respectively for temporal modeling. Furthermore, the deformable correlation module can adaptively expand the sampling range and adjust the attended regions even under restricted sampling conditions, thereby achieving superior spatio-temporal context modeling for describing the real limb movement between adjacent frames. We elaborate the Deformable Correlation Module based on Motion Prior into two parts in the subsequent contents.

(1) Spatio-temporal driven deformable sampling. Assuming the current stage yields a frame-sequence feature denoted as

f_{s} \in R^{T \times C_{1} \times H \times W}

, when fed into the correlation module, it first undergoes a 3D convolution to capture local spatio-temporal relationships while performing channel-wise downsampling, resulting in

f_{t}^{l} \in R^{T \times C \times H \times W}

. To establish temporal dependencies within the neighborhood of time step t, we temporally shift

f_{t}^{l}

leftward to obtain

f_{t - 1}^{l} \in R^{T \times C \times H \times W}

and rightward to obatin

f_{t + 1}^{l} \in R^{T \times C \times H \times W}

. During left shifting, features at time t are replaced by those at

t - 1

, whereas during right shifting, features at time t are replaced by those at time

t + 1

. For the convenience of subsequent explanation, we only analyze the correlation calculation of the feature

f_{t} \in R^{C \times H \times W}

at time t. Given a sampling range of coordinates, the correlation sampling between

f_{t}

and

f_{t + 1}

can be computed as follows:

\begin{matrix} M_{c} (x, y, x_{1}, y_{1}) = \frac{1}{C} \sum_{c = 1}^{C} (f_{t}^{c} (x, y) \cdot f_{t + 1}^{c} (x_{1}, y_{1})) \\ (x, y) \in R_{1}, (x_{1}, y_{1}) \in R_{2} \end{matrix}

(1)

where,

(x, y)

represents the coordinates of an element in

f_{t}

,

R_{1} \in R^{H \times W \times 2}

represents the coordinates of

f_{t}

, while

(x_{1}, y_{1})

denotes coordinates in

f_{t + 1}

,

M_{c} \in R^{H \times W \times (2 r + 1) \times (2 r + 1)}

is the objective correlation matrix, we could leverage the dot-product "·" to compute the visual similarity in the correlation matrix

M_{c}

. During correlation sampling,

x_{1}

and

y_{1}

are constrained within the prior-specified range

R_{2} \in R^{(2 r + 1) \times (2 r + 1) \times 2}

. In CorrNet, the range

R_{2}

covers the entire feature plane of

f_{t + 1}

. In our proposed method, we reduce the sampling area to

1 / 4

of the original size to eliminate redundant correlation sampling. To ensure unbiased sampling after area reduction, we employ dilation operations (as shown in Figure 3b) to uniformly distribute sampling points across the feature plane, thereby obtaining the initial reference sampling coordinates

c o o r d_{i} \in R^{(2 r + 1) \times (2 r + 1) \times 2}

.

Since correlation computation is a binary operation, we design an offset mechanism dependent on both two frames—termed spatio-temporal driven deformation. By concatenating

f_{t}

and

f_{t + 1}

along the channel dimension and processing them through convolutional layers, we obtain learnable offsets

c o o r d_{o f s} \in R^{(2 r + 1) \times (2 r + 1) \times 2}

. These offsets are then added to

c o o r d_{i}

to derive the final sampling coordinates. The spatio-temporal driven deformable correlation sampling can ultimately be calculated using the following formula:

\begin{matrix} c o o r d_{o f s} = C o n v (C o n c a t (f_{t}, f_{t + 1})) \\ M_{c} (x, y, x_{1}^{*}, y_{1}^{*}) = \frac{1}{C} \sum_{c = 1}^{C} (f_{t}^{c} (x, y) \cdot f_{t + 1}^{c} (x_{1}^{*}, y_{1}^{*})) \\ (x_{1}^{*}, y_{1}^{*}) \in (c o o r d_{o f s} + c o o r d_{i}) \end{matrix}

(2)

Through the aforementioned deformable sampling operation, we can obtain a smaller volume of the correlation matrix while greatly reducing computational complexity. The above discussion only addresses deformable sampling at one point in

f_{t}

. For the entire

f_{t}

, the shape of the learnable deformable tensor required would be

R^{(2 r + 1) \times (2 r + 1) \times 2}

.

(2) Multi-scale Correlation Sampling based on Motion prior. In the above designed deformable correlation sampling, we employ a uniform distribution to set the sampling locations of any element in

f_{t}

within

f_{t + 1}

. Experimental results demonstrate that this approach yields a certain performance improvement. Although this unbiased initialization provides modest gains for sign language recognition models, it violates the spatial displacement locality of each feature element in the spatio-temporal domain. Furthermore, although the current scheme reduces the number of sampling points through our designed deformable strategy, there remains some degree of sampling redundancy. Given that sign language scenarios typically involve performers using relatively slow limb movements to convey gestures, we propose adopting a more localized sampling range to model motion features. Furthermore, in certain scenarios, sign language gestures involve significant body displacements, which necessitate an expanded sampling range for the correlation volume. To address this, we downsample the features into multiple scales and construct multi-scale correlation volumes, thereby effectively modeling body motion from local-to-global ranges.

Based on this insight, we present the Multi-scale Correlation Sampling based on the Motion Prior algorithm. The complete algorithm consists of two key steps:

(1) Motion Prior-based Sampling Point Configuration: Unlike previous approaches that adopt uniformly distributed sampling anchors, we propose to initialize the anchor points as the element coordinates Pt of the feature plane

f_{t}

, guided by motion priors. This design is theoretically justified by the spatial locality of motion between consecutive frames in scenarios with limited movement magnitude. After anchor point determination, we further reduce the sampling radius r to 2. Consequently, each point in the left feature map

f_{t}

only needs to sample 25 feature points from the right feature map (

f_{t + 1}

or

f_{t - 1}

), where 25 is significantly smaller than

(H / 2) \times (W / 2)

in our implementation.

(2) Multi-Scale Correlation Handling: To address potential large-motion scenarios in sign language while maintaining sampling efficiency, we retain the original sampling radius but strategically reduce the resolution of the right feature map. This achieves an effective increase in the receptive field without expanding the sampling range. As illustrated in Figure 4, we perform correlation sampling on right features at three distinct scales:

{1, 1 / 2, 1 / 4}

of the original resolution. The resulting correlation volumes from all scales are concatenated to form the final correlation matrix

M_{c}

.

It is noteworthy that the proposed multi-scale motion-prior-based scheme constructs a local-to-global correlation tensor, which further suppresses sampling redundancy through spatiotemporal locality priors, thereby reducing the overall size of the final correlation matrix

M_{c}

from

R^{H \times W \times H / 2 \times W / 2}

to

R^{H \times W \times 3 \times 5 \times 5}

. Moreover, we only need to predict correspondingly smaller sampling offsets

c o o r d_{o f s} \in R^{R^{H \times W \times 3 \times 5 \times 5 \times 2}}

to achieve deformable correlation sampling.

The motion-prior-based deformable correlation sampling cannot be directly implemented using off-the-shelf PyTorch 1.10 APIs. Therefore, we developed the corresponding custom operators using PyTorch’s CUDA extension, with a simplified pseudocode provided in Algorithm 1.

Algorithm 1: Deformable correlation sampling kernel (pseudo code)

3.3. Correlation Matrix Filter

The correlation matrix is computed through correlation sampling, and CorrNet applies it to

f_{t}

to achieve a spatio-temporal affine transformation. However, during the sampling process, the sampling results may be affected by visually similar interfering features; the sign language recognition model is prone to semantic understanding errors. Therefore, we aim to assign smaller weights to these regions, making the attention mechanism a natural solution to this problem. Upon closer examination of the correlation matrix, we observe that its dimensionality can be transformed from

M_{c} \in R^{H \times W \times 3 \times (2 r + 1) \times (2 r + 1)}

to

M_{c}^{*} \in R^{3 * (2 r + 1) * (2 r + 1) \times H \times W}

through dimension reshape and permutation. Given an index

(x, y)

along the H and W dimensions of the correlation matrix, one element stored across the channels

(C^{*} = 3 * (2 r + 1) * (2 r + 1))

at position

(x, y)

in

M_{c}^{*}

represent the local sampling of

f_{t}

at

(x, y)

on

f_{t + 1}

. On this basis, we design a correlation matrix filter for sign language recognition models by following the spatial-channel attention paradigm of CBAM. It is noteworthy that this paper provides an interpretability analysis of the proposed correlation matrix filter and introduces specialized improvements tailored to the modeling characteristics of sign language recognition. The overall workflow of correlation matrix filter is shown in Figure 5.

(1) Independent motion modeling (spatial attention). Based on the aforementioned analysis, each element in the

M_{c}^{*}

stores a local sampling result of an element in

f_{t}

. If the sampling result is reliable, it indicates that the sampling range exhibits generally high semantic similarity with potentially smaller distribution variance. Therefore, compared to traditional spatial attention mechanisms, we additionally introduce a meaningful variance descriptor for channel distributions. Through spatial attention, unreliable motion modeling between frames can be independently suppressed at each feature point. The specific attention generation formula is as follows:

\begin{matrix} A t t n_{s} = α_{1} * C o n v_{A v g} (F_{A v g} (M_{c}^{*})) + α_{2} * C o n v_{M a x} (F_{M a x} (M_{c}^{*})) \\ + α_{3} * C o n v_{V a r} (F_{V a r} (M_{c}^{*})) \\ A t t n_{s} = S i g m o i d (A t t n_{s}) \end{matrix}

(3)

it is obvious that variance

F_{V a r} (\cdot)

, average

F_{A v g} (\cdot)

and max

F_{M a x} (\cdot)

are introduced to comprehensively describe the feature distribution of local sampling for modeling independent element motion.

α_{1}

,

α_{2}

and

α_{3}

are learnable weights respectively.

(2) Multi-scale neighborhoods motion modeling (channel attention). Since spatial attention only describes the motion reliability of one single element in

f_{t}

, it loses spatial contextual information. However, in real-world sign language articulation, local spatial contexts often exhibit consistent motion trends. Therefore, by jointly modeling the motion patterns of all feature points within a local neighborhood, we could more effectively identify high-confidence regions in the correlation matrix.

In the original channel, the attention mechanism primarily focuses on the global feature distribution across the entire spatial domain, which does not align well with the characteristics of sign language recognition. In contrast, features within local neighborhoods are more likely to share similar motion patterns. Hence, we employ convolution-based downsampling to generate multi-subregion channel attention maps. These maps are then upsampled via differentiable bilinear interpolation, effectively performing a broadcasting-like operation to align with the resolution of the original

f_{t}

features. The multi-subregion channel attention maps could be computed by the following

\begin{matrix} A t t n_{c}^{s} = S i g m o i d (F_{i n t p} (C o n v_{s} (M_{c}^{*}))) \\ A t t n_{c}^{m} = S i g m o i d (F_{i n t p} (C o n v_{m} (M_{c}^{*}))) \end{matrix}

(4)

where

F_{i n t p}

is the bilinear interpolation,

C o n v_{s}

and

C o n v_{M}

are responsible for downsampling the features in a convolutional manner to 1/4 or 1/16 of the original resolution, respectively. The enhanced correlation matrix could be ultimately computed using the following formula after obtaining both spatial and channel attention weights:

\begin{matrix} M_{c}^{*} = (β_{1} * A t t n_{c}^{s} * M_{c}^{*} + β_{2} * A t t n_{c}^{m} * M_{c}^{*}) * A t t n_{s} \end{matrix}

(5)

where

β_{1}

and

β_{2}

are learnable weights.

3.4. Temporal Affine Transformation

After processing through the correlation matrix filter, we obtain an attention-enhanced correlation matrix. Since the time step t simultaneously establishes temporal associations with both

t - 1

and

t + 1

time steps, this ultimately yields two correlation matrices within the temporal neighborhood of time step t, which we denote as

M_{c l}^{*}

and

M_{c r}^{*}

. To achieve temporal enhancement at the frame-wise stage, we perform temporal affine transformation by multiplying

M_{c l}^{*}

and

M_{c r}^{*}

with the feature

f_{t}

through matrix multiplication. The specific computation is as follows:

\begin{matrix} f_{t}^{*} = F_{m u l} (M_{c l}^{*}, f_{t}) * α_{1} + F_{m u l} (M_{c r}^{*}, f_{t}) * α_{2} \end{matrix}

(6)

where

F_{m u l} (\cdot)

is matrix multiplication,

f_{t}^{*}

is the enhanced frame-wise features,

α_{1}

and

α_{2}

are weights for temporal feature selection. The spatio-temporal enhancement could be accomplished among frame-wise feature extraction stages through our proposed methods mentioned above.

3.5. Long-Term Temporal Enhanced Module Based on Spatial Aggregation

The deformable correlation sampling method employed in previous work can only focus on motion features between adjacent frames, which is insufficient to fully characterize the overall limb movement trajectory in sign language. Given that sign language videos typically contain hundreds of frames, directly applying 3D convolution would incur excessive computational overhead and GPU memory usage, whereas using sparse 3D convolution may miss critical temporal segments. To address this, we propose a long-term enhanced module based on spatial aggregation, as shown in Figure 6. In contrast to the aforementioned deformable correlation module, which can only process spatio-temporal information and body trajectories of adjacent frames, this module extends spatio-temporal context modeling to capture sign language feature representations of the entire video sequence.

(1) Global Temporal Embedding based on spatial aggregation. First, two convolutional operations with different receptive fields are applied to the original entire video features

f_{t}

to enhance spatial perception diversity. To accommodate convolutional operations, tensor shapes and dimensional arrangements require adjustment before and after processing. Subsequently, spatial distribution means are computed separately for the output tensors from different receptive-field convolutions to achieve spatial feature aggregation. These aggregated features are then concatenated to form a global temporal embedding, which could be computed by the following:

\begin{matrix} f_{e}^{1}, f_{e}^{2} = F_{R P} (C o n v_{2 d}^{1} (F_{P R} (f_{t})), C o n v_{2 d}^{2} (F_{P R} (f_{t}))) \\ f_{g t e} = C o n c a t (F_{M a x} (f_{e}^{1}), F_{M a x} (f_{e}^{2})) \end{matrix}

(7)

where

F_{R P}

is the combination operation of reshape and permutation;

F_{R P}

is the inverse operation of

F_{R P}

; and

f_{g t e}

is the global temporal embedding.

(2) Long-term Temporal Enhancement. After obtaining the global temporal embedding

f_{g t e}

, this paper establishes global temporal correlations through a self-attention mechanism. Since spatial aggregation has been completed, the tensor shape of the global temporal embedding is

(B, C, T)

, where the temporal dimension T can be regarded as equivalent to the text token length in NLP tasks. Accordingly, we designed analogous query, key, and value mapping layers based on 1D convolutions. Subsequently, matrix multiplication between the query and the key yields a

T \times T

global temporal correlation matrix, which is normalized via softmax. This normalized matrix is then multiplied by the value to generate a temporal enhancement mask

M_{s}

. Finally,

M_{s}

is added to the

f_{t}^{*}

, thereby accomplishing long-term temporal enhancement, which could be formulated as follows:

\begin{matrix} Q, K, V = C o n v_{2 d}^{Q} (f_{g t e}), C o n v_{2 d}^{K} (f_{g t e}), C o n v_{2 d}^{V} (f_{g t e}) \\ M_{s} = (\frac{Q * K^{⊤}}{\sqrt{d}}) * V^{⊤} \\ f_{t}^{*} = B r o a d c a s t (M_{s}) + f_{t}^{*} \end{matrix}

(8)

where

B r o a d c a s t (\cdot)

is the broadcast operation for aligning with the

f_{t}^{*}

.

During the feature-wise extraction phase, the proposed three spatiotemporal modeling schemes provide robust upstream support for subsequent mining of reliable spatiotemporal contexts and sign language representation cues, ultimately ensuring that the sign language recognition model generates accurate natural language descriptions corresponding to the input videos.

4. Experiments

4.1. Datasets

PHOENIX14 [29]. The PHOENIX14 dataset was collected from German weather forecast programs, with each video sequence consisting of a sign language presenter and a clean background. The image sequences in the videos have a resolution of 210 × 260. The entire dataset contains 6841 sentences, comprising a total of 1295 sign language vocabulary items. For ease of training and testing, PHOENIX14 divides the dataset into 5672 training samples, 540 validation samples, and 629 test samples.

PHOENIX14-T [30]. The PHOENIX14-T dataset is an extension of the PHOENIX corpus, containing sign language recognition videos, sign-gloss annotations, and German translations. PHOENIX14-T includes 1085 sign language vocabulary items, totaling 8247 sentences. The dataset is partitioned into 7096 training samples, 519 validation samples, and 642 test samples.

CE-CSL [31]. The CE-CSL dataset features 12 sign language performers, including 8 females and 4 males. Among them, two are hearing-impaired individuals who primarily use sign language for daily communication, while the remaining performers are professional sign language interpreters, ensuring the dataset’s expertise and diversity. CE-CSL divides the entire dataset into 4973 training videos, 515 validation videos, and 500 test videos. These videos cover 3515 Chinese vocabulary items, encompassing a wide range of daily communication phrases. Notably, like PHOENIX14, it is a publicly available dataset, contributing to the sign language recognition research community.

4.2. Training Details

Network Details. The proposed Deformable Correlation Network (DCA) employs ResNet18 as the 2D CNN backbone for the frame-wise stage, initialized with ImageNet pre-trained weights to accelerate training and convergence. In the gloss-wise stage, DCA adopts the temporal dimension reduction network. This temporal network primarily consists of 1D convolutional layers, structured as

{K 5, P 2, K 5, P 2}

, where

K σ

and

P σ

denote a convolutional layer with kernel size

σ

and a pooling layer, respectively. After temporal modeling, the frame-wise features are compressed into the specified gloss-wise dimension. Following the 1D CNN processing, a Bi-LSTM with a hidden size of 1024 is applied for long-term temporal modeling. Finally, the classification layer outputs the predicted sentence corresponding to the sign language video.

Training Configuration. The model is trained for 50 epochs, with an initial learning rate of

1 \times 10^{- 4}

. Adam is selected as the optimizer, with a weight decay of

1 \times 10^{- 5}

. Furthermore, MultiStepLR scheduler is used to adjust the learning rate. For input data preprocessing, videos are first resized to

256 \times 256

, followed by random cropping to

224 \times 224

. The loss function follows CorrNet, combining VE Loss and VA Loss with weights of 1.0 and 25.0, respectively. The entire model is trained and evaluated on a single NVIDIA RTX 3090 GPU.

4.3. Evaluation Metric

We employ the Word Error Rate (WER), a well-established evaluation metric in sign language recognition, to assess model performance. WER measures recognition accuracy by calculating the minimum number of operations (deletions, insertions, and substitutions) required to align the predicted sentence with the ground truth. The specific computation formula is as follows:

\begin{matrix} WER = \frac{# s u b + # i n s + # d e l}{# r e f e r e n c e} \end{matrix}

(9)

it should be noted that a lower WER indicates higher recognition accuracy in sign language recognition.

4.4. Ablation Study

We conduct ablation studies of the proposed model on both PHOENIX14 and CE-CSL datasets. To thoroughly validate the effectiveness of the proposed motion-prior-based Deformable Correlation (DC) module, Correlation Matrix Filter (CMF) and Long-term Temporal Enhanced (LTE) module, comprehensive ablation analyses were performed on the validation and test sets of both datasets. Additionally, detailed performance analyses are conducted on the test sets to evaluate specific configurations of the DCA and CMF modules.

(1) Overall ablation analysis

As shown in Table 1, by incorporating the deformable correlation (DC) module with a sampling range of

H / 2 \times W / 2

and dilation operations, our DCA achieves a 4.2% reduction in WER on the PHOENIX14 validation set and a 3.6% reduction on its test set compared to the baseline. To verify DCA’s performance improvement in Chinese contexts, we conducted specific experiments on the CE-CSL validation and test sets, obtaining WER reductions of 7.0% and 6.7%, respectively. This improvement stems from DCA’s ability to suppress irrelevant information introduced by global correlations during frame-wise feature extraction while achieving precise inter-frame correlations through deformable sampling, thereby enhancing temporal modeling capability and overall sign language recognition performance.

By integrating the CMF module that employs the distribution mean, maximum, and variance for attention modeling, our model demonstrates an average 3.1% WER reduction on both validation and test sets of PHOENIX14, and an average 4.3% reduction on CE-CSL compared to baseline CorrNet. This improvement is attributed to CMF’s effective spatial attention mechanism, which combines correlation sampling distribution with learnable convolutional networks to suppress low-confidence regions in inter-frame correlation construction.

By independently embedding the LTE module, the sign language recognition accuracy achieved performance improvements on both the test and validation sets of the two datasets. Specifically, the WER metric decreased by an average of 0.6 on PHOENIX14 and by an average of 1.1 on CE-CSL. This improvement benefits from the effective utilization of global temporal information, which assists the sign language recognition system in achieving reliable limb motion modeling and inferring more accurate sign language information.

Finally, the combined integration of DCA, CMF, and LTE modules yields WER reductions of 1 and 0.9 on PHOENIX14’s validation and test sets, respectively, and 5.1 and 4.9 on CE-CSL’s validation and test sets. These results clearly demonstrate that DCA, CMF and LTE provide orthogonal performance gains for the overall sign language recognition system.

(2) Detailed analysis of DC module

To fully demonstrate the reliability of the motion-prior-based DC model with a multi-scale sampling configuration, we conducted an analysis through experiments across three aspects: correlation sampling scale, initialization sampling point arrangement, and deformable offset generation.

Sampling scale analysis (Motion-prior). First, we investigated three down-sampling scales in combination with a fixed sampling radius (r = 2): {1, 1 + 1/2, 1 + 1/2 + 1/4}. As shown in Table 2, the DCA achieves WER scores of 19.1 and 45.5 on PHOENIX14 and CE-CSL respectively when only using the

{1}

original scale. Our empirical findings demonstrate that as the sampling results across multiple scales are progressively aggregated, the performance of the sign language recognition model continues to improve. When all scales

{1 + 1 / 2 + 1 / 4}

are utilized, the model achieves WER scores of 18.7 and 44.9 on the two test datasets, respectively.

Initial sampling point arrangement. We validate the performance of dilation sampling (with a sampling area of

H / 2 \times W / 2

per point) and motion-prior-based sampling (with a sampling area of (

{(2 r + 1)}^{2}

per point) using only the original-scale sampling results (where the resolution of

f_{t + 1}

or

f_{t - 1}

is

H \times W

). The motion-prior-based approach outperforms dilation sampling even at the original scale, while further reducing the Word Error Rate (WER) by 0.3 and 1.1 on the PHOENIX and CSL-CE datasets, respectively, compared to the baseline.

Offset generation methodology. Finally, we conducted detailed experimental analysis on offset generation methods, for a fair comparison, we evaluate both offset generation methods across all motion-prior-based sampling scales. Considering that correlation computation primarily establishes temporal relationships between frames, we propose jointly generating offsets from both correlated frames. We designed two feature fusion strategies: (i) direct element-wise addition followed by convolutional processing, and (ii) channel-wise concatenation followed by convolution. Results in Table 2 demonstrate that the channel concatenation approach yields superior recognition performance, reducing WER scores by 0.6 and 1.3 compared to element-wise addition method on PHOENIX14 and CE-CSL test sets respectively.

(3) Detail analysis of CMF module

As shown in Table 3, we employ only the spatial attention (SA) module to process the correlation matrix, achieving average WER scores of 19.0 and 45.75 on the PHOENIX14 and CE-CSL datasets, respectively. By leveraging the independent motion modeling capability provided by spatial attention (SA), the model can assess motion reliability through correlation sampling at individual points. Consequently, applying SA-generated masks to the correlation matrix yields performance improvements. Furthermore, when channel attention (CA) is independently incorporated, superior sign language recognition accuracy is achieved compared to the baseline across both datasets. This enhancement stems from the model’s neighborhood motion modeling capability, which enables it to infer the reliability of local motion based on spatial context. Ultimately, the integration of SA and CA produces significant performance gains.

Table 4 presents a comprehensive efficiency comparison between the baseline and our proposed method. To ensure a fair comparison, we focus solely on the differences between the correlation module in CorrNet and our proposed module across four performance metrics. For this purpose, we standardize the input video feature size to

1 \times 256 \times 256 \times 28 \times 28

. First, regarding computational complexity, our approach demonstrates superior efficiency, reducing the FLOPs from 15.57 G (Baseline) to 12.53 G. Second, in terms of model size, the proposed method optimizes the architecture effectively, decreasing the parameter count from 77.66 K to 72.42 K. Third, concerning memory consumption, our method significantly lowers the GPU memory footprint to 4.3 GiB compared to the baseline’s 5.1 GiB, facilitating deployment on resource-constrained devices. Finally, regarding inference speed, our full scheme demonstrates a marginal decrease in inference latency compared to the baseline. However, for the correlation matrix calculation component alone, our method reduces the latency from 40.13 ms to 36.86 ms.

4.5. Comparison with State-of-the-Art Methods

(1) Comparative Analysis on PHOENIX14 and PHOENIX14-T

As shown in Table 5, we conducted a comprehensive comparison between our proposed DCA sign language recognition algorithm and several state-of-the-art methods. By incorporating the deformable correlation module and correlation matrix filter, our DCA achieves average WER scores of 18.3 and 18.5 on PHOENIX14 and PHOENIX14-T datasets. These results demonstrate the superior performance and competitive advantage of our proposed algorithm.

(2) Comparative Analysis on CE-CSL

To validate the effectiveness of our algorithm in Chinese sign language recognition, we compared DCA with state-of-the-art approaches on the CE-CSL dataset. As presented in Table 6, the DCA algorithm achieves the lowest average WER score of 42.05. While its performance on the validation set is slightly inferior to THNet, DCA establishes new state-of-the-art performance on the test set. These findings confirm that the deformable correlation module enables more effective utilization of temporal information, leading to exceptional performance in Chinese sign language recognition.

4.6. Visualization Analysis

This section presents attention visualizations for the more challenging CE-CSL dataset. Experimental results verify that our approach remains capable of reliably describing sign language information carriers even in the presence of substantial background noise.

(1) PCA Visualization of Correlation Matrix

As shown in Figure 7, we perform “reshape” operations on the correlation matrices of CorrNet and our DCA, transforming their dimensions from

H \times W \times H \times W

to

H \times W \times C_{1}

(

C_{1} = H * W

) for CorrNet and from

H \times W \times 3 \times (2 r + 1) \times (2 r + 1)

to

H \times W \times C_{2}

(

C_{2} = 3 * {(2 r + 1)}^{2}

) for our DCA. This allows us to employ principal component analysis (PCA) along the channel dimension to extract key features, thereby achieving visualization of the two correlation matrices. It can be clearly observed that, compared with CorrNet, DCA more effectively focuses on the performer’s limb movements, thereby improving sign language recognition performance.

(2) Grad-CAM Visualization of Correlation Matrix

As illustrated in Figure 8, we employ Grad-CAM to visualize spatiotemporal correlation heatmaps of the filtered correlation matrices, leveraging gradient information from the proposed CMF (Correlation Matrix Filter). Notably, the CMF-enhanced correlation matrices can more accurately characterize the performer’s limb movements while suppressing interference from irrelevant information, thereby further improving the accuracy of sign language recognition.

4.7. Qualitative Analysis

As illustrated in Figure 9. It can be observed that, in both scenarios, CorrNet exhibits significant discrepancies between its predictions for the blank token “SIL” and the ground-truth annotations. Moreover, in full-video sign language prediction, CorrNet is more prone to erroneous gloss judgments, leading to substantial deviations between the predicted signs and the performer’s actual limb movement. This issue stems from global inter-frame correlation, which introduces excessive redundancy, forcing each feature element to attend not only to motion patterns but also to abundant background features. Consequently, the model encounters difficulties in temporal feature learning. Additionally, temporal dilation further hinders the model from effectively leveraging complete sequential information.

In contrast, DCA effectively addresses these limitations. Our approach demonstrates stronger alignment with ground-truth labels in frame-level gloss prediction while reducing the probability of incorrect sign predictions.

4.8. Converge Analysis

As illustrated in Figure 10, we present the Word Error Rate (WER) curves on both datasets during validation and testing phases. The DCA model requires 50 training epochs in total, with evaluations performed on both validation and test sets after each epoch. Results demonstrate that DCA begins to converge stably after 20 epochs and ultimately achieves superior performance compared to the baseline CorreNet.

5. Limitations and Future Work

While our proposed Deformable Correlation Network (DCA) demonstrates significant improvements in sign language recognition accuracy, there are still some limitations to consider. First, the proposed deformable correlation model relies heavily on the effectiveness of adaptive offsets predicted by the network; however, in complex scenarios, it may fail to consistently capture reliable spatio-temporal information. Additionally, implementing this module with off-the-shelf PyTorch (1.10) operators is challenging, increasing implementation complexity. Furthermore, our current validation is limited to public benchmarks; experiments demonstrating the method’s generalization to diverse, complex scenarios are insufficient, and further verification of the scheme’s robustness is required.

To address these limitations, we aim to refine the low-level optimization of the deformable operators to achieve seamless integration. Simultaneously, to tackle the issue of limited data scenarios, we plan to validate the model’s robustness and generalization through data augmentation or by collecting datasets from real-world environments. Finally, given the rapid advancements in Large Models, we intend to leverage cross-modal knowledge transfer in future work to further enhance the performance of our sign language recognition algorithm.

6. Conclusions

This paper analyzes the correlation computation process in CorrNet and identifies that its global association introduces redundant irrelevant features while increasing computational overhead. To address these issues, we propose a deformable correlation network as a targeted improvement. First, a motion-prior-based deformable correlation module is introduced to ensure lower computational cost during the multi-scale correlation sampling stages and enable adaptive adjustment of the sampling range, thereby providing a better technical solution for inter-frame association. Additionally, a correlation matrix filter is incorporated to further suppress low-confidence regions in the correlation matrix, ensuring the effectiveness of subsequent temporal affine computations. Thirdly, a long-term temporal enhanced module is devised to efficiently leverage global temporal information for enhancing the sign language recognition accuracy. The proposed methods significantly enhance the performance of CSLR models. We believe that our proposed model, presented in this study, will offer practical value for the deaf community and educational applications.

Author Contributions

Conceptualization, Y.J. and D.Y.; methodology, Y.J.; software, Y.J.; validation, Y.J., D.Y. and C.C.; formal analysis, Y.J.; investigation, Y.J.; resources, D.Y.; data curation, Y.J.; writing—original draft preparation, Y.J.; writing—review and editing, D.Y. and C.C.; visualization, Y.J.; supervision, D.Y.; project administration, D.Y.; funding acquisition, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by the National Natural Science Foundation of China (Grant No. 62003065), the Natural Science Foundation of Chongqing, China (Grant No. CSTB2023NSCQ-MSX0771), and the Science and Technology Research Program of Chongqing Municipal Education Commission of China (Grant No. KJZD-K202300502).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/jacky218/DCANet/tree/master (accessed on 1 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Aloysius, N.; Geetha, M. Understanding vision-based continuous sign language recognition. Multimed. Tools Appl. 2020, 79, 22177–22209. [Google Scholar] [CrossRef]
Cui, R.; Liu, H.; Zhang, C. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7361–7369. [Google Scholar]
Liu, C.; Hu, L. Rethinking the temporal downsampling paradigm for continuous sign language recognition. Multimed. Syst. 2025, 31, 134. [Google Scholar] [CrossRef]
Wang, S.; Guo, L.; Xue, W. Dynamical semantic enhancement network for continuous sign language recognition. Multimed. Syst. 2024, 30, 313. [Google Scholar] [CrossRef]
Cui, R.; Liu, H.; Zhang, C. A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans. Multimed. 2019, 21, 1880–1891. [Google Scholar] [CrossRef]
Cheng, K.L.; Yang, Z.; Chen, Q.; Tai, Y.W. Fully convolutional networks for continuous sign language recognition. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 697–714. [Google Scholar]
Hao, A.; Min, Y.; Chen, X. Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11303–11312. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition, a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
Liu, Z.; Luo, D.; Wang, Y.; Wang, L.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Lu, T. Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11669–11676. [Google Scholar]
Hu, L.; Gao, L.; Liu, Z.; Feng, W. Continuous sign language recognition with correlation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2529–2539. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Huang, Y.; Xiao, Z.; Firkat, E.; Zhang, J.; Wu, D.; Hamdulla, A. Spatio-temporal mix deformable feature extractor in visual tracking. Expert Syst. Appl. 2024, 237, 121377. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Koller, O.; Zargaran, O.; Ney, H.; Bowden, R. Deep sign: Hybrid CNN-HMM for continuous sign language recognition. In Proceedings of the British Machine Vision Conference 2016, York, UK, 19–22 September 2016. [Google Scholar]
Koller, O.; Zargaran, S.; Ney, H. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4297–4305. [Google Scholar]
Liu, P.; Li, G.; Zhao, W.; Tang, X. A coupling method of learning structured support correlation filters for visual tracking. Vis. Comput. 2024, 40, 181–199. [Google Scholar] [CrossRef]
Chen, Z.; Liu, L.; Yu, Z. Toward robust visual tracking for UAV with adaptive spatial-temporal weighted regularization. Vis. Comput. 2024, 40, 8987–9003. [Google Scholar] [CrossRef]
Aiman, U.; Ahmad, T. Angle based hand gesture recognition using graph convolutional network. Comput. Animat. Virtual Worlds 2024, 35, e2207. [Google Scholar] [CrossRef]
Xiao, Z.; Chen, Y.; Zhou, X.; He, M.; Liu, L.; Yu, F.; Jiang, M. Human action recognition in immersive virtual reality based on multi-scale spatio-temporal attention network. Comput. Animat. Virtual Worlds 2024, 35, e2293. [Google Scholar] [CrossRef]
Xue, S.; Gao, L.; Wan, L.; Feng, W. Multi-scale context-aware network for continuous sign language recognition. Virtual Real. Intell. Hardw. 2024, 6, 323–337. [Google Scholar] [CrossRef]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Min, Y.; Hao, A.; Chai, X.; Chen, X. Visual alignment constraint for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11542–11551. [Google Scholar]
Hu, L.; Gao, L.; Liu, Z.; Feng, W. Temporal lift pooling for continuous sign language recognition. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 511–527. [Google Scholar]
Hu, L.; Gao, L.; Liu, Z.; Feng, W. Self-emphasizing network for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 854–862. [Google Scholar]
Ying, X.; Wang, L.; Wang, Y.; Sheng, W.; An, W.; Guo, Y. Deformable 3d convolution for video super-resolution. IEEE Signal Process. Lett. 2020, 27, 1500–1504. [Google Scholar] [CrossRef]
Huang, Y.; Ji, L.; Liu, H.; Ye, M. LGU-SLAM: Learnable Gaussian Uncertainty Matching with Deformable Correlation Sampling for Deep Visual SLAM. arXiv 2024, arXiv:2410.23231. [Google Scholar] [CrossRef]
Koller, O.; Forster, J.; Ney, H. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [Google Scholar] [CrossRef]
Camgoz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7784–7793. [Google Scholar]
Zhu, Q.; Li, J.; Yuan, F.; Fan, J.; Gan, Q. A Chinese Continuous Sign Language Dataset Based on Complex Environments. arXiv 2024, arXiv:2409.11960. [Google Scholar] [CrossRef]
Zhu, Q.; Li, J.; Yuan, F.; Gan, Q. Multiscale temporal network for continuous sign language recognition. J. Electron. Imaging 2024, 33, 023059. [Google Scholar] [CrossRef]
Hu, L.; Gao, L.; Liu, Z.; Feng, W. Scalable frame resolution for efficient continuous sign language recognition. Pattern Recognit. 2024, 145, 109903. [Google Scholar] [CrossRef]
Zhou, H.; Zhou, W.; Zhou, Y.; Li, H. Spatial-temporal multi-cue network for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13009–13016. [Google Scholar]
Kan, J.; Hu, K.; Hagenbuchner, M.; Tsoi, A.C.; Bennamoun, M.; Wang, Z. Sign language translation with hierarchical spatio-temporal graph neural network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3367–3376. [Google Scholar]
Liu, J.; Xue, W.; Zhang, K.; Yuan, T.; Chen, S. Tb-net: Intra-and inter-video correlation learning for continuous sign language recognition. Inf. Fusion 2024, 109, 102438. [Google Scholar] [CrossRef]
Lu, H.; Salah, A.A.; Poppe, R. Tcnet: Continuous sign language recognition from trajectories and correlated regions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 3891–3899. [Google Scholar]
Zhu, Q.; Li, J.; Yuan, F.; Gan, Q. Continuous sign language recognition based on motor attention mechanism and frame-level self-distillation. Mach. Vis. Appl. 2025, 36, 7. [Google Scholar] [CrossRef]

Figure 1. Comparison of global correlation and our spatio-temporal driven deformable correlation. The symbols + and * denote element-wise addition and element-wise multiplication, respectively.

Figure 2. Global overview of our proposed continuous sign language recognition based on the deformable correlation network.

Figure 3. Workflow of spatio-temporal driven deformable correlation.

Figure 4. Multi-scale correlation sampling based on Motion Prior.

Figure 5. Workflow of correlation matrix filter. The symbols + and × denote element-wise addition and element-wise multiplication, respectively.

Figure 6. Long-term temporal enhanced module.

Figure 7. Visualization of the correlation matrix based on PCA.

Figure 8. Visualization of the correlation matrix based on Grad-CAM.

Figure 9. Alignment results of sentence predictions.

Figure 10. Word Error Rate (WER) curves on CE-CSL and PHOENIX14 datasets.

Table 1. Overall ablation analysis on PHOENIX14 and CE-CSL.

Configuration	PHOENIX14		CE-CSL
Configuration	Val (%)	Test (%)	Val (%)	Test (%)
Baseline	19.2	19.3	47.2	46.5
Baseline + DC	18.5	18.8	43.9	43.4
Baseline + CMF	18.6	18.7	44.7	44.9
Baseline + LTE	18.8	18.9	46.3	46.1
Baseline + CMF + DC + LTE	18.4	18.6	42.1	41.6

Table 2. Detailed analysis of DC module on PHOENIX14 and CE-CSL.

DC Configuration		PHOENIX14 (Test %)	CE-CSL (Test %)
baseline		19.3	46.5
Sampling scale	${1}$	19.0	45.4
	${1 + 1 / 2}$	18.8	45.2
	${1 + 1 / 2 + 1 / 4}$	18.7	44.9
Initial sampling arrangement	w/Dilation	19.2	45.9
Initial sampling arrangement	w/Motion-prior	19.0	45.4
Offset generation	Element-wise addition	19.3	46.2
Offset generation	Channel-wise concat	18.7	44.9

Table 3. Detailed analysis of the CMF module on PHOENIX14 and CE-CSL.

Configuration	PHOENIX14		CE-CSL
Configuration	Val (%)	Test (%)	Val (%)	Test (%)
w/SA + CA	18.6	18.7	44.7	44.9
w/SA	18.9	19.1	45.6	45.9
w/CA	18.8	18.9	45.3	45.7

Table 4. Efficiency comparison between the baseline and our proposed method.

Method	FLOPs (G)	Params (K)	Latency (ms)	MEM (GiB)
Baseline	15.57	77.66	40.13	5.1
Baseline + CD	10.22	68.74	36.86	3.6
Baseline + CD + CMF + LTE	12.53	72.42	39.93	4.3

Table 5. Comparative analysis on PHOENIX14 and PHOENIX14-T.

Methods	PHOENIX14		PHOENIX14-T
Methods	Val (%)	Test (%)	Val (%)	Test (%)
VAC [24]	21.2	22.3	-	-
MSTNet [32]	20.3	21.4	-	-
SEN [26]	19.5	21.0	19.3	20.7
AdaSize [33]	19.7	20.9	19.7	20.7
STMC [34]	21.1	20.7	19.6	21.0
HST-GNN [35]	19.5	19.8	20.1	20.3
DSE [4]	18.6	19.8	18.9	19.9
TB-Net [36]	18.9	19.6	18.8	20.0
CorrNet [12]	19.2	19.4	18.9	20.5
TCatet [37]	18.1	18.9	18.3	19.4
MAM-FSD [38]	19.2	18.8	18.2	19.4
THNet [31]	18.7	18.6	18.0	19.1
DCA (Ours)	18.4	18.6	18.3	18.6

Table 6. Comparative analysis on CE-CSL.

Configuration	CE-CSL
Configuration	Val (%)	Test (%)
MSTNet [32]	54.4	53.0
CorrNet [12]	47.2	46.5
SEN [26]	46.5	45.3
VAC [24]	45.1	43.3
MAM-FSD [38]	44.9	44.7
THNet [31]	42.1	41.9
DCA (Ours)	42.1	41.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, Y.; Yang, D.; Chen, C. Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation. Appl. Sci. 2026, 16, 124. https://doi.org/10.3390/app16010124

AMA Style

Jiang Y, Yang D, Chen C. Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation. Applied Sciences. 2026; 16(1):124. https://doi.org/10.3390/app16010124

Chicago/Turabian Style

Jiang, Yihan, Degang Yang, and Chen Chen. 2026. "Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation" Applied Sciences 16, no. 1: 124. https://doi.org/10.3390/app16010124

APA Style

Jiang, Y., Yang, D., & Chen, C. (2026). Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation. Applied Sciences, 16(1), 124. https://doi.org/10.3390/app16010124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation

Abstract

1. Introduction

2. Related Work

2.1. Continuous Sign Language Recognition

2.2. Deformable Neural Net Module

3. Methods

3.1. Method Overview and Motivation

3.2. Deformable Correlation Module Based on Motion Prior

3.3. Correlation Matrix Filter

3.4. Temporal Affine Transformation

3.5. Long-Term Temporal Enhanced Module Based on Spatial Aggregation

4. Experiments

4.1. Datasets

4.2. Training Details

4.3. Evaluation Metric

4.4. Ablation Study

4.5. Comparison with State-of-the-Art Methods

4.6. Visualization Analysis

4.7. Qualitative Analysis

4.8. Converge Analysis

5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI