Next Article in Journal
Thermal Modeling and Investigation of Interlayer Dwell Time in Wire-Laser Directed Energy Deposition
Previous Article in Journal
Tracking Control of a Two-Wheeled Mobile Robot Using Integral Sliding Mode Control and a Linear Quadratic Regulator
Previous Article in Special Issue
Towards Occlusion-Aware Multi-Pedestrian Tracking
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation

College of Computer and Information Science, Chongqing Normal University, Hu Xi Street, Chongqing 401331, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(1), 124; https://doi.org/10.3390/app16010124
Submission received: 25 November 2025 / Revised: 16 December 2025 / Accepted: 18 December 2025 / Published: 22 December 2025

Abstract

Deep learning-based sign language recognition plays a pivotal role in facilitating communication for the deaf community. Current approaches, while effective, often introduce redundant information and incur excessive computational overhead through global feature interactions. To address these limitations, this paper introduces a Deformable Correlation Network (DCA) designed for efficient temporal modeling in continuous sign language recognition. The DCA integrates a Deformable Correlation (DC) module that leverages spatio-temporal driven offsets to adjust the sampling range adaptively, thereby minimizing interference. Additionally, a multi-scale local sampling strategy, guided by motion prior, enhances temporal modeling capability while reducing computational costs. Furthermore, an attention-based Correlation Matrix Filter (CMF) is proposed to suppress interference elements by accounting for feature motion patterns. A long-term temporal enhancement module, based on spatial aggregation, efficiently leverages global temporal information to model the performer’s holistic limb motion trajectories. Extensive experiments on three benchmark datasets demonstrate significant performance improvements, with a reduction in Word Error Rate (WER) of up to 7.0% on the CE-CSL dataset, showcasing the superiority and competitive advantage of the proposed DCA algorithm.

1. Introduction

Sign language plays a vital role in daily communication within the deaf and mute community. However, due to limited popularization and inherent complexity, a significant communication gap inevitably exists between the general public and individuals with hearing impairments. To address this issue, researchers have proposed numerous sign language translation and recognition solutions, among which video-based deep learning approaches have shown remarkable promise [1,2,3,4]. Drawing inspiration from natural language processing techniques, current video-based sign language recognition methods predict multiple glosses to form coherent sentences, enabling a more structured framework for modeling sign language recognition.
Existing deep learning-based sign language recognition methods can be broadly categorized into two groups: single-frame analysis approaches and implicit temporal modeling methods. Single-frame methods [5,6,7] employ shared-parameter convolutional neural networks (CNNs) to process each video frame independently, which inherently limits inter-frame correlation during feature extraction, leading to a spatio-temporal fragmentation dilemma in sign language recognition models. Unsurprisingly, such frame-independent approaches fail to capture semantic information from body motion trajectories, resulting in suboptimal recognition accuracy.
With advancements in Convolutional Neural Networks and feature fusion techniques, many researchers have shifted their focus toward implicit temporal modeling using 3D convolutions or feature fusion. Relevant studies leverage 3D convolutions [8] or 2D convolutions [9] combined with feature fusion to construct local spatiotemporal contexts, thereby enhancing the model’s semantic understanding of body movements in videos. Additionally, some works employ temporal shift modules [10] and temporal convolutions [11] to achieve short-term multi-frame associations. However, due to the large number of input frames required for sign language recognition, 3D convolutions and other temporal modeling methods struggle to perform global analysis across the entire temporal dimension. Furthermore, convolution-based spatio-temporal fusion approaches rely heavily on data-driven assumptions while lacking sufficient prior knowledge, which hinders the interpretation of semantic information derived from motion changes in videos.
To mitigate these limitations, as shown in Figure 1a, CorrNet [12] was proposed to establish explicit correlation sampling via cross-attention mechanisms. CorrNet computes visual correlations between frames via matrix multiplication, with a channel-wise correlation matrix that captures global inter-frame dependencies. This indirectly describes pixel-level motion between frames, effectively uncovering action-related changes in videos. Additionally, CorrNet employs dilated grouped convolutions to reduce parallel computation overhead, expand the temporal receptive field, and enhance global temporal modeling capabilities. Despite its improved spatio-temporal context modeling, CorrNet inevitably incurs computational costs due to global correlations and suffers from performance degradation caused by misleading interference features in motion trajectory descriptions. Furthermore, spatio-temporal dilation operations in 3D CNNs could also miss crucial time intervals across global temporal sequences.
To solve the above problems, inspired by learnable deformable sampling techniques [13,14,15], this paper proposes a Deformable Correlation Network (DCA) for continuous sign language recognition, as shown in Figure 1b, effectively mitigating computational overhead and interference feature-induced performance decline. The deformable strategy is advantageous primarily because it employs adaptive sampling offsets within a smaller sampling range. This design efficiently captures rapid limb movements while maintaining low computational costs—a critical factor for modeling spatio-temporal context in sign language recognition. Without such adaptive offsets, fixed local sampling would inevitably miss the temporal semantic features inherent in extensive limb movements. The details of our proposed scheme are presented below.
First, we integrate deformable sampling into the correlation computation process to construct a deformable correlation (DC) module, and it leverages inter-frame fusion to generate spatiotemporal context-driven offsets. This enables adaptive adjustment of correlation sampling ranges. By employing input-dependent dynamic range adaptation, DC achieves reliable visual correlation modeling with reduced sampling scope. Our in-depth analysis reveals that the reduced sampling area prevents each element from achieving global attention in the other frame. Consequently, the initial sampling point placement becomes crucial for model performance. To address this, we investigated two approaches: uniform placement based on dilation operations and motion-prior-based placement. Experimental comparisons demonstrate the superior performance of the motion-prior approach. Simultaneously, the motion-prior method enables additional reductions in both the sampling radius. By integrating multi-scale sampling results, it could achieve reliable local-to-global inter-frame correlation modeling.
Secondly, to further suppress irrelevant information in the correlation matrix, we propose a correlation matrix filter (CMF) based on visual similarity distribution to generate masks for filtering out interference features. We employ spatial attention to model independent-element motion at individual points, while using channel attention to accomplish motion modeling within local spatial contexts. This dual approach enhances the attention mask’s capacity to characterize both reliable and unreliable motions, thereby enabling effective noise filtering.
Thirdly, to preserve complete temporal information while reducing computational costs and GPU memory consumption, we propose a spatial aggregation-based long-term temporal enhancement module. The generated global enhancement mask operates on feature-wise representations during the backbone network’s feature extraction stage. This design enables the model to efficiently leverage global temporal characteristics for performance improvement while avoiding information loss caused by temporal dilation.
Experimental results demonstrate that our proposed DCA-based continuous sign language recognition model achieves significant performance on three public datasets. Ablation studies confirm its superior temporal modeling capability and higher recognition accuracy compared to baseline methods.
Key innovations of our approach include the following:
(1) Spatio-temporal driven deformable correlation module: We generate offsets by leveraging temporal fusion between adjacent frames, and establishing a multi-scale correlation sampling based on motion-prior for efficient temporal modeling.
(2) Correlation matrix filter: an inter-frame feature motion-oriented hybrid attention mechanism, specifically designed for sign language recognition tasks, is proposed to generate an attention mask for the correlation matrix to effectively filter out interference information.
(3) Long-term temporal enhancement module: a spatially aggregated efficient global temporal correlation model is proposed to enhance long-term temporal modeling, thereby improving the accuracy of sign language recognition.
(4) Extensive ablation studies and comparative analyses validate the effectiveness of the proposed framework.

2. Related Work

2.1. Continuous Sign Language Recognition

The objective of Continuous Sign Language Recognition (CSLR) algorithms is to convert a performer’s sign language video into a gloss sequence, representing it in natural semantics that are more comprehensible to general audiences. In early approaches [16,17], manually designed feature descriptors were employed for feature extraction, while Hidden Markov Models (HMMs) were utilized for temporal modeling. However, these traditional methods heavily relied on expert-provided prior knowledge, and handcrafted feature extraction struggled to uncover more representative abstract representations, thereby limiting recognition performance.
With the rise of deep learning, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been widely applied to computer vision and temporal modeling tasks, leading to further improvements in fields such as visual tracking [18,19] and action recognition [20,21]. Sign language recognition [22] has evolved to prioritize mining spatiotemporal contexts using neural networks, drawing inspiration from these tasks. By constructing end-to-end networks, backpropagation and Connectionist Temporal Classification (CTC) loss [23] functions are used to update network parameters in a data-driven manner for model optimization.
Current mainstream network architectures primarily employ 2D CNNs [2,5] to extract frame-wise features from individual video frames—a stage typically referred to as frame-wise extraction. Subsequently, 1D CNNs and LSTMs are applied to model temporal dependencies in the video sequence features [12,24], thereby capturing the performer’s motion trajectories and abstracting corresponding sign language information. Other works focused on improving the objective function by introducing alignment loss [24] and pseudo-labels to enhance learning effectiveness. Since CSLR is a typical sequential task, related work has sought to enhance its temporal modeling capabilities. For instance, TLP [25] leverages a temporal lift module to obtain more robust temporal features. SEN [26] introduces multi-scale spatial features to generate a spatial attention weighting mechanism. CorrNet [12] pioneers the incorporation of temporal modeling into the frame-wise feature extraction stage by adopting Transformer-based cross-attention mechanisms to establish global correlations among features within the temporal neighborhood. This approach introduces temporal dependencies at the early feature extraction stage, significantly improving sign language recognition performance. However, CorrNet’s global correlation mechanism introduces substantial irrelevant noise features and leads to a sharp increase in computational complexity. Therefore, mitigating these issues holds significant potential for advancing CSLR models.

2.2. Deformable Neural Net Module

The primary objective of deformable neural network modules is to break the constraints imposed by manually designed fixed feature sampling ranges. By adopting a posterior-driven approach, these modules learn input-dependent offsets during training, thereby enabling adaptive adjustment of the sampling scope.
Deformable Convolution (DCN) [13] introduced learnable deformation strategies for object detection in computer vision, enabling feature sampling to effectively capture irregularly shaped objects and thereby enhancing model recognition capability. Subsequent works, such as Deformable DETR [15], incorporated deformable mechanisms into the self-attention modules of Transformers. This approach avoids exhaustive global feature correlation, reducing computational overhead while improving convergence speed during training.
Later research extended deformable strategies [14,27,28] to temporal modeling tasks, leading to the development of deformable 3D convolution [27], which enables adaptive sampling along the temporal dimension. However, the parameter size and computational complexity of 3D convolutions make them impractical for processing long temporal sequences, particularly in sign language recognition, where a single video often contains over 100 frames. Introducing multiple 3D convolutions would incur prohibitive computational costs.
To address this, we adopt CorrNet’s temporal association strategy at the frame-wise stage, which focuses solely on modeling correlations within the local temporal neighborhood. However, CorrNet employs a global correlation mechanism akin to cross-attention, which tends to introduce excessive irrelevant noise. Inspired by advanced deformable strategies [13,14,15,28], we mitigate this issue by reducing the number of sampling points and enabling dynamic sampling ranges via learnable offsets.

3. Methods

In this section, we elaborate on the proposed Continuous Sign Language Recognition (CSLR) based on the deformable correlation network, as illustrated in Figure 2. According to the feature dimension processing approach, the overall architecture is mainly divided into two parts: frame-wise feature extraction and gloss-wise feature extraction. (1) In the frame-wise part, consecutive video frames are fed into a ResNet-based backbone to complete video feature extraction. During the backbone stage, we not only focus on single-frame visual features but also introduce a multi-scale deformable correlation module to construct a reliable and efficient spatiotemporal context. And an attention-based correlation matrix filter (CMF) is introduced to suppress unreliable correspondences, further enhancing the reliability of inter-frame associations. The features are then fused with the output of the long-term temporal enhanced module to derive the frame-wise features at each stage in the backbone. (2) In the gloss-wise part, we employ temporal modeling based on 1D convolutions and Bi-LSTM to obtain the final gloss-wise features, which are fed into a classifier to produce the final sign language recognition results.

3.1. Method Overview and Motivation

Overview. Given a continuous video frame sequence { x } R T × 3 × H i × W i ( H i , W i are height and width of original images) of length T, the primary objective of a deep CSLR model is to feed { x } into a neural network to obtain a sequence of glosses y = { y i } i = 1 N that describe a natural language sentence, where N denotes the length of the gloss sequence. Specifically, CSLR first employs a ResNet-based backbone for feature extraction, producing frame-wise features f v R T × C 1 × H × W . Subsequently, f v is passed into a temporal modeling combination, consisting of 1D convolutions and a Bi-LSTM, to perform long-term temporal modeling while compressing the temporal dimension: ( T N ) . Finally, a fully connected network (FCN) layer-based classifier is used to output the predicted gloss probability distribution. During training, the entire network parameters are optimized via backpropagation using the Connectionist Temporal Classification (CTC) loss. During inference, the predicted probability distribution is decoded to produce the gloss sequence representing the natural language sentence.
Motivation. To improve spatio-temporal enhancement in the feature-wise extraction stages, CorrNet introduced a correlation module at each stage of the backbone to establish inter-frame relationships. Specifically, every element in the visual features at time t compute dot products with all elements in the visual features at time t 1 or t + 1 . Assuming there are T frames, the total number of global correlation computations is 2 × T × C × ( H × W ) 2 . If  T × C is treated as a constant significantly smaller than H × W , the final time complexity becomes O ( ( H × W ) 2 ) . In practice, the limb movements that require attention in videos typically exhibit spatio-temporal locality. Consequently, employing global correlations not only incurs excessive computational overhead but also introduces substantial irrelevant noise. Similarly, effective correlation modeling can be achieved by focusing solely on spatio-temporal continuous local regions. Assuming the local region has a size of ( 2 r + 1 ) 2 (where r is the sampling radius, which is significantly smaller than both H and W), the total computation required becomes: 2 × C × T × H × W × ( 2 r + 1 ) 2 , with a computational complexity of O( H × W ). Therefore, this paper replaces the original global receptive field with a smaller sampling range and achieves efficient temporal modeling by introducing adaptive deformable operations.

3.2. Deformable Correlation Module Based on Motion Prior

Given a sequence of video frames, the deep CSLR model first employs a backbone network to extract feature-wise representations. In the ResNet-based backbone, there are four stages of feature downsampling. In the first stage, the relatively high resolution would lead to excessive computational overhead if correlation computation were used for inter-frame modeling. Therefore, CorrNet omits the correlation module at this stage. In contrast, the proposed deformable correlation module with efficient computation could be deployed in all stages (as depicted in Figure 3) respectively for temporal modeling. Furthermore, the deformable correlation module can adaptively expand the sampling range and adjust the attended regions even under restricted sampling conditions, thereby achieving superior spatio-temporal context modeling for describing the real limb movement between adjacent frames. We elaborate the Deformable Correlation Module based on Motion Prior into two parts in the subsequent contents.
(1) Spatio-temporal driven deformable sampling. Assuming the current stage yields a frame-sequence feature denoted as f s R T × C 1 × H × W , when fed into the correlation module, it first undergoes a 3D convolution to capture local spatio-temporal relationships while performing channel-wise downsampling, resulting in f t l R T × C × H × W . To establish temporal dependencies within the neighborhood of time step t, we temporally shift f t l leftward to obtain f t 1 l R T × C × H × W and rightward to obatin f t + 1 l R T × C × H × W . During left shifting, features at time t are replaced by those at t 1 , whereas during right shifting, features at time t are replaced by those at time t + 1 . For the convenience of subsequent explanation, we only analyze the correlation calculation of the feature f t R C × H × W at time t. Given a sampling range of coordinates, the correlation sampling between f t and f t + 1 can be computed as follows:
M c ( x , y , x 1 , y 1 ) = 1 C c = 1 C ( f t c ( x , y ) · f t + 1 c ( x 1 , y 1 ) ) ( x , y ) R 1 , ( x 1 , y 1 ) R 2
where, ( x , y ) represents the coordinates of an element in f t , R 1 R H × W × 2 represents the coordinates of f t , while ( x 1 , y 1 ) denotes coordinates in f t + 1 , M c R H × W × ( 2 r + 1 ) × ( 2 r + 1 ) is the objective correlation matrix, we could leverage the dot-product "·" to compute the visual similarity in the correlation matrix M c . During correlation sampling, x 1 and y 1 are constrained within the prior-specified range R 2 R ( 2 r + 1 ) × ( 2 r + 1 ) × 2 . In CorrNet, the range R 2 covers the entire feature plane of f t + 1 . In our proposed method, we reduce the sampling area to 1 / 4 of the original size to eliminate redundant correlation sampling. To ensure unbiased sampling after area reduction, we employ dilation operations (as shown in Figure 3b) to uniformly distribute sampling points across the feature plane, thereby obtaining the initial reference sampling coordinates c o o r d i R ( 2 r + 1 ) × ( 2 r + 1 ) × 2 .
Since correlation computation is a binary operation, we design an offset mechanism dependent on both two frames—termed spatio-temporal driven deformation. By concatenating f t and f t + 1 along the channel dimension and processing them through convolutional layers, we obtain learnable offsets c o o r d o f s R ( 2 r + 1 ) × ( 2 r + 1 ) × 2 . These offsets are then added to c o o r d i to derive the final sampling coordinates. The spatio-temporal driven deformable correlation sampling can ultimately be calculated using the following formula:
c o o r d o f s = C o n v ( C o n c a t ( f t , f t + 1 ) ) M c ( x , y , x 1 , y 1 ) = 1 C c = 1 C ( f t c ( x , y ) · f t + 1 c ( x 1 , y 1 ) ) ( x 1 , y 1 ) ( c o o r d o f s + c o o r d i )
Through the aforementioned deformable sampling operation, we can obtain a smaller volume of the correlation matrix while greatly reducing computational complexity. The above discussion only addresses deformable sampling at one point in f t . For the entire f t , the shape of the learnable deformable tensor required would be R ( 2 r + 1 ) × ( 2 r + 1 ) × 2 .
(2) Multi-scale Correlation Sampling based on Motion prior. In the above designed deformable correlation sampling, we employ a uniform distribution to set the sampling locations of any element in f t within f t + 1 . Experimental results demonstrate that this approach yields a certain performance improvement. Although this unbiased initialization provides modest gains for sign language recognition models, it violates the spatial displacement locality of each feature element in the spatio-temporal domain. Furthermore, although the current scheme reduces the number of sampling points through our designed deformable strategy, there remains some degree of sampling redundancy. Given that sign language scenarios typically involve performers using relatively slow limb movements to convey gestures, we propose adopting a more localized sampling range to model motion features. Furthermore, in certain scenarios, sign language gestures involve significant body displacements, which necessitate an expanded sampling range for the correlation volume. To address this, we downsample the features into multiple scales and construct multi-scale correlation volumes, thereby effectively modeling body motion from local-to-global ranges.
Based on this insight, we present the Multi-scale Correlation Sampling based on the Motion Prior algorithm. The complete algorithm consists of two key steps:
(1) Motion Prior-based Sampling Point Configuration: Unlike previous approaches that adopt uniformly distributed sampling anchors, we propose to initialize the anchor points as the element coordinates Pt of the feature plane f t , guided by motion priors. This design is theoretically justified by the spatial locality of motion between consecutive frames in scenarios with limited movement magnitude. After anchor point determination, we further reduce the sampling radius r to 2. Consequently, each point in the left feature map f t only needs to sample 25 feature points from the right feature map ( f t + 1 or f t 1 ), where 25 is significantly smaller than ( H / 2 ) × ( W / 2 ) in our implementation.
(2) Multi-Scale Correlation Handling: To address potential large-motion scenarios in sign language while maintaining sampling efficiency, we retain the original sampling radius but strategically reduce the resolution of the right feature map. This achieves an effective increase in the receptive field without expanding the sampling range. As illustrated in Figure 4, we perform correlation sampling on right features at three distinct scales: { 1 , 1 / 2 , 1 / 4 } of the original resolution. The resulting correlation volumes from all scales are concatenated to form the final correlation matrix M c .
It is noteworthy that the proposed multi-scale motion-prior-based scheme constructs a local-to-global correlation tensor, which further suppresses sampling redundancy through spatiotemporal locality priors, thereby reducing the overall size of the final correlation matrix M c from R H × W × H / 2 × W / 2 to R H × W × 3 × 5 × 5 . Moreover, we only need to predict correspondingly smaller sampling offsets c o o r d o f s R R H × W × 3 × 5 × 5 × 2 to achieve deformable correlation sampling.
The motion-prior-based deformable correlation sampling cannot be directly implemented using off-the-shelf PyTorch 1.10 APIs. Therefore, we developed the corresponding custom operators using PyTorch’s CUDA extension, with a simplified pseudocode provided in Algorithm 1.
Algorithm 1: Deformable correlation sampling kernel (pseudo code)
Applsci 16 00124 i001

3.3. Correlation Matrix Filter

The correlation matrix is computed through correlation sampling, and CorrNet applies it to f t to achieve a spatio-temporal affine transformation. However, during the sampling process, the sampling results may be affected by visually similar interfering features; the sign language recognition model is prone to semantic understanding errors. Therefore, we aim to assign smaller weights to these regions, making the attention mechanism a natural solution to this problem. Upon closer examination of the correlation matrix, we observe that its dimensionality can be transformed from M c R H × W × 3 × ( 2 r + 1 ) × ( 2 r + 1 ) to M c R 3 ( 2 r + 1 ) ( 2 r + 1 ) × H × W through dimension reshape and permutation. Given an index ( x , y ) along the H and W dimensions of the correlation matrix, one element stored across the channels ( C = 3 ( 2 r + 1 ) ( 2 r + 1 ) ) at position ( x , y ) in M c represent the local sampling of f t at ( x , y ) on f t + 1 . On this basis, we design a correlation matrix filter for sign language recognition models by following the spatial-channel attention paradigm of CBAM. It is noteworthy that this paper provides an interpretability analysis of the proposed correlation matrix filter and introduces specialized improvements tailored to the modeling characteristics of sign language recognition. The overall workflow of correlation matrix filter is shown in Figure 5.
(1) Independent motion modeling (spatial attention). Based on the aforementioned analysis, each element in the M c stores a local sampling result of an element in f t . If the sampling result is reliable, it indicates that the sampling range exhibits generally high semantic similarity with potentially smaller distribution variance. Therefore, compared to traditional spatial attention mechanisms, we additionally introduce a meaningful variance descriptor for channel distributions. Through spatial attention, unreliable motion modeling between frames can be independently suppressed at each feature point. The specific attention generation formula is as follows:
A t t n s = α 1 C o n v A v g F A v g ( M c ) + α 2 C o n v M a x F M a x ( M c ) + α 3 C o n v V a r F V a r ( M c ) A t t n s = S i g m o i d ( A t t n s )
it is obvious that variance F V a r ( · ) , average F A v g ( · ) and max F M a x ( · ) are introduced to comprehensively describe the feature distribution of local sampling for modeling independent element motion. α 1 , α 2 and α 3 are learnable weights respectively.
(2) Multi-scale neighborhoods motion modeling (channel attention). Since spatial attention only describes the motion reliability of one single element in f t , it loses spatial contextual information. However, in real-world sign language articulation, local spatial contexts often exhibit consistent motion trends. Therefore, by jointly modeling the motion patterns of all feature points within a local neighborhood, we could more effectively identify high-confidence regions in the correlation matrix.
In the original channel, the attention mechanism primarily focuses on the global feature distribution across the entire spatial domain, which does not align well with the characteristics of sign language recognition. In contrast, features within local neighborhoods are more likely to share similar motion patterns. Hence, we employ convolution-based downsampling to generate multi-subregion channel attention maps. These maps are then upsampled via differentiable bilinear interpolation, effectively performing a broadcasting-like operation to align with the resolution of the original f t features. The multi-subregion channel attention maps could be computed by the following
A t t n c s = S i g m o i d F i n t p C o n v s ( M c ) A t t n c m = S i g m o i d F i n t p C o n v m ( M c )
where F i n t p is the bilinear interpolation, C o n v s and C o n v M are responsible for downsampling the features in a convolutional manner to 1/4 or 1/16 of the original resolution, respectively. The enhanced correlation matrix could be ultimately computed using the following formula after obtaining both spatial and channel attention weights:
M c = ( β 1 A t t n c s M c + β 2 A t t n c m M c ) A t t n s
where β 1 and β 2 are learnable weights.

3.4. Temporal Affine Transformation

After processing through the correlation matrix filter, we obtain an attention-enhanced correlation matrix. Since the time step t simultaneously establishes temporal associations with both t 1 and t + 1 time steps, this ultimately yields two correlation matrices within the temporal neighborhood of time step t, which we denote as M c l and M c r . To achieve temporal enhancement at the frame-wise stage, we perform temporal affine transformation by multiplying M c l and M c r with the feature f t through matrix multiplication. The specific computation is as follows:
f t = F m u l ( M c l , f t ) α 1 + F m u l ( M c r , f t ) α 2
where F m u l ( · ) is matrix multiplication, f t is the enhanced frame-wise features, α 1 and α 2 are weights for temporal feature selection. The spatio-temporal enhancement could be accomplished among frame-wise feature extraction stages through our proposed methods mentioned above.

3.5. Long-Term Temporal Enhanced Module Based on Spatial Aggregation

The deformable correlation sampling method employed in previous work can only focus on motion features between adjacent frames, which is insufficient to fully characterize the overall limb movement trajectory in sign language. Given that sign language videos typically contain hundreds of frames, directly applying 3D convolution would incur excessive computational overhead and GPU memory usage, whereas using sparse 3D convolution may miss critical temporal segments. To address this, we propose a long-term enhanced module based on spatial aggregation, as shown in Figure 6. In contrast to the aforementioned deformable correlation module, which can only process spatio-temporal information and body trajectories of adjacent frames, this module extends spatio-temporal context modeling to capture sign language feature representations of the entire video sequence.
(1) Global Temporal Embedding based on spatial aggregation. First, two convolutional operations with different receptive fields are applied to the original entire video features f t to enhance spatial perception diversity. To accommodate convolutional operations, tensor shapes and dimensional arrangements require adjustment before and after processing. Subsequently, spatial distribution means are computed separately for the output tensors from different receptive-field convolutions to achieve spatial feature aggregation. These aggregated features are then concatenated to form a global temporal embedding, which could be computed by the following:
f e 1 , f e 2 = F R P ( C o n v 2 d 1 ( F P R ( f t ) ) , C o n v 2 d 2 ( F P R ( f t ) ) ) f g t e = C o n c a t ( F M a x ( f e 1 ) , F M a x ( f e 2 ) )
where F R P is the combination operation of reshape and permutation; F R P is the inverse operation of F R P ; and f g t e is the global temporal embedding.
(2) Long-term Temporal Enhancement. After obtaining the global temporal embedding f g t e , this paper establishes global temporal correlations through a self-attention mechanism. Since spatial aggregation has been completed, the tensor shape of the global temporal embedding is ( B , C , T ) , where the temporal dimension T can be regarded as equivalent to the text token length in NLP tasks. Accordingly, we designed analogous query, key, and value mapping layers based on 1D convolutions. Subsequently, matrix multiplication between the query and the key yields a T × T global temporal correlation matrix, which is normalized via softmax. This normalized matrix is then multiplied by the value to generate a temporal enhancement mask M s . Finally, M s is added to the f t , thereby accomplishing long-term temporal enhancement, which could be formulated as follows:
Q , K , V = C o n v 2 d Q ( f g t e ) , C o n v 2 d K ( f g t e ) , C o n v 2 d V ( f g t e ) M s = Q K d V f t = B r o a d c a s t ( M s ) + f t
where B r o a d c a s t ( · ) is the broadcast operation for aligning with the f t .
During the feature-wise extraction phase, the proposed three spatiotemporal modeling schemes provide robust upstream support for subsequent mining of reliable spatiotemporal contexts and sign language representation cues, ultimately ensuring that the sign language recognition model generates accurate natural language descriptions corresponding to the input videos.

4. Experiments

4.1. Datasets

PHOENIX14 [29]. The PHOENIX14 dataset was collected from German weather forecast programs, with each video sequence consisting of a sign language presenter and a clean background. The image sequences in the videos have a resolution of 210 × 260. The entire dataset contains 6841 sentences, comprising a total of 1295 sign language vocabulary items. For ease of training and testing, PHOENIX14 divides the dataset into 5672 training samples, 540 validation samples, and 629 test samples.
PHOENIX14-T [30]. The PHOENIX14-T dataset is an extension of the PHOENIX corpus, containing sign language recognition videos, sign-gloss annotations, and German translations. PHOENIX14-T includes 1085 sign language vocabulary items, totaling 8247 sentences. The dataset is partitioned into 7096 training samples, 519 validation samples, and 642 test samples.
CE-CSL [31]. The CE-CSL dataset features 12 sign language performers, including 8 females and 4 males. Among them, two are hearing-impaired individuals who primarily use sign language for daily communication, while the remaining performers are professional sign language interpreters, ensuring the dataset’s expertise and diversity. CE-CSL divides the entire dataset into 4973 training videos, 515 validation videos, and 500 test videos. These videos cover 3515 Chinese vocabulary items, encompassing a wide range of daily communication phrases. Notably, like PHOENIX14, it is a publicly available dataset, contributing to the sign language recognition research community.

4.2. Training Details

Network Details. The proposed Deformable Correlation Network (DCA) employs ResNet18 as the 2D CNN backbone for the frame-wise stage, initialized with ImageNet pre-trained weights to accelerate training and convergence. In the gloss-wise stage, DCA adopts the temporal dimension reduction network. This temporal network primarily consists of 1D convolutional layers, structured as { K 5 , P 2 , K 5 , P 2 } , where K σ and P σ denote a convolutional layer with kernel size σ and a pooling layer, respectively. After temporal modeling, the frame-wise features are compressed into the specified gloss-wise dimension. Following the 1D CNN processing, a Bi-LSTM with a hidden size of 1024 is applied for long-term temporal modeling. Finally, the classification layer outputs the predicted sentence corresponding to the sign language video.
Training Configuration. The model is trained for 50 epochs, with an initial learning rate of 1 × 10 4 . Adam is selected as the optimizer, with a weight decay of 1 × 10 5 . Furthermore, MultiStepLR scheduler is used to adjust the learning rate. For input data preprocessing, videos are first resized to 256 × 256 , followed by random cropping to 224 × 224 . The loss function follows CorrNet, combining VE Loss and VA Loss with weights of 1.0 and 25.0, respectively. The entire model is trained and evaluated on a single NVIDIA RTX 3090 GPU.

4.3. Evaluation Metric

We employ the Word Error Rate (WER), a well-established evaluation metric in sign language recognition, to assess model performance. WER measures recognition accuracy by calculating the minimum number of operations (deletions, insertions, and substitutions) required to align the predicted sentence with the ground truth. The specific computation formula is as follows:
WER = # s u b + # i n s + # d e l # r e f e r e n c e
it should be noted that a lower WER indicates higher recognition accuracy in sign language recognition.

4.4. Ablation Study

We conduct ablation studies of the proposed model on both PHOENIX14 and CE-CSL datasets. To thoroughly validate the effectiveness of the proposed motion-prior-based Deformable Correlation (DC) module, Correlation Matrix Filter (CMF) and Long-term Temporal Enhanced (LTE) module, comprehensive ablation analyses were performed on the validation and test sets of both datasets. Additionally, detailed performance analyses are conducted on the test sets to evaluate specific configurations of the DCA and CMF modules.
(1) Overall ablation analysis
As shown in Table 1, by incorporating the deformable correlation (DC) module with a sampling range of H / 2 × W / 2 and dilation operations, our DCA achieves a 4.2% reduction in WER on the PHOENIX14 validation set and a 3.6% reduction on its test set compared to the baseline. To verify DCA’s performance improvement in Chinese contexts, we conducted specific experiments on the CE-CSL validation and test sets, obtaining WER reductions of 7.0% and 6.7%, respectively. This improvement stems from DCA’s ability to suppress irrelevant information introduced by global correlations during frame-wise feature extraction while achieving precise inter-frame correlations through deformable sampling, thereby enhancing temporal modeling capability and overall sign language recognition performance.
By integrating the CMF module that employs the distribution mean, maximum, and variance for attention modeling, our model demonstrates an average 3.1% WER reduction on both validation and test sets of PHOENIX14, and an average 4.3% reduction on CE-CSL compared to baseline CorrNet. This improvement is attributed to CMF’s effective spatial attention mechanism, which combines correlation sampling distribution with learnable convolutional networks to suppress low-confidence regions in inter-frame correlation construction.
By independently embedding the LTE module, the sign language recognition accuracy achieved performance improvements on both the test and validation sets of the two datasets. Specifically, the WER metric decreased by an average of 0.6 on PHOENIX14 and by an average of 1.1 on CE-CSL. This improvement benefits from the effective utilization of global temporal information, which assists the sign language recognition system in achieving reliable limb motion modeling and inferring more accurate sign language information.
Finally, the combined integration of DCA, CMF, and LTE modules yields WER reductions of 1 and 0.9 on PHOENIX14’s validation and test sets, respectively, and 5.1 and 4.9 on CE-CSL’s validation and test sets. These results clearly demonstrate that DCA, CMF and LTE provide orthogonal performance gains for the overall sign language recognition system.
(2) Detailed analysis of DC module
To fully demonstrate the reliability of the motion-prior-based DC model with a multi-scale sampling configuration, we conducted an analysis through experiments across three aspects: correlation sampling scale, initialization sampling point arrangement, and deformable offset generation.
Sampling scale analysis (Motion-prior). First, we investigated three down-sampling scales in combination with a fixed sampling radius (r = 2): {1, 1 + 1/2, 1 + 1/2 + 1/4}. As shown in Table 2, the DCA achieves WER scores of 19.1 and 45.5 on PHOENIX14 and CE-CSL respectively when only using the { 1 } original scale. Our empirical findings demonstrate that as the sampling results across multiple scales are progressively aggregated, the performance of the sign language recognition model continues to improve. When all scales { 1 + 1 / 2 + 1 / 4 } are utilized, the model achieves WER scores of 18.7 and 44.9 on the two test datasets, respectively.
Initial sampling point arrangement. We validate the performance of dilation sampling (with a sampling area of H / 2 × W / 2 per point) and motion-prior-based sampling (with a sampling area of ( ( 2 r + 1 ) 2 per point) using only the original-scale sampling results (where the resolution of f t + 1 or f t 1 is H × W ). The motion-prior-based approach outperforms dilation sampling even at the original scale, while further reducing the Word Error Rate (WER) by 0.3 and 1.1 on the PHOENIX and CSL-CE datasets, respectively, compared to the baseline.
Offset generation methodology. Finally, we conducted detailed experimental analysis on offset generation methods, for a fair comparison, we evaluate both offset generation methods across all motion-prior-based sampling scales. Considering that correlation computation primarily establishes temporal relationships between frames, we propose jointly generating offsets from both correlated frames. We designed two feature fusion strategies: (i) direct element-wise addition followed by convolutional processing, and (ii) channel-wise concatenation followed by convolution. Results in Table 2 demonstrate that the channel concatenation approach yields superior recognition performance, reducing WER scores by 0.6 and 1.3 compared to element-wise addition method on PHOENIX14 and CE-CSL test sets respectively.
(3) Detail analysis of CMF module
As shown in Table 3, we employ only the spatial attention (SA) module to process the correlation matrix, achieving average WER scores of 19.0 and 45.75 on the PHOENIX14 and CE-CSL datasets, respectively. By leveraging the independent motion modeling capability provided by spatial attention (SA), the model can assess motion reliability through correlation sampling at individual points. Consequently, applying SA-generated masks to the correlation matrix yields performance improvements. Furthermore, when channel attention (CA) is independently incorporated, superior sign language recognition accuracy is achieved compared to the baseline across both datasets. This enhancement stems from the model’s neighborhood motion modeling capability, which enables it to infer the reliability of local motion based on spatial context. Ultimately, the integration of SA and CA produces significant performance gains.
Table 4 presents a comprehensive efficiency comparison between the baseline and our proposed method. To ensure a fair comparison, we focus solely on the differences between the correlation module in CorrNet and our proposed module across four performance metrics. For this purpose, we standardize the input video feature size to 1 × 256 × 256 × 28 × 28 . First, regarding computational complexity, our approach demonstrates superior efficiency, reducing the FLOPs from 15.57 G (Baseline) to 12.53 G. Second, in terms of model size, the proposed method optimizes the architecture effectively, decreasing the parameter count from 77.66 K to 72.42 K. Third, concerning memory consumption, our method significantly lowers the GPU memory footprint to 4.3 GiB compared to the baseline’s 5.1 GiB, facilitating deployment on resource-constrained devices. Finally, regarding inference speed, our full scheme demonstrates a marginal decrease in inference latency compared to the baseline. However, for the correlation matrix calculation component alone, our method reduces the latency from 40.13 ms to 36.86 ms.

4.5. Comparison with State-of-the-Art Methods

(1) Comparative Analysis on PHOENIX14 and PHOENIX14-T
As shown in Table 5, we conducted a comprehensive comparison between our proposed DCA sign language recognition algorithm and several state-of-the-art methods. By incorporating the deformable correlation module and correlation matrix filter, our DCA achieves average WER scores of 18.3 and 18.5 on PHOENIX14 and PHOENIX14-T datasets. These results demonstrate the superior performance and competitive advantage of our proposed algorithm.
(2) Comparative Analysis on CE-CSL
To validate the effectiveness of our algorithm in Chinese sign language recognition, we compared DCA with state-of-the-art approaches on the CE-CSL dataset. As presented in Table 6, the DCA algorithm achieves the lowest average WER score of 42.05. While its performance on the validation set is slightly inferior to THNet, DCA establishes new state-of-the-art performance on the test set. These findings confirm that the deformable correlation module enables more effective utilization of temporal information, leading to exceptional performance in Chinese sign language recognition.

4.6. Visualization Analysis

This section presents attention visualizations for the more challenging CE-CSL dataset. Experimental results verify that our approach remains capable of reliably describing sign language information carriers even in the presence of substantial background noise.
(1) PCA Visualization of Correlation Matrix
As shown in Figure 7, we perform “reshape” operations on the correlation matrices of CorrNet and our DCA, transforming their dimensions from H × W × H × W to H × W × C 1 ( C 1 = H W ) for CorrNet and from H × W × 3 × ( 2 r + 1 ) × ( 2 r + 1 ) to H × W × C 2 ( C 2 = 3 ( 2 r + 1 ) 2 ) for our DCA. This allows us to employ principal component analysis (PCA) along the channel dimension to extract key features, thereby achieving visualization of the two correlation matrices. It can be clearly observed that, compared with CorrNet, DCA more effectively focuses on the performer’s limb movements, thereby improving sign language recognition performance.
(2) Grad-CAM Visualization of Correlation Matrix
As illustrated in Figure 8, we employ Grad-CAM to visualize spatiotemporal correlation heatmaps of the filtered correlation matrices, leveraging gradient information from the proposed CMF (Correlation Matrix Filter). Notably, the CMF-enhanced correlation matrices can more accurately characterize the performer’s limb movements while suppressing interference from irrelevant information, thereby further improving the accuracy of sign language recognition.

4.7. Qualitative Analysis

As illustrated in Figure 9. It can be observed that, in both scenarios, CorrNet exhibits significant discrepancies between its predictions for the blank token “SIL” and the ground-truth annotations. Moreover, in full-video sign language prediction, CorrNet is more prone to erroneous gloss judgments, leading to substantial deviations between the predicted signs and the performer’s actual limb movement. This issue stems from global inter-frame correlation, which introduces excessive redundancy, forcing each feature element to attend not only to motion patterns but also to abundant background features. Consequently, the model encounters difficulties in temporal feature learning. Additionally, temporal dilation further hinders the model from effectively leveraging complete sequential information.
In contrast, DCA effectively addresses these limitations. Our approach demonstrates stronger alignment with ground-truth labels in frame-level gloss prediction while reducing the probability of incorrect sign predictions.

4.8. Converge Analysis

As illustrated in Figure 10, we present the Word Error Rate (WER) curves on both datasets during validation and testing phases. The DCA model requires 50 training epochs in total, with evaluations performed on both validation and test sets after each epoch. Results demonstrate that DCA begins to converge stably after 20 epochs and ultimately achieves superior performance compared to the baseline CorreNet.

5. Limitations and Future Work

While our proposed Deformable Correlation Network (DCA) demonstrates significant improvements in sign language recognition accuracy, there are still some limitations to consider. First, the proposed deformable correlation model relies heavily on the effectiveness of adaptive offsets predicted by the network; however, in complex scenarios, it may fail to consistently capture reliable spatio-temporal information. Additionally, implementing this module with off-the-shelf PyTorch (1.10) operators is challenging, increasing implementation complexity. Furthermore, our current validation is limited to public benchmarks; experiments demonstrating the method’s generalization to diverse, complex scenarios are insufficient, and further verification of the scheme’s robustness is required.
To address these limitations, we aim to refine the low-level optimization of the deformable operators to achieve seamless integration. Simultaneously, to tackle the issue of limited data scenarios, we plan to validate the model’s robustness and generalization through data augmentation or by collecting datasets from real-world environments. Finally, given the rapid advancements in Large Models, we intend to leverage cross-modal knowledge transfer in future work to further enhance the performance of our sign language recognition algorithm.

6. Conclusions

This paper analyzes the correlation computation process in CorrNet and identifies that its global association introduces redundant irrelevant features while increasing computational overhead. To address these issues, we propose a deformable correlation network as a targeted improvement. First, a motion-prior-based deformable correlation module is introduced to ensure lower computational cost during the multi-scale correlation sampling stages and enable adaptive adjustment of the sampling range, thereby providing a better technical solution for inter-frame association. Additionally, a correlation matrix filter is incorporated to further suppress low-confidence regions in the correlation matrix, ensuring the effectiveness of subsequent temporal affine computations. Thirdly, a long-term temporal enhanced module is devised to efficiently leverage global temporal information for enhancing the sign language recognition accuracy. The proposed methods significantly enhance the performance of CSLR models. We believe that our proposed model, presented in this study, will offer practical value for the deaf community and educational applications.

Author Contributions

Conceptualization, Y.J. and D.Y.; methodology, Y.J.; software, Y.J.; validation, Y.J., D.Y. and C.C.; formal analysis, Y.J.; investigation, Y.J.; resources, D.Y.; data curation, Y.J.; writing—original draft preparation, Y.J.; writing—review and editing, D.Y. and C.C.; visualization, Y.J.; supervision, D.Y.; project administration, D.Y.; funding acquisition, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by the National Natural Science Foundation of China (Grant No. 62003065), the Natural Science Foundation of Chongqing, China (Grant No. CSTB2023NSCQ-MSX0771), and the Science and Technology Research Program of Chongqing Municipal Education Commission of China (Grant No. KJZD-K202300502).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/jacky218/DCANet/tree/master (accessed on 1 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Aloysius, N.; Geetha, M. Understanding vision-based continuous sign language recognition. Multimed. Tools Appl. 2020, 79, 22177–22209. [Google Scholar] [CrossRef]
  2. Cui, R.; Liu, H.; Zhang, C. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7361–7369. [Google Scholar]
  3. Liu, C.; Hu, L. Rethinking the temporal downsampling paradigm for continuous sign language recognition. Multimed. Syst. 2025, 31, 134. [Google Scholar] [CrossRef]
  4. Wang, S.; Guo, L.; Xue, W. Dynamical semantic enhancement network for continuous sign language recognition. Multimed. Syst. 2024, 30, 313. [Google Scholar] [CrossRef]
  5. Cui, R.; Liu, H.; Zhang, C. A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans. Multimed. 2019, 21, 1880–1891. [Google Scholar] [CrossRef]
  6. Cheng, K.L.; Yang, Z.; Chen, Q.; Tai, Y.W. Fully convolutional networks for continuous sign language recognition. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 697–714. [Google Scholar]
  7. Hao, A.; Min, Y.; Chen, X. Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11303–11312. [Google Scholar]
  8. Carreira, J.; Zisserman, A. Quo vadis, action recognition, a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
  9. Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
  10. Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
  11. Liu, Z.; Luo, D.; Wang, Y.; Wang, L.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Lu, T. Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11669–11676. [Google Scholar]
  12. Hu, L.; Gao, L.; Liu, Z.; Feng, W. Continuous sign language recognition with correlation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2529–2539. [Google Scholar]
  13. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
  14. Huang, Y.; Xiao, Z.; Firkat, E.; Zhang, J.; Wu, D.; Hamdulla, A. Spatio-temporal mix deformable feature extractor in visual tracking. Expert Syst. Appl. 2024, 237, 121377. [Google Scholar] [CrossRef]
  15. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  16. Koller, O.; Zargaran, O.; Ney, H.; Bowden, R. Deep sign: Hybrid CNN-HMM for continuous sign language recognition. In Proceedings of the British Machine Vision Conference 2016, York, UK, 19–22 September 2016. [Google Scholar]
  17. Koller, O.; Zargaran, S.; Ney, H. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4297–4305. [Google Scholar]
  18. Liu, P.; Li, G.; Zhao, W.; Tang, X. A coupling method of learning structured support correlation filters for visual tracking. Vis. Comput. 2024, 40, 181–199. [Google Scholar] [CrossRef]
  19. Chen, Z.; Liu, L.; Yu, Z. Toward robust visual tracking for UAV with adaptive spatial-temporal weighted regularization. Vis. Comput. 2024, 40, 8987–9003. [Google Scholar] [CrossRef]
  20. Aiman, U.; Ahmad, T. Angle based hand gesture recognition using graph convolutional network. Comput. Animat. Virtual Worlds 2024, 35, e2207. [Google Scholar] [CrossRef]
  21. Xiao, Z.; Chen, Y.; Zhou, X.; He, M.; Liu, L.; Yu, F.; Jiang, M. Human action recognition in immersive virtual reality based on multi-scale spatio-temporal attention network. Comput. Animat. Virtual Worlds 2024, 35, e2293. [Google Scholar] [CrossRef]
  22. Xue, S.; Gao, L.; Wan, L.; Feng, W. Multi-scale context-aware network for continuous sign language recognition. Virtual Real. Intell. Hardw. 2024, 6, 323–337. [Google Scholar] [CrossRef]
  23. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
  24. Min, Y.; Hao, A.; Chai, X.; Chen, X. Visual alignment constraint for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11542–11551. [Google Scholar]
  25. Hu, L.; Gao, L.; Liu, Z.; Feng, W. Temporal lift pooling for continuous sign language recognition. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 511–527. [Google Scholar]
  26. Hu, L.; Gao, L.; Liu, Z.; Feng, W. Self-emphasizing network for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 854–862. [Google Scholar]
  27. Ying, X.; Wang, L.; Wang, Y.; Sheng, W.; An, W.; Guo, Y. Deformable 3d convolution for video super-resolution. IEEE Signal Process. Lett. 2020, 27, 1500–1504. [Google Scholar] [CrossRef]
  28. Huang, Y.; Ji, L.; Liu, H.; Ye, M. LGU-SLAM: Learnable Gaussian Uncertainty Matching with Deformable Correlation Sampling for Deep Visual SLAM. arXiv 2024, arXiv:2410.23231. [Google Scholar] [CrossRef]
  29. Koller, O.; Forster, J.; Ney, H. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [Google Scholar] [CrossRef]
  30. Camgoz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7784–7793. [Google Scholar]
  31. Zhu, Q.; Li, J.; Yuan, F.; Fan, J.; Gan, Q. A Chinese Continuous Sign Language Dataset Based on Complex Environments. arXiv 2024, arXiv:2409.11960. [Google Scholar] [CrossRef]
  32. Zhu, Q.; Li, J.; Yuan, F.; Gan, Q. Multiscale temporal network for continuous sign language recognition. J. Electron. Imaging 2024, 33, 023059. [Google Scholar] [CrossRef]
  33. Hu, L.; Gao, L.; Liu, Z.; Feng, W. Scalable frame resolution for efficient continuous sign language recognition. Pattern Recognit. 2024, 145, 109903. [Google Scholar] [CrossRef]
  34. Zhou, H.; Zhou, W.; Zhou, Y.; Li, H. Spatial-temporal multi-cue network for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13009–13016. [Google Scholar]
  35. Kan, J.; Hu, K.; Hagenbuchner, M.; Tsoi, A.C.; Bennamoun, M.; Wang, Z. Sign language translation with hierarchical spatio-temporal graph neural network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3367–3376. [Google Scholar]
  36. Liu, J.; Xue, W.; Zhang, K.; Yuan, T.; Chen, S. Tb-net: Intra-and inter-video correlation learning for continuous sign language recognition. Inf. Fusion 2024, 109, 102438. [Google Scholar] [CrossRef]
  37. Lu, H.; Salah, A.A.; Poppe, R. Tcnet: Continuous sign language recognition from trajectories and correlated regions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 3891–3899. [Google Scholar]
  38. Zhu, Q.; Li, J.; Yuan, F.; Gan, Q. Continuous sign language recognition based on motor attention mechanism and frame-level self-distillation. Mach. Vis. Appl. 2025, 36, 7. [Google Scholar] [CrossRef]
Figure 1. Comparison of global correlation and our spatio-temporal driven deformable correlation. The symbols + and * denote element-wise addition and element-wise multiplication, respectively.
Figure 1. Comparison of global correlation and our spatio-temporal driven deformable correlation. The symbols + and * denote element-wise addition and element-wise multiplication, respectively.
Applsci 16 00124 g001
Figure 2. Global overview of our proposed continuous sign language recognition based on the deformable correlation network.
Figure 2. Global overview of our proposed continuous sign language recognition based on the deformable correlation network.
Applsci 16 00124 g002
Figure 3. Workflow of spatio-temporal driven deformable correlation.
Figure 3. Workflow of spatio-temporal driven deformable correlation.
Applsci 16 00124 g003
Figure 4. Multi-scale correlation sampling based on Motion Prior.
Figure 4. Multi-scale correlation sampling based on Motion Prior.
Applsci 16 00124 g004
Figure 5. Workflow of correlation matrix filter. The symbols + and × denote element-wise addition and element-wise multiplication, respectively.
Figure 5. Workflow of correlation matrix filter. The symbols + and × denote element-wise addition and element-wise multiplication, respectively.
Applsci 16 00124 g005
Figure 6. Long-term temporal enhanced module.
Figure 6. Long-term temporal enhanced module.
Applsci 16 00124 g006
Figure 7. Visualization of the correlation matrix based on PCA.
Figure 7. Visualization of the correlation matrix based on PCA.
Applsci 16 00124 g007
Figure 8. Visualization of the correlation matrix based on Grad-CAM.
Figure 8. Visualization of the correlation matrix based on Grad-CAM.
Applsci 16 00124 g008
Figure 9. Alignment results of sentence predictions.
Figure 9. Alignment results of sentence predictions.
Applsci 16 00124 g009
Figure 10. Word Error Rate (WER) curves on CE-CSL and PHOENIX14 datasets.
Figure 10. Word Error Rate (WER) curves on CE-CSL and PHOENIX14 datasets.
Applsci 16 00124 g010
Table 1. Overall ablation analysis on PHOENIX14 and CE-CSL.
Table 1. Overall ablation analysis on PHOENIX14 and CE-CSL.
ConfigurationPHOENIX14CE-CSL
Val (%)Test (%)Val (%)Test (%)
Baseline19.219.347.246.5
Baseline + DC18.518.843.943.4
Baseline + CMF18.618.744.744.9
Baseline + LTE18.818.946.346.1
Baseline + CMF + DC + LTE18.418.642.141.6
Table 2. Detailed analysis of DC module on PHOENIX14 and CE-CSL.
Table 2. Detailed analysis of DC module on PHOENIX14 and CE-CSL.
DC ConfigurationPHOENIX14
(Test %)
CE-CSL
(Test %)
baseline19.346.5
Sampling scale { 1 } 19.045.4
{ 1 + 1 / 2 } 18.845.2
{ 1 + 1 / 2 + 1 / 4 } 18.744.9
Initial sampling arrangementw/Dilation19.245.9
w/Motion-prior19.045.4
Offset generationElement-wise addition19.346.2
Channel-wise concat18.744.9
Table 3. Detailed analysis of the CMF module on PHOENIX14 and CE-CSL.
Table 3. Detailed analysis of the CMF module on PHOENIX14 and CE-CSL.
ConfigurationPHOENIX14CE-CSL
Val (%) Test (%) Val (%) Test (%)
w/SA + CA18.618.744.744.9
w/SA18.919.145.645.9
w/CA18.818.945.345.7
Table 4. Efficiency comparison between the baseline and our proposed method.
Table 4. Efficiency comparison between the baseline and our proposed method.
MethodFLOPs (G)Params (K)Latency (ms)MEM (GiB)
Baseline15.5777.6640.135.1
Baseline + CD10.2268.7436.863.6
Baseline + CD + CMF + LTE12.5372.4239.934.3
Table 5. Comparative analysis on PHOENIX14 and PHOENIX14-T.
Table 5. Comparative analysis on PHOENIX14 and PHOENIX14-T.
MethodsPHOENIX14PHOENIX14-T
Val (%) Test (%) Val (%) Test (%)
VAC [24]21.222.3--
MSTNet [32]20.321.4--
SEN [26]19.521.019.320.7
AdaSize [33]19.720.919.720.7
STMC [34]21.120.719.621.0
HST-GNN [35]19.519.820.120.3
DSE [4]18.619.818.919.9
TB-Net [36]18.919.618.820.0
CorrNet [12]19.219.418.920.5
TCatet [37]18.118.918.319.4
MAM-FSD [38]19.218.818.219.4
THNet [31]18.718.618.019.1
DCA (Ours)18.418.618.318.6
Table 6. Comparative analysis on CE-CSL.
Table 6. Comparative analysis on CE-CSL.
ConfigurationCE-CSL
Val (%) Test (%)
MSTNet [32]54.453.0
CorrNet [12]47.246.5
SEN [26]46.545.3
VAC [24]45.143.3
MAM-FSD [38]44.944.7
THNet [31]42.141.9
DCA (Ours)42.141.6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, Y.; Yang, D.; Chen, C. Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation. Appl. Sci. 2026, 16, 124. https://doi.org/10.3390/app16010124

AMA Style

Jiang Y, Yang D, Chen C. Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation. Applied Sciences. 2026; 16(1):124. https://doi.org/10.3390/app16010124

Chicago/Turabian Style

Jiang, Yihan, Degang Yang, and Chen Chen. 2026. "Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation" Applied Sciences 16, no. 1: 124. https://doi.org/10.3390/app16010124

APA Style

Jiang, Y., Yang, D., & Chen, C. (2026). Enhancing Continuous Sign Language Recognition via Spatio-Temporal Multi-Scale Deformable Correlation. Applied Sciences, 16(1), 124. https://doi.org/10.3390/app16010124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop