Inverse Kinematics-Augmented Sign Language: A Simulation-Based Framework for Scalable Deep Gesture Recognition

Wang, Binghao; Jing, Lei; Li, Xiang

doi:10.3390/a18080463

Open AccessArticle

Inverse Kinematics-Augmented Sign Language: A Simulation-Based Framework for Scalable Deep Gesture Recognition

by

Binghao Wang

,

Lei Jing

^*

and

Xiang Li

^*

Graduate School of Computer Science and Engineering, University of Aizu, Aizu-Wakamatsu 965-8580, Japan

^*

Authors to whom correspondence should be addressed.

Algorithms 2025, 18(8), 463; https://doi.org/10.3390/a18080463

Submission received: 9 June 2025 / Revised: 9 July 2025 / Accepted: 19 July 2025 / Published: 24 July 2025

Download

Browse Figures

Versions Notes

Abstract

In this work, we introduce IK-AUG, a unified algorithmic framework for kinematics-driven data augmentation tailored to sign language recognition (SLR). Departing from traditional augmentation techniques that operate at the pixel or feature level, our method integrates inverse kinematics (IK) and virtual simulation to synthesize anatomically valid gesture sequences within a structured 3D environment. The proposed system begins with sparse 3D keypoints extracted via a pose estimator and projects them into a virtual coordinate space. A differentiable IK solver based on forward-and-backward constrained optimization is then employed to reconstruct biomechanically plausible joint trajectories. To emulate natural signer variability and enhance data richness, we define a set of parametric perturbation operators spanning spatial displacement, depth modulation, and solver sensitivity control. These operators are embedded into a generative loop that transforms each original gesture sample into a diverse sequence cluster, forming a high-fidelity augmentation corpus. We benchmark our method across five deep sequence models (CNN3D, TCN, Transformer, Informer, and Sparse Transformer) and observe consistent improvements in accuracy and convergence. Notably, Informer achieves 94.1% validation accuracy with IK-AUG enhanced training, underscoring the framework’s efficacy. These results suggest that algorithmic augmentation via kinematic modeling offers a scalable, annotation free pathway for improving SLR systems and lays the foundation for future integration with multi-sensor inputs in hybrid recognition pipelines.

Keywords:

sign language recognition; inverse kinematics; kinematic data augmentation; virtual simulation; temporal deep models; gesture synthesis

1. Introduction

Sign language serves as a primary mode of communication for millions of individuals with hearing or speech impairments [1,2]. Bridging the communication gap between signers and non-signers has long been a central challenge in inclusive human–computer interaction [3,4]. With the rise of deep learning and computer vision, there has been considerable progress in automatic Sign Language Recognition (SLR), where gesture sequences are mapped to textual or spoken representations [5,6,7]. Yet despite recent advances, existing SLR systems often fall short in real-world settings, primarily due to a lack of scalable, diverse, and structurally consistent training data [8].

Most current datasets rely on constrained acquisition environments and limited signer diversity, which in turn restricts the generalization capabilities of trained models [9,10]. Moreover, traditional data augmentation techniques such as image-level transformations or synthetic recombination fail to capture the biomechanical realism and temporal coherence intrinsic to human gestures [11,12]. As a result, models trained on such data often struggle with signer variability, camera-view inconsistencies, and gesture ambiguity in practical applications [13].

To address these challenges, we propose a unified algorithmic framework, Inverse Kinematics-Augmented Gesture Generator (IK-AUG), which systematically synthesizes biomechanically plausible sign language sequences through kinematic modeling and structured simulation. Instead of relying on pixel-level transformations or stochastic recombination, IK-AUG formulates sign augmentation as a pose-to-motion reconstruction problem, where sparse 3D landmarks are lifted into full joint trajectories using an inverse kinematics solver embedded in a virtual environment.

The core of IK-AUG integrates a differentiable kinematic engine based on forward-and-backward constraint propagation with a set of multi-modal perturbation operators parameterized by spatial noise, depth offsets, and solver sensitivity modulation. These operators act as domain-specific augmentation functions that preserve semantic intent while injecting controlled variability across anatomical and dynamic dimensions.

To enable comprehensive evaluation, we instantiate IK-AUG within a Unity-based simulation pipeline and generate an enriched dataset from limited raw gestures. The synthesized samples are then used to train five representative temporal recognition models: CNN3D [14], Temporal Convolutional Network (TCN) [15], Transformer [16], Informer [17], and Sparse Transformer [18], revealing consistent accuracy improvements and faster convergence. Notably, attention-based models exhibit the most significant gains, validating the framework’s ability to enhance generalization across spatiotemporal architectures.

By transforming raw motion capture into high-fidelity gesture distributions, our algorithm not only mitigates data scarcity but also establishes a scalable, annotation-free augmentation paradigm for future gesture-based recognition systems.

The key contributions of this work are summarized as follows:

We propose IK-AUG, a unified algorithmic framework that leverages inverse kinematics within a virtual simulation environment to generate anatomically and temporally coherent sign language sequences from sparse 3D inputs.
We design a set of differentiable perturbation operators including depth modulation, spatial displacement, and solver sensitivity variation that systematically inject realistic variability while preserving semantic consistency, forming a domain-specific augmentation module.
We conduct comprehensive empirical evaluations across five deep sequence models (CNN3D, TCN, Transformer, Informer, Sparse Transformer), demonstrating that IK-AUG consistently enhances convergence efficiency and classification performance, particularly in attention-based architectures.
We establish a foundation for future hybrid gesture learning systems by outlining a scalable pathway to integrate multi-sensor modalities (e.g., inertial and haptic glove data) into a unified simulation–supervision loop.

Collectively, this work situates itself at the algorithmic intersection of inverse kinematics, simulation-based data generation, and spatiotemporal learning, offering a modular and extensible paradigm for gesture-centric recognition tasks.

2. Related Work

2.1. Virtual and Immersive Environments for Sign Language Learning and Recognition

Virtual environments have increasingly been employed to facilitate interactive learning and recognition of sign language. Several works have explored immersive VR platforms that enable real-time interaction between learners and signing avatars, allowing for effective acquisition of vocabulary and grammar through motion tracking and feedback loops [19]. Leap Motion-based systems further provide low-latency hand tracking from egocentric views, supporting real-time gesture classification with lightweight hardware [20,21]. However, these efforts primarily focus on interaction and recognition, whereas our method leverages virtual environments not just for learning but also for structured gesture data synthesis filling a crucial gap in generalizable data generation.

2.2. Inverse Kinematics for Motion Reconstruction

Inverse kinematics (IK) is widely adopted in gesture animation and full-body motion reconstruction, serving as a vital bridge between sparse positional input and anatomically consistent pose synthesis. Prior works have integrated IK with sign language notation to animate avatars from structured linguistic forms [22], while others reconstruct realistic full-body movements from sparse tracker data using learned IK models [23]. Our approach similarly employs IK but uniquely incorporates it into a data augmentation framework, ensuring that each synthesized gesture not only satisfies anatomical constraints but is also training-ready for downstream recognition models.

2.3. Data Augmentation for Sign Language Recognition

The challenge of data scarcity in sign language recognition (SLR) has prompted various augmentation strategies, including landmark transformation, image-space perturbation, and contextual recombination. For instance, landmark-based systems have reduced input dimensionality while improving generalization in in-the-wild scenarios [24]. Hybrid image augmentation involving background, lighting, and geometric shifts has also proven effective in improving model robustness under real-world conditions [25]. Similarly, studies using face swapping and affine transformations have achieved substantial accuracy improvements across top-k metrics [26]. Compared with these methods, our IK-based augmentation operates in 3D semantic space, allowing controlled, biomechanically constrained gesture variation beyond pixel-level manipulation.

2.4. Skeleton-Based and Adversarial Augmentation Techniques

Skeleton-based representations are increasingly utilized in SLR due to their interpretability and compactness. Previous work has proposed generating 3D skeletons from monocular videos and refining them through IK for animation and recognition tasks [27]. Meanwhile, adversarial learning-based augmentation frameworks like AVSN have shown that jointly optimizing augmentation and recognition can uncover model vulnerabilities and enhance generalization [28]. Our pipeline similarly adopts a skeleton-centric view, but emphasizes forward simulation with perturbation using sensitivity-controlled kinematic modifications to produce robust gesture sequences.

2.5. Multimodal and Glove-Enhanced Recognition Systems

Recent advancements in wearable technology have enabled multimodal gesture recognition via smart gloves equipped with triboelectric, inertial, and flexion sensors. These systems demonstrate the feasibility of recognizing full sentences in sign language and enabling bidirectional communication in virtual environments [29]. Despite their high fidelity, such systems face practical limitations due to hardware cost and user accessibility. In contrast, our method remains purely vision-based, while offering future extensibility to glove input for hybrid learning and supervision.

3. Methodology and Algorithms

3.1. Theoretical Foundation

3.1.1. Virtual Environments for Structured Simulation and Data Augmentation

Virtual environments have emerged as an indispensable tool in data-centric artificial intelligence, enabling researchers to decouple learning-based model development from the limitations of physical data acquisition. By leveraging physically grounded simulation engines and programmable control interfaces, modern virtual platforms allow for the synthesis of perceptually realistic, structurally consistent, and semantically controllable datasets across a wide range of modalities. In high-stakes domains such as robotics, embodied AI, and human-centric vision, this synthetic-to-real pipeline has fundamentally redefined how datasets are generated, curated, and scaled.

From a systems perspective, virtual environments offer a generative prior over interaction spaces. They permit precise specification of agent morphology, environmental constraints, temporal control, and multimodal sensory feedback. This level of control is particularly beneficial when designing data regimes that demand statistical diversity while retaining structural coherence, e.g., varying motion trajectories while preserving physical feasibility, or sampling camera perspectives without violating temporal causality. Such controllability is difficult to achieve in naturalistic data collection pipelines, where sensor noise, occlusion, and annotation ambiguity often introduce unquantifiable biases.

Virtual data augmentation is not merely a proxy for “more data”, but a mechanism for bias shaping and generalization conditioning; it enables targeted perturbations of scene structure, motion context, or agent–object relationships in ways that real-world sampling cannot support at scale. This property is increasingly central in domains where distribution shift, edge cases, or long-tail behaviors dominate downstream model performance.

Virtual environments have transitioned from visualization tools to algorithmic infrastructure providing not only photorealistic assets but also structured, verifiable, and fully controllable simulation scaffolds that form the foundation of modern synthetic data pipelines.

3.1.2. Inverse Kinematics (IK) for Motion Modeling

Inverse kinematics (IK) is a fundamental technique in robotics and motion synthesis that focuses on determining a set of joint configurations that achieve a desired end-effector position or trajectory. In contrast to forward kinematics where the pose of a skeleton is calculated by successively applying joint transformations from root to end-effector IK solves the inverse problem: given the target positions of certain parts (e.g., fingertip, hand, foot), it computes the internal joint angles that satisfy kinematic constraints while maintaining anatomical plausibility.

Mathematically, IK is often formulated as a constrained nonlinear optimization problem. Let

θ \in R^{n}

represent the vector of joint parameters (e.g., angles or positions), and let

f (θ) \in R^{3}

be the forward kinematics function that maps joint parameters to the Cartesian position of an end-effector. Given a desired target position

x_{target} \in R^{3}

, IK seeks a solution to the following system:

f (θ) = x_{target} .

(1)

Due to redundancy or nonlinearity in articulated systems, an exact solution may not exist or may not be unique. Thus, the problem is commonly relaxed to minimizing the squared positional error:

min_{θ} {∥f (θ) - x_{target}∥}^{2},

(2)

subject to joint limits:

θ_{min} \leq θ \leq θ_{max} .

(3)

In practice, iterative methods are used to approximate solutions. One widely used approach involves the Jacobian matrix

J (θ) \in R^{3 \times n}

, defined as:

J (θ) = \frac{\partial f (θ)}{\partial θ} .

(4)

A linear approximation of the end-effector motion is given by:

Δ x \approx J (θ) Δ θ,

(5)

and the corresponding joint update can be computed via the Jacobian pseudoinverse:

Δ θ = J {(θ)}^{+} Δ x,

(6)

where

J^{+}

denotes the Moore-Penrose pseudoinverse of the Jacobian.

Alternative real-time solvers such as Cyclic Coordinate Descent (CCD) optimize each joint sequentially and are computationally efficient for low-DOF systems like fingers.

In the domain of motion modeling and gesture synthesis, IK ensures physical realism and structural coherence. For instance, fingertip trajectories captured from video can be converted into full-hand joint configurations using IK, even when some joints are occluded or missing. This enables realistic hand articulation from sparse data.

Moreover, IK systems are often integrated with animation engines or physics-based simulators to dynamically adjust to constraints such as obstacle avoidance or balance preservation during complex transitions.

In summary, IK serves as a bridge between sparse high-level control signals and full-body anatomically plausible movements. Its integration is crucial in gesture reconstruction, animation blending, and physics-informed motion generation pipelines.

3.1.3. Deep Learning Architectures for Sign Language Recognition

Deep learning models have become foundational to the task of sign language recognition (SLR), as they offer powerful mechanisms for extracting and modeling spatiotemporal patterns from high-dimensional motion sequences. Unlike static image classification, SLR requires understanding structured motion over time, often involving complex hand articulation, temporal dependencies, and multimodal cues. To address these challenges, both convolutional and attention-based neural architectures have been extensively adopted.

Convolutional approaches such as 3D Convolutional Neural Networks (CNN3D) capture spatiotemporal dependencies by applying convolutional filters across both spatial and temporal dimensions. This enables them to learn short-range motion patterns effectively while maintaining spatial coherence, making them suitable for recognizing locally structured gestures. In contrast, self-attention-based architectures like Transformers model global temporal relationships through attention mechanisms, allowing them to capture long-range dependencies without relying on recurrence or local convolutions. Their ability to process variable-length inputs and attend to both past and future frames makes them particularly effective for recognizing continuous, coarticulated gesture sequences where temporal context is critical.

The role of these architectures in SLR extends beyond prediction; they serve as analytical tools for evaluating temporal consistency, identifying gesture boundaries, and quantifying the discriminability of learned representations. By learning to map raw motion data into structured latent spaces, deep models enable systematic assessment of gesture classes and can serve as a diagnostic mechanism for evaluating the quality and diversity of training datasets.

3.2. Proposed Augmentation Framework via Kinematic Gesture Simulation

3.2.1. Overview of the Synthetic–Real Gesture Loop

To address the limitations of existing sign language datasets in terms of anatomical fidelity, inter-signer variability, and scalability, we propose IK-AUG a structured kinematic augmentation framework that bridges real-world motion acquisition and virtual gesture synthesis. As shown in Figure 1, IK-AUG is a multi-stage pipeline that combines inverse kinematics-based hand reconstruction, biomechanically constrained animation, and perturbation-driven augmentation to produce diverse and semantically faithful gesture sequences for downstream sign language recognition (SLR) tasks.

As detailed in Algorithm 1, the process begins with extracting 3D keypoints K from raw sign language video V using a vision-based estimator

E

(e.g., MediaPipe). These keypoints are spatially normalized and mapped into a virtual coordinate system via a transformation module

T

, which includes noise filtering, frame interpolation, and re-orientation. A forward-and-backward iterative IK solver

I

reconstructs complete skeletal hand poses S from sparse landmarks, ensuring biomechanical plausibility. The resulting animation is rendered on a digital hand model within Unity3D (version 2022.3.40f1; Unity Technologies, San Francisco, CA, USA), forming a baseline gesture trajectory.

To simulate natural motion variability, a multi-modal perturbation operator

P_{Θ}

is applied, where

Θ = {ϵ_{x y}, ϵ_{z}, λ_{I K}}

defines parameter ranges for spatial jitter, depth displacement, and solver sensitivity, respectively. Each configuration generates a perturbed variant

{\vec{D}}_{i}

that is exported as both video and coordinate data. Optionally, tactile and kinematic information from multi-sensor gloves

G

can be integrated to further enrich motion fidelity and robustness. The output dataset

D_{a u g}

forms a synthetic–real manifold of augmented gesture sequences suitable for deep learning-based recognition and translation systems.

Algorithm 1 IK-AUG: Inverse Kinematic-Augmented Gesture Synthesis Framework

Require: Raw video sequence V, pose estimator

E

, transformation module

T

, IK solver

I

,
perturbation config

Θ = {ϵ_{x y}, ϵ_{z}, λ_{I K}}

, number of augmentations N

Ensure: Augmented dataset

D_{a u g} = {({\vec{D}}_{1}, C_{1}), . . ., ({\vec{D}}_{N}, C_{N})}

1:

K \leftarrow E (V)

▹ Extract 3D keypoints from raw video

2:

K^{'} \leftarrow T (K)

▹ Transform to virtual space: normalization, filtering, re-alignment

3:

S \leftarrow I (K^{'})

▹ Apply IK solver to reconstruct skeletal pose sequence

4: Initialize

D_{a u g} \leftarrow \emptyset

5: for

i = 1

to N do

6: Sample perturbation config

{ϵ_{x y}^{(i)}, ϵ_{z}^{(i)}, λ_{I K}^{(i)}} \sim Θ

7:

S^{(i)} \leftarrow P (S; ϵ_{x y}^{(i)}, ϵ_{z}^{(i)}, λ_{I K}^{(i)})

▹ Apply spatial, depth, and kinematic variation

8:

{\vec{D}}_{i} \leftarrow RenderAnimation (S^{(i)})

9:

C_{i} \leftarrow ExportCoordinates (S^{(i)})

10:

D_{a u g} \leftarrow D_{a u g} \cup {({\vec{D}}_{i}, C_{i})}

11: end for

12: return

D_{a u g}

3.2.2. Inverse Kinematics Solver for Hand Pose Reconstruction

To reconstruct hand gestures that conform to biomechanical constraints from keypoint position data, this work introduces an inverse kinematics (IK) solver within a virtual environment. The objective is to infer joint rotations of each finger based on the spatial positions of the fingertips, enabling the digital hand model to produce anatomically plausible postures.

In our system, the IK module takes the augmented keypoints transformed into the virtual coordinate system as input, and drives a 3D digital hand model as the target. Through iterative optimization, joint angles are computed to animate the skeleton and generate complete hand gestures.

We employ the Finger Rig module from the Final IK plugin to construct an independent three-segment kinematic chain (MCP–PIP–DIP) for each finger. The inverse kinematics solving process, as detailed in Algorithm 2, is based on the Forward And Backward Reaching Inverse Kinematics (FABRIK) algorithm. FABRIK approximates the fingertip’s spatial target through iterative forward and backward propagation steps.

Specifically, in each iteration, the backward phase adjusts joint positions from the fingertip to the root (DIP → PIP → MCP), preserving fixed bone lengths while progressively moving the chain toward the desired configuration. The subsequent forward phase restores the root position (MCP) and propagates the joint updates outward, ensuring physical plausibility and spatial continuity. In cases where the target lies beyond the maximum reach of the chain, the solver enters an emergency stretch mode, aligning all segments linearly toward the unreachable target to preserve anatomical consistency.

Finally, local joint rotations are derived from the resulting directional vectors and applied to the skeletal animation system to generate smooth and continuous hand gestures. This solver architecture effectively balances geometric constraints with real-time performance, providing fast convergence, numerical stability, and seamless integration into data-driven gesture synthesis pipelines. It is particularly well-suited for high-precision sign language reconstruction and target-controlled animation tasks.

Algorithm 2 Finger Rig-Based IK Solver

Input: Target position

T \in R^{3}

, joint positions

{J_{0}, J_{1}, J_{2}}

(MCP → PIP → DIP), segment
lengths

{L_{0}, L_{1}}

, convergence threshold

ε

Output: Joint rotations

{θ_{0}, θ_{1}, θ_{2}}

1: if

∥ J_{0} - T ∥ > L_{0} + L_{1}

then

2: // Target unreachable: fully extend finger

3:

Direction \leftarrow normalize (T - J_{0})

4:

J_{1} \leftarrow J_{0} + L_{0} \cdot Direction

5:

J_{2} \leftarrow J_{1} + L_{1} \cdot Direction

6: else

7: repeat

8: // Backward pass

9:

J_{2} \leftarrow T

10:

J_{1} \leftarrow J_{2} + L_{1} \cdot normalize (J_{1} - J_{2})

11:

J_{0} \leftarrow J_{1} + L_{0} \cdot normalize (J_{0} - J_{1})

12: // Forward pass (optional if root fixed)

13:

J_{0} \leftarrow fixed root position

14:

J_{1} \leftarrow J_{0} + L_{0} \cdot normalize (J_{1} - J_{0})

15:

J_{2} \leftarrow J_{1} + L_{1} \cdot normalize (J_{2} - J_{1})

16: until

∥ J_{2} - T ∥ < ε

17: end if

18: Compute joint angles

{θ_{0}, θ_{1}, θ_{2}}

from final

{J_{0}, J_{1}, J_{2}}

19: Apply joint rotations to digital hand model

3.2.3. Sign Language Diversity Enhancement Strategy Driven by Multimodal Perturbation

Following the accurate reconstruction of hand gestures through inverse kinematics (IK), we introduce a multimodal perturbation strategy to systematically enhance the diversity and robustness of synthesized sign language sequences within a virtual environment. As illustrated in Figure 2, this stage applies structure-preserving perturbations across spatial, kinematic, and visual dimensions to emulate the natural variability and observational noise encountered in real-world gesture execution.

The process begins with IK-based joint angle estimation from video-derived keypoints, which are mapped onto a 3D digital hand model to reconstruct precise finger trajectories. To further enhance fidelity beyond local articulation, a body animation module generates synchronized upper-body motion, including arms and torso, ensuring that the resulting digital gestures reflect both fine-grained control and holistic motion patterns consistent with natural sign communication.

To introduce controllable variation into the synthesized gestures, we design three complementary perturbation mechanisms:

Depth Offset: Simulates variations in camera-to-subject distance by perturbing the depth plane of the virtual viewpoint.
2D Spatial Perturbation: Introduces localized shifts in the image plane (X-Y), mimicking tracking noise, occlusions, or signer-dependent expressive differences.
IK Sensitivity Control: Adjusts solver hyperparameters such as target weight and damping, enabling variability in articulation flexibility and gesture expressiveness.

All perturbations are governed by a sensitivity parameter

ε

, which ensures semantic consistency while introducing naturalistic motion deviation. From a single original gesture instance

D_{1}

, this strategy generates a diverse sequence set

D_{1} \to D_{n}

, collectively forming the Augmented Gesture Sequences shown in Figure 3.

To further illustrate the effect of the augmentation pipeline, a supplementary video is provided showing the real gesture sequence and its IK-AUG-generated variants in motion. The video demonstrates how the perturbation components preserve semantic integrity while introducing realistic variability. It is available at: https://youtu.be/hawctx3g2fE (accessed on 8 July 2025).

4. Experiments

4.1. Experimental Setup

All experiments were conducted on a workstation running Ubuntu 21.04, equipped with an NVIDIA RTX 3090 GPU and 32 GB of RAM. The implementation was based on PyTorch 1.11.0 and Python 3.10.

The dataset used in this study consists of ten manually collected sign gestures, captured from a frontal view. Raw hand landmark data was extracted using MediaPipe, and subsequently converted into structured CSV files. Each sample in the dataset is represented as a sequence of 100 frames, with 21 hand keypoints per frame, and each keypoint described by its 3D spatial coordinates (x, y, z). Thus, the final input format for all models is standardized as a tensor of shape [100, 21, 3].

All models were trained exclusively on these structured 3D pose keypoints extracted from video using the MediaPipe Hands framework. No image frames, rendered videos, or skeletal animations were used during training. This ensures that the input data is purely geometric and fully standardized across all experiments.

This setup enables a consistent pipeline for training and evaluating various gesture recognition models under controlled conditions.

All models were trained using specific hyperparameters tuned via the Optuna optimization framework. Each model architecture CNN3D, TCN, Transformer, Informer, and Sparse Transformer was independently optimized using cross-validated search over learning rate, batch size, dropout rate, hidden dimensions, and weight decay. The optimization process was conducted within a fixed computational budget to ensure fairness. The final hyperparameter configurations, selected based on best validation accuracy, are summarized in Table 1. These values were held constant across all training runs involving raw data, rotation-augmented data, and IK-AUG data to ensure a controlled and unbiased comparison across augmentation strategies.

4.2. Experimental Results

To evaluate the effectiveness of the proposed data augmentation strategy, we first trained two representative baseline models CNN3D and Transformer on both the raw dataset and the augmented dataset for 500 epochs. The training and validation accuracy curves are shown in Figure 4.

The CNN3D model trained on the raw data exhibits signs of overfitting, where the training accuracy continues to increase while the validation accuracy fluctuates and fails to steadily improve. In contrast, as shown on the right, the CNN3D model trained on the augmented data achieves consistently higher validation accuracy with reduced fluctuation, indicating better generalization and more stable convergence.

For the Transformer model, the benefit of augmentation is even more pronounced. When trained on the raw dataset, the Transformer fails to reach satisfactory accuracy within 500 epochs and suffers from underfitting. However, when trained on the augmented dataset, the model quickly converges to a high level of both training and validation accuracy, demonstrating that the augmented data not only improves overall performance but also enables more effective model fitting within the same training budget.

These results confirm that the proposed augmentation approach enhances both convergence speed and final accuracy, especially for deeper architectures such as Transformers.

To provide a comprehensive evaluation of model performance on the augmented gesture dataset, we trained five representative architectures CNN3D, TCN, Transformer, Informer, and Sparse Transformer for 1000 epochs. The training trajectories and final accuracy outcomes are presented in Figure 5, respectively.

As illustrated in Figure 5, all models demonstrate effective learning and converge successfully on the augmented dataset. CNN3D and TCN exhibit fast initial convergence, reaching high accuracy in the early training phase. Transformer-based models, including Informer and Sparse Transformer, present a more progressive and stable learning process, characterized by consistent improvement across epochs and smoother validation curves.

Figure 5 complements these findings by comparing the final training and validation accuracy. All models achieve high overall accuracy, confirming the suitability of the augmented data across architectural paradigms. Notably, Transformer-based models attain a particularly strong balance between training and validation performance, with Informer and Sparse Transformer both reaching 94.1% validation accuracy. In addition, CNN3D achieves the highest validation accuracy of 96.5%, further validating the effectiveness of the augmentation strategy on convolutional architectures.

Overall, these results highlight the versatility and strength of the augmented dataset in supporting both classical and modern architectures. In particular, Transformer variants exhibit highly consistent convergence and competitive accuracy, suggesting that the augmentation pipeline enables them to fully leverage their representational capacity for gesture modeling.

To quantify the effectiveness of the proposed IK-based data augmentation strategy, we conducted controlled experiments by training all five models CNN3D, TCN, Transformer, Informer, and Sparse Transformer with and without augmentation. The results are presented in Figure 6.

Figure 6 shows the validation accuracy comparison under both settings. For all models, introducing IK-based augmentation consistently leads to higher validation accuracy, confirming the general benefit of the method. Notably, the improvement is most prominent in Transformer-based architectures. The vanilla Transformer sees its validation accuracy increase from 87.1% to 93.2%, while Informer and Sparse Transformer both improve by more than 4.7%.

To further highlight the contribution of each model, Figure visualizes the proportion of total accuracy gain attributable to each architecture. The Transformer model accounts for the largest share (0.061), followed closely by Informer (0.058) and Sparse Transformer (0.047). These observations demonstrate that the augmentation strategy not only improves model performance uniformly, but also enables attention-based models to realize their full potential on complex gesture sequences.

Together, these results underscore the robustness and scalability of our augmentation pipeline across different model paradigms.

To further demonstrate the advantage of our IK-based augmentation strategy, we established a baseline using a conventional augmentation technique, specifically, 2D random rotation. Each model was trained with pose sequences augmented by randomly rotating the hand keypoints within ±10° in the image plane. As shown in Figure 7, the models trained with IK-AUG consistently outperform those trained with random rotation across all five architectures. These results confirm that IK-AUG provides stronger semantic and structural regularization compared to geometric-only augmentation techniques.

To further examine the consistency and robustness of IK-based augmentation across different training budgets, we extended the evaluation to include three epoch settings: 500, 1000, and 2000. The results are summarized in Figure 8.

As shown in Figure 8, all five models demonstrate consistent improvements in validation accuracy after applying the augmentation method, regardless of the number of training epochs. Transformer-based architectures, particularly Informer and Sparse Transformer, not only exhibit high baseline performance but also maintain stable and notable gains at all epoch levels. For example, Informer achieves gains of 0.040, 0.058, and 0.053 across the 500, 1000, and 2000 epoch settings, respectively.

To better understand the influence of training duration on augmentation effectiveness, Figure 6b plots the accuracy gains over epochs for each model. Most models show either a stable or upward trend in gain magnitude as training progresses, with Transformer achieving the largest gain at 1000 epochs (+0.070), and Informer sustaining gains above 0.05 throughout. This suggests that our augmentation strategy scales well with increased training and can be fully leveraged by attention-based architectures.

Figure 8 compares the final validation accuracy on the augmented dataset alone, without reference to raw data performance. The results demonstrate that accuracy steadily improves with longer training for nearly all models, reaffirming the compatibility of the augmented data with extended training. Notably, all models surpass 93% validation accuracy at 2000 epochs, with Transformer-based models consistently ranking at the top.

Collectively, these findings validate that the IK-based augmentation not only yields immediate performance boosts, but also maintains its effectiveness over prolonged training periods, particularly when combined with Transformer-family architectures.

To further validate the impact of IK-based augmentation on test-time performance, we conducted a comparative analysis across all models using the test set. As summarized in Table 2, all five architectures exhibit consistent improvements in final test accuracy after being trained on the augmented dataset. CNN3D and TCN achieve relative gains of +5% and +6%, respectively, while Transformer-based models Transformer, Informer, and Sparse Transformer show marginal but stable improvements of +2% to +3%, further confirming the compatibility of the synthesized gestures with generalization tasks.

Additionally, Figure 9 presents the confusion matrices of the Transformer model trained with and without augmentation. The augmented model (right) demonstrates stronger diagonal dominance and reduced inter-class confusion compared to the baseline (left), highlighting the improved discriminative ability and classification robustness introduced by the synthetic diversity.

These results emphasize that the proposed augmentation framework not only boosts training efficiency and validation performance but also translates into measurable generalization gains on unseen data, making it a practical and scalable solution for gesture-based recognition systems.

4.3. Evaluation

To quantitatively assess the effectiveness of the proposed augmentation framework, we adopted a series of evaluation metrics and comparative strategies centered on model generalization, classification accuracy, and inter-class discriminability.

The primary evaluation metric is top-1 classification accuracy on a held-out test set, ensuring comparability across models and datasets. All models were trained under identical settings with and without augmentation to isolate the contribution of synthetic data. For further validation, we utilized confusion matrix analysis to visualize class-level performance and to identify reductions in misclassification errors, particularly for semantically or visually similar gestures.

In addition, we computed relative accuracy gains across multiple architectures and training budgets to evaluate consistency and scalability. The inclusion of temporal models (e.g., TCN) and attention-based architectures (e.g., Transformer, Informer) provides a broad perspective on how different inductive biases interact with the augmented data.

This evaluation protocol enables us to not only confirm the overall effectiveness of the augmentation strategy, but also to uncover architecture-specific benefits and gesture-specific improvements. These findings are further analyzed in the subsequent discussion section.

4.4. Discussion

The experimental findings substantiate the algorithmic merit of the proposed IK-AUG framework across diverse model architectures and training configurations. Notably, inverse kinematics-driven augmentation consistently enhances both convergence stability and generalization capacity particularly in Transformer-based models, which are otherwise susceptible to overfitting and data inefficiency in low-resource regimes. By synthesizing biomechanically constrained yet variationally rich gesture sequences, IK-AUG effectively expands the data manifold while preserving the semantic integrity of motion classes, thereby serving as an implicit regularizer in temporal representation learning.

Temporal convolutional architectures such as TCN demonstrate robust performance under this augmentation regime, exhibiting efficient convergence and favorable trade-offs between model complexity and recognition accuracy. In contrast, attention-centric models like Sparse Transformer attain high asymptotic accuracy but display greater sensitivity to hyperparameter configurations and training dynamics, suggesting that they are more reliant on augmentation-induced structural priors to capture gesture regularity.

Despite these improvements, confusion matrix analysis reveals persistent misclassification in semantically adjacent gesture classes. This indicates that further disambiguation may require the integration of complementary modalities, e.g., depth maps, surface cues, or multiview projections, as well as temporal alignment mechanisms that better capture inter-frame dependencies and gesture onset–offset transitions.

In addition, we recognize that the current study is constrained by the limited scale of the dataset, which includes 10 gesture classes recorded in a controlled environment. This limitation may restrict the generalization of the observed findings to more diverse, real-world signing scenarios. We plan to address this by scaling up the dataset in future work, incorporating a broader vocabulary, more signer diversity, and varied contextual conditions. Such efforts will enable a more comprehensive evaluation of the framework’s robustness and applicability.

Although IK-AUG integrates multiple perturbation modules namely spatial jitter, depth modulation, and solver sensitivity control these components are intrinsically coupled within a unified simulation loop that operates on inverse kinematics-reconstructed trajectories. As such, removing individual components would compromise the biomechanical plausibility or semantic consistency of the synthesized gestures. In future work, we plan to develop a modular variant of the pipeline to facilitate interpretable component-wise evaluation and principled ablation analysis.

Overall, the results validate IK-AUG not merely as a data expansion tool, but as a principled algorithmic mechanism for generating anatomically faithful, semantically grounded, and training-efficient gesture corpora. These insights underscore the importance of geometry-aware augmentation in deep sequence modeling, and lay a foundation for hybrid simulation–sensor pipelines in future sign language understanding systems.

5. Conclusions

This work introduces IK-AUG, a principled augmentation framework that unifies inverse kinematics, simulation-based gesture modeling, and parametric perturbation into a coherent algorithmic process for sign language recognition. By transforming sparse keypoint trajectories into anatomically valid motion sequences within a controllable virtual environment, IK-AUG addresses key limitations of existing gesture datasets in scalability, structural diversity, and biomechanical fidelity.

Through the integration of 3D pose estimation, forward–backward IK solvers, and multi-modal augmentation operators parameterized by spatial, depth, and kinematic factors, the framework yields gesture variants that span a rich data manifold while preserving semantic consistency. Experimental evaluations across multiple deep temporal architectures validate the efficacy of IK-AUG in enhancing classification accuracy, reducing overfitting, and improving class-level discriminability particularly for high-capacity attention-based models such as Transformers. Moreover, the augmentation benefits persist across varied training budgets, confirming the robustness and transferability of the synthetic–real gesture loop.

Beyond empirical gains, this research contributes a generalized computational abstraction of gesture synthesis, whereby virtual motion priors can be encoded, perturbed, and operationalized for model training. The framework thus serves not merely as a data augmentation technique, but as a reusable generative substrate for simulation-driven learning paradigms.

Looking ahead, the integration of multi-sensor glove streams featuring inertial, haptic, and flexion feedback offers a compelling pathway toward hybrid supervision regimes, enabling the joint modeling of physical and virtual motion distributions. Additionally, advancing cross-domain adaptation techniques may allow IK-AUG-generated sequences to serve as pretraining anchors for real-world SLR, thereby narrowing the sim-to-real gap through structured algorithmic alignment.

While the IK-AUG framework demonstrates consistent improvements across diverse model architectures, it is important to acknowledge the current study’s limitations. The experimental dataset comprises 10 gesture classes collected under constrained conditions, which may limit the generalizability of the findings to broader, real-world scenarios. To address this, future work will focus on scaling up the dataset to include a more extensive vocabulary, diverse signers, and varied environmental contexts. We also plan to evaluate the framework on large-scale benchmarks such as RWTH-PHOENIX-Weather and How2Sign, and explore the integration of multimodal inputs to support hybrid simulation–sensor learning pipelines.

In total, IK-AUG exemplifies the algorithmic fusion of kinematic reasoning, generative simulation, and data-driven learning laying the foundation for a new class of geometry-aware, sensor-informed learning frameworks situated at the intersection of embodied AI and multimodal human–computer interaction.

Author Contributions

Conceptualization, X.L.; methodology, B.W.; software, B.W.; validation, B.W., X.L. and L.J.; formal analysis, B.W.; investigation, B.W.; resources, L.J.; data curation, B.W.; writing—original draft preparation, B.W.; writing—review and editing, B.W.; visualization, B.W.; supervision, X.L.; project administration, X.L.; funding acquisition, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JKA and its promotion funds from KEIRIN RACE.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because they are part of an ongoing study and currently being used for follow-up experiments. Requests to access the datasets should be directed to the corresponding author’s project page.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Bragg, D.; Koller, O.; Bellard, M.; Berke, L.; Boudreault, P.; Braffort, A.; Caselli, N.; Huenerfauth, M.; Kacorri, H.; Verhoef, T.; et al. Sign language recognition, generation, and translation: An interdisciplinary perspective. In Proceedings of the 21st International ACM Special Interest Group on Accessibility and Computing Conference on Computers and Accessibility, Pittsburgh, PA, USA, 28–30 October 2019; pp. 16–31. [Google Scholar]
Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 10023–10033. [Google Scholar]
Maiorana-Basas, M.; Pagliaro, C.M. Technology use among adults who are deaf and hard of hearing: A national survey. J. Deaf. Stud. Deaf. Educ. 2014, 19, 400–410. [Google Scholar] [CrossRef] [PubMed]
Koller, O.; Zargaran, S.; Ney, H.; Bowden, R. Deep sign: Enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. Int. J. Comput. Vis. 2018, 126, 1311–1325. [Google Scholar] [CrossRef]
Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; IEEE: New York, NY, USA, 2020; pp. 1459–1469. [Google Scholar]
Ge, L.; Ren, Z.; Li, Y.; Xue, Z.; Wang, Y.; Cai, J.; Yuan, J. 3D hand shape and pose estimation from a single RGB image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 10833–10842. [Google Scholar]
Khan, A.; Jin, S.; Lee, G.H.; Arzu, G.E.; Nguyen, T.N.; Dang, L.M.; Choi, W.; Moon, H. Deep learning approaches for continuous sign language recognition: A comprehensive review. IEEE Access 2025, 13, 55524–55544. [Google Scholar] [CrossRef]
Duarte, A.; Palaskar, S.; Ventura, L.; Ghadiyaram, D.; DeHaan, K.; Metze, F.; Torres, J.; Giro-i Nieto, X. How2Sign: A large-scale multimodal dataset for continuous American Sign Language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 2735–2744. [Google Scholar]
Forster, J.; Schmidt, C.; Hoyoux, T.; Koller, O.; Zelle, U.; Piater, J.H.; Ney, H. Rwth-phoenix-weather: A large vocabulary sign language recognition and translation corpus. Int. Conf. Lang. Resour. Eval. 2012, 9, 3785–3789. [Google Scholar]
Al-Qurishi, M.; Khalid, T.; Souissi, R. Deep learning for sign language recognition: Current techniques, benchmarks, and open issues. IEEE Access 2021, 9, 126917–126951. [Google Scholar] [CrossRef]
Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Skeleton-aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 3413–3423. [Google Scholar]
Barve, P.; Mutha, N.; Kulkarni, A.; Nigudkar, Y.; Robert, Y. Application of deep learning techniques on sign language recognition A survey. In Data Management, Analytics and Innovation, Proceedings of the International Conference on Discrete Mathematics, Rupnagar, India, 11–13 February 2021; Springer: Singapore, 2021; Volume 70, pp. 211–227. [Google Scholar]
Kumar, D.A.; Sastry, A.; Kishore, P.; Kumar, E.K. 3D sign language recognition using spatio-temporal graph kernels. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 143–152. [Google Scholar] [CrossRef]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
Lea, C.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks: A unified approach to action segmentation. In Proceedings of the European Conference on Computer Vision 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Part III. Springer: Cham, Switzerland, 2016; pp. 47–54. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the The Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
Alam, M.S.; Lamberton, J.; Wang, J.; Leannah, C.; Miller, S.; Palagano, J.; de Bastion, M.; Smith, H.L.; Malzkuhn, M.; Quandt, L.C. ASL champ!: A virtual reality game with deep-learning driven sign recognition. Comput. Educ. X Real. 2024, 4, 100059. [Google Scholar] [CrossRef]
Vaitkevičius, A.; Taroza, M.; Blažauskas, T.; Damaševičius, R.; Maskeliūnas, R.; Woźniak, M. Recognition of American sign language gestures in a virtual reality using leap motion. Appl. Sci. 2019, 9, 445. [Google Scholar] [CrossRef]
Schioppo, J.; Meyer, Z.; Fabiano, D.; Canavan, S. Sign language recognition: Learning American sign language in a virtual environment. In Proceedings of the Extended Abstracts of the 2019 Conference on Human Factors in Computing Systems, Glasgow, Scotland, UK, 4–9 May 2019; pp. 1–6. [Google Scholar]
Papadogiorgaki, M.; Grammalidis, N.; Makris, L.; Strintzis, M.G. Gesture synthesis from sign language notation using MPEG-4 humanoid animation parameters and inverse kinematics. In Proceedings of the 2nd IET International Conference on Intelligent Environments, Athens, Greece, 5–6 July 2006; pp. 151–160. [Google Scholar]
Ponton, J.L.; Yun, H.; Aristidou, A.; Andujar, C.; Pelechano, N. Sparseposer: Real-time full-body motion reconstruction from sparse data. ACM Trans. Graph. 2023, 43, 5. [Google Scholar] [CrossRef]
Nunnari, F.; España-Bonet, C.; Avramidis, E. A data augmentation approach for sign-language-to-text translation in-the-wild. In Proceedings of the 3rd Conference on Language, Data and Knowledge. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Zaragoza, Spain, 1–3 September 2021; pp. 36:1–36:8. [Google Scholar]
Awaluddin, B.A.; Chao, C.T.; Chiou, J.S. A hybrid image augmentation technique for user-and environment-independent hand gesture recognition based on deep learning. Mathematics 2024, 12, 1393. [Google Scholar] [CrossRef]
Perea-Trigo, M.; López-Ortiz, E.J.; Soria-Morillo, L.M.; Álvarez García, J.A.; Vegas-Olmos, J. Impact of face swapping and data augmentation on sign language recognition. Univers. Access Inf. Soc. 2024, 24, 1283–1294. [Google Scholar] [CrossRef]
Brock, H.; Law, F.; Nakadai, K.; Nagashima, Y. Learning three-dimensional skeleton data from sign language video. ACM Trans. Intell. Syst. Technol. 2020, 11, 1–24. [Google Scholar] [CrossRef]
Nakamura, Y.; Jing, L. Skeleton-based data augmentation for sign language recognition using adversarial learning. IEEE Access 2024, 13, 15290–15300. [Google Scholar] [CrossRef]
Wen, F.; Zhang, Z.; He, T.; Lee, C. AI enabled sign language recognition and VR space bidirectional communication using triboelectric smart glove. Nat. Commun. 2021, 12, 5378. [Google Scholar] [CrossRef] [PubMed]

Figure 1. System overview of the proposed IK-AUG framework. The pipeline begins with 3D keypoint extraction and inverse kinematics-based hand reconstruction (left), followed by biomechanically constrained animation and multi-modal perturbation (right), including depth offset, spatial jitter, and IK sensitivity control. The generated sequences

{\vec{D}}_{1} \to {\vec{D}}_{N}

are exported for downstream training. Optional multi-sensor glove input enhances signal richness and motion granularity.

Figure 1. System overview of the proposed IK-AUG framework. The pipeline begins with 3D keypoint extraction and inverse kinematics-based hand reconstruction (left), followed by biomechanically constrained animation and multi-modal perturbation (right), including depth offset, spatial jitter, and IK sensitivity control. The generated sequences

{\vec{D}}_{1} \to {\vec{D}}_{N}

are exported for downstream training. Optional multi-sensor glove input enhances signal richness and motion granularity.

Figure 2. Real video sequence extracted from raw gesture recording. The frames illustrate the original signer’s motion trajectory captured under constrained conditions. These unaugmented sequences serve as the source input for inverse kinematics-based hand pose reconstruction and virtual simulation.

Figure 3. Augmented gesture sequence generated via virtual perturbation pipeline. Structure-preserving variations are applied in depth, spatial position, and inverse kinematic sensitivity, resulting in diverse but semantically consistent gesture variants governed by a sensitivity parameter

ε

. This enhances generalization and simulates real-world gesture variability.

Figure 3. Augmented gesture sequence generated via virtual perturbation pipeline. Structure-preserving variations are applied in depth, spatial position, and inverse kinematic sensitivity, resulting in diverse but semantically consistent gesture variants governed by a sensitivity parameter

ε

. This enhances generalization and simulates real-world gesture variability.

Figure 4. Training versus validation accuracy across different models and data settings over 500 epochs. (a,b): CNN3D model trained on raw and augmented datasets, respectively; (c,d): same comparison for the Transformer model. Augmented data consistently leads to improved validation accuracy and faster convergence, particularly mitigating overfitting in deep architectures.

Figure 5. Comparative performance of deep learning models on augmented gesture data. (a) illustrates the convergence behavior of five architectures across 1000 training epochs, showing consistent learning stability and reduced variance with augmentation. (b) presents final training and validation accuracy, where all models achieve high generalization, with Transformer-based models demonstrating a strong balance between training fit and validation performance.

Figure 6. Effectiveness of IK-based data augmentation across five deep learning architectures. (a) compares validation accuracy before and after augmentation, showing consistent improvements across all models. (b) presents the proportional contribution of each model to the total accuracy gain, highlighting that Transformer-based models (Transformer, Informer, Sparse Transformer) account for the majority of performance improvement.

Figure 7. Comparison of IK-AUG versus 2D rotation-based augmentation. Validation accuracy of five deep learning models trained with traditional 2D random rotation and with the proposed IK-based augmentation. IK-AUG yields consistent performance gains across all models.

Figure 8. Impact of training duration on the effectiveness of IK-based augmentation. (a) presents raw accuracy and augmentation gain for five models across 500, 1000, and 2000 training epochs, illustrating stable improvements with longer training. (b) shows the gain trends over epochs, where Transformer achieves the largest boost at 1000 epochs, while Informer and Sparse Transformer maintain consistent gains, indicating strong scalability of augmentation across model types.

Figure 9. Comparison of classification performance between models trained on raw and IK-augmented gesture data. (a) Confusion matrix of the model trained on raw data, where several classes exhibit misclassification. (b) Confusion matrix of the model trained on IK-augmented data, illustrates improved class separability with augmented data, showing stronger diagonal dominance and reduced inter-class confusion.

Table 1. Model -specific training hyperparameters used in all experiments.

Model	Batch Size	Learning Rate	Weight Decay	Dropout	Hidden Dim	Notes
CNN3D	16	$3.15 \times 10^{- 4}$	$1.79 \times 10^{- 5}$	0.395	320
TCN	32	$1.48 \times 10^{- 4}$	$5.21 \times 10^{- 5}$	0.266	256	Kernel size = 3, 6 layers
Transformer	32	$4.60 \times 10^{- 4}$	$8.55 \times 10^{- 5}$	0.102	32	2 layers, 8 heads
Informer	32	$9.83 \times 10^{- 5}$	$7.92 \times 10^{- 3}$	0.422	256	Kernel size = 0, 4 layers
Sparse Transformer	16	$7.79 \times 10^{- 4}$	$2.97 \times 10^{- 4}$	0.458	384

Table 2. Test set classification accuracy (%) for all models trained on original vs. augmented datasets.

Model	Original Data Accuracy	Augmented Data Accuracy
CNN3D	85%	90%
TCN	83%	89%
Transformer	90%	92%
Informer	91%	94%
Sparse Transformer	89%	92%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, B.; Jing, L.; Li, X. Inverse Kinematics-Augmented Sign Language: A Simulation-Based Framework for Scalable Deep Gesture Recognition. Algorithms 2025, 18, 463. https://doi.org/10.3390/a18080463

AMA Style

Wang B, Jing L, Li X. Inverse Kinematics-Augmented Sign Language: A Simulation-Based Framework for Scalable Deep Gesture Recognition. Algorithms. 2025; 18(8):463. https://doi.org/10.3390/a18080463

Chicago/Turabian Style

Wang, Binghao, Lei Jing, and Xiang Li. 2025. "Inverse Kinematics-Augmented Sign Language: A Simulation-Based Framework for Scalable Deep Gesture Recognition" Algorithms 18, no. 8: 463. https://doi.org/10.3390/a18080463

APA Style

Wang, B., Jing, L., & Li, X. (2025). Inverse Kinematics-Augmented Sign Language: A Simulation-Based Framework for Scalable Deep Gesture Recognition. Algorithms, 18(8), 463. https://doi.org/10.3390/a18080463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Inverse Kinematics-Augmented Sign Language: A Simulation-Based Framework for Scalable Deep Gesture Recognition

Abstract

1. Introduction

2. Related Work

2.1. Virtual and Immersive Environments for Sign Language Learning and Recognition

2.2. Inverse Kinematics for Motion Reconstruction

2.3. Data Augmentation for Sign Language Recognition

2.4. Skeleton-Based and Adversarial Augmentation Techniques

2.5. Multimodal and Glove-Enhanced Recognition Systems

3. Methodology and Algorithms

3.1. Theoretical Foundation

3.1.1. Virtual Environments for Structured Simulation and Data Augmentation

3.1.2. Inverse Kinematics (IK) for Motion Modeling

3.1.3. Deep Learning Architectures for Sign Language Recognition

3.2. Proposed Augmentation Framework via Kinematic Gesture Simulation

3.2.1. Overview of the Synthetic–Real Gesture Loop

3.2.2. Inverse Kinematics Solver for Hand Pose Reconstruction

3.2.3. Sign Language Diversity Enhancement Strategy Driven by Multimodal Perturbation

4. Experiments

4.1. Experimental Setup

4.2. Experimental Results

4.3. Evaluation

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI