Adaptive Sign Language Recognition for Deaf Users: Integrating Markov Chains with Niching Genetic Algorithm

Al-Saidi, Muslem; Ballagi, Áron; Hassen, Oday Ali; Darwish, Saad M.

doi:10.3390/ai6080189

Open AccessArticle

Adaptive Sign Language Recognition for Deaf Users: Integrating Markov Chains with Niching Genetic Algorithm

¹

Doctoral School of Multidisciplinary Engineering Sciences, Széchenyi István University, Egyetem tér 1, 9026 Győr, Hungary

²

Department of Automation, Széchenyi István University, Egyetem tér 1, 9026 Győr, Hungary

³

Ministry of Education, Wasit Education Directorate, Kut 52001, Iraq

⁴

Department of Information Technology, Institute of Graduate Studies and Research, Alexandria University, Alexandria 21526, Egypt

^*

Author to whom correspondence should be addressed.

AI 2025, 6(8), 189; https://doi.org/10.3390/ai6080189

Submission received: 3 June 2025 / Revised: 18 July 2025 / Accepted: 12 August 2025 / Published: 15 August 2025

(This article belongs to the Topic Advances in Robot Vision Perception and Control Technology)

Download

Browse Figures

Versions Notes

Abstract

Sign language recognition (SLR) plays a crucial role in bridging the communication gap between deaf individuals and the hearing population. However, achieving subject-independent SLR remains a significant challenge due to variations in signing styles, hand shapes, and movement patterns among users. Traditional Markov Chain-based models struggle with generalizing across different signers, often leading to reduced recognition accuracy and increased uncertainty. These limitations arise from the inability of conventional models to effectively capture diverse gesture dynamics while maintaining robustness to inter-user variability. To address these challenges, this study proposes an adaptive SLR framework that integrates Markov Chains with a Niching Genetic Algorithm (NGA). The NGA optimizes the transition probabilities and structural parameters of the Markov Chain model, enabling it to learn diverse signing patterns while avoiding premature convergence to suboptimal solutions. In the proposed SLR framework, GA is employed to determine the optimal transition probabilities for the Markov Chain components operating across multiple signing contexts. To enhance the diversity of the initial population and improve the model’s adaptability to signer variations, a niche model is integrated using a Context-Based Clearing (CBC) technique. This approach mitigates premature convergence by promoting genetic diversity, ensuring that the population maintains a wide range of potential solutions. By minimizing gene association within chromosomes, the CBC technique enhances the model’s ability to learn diverse gesture transitions and movement dynamics across different users. This optimization process enables the Markov Chain to better generalize subject-independent sign language recognition, leading to improved classification accuracy, robustness against signer variability, and reduced misclassification rates. Experimental evaluations demonstrate a significant improvement in recognition performance, reduced error rates, and enhanced generalization across unseen signers, validating the effectiveness of the proposed approach.

Keywords:

subject-independent sign language recognition; Markov Chain optimization; Niching Genetic Algorithm (NGA); gesture dynamics and variability

1. Introduction

Sign language recognition (SLR) is a vital technology that facilitates communication between deaf individuals and the hearing population, with applications in real-time translation, assistive technologies, human–computer interaction, and accessibility services in various sectors, including public spaces such as transportation hubs, government offices, shopping centers, and emergency response systems. In public spaces, SLR can be integrated into interactive kiosks, digital signage, and customer service systems to provide automatic sign language interpretation, enabling deaf individuals to access essential information without requiring a human interpreter. Additionally, SLR can enhance accessibility in airports, train stations, and bus terminals by providing real-time sign language translations for announcements, schedules, and emergency alerts. Despite its importance, achieving subject-independent SLR remains a major challenge due to the high variability in signing styles, hand shapes, movement trajectories, and speed across different users. These variations arise from factors such as individual physiological differences, cultural and regional sign language dialects, and signer proficiency levels. Additionally, environmental conditions in public spaces, such as poor lighting, background noise, occlusions, and crowded environments, further complicate accurate recognition [1,2,3].

Traditional SLR models, particularly those based on statistical and deep learning methods, often struggle to generalize across diverse users and varying operating conditions, resulting in reduced recognition accuracy and increased uncertainty when encountering unseen signing patterns. These limitations arise from the difficulty of capturing the complex temporal dependencies and variations in hand movements, shapes, and transitions inherent in sign language. Markov Chain-based models offer a significant advantage in SLR by effectively modeling sequential dependencies in gestures, enabling a structured representation of transitions between different sign states. Unlike traditional deep learning models that rely heavily on large amounts of labeled data for training, Markov Chains can efficiently learn gesture sequences with fewer samples by leveraging probabilistic state transitions. This makes them particularly useful in handling signer variability and environmental noise, as they can adaptively adjust transition probabilities based on observed data. Additionally, Markov Chains provide better interpretability by explicitly defining transition probabilities between gesture states, making it easier to analyze and refine recognition performance [4,5,6].

Markov Chain models, while effective in capturing sequential dependencies in SLR, face several challenges, including suboptimal transition probabilities and susceptibility to local optima when modeling complex gesture sequences. Genetic Algorithms (GAs) can address these limitations by optimizing the transition probabilities and structural parameters of the Markov Chain, ensuring a more adaptive and robust recognition framework. Traditional Markov Chains rely on predefined or statistically estimated transition probabilities, which may not always generalize well across different signers due to variations in signing styles, hand movement trajectories, and gesture execution speeds. GAs overcome this by employing evolutionary optimization techniques to iteratively refine transition probabilities, selecting the best-performing probability distributions based on fitness evaluations. Furthermore, GAs introduce diversity in the search space by exploring multiple potential solutions simultaneously, preventing premature convergence to suboptimal transition matrices [7,8].

When applying GAs to optimize Markov Chain models for SLR, several challenges arise, including premature convergence, loss of genetic diversity, and suboptimal exploration of the solution space. Standard GA approaches tend to converge too quickly for locally optimal transition probability distributions, limiting their ability to explore better configurations for modeling gesture sequences. Additionally, due to the high-dimensional nature of transition matrices, GA solutions may suffer from gene association problems, where certain transition probabilities become overly dependent on others, reducing the model’s flexibility in handling signer variability. This issue is particularly problematic in subject-independent SLR, where different users exhibit diverse signing styles and motion patterns [9,10]. Niching techniques, such as Context-Based Clearing (CBC), address these challenges by maintaining genetic diversity and preventing dominant solutions from taking over the population too early. CBC achieves this by identifying and clearing similar solutions within a given niche, ensuring that multiple high-quality solutions evolve in parallel rather than collapsing into a single, potentially suboptimal configuration. This promotes a broader exploration of transition probabilities, allowing the Markov Chain to capture a wider range of gesture variations and signer-specific nuances. By reducing the risk of premature convergence and balancing the exploitation–exploration tradeoff, CBC-enhanced GA ensures that the optimized Markov Chain model remains adaptive, generalizes effectively across different users, and improves recognition accuracy in diverse real-world conditions [11,12].

1.1. Problem Statement and Motivation

SLR systems play a crucial role in bridging communication between the deaf and hearing communities. However, subject-independent recognition remains a major challenge due to variations in signing styles, hand shapes, and movement patterns among users. Traditional Markov Chain models often fail to generalize for unseen signers, resulting in reduced accuracy and increased uncertainty. These shortcomings arise from their limited ability to capture diverse gesture dynamics and handle inter-user variability. Without adaptive mechanisms to address these issues, SLR systems fail to perform reliably in real-world applications.

1.2. Contribution

This study introduces an adaptive SLR framework that integrates Markov Chains with a Niching Genetic Algorithm (NGA) to enhance model generalization for subject-independent recognition. The NGA optimizes the Markov Chain’s transition probabilities through an evolutionary process, ensuring that the model learns diverse signing patterns without prematurely converging to local optima. The novelty lies in the integration of a Context-Based Clearing (CBC) technique, which fosters genetic diversity by reducing excessive gene correlation within chromosomes. By maintaining a diverse set of candidate solutions, CBC enables the Markov Chain to capture a broader range of gesture variations, preventing the model from overfitting to specific signer-dependent patterns. Additionally, the structured clearing mechanism ensures that the evolutionary process explores multiple high-potential configurations of the transition matrix, refining the Chain’s ability to generalize across different users. This results in a more adaptive and resilient Markov Chain model, capable of accurately modeling gesture transitions in subject-independent SLR, reducing misclassification rates, and improving recognition accuracy.

While modern sequential deep learning models such as Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Transformers have demonstrated impressive performance in many temporal sequence modeling tasks, the decision to employ classical probabilistic models like Markov Chains in this study is both strategic and justifiable. Classical probabilistic models offer greater interpretability, computational efficiency, and reduced data dependency compared to deep learning architectures. Markov Chains, in particular, excel at modeling sequential dependencies and state transitions with well-defined probabilistic structures, making them suitable for scenarios where labeled training data is limited or where transparency in decision-making is crucial. In subject-independent SLR, where inter-signer variability introduces uncertainty and inconsistent patterns, Markov Chains provide a stable foundation for modeling gesture transitions. The integration of a Niching Genetic Algorithm (NGA) further enhances their adaptability by optimizing transition probabilities and maintaining population diversity through the Context-Based Clearing (CBC) mechanism. This hybridization enables the model to escape local optima and effectively capture a wider spectrum of gesture dynamics across different users. In contrast, deep learning models often require extensive annotated datasets, are prone to overfitting in small-sample settings, and behave as “black boxes,” making them less suited for applications where robustness to unseen data and explainability are essential. Therefore, by leveraging the strengths of classical probabilistic modeling and evolutionary optimization, the proposed approach balances accuracy, generalization, and interpretability, offering a compelling alternative to deep learning for subject-independent sign language recognition. Table 1 compares classical probabilistic models (e.g., Markov Chains) and modern sequential deep learning models (e.g., RNNs, LSTMs, and Transformers) in the context of SLR. This table highlights that while deep learning models are powerful in feature learning and scalability, classical models like Markov Chains—when enhanced with evolutionary algorithms—remain highly effective for interpretable, resource-efficient, and subject-independent SLR tasks, particularly in data-constrained environments.

The remainder of this paper consists of the following sections: Section 2 provides a literature review of relevant publications for the optimization-aware SLR framework. The suggested Niching Genetic Algorithm HMM-based SLR approach is presented in Section 3. The assessment of the suggested technique, including results and discussion, is presented in Section 4. The study is concluded, and possible future directions are discussed in Section 5.

2. State-of-the-Art Related Work

SLR has gained significant attention in recent years as an essential tool for enhancing communication between the deaf and hearing communities. Traditional SLR methods rely on handcrafted feature extraction and classification techniques, which often struggle with variability in signing styles, occlusions, and environmental noise. To address these challenges, optimization-based SLR approaches have emerged, leveraging metaheuristic algorithms to fine-tune model parameters, enhance feature selection, and improve classification accuracy [13,14,15]. Metaheuristic approaches like GA, Particle Swarm Optimization (PSO), and the Tunicate Swarm Algorithm (TSA) can effectively search the solution space to find optimal HMM configurations, reducing overfitting, and improve recognition accuracy. Additionally, optimization can aid in feature selection and sequence alignment, further enhancing the model’s ability to capture temporal dependencies in sign language gestures. By integrating optimization strategies, HMM-based SLR systems can achieve higher recognition rates, better adaptability to different signers, and improved computational efficiency [16,17].

The study presented in Ref. [18] suggests a real-time recognition system that integrates Particle Swarm Optimization (PSO) and a Probabilistic Neural Network (PNN). The system employs Principal Component Analysis (PCA) for dimensionality reduction to minimize computational overhead and employs K-means clustering and the Pearson correlation coefficient to extract optimal gesture features for classification. In offline gesture recognition tests involving six continuous gestures (CGs), the proposed algorithm achieved an impressive 97% accuracy with a training set of 300 samples and a runtime of just 31.25 ms. Compared to five alternative algorithms, PSO-PNN improved accuracy by at least 9% and reduced runtime by 40.475 ms. Additionally, testing across multiple datasets demonstrated an average recognition rate of 90.17%, outperforming other methods by at least 9.84%. In online CG control experiments for robot navigation in complex environments, the PSO-PNN system achieved real-time performance of 28.56 ms and a task completion rate of 90.67%, confirming its effectiveness and practicality. The key advantages of this approach include high accuracy, efficient feature extraction, reduced computational cost, and real-time performance, making it highly suitable for real-world applications like robotic control. However, some limitations exist, such as dependence on high-quality training data, potential performance degradation with unseen gesture variations, and sensitivity to environmental factors like lighting and background noise.

The SLR-ISOADL methodology presented in Ref. [19] integrates an Improved Seagull Optimization Algorithm (ISOA) with deep learning for sign language recognition. This approach employs a hyperparameter-tuned AlexNet model to extract intrinsic patterns, while bilateral filtering (BF) is applied for noise reduction. The ISOA optimally selects hyperparameters for AlexNet, and a multilayer perceptron (MLP) is used for final classification. Experimental analysis on a benchmark dataset demonstrated the effectiveness of SLR-ISOADL, achieving superior detection performance across multiple metrics. However, while the approach enhances accuracy and robustness, potential drawbacks include increased computational complexity due to the optimization process and the need for extensive training data to fine-tune hyperparameters effectively.

In Ref. [20], the authors presented a novel approach for Arabic sign alphabet recognition using optimized hybrid techniques that combine a Convolutional Neural Network (CNN) with five traditional machine learning algorithms: Feedforward Neural Network, Decision Tree, Random Forest, Support Vector Machine, and K-Nearest Neighbors. To enhance classification accuracy, six optimization techniques—Genetic Algorithm, Particle Swarm Optimization, Firefly Algorithm, Differential Evolution, Sine Cosine Algorithm, and Harris Hawks Optimization—were explored to determine the optimal weight multipliers for the outputs of the hybrid CNN models. The results showed that the optimized hybrid techniques achieved nearly 99% accuracy, surpassing individual models and demonstrating superior efficiency in Arabic sign alphabet recognition. The advantages of this approach include high classification accuracy, the effective integration of deep learning as well as traditional models, and the ability to optimize model performance using advanced metaheuristic algorithms. Additionally, the hybrid approach improves generalization and robustness across different sign variations. However, some limitations exist, such as increased computational complexity due to multiple models, potential overfitting when tuning optimization parameters, and the requirement for substantial training time and resources.

In Ref. [21], the authors have developed a structured framework for the recognition and analysis of sign language gestures, aiming to bridge the communication gap between individuals with hearing impairments and those unfamiliar with sign language. The system consists of multiple processing modules, starting with noise removal using an adaptive filter to eliminate background disturbances and enhance gesture clarity. Next, segmentation is performed using a region-growing algorithm, which effectively isolates hand gestures from the background, improving feature extraction accuracy. The third stage involves feature extraction using an improved Genetic Algorithm, which optimizes feature selection by reducing redundancy and enhancing recognition performance. Finally, the system’s effectiveness is evaluated by comparing it with a Support Vector Machine (SVM) classifier, ensuring a robust performance assessment. The pros of this approach include improved gesture recognition accuracy, efficient noise reduction, optimized feature selection, and objective performance evaluation against a standard classifier. However, the cons involve high computational complexity, sensitivity to environmental variations such as lighting and background noise, dependency on a diverse training dataset, and potential processing delays due to Genetic Algorithm-based feature extraction, which may limit real-time applicability.

In Ref. [22], the authors presented AEGWO-Net, an advanced technique that integrates machine learning and swarm intelligence for improved gesture recognition. The method begins with feature extraction using the Histogram of Oriented Gradients (HOG) approach, followed by dimensionality reduction through an unsupervised autoencoder to retain essential information while reducing computational complexity. Next, an enhanced Grey Wolf Optimization (GWO) algorithm refines the feature set, ensuring optimal selection. Finally, a handcrafted artificial neural network (ANN) classifier is employed to perform classification. The effectiveness of AEGWO-Net was extensively evaluated on six diverse datasets (ASL, ASL MNIST, ISL, ArSL, MNIST Digits, and IEEE-ISL), covering various sign languages. Experimental results demonstrate its superiority over PCA-IGWO and KPCA-IGWO, with accuracy and F1-score improvements of 6% and 4%, respectively. The model achieves 98.40% accuracy, 96.59% F1-score, 97.14% MCC, and 96.21% AUC, showcasing its robustness and generalizability even with reduced feature dimensions. The pros of AEGWO-Net include high classification accuracy, effective feature selection, and strong generalization across multiple datasets. Additionally, its dimensionality reduction mechanism helps maintain model efficiency. However, the method has some drawbacks, such as increased computational cost due to multiple optimization steps, a longer training time compared to traditional classifiers, and sensitivity to hyperparameter tuning in the autoencoder and GWO algorithm.

This study presented in Ref. [23] introduces a hybrid deep Recurrent Neural Network (RNN) integrated with the Chaos Game Optimization (CGO) algorithm, termed RNN-CGO, for efficient hand gesture recognition. The primary goal of RNN-CGO is to accurately recognize alphabet signs from 2D gesture images, utilizing a structured pipeline comprising preprocessing, feature extraction, feature selection, and classification. The approach was implemented on the American Sign Language (ASL) dataset using the Python (version 3.13) platform. Experimental results demonstrate exceptional performance, achieving 99.96% accuracy, 99.28% precision, 99.25% F1-score, and 99.28% recall, with an inference time of just 0.121 s, making it highly efficient. The advantages of RNN-CGO include high accuracy, fast inference time, and reduced computational complexity compared to existing models. Additionally, the CGO algorithm enhances feature selection, leading to better classification performance. However, some drawbacks exist, such as potential sensitivity to hyperparameter tuning, dependence on dataset quality, and limited scalability to large-scale real-time applications due to the recurrent nature of RNNs.

In Ref. [24], the authors presented a three-module framework for isolated sign recognition, leveraging different sequence models. A key challenge in using HMMs for this task is the high dimensionality of deep features, which we address by introducing two CNN-based architectures for effective feature dimension reduction. Their experiments demonstrated that by combining pretrained ResNet50 features with one of our CNN-based reduction models, HMMs achieve 90.15% accuracy on the Montalbano dataset using RGB and skeletal data—comparable to state-of-the-art LSTM models. The advantage of HMMs lies in their lower parameter count and efficient training, enabling deployment on commodity hardware without requiring GPUs. However, while HMMs offer computational efficiency, deep sequence models like LSTMs may still capture complex temporal dependencies more effectively, potentially improving recognition in more challenging cases.

The work presented in Ref. [25] introduces a novel approach for dynamic hand gesture recognition using Hidden Markov Models (HMMs) to detect English alphabet letters by analyzing hand movement trajectories. The method begins with skin color-based segmentation to isolate the hand in video frames, followed by morphological operations to refine gesture paths. Hand-tracking and trajectory smoothing, utilizing techniques like the Kalman filter, ensure accurate motion capture. The processed movements are then converted into quantized sequences and analyzed using the Baum–Welch Re-estimation Algorithm, with a maximum likelihood classifier identifying the most probable letter. Their approach offers significant advantages, including real-time recognition, improved accuracy over traditional methods, and enhanced capability to distinguish complex gestures, making it valuable for sign language recognition. However, potential limitations include sensitivity to lighting variations and skin color segmentation accuracy, which may affect performance in diverse environments.

In Ref. [26], the authors suggested an automated Indian ISLR method for English using an optimized hand segmentation approach. The Grasshopper Optimization Algorithm (GOA) is applied based on a skin color model to segment the hand region, with its effectiveness compared against two alternative techniques: Particle Swarm Optimization-based SCDA (PSO-SCDA) and Artificial Bee Colony-based SCDA (ABC-SCDA). A dataset of hand gestures representing distinct English alphabets is then created, and a template-based matching approach is used for recognition. Classification is performed using both a Support Vector Machine (SVM) and Convolutional Neural Network (CNN), with the CNN achieving superior performance—99.2% accuracy and 81.8% precision—outperforming the SVM. Among the segmentation techniques, GOA-SCDA yielded the highest recognition accuracy at 97.85%, surpassing PSO-SCDA (89.29%) and ABC-SCDA (93.96%). The strengths of this approach include its high accuracy, robust segmentation, and superior classification performance, particularly with the CNN. However, the potential drawbacks include increased computational complexity due to optimization techniques and possible sensitivity to variations in skin tone and lighting conditions.

2.1. Research Gap

HMMs face several challenges in sign language recognition, particularly in parameter configuration and tuning. The performance of HMMs heavily depends on selecting optimal values for parameters such as the number of hidden states, observation probabilities, transition probabilities, and initial state distributions. Traditional methods for tuning these parameters, such as the Baum–Welch Algorithm, often suffer from slow convergence and local optima issues, leading to suboptimal recognition performance. Additionally, manually determining the best number of states for different gestures can be difficult, as an inadequate number of states may fail to capture gesture variations, while too many states can lead to overfitting and excessive computational complexity. These challenges make it crucial to adopt more advanced optimization techniques to improve the effectiveness of HMMs in sign language recognition. Advanced GA techniques offer a promising solution to these issues by automating the optimization of HMM parameters. GAs can efficiently explore large search spaces by evolving a population of potential parameter sets through selection, crossover, and mutation operations. By defining an objective function—such as maximizing recognition accuracy or minimizing error rates—GA can iteratively refine HMM parameters to achieve optimal performance. These evolutionary algorithms enable more adaptive and efficient parameter tuning, making HMMs more robust for sign language recognition, especially in handling complex and highly variable gestures.

2.2. The Need to Extend the Related Work

In the context of parameter tuning for machine learning models, such as HMMs in SLR, Niching Algorithms can significantly improve the adaptability of GAs. Niching Algorithms play a crucial role in enhancing the performance of GAs by promoting population diversity and preventing premature convergence. Traditional GAs often suffer from the tendency to converge too quickly to a single solution, especially in complex optimization problems with multiple local optima. This premature convergence leads to suboptimal results, as the algorithm loses the ability to explore other potentially better solutions in the search space. Niching techniques, such as fitness sharing, crowding, and speciation, help maintain diverse subpopulations within the GA by encouraging the survival of multiple solutions. By preserving diversity, Niching Algorithms ensure that the GA can effectively explore different regions of the solution space, leading to a more robust optimization process.

While the integration of GAs with Markov Chains (MCs) has been previously explored in various optimization contexts, the novelty of our work lies not in the mere combination of these two methods, but in the strategic incorporation of Context-Based Clearing (CBC) within a Niching Genetic Algorithm (NGA) framework tailored specifically for subject-independent Sign Language Recognition (SLR). Prior research predominantly focuses on GA-MC hybrids for generic optimization or simple gesture recognition tasks without addressing the critical challenge of signer variability—a key issue in subject-independent SLR.

Our proposed approach introduces a structurally adaptive Markov Chain, where the transition probabilities are dynamically optimized by an NGA to reflect the diverse gesture patterns of different users. The CBC mechanism plays a crucial role here: it maintains genetic diversity during evolution by penalizing similar solutions within the same context or niche, effectively avoiding premature convergence—a limitation observed in both standard GAs and GA-MC hybrids. This ensures that multiple gesture transition models can evolve in parallel, each capturing different signer-specific dynamics, which is not addressed in prior GA-MC integrations.

3. Methodology

The problem of subject-independent SLR involves recognizing sign gestures accurately across different users despite variations in signing styles, hand shapes, and movement patterns. Conventional Markov Chain models struggle with this due to limited adaptability to inter-user variations. The goal is to develop an adaptive framework that optimizes Markov Chains using an NGA to enhance recognition accuracy and robustness. Given a sequence of observed gestures,

O = {o_{1}, o_{2}, \dots, o_{T}}

, where

T

is the length of the sequence, the objective is to determine the most probable sign,

S

, from a predefined set of signs,

S

, while minimizing misclassification rates and maximizing generalization across different users. A Markov Chain is represented as a tuple [7,16,24,25]:

M = (S, P, A)

(1)

where

S = {s_{1}, s_{2}, \dots, s_{N}}

is the set of

N

discrete states corresponding to gestures,

P = {p_{i j}}

represents the transition probability matrix, where

p_{i j} = P (s_{j} ∣ s_{i})

denotes the probability of transitioning from state

s_{i}

to state

s_{j}

, and

A = {a_{i}}

represents the observation probabilities that map each state to an observed gesture sequence. The probability of a given gesture sequence,

O

, occurring given a model,

M

, is given by the following:

P (O∣ M) = P (o_{1}) \prod_{t = 2}^{T} P (o_{t} ∣ o_{t - 1})

(2)

Our goal is to optimize the transition probabilities,

p_{i j}

, to maximize the likelihood of correctly recognizing unseen gestures across different users. To improve the generalization and adaptability of the Markov Chain, we employ an NGA that optimizes the transition probability matrix,

P

, and structural parameters of the model. Each chromosome in the population represents a candidate transition probability matrix,

P

, encoded as follows [11,12]:

C = [p_{11}, p_{12}, \dots, p_{N N}]

(3)

where

p_{i j}

are elements of the transition matrix. The fitness function evaluates how well a given Markov Chain model generalizes across different users. The objective is to minimize the recognition error,

R E

, defined as follows:

R E = 1 - \frac{1}{M} \sum_{i = 1}^{M} {A c c}_{i}

(4)

where

{A c c}_{i}

is the classification accuracy for signer

i

, and

M

is the number of test users. Additionally, a regularization term is added to prevent overfitting:

F (C) = φ_{1} (1 - E) + φ_{2} D (P)

(5)

where

D (P)

represents the diversity measure of the transition probability matrix, and

φ_{1}

and

φ_{2}

are weighting parameters.

To maintain diversity in the population, we apply CBC, which prevents premature convergence by promoting diverse solutions. Let

C

be the set of candidate solutions in a niche, and define the dominance criterion as follows [27,28]:

D (C_{i} {, C}_{j}) = \sum_{k} ∣ p_{i k} - p_{j k} ∣

(6)

where

D (C_{i} {, C}_{j})

represents the difference between two chromosomes. The clearing procedure assigns a fitness penalty to highly similar solutions to enforce diversity:

\overset{´}{F} (C_{i}) = \{\begin{matrix} F (C_{i}) & i f C_{i} i s u n i q u e i n i t s n i c h e \\ 0, & o t h e r w i s e \end{matrix}

(7)

This ensures that the Markov Chain evolves with a broad range of potential solutions, improving its robustness to different signing styles. The proposed SLR system is comprehensively described in the following sections, outlining its key components and functionalities. Figure 1 visually represents the overall workflow of the system, illustrating each stage from input processing to final classification.

3.1. Preprocessing Phase

Preprocessing plays a crucial role in SLR by enhancing image quality, reducing noise, and standardizing variations in lighting, background, and contrast. It ensures that the hand gestures are clearly distinguishable from the background, making feature extraction more accurate and robust. Noise removal eliminates random variations, while contrast and brightness adjustments enhance visibility in different lighting conditions. Background removal isolates the hand from unwanted elements, preventing distractions that could impact classification accuracy. These preprocessing steps collectively improve the reliability and generalizability of SLR models, enabling them to work effectively across diverse environments and user variations [1,2,3,4,5]. Figure 2 outlines key preprocessing techniques applied to enhance the quality and clarity of sign language images.

3.1.1. Noise Removal (Denoising)

Gaussian filtering is used to reduce high-frequency noise while preserving important edges in an image. It operates by applying a weighted sum of neighboring pixel intensities using a Gaussian kernel, where the degree of smoothing is controlled by the standard deviation,

σ

. This method effectively removes small random variations in pixel intensity, ensuring smoother image regions while maintaining essential structural details. The filtering operation is represented as follows:

G (x, y) = \sum_{i = - k}^{k} \sum_{j = - k}^{k} I (x + i, y + j) . W (i, j)

(8)

W (i, j) = \frac{1}{2 π σ^{2}} e^{- \frac{i^{2} + j^{2}}{2 σ^{2}}}

(9)

where

G (x, y)

is the denoised pixel value at coordinates

(x, y) a n d k

represents the radius of the Gaussian kernel, which determines the size of the filter window.

I (x + i, y + j)

is the intensity of a neighboring pixel, and

W (i, j)

is the Gaussian kernel.

σ

controls the degree of smoothing (higher

σ

results in more blur).

3.1.2. Lightening (Contrast Adjustment)

Linear intensity transformation adjusts image contrast by applying a linear function to each pixel’s intensity. The transformation uses a gain factor,

α

, to enhance contrast and a bias term,

β

, to modify brightness. This process makes the hand sign more distinguishable and helps normalize variations in lighting conditions.

\overset{´}{I} (x, y) = g I (x, y) + b

(10)

where

\overset{´}{I} (x, y)

is the adjusted pixel intensity,

g

(gain factor) controls contrast enhancement, and

b

(bias term) adjusts brightness.

3.1.3. Brightness Adjustment (Adaptive Histogram Equalization)

Contrast Limited Adaptive Histogram Equalization (CLAHE) enhances local contrast while preventing excessive noise amplification by redistributing pixel intensity values. It uses the cumulative distribution function (CDF) to adjust brightness, ensuring a balanced intensity distribution across the image. This method helps prevent underexposed or overexposed regions, making the hand sign more clearly visible.

\overset{´}{I} (x, y) = \frac{C D F (x, y) - {C D F}_{m i n}}{{C D F}_{m a x} - {C D F}_{m i n}} \times 255

(11)

where

\overset{´}{I} (x, y)

is the new pixel intensity,

C D F (x, y)

is the cumulative distribution function (CDF) of the pixel, and

{C D F}_{m i n}

and

{C D F}_{m a x}

are the minimum and maximum CDF values, respectively.

3.1.4. Background Removal

Color segmentation in the HSV (hue, saturation, and value) space extracts the hand region by applying predefined thresholds to the hue, saturation, and value components of each pixel. Morphological operations, such as erosion and dilation, refine the segmentation by removing noise and enhancing the detected hand area. This process isolates the hand from the background, eliminating distractions and improving recognition accuracy. The hand region is extracted based on predefined HSV thresholds:

M (x, y) = \{\begin{matrix} 1, & i f H_{m i n} \leq H (x, y) \leq H_{m a x} a n d S_{m i n} \leq S (x, y) \leq S_{m a x} a n d V_{m i n} \leq V (x, y) \leq V_{m a x} \\ 0 & o t h e r w i s e \end{matrix}

(12)

where

M (x, y)

is a binary mask,

H (x, y), S (x, y), V (x, y)

are the hue, saturation, and value components of the pixel, respectively, and

H_{m i n}, H_{m a x}, S_{m i n}, S_{m a x}, V_{m i n}

,

V_{m a x}

are empirically determined threshold values. Erosion morphological operation,

E

, is utilized to remove small noise elements, while dilation,

D

, expands the detected hand region.

E (I) = I ⊖ K

(13)

D (I) = I \oplus K

(14)

where

I

is the input binary image,

K

is the structuring element, and ⊖, ⊕ denote erosion and dilation, respectively.

3.2. Feature Extraction Phase

Existing methods for feature extraction in SLR largely rely on hand-crafted features, deep learning-based approaches, or a combination of both [29,30,31]. Hand-crafted methods (e.g., Histogram of Oriented Gradients (HOG), optical flow, or geometric features like hand trajectory and joint angles) are computationally efficient and interpretable, but often lack robustness to variations in signer styles, lighting, and background noise. Deep learning-based methods, such as CNNs or RNNs (often combined with pose estimation tools like OpenPose), offer superior accuracy by learning hierarchical and discriminative features directly from raw video or skeletal data. However, these methods are computationally expensive and require large labeled datasets, making them less ideal for real-time or low-resource environments. Hybrid approaches attempt to balance these trade-offs but may still inherit some complexity or require fine-tuning across datasets.

For lightweight and efficient feature extraction in SLR, pose-based methods using keypoints from skeleton data are highly recommended. These methods significantly reduce input dimensionality by focusing only on essential joints (hands), which are crucial for gestures. They offer a good trade-off between speed and accuracy and are suitable for real-time applications, mobile devices, or embedded systems. Additionally, combining these skeletal features with simple temporal descriptors (e.g., velocity or joint angle changes over time) can further enhance performance without introducing substantial computational overhead. Let

C_{i}^{t} = (x_{i}^{t}, y_{i}^{t})

be the coordinates of the

i^{t h}

keypoint at time

t

, where

i = 1,2, \dots, N

and

N

is the total number of keypoints (e.g., 21 per hand in MediaPipe Hand). A single gesture frame at time

t

can be represented by the feature vector:

F^{t} = [C_{1}^{t}, C_{2}^{t}, \dots . ., C_{N}^{t}] \in R^{2}

(15)

To enhance discriminability, the following derived features can be included:

-: Joint displacements:

{∆ C}_{i}^{' t} = C_{i}^{' t} - C_{i}^{' t - 1}

(16)

-: Inter-joint distances:

d_{i, j}^{' t} = {‖C_{i}^{' t} - C_{j}^{' t}‖}_{2}

(17)

-: Joint angles (for limbs):

θ_{i, j, k}^{' t} = {c o s}^{- 1} (\frac{(C_{i}^{' t} - C_{j}^{' t}) . (C_{k}^{' t} - C_{j}^{' t})}{‖C_{i}^{' t} - C_{j}^{' t}‖ . ‖C_{k}^{' t} - C_{j}^{' t}‖})

(18)

In our work, MediaPipe hands is used and 21 keypoints are extracted per hand in 2D, then

F^{t} \in R^{2 \times 21} = R^{42}

per hand, and with inter-joint distances (e.g., pairwise for 21 keypoints) we have

(\begin{matrix} 21 \\ 2 \end{matrix}) = 210

distances per hand, as well as 19 angles per hand for joint angles. Thus, a complete feature vector per gesture frame can range from 42 (raw) to over 250 features depending on the inclusion of distances, angles, and temporal derivatives. These vectors can then be passed into a Markov Chain model for recognition tasks.

In general, although the MediaPipe hand-tracking pipeline may fail with occlusions or non-standard hand poses, it is valid and relevant in general-purpose applications. However, in the context of this specific SLR application, several factors mitigate this concern. First, MediaPipe uses a two-stage pipeline—a palm detector followed by a hand landmark model—which helps improve robustness under mild occlusion. The palm detector localizes the hand region even when parts of the fingers are occluded, and the landmark estimation stage uses a regression model trained on large-scale datasets with varied poses to estimate missing keypoints. Furthermore, temporal smoothing and interpolation techniques used within MediaPipe allow for estimation continuity across frames, meaning brief occlusions (e.g., hands moving across the body) can be tolerated without significantly disrupting landmark tracking. This capability is especially helpful in SLR tasks, where rapid hand movements may briefly hide parts of the hand.

Moreover, in this particular application, the SLR dataset was carefully curated and collected under controlled conditions that minimize the risk of occlusions. The dataset used includes video sequences where the signer is typically in a well-lit environment, facing the camera, with unobstructed hand visibility and consistent gesture articulation. All sequences were manually verified during preprocessing to ensure that no significant occlusions or ambiguities in hand visibility were present. As a result, the use of MediaPipe in this setup remains highly effective, since the input conditions fall within its operating range. The clean, occlusion-free nature of the dataset ensures that landmark extraction remains accurate and consistent, and thus the concern regarding MediaPipe’s sensitivity to occlusion is not a practical limitation in this specific use case. Nonetheless, future work could explore combining MediaPipe with model-based tracking or depth-aware sensors for enhanced robustness in unconstrained environments.

3.3. SLR-Based Markov Chain Modeling

In gesture-based sign recognition, a Markov Process (MP) is used to model the sequence of feature vectors representing hand gestures. It encodes the temporal dependencies and variability in gesture execution, enabling the reliable recognition of dynamic signs over time [7,16]. A formal description of an MP is the tuple

(S, A, P, R, t)

:

-: $S$ (State Space): The state $s \in S$ represents the extracted hand pose information from a single image frame. Each state encodes key hand joint features derived from the skeleton data. The system state at decision time $t$ is defined as $Y (t, i) = (t, i, λ)$ , where $t$ is the current image frame index, $i \in [0, k]$ represents the segment or stage of gesture recognition (e.g., beginning, middle, and end), and $λ$ is the hand keypoint descriptor vector. This formulation captures the spatiotemporal context needed for gesture differentiation.
-: $A$ (Action Space): The action set $A$ defines the available recognition decisions at each state. It includes the following: (1) Recognize Gesture: classify the current sequence of hand poses into a known sign. (2) Continue Observation: wait for additional image frames to improve recognition confidence. (3) Segment Gesture: mark the end of the current gesture and reset for the next one. For each state, $s$ , a subset of these actions, $A_{s}$ , is available, and at time $t$ the system selects an action, $a \in A_{s}$ .
-: $P$ (Transition Probability): The transition probability function $P [s^{'} ∣ s, a]$ models the likelihood of moving from state $s$ to $s^{'}$ when action $a$ is taken. These transitions capture the probabilistic dynamics of hand gesture progression over image frames. For example, the probability of transitioning from a partially open hand to a fully open palm gesture may be high if such transitions commonly occur in the training data. The transition matrix incorporates uncertainty in gesture evolution and visual noise.
-: $R$ (Reward Function): The reward function $R : S \times A \to R$ evaluates the effectiveness of decisions made at each state. A high reward is granted for correctly recognizing a gesture with minimal frames (high confidence, low latency). Penalties are imposed for incorrect recognition, unnecessary waiting (delays), or premature segmentation. Formally, $R (s, a, s^{'})$ is the scalar reward received for transitioning from state $s$ to $s^{'}$ after performing action $a$ , encouraging accurate and efficient classification.
-: $t$ (Decision Epochs): Decision epochs are the discrete time steps, $t$ (i.e., image frames), at which the system evaluates the current hand pose and selects an action. Each decision epoch corresponds to a new candidate gesture segment. The system operates within a finite time horizon over the gesture sequence, making sequential decisions aimed at maximizing the total expected reward.

This MP formulation allows the hand-based SLR system to learn adaptive policies for gesture recognition directly from image sequences by employing spatial and temporal cues embedded in hand keypoints. It accommodates variability in hand shapes, speeds, and orientations, enabling robust and interpretable real-time recognition of sign language gestures.

3.4. Niche Genetic Algorithm (NGA) to Optimize the Transition Probabilities

The goal of this step is to optimize the transition probability matrix

P [s^{'} ∣ s, a] \in R^{|S| \times A \times |S|}

of a Markov Process to improve the accuracy and robustness of gesture transitions in sign language recognition, i.e., learn transition probabilities that best capture gesture progression, phase changes, and hand pose dynamics.

Step 1: Chromosome Encoding (Genotype Representation)

Each individual in the population represents a set of transition probabilities:

I n d i v i d u a l V = [p_{s_{1}, a, {\overset{´}{s}}_{1^{'}}}, p_{s_{1}, a, {\overset{´}{s}}_{2^{'}}}, \dots, p_{s_{n}, a, {\overset{´}{s}}_{m}}]

(19)

p_{s_{1}, a, {\overset{´}{s}}_{1}} \in [0,1]

and for each

(s, a)

,

\sum_{\overset{´}{s}} p_{s_{1}, a, {\overset{´}{s}}_{1}} = 1

(stochastic constraint). Herein, the Dirichlet distribution for initial generation is employed to ensure valid probability vectors. The Dirichlet distribution is a probability distribution over a vector of probabilities, meaning that it generates a set of non-negative numbers that sum up to 1. It is widely used to model random probability vectors, such as rows in a transition matrix where each row represents probabilities of moving to possible next states.

The Dirichlet distribution is a multivariate probability distribution used to generate

k

-dimensional vectors

(x_{1}, x_{2}, \dots, x_{k})

, where each component

x_{i} \geq 0

and the sum of all components equals 1, making it ideal for modeling probability distributions over

k

categories (e.g., transitions in a Markov model). Each vector represents a valid probability assignment across the categories. The distribution is governed by a parameter vector

ω = (ω_{1}, ω_{2}, \dots, ω_{K})

, where each

ω_{i} > 0

, allowing control over the shape and concentration of the generated probabilities—smaller

α_{i}

values tend to produce sparser distributions, while larger values lead to more uniform distributions.

Step 2: Fitness Function Design

Fitness evaluates how well a given transition matrix supports gesture recognition accuracy:

F i t n e s s (I) = α \cdot A c c u r a c y - β \cdot E n t r o p y - γ \cdot L a t e n c y

(20)

where

A c c u r a c y

is the correct gesture recognition rate using an MP with the individual’s transition matrix,

E n t r o p y

is the penalization of overly uncertain transitions, and

L a t e n c y

is the frames taken to reach correct recognition. To compute Accuracy, you run the MP using the individual’s transition matrix across a labeled gesture dataset and calculate the proportion of correctly recognized gestures out of the total number of gestures tested. Entropy is computed using the Shannon entropy formula for each row (state) of the transition matrix, reflecting the uncertainty in choosing the next state—higher entropy indicates more uniform or uncertain transitions, which are penalized. Latency is measured as the average number of frames the MP requires to correctly recognize a gesture, with lower latency being preferred as it indicates faster recognition.

Adjusting the weight parameters

α

,

β

, and

γ

in the fitness function allows for the prioritization of certain objectives during optimization. A higher

α

emphasizes recognition accuracy, making the algorithm favor transition matrices that improve classification rates. Increasing

β

penalizes transition matrices with high uncertainty, pushing the model towards clearer, more confident transitions. Raising

γ

discourages long delays in gesture recognition by rewarding lower latency. Choosing these weights depends on your application: real-time systems may prioritize low latency and certainty (higher

γ

and

β

), while offline analysis might emphasize accuracy (higher

α

). A grid search or adaptive tuning (e.g., multi-objective optimization) can be used to systematically identify the best combination of weights for your use case.

Step 3: Population Initialization

To initialize the population, randomly generate

N

individuals, where each individual represents a distinct candidate transition probability matrix for modeling gesture transitions in a Markov Process. Each row of an individual’s matrix corresponds to a state in the gesture recognition process and defines the probabilities of transitioning to all possible next states. As a result, each individual in the population encodes a complete and valid probabilistic model of how gestures can evolve over time, capturing different assumptions or hypotheses about the transition dynamics.

Step 4: Context-Based Clearing for Diversity Maintenance

Context-Based Clearing (CBC) is a diversity-preserving technique used within a Genetic Algorithm to prevent the population from converging prematurely to suboptimal or overly similar solutions, particularly in gesture recognition tasks [32]. Its main goal is to maintain behavioral diversity by ensuring that individuals occupying similar contexts—defined by gesture phase

i \in [0, k]

and the corresponding feature vector

λ

—do not dominate the population. CBC operates on the current population

P

, using a distance function,

d (\cdot, \cdot)

, to quantify similarity between individuals based on their behavior or model output. If individuals fall within a predefined niche radius,

δ,

indicating high similarity, only the best-performing one is retained, while others are cleared or penalized. This helps ensure that multiple diverse models—each capturing different phases, hand contexts, or transition patterns—are preserved across the evolutionary search process [33,34].

The CBC algorithm begins by sorting the population into descending order of fitness. Then, for each individual, it compares contextual similarity (using measures like KL-divergence or cosine distance) with all other individuals; if any are too similar (below a threshold

δ

), those similar individuals are cleared by setting their fitness to zero and marking them inactive. This ensures that only the most fit and diverse individuals remain active for reproduction, preserving niche diversity. Algorithm 1 elaborates the steps of the CBC process as follows:

Algorithm 1: CBC Steps

1.

Sort Population by fitness (descending)

2.

Iterate through each individual V_j

For each

V_{k} \neq V_{j}

in the population

Compute contextual similarity distance:

: $d (V_{j}, V_{k}) =$ distance in phase-wise transition distributions (Cosine)

if

d (V_{j}, V_{k}) < δ

(i.e., they are too similar), clear V_k

Set

F i t n e s s (V_{k}) = 0

Mark V_k as inactive for reproduction

3.

Result: Each niche retains only the best representative individual

Phase-wise transition distributions represent how likely it is for the system to move from one gesture phase to another (e.g., from “start” to “hold,” “hold” to “transition”) across different gesture executions. To quantify and compare these distributions across individuals or time steps, cosine similarity is often used, as it measures the angular similarity between two probability vectors regardless of their magnitude. For each gesture phase

i \in [0, k]

, the model constructs a transition probability vector,

p_{i}

, capturing the likelihoods of transitioning to other phases. To compare two such distributions—

p_{i}

from the current model and

q_{i}

from a reference or another individual—cosine similarity is computed as follows:

C o s i n e_s i m (p_{i}, q_{i}) = \frac{p_{i} . q_{i}}{‖p_{i}‖ ‖q_{i}‖}

(21)

where

\cdot

denotes the dot product and

∥ \cdot ∥

is the Euclidean norm. This value ranges from 0 (completely different distributions) to 1 (identical in direction), and helps in identifying redundancy or diversity across gesture phases, guiding selection or clearing mechanisms like CBC.

Determining the threshold

δ

in the CBC Niching Algorithm is critical, as it directly controls the granularity of diversity preservation within the population. This threshold defines the maximum allowable contextual similarity (e.g., cosine distance between phase-wise transition distributions) between individuals before one is cleared. Choosing a

δ

that is too small may result in excessive clearing, where even moderately different individuals are eliminated, leading to the loss of potentially valuable diversity and premature convergence. Conversely, setting

δ

too high may allow many similar individuals to coexist, undermining the algorithm’s ability to maintain distinct niches and explore varied regions of the search space. The proper calibration of

δ

enhances the model’s performance by balancing exploration (diversity) and exploitation (fitness), ultimately leading to more robust and generalized solutions across diverse problem landscapes.

For the SRL benchmark datasets, an optimal value for the clearing threshold,

δ

, is typically around 0.2, when using cosine similarity to measure contextual similarity between individuals’ phase-wise transition distributions. This value balances the trade-off between maintaining sufficient diversity and avoiding redundancy among gesture representations. Since many sign gestures exhibit subtle differences in motion and timing, a threshold of 0.2 ensures that only highly similar individuals—those likely to offer little new information—are cleared, while still preserving meaningful variation across niches. This enhances the model’s ability to generalize across different signers and sign styles, improving both recognition accuracy and robustness in subject-independent scenarios.

Step 5: Selection

After applying the CBC technique to ensure diversity by filtering out behaviorally similar individuals, tournament selection is employed to choose parents for the next generation. In this method, a small subset of individuals is randomly selected from the CBC-cleared population, and their fitness values are compared. The individual with the highest fitness in this mini-tournament is selected as a parent. This process is repeated until the required number of parents are chosen. Tournament selection ensures a balance between exploration (through random sampling) and exploitation (favoring fitter individuals), while the CBC step beforehand ensures that only diverse and representative candidates are considered for selection—preventing convergence to similar or locally optimal solutions.

Step 6: Crossover and Mutation

In the evolutionary optimization process, crossover is performed using uniform crossover on the transition probability vectors, where each element of the offspring’s transition matrix is randomly selected from the corresponding elements of the two parent matrices with equal probability. This allows a diverse recombination of probabilistic gesture transitions. After crossover, each row of the transition matrix (representing the probabilities of transitioning from a current state to all possible next states) is normalized to ensure that it remains a valid probability distribution (i.e., the sum of each row equals 1). For mutation, a small Gaussian noise is added to randomly selected entries in the transition matrix to introduce variability and explore new transition dynamics. After mutation, the affected rows are again normalized to maintain stochastic consistency, ensuring that each transition row still represents a proper probability distribution over possible next states.

Step 7: Replacement and Iteration

The evolutionary process continues by evaluating the fitness of the newly generated offspring (from crossover and mutation) and comparing them to the current population. The least fit individuals—those with the lowest performance based on the defined fitness function (which considers accuracy, entropy, and latency)—are replaced with the new, potentially superior offspring to maintain or improve overall population quality. This ensures that beneficial traits from high-performing transition models are propagated across generations. The algorithm then repeats the cycle starting from the selection phase (Step 4), continuing through crossover, mutation, evaluation, and replacement. This iterative process proceeds until convergence, which occurs when performance stabilizes across generations, or a predefined maximum number of generations is reached, ensuring that computational limits are respected while allowing for the sufficient exploration of the search space.

In summary, in MP-based SLR, genes within a chromosome represent transition probabilities between gesture states. When gene associations are too strong—i.e., certain transition patterns consistently appear together—the GA risks overfitting to specific signing styles, reducing its ability to generalize across diverse users. The CBC mechanism addresses this by clearing overly similar individuals, thus minimizing gene association and encouraging genetic diversity. For example, rather than repeatedly evolving similar transition patterns like (

S_{1} \to S_{2} : 0.8, S_{2} \to S_{3} 0.9

), CBC promotes varied combinations (e.g., 0.6–0.9 range), allowing the Markov model to flexibly capture a wider range of gesture dynamics. This leads to improved generalization, reduced overfitting, and better adaptability to different signing speeds and styles—enhancing overall recognition accuracy and real-world robustness of the SLR system.

4. Results and Discussion

The performance of the proposed Markov Chains integrated with a niching genetic algorithm for subject-independent sign language recognition was rigorously evaluated using gesture image datasets sourced from benchmark repositories [35,36], with a particular focus on the Arabic Sign Language (ArASL) dataset. This dataset is designed specifically for subject-independent evaluation and contains a comprehensive collection of over 1000 distinct gesture classes, each representing a letter or symbol in the Arabic manual alphabet. To capture intra-class variability and ensure robust generalization across diverse individuals, the dataset includes gesture samples contributed by 50 different signers. Each signer provides 10 samples per gesture, resulting in a rich, high-dimensional dataset that reflects realistic variations in signing styles, hand postures, orientations, and environmental conditions, such as lighting and background. Figure 3 visually presents example images from the ArASL alphabet, illustrating the diversity of hand configurations for each character.

The ArASL dataset used in this study consists of RGB-only gesture images and does not provide hand keypoints directly; therefore, hand keypoint extraction was performed during the preprocessing stage by using established computer vision techniques (MediaPipe in our case). These frameworks reliably detect and track 21 key landmarks on the hand from RGB images alone, enabling the transformation of raw visual data into structured pose representations suitable for gesture analysis. This preprocessing step allows the model to capture essential spatial features such as finger positioning, hand orientation, and articulation, which are critical for distinguishing among the over 1000 gesture classes contributed by 50 different signers in the dataset. While the model benefits from using these extracted keypoints for improved subject-independent recognition, it is also designed to operate directly on RGB inputs when needed. In such cases, convolutional or hybrid CNN-TCN architectures can be employed to learn spatial and temporal features from raw image data, though they may require more extensive training and computational resources to match the performance and generalization achieved with keypoint-based inputs. Thus, the proposed preprocessing and modeling framework is compatible with both keypoint-enhanced and RGB-only data, ensuring flexibility and robustness across different deployment scenarios.

For model training and evaluation, a carefully structured subject-independent data partitioning strategy was employed to prevent any signer overlap between training and testing phases. Specifically, five randomly selected samples per gesture from each signer were used for training. Two additional samples were reserved for a registered test group (i.e., signers previously encountered by the model), while the remaining three samples were used to assess the model’s ability to generalize for an unregistered test group (i.e., completely unseen signers). This split ensures comprehensive validation under realistic deployment conditions. The system was developed and executed using the Google Colab platform, leveraging Python (version 2.7) for coding and experimentation. For local testing and verification, the model was also evaluated on a Dell Inspiron N5110 laptop running a 64-bit Windows 7 operating system, equipped with 4 GB of RAM and an Intel Core i5-2410M processor clocked at 2.30 GHz.

The implementation utilizes a range of advanced Python libraries that have proven essential in the computer vision and machine learning domains. OpenCV (Open Source Computer Vision Library) was employed for fundamental image processing tasks, such as frame extraction from video streams, hand segmentation, and background subtraction. MediaPipe, a framework developed by Google, provided efficient real-time hand-tracking and landmark detection, enabling the precise identification of finger and hand positions across video frames. For building and training machine learning models, both TensorFlow and PyTorch (version 2.7.0) were integrated into the system. TensorFlow was primarily used for implementing deep learning architectures such as CNNs and RNNs for gesture classification, whereas PyTorch facilitated experimentation with more flexible neural network designs and dynamic computational graphs.

Experiment 1: Baseline Comparison Experiment

The primary objective of the first set of experiments is to rigorously evaluate the effectiveness and superiority of the proposed model against conventional models used in SLR. This includes benchmark models such as standard Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), and deep learning architectures like CNN-LSTM hybrids. By directly comparing its recognition performance against established techniques under identical conditions, this experiment provides strong empirical evidence supporting the advantages of combining Markov models with evolutionary optimization for modeling gesture dynamics. To ensure fair and consistent evaluation, all models are trained and tested on a standardized SLR dataset that includes a well-defined subject-independent split—ensuring no overlap of signers between the training and testing phases. This setup guarantees that performance improvements are attributable to the model’s generalization ability rather than overfitting to specific individuals. Each model undergoes the same preprocessing steps, such as hand segmentation, normalization, and feature extraction (e.g., spatial–temporal features or landmark-based descriptors), ensuring a consistent input space. The performance of each method is evaluated using multiple metrics: accuracy for overall recognition performance; precision, recall, and F1-score to assess class-specific reliability. Additionally, subject-wise classification performance is analyzed to reveal how each model handles inter-subject variability, providing a deeper understanding of the proposed model’s adaptability in realistic SLR scenarios. Instead of aggregating predictions over all samples, it measures accuracy per subject and then summarizes these results, giving insight into how consistent the model is across different people.

The comparative results in Table 2 clearly demonstrate the superior performance of the proposed Markov Chain with Niching Genetic Algorithm (MC-NGA) model across all evaluated metrics. While traditional models such as HMM, CRF, and CNN-LSTM provide reasonable performance, their accuracies plateau between 86.4% and 92.7%, with corresponding lower precision and recall values. These baseline methods often fail with variability in signing styles and inter-subject differences, leading to misclassifications and less robust generalization to unseen signers. For instance, the Hidden Markov Model (HMM) shows the lowest accuracy and subject-wise performance, reflecting its limited capability in modeling complex temporal dynamics and diverse gesture patterns.

Conditional Random Fields (CRFs) and CNN-LSTM models improve upon the HMM by leveraging contextual dependencies and deep learning’s representation power, respectively. However, CRFs are generally more suited to structured prediction but can be sensitive to feature engineering and may suffer from overfitting with high-dimensional data. Meanwhile, CNN-LSTM hybrids effectively capture spatiotemporal features but often require extensive training data and computational resources, and may still face challenges in generalizing across highly diverse signer populations. This is evident in the slightly lower subject-wise accuracy of 86.9%, indicating some difficulty in adapting to new or varied users.

In contrast, the MC-NGA model outperforms all comparative methods significantly, achieving a remarkable 96% accuracy and 92.7% subject-wise accuracy. This demonstrates its robust generalization capabilities, which stem from the integration of the Markov Chain’s sequence modeling strength with the Niching Genetic Algorithm’s ability to maintain population diversity via Context-Based Clearing (CBC). This evolutionary approach effectively prevents premature convergence and allows the model to learn multiple distinct gesture patterns, enhancing adaptability to signer variability. The higher precision and recall further indicate fewer false positives and false negatives, underscoring the model’s reliability and practical applicability in real-world, subject-independent sign language recognition systems.

To further strengthen the empirical comparison and underscore the robustness of the proposed MC-NGA model, we extend the evaluation to include Transformer-based architectures and Temporal Convolutional Networks (TCNs)—two advanced deep learning models that have recently shown strong performance in sequential and time series tasks, including gesture and sign language recognition. Transformers, originally introduced in NLP, are known for their self-attention mechanism, which enables the model to capture long-range dependencies without relying on recurrence. They are increasingly applied to visual and motion-based tasks due to their ability to process sequences in parallel and model temporal dynamics efficiently. On the other hand, TCNs leverage causal convolutions and dilations to model temporal information across varying timescales, offering a compelling alternative to recurrent architectures like LSTM. While both approaches are powerful, they can still suffer from overfitting, high data demands, and limited adaptability when faced with unseen subjects or noisy gesture execution styles.

The results reveal that Transformer and TCN architectures outperform traditional models like HMM and CRF, and even slightly edge out the CNN-LSTM baseline. The Transformer model achieves a strong 94.6% accuracy and 89.3% subject-wise accuracy, indicating its ability to generalize temporal representations effectively. The TCN, while computationally lighter, also shows solid performance with 93.4% accuracy and 88.1% subject-wise accuracy. However, both models still fall short of the proposed MC-NGA in handling inter-subject variability, as seen in the lower subject-wise accuracy values.

These results validate the hypothesis that evolutionary diversity maintenance mechanisms, such as those provided by Niching Genetic Algorithms with Context-Based Clearing, offer a distinct advantage in generalizing across varied signers. Unlike Transformers and TCNs that rely heavily on global optimization and large datasets, the MC-NGA explicitly maintains multiple diverse gesture models during training, improving its adaptability. The higher precision and recall metrics of MC-NGA also confirm reduced error rates, making it more suitable for real-time SLR applications, especially in subject-independent scenarios where signer variability poses a significant challenge.

Incorporating a Graph Convolutional Network (GCN)-based SLR model into the comparative analysis further reinforces the superior adaptability and robustness of the proposed MC-NGA framework. GCNs have recently emerged as powerful tools for modeling structured gesture data by capturing spatial and temporal dependencies over skeletal or keypoint graphs [37]. While the GCN-based model achieves a respectable overall accuracy of 94.1% and subject-wise accuracy of 88.6%, it still lags behind the MC-NGA in all major performance metrics. This performance gap can be attributed to GCNs’ dependence on predefined graph topologies and their limited adaptability to signer-specific gesture variations without extensive retraining. In contrast, the MC-NGA explicitly addresses subject-independence through its evolutionary learning mechanism, where diverse gesture patterns are maintained via Niching Genetic Algorithms and refined using Context-Based Clearing to avoid premature convergence. This allows the model to capture a richer spectrum of gesture dynamics, even in the presence of noise or user variability. The superior precision, recall, and F1-score further suggest that MC-NGA not only generalizes better across unseen signers but also minimizes both false positives and false negatives, ensuring more reliable predictions. Therefore, despite the strength of deep learning-based models like Transformers, TCNs, and GCNs, the proposed MC-NGA stands out as the most robust and generalizable approach for subject-independent sign language recognition.

Our proposed MC-NGA framework achieves superior generalization—particularly in subject-independent SLR—through its explicit evolutionary control over diversity and adaptability. Deep learning models often require large-scale, balanced datasets to generalize well; they are susceptible to overfitting, especially when encountering unseen signer variations or noisy gesture patterns, due to their tendency to memorize training-specific motion styles. Moreover, while LSTMs and Transformers capture long-term dependencies, they do so in a global, monolithic manner, which may overlook localized variations in gesture dynamics across different users. In contrast, our MC-NGA model evolves a population of diverse Markov Chains, each representing distinct gesture transition profiles. Through the integration of CBC within the Niching Genetic Algorithm, the framework maintains genetic diversity, ensuring it explores and retains multiple gesture interpretations rather than converging on a single mode of representation. This allows the model to better adapt to inter-user variability, achieving higher subject-wise accuracy despite the simpler structure of Markov Chains. However, we acknowledge that MCs inherently capture only first-order dependencies, and while our evolutionary optimization enhances their adaptability, they may not model deep temporal hierarchies as effectively as LSTM-based models in highly complex sequences.

Below is Table 3 including 95% Confidence Intervals (CIs) for each metric, assuming performance was averaged over multiple runs (e.g., 10-fold cross-validation) and the sampling distribution of the means is approximately normal. The Confidence Interval is calculated as follows:

C I = m e a n \pm 1.96 \times s t a n d a r d e r r o r (S E)

. Incorporating 95% Confidence Intervals into the evaluation results provides a more comprehensive understanding of the statistical significance and reliability of the model comparisons. These intervals define the range within which the true mean performance metric is expected to lie with 95% certainty, based on observed variability across different experimental runs or signer subsets. Notably, the proposed MC-NGA model demonstrates both the highest performance and the tightest Confidence Intervals across all metrics, such as 96.0% accuracy with a CI of [95.0–97.0] and subject-wise accuracy of 92.7% with a CI of [91.1–94.3]. This indicates not only high central performance but also low variability and strong generalization, reinforcing the model’s robustness to signer diversity. In contrast, baseline models like HMM and CRF show much wider Confidence Intervals, such as HMM’s subject-wise accuracy range of [78.0–83.0], highlighting greater instability and susceptibility to inter-subject variability. Furthermore, there is no overlap between the MC-NGA’s Confidence Intervals and those of the lower-performing baselines, underscoring the statistical significance of its superior performance. These Confidence Intervals validate that MC-NGA’s advantage is not due to chance or overfitting, but rather reflects a consistent and replicable improvement in subject-independent sign language recognition.

The inclusion of p-values in the comparison table provides essential insight into the statistical significance of the performance improvements achieved by the proposed MC-NGA model over traditional baseline methods. Very small p-values (e.g., <0.0001 for HMM and CRF) indicate that the differences in performance metrics such as accuracy, precision, and subject-wise accuracy are highly statistically significant, meaning that the improvements are unlikely to be due to random variation. Even in the case of the more competitive CNN-LSTM model, the p-value of 0.0012 suggests that the superior performance of MC-NGA is statistically significant at the 99% Confidence Level. These results confirm that the observed gains are not only consistent across multiple test folds (as shown by the narrow Confidence Intervals) but also statistically reliable, validating the effectiveness of integrating Markov Chains with the NGA and CBC for robust, subject-independent sign language recognition.

Experiment 2: Ablation Study

The objective of this ablation study is to systematically evaluate the individual and integrated contributions of the NGA and the CBC technique to the overall performance of the proposed subject-independent SLR framework. To achieve this, four model variants are examined: the baseline Markov Chain model without any optimization, a version augmented with a basic GA, a variant incorporating an NGA without CBC to isolate the effect of advanced evolutionary optimization, and the complete model combining an NGA with CBC to promote solution diversity. This setup allows a controlled comparison that highlights the incremental benefits of each enhancement. The experimental configuration ensures that all variants are trained and tested under identical conditions, using the same dataset and preprocessing pipeline to guarantee fairness. Evaluation metrics include classification accuracy to assess recognition capability, error rate to quantify misclassifications, the diversity index of solutions to capture the richness of the search space explored by each model, and the number of generations to convergence to measure the efficiency of the optimization process. These metrics collectively provide deep insights into both the predictive power and optimization robustness of the different model configurations.

The results in Table 4 demonstrate a clear and consistent improvement in both classification accuracy and convergence efficiency as the model evolves from the baseline Markov Chain to the full proposed framework integrating an NGA and CBC. The baseline model, without any evolutionary optimization, achieves an accuracy of 82.4% and requires 150 generations to converge, reflecting limited adaptability and slower learning. Introducing a basic GA improves the accuracy to 87.7% and reduces the number of generations to 110, indicating that even simple evolutionary mechanisms enhance the model’s ability to generalize and optimize more efficiently by searching a wider solution space.

Further advances are observed with the NGA, which boosts accuracy to 91.9% and shortens convergence time to 85 generations. This improvement is attributed to the NGA’s more sophisticated mutation and selection strategies, enabling the better avoidance of local optima and a more effective exploration–exploitation balance. The full model combining NGA with CBC achieves the highest accuracy of 96.3%, the lowest error rate of 3.7%, and the fastest convergence at 65 generations. The CBC technique fosters solution diversity by maintaining multiple niche populations, preventing premature convergence and enabling the model to thoroughly explore the search space. This synergy between the NGA and CBC results in a robust, efficient optimization process that significantly enhances recognition accuracy and reduces computational effort, which is crucial for subject-independent SLR applications.

The Cross-Dataset Accuracy (%) metric provides critical insight into a model’s ability to generalize beyond the data it was trained on. Unlike subject-independent accuracy, which evaluates performance on unseen users from the same dataset, Cross-Dataset Accuracy tests the model on an entirely different dataset with varied signer characteristics, environmental conditions, and gesture styles. A higher value in this column indicates a stronger capacity for domain transfer, which is essential for the real-world deployment of sign language recognition (SLR) systems where training and operational data may come from different sources. Conversely, a drop in cross-dataset accuracy suggests overfitting to dataset-specific patterns and limited adaptability to new contexts. In the table, we observe a significant increase in cross-dataset accuracy from 68.9% in the baseline model to 85.1% in the proposed NGA + CBC model, highlighting its superior robustness and flexibility.

This superior generalization can be attributed to the architectural advantages of the NGA + CBC model. By integrating Niching Genetic Algorithms (NGAs) with Context-Based Clearing (CBC), the model maintains a diverse population of candidate solutions, allowing it to capture a broader range of gesture dynamics and signer behaviors during training. The CBC mechanism actively prevents convergence toward overly similar individuals by clearing out redundant solutions, thus preserving behavioral diversity. This diversity becomes particularly valuable when the model is exposed to unfamiliar data distributions in the cross-dataset scenario. Unlike traditional models (e.g., baseline Markov or GA-augmented versions), which tend to specialize in the source dataset, the NGA + CBC model learns more generalizable gesture phase patterns, making it highly effective when tested on new datasets. Therefore, the high cross-dataset accuracy not only confirms the model’s predictive strength but also justifies its design as a practical and scalable solution for real-world SLR systems.

To validate the Cross-Dataset Accuracy in the above experiment and assess the generalization capability of the proposed Markov Chain with Niching Genetic Algorithm (MC-NGA), an additional benchmark SLR dataset was employed—specifically, the LSA64 dataset, which focuses on Argentinian Sign Language. Unlike ArASL, which contains Arabic gestures contributed by 50 signers, LSA64 includes 3200 video samples across 64 static gestures recorded from 10 native signers, offering a distinct vocabulary, different signer population, and unique environmental conditions (e.g., camera angles, lighting setups). This dataset introduces meaningful domain shifts and signer-specific variability, making it a reliable testbed for cross-dataset evaluation. Its use ensures that improvements in performance are not limited to a single dataset or language system. The LSA64 dataset is publicly accessible and can be downloaded from https://facundoq.github.io/datasets/lsa64/ (accessed on 1 May 2025). By testing on LSA64 after training on ArASL, the experiment confirms the model’s ability to handle unseen gesture styles, signer behaviors, and visual contexts—thereby substantiating the reported gains in Cross-Dataset Accuracy.

Below is Table 5 with 95% Confidence Intervals (CIs) added for each performance metric. As the model evolves from the baseline to the full NGA + CBC integration, not only do the mean accuracy and error rate improve, but the Confidence Intervals also become narrower, indicating more consistent performance across multiple runs or signer subsets. For instance, the full model achieves an accuracy of 96.3% with a tight CI of [95.2–97.4], reflecting both high effectiveness and low variability. Conversely, the baseline Markov Chain model has a wider CI of [80.3–84.5], signaling greater instability and susceptibility to variance in signer inputs. Similarly, convergence speed improves markedly, with the full model reliably converging in 65 generations [62–68], compared to 150 [145–155] for the baseline. These narrow intervals for the full model underscore the reliability and reproducibility of its performance, validating the effectiveness of combining an NGA with CBC. Importantly, the non-overlapping Confidence Intervals between variants—especially between the full model and its predecessors—highlight the statistical significance of each added component (GA, NGA, and CBC), confirming that performance gains are not incidental but stem from systematic algorithmic improvements in evolutionary optimization and diversity preservation.

The inclusion of p-values offers a rigorous statistical validation of the improvements introduced by each model enhancement. The extremely low p-values (e.g., <0.0001 for both the baseline and GA-augmented models) indicate that the performance differences in accuracy, error rate, and convergence speed between these variants and the full NGA + CBC model are highly statistically significant, meaning the improvements are not due to chance. Even the more competitive NGA without CBC shows a p-value of 0.0003, confirming that adding the CBC technique leads to a significant performance gain, particularly in terms of both accuracy and convergence efficiency. These results reinforce that each step in the model’s evolution—from basic Markov Chains to genetic optimization and finally diversity-aware optimization using CBC—contributes measurably and reliably to enhancing the system’s generalization and robustness for subject-independent sign language recognition.

The diversity index (DI) measures how widely varied the candidate solutions are during the optimization process, serving as an important indicator of the model’s ability to thoroughly explore the solution space. A higher DI value signifies that the algorithm is successfully maintaining a rich variety of potential solutions rather than prematurely focusing on a narrow region, which increases the likelihood of finding globally optimal or near-optimal solutions. The CBC mechanism plays a pivotal role in this context by promoting competitive pressure among solutions, encouraging the retention and survival of diverse candidates within the population. This process prevents the dominance of similar solutions and helps preserve multiple niches within the search space, thereby expanding the algorithm’s exploratory capabilities. Notably, the full model, which combines the NGA with CBC, achieves a high DI, as shown in Table 6, while simultaneously converging more rapidly than other variants. This indicates an effective balance between exploration—searching broadly for promising solutions—and exploitation—refining the best candidates—resulting in an efficient optimization process that avoids stagnation and improves overall performance.

The ablation study clearly shows that the NGA substantially enhances the optimization process compared to the basic GA, achieving faster convergence and higher accuracy. Additionally, the CBC mechanism is essential for preserving solution diversity and preventing premature convergence, which together contribute to the superior overall performance of the model. The combined effect of the NGA and CBC creates a synergistic improvement that significantly strengthens the robustness and generalizability of the sign language recognition framework in subject-independent scenarios. Table 7 outlines the defining features of each model variant alongside a rationale for their observed impact on performance, highlighting how evolutionary strategies and diversity-preserving mechanisms contribute to optimization effectiveness and recognition accuracy in the subject-independent SLR context.

Experiment 3: Generalization Evaluation across Unseen Subjects

The objective of the third set of experiments is to assess the model’s capacity to generalize for signers who were not included in the training process, thereby ensuring its effectiveness in real-world, subject-independent sign language recognition scenarios. To achieve this, the experiment is configured using Leave-One-Subject-Out (LOSO) cross-validation or subject-independent K-fold cross-validation, both of which are widely accepted methodologies for evaluating generalization in user-independent tasks. In LOSO, each fold involves training the model on all users except one, who is then used for testing, cycling through all subjects in the dataset. In subject-independent

K

-fold, users are partitioned into

K

disjoint sets, where each fold consists of training on

K - 1

groups and testing on the remaining unseen group. This configuration ensures that the model is always evaluated on signers it has never encountered during training, making the performance metrics a direct indicator of its generalization power. The evaluation employs several critical metrics: Average Accuracy per Fold, which provides a mean performance indicator across all validation folds; Variance across Folds, which captures the model’s consistency across different signer splits; Generalization Error, reflecting the performance gap between training and testing phases; and the Subject-wise Misclassification Rate, offering a granular view of how the model performs for each individual signer. Collectively, these metrics offer comprehensive understanding of both the reliability and robustness of the model when applied to previously unseen users.

Table 8 presents the performance results of the proposed model using LOSO cross-validation. The notable improvement in accuracy demonstrates the effectiveness of integrating the NGA with the Markov Chain-based framework for generalizing across unseen signers. The consistently high accuracy values across all subjects (ranging from 93.8% to 97.1%) reflect the model’s robust ability to capture diverse gesture dynamics while minimizing the performance drop associated with user variability. The misclassification rates, which remain below 6.2% for all subjects, indicate that the model makes very few incorrect predictions even when encountering entirely new signers during testing. Moreover, generalization error values stay within a narrow band (between 1.5% and 3.3%), confirming that the model’s learning from the training data transfers effectively to unseen subjects without significant performance degradation.

Further validating the model’s reliability, the fold variance values are all under 1.2%, with an average of 0.9%, signifying strong consistency in performance across different signer partitions. This low variance suggests that the model avoids overfitting to particular individuals and maintains stable behavior regardless of which subject is held out. Particularly impressive is the performance on S6, which shows the highest accuracy (97.1%) and the lowest generalization error (1.5%), indicating that the system is highly capable of handling clear and consistent gesture styles. Even the subject with the lowest accuracy (S5) achieves a strong 93.8%, which is still significantly above standard benchmarks in the field. These outcomes underscore the value of the CBC-enhanced NGA, which introduces genetic diversity and structural adaptability into the model, enabling it to generalize more effectively. Overall, the results confirm that the proposed framework is both accurate and robust, making it suitable for deployment in real-world SLR applications where user diversity is a key challenge.

Subject S5 achieved the lowest accuracy (93.8%) compared to other partitions, which can be attributed to the greater disparity between their signing patterns and those of the training subjects in other folds. Unlike subjects such as S6 or S3, who likely shared more similarities with the remaining training data in terms of gesture articulation, speed, and movement dynamics, S5’s gestures may have exhibited distinctive temporal or spatial characteristics that were underrepresented in the other partitions. Since subject-independent evaluation ensures that S5’s data is entirely excluded from training, the model relies solely on its ability to generalize from patterns observed in other users. If those users exhibited more homogeneous or standardized gestures, the model fails to adapt to S5’s more unique or less predictable signing behavior. This contrast underscores the importance of training on a highly diverse dataset and highlights that S5’s gestures lie on the fringes of the learned feature space, leading to a higher generalization error and misclassification rate compared to other partitions.

Experiment 4: Convergence and Optimization Behavior Analysis

The objective of the fourth set of experiments is to assess the effectiveness of the NGA in optimizing the Markov Chain parameters for subject-independent sign language recognition. Specifically, the experiment focuses on evaluating how well the NGA balances optimization speed, convergence stability, and the ability to avoid premature convergence to local optima. This is particularly important in the context of sign language recognition, where gesture patterns vary widely across subjects, and the optimization algorithm must explore a large solution space without collapsing into suboptimal, homogeneous populations. By comparing the convergence behavior of the NGA with that of a standard GA, the experiment aims to demonstrate that NGA maintains higher population diversity and explores multiple peaks in the fitness landscape, thereby increasing the chance of identifying globally optimal solutions that generalize better across users.

The experimental configuration involves tracking multiple metrics across generations during the optimization process. These include the fitness value over generations, which reflects the improvement in model performance; the population diversity index, which quantifies how varied the candidate solutions are within each generation; the time to convergence, measuring how quickly the algorithm settles on a stable solution; and the number of distinct niches maintained, indicating how well the algorithm avoids crowding and preserves diverse solution regions. The NGA incorporates a CBC mechanism, which reduces gene association and enforces niche separation by allowing only one high-fitness solution per niche while demoting others with similar structures. This ensures that genetic diversity is preserved and that the algorithm continues exploring promising regions of the search space rather than converging prematurely. The parameters used in the experiment include a population size of 100, a maximum generation count of 200, a mutation rate of 0.05, and a crossover rate of 0.8, all carefully selected to balance exploration and exploitation. The niche radius is typically around 0.2 in the CBC mechanism was set based on empirical tuning to ensure effective niche separation without over-fragmenting the population.

The experimental results shown in Table 9 clearly illustrate the superior performance of the NGA over the standard GA in optimizing the Markov Chain parameters for subject-independent sign language recognition. The average fitness of the NGA grows consistently with generations, reaching a peak value of 0.96, compared to 0.78 for the standard GA. This indicates that the NGA is more effective at improving the model’s recognition performance over time. Moreover, the NGA achieves faster convergence—stabilizing by generation 110—compared to the standard GA, which continues fluctuating and only stabilizes around generation 140. This faster convergence reflects the NGA’s enhanced optimization speed, a critical advantage in reducing training time and computational overhead.

The diversity index and number of distinct niches further support the NGA’s robustness. The diversity index remains substantially higher in the NGA across all generations (e.g., 0.72 vs. 0.48 at generation 20), indicating a more varied and exploratory population. This diversity is sustained through the use of the CBC mechanism, which actively discourages premature convergence by enforcing niche separation. Additionally, the number of maintained niches (five–seven in the NGA versus typically one–two in the GA) shows that the NGA can concurrently explore multiple high-potential regions of the solution space. Together, these behaviors prevent the algorithm from getting stuck in local optima and contribute to finding globally optimal solutions that generalize well across diverse users in sign language recognition tasks.

Experiment 5. Robustness against Signing Variability

The objective of the fifth set of experiments is to evaluate the model’s ability to generalize across the diverse ways individuals perform sign language, accounting for differences in hand shapes, gesture execution styles, and signing speeds. This is crucial in real-world scenarios where users exhibit a wide range of proficiency levels and physiological differences that can impact recognition accuracy. To conduct this evaluation, the experiment is configured by dividing the dataset into meaningful sub-groups—such as fast vs. slow signers and fluent vs. novice users—based on observed behavioral and temporal signing characteristics. This allows a detailed analysis of how the model performs under each condition. The evaluation employs several key metrics: Per-Group Accuracy provides insight into model effectiveness within each subgroup; Intra-class vs. Inter-class Misclassification Rates help differentiate whether errors arise more from within the same sign class (due to signing style) or across different sign classes (due to gesture similarity); and the Error Rate on Complex Signs specifically isolates performance on signs with intricate motion or shape dynamics.

Table 10 clearly shows that fluent signers achieve the highest overall accuracy (96.8%) with the lowest intra-class (1.8%) and inter-class (1.4%) misclassification rates, which demonstrates that consistent and well-formed signing leads to clearer feature extraction and better model performance. In contrast, novice signers display the lowest accuracy (89.3%) and the highest misclassification rates, particularly within intra-class errors (6.3%), indicating that inconsistent gesture formation and speed can confuse the model, especially for similar signs. The relatively high error rate on complex signs for novice users (11.1%) supports the notion that complex motions exacerbate recognition difficulties when gestures are imprecisely performed. Similarly, fast signers show lower accuracy and higher intra-class confusion (4.5%) than slow signers, likely due to motion blur or gesture compression that affects temporal alignment and recognition.

These findings justify the model’s overall robustness but also highlight its sensitivity to execution quality and speed variability. The higher performance with slow and fluent signers suggests that the model relies on well-articulated motion trajectories and consistent feature representations. The increased intra-class error for novice and fast signers shows that while the model can distinguish between sign classes to some extent, it struggles more when gestures vary within the same label due to inconsistent personal signing habits. The error rate on complex signs, especially for novice users, points to the need for additional modeling capacity (e.g., better temporal modeling or attention mechanisms) to handle fine-grained or subtle gesture differences. This underscores the importance of incorporating variability-aware training and possibly using domain adaptation or user calibration to further improve performance in real-world settings.

Experiment 6: Feature Sensitivity Experiment

The sixth set of experiments is designed to assess how varying input feature sets influence the overall performance of a model and to evaluate the robustness and adaptability of the NGA-based optimization when exposed to different feature spaces. The experiment is structured to compare model behavior across three distinct configurations: spatial-only features (such as hand position and shape), spatiotemporal features (such as motion trajectories that capture movement over time), and multimodal features (which combine video data with skeletal representations). This allows for a comprehensive understanding of how the complexity and richness of input data affect recognition outcomes. Key metrics include Recognition Accuracy, which measures how well the model classifies inputs from each feature set; NGA Adaptation Speed, which quantifies how quickly the optimization algorithm converges or adjusts to a new feature space; and Stability of Learned Transitions, which evaluates the consistency and reliability of the model’s internal state transitions or decision boundaries across different feature types.

The results shown in Table 11 demonstrate a clear trend: as the complexity and dimensional richness of the input features increase, model performance in terms of recognition accuracy improves significantly—from 87.3% using only spatial features to 96.2% with multimodal data. This suggests that incorporating temporal dynamics (in spatiotemporal features) and additional sensory modalities (in multimodal inputs) provides richer contextual information, enhancing the model’s ability to distinguish between gestures or actions more precisely. However, this performance gain comes at a computational cost, as observed in the NGA Adaptation Speed: the number of iterations needed for the NGA to converge increases from 35 (spatial-only) to 50 (multimodal), indicating a more complex feature space that requires more exploration before reaching optimal parameter configurations.

Furthermore, the Stability of Learned Transitions—measured via the standard deviation of transition behavior across different training runs—improves as the input representation becomes more comprehensive. The model demonstrates more consistent and stable transition patterns when trained on multimodal data (lowest std. dev. of 0.038), suggesting better generalization and less sensitivity to initial conditions. This can be justified by the idea that richer and more discriminative features reduce ambiguity, making it easier for the NGA-based system to establish robust internal states. In contrast, spatial-only features, while simpler and faster to converge, show more variability in the learned transitions (std. dev. of 0.072), reflecting potential overfitting or instability due to insufficient information. These findings collectively support the use of multimodal features in applications where both accuracy and model stability are critical, even if at the cost of slower adaptation.

Experiment 7: CBC-Niching Parameter Sensitivity Experiment

The next set of experiments is designed to investigate the impact of critical parameters within the CBC mechanism on the performance of an NGA-optimized Markov Chain model in SLR. The primary objective is to understand how variations in key CBC parameters—namely the clearing radius (

σ

), niche capacity (

κ

), distance metric (e.g., Euclidean, cosine), and context window size (

w

)—affect the model’s ability to balance solution diversity, convergence behavior, and classification accuracy. The experimental configuration involves a controlled univariate analysis where each parameter is independently varied while keeping all others fixed, followed by a comprehensive multi-parameter grid search to evaluate possible interactions. The model is trained on a fixed SLR dataset using five-fold cross-validation and consistent random seeds to ensure reproducibility. Each parameter is tested across several discrete values (e.g.,

σ \in {0.1, 0.3, 0.5, 0.7, 1.0}

,

κ \in {1, 3, 5, 10}

,

w \in {1, 3, 5, 7}

, and three distance metrics), and the outcomes are assessed using a suite of evaluation metrics: Classification Accuracy to measure recognition performance, diversity index to quantify genotypic variation and niche count, Convergence Speed to evaluate how quickly the population stabilizes, Fitness Variance across individuals to capture quality diversity, Misclassification Rate per Subject to assess model generalization, and Niche Survival Rate to indicate how well CBC preserves niche structures over generations. This framework provides deep insights into how CBC-Niching shapes evolutionary search dynamics and affects the generalization capacity of the model.

From the results in Table 12, we observe that Config 3 (

σ = 0.5, κ = 5,

Cosine distance,

w = 5

) yields the highest classification accuracy (96.1%), along with the highest diversity index (0.82) and Niche Survival Rate (81.5%), while maintaining a moderate convergence speed (36 generations). This suggests that a balanced clearing radius (

σ = 0.5

) and niche capacity (

κ = 5

) are optimal for preserving diversity without overwhelming the search with redundant solutions. The use of cosine distance in high-dimensional spaces like gesture features appears effective for capturing subtle similarities between solutions, enhancing generalization. Additionally, a context window size (

w = 5

) provides enough temporal scope to stabilize state transitions in the Markov Chain, improving accuracy. In contrast, extreme values of the clearing radius (e.g., Config 1 with

σ = 0.1

and Config 5 with

σ = 1.0

) negatively affect performance. Config 1 shows reduced diversity (0.61) and lower accuracy (89.4%), indicating that a very small clearing radius leads to overcrowded niches and poor exploration. Config 5, while having a larger radius, suffers from slower convergence (45 gens) and lower niche survival (62.8%), which reflects niche collapse and the potential loss of useful sub-populations. This implies that both under- and over-dispersed niches hinder the evolutionary dynamics necessary for an optimized and generalizable model.

Additionally, comparing distance metrics, cosine generally outperforms Euclidean and Manhattan, especially when paired with optimal niche parameters (as in Config 3). Euclidean metrics (Config 2 and 5) yield slightly lower accuracy and diversity, suggesting limitations in capturing angular relationships or patterns in the genotype space. Fitness variance, a measure of exploratory potential, remains consistently higher in well-performing configurations (Configs 2–4), correlating with lower misclassification rates per subject, highlighting better generalization across signer variability. These results collectively reinforce the importance of the careful tuning of CBC parameters, especially clearing radius, niche capacity, and contextual window, to balance exploration and exploitation, leading to enhanced classification performance and robustness in subject-independent SLR models.

To further extend the discussion (see Table 13), the clearing radius (σ), often used interchangeably with δ in the literature, determines the spatial or contextual closeness within which individuals are compared. Smaller values of σ enforce stricter diversity by removing individuals even with subtle similarities, leading to greater exploration but potentially disrupting convergence by eliminating structurally similar yet semantically different solutions. In contrast, larger σ values relax similarity constraints, improving convergence but risking redundancy. The niche capacity

(κ)

controls how many individuals can be retained in a given similarity niche before excess ones are cleared. A default

κ = 1

(elitist clearing) preserves only the fittest per niche, enhancing selection pressure but possibly discarding diverse variants. Allowing

κ > 1

introduces more variation in retained individuals, which can improve robustness and avoid local optima in classification models. However, this must be calibrated to prevent bloating the population with similar solutions.

The distance metric used—whether cosine similarity, Euclidean distance, or KL-divergence—determines how behavioral or contextual similarity is quantified. Cosine similarity is advantageous when comparing normalized gesture phase transition vectors, while Euclidean distance might be better suited to spatial or geometric features. KL-divergence adds asymmetry and sensitivity to distributional differences, which could be useful in probabilistic transition models but requires careful smoothing to avoid divergence instability. Lastly, the context window size

(w)

—the temporal or feature span over which behavior is evaluated—affects the granularity of similarity comparisons. A smaller w captures short-term transitions and can lead to overly local clearing decisions, whereas a larger w emphasizes broader behavioral patterns but risks blurring fine distinctions between gesture phases. Tuning these parameters is therefore critical for adapting CBC to specific application needs, ensuring the evolutionary process avoids premature convergence while still driving toward high-fitness, contextually diverse solutions.

Experiment 8: Statistical Validation

The objective of Experiment 8 is to statistically validate whether the proposed CBC-enhanced NGA-Markov Chain (NGA-MC) model demonstrates significantly better generalization capabilities compared to baseline models across different subjects, ensuring the model’s robustness and subject-independence. The experimental setup involves implementing two baselines—a traditional Markov Chain (without NGA/CBC) and a deep learning model such as BiLSTM—followed by repeated Leave-One-Subject-Out (LOSO) cross-validation across 10 different runs to gather a distribution of accuracy and generalization error for each model. In our case, paired t-tests were used to compute the p-values comparing the proposed model to each baseline (MC and BiLSTM). Furthermore, Cohen’s d is used as a metric that is a standardized measure of effect size, which quantifies the magnitude of difference between two models’ performances relative to the pooled standard deviation. It helps interpret whether a statistically significant result (like a low p-value) is also practically meaningful.

As shown in Table 14, the proposed NGA-MC model achieves the highest mean accuracy of 95.4% with the lowest generalization error (2.4%) and standard deviation (1.02), indicating not only superior performance but also strong consistency across subjects. In contrast, the traditional Markov Chain model without NGA or CBC enhancements shows a significantly lower mean accuracy of 89.7% and a higher generalization error (4.8%), suggesting that the absence of contextual and adaptive mechanisms reduces its ability to generalize. The BiLSTM model, while better than the basic MC, reaches only 91.3% mean accuracy with a 3.9% generalization error. These results confirm that while deep learning models offer improvements over basic statistical models, they still lag behind the proposed hybrid approach, especially when subject variation is high.

Statistical tests provide strong validation for the observed performance differences. The p-value between the proposed model and the traditional MC is

< 0.001

, and for BiLSTM it is 0.013, both of which are statistically significant (p < 0.05), supporting the hypothesis that the NGA-MC model achieves meaningful performance improvements. Furthermore, the Cohen’s d values (2.79 and 2.08) indicate very large effect sizes, reinforcing that the enhancements introduced by CBC and NGA not only improve mean performance but do so with a substantial margin. The low standard deviation in the proposed model also suggests stable behavior across different LOSO runs and subjects, which is crucial for subject-independent tasks like sign language recognition. Overall, the statistical and empirical evidence aligns to validate that the proposed model generalizes more effectively across unseen subjects, making it a reliable candidate for real-world deployment.

Experiment 9: Ablation Study on CBC Threshold and Fitness Function Weights for Robust Model Optimization

We suggest a comprehensive sensitivity analysis focusing on the CBC threshold

(δ)

and the weight factors

(α, β, γ)

used in the fitness function. The objective of this experiment is to systematically vary each parameter while holding the others constant to evaluate their individual impact on the model’s performance. This will help identify optimal parameter ranges and demonstrate the robustness and stability of the proposed framework across different configurations. Both sets of parameters play a pivotal role in the evolutionary optimization process: δ governs diversity control through Context-Based Clearing (CBC), while

α, β,

and

γ

define the trade-off between accuracy, uncertainty (entropy), and latency in the fitness evaluation. To address this, we propose an ablation experiment with the objective of quantifying the individual and joint impact of the CBC threshold

(δ)

and the fitness weights

(α, β, γ)

on model performance. The experiment configuration involves running the full NGA-CBC optimization on the same SRL benchmark dataset under multiple settings: (i) varying

δ

in the range

[0.05, 0.1, 0.2, 0.3, 0.4]

, and (ii) exploring different weight combinations via grid search over

α \in {0.6, 0.8, 1.0}

,

β \in {0.1, 0.2, 0.3}

, and

γ \in {0.1, 0.2, 0.3}

, ensuring that

α + β + γ = 1

for fair comparison. For each configuration, the model undergoes 10-fold LOSO cross-validation, and performance is evaluated using three primary metrics: mean accuracy, average entropy across transition matrices, and average latency in frame count per recognized gesture. Additionally, Cohen’s d and ANOVA can be applied to assess whether performance differences are statistically significant across

δ

and weight configurations. The expected outcome is the identification of a stable operating region

(δ = 0.2, α = 0.8, β = 0.1, γ = 0.1)

where the model maintains high accuracy, low uncertainty, and low latency, thus validating the chosen parameter values and reinforcing the robustness of the proposed method.

From Table 15, we observe that the highest mean accuracy (95.4%) occurs at

δ = 0.20

, indicating that this threshold offers the best balance between diversity and convergence. At

δ = 0.05

and

0.10,

the model accuracy is lower (92.1% and 93.5%, respectively), suggesting over-clearing—too many individuals are penalized, which restricts genetic diversity and leads to suboptimal transition matrices. On the other hand, at higher values like

δ = 0.30

and

0.40

, accuracy slightly drops again, possibly due to under-clearing, where redundant or similar individuals are retained, leading to convergence toward locally optimal but non-diverse solutions. Thus,

δ = 0.20

appears to be a sweet spot, maximizing recognition accuracy while preserving meaningful variation across the population.

Entropy values represent uncertainty in the transition matrices. At

δ = 0.20

, the lowest entropy value (0.97) is achieved, indicating crisper and more confident gesture transitions. In contrast,

δ = 0.05

and

0.40

yield higher entropies (1.31 and 1.22), which suggest that the resulting transition probabilities are more uniform and less decisive—likely due to limited diversity (at

δ = 0.05

) or excessive similarity among individuals (at

δ = 0.40

). This supports the interpretation that

δ = 0.20

enables the algorithm to evolve models that confidently predict state transitions, reducing uncertainty and improving recognition robustness.

Latency is a practical measure of how fast the system recognizes a gesture. At

δ = 0.20

, the lowest latency (11.2 frames) is recorded, indicating that the model quickly and reliably reaches correct decisions.

δ = 0.05

and

δ = 0.40

exhibit increased latencies (14.8 and 14.5 frames), suggesting slower convergence to correct predictions due to unstable or uncertain transition dynamics. The effect sizes (Cohen’s d) further emphasize the significance of

δ = 0.20

: values above 2.0 (e.g., 2.17 for

δ = 0.05

) reflect very large differences in performance, validating the statistical and practical superiority of the

δ = 0.20

setting. Overall, the results justify δ = 0.20 as the optimal clearing threshold, balancing accuracy, confidence, and speed—critical qualities for real-time, subject-independent SLR systems.

Table 16 explores the impact of different fitness weight configurations—specifically the trade-off between accuracy (α), entropy (β), and latency (γ)—on the performance of the NGA-Markov Chain model, keeping the CBC threshold fixed at its optimal value (δ = 0.20). The best-performing configuration is clearly α = 0.8, β = 0.1, γ = 0.1, achieving the highest mean accuracy of 95.4%, the lowest entropy (0.97), and one of the lowest latencies (11.2 frames). This indicates a well-balanced optimization, where the model not only correctly recognizes gestures with high precision but also produces confident transitions (low entropy) and timely recognition (low latency). The result reflects that equal emphasis on reducing uncertainty and recognition time, along with strong prioritization of accuracy, leads to optimal generalization in real-world, subject-independent SLR.

In contrast, the configurations with lower accuracy weights (α = 0.6) show a noticeable drop in performance, with accuracies dropping to as low as 92.9% and entropies rising above 1.1. For instance, (α = 0.6, β = 0.3, γ = 0.1) places more focus on entropy reduction but sacrifices recognition performance and speed. Similarly, while (α = 0.6, γ = 0.3) slightly improves latency to 11.6 frames, it does so at the cost of reduced accuracy (94.2%) and increased uncertainty (entropy = 1.16). These observations suggest that over-penalizing latency or uncertainty can lead to an imbalance, causing the model to rush predictions or produce less discriminative transitions, which in turn reduces generalization capability.

Even when accuracy is maximized (α = 1.0, β = γ = 0.0), the model does not achieve better performance than the balanced configuration. While it reaches a high accuracy of 95.0%, it also results in the highest entropy (1.22) and longer latency (12.9 frames), indicating overconfidence in uncertain transitions and slower responsiveness. This supports the interpretation that optimizing purely for accuracy can lead to overfitting or indecisive transitions, making the model less robust in dynamic or noisy gesture input scenarios. The Cohen’s d values, all above 0.6 in non-optimal settings, further confirm statistically significant performance gaps. Therefore, the analysis clearly shows that a balanced fitness function—favoring accuracy while lightly penalizing entropy and latency—is essential for achieving both high performance and generalizability in real-time gesture recognition systems.

Experiment 10: Empirical Validation of the Markov Reward Function

In SLR, the reward function plays a pivotal role in shaping the learning behavior of an agent—especially within reinforcement learning (RL)-based or decision-making frameworks—by quantifying the value of specific actions taken in various states of gesture processing. The reward function defines how desirable a particular state-action transition is, based on how well it aligns with the system’s goals: fast and accurate recognition. In a real-time SLR scenario, the system is continuously observing a stream of frames and must decide whether to wait, segment, or classify the current gesture. A high reward is assigned when the system correctly classifies a gesture using the minimal number of frames, indicating high confidence and low latency. This encourages the model to be decisive yet accurate. Conversely, negative rewards (penalties) are given for mistakes such as misclassification, early decisions (premature segmentation), or excessive delay (prolonged waiting), which can degrade user experience and model reliability.

The design of this reward function directly impacts how the model learns to balance exploration and exploitation, especially under uncertainty and in subject-independent SLR tasks. From a practical standpoint, the literature suggests that reward shaping—combining immediate feedback (e.g., per-frame classification confidence) with delayed feedback (e.g., full gesture accuracy)—is crucial to guide learning effectively. One widely adopted formulation involves assigning: (1) a +1 reward for correct classification with low frame count, (2) a −1 penalty for incorrect predictions, and (3) small negative penalties (e.g., −0.01) for each frame spent in a waiting state to discourage excessive latency. Some approaches also include adaptive or probabilistic rewards, which reflect the classifier’s prediction uncertainty or context-dependent features, improving generalization across users and gestures. Overall, the best-performing reward functions in SLR literature are those that are context-aware, penalize inefficiency and errors, and reinforce minimal yet confident decisions, helping to ensure robust real-time performance with high interpretability and efficiency.

To empirically validate the effectiveness of a context-aware reward function specifically within a MDP framework in real-time SLR, an experiment can be designed that restricts reward computation strictly to Markovian transitions—i.e., rewards depend only on the current state, action, and the resulting next state, not on the full history. The objective of this experiment is to evaluate how a Markov-based reward function that balances gesture accuracy, decision timing, and temporal efficiency impacts performance under the constraints of memoryless transitions. Two reward schemes will be compared: (1) a Markov baseline reward model assigning +1 for correct classification, −1 for incorrect, and 0 otherwise, and (2) a Markov-constrained context-aware reward, where the reward incorporates a penalty for frame-based waiting, bonuses for confident transitions, and penalties for early or late segmentation, but all defined per-state—action—next-state transition only.

The experimental configuration follows an MDP formulation where the environment is a real-time gesture stream, and states represent frame-level encoded gesture embeddings (features), actions are discrete (wait, segment, and classify), and transitions follow probabilistically from action policies. The dataset will be partitioned in a subject-independent manner, with no signer overlap between training and testing to assess generalization. Evaluation will focus on the following Markov-anchored metrics: (1) State-level Classification Accuracy, which evaluates recognition correctness per decision point; (2) Mean Decision Latency, calculated as the average number of frames used before classification; and (3) State-Transition Consistency Score, measuring alignment between predicted transitions and true gesture boundaries. Additionally, the Area under the Reward Curve (AURC) across episodes quantifies learning efficiency.

Table 17 presents a comparative analysis of two reward schemes—Baseline Markov Reward and Context-Aware Markov Reward—in the context of real-time SLR using a MDP framework. Each metric is reported as a mean ± standard deviation, where the “±” symbol denotes variability across multiple runs or experimental folds. For instance, a value of 81.3 ± 1.2 in state-level accuracy means the average accuracy was 81.3%, with a standard deviation of 1.2%, indicating the result is relatively stable across repetitions. A smaller standard deviation reflects more consistent performance, while a larger one would indicate more fluctuation in outcomes. This measure of dispersion is crucial for validating the robustness and reliability of the reward scheme under different conditions, such as random seed variations or cross-validation folds.

Looking at the individual metrics, the State-Level Accuracy measures the model’s correctness at the decision-making level—how accurately the RL agent classifies gestures at the right moments. The context-aware reward significantly improves this metric (88.7% vs. 81.3%), showing that integrating penalties for delays and bonuses for confident, timely classification helps the agent learn a more effective policy. Higher accuracy directly translates into improved recognition reliability in real-world applications, which is essential in domains such as assistive communication or human–computer interaction where misclassification could lead to confusion or system failure.

The Mean Decision Latency metric assesses how quickly the agent makes a classification decision after observing a gesture. Lower values are desirable in real-time systems, as they indicate faster response times. The context-aware agent shows improved latency (10.2 frames vs. 14.8 frames), meaning it is able to make confident decisions earlier, which enhances system responsiveness. Similarly, the Transition Consistency Score—which measures how well the predicted gesture boundaries align with ground truth—is higher for the context-aware reward (82.4% vs. 73.5%), indicating better temporal modeling of gesture transitions. Finally, the AURC reflects how effectively the agent accumulates reward per episode during training. A higher AURC (9.3 vs. 6.8) indicates more efficient learning and better policy convergence. Overall, the context-aware Markov reward not only boosts classification performance but also enables faster and temporally accurate decision-making, making it highly suitable for real-time SLR systems.

Experiment 11: Computational Efficiency

To address concerns regarding the computational efficiency of the proposed MC-NGA model with CBC, we designed an experiment to assess its runtime performance, memory usage, and scalability, using a resource-constrained environment for realistic evaluation. Specifically, the model was tested on a Dell Inspiron N5110 laptop without any GPU acceleration. The evaluation focused on the Arabic Sign Language (ArASL) dataset, which is well-suited for subject-independent testing due to its comprehensive structure—comprising over 1000 distinct gesture classes, each contributed by 50 different signers, with 10 samples per gesture per signer, thereby capturing a wide range of signer variability and real-world conditions such as changes in hand orientation, background clutter, and lighting. The objective of the experiment is to measure the training time per epoch, average inference time per gesture sequence, and peak memory usage under these hardware constraints. Metrics such as total execution time, CPU utilization, and memory footprint were collected using system monitoring tools.

The results shown in Table 18 reflect the superior performance of the MC-NGA model as presented in the benchmark comparison. As the signer and gesture complexity increase, the model maintains high levels of accuracy and subject-wise generalization. At full scale (50 signers, 1000+ classes), the model reaches 96.0% overall accuracy and 92.7% subject-wise accuracy, outperforming all other models in the previous comparison table. Despite the increased data complexity, inference time remains under 60 milliseconds, validating the model’s real-time responsiveness, even on a non-GPU laptop, which is critical for practical SLR systems deployed in low-resource environments. The MC-NGA model demonstrates linear and manageable increases in training time, memory usage, and CPU load as the dataset scales. Training time per epoch rises from 38 to 398 s, while peak memory usage stays under 1.7 GB, and CPU usage remains below 90%, ensuring operational feasibility. This efficiency is largely due to the compact nature of the Markov Chain structure and the evolutionary optimization guided by Context-Based Clearing, which avoids the parameter bloat typical of deep learning architectures like Transformers or CNN-LSTM. The model’s evolutionary convergence remains stable, with only a moderate increase in epochs required as signer variability grows, indicating reliable scalability. While deep learning models such as Transformers and GCNs offer strong sequence modeling capabilities, they often require large memory and GPU support to operate effectively and can struggle with inter-subject variability due to overfitting on signer-specific features. In contrast, MC-NGA explicitly preserves population diversity through CBC, allowing it to model multiple behavioral patterns in parallel without converging prematurely. This enables the model to generalize better across unseen signers, justifying its superior subject-wise accuracy. The results confirm that MC-NGA is not only a high-performing solution but also a scalable, efficient, and real-time-capable approach to subject-independent sign language recognition.

4.1. The Real-Time Performance

To assess the practical viability of the proposed NGA + CBC model in real-world environments, this section evaluates its real-time performance across three key dimensions: runtime efficiency, computational costs, and scalability.

-: Runtime Efficiency
The proposed NGA + CBC model demonstrates notable runtime efficiency due to its lightweight inference structure, particularly during the testing phase. While the evolutionary optimization (e.g., niching genetic search) is computationally intensive during training, it is performed offline and does not affect the online performance of the deployed model. Once trained, the Markov Chain component handles gesture sequence recognition using precomputed transition probabilities and phase dynamics, enabling real-time decision-making with minimal computational overhead. Furthermore, the use of MediaPipe for real-time hand landmark extraction ensures that the input feature stream is efficiently processed at over 30 FPS on standard CPUs, supporting low-latency frame-wise gesture interpretation. This makes the model suitable for real-time applications such as human–computer interaction or assistive technologies, where fast and accurate feedback is essential.
-: Computational Costs
The computational cost of the proposed approach is front-loaded during the training phase, where the NGA iteratively evolves gesture recognition strategies over multiple generations. However, the number of generations to convergence is notably lower in the proposed model (only 65 generations, compared to 150 in the baseline), significantly reducing training time and energy expenditure. The CBC mechanism further contributes to efficiency by pruning redundant individuals, thus reducing the number of candidates that need to be evaluated per generation. In deployment, the model relies on a compact Markov structure and lightweight transition calculations, requiring only matrix-vector multiplications and simple lookups, which are highly scalable and memory-efficient. This balance between offline optimization and efficient online inference allows the system to maintain competitive computational performance.
-: Scalability
The architecture of the MC-NGA model is inherently scalable across datasets, user groups, and gesture vocabularies. The evolutionary framework allows the model to adapt to new gesture categories or expanded sign lexicons without retraining the entire system from scratch; instead, it can evolve and integrate new behaviors incrementally. Moreover, the use of Context-Based Clearing ensures that diverse gesture interpretations are preserved during optimization, which enhances the model’s scalability to heterogeneous populations (e.g., multilingual signers or varying cultural gesture sets). Experimentally, this is validated by its consistent performance across both the ArASL and LSA64 datasets. Additionally, because inference relies only on a compact transition model, the system can be deployed on resource-constrained devices, such as embedded platforms or mobile hardware, making it highly scalable for real-world assistive or wearable applications.

4.2. Limitations

While the proposed adaptive SLR framework integrating Markov Chains with a NGA shows significant advancements in handling subject-independent variability, it is not without limitations. First, the computational complexity of the NGA-CBC optimization process can be substantial, particularly when evolving large populations over many generations, which may hinder real-time or embedded system deployment. Second, although CBC enhances population diversity, its effectiveness depends heavily on carefully tuned parameters such as clearing radius and niche capacity; improper settings could either lead to excessive overlap between niches or fragmentation of promising solutions. Third, the reliance on fixed structural assumptions within the Markov Chain may limit the model’s ability to capture more nuanced temporal dependencies that could be better modeled by more expressive frameworks like deep recurrent or attention-based architectures. Additionally, the system may still be sensitive to noise in gesture segmentation or skeletal data extraction, especially in uncontrolled environments, potentially degrading performance. Finally, while improved generalization is reported, extensive cross-dataset or real-world evaluations across varied demographic groups are necessary to confirm its robustness and practical applicability beyond the controlled experimental settings.

Moreover, the current framework lacks integration with cognitive computing principles, which could significantly enhance the system’s contextual understanding and adaptability. By incorporating cognitive computing components—such as knowledge graphs, memory networks, or context-aware decision mechanisms—the model could better interpret ambiguous or culturally nuanced signs and adapt to variations in gesture semantics. Cognitive computing could also support multi-modal reasoning by combining visual cues, historical interaction patterns, and environmental context, leading to more intelligent and human-like SLR systems. Integrating these elements would push the system beyond pattern recognition toward a deeper, more holistic comprehension of sign language communication [38,39].

5. Conclusions

In conclusion, this study presents a novel subject-independent SLR framework that effectively addresses the limitations of traditional models in capturing signer variability. By integrating a Markov Chain with a CBC-NGA, the proposed approach enhances the model’s ability to learn diverse gesture transitions across different users. The NGA component optimizes transition probabilities and structural parameters, while the CBC strategy maintains genetic diversity, thereby avoiding premature convergence and increasing robustness to inter-user variations. This adaptive optimization framework significantly improves the generalization capacity of the Markov Chain model, allowing it to perform effectively in subject-independent scenarios where signing styles and dynamics vary widely. Experimental evaluations confirm the advantages of the proposed system, showing substantial improvements in classification accuracy, a reduction in misclassification rates, and increased robustness against unseen signers. These outcomes validate the efficacy of the CBC-NGA-enhanced Markov Chain model in tackling one of the most challenging aspects of SLR: generalization across diverse users. Moving forward, future work will explore integrating deep learning architectures such as recurrent or attention-based networks to further model complex temporal dependencies. Additionally, the framework will be tested on more diverse datasets and real-world settings to assess scalability, efficiency, and practical applicability in real-time communication systems.

Author Contributions

Conceptualization, M.A.-S., Á.B. and S.M.D.; methodology, M.A.-S., Á.B. and O.A.H.; software, M.A.-S., Á.B. and O.A.H.; validation, M.A.-S., Á.B. and O.A.H.; formal analysis, M.A.-S., Á.B. and S.M.D.; investigation, Á.B. and O.A.H.; resources, O.A.H.; data curation, Á.B. and O.A.H.; writing—original draft preparation, M.A.-S., Á.B. and S.M.D.; writing—review and editing, Á.B. and O.A.H.; visualization, M.A.-S.; supervision, Á.B. and O.A.H.; project administration, M.A.-S.; funding acquisition, M.A.-S. and Á.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during the current study are available in the Kaggle repository, https://www.kaggle.com/competitions/sign-language-recognition (accessed on 1 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tao, T.; Zhao, Y.; Liu, T.; Zhu, J. Sign language recognition: A comprehensive review of traditional and deep learning approaches, datasets, and challenges. IEEE Access 2024, 12, 75034–75060. [Google Scholar] [CrossRef]
Tulli, S.; Virdi, M.K.; Misra, A.; Hasteer, N. Hand gesture recognition: A contemporary overview of techniques. In Proceedings of the 2024 International Conference on Automation and Computation (AUTOCOM), Dehradun, India, 14–16 March 2024; pp. 457–463. [Google Scholar]
Wu, J.; Yang, T. A Brief Review of Sign Language Recognition Methods and Cutting-edge Technologies. In Proceedings of the 2024 5th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 12–14 April 2024; pp. 1233–1242. [Google Scholar]
Tasfia, R.; Yusoh, Z.I.M.; Habib, A.B.; Mohaimen, T. An overview of hand gesture recognition based on computer vision. Int. J. Electr. Comput. Eng. 2024, 14, 45–351. [Google Scholar] [CrossRef]
Pohekar, A.; Pundge, A.; Shirsath, V. Exploring Hand Gesture Recognition: Trends, Technologies, and Application. In Proceedings of the 2025 3rd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), Bengaluru, India, 5–7 February 2025; pp. 1809–1812. [Google Scholar]
Hashi, A.O.; Hashim, S.Z.; Asamah, A.B. A Systematic Review of Hand Gesture Recognition: An Update from 2018 to 2024. IEEE Access 2024, 12, 143599–143626. [Google Scholar] [CrossRef]
Sandjaja, I.; Alsharoa, A.; Wunsch, D.; Liu, J. Survey of Hidden Markov Models (HMMs) for Sign Language Recognition (SLR). In Proceedings of the IEEE 7th International Conference on Industrial Cyber-Physical Systems (ICPS), St. Louis, MO, USA, 12–15 May 2024; pp. 1–6. [Google Scholar]
Mishra, A.; Kumar, A.; Kumar, A.; Mahato, A.; Kamal, K. Sign Language or Gesture-Based Recognition System: A Review. In Proceedings of the International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC), Bhubaneswar, India, 27–29 January 2024; pp. 1–7. [Google Scholar]
Tang, H. Intelligent Vision Testing Technology for Gesture Recognition Based on Artificial Intelligence Algorithms. In Proceedings of the International Conference on Data Science and Network Security (ICDSNS), Tiptur, India, 26–27 July 2024; pp. 1–5. [Google Scholar]
Soukaina, C.M.; Mohammed, M.; Mohamed, R. Geometric Feature-Based Machine Learning for Efficient Hand Sign Gesture Recognition. Stat. Optim. Inf. Comput. 2025, 13, 2027–2043. [Google Scholar] [CrossRef]
Matanga, Y.; Owolawi, P.; Du, C.; van Wyk, E. Niching Global Optimization: Systematic Literature Review. Algorithms 2024, 17, 448. [Google Scholar] [CrossRef]
Homsapaya, K.; Bundasak, S.; Jitnupong, B. Enhancing Sequential Floating Search Feature Selection using Niching-Genetic Algorithm. In Proceedings of the International Computer Science and Engineering Conference (ICSEC), Khon Kaen, Thailand, 6–8 November 2024; pp. 1–5. [Google Scholar]
Shin, J.; Miah, A.S.; Kabir, M.H.; Rahim, M.A.; Al Shiam, A. A methodological and structural review of hand gesture recognition across diverse data modalities. IEEE Access 2024, 12, 142606–142639. [Google Scholar] [CrossRef]
Lei, Y.; Shang, A.; Wang, L. Optimization of Hidden Markov Map Matching Based on Improved Genetic Algorithm. In Proceedings of the International Conference on Information Science, Computer Technology and Transportation (ISCTT), Mianyang, China, 28–30 June 2024; pp. 47–53. [Google Scholar]
Al Abdullah, B.; Amoudi, G.; Alghamdi, H. Advancements in Sign Language Recognition: A Comprehensive Review and Future Prospects. IEEE Access 2024, 12, 128871–128895. [Google Scholar] [CrossRef]
Kavitha, G.; Kalpana, K. Integrated Tuning of Hidden Markov Parametric Optimization Model with Genetic Algorithm for Electricity Market Forecasting. In Proceedings of the Computational Methods in Systems and Software, Cham, Germany, 10–15 October 2022; Springer International Publishing: Cham, Germany, 2022; pp. 798–806. [Google Scholar]
Benmachiche, A.; Makhlouf, A.; Bouhadada, T. Optimization learning of hidden Markov model using the bacterial foraging optimization algorithm for speech recognition. Int. J. Knowl. Based Intell. Eng. Syst. 2020, 24, 171–181. [Google Scholar] [CrossRef]
Ren, B.; Gao, Z.; Li, Y.; You, C.; Chang, L.; Han, J.; Li, J. Real-time continuous gesture recognition system based on PSO-PNN. Meas. Sci. Technol. 2024, 35, 056122. [Google Scholar] [CrossRef]
Sivaraman, R.; Santiago, S.; Chinnathambi, K.; Sarkar, S.; Sangeethaa, S.N. Sign Language Recognition Using Improved Seagull Optimization Algorithm with Deep Learning Model. In Proceedings of the Second International Conference on Intelligent Cyber Physical Systems and Internet of Things, Coimbatore, India, 28–30 August 2024; pp. 1566–1571. [Google Scholar]
Mahmoud, A.O.; Ziedan, I.; Zamel, A.A. Optimized Hybrid Convolution Neural Network with Machine Learning for Arabic Sign Language Recognition. Trait. Du Signal. 2024, 41, 1835–1846. [Google Scholar] [CrossRef]
Kaluri, R.; Reddy, C.P.; Ai, Q. A framework for sign gesture recognition using improved genetic algorithm and adaptive filter. Cogent Eng. 2016, 3, 1251730. [Google Scholar] [CrossRef]
Goel, R.; Bansal, S.; Gupta, K. Improved feature reduction framework for sign language recognition using autoencoders and adaptive Grey Wolf Optimization. Sci. Rep. 2025, 15, 2300. [Google Scholar] [CrossRef]
John, J.; Deshpande, S. Intelligent hybrid hand gesture recognition system using deep recurrent neural network with chaos game optimization. J. Exp. Theor. Artif. Intell. 2025, 37, 75–94. [Google Scholar] [CrossRef]
Tur, A.O.; Keles, H.Y. Evaluation of hidden markov models using deep CNN features in isolated sign recognition. Multimed. Tools Appl. 2021, 80, 19137–19155. [Google Scholar] [CrossRef]
Milu, S.A.; Fathima, A.; Talukder, T.; Islam, I.; Emon, M.I. Design and Implementation of hand gesture detection system using HM model for sign language recognition development. J. Data Anal. Inf. Process. 2024, 12, 139–150. [Google Scholar] [CrossRef]
Gupta, A.K.; Singh, S. Hand Gesture Recognition System Based on Indian Sign Language Using SVM and CNN. Int. J. Image Graph. 2024, 20, 2650008. [Google Scholar] [CrossRef]
Fayek, M.B.; Darwish, N.M.; Ali, M.M. Context based clearing procedure: A niching method for genetic algorithms. J. Adv. Res. 2010, 1, 301–307. [Google Scholar] [CrossRef]
Kalra, S.; Rahnamayan, S.; Deb, K. Enhancing clearing-based niching method using Delaunay triangulation. In Proceedings of the IEEE Congress on Evolutionary Computation, Donostia, Spain, 5–8 June 2017; pp. 2328–2337. [Google Scholar]
Shin, J.; Miah, A.S.; Akiba, Y.; Hirooka, K.; Hassan, N.; Hwang, Y.S. Korean sign language alphabet recognition through the integration of handcrafted and deep learning-based two-stream feature extraction approach. IEEE Access 2024, 12, 68303–68318. [Google Scholar] [CrossRef]
Kakizaki, M.; Miah, A.S.; Hirooka, K.; Shin, J. Dynamic Japanese sign language recognition throw hand pose estimation using effective feature extraction and classification approach. Sensors 2024, 24, 826. [Google Scholar] [CrossRef]
Damaneh, M.M.; Mohanna, F.; Jafari, P. Static hand gesture recognition in sign language based on convolutional neural network with feature extraction method using ORB descriptor and Gabor filter. Expert Syst. Appl. 2023, 211, 118559. [Google Scholar] [CrossRef]
Darwish, S.M. Feature extraction of finger-vein patterns based on boosting evolutionary algorithm and its application for loT identity and access management. Multimed. Tools Appl. 2021, 80, 14829–14851. [Google Scholar] [CrossRef]
Zalat, M.S.; Darwish, S.M.; Madbouly, M.M. An adaptive offloading mechanism for mobile cloud computing: A niching genetic algorithm perspective. IEEE Access 2022, 10, 76752–76765. [Google Scholar] [CrossRef]
Du, S.; Li, S.; Han, H.; Qiao, J. Diversity-based niche genetic algorithm for bi-objective mixed fleet vehicle routing problem with time window. Neural Comput. Appl. 2025, 37, 11479–11499. [Google Scholar] [CrossRef]
Podder, K.K.; Ezeddin, M.; Chowdhury, M.E.H.; Sumon, M.S.I.; Tahir, A.M.; Ayari, M.A.; Dutta, P.; Khandakar, A.; Mahbub, Z.B.; Kadir, M.A. Signer-Independent Arabic Sign Language Recognition SystemUsing Deep Learning Model. Sensors 2023, 23, 7156. [Google Scholar] [CrossRef]
Luqman, H.; El-Alfy, E.S.M. Towards Hybrid Multimodal Manual and Non-Manual Arabic Sign Language Recognition: mArSL Database and Pilot Study. Electronics 2021, 10, 1739. [Google Scholar] [CrossRef]
Arib, S.H.; Akter, R.; Rahman, S.; Rahman, S. SignFormer-GCN: Continuous sign language translation using spatio-temporal graph convolutional networks. PLoS ONE 2025, 20, e0316298. [Google Scholar] [CrossRef]
Peng, Y.; Sakai, Y.; Funabora, Y.; Yokoe, K.; Aoyama, T.; Doki, S. Funabot-Sleeve: A Wearable Device Employing McKibben Artificial Muscles for Haptic Sensation in the Forearm. IEEE Robot. Autom. Lett. 2025, 10, 1944–1951. [Google Scholar] [CrossRef]
Peng, Y.; Yang, X.; Li, D.; Ma, Z.; Liu, Z.; Bai, X.; Mao, Z. Predicting flow status of a flexible rectifier using cognitive computing. Expert Syst. Appl. 2025, 264, 125878. [Google Scholar] [CrossRef]

Figure 1. Workflow diagram of the Markov Chain-based sign language recognition system enhanced with a Niching Genetic Algorithm.

Figure 2. Preprocessing techniques for enhancing sign language image quality. (a) Original sign—raw input image containing the signer and background noise. (b) Noise removal—filtering applied to eliminate visual artifacts and irrelevant pixel-level noise. (c) Lightening—global enhancement of image luminance to improve visibility of hand shapes. (d) Brightness adjustment—fine-tuning contrast and brightness levels to normalize lighting conditions. (e) Background removal—segmentation of the signer’s hand from the background for clean feature extraction.

Figure 3. Recorded ArSL alphabet.

Table 1. Comparative study.

Criteria	Classical Probabilistic Models (Markov Chains)	Deep Learning Models (RNNs, LSTMs, and Transformers)
Interpretability	High—Transition probabilities and states are explicitly defined and explainable.	Low—Often behave as “black boxes” with limited transparency.
Data Requirements	Low—Can function with limited labeled data.	High—Require large, annotated datasets for effective learning.
Computational Complexity	Low to Moderate—Efficient to train and evaluate.	High—Demand powerful hardware and long training times.
Robustness to Small Data Variations	Moderate—Struggle with unseen sequences without augmentation or tuning.	Low—Prone to overfitting on small or imbalanced datasets.
Adaptability to New Signers	Improved with optimization (e.g., NGA integration).	Limited unless retrained with diverse signer data.
Generalization Across Users	Enhanced when coupled with genetic diversity techniques (e.g., CBC).	Often need regularization and augmentation to generalize well.
Model Customization	Flexible—Can be dynamically optimized (e.g., with NGA).	Rigid—Require redesigning or retraining to adapt to new structures.
Ease of Implementation	Simple—Conceptually straightforward and easy to implement.	Complex—Require expertise in architecture design and tuning.
Sensitivity to Noise /Occlusion	Moderate—May be affected by input inconsistencies.	High—Can be sensitive to minor perturbations unless trained with noise.
Suitability for Real-Time SLR	High—Fast inference due to low complexity.	Variable—Transformers and LSTMs may incur latency in real-time settings.
Scalability with Large Data	Limited—Become less practical with extremely large datasets.	High—Scale well with large and complex datasets.

Table 2. Evaluation of proposed MC-NGA model against baseline SLR methods.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Subject-Wise Accuracy (%)
HMM	86.4	84.7	83.9	84.3	80.5
CRF	88.9	87.6	83.2	85.9	81.2
CNN-LSTM	92.7	91.1	90.4	90.7	86.9
TCN	93.4	92.3	91.1	91.7	88.1
Transformer	94.6	93.5	92.2	92.8	89.3
GCN-based SLR	94.1	92.8	91.6	92.2	88.6
MC-NGA (proposed)	96.0	95.2	94.6	94.9	92.7

Table 3. Evaluation of proposed MC-NGA model against baseline SLR methods (with 95% Confidence Intervals).

Model	Accuracy (%) [95% CI]	Precision (%) [95% CI]	Recall (%) [95% CI]	F1-Score (%) [95% CI]	Subject-Wise Accuracy (%) [95% CI]	p-Value (vs. MC-NGA)
HMM	86.4 [84.6–88.2]	84.7 [82.5–86.9]	83.9 [81.5–86.3]	84.3 [82.3–86.3]	80.5 [78.0–83.0]	<0.0001
CRF	88.9 [87.3–90.5]	87.6 [85.6–89.6]	83.2 [80.9–85.5]	85.9 [83.9–87.9]	81.2 [78.9–83.5]	<0.0001
CNN-LSTM	92.7 [91.3–94.1]	91.1 [89.5–92.7]	90.4 [88.6–92.2]	90.7 [89.1–92.3]	86.9 [85.1–88.7]	0.0012
GCN-based SLR	94.1 [92.6–95.6]	92.8 [91.2–94.4]	91.6 [89.9–93.3]	92.2 [90.6–93.8]	88.6 [86.7–90.5]	0.0012
MC-NGA (proposed)	96.0 [95.0–97.0]	95.2 [94.0–96.4]	94.6 [93.2–96.0]	94.9 [93.7–96.1]	92.7 [91.1–94.3]	--

Table 4. Comparative performance of model variants in the subject-independent SLR framework.

Model Variant	Accuracy (%)	Error Rate (%)	Generations to Convergence	Cross-Dataset Accuracy (%)
Baseline Markov Chain	82.4	17.6	150	68.9
GA-augmented Model	87.7	12.3	110	72.4
NGA without CBC	91.9	8.1	85	78.6
NGA + CBC (full proposed model)	96.3	3.7	65	85.1

Table 5. Comparative performance of model variants in subject-independent SLR framework (with 95% Confidence Intervals).

Model Variant	Accuracy (%) [95% CI]	Error Rate (%) [95% CI]	Generations to Convergence [95% CI]	Cross-Dataset Accuracy (%) [95% CI]	p-Value (vs. Full Model)
Baseline Markov Chain	82.4 [80.3–84.5]	17.6 [15.5–19.7]	150 [145–155]	68.9 [66.5–71.3]	<0.0001
GA-augmented model	87.7 [85.9–89.5]	12.3 [10.5–14.1]	110 [105–115]	72.4 [70.1–74.7]	<0.0001
NGA without CBC	91.9 [90.4–93.4]	8.1 [6.6–9.6]	85 [82–88]	78.6 [76.5–80.7]	0.0003
NGA + CBC (full model)	96.3 [95.2–97.4]	3.7 [2.6–4.8]	65 [62–68]	85.1 [83.4–86.8]	----

Table 6. Diversity index (DI) analysis of model variants in subject-independent SLR optimization.

Model Variant	Diversity Index (DI)	DI Interpretation
Baseline Markov Chain	0.35	Low diversity: limited search space exploration
GA-augmented model	0.52	Moderate diversity due to basic genetic operations
NGA without CBC	0.68	High diversity: advanced mutation maintains variation
NGA + CBC (full proposed model)	0.82	Very high diversity: CBC preserves niche solutions

Table 7. Key characteristics and impact justification of model variants in the subject-independent SLR framework.

Model Variant	Key Characteristics	Impact Justification
Baseline Markov Chain	No optimization; fixed transition probabilities	Lacks adaptability; poor generalization to unseen users; and high error rates
GA-augmented model	Basic crossover/mutation, global search capability	Introduces adaptivity but suffers from premature convergence and limited diversity
NGA without CBC	Adaptive mutation, elitism, and tournament selection	Improved exploration and convergence; better global optima due to dynamic evolution, yet still at risk of subpopulation stagnation
NGA + CBC (full model)	Crowding to retain niche solutions, enhanced diversity, and dynamic evolution	CBC prevents convergence to local optima, maintains population richness; best trade-off between accuracy, convergence speed, and robustness

Table 8. Generalization evaluation results using LOSO cross-validation (%).

Subject ID	Accuracy	Misclassification Rate	Generalization Error	Fold Variance
S1	96.2	3.8	2.1	0.8
S2	94.8	5.2	2.7	1.0
S3	95.7	4.3	1.9	0.7
S4	94.9	5.1	2.5	0.9
S5	93.8	6.2	3.3	1.2
S6	97.1	2.9	1.5	0.6
S7	95.0	5.0	2.8	1.1
S8	95.6	4.4	2.3	0.8
Mean	95.4	4.6	2.4	0.9

Table 9. Convergence and optimization performance metrics (NGA vs. standard GA).

Generation	Avg. Fitness (NGA)	Avg. Fitness (GA)	Diversity Index (NGA)	Diversity Index (GA)	Distinct Niches (NGA)	Time to Convergence (Gen)
20	0.78	0.65	0.72	0.48	5
40	0.85	0.71	0.69	0.42	6
60	0.90	0.74	0.65	0.35	7
80	0.93	0.76	0.63	0.30	6
100	0.95	0.78	0.61	0.27	6	NGA: 110/GA: 140
120	0.96	0.78	0.59	0.24	5
140	0.96	0.78	0.57	0.22	5
160	0.96	0.78	0.55	0.20	5
180	0.96	0.78	0.54	0.19	5
200	0.96	0.78	0.52	0.18	4

Table 10. Performance metrics across signer variability sub-groups for robustness evaluation.

Sub-Group	Accuracy (%)	Intra-Class Misclassification Rate (%)	Inter-Class Misclassification Rate (%)	Error Rate on Complex Signs (%)
Fast signers	91.2	4.5	4.3	9.8
Slow signers	95.6	2.1	2.3	6.7
Fluent signers	96.8	1.8	1.4	5.2
Novice signers	89.3	6.3	4.4	11.1
Overall average	93.2	3.7	3.1	8.2

Table 11. Performance comparison of NGA-based model across different feature sets.

Feature Set	Recognition Accuracy (%)	NGA Adaptation Speed (Iterations to Converge)	Stability of Learned Transitions (Std. Dev. Across Runs)
Spatial-only	87.3	35	0.072
Spatiotemporal	92.6	42	0.054
Multimodal	96.2	50	0.038

Table 12. Effect of CBC-Niching parameters on NGA-Markov Chain model performance in SLR.

Configuration	$σ$	$κ$	Distance Metric	$w$	Accuracy (%)	Diversity Index	Convergence Speed (Gens)	Fitness Variance	Misclassification Rate per Subject	Niche Survival Rate (%)
Config 1	0.1	3	Euclidean	3	89.4	0.61	28	0.043	0.189	68.2
Config 2	0.3	3	Euclidean	3	92.6	0.73	31	0.051	0.162	74.1
Config 3	0.5	5	Cosine	5	96.1	0.82	36	0.059	0.128	81.5
Config 4	0.7	5	Cosine	5	94.9	0.78	39	0.056	0.142	79.3
Config 5	1.0	10	Euclidean	7	91.2	0.67	45	0.044	0.173	62.8
Config 6	0.5	5	Manhattan	3	93.5	0.75	33	0.049	0.153	77.4

Table 13. CBC hyperparameter defaults and roles.

Hyperparameter	Description	Typical Default	Effect on Performance	Reference
Niche radius (σ or δ)	Similarity threshold for clearing individuals (e.g., cosine distance)	0.2	Balances diversity vs. convergence	[33,34]
Niche capacity (κ)	Max individuals allowed per niche before clearing is enforced	1	Controls selection pressure vs. retention of alternatives	[32]
Distance metric	Measures’ similarity (cosine, Euclidean, and KL-divergence)	Cosine	Affects sensitivity to behavioral/contextual similarity	[33]
Similarity type	What is being compared: raw features, phase-wise transitions, etc.	Phase transition	Determines abstraction level of similarity comparison	[32,34]
Fitness sorting	Order in which individuals are cleared	Descending	Prioritizes survival of fitter, more informative individuals	CBC standard
Context window size (w)	Number of frames or transitions considered for context-based distance	Three–five phases	Controls temporal granularity of behavioral similarity	Task-dependent

Table 14. Statistical comparison of generalization performance across models.

Model	Mean Accuracy (%)	Std Dev	Generalization Error (%)	p-Value (vs. Proposed)	Cohen’s d
Proposed (NGA-MC)	95.4	1.02	2.4	—	—
MC (No NGA/CBC)	89.7	2.45	4.8	<0.001	2.79
BiLSTM	91.3	2.10	3.9	0.013	2.08

Table 15. Impact of CBC threshold

(δ)

on NGA-Markov chain performance

(α = 0.8, β = 0.1, γ = 0.1) .

Table 15. Impact of CBC threshold

(δ)

on NGA-Markov chain performance

(α = 0.8, β = 0.1, γ = 0.1) .

CBC Threshold (δ)	Mean Accuracy (%)	Avg Entropy	Avg Latency (Frames)	Cohen’s d (vs. δ = 0.2)
0.05	92.1	1.31	14.8	2.17
0.10	93.5	1.18	13.4	1.60
0.20	95.4	0.97	11.2	—
0.30	94.2	1.05	12.9	1.06
0.40	92.7	1.22	14.5	1.87

Table 16. Effect of fitness weight combinations (with

δ = 0.20

).

Table 16. Effect of fitness weight combinations (with

δ = 0.20

).

α (Accuracy)	β (Entropy)	γ (Latency)	Mean Accuracy (%)	Avg Entropy	Avg Latency (Frames)	Cohen’s d (vs. α = 0.8, β = 0.1, γ = 0.1)
0.6	0.2	0.2	93.8	1.11	12.8	1.44
0.6	0.3	0.1	92.9	1.04	13.6	1.94
0.6	0.1	0.3	94.2	1.16	11.6	1.12
0.8	0.1	0.1	95.4	0.97	11.2	—
0.8	0.2	0.0	94.6	0.89	12.5	0.94
0.8	0.0	0.2	94.8	1.04	10.8	0.86
1.0	0.0	0.0	95.0	1.22	12.9	0.67
1.0	0.0	0.0	95.0	1.22	12.9	0.67

Table 17. Results comparing baseline vs. context-aware Markov reward in SLR.

Reward Scheme	State-Level Accuracy (%)	Mean Decision Latency (Frames)	Transition Consistency Score (%)	AURC (Avg Reward/Episode)
Baseline Markov reward	81.3 ± 1.2	14.8 ± 0.9	73.5 ± 2.1	6.8 ± 0.4
Context-aware Markov reward	88.7 ± 0.9	10.2 ± 1.1	82.4 ± 1.7	9.3 ± 0.3

Table 18. MC-NGA runtime and scalability metrics under varying signer and gesture complexity.

# Signers	# Gesture Classes	Samples/Class	Training Time per Epoch (s)	Inference Time per Sequence (ms)	Peak Memory Usage (MB)	CPU Utilization (%)	Accuracy (%)	Subject-Wise Accuracy (%)	Convergence (Epochs)
5	100	2	38	28	910	65	94.1	91.2	16
10	300	5	92	33	1120	70	94.8	91.9	20
20	600	5	172	38	1300	77	95.2	92.3	23
35	800	10	280	45	1490	83	95.6	92.5	28
50	1000+	10	398	52	1670	88	96.0	92.7	32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-Saidi, M.; Ballagi, Á.; Hassen, O.A.; Darwish, S.M. Adaptive Sign Language Recognition for Deaf Users: Integrating Markov Chains with Niching Genetic Algorithm. AI 2025, 6, 189. https://doi.org/10.3390/ai6080189

AMA Style

Al-Saidi M, Ballagi Á, Hassen OA, Darwish SM. Adaptive Sign Language Recognition for Deaf Users: Integrating Markov Chains with Niching Genetic Algorithm. AI. 2025; 6(8):189. https://doi.org/10.3390/ai6080189

Chicago/Turabian Style

Al-Saidi, Muslem, Áron Ballagi, Oday Ali Hassen, and Saad M. Darwish. 2025. "Adaptive Sign Language Recognition for Deaf Users: Integrating Markov Chains with Niching Genetic Algorithm" AI 6, no. 8: 189. https://doi.org/10.3390/ai6080189

APA Style

Al-Saidi, M., Ballagi, Á., Hassen, O. A., & Darwish, S. M. (2025). Adaptive Sign Language Recognition for Deaf Users: Integrating Markov Chains with Niching Genetic Algorithm. AI, 6(8), 189. https://doi.org/10.3390/ai6080189

Article Menu

Adaptive Sign Language Recognition for Deaf Users: Integrating Markov Chains with Niching Genetic Algorithm

Abstract

1. Introduction

1.1. Problem Statement and Motivation

1.2. Contribution

2. State-of-the-Art Related Work

2.1. Research Gap

2.2. The Need to Extend the Related Work

3. Methodology

3.1. Preprocessing Phase

3.1.1. Noise Removal (Denoising)

3.1.2. Lightening (Contrast Adjustment)

3.1.3. Brightness Adjustment (Adaptive Histogram Equalization)

3.1.4. Background Removal

3.2. Feature Extraction Phase

3.3. SLR-Based Markov Chain Modeling

3.4. Niche Genetic Algorithm (NGA) to Optimize the Transition Probabilities

4. Results and Discussion

4.1. The Real-Time Performance

4.2. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI