Smart Home Automation-Based Hand Gesture Recognition Using Feature Fusion and Recurrent Neural Network

Gestures have been used for nonverbal communication for a long time, but human–computer interaction (HCI) via gestures is becoming more common in the modern era. To obtain a greater recognition rate, the traditional interface comprises various devices, such as gloves, physical controllers, and markers. This study provides a new markerless technique for obtaining gestures without the need for any barriers or pricey hardware. In this paper, dynamic gestures are first converted into frames. The noise is removed, and intensity is adjusted for feature extraction. The hand gesture is first detected through the images, and the skeleton is computed through mathematical computations. From the skeleton, the features are extracted; these features include joint color cloud, neural gas, and directional active model. After that, the features are optimized, and a selective feature set is passed through the classifier recurrent neural network (RNN) to obtain the classification results with higher accuracy. The proposed model is experimentally assessed and trained over three datasets: HaGRI, Egogesture, and Jester. The experimental results for the three datasets provided improved results based on classification, and the proposed system achieved an accuracy of 92.57% over HaGRI, 91.86% over Egogesture, and 91.57% over the Jester dataset, respectively. Also, to check the model liability, the proposed method was tested on the WLASL dataset, attaining 90.43% accuracy. This paper also includes a comparison with other-state-of-the art methods to compare our model with the standard methods of recognition. Our model presented a higher accuracy rate with a markerless approach to save money and time for classifying the gestures for better interaction.


Introduction
In recent years, home automation has emerged as a research topic.Many researchers have started investigating the demand criteria for home automation in different environments.Human-computer interaction (HCI) [1] is considered a more interactive and resourceful method of engaging with different appliances to make the system work.In the conventional approach, different devices like a mouse, keyboard, touch screen, and remote devices are used to fulfill requirements so that users can interact by only using their hands with different home appliances, home healthcare, and home monitoring systems.Usually, changing channels and controlling light on/off switches are more demanding research areas for HCI [2].Earlier systems were divided into two approaches for interacting with computers.The first approach is inertial sensor-based and the second approach is vision-based.In the first approach, sensors are built with one or more arrays.They track the position of the hand, the velocity, and acceleration.Then, these motion features are trained and tested for hand gesture recognition.They are used to control home appliances like TV, radio, and lights [3][4][5][6][7].Despite its high sensitivity, this approach makes it difficult to obtain higher accuracy.This approach demands a proper setup with high-quality sensors.The use of high-quality sensors can attain better results, but they make the system more expensive, and durability issues arise.With the advancement of technology, new sensors are continually being launched in the market [8], the purpose of which is to minimize sensitivity, making them more expensive.
The second approach is vision-based, which reduces the limitations arising from the sensor-based approach [9].With the help of this sensor, hand gestures are recognized using images.The images consist of RGB and depth.The RGB images are collected using cameras.The cameras are less expensive and easy to set up properly.The RGB image color, shape, orientation, contours, and positions are calculated for hand gesture recognition.The vision-based sensors with depth images gain more dimensions than RGB [10].For depth, thresholding techniques are either empirical or automated.Empirical techniques include the trial-and-error method, in which the search space is excluded, and the computation cost is a priority for hand localization.In automated solutions, the hand is considered the main focus area for data acquisition [11].The hand is localized as the closest object in front of the camera's in-depth image.
Vision-based sensors also pose some challenges for researchers, such as light intensity, clutter sensitivity, and skin tone color [12].Hand localization is a crucial step.For this, the conventional systems are divided into different steps to obtain better accuracy while keeping the challenges in view.First, data acquisition is performed, followed by hand detection.For hand detection, multiple methods are used, including segmentation, tracking, and color-based extractions.The features are extracted using different algorithms.After that, the gesture is recognized.For the given approach, both images and videos are collected [13].The still images provide static gestures, whereas videos provide dynamic hand gestures, as changes in hand gestures from one frame to another are noticed.Static gestures are still images and require less computation cost [14][15][16][17], whereas dynamic gestures contain three-dimensional motion.The movement in dynamic hand gestures becomes a challenging task as the speed varies, and gesture acquisition is difficult due to speed issues.In the literature, static and dynamic gesture recognition has been performed using two different methods: supervised and unsupervised learning.Supervised learning methods include decision trees, random forests, and SVM, whereas unsupervised learning methods include k-means, hidden Markov model, and PCA [18].
In our proposed model, we have used dynamic gestures to challenge our limitations.Our system proved its compatibility.In this paper, the videos are first converted into frames.An adaptive median filter and gamma correction are applied to the images to reduce noise and adjust the light intensity, respectively.Then, the hand is detected using saliency maps.The extracted hand is then available for feature extraction.We have extracted different features while keeping the issues hindered in classification.For this feature, we have chosen three different state-of-the-art algorithms.These features are named the joint color cloud, neural gas, and directional active model.The features are then optimized using an active bee colony algorithm.The optimized features are passed through the RNN.Our accuracies are shown to be better for model designs.The main contributions of our system are as follows:

•
The system approach is different from previous systems; it recognizes dynamic gestures with complex backgrounds.

•
Hands are detected from both images using two-way detection: first, the skin tone pixels are extracted, and then the saliency map is applied for greater precision.

•
Features are collected using different algorithms, like fast marching, neural gas, and the 8-freeman chain model.All the features are extracted with modifications to the algorithms listed.The features are collected and fused to make a feature fusion for recognition.

•
The proposed system uses a deep learning algorithm such as RNN to achieve higher accuracy.
The rest of the sections presented in this article are as follows: Section 2 includes a related study of the existing methods.Section 3 presents the architecture of the proposed system.Section 4 shows the experimental section with system performance evaluations.Section 5 describes the strengths and weaknesses of our proposed system.Section 6 presents the conclusion of the system and future work directions.

Literature Review
Multiple methods have been introduced to acquire hand gestures.This section presents the most useful and popular methods.A literature review was conducted to study the research work carried out in particular areas.

Hand Gesture Recognition via RGB Sensors
In hand gesture recognition systems, many researchers use sensors and cameras to recognize gestures.The RGB videos can be collected using different cameras.Table 1 presents the methods used by researchers for hand gesture recognition using RGB videos.
Table 1.Related studies on hand gesture recognition using RGB sensors.

S. Nagarajan et al. [19]
The proposed system captures the American sign language and filters the images using Canny edge detection.An Edge Orientation Histogram (EOH) for feature extraction was used, and these feature sets were classified by a multiclass SVM classifier; however, some signs were not detected due to hand orientation and gesture similarity.

Mandeep et al. [20]
The hand gesture system used the skin color model and thresholding; the YCbCr segmented the hand region, skin color segmentation was used to extract the skin pixels, and Otsu thresholding removed the image's background.In the last PCA, the template-matching method was used to recognize a total of twenty images per gesture from five different poses from four gesture captures.
On the other hand, this system has some limitations in that skin color varies due to light colors, and the background contains skin color pixels.
Thanh et al. [21] Multimodal streams are used to increase the performance of hand recognition by combining depth, RGB, and optical flow.A deep learning model is used for feature extraction from each stream; afterward, these features are combined with different fusion methods for the final classification.This system outperforms the results with multi-modal streams of different viewpoints collected from twelve gestures.
Noorkholis et al. [22] In dynamic hand gesture recognition, the dataset of RGB and depth images is preprocessed from the Region of Interest (ROI) to extract the original pixel value of the hand instead of other unnecessary points.To extract the feature set, a three-dimensional convolutional neural network (3DNN) and long short-term memory (LSTM) combination of deep learning is used to extract the spatio-temporal features that are further classified by finite state machine (FSM) model classification to solve the problem of different gestures used in different applications for ease.This proposed system is designed for a smart TV environment, and for this purpose, eight gestures perform robustly in real-time testing out of 24 gestures.
K. Irie et al. [23] In this paper, the hand gesture is detected by the emotion of the hand in front of the camera.The hand motion is detected to control the electronic appliances in intelligent rooms with complete control of hand gestures.The cameras have the ability to zoom in and focus on the user to detect the hand gesture.The hand is detected via color information and motion direction using fingers.

Authors Methodology
Chen-Chiung Hsieh et al. [24] This research was conducted to reduce issues like hand gesture detection from complex backgrounds and light intensity issues.The hand gesture was detected with the help of the body skin detection method.The gestures were classified with the help of a new hand gesture recognition model called the motion history image-based method.A total of six hand gestures at different distances from the camera were used as the dataset.The images were trained using a haar-like structure with up, down, right, and left movements.The home automation-based system generated 94.1% accuracy using the proposed method.

Zhou Ren et al. [25]
A new study was conducted on hand gesture recognition using the finger earth mover distance (FEMD) approach.They noticed the speed and accuracy of the FEMD, shape context, and shape-matching algorithm.The dataset was collected from the Kinect camera, so it contained both depth and RGB images.
Jaya Prakash Saho [26] Currently, convolutional neural networks (CNNs) exhibit good recognition rates for image classification problems.It is difficult to train deep CNN networks such as AlexNet, VGG-16, and ResNet from scratch due to the lack of big, labelled picture examples in static hand gesture images.
To recognize hand gestures in a dataset with a low number of gesture images, they used an end-to-end fine-tuning strategy for a pre-trained CNN model with score-level fusion.They used two benchmark datasets, and the efficacy of the proposed approach was assessed using leave-one-subject-out cross-validation (LOO CV) and conventional CV tests.They proposed a real-time American Sign Language (ASL) recognition system and also evaluated it.
Ing-Jr Ding [27] In the proposed system, the suggested method consists of two sequential computation steps: phase 1 and phase 2. The deep learning model, a visual geometry group (VGG)-type convolutional neural network (CNN), also known as the VGG-CNN, is used to assess the recognition rate.The experiments proved that image extraction efficiently eliminates the undesirable shadow region in hand gesture depth pictures and greatly improves the identification accuracy.
Jun Li [28] They proposed MFFCNN-LSTM for forearm sEMG signal recognition using time-domain and time-frequency spectrum features.They first extracted hand movements from the NinaPro db8 dataset, and then images were denoised via empirical Fourier decomposition.The images were passed through the different channels using CNN to collect the time-domain and time-frequency-spectrum features.The features were fused and passed to the LSTM.They achieved 98.5% accuracy with the proposed system.

Hand Gesture Recognition via Marker Sensors
Many researchers worked on marker sensors with proper equipment setup.Gloves were attached to the hands to note down the locations and movements.Table 2 presents the researchers' methods for hand gesture recognition using marker videos.

Authors Methodology
Safa et al. [29] Currently, the hand gesture system deploys many recognition systems with sensors to locate the correct motion and gesture of the hand without any distortion.Combining machine learning and sensors increases the potential in the field of digital entertainment by using touchless and touch-dynamic hand motion.In a recent study, a leap motion device was used to detect the dynamic motion of the hand without touching it, analyse the sequential time series data using long short-term memory (LSTM) for recognition, and separate unidirectional and bidirectional LSTM.The novel model, named Hybrid Bidirectional Unidirectional LSTM (HBU-LSTM), improves performance by considering spatial and temporal features between leap motion data and neural network layers.
Xiaoliang et al. [30] The hand gesture system, with a novel approach, combines a wearable armband and customized pressure sensor smart gloves for sequential hand motion.The data collected from the inertial measurement unit (IMU), fingers, palm pressure, and electromyography was computed using deep learning.Long and short-term memory models (LSTM) for testing and training were applied.The experimental work showed outstanding results with dynamic and air gestures collected from ten different participants.

Authors Methodology
Muhammad et al. [31] In a smart home, the automatic system developed for the elder's care deployed a home automation system with the gesture to control the appliances of daily use by using embedded hand gloves to detect the motion of the hand.For hand movements, wearable sensors such as an accelerometer and gyroscope were used to collect the combined feature set, and a random forest classifier was used to recognize the nine different gestures.
Dong-Luong-Dinh et al. [32] In hand gesture recognition for home appliances, a novel approach towards detection is provided in this paper.They controlled home appliances using hand gestures by detecting hands and generating control commands.They created a database for hand gestures via labelling part maps and then classifying them using random forests.They generated a system for TV, lights, doors, changing channels, fans, temperature, and volume using hand gestures.
Muhammad Muneeb et al. [33] Smart homes for the elderly and disabled people need special attention, as awareness of geriatric problems is necessary to resolve these issues.Researchers have developed many gesture recognition systems in various domains, but the authors of this paper presented a way to deal with elderly issues in particular.They used gloves to record the movements of the rotation, tilting of the hand, and acceleration.The nine gestures were classified using random forest, attaining an accuracy of 94% over the benchmark dataset.
Chi-Huang Hung et al. [34] They proposed a system for an array lamp that performed ON/OFF actions and dimmed the light.They used a gyroscope and an accelerometer for hand detection.The noise was removed using a Kalman filter, and signals were decoded after receiving them from the devices to convert them into the desired gestures.
Marvin S. Verdadero et al. [35] Remote control devices are common, but the setup is very expensive.The static hand gestures are taken from an Android mobile, and the signals are passed to the electronic devices.The distance should be 6 m from the device to pass the signals accurately for gesture recognition.
Zhiwen Deng [36] Sign language recognition (SLR) is an efficient way to bridge communication gaps.SLR can additionally be used for human-computer interaction (HCI), virtual reality (VR), and augmented reality (AR).To enhance the research study, they proposed a skeleton-based self-distillation multi-feature learning method (SML).They constructed a multi-feature aggregation module (MFA) for the fusion of the features.For feature extraction and recognition, a self-distillation-guided adaptive residual graph convolutional network (SGA-ResGCN) was used.They tested the system on two benchmark datasets, WLASL and AUTSL, attaining accuracies of 55.85% and 96.85%, respectively.
Elahe Rahimian [37] For the reduction in computation costs in complex architectures while training larger datasets, they proposed a temporal convolution-based hand gesture recognition system (TC-HGR).The 17 gestures were trained using attention mechanisms and temporal convolutions.They attained 81.65% and 80.72% classification accuracy for window sizes of 300 ms and 200 ms, respectively.

System Methodology
The proposed architecture detects hand gestures in a dynamic environment.Primarily, for a dynamic image, the images are first converted into frames.The acquired images are passed through an adaptive mean filter for noise reduction, and then gamma correction is applied to the images to adjust the image intensity for better detection.On the filtered images, skin color is detected, and a saliency map is applied over it for hand extraction.The extracted hand is trained over a pre-defined model for the hand skeleton.After that, the detected hand and skeleton are used for feature extraction.The features include a joint color cloud, neural gas, and a directionally active model.The features are optimized to reduce complexity via graph mining.Finally, for the gestures, an RNN is implemented for classification.The architecture of the proposed system is presented in Figure 1.
tered images, skin color is detected, and a saliency map is applied over it for hand extraction.The extracted hand is trained over a pre-defined model for the hand skeleton.After that, the detected hand and skeleton are used for feature extraction.The features include a joint color cloud, neural gas, and a directionally active model.The features are optimized to reduce complexity via graph mining.Finally, for the gestures, an RNN is implemented for classification.The architecture of the proposed system is presented in Figure 1.

Images Pre-Processing
In the acquired image, noise reduction is necessary to remove extra pixel information, as extra pixels hinder detection [38][39][40][41].An adaptive median filter is used to detect the pixels affected by noise.This filter maintains the image quality, and the image blurring effect is negated.The pixels in the noised image are compared with the values of their neighboring pixels.A pixel showing a dissimilar value is labelled as a noisy pixel and a filter is applied over it.The pixel value is adjusted and replaced with the value of its neighboring pixels.For every pixel, the local region statistical estimate is calculated, resulting in  ̂;  is the uncorrupted image, and  ̂ is obtained from this image.The mean square error (MSE) is minimized between these two images,  ̂ and .The MSE is presented as follows: Conventional filters change all pixel values to denoise the image, but adaptive median filters work in two ways to change only the dissimilar pixels.Between level A and level B, level A is presented as follows: where   represents the median of the gray level in the original image   ;   is the minimum gray level in   ;   is the maximum gray level in   .If 1 > 0 and 2 < 0, there is a shift to level B. Otherwise, the window size is increased if the window size is

Images Pre-Processing
In the acquired image, noise reduction is necessary to remove extra pixel information, as extra pixels hinder detection [38][39][40][41].An adaptive median filter is used to detect the pixels affected by noise.This filter maintains the image quality, and the image blurring effect is negated.The pixels in the noised image are compared with the values of their neighboring pixels.A pixel showing a dissimilar value is labelled as a noisy pixel and a filter is applied over it.The pixel value is adjusted and replaced with the value of its neighboring pixels.For every pixel, the local region statistical estimate is calculated, resulting in â; a is the uncorrupted image, and â is obtained from this image.The mean square error (MSE) is minimized between these two images, â and a.The MSE is presented as follows: Conventional filters change all pixel values to denoise the image, but adaptive median filters work in two ways to change only the dissimilar pixels.Between level A and level B, level A is presented as follows: where Q med represents the median of the gray level in the original image I xy ; Q min is the minimum gray level in I xy ; Q max is the maximum gray level in I xy .If A1 > 0 and A2 < 0, there is a shift to level B. Otherwise, the window size is increased if the window size is less than or equal to I max repeat level A, whereas I max represents the maximum size of I xy .
Otherwise, the gray level coordinates Q xy are shown.Level B is presented as follows: If B1 > 0 and B2 < 0 then Q xy is shown, otherwise Q med is shown.Figure 2 shows a flowchart of the algorithm implemented for the filter.less than or equal to   repeat level A, whereas   represents the maximum size of   .Otherwise, the gray level coordinates   are shown.Level B is presented as follows: If 1 > 0 and 2 < 0 then   is shown, otherwise   is shown.Figure 2 shows a flowchart of the algorithm implemented for the filter.The denoised image intensity is adjusted via gamma correction, as brightness plays a key role in the detection of a region of interest [42].The power law for gamma correction is defined as follows: where   is the input non-negative value with power  and  is the constant usually equal to 1, and the range can lie between 0 and 1.   is the output value [43][44][45].The denoised intensity-adjusted image, including the plot, is shown in Figure 3.The denoised image intensity is adjusted via gamma correction, as brightness plays a key role in the detection of a region of interest [42].The power law for gamma correction is defined as follows: where W I is the input non-negative value with power γ and G is the constant usually equal to 1, and the range can lie between 0 and 1. W o is the output value [43][44][45].The denoised intensity-adjusted image, including the plot, is shown in Figure 3.
.Otherwise, the gray level coordinates   are shown.Level B is presented as follows: If 1 > 0 and 2 < 0 then   is shown, otherwise   is shown.Figure 2 shows a flowchart of the algorithm implemented for the filter.The denoised image intensity is adjusted via gamma correction, as brightness plays a key role in the detection of a region of interest [42].The power law for gamma correction is defined as follows: where   is the input non-negative value with power  and  is the constant usually equal to 1, and the range can lie between 0 and 1.   is the output value [43][44][45].The denoised intensity-adjusted image, including the plot, is shown in Figure 3.

Hand Detection
In this section, the hand is detected from the images using a two-way model.First, the skin tone from pixels is detected using hand gestures to localize the region of interest [46][47][48][49][50].Then, a saliency map is applied over the image to obtain a better view of the desired gesture.The saliency map goal is to find the appropriate localization map, which is computed as follows: Sensors 2023, 23, 7523 where M h s is the localization map for the region of interest; u * v represents the width and height of the image; i is the region of interest; α h i represents the global average pooling; R is the gradient via backpropagation.The average of the feature map is calculated using the weights assigned to the pixel gradient.Then, the ReLU is applied over the feature map.The image view range is set between 0 and 1, and the image is upscaled and overlay on the original image, resulting in a saliency map [51][52][53].Figure 4 presents the saliency map for the HaGRI dataset "stop" and "ok" gestures.
where   ℎ is the localization map for the region of interest;  *  represents the width and height of the image;  is the region of interest;   ℎ represents the global average pooling;  is the gradient via backpropagation.The average of the feature map is calculated using the weights assigned to the pixel gradient.Then, the  is applied over the feature map.The image view range is set between 0 and 1, and the image is upscaled and overlay on the original image, resulting in a saliency map [51][52][53].Figure 4 presents the saliency map for the HaGRI dataset "stop" and "ok" gestures.

Hand Skeleton
For hand skeleton mapping, hand localization is the foremost step [54].In our research, we first separated the palm and fingers for an accurate classification of the skeleton points.For palm extraction, a single-shot multibox detector (SSMD) is used; it excludes the fingers, and only the palm is bound by the blob.Then, the palm is first converted into binary, and a four-phase sliding window is moved across the whole area for the detection of the four extreme left, right, top, and bottom points.The second phase of the system includes finger identification; again, SSMD is used to detect the fingers.The palm is excluded, and the four-phase sliding window is moved to the extracted fingers again.It identified the extreme top, bottom, left, and right points [55].From the extreme tops, the curves of the pixels are noted and marked.As a result, five points on the fingers and four points on the palm are obtained.Figure 5 shows the hand skeleton results for the HaGRI dataset.

Hand Skeleton
For hand skeleton mapping, hand localization is the foremost step [54].In our research, we first separated the palm and fingers for an accurate classification of the skeleton points.For palm extraction, a single-shot multibox detector (SSMD) is used; it excludes the fingers, and only the palm is bound by the blob.Then, the palm is first converted into binary, and a four-phase sliding window is moved across the whole area for the detection of the four extreme left, right, top, and bottom points.The second phase of the system includes finger identification; again, SSMD is used to detect the fingers.The palm is excluded, and the four-phase sliding window is moved to the extracted fingers again.It identified the extreme top, bottom, left, and right points [55].From the extreme tops, the curves of the pixels are noted and marked.As a result, five points on the fingers and four points on the palm are obtained.Figure 5 shows the hand skeleton results for the HaGRI dataset.

Fusion Features Extraction
In this section, we illustrate how to extract various features from the acquired h gestures.In hand gesture recognition systems, feature extraction contains two type features: full-hand and point-based [56].The full-hand feature set is made up of two te niques: a joint colour cloud and neural gas.A directionally active model is included in point-based feature.Both the extracted features are fused together to generate a feat set for recognition.

Joint Color Cloud
For this feature, the algorithm used to generate the cloud with different colors, wh

Fusion Features Extraction
In this section, we illustrate how to extract various features from the acquired hand gestures.In hand gesture recognition systems, feature extraction contains two types of features: full-hand and point-based [56].The full-hand feature set is made up of two techniques: a joint colour cloud and neural gas.A directionally active model is included in the point-based feature.Both the extracted features are fused together to generate a feature set for recognition.

Joint Color Cloud
For this feature, the algorithm used to generate the cloud with different colors, which helps to obtain the skeleton point accuracy, and the geodesic distance for all fingers, including the palm, is extracted for the feature set.The color cloud is generated using a fast-marching algorithm [56][57][58].This algorithm is defined as follows: (1) Suppose we are interested in the region of interest function value f (i, j).This leads to two types of spatial derivative operators.
where S +i f is the forward operator, as it uses the f (i + t, j) to propagate from right to left by finding the value of f (i, j).On the other hand, S −i f represents the backward operator, propagating from left to right.(2) For the difference operator, a discrete function is used to calculate f i,j .For this purpose, at a specific point, the speed function P i,j is defined as follows: The above equation is interpreted as follows, where (i, j) is the arrival time of f i,j . Max (3) For the neighbor pixel value calculation, only f i,j point included in the set point (i, j) can be used.The f i,j value computation is defined as follows: (4) The quadratic equation is formulated for f i,j : if 1 f i,j > |p − q|, which leads to the following: These computations have only been performed on the neighbors of the new points added.If the neighboring value and calculated point (i, j) are equal then the values are compared, and the smaller value calculated before is added.In every iteration, a smaller value is found and stored.To save time, the min heap is used in the fast-marching algorithm to store the minimum values quickly with less time consumption.These iterations continue until the endpoint is achieved [59,60].Figure 6 shows the results for the point-colored cloud.
These computations have only been performed on the neighbors of the new poi added.If the neighboring value and calculated point (, ) are equal then the values compared, and the smaller value calculated before is added.In every iteration, a sma value is found and stored.To save time, the min heap is used in the fast-marching al rithm to store the minimum values quickly with less time consumption.These iterati continue until the endpoint is achieved [59,60].Figure 6 shows the results for the po colored cloud.

Neural Gas
Neural maps organize themselves and form neural gas; it shows the ability to ra neighborhood vectors, which determine the neighborhood data space [61,62].The neu gas is composed of multiple neurons, , comprising weight vectors () that resul forming clusters.During training, every single neuron presents a change in position w an abrupt movement.Randomly, a feature vector is assigned to every single neuron.Fr the formed neural gas network, random data  is selected from the feature vector.W the help of this data vector , the Euclidean distance is calculated from all the wei vectors.The distance values computed determine the center adjustment with the selec data vector [63][64][65].The feature vector itself is defined as follows:

Neural Gas
Neural maps organize themselves and form neural gas; it shows the ability to rank neighborhood vectors, which determine the neighborhood data space [61,62].The neural gas is composed of multiple neurons, n, comprising weight vectors W(r) that result in forming clusters.During training, every single neuron presents a change in position with an abrupt movement.Randomly, a feature vector is assigned to every single neuron.From the formed neural gas network, random data r is selected from the feature vector.With the help of this data vector r, the Euclidean distance is calculated from all the weight vectors.The distance values computed determine the center adjustment with the selected data vector [63][64][65].The feature vector itself is defined as follows: where the probability distribution W(r) of the data vector n with a finite number of sets s f , f = 1, . . ..., N. A data vector n for probability distribution W(r) is presented at each time step m.The distance order is determined from the feature vector of the given data r.If n o is the index of the closed feature vector, n 1 is the second, and n N−1 is distant to the data vector n, then ε represents adaptation step size and represents neighborhood range.After most of the adaptation steps, the data space is covered with a feature vector with minimum errors.Algorithm 1 defines the pseudocode for neural gas formation, and Figure 7 presents the structure of the neural gas over the HaGRI dataset gesture.where the probability distribution () of the data vector  with a finite numbe sets   ,  = 1, … . ., .A data vector  for probability distribution () is presente each time step .The distance order is determined from the feature vector of the g data .If   is the index of the closed feature vector,  1 is the second, and  −1 is tant to the data vector , then  represents adaptation step size and ⋋ represents ne borhood range.After most of the adaptation steps, the data space is covered with a fea vector with minimum errors.Algorithm 1 defines the pseudocode for neural gas mation, and Figure 7 presents the structure of the neural gas over the HaGRI dataset ture.Output: G = (n 0 , n 1 , . . . ., n N ) : the map; I← [] Method: I← N(n 0 , n 1 ), where n 0 represents the first node and n 1 represents the second node n 0 ← 0; n N ← 100; Whereas, the input signal Φ is as follows: Adjust edge I← [n i+1 ] repeat until n N ← 100 end while return G = (n 0 , n 1 , . . . ., n N )

Directional Active Model
The next feature is extracted using an 8-Freeman chain code algorithm, which measures the change in the directions of the curves at the boundary of the hand gesture [66].Eight Freeman chain codes are shape descriptors, and they change structural schemes with a contour-dependent scheme.A shape description possesses a set of lines oriented in a particular manner.The oriented vectors are in eight and four directions, and the chain code vectors have integer numbers represented in a possible direction, as shown in Figure 8.

Directional Active Model
The next feature is extracted using an 8-Freeman chain code algorithm, which measures the change in the directions of the curves at the boundary of the hand gesture [66].Eight Freeman chain codes are shape descriptors, and they change structural schemes with a contour-dependent scheme.A shape description possesses a set of lines oriented in a particular manner.The oriented vectors are in eight and four directions, and the chain code vectors have integer numbers represented in a possible direction, as shown in Figure 8. First, the boundary of the hand is identified to obtain the curves.Suppose the points on the curve are denoted by  on the boundary .The starting point  on the top-right side of the thumb orientation is checked for its vector position.The curve points on the boundary   are calculated for all points, so it becomes   = { 0 ,  1 , … … ,  −1 }.After attaining the vector position of  0 and  1 , both of the curve point directions are compared; if they both have the same values, the value of  1 is not considered and the next point  2 vector position is checked; otherwise, both of the curve point values are added to the list.Hence, this whole procedure continues until  −1 is reached.Figure 9   First, the boundary of the hand is identified to obtain the curves.Suppose the points on the curve are denoted by c on the boundary d.The starting point t on the top-right side of the thumb orientation is checked for its vector position.The curve points on the boundary P b are calculated for all points, so it becomes P b = {t 0 , t 1 , . . . . . . ,t n−1 }.After attaining the vector position of t 0 and t 1 , both of the curve point directions are compared; if they both have the same values, the value of t 1 is not considered and the next point t 2 vector position is checked; otherwise, both of the curve point values are added to the list.Hence, this whole procedure continues until t n−1 is reached.Figure 9  side of the thumb orientation is checked for its vector position.The curve points on the boundary   are calculated for all points, so it becomes   = { 0 ,  1 , … … ,  −1 }.After attaining the vector position of  0 and  1 , both of the curve point directions are compared; if they both have the values, the value of  1 is not considered and the next point  2 vector position is checked; otherwise, both of the curve point values are added to the list.Hence, this whole procedure continues until  −1 is reached.Figure 9   For our proposed system feature vector, we considered only 12 positions: 8 with an angle of 45° and 5 with an angle of 90°.The angle description is shown in Figure 10, which illustrates a better demonstration of the feature vector [67].For our proposed system feature vector, we considered only 12 positions: 8 with an angle of 45 • and 5 with an angle of 90 • .The angle description is shown in Figure 10, which illustrates a better demonstration of the feature vector [67].
side of the thumb orientation is checked for its vector position.The curve points on the boundary   are calculated for all points, so it becomes   = { 0 ,  1 , … … ,  −1 }.After attaining the vector position of  0 and  1 , both of the curve point directions are compared; if they both have the same values, the value of  1 is not considered and the next point  2 vector position is checked; otherwise, both of the curve point values are added to the list.Hence, this whole procedure continues until  −1 is reached.Figure 9   For our proposed system feature vector, we considered only 12 positions: 8 with an angle of 45° and 5 with an angle of 90°.The angle description is shown in Figure 10, which illustrates a better demonstration of the feature vector [67].

Feature Analysis and Optimization
After feature extraction from all datasets, the extracted features are passed through an artificial bee colony algorithm (ABCA) for optimization [68].This helps reduce the computation time and also the complexity of the data.ABCA consists of two groups: one is known as the employer bee and the other is the onlooker bee.Both groups of bees have the same number, which is similar to the solutions in the group of honey bees, known as a swarm.The swarm size generates a randomly distributed initial population.Suppose the number of j-th solutions in the swarm is denoted as X j = x j,1 , x j,2 , . . ..., x j,n .Employed bees find their food sources as follows: where X l represents the candidate solution and is randomly selected when j = l.∅ j,i represents a random number from the range [-1, 1].l is the dimension index from {1,2,3,. ..N}.When the food search by employee bees is completed, they share all the information between the onlookers and nectar.Then, they choose the food amount equal to the nectar amount.The fitness function of the new candidate solution is defined as follows: where Pn j is the probability of the food source, which is higher if the solution better than j is achieved.f it represents the fitness value in the j-th swarm size.With predefined function iterations, if the position is not changed, then the value of the food source X j is replaced with X j,i found by scout bees: where ob i and kb i are the lower and upper boundaries of the i-th dimension; rand(0, 1) represents the random values between 0 and 1, respectively.Figure 11 presents the overall flowchart of the artificial bee colony to determine the decision steps, while Figure 12 presents the best fitness result over the "call" gesture in the HaGRI dataset.
is achieved. represents the fitness value in the -th swarm size.With predefined function iterations, if the position is not changed, then the value of the food source   is replaced with  , found by scout bees: , =   + (0,1).(  −   ) where   and   are the lower and upper boundaries of the -th dimension; (0,1) represents the random values between 0 and 1, respectively.Figure 11 presents the overall flowchart of the artificial bee colony to determine the decision steps, while Figure 12 presents the best fitness result over the "call" gesture in the HaGRI dataset.

Gesture Classification Using RNN
We used a recursive neural network (RNN) on our optimized feature vectors to classify gestures [69].An RNN is a deep neural network that has the ability to learn distributive and structured data.Therefore, it is ideal for our proposed system of classification.In

Gesture Classification Using RNN
We used a recursive neural network (RNN) on our optimized feature vectors to classify gestures [69].An RNN is a deep neural network that has the ability to learn distributive and structured data.Therefore, it is ideal for our proposed system of classification.In an RNN, the last output is typically used as the input for the next layer with hidden states.For each timestamp ts, the activation function d ts and the output o ts defined are as follows: ) where U dd , U db , U od, , g d , g y are the coefficients shared temporarily.k 1 , k 2 are activation functions.Figure 13 presents the overall flow of the RNN architecture.

Experimental Setup and Evaluation
Experiments were performed on a system with the specifications of an Intel Core i7-9750H with 2.60GHz processing power, and 16GB RAM with ×64 based Windows 10.The MATLAB tool and Google Colab were used for attaining the results.The system accessed the performance of the proposed architecture on four benchmark datasets: HaGRI, Geogesture, Jester, and WLASL.The k-fold cross-validation technique was applied to all three datasets to verify the reliability of our proposed system.This section includes a dataset description, the experiments performed, and a system comparison with other state-of-theart systems.

Experimental Setup and Evaluation
Experiments were performed on a system with the specifications of an Intel Core i7-9750H with 2.60GHz processing power, and 16GB RAM with ×64 based Windows 10.The MATLAB tool and Google Colab were used for attaining the results.The system accessed the performance of the proposed architecture on four benchmark datasets: HaGRI, Geogesture, Jester, and WLASL.The k-fold cross-validation technique was applied to all three datasets to verify the reliability of our proposed system.This section includes a dataset description, the experiments performed, and a system comparison with other state-of-the-art systems.description, the experiments performed, and a system comparison with other state-of-theart systems.

Jester Dataset
The Jester dataset [72] contains 148,092 video clips of pre-defined human hand gestures collected in front of cameras; it comprises 27 gestures.The video quality of the gestures is set to 100 pixels at 12 fps.Seven hand gestures are selected for system training and testing for the following: sliding two fingers down, stop sign, swiping left, swiping right, turning the hand clockwise, turning the hand counterclockwise, and zooming in with two fingers.The example gestures of the Jester dataset are shown in Figure 16.

WLASL Dataset
The WLASL dataset has the largest number of videos of American Sign Language hand gestures [73].It has a total of 2000 hand gesture classes.The dataset was created specifically for communication between the deaf and hearing communities.We used seven classes to test the validity of our proposed model on hand gesture recognition datasets: hungry, wish, scream, forgive, attention, appreciate, and abuse.The WLASL da-

Jester Dataset
The Jester dataset [72] contains 148,092 video clips of pre-defined human hand gestures collected in front of cameras; it comprises 27 gestures.The video quality of the gestures is set to 100 pixels at 12 fps.Seven hand gestures are selected for system training and testing for the following: sliding two fingers down, stop sign, swiping left, swiping right, turning the hand clockwise, turning the hand counterclockwise, and zooming in with two fingers.The example gestures of the Jester dataset are shown in Figure 16.

Jester Dataset
The Jester dataset [72] contains 148,092 video clips of pre-defined human hand gestures collected in front of cameras; it comprises 27 gestures.The video quality of the gestures is set to 100 pixels at 12 fps.Seven hand gestures are selected for system training and testing for the following: sliding two fingers down, stop sign, swiping left, swiping right, turning the hand clockwise, turning the hand counterclockwise, and zooming in with two fingers.The example gestures of the Jester dataset are shown in Figure 16.

WLASL Dataset
The WLASL dataset has the largest number of videos of American Sign Language hand gestures [73].It has a total of 2000 hand gesture classes.The dataset was created specifically for communication between the deaf and hearing communities.We used seven classes to test the validity of our proposed model on hand gesture recognition datasets: hungry, wish, scream, forgive, attention, appreciate, and abuse.The WLASL da-

WLASL Dataset
The WLASL dataset has the largest number of videos of American Sign Language hand gestures [73].It has a total of 2000 hand gesture classes.The dataset was created specifically for communication between the deaf and hearing communities.We used seven classes to test the validity of our proposed model on hand gesture recognition datasets: hungry, wish, scream, forgive, attention, appreciate, and abuse.The WLASL dataset sample images are shown in Figure 17.

WLASL Dataset
The WLASL dataset has the largest number of videos of American Sign Language hand gestures [73].It has a total of 2000 hand gesture classes.The dataset was created specifically for communication between the deaf and hearing communities.We used seven classes to test the validity of our proposed model on hand gesture recognition datasets: hungry, wish, scream, forgive, attention, appreciate, and abuse.The WLASL dataset sample images are shown in Figure 17.

Evaluation via Experimental Results
We evaluated the performance of our proposed system on all three datasets, and the experiments proved the system's efficiency.Tables 3-6 illustrate the confusion matrices for the HaGRI, Egogesture, Jester, and WLASL datasets, achieving accuracy of 92.57%, 91.86%, 91.57%, and 90.43%, respectively.The experiments were repeated many times to evaluate the efficiency of the results.The HaGRI dataset presented the highest accuracy over the other datasets because of the higher resolution, and the hand extraction showed better results than the other datasets.Tables 7-10 depict the gesture evaluation matrices for the HaGRI, Egogesture, Jester, and WLASL datasets.This presents the gesture class accuracy, precision, recall, and f1 score for all the benchmark datasets used.This section also compares the selected classifier's accuracies to those of other conventional methods to demonstrate why they are preferred over other algorithms.Figure 18 demonstrates the comparison of the accuracy of RNN with other-state-of the art algorithms.Table 11 presents a comparison of our system with other conventional systems in the literature.

Discussion
The proposed hand gesture recognition system model is designed to achieve stateof-the-art performance over RGB images.Initially, images with a variety of gestures and complex backgrounds are used as inputs from benchmark datasets, such as HaGRI, Egogesture, and Jester.Our suggested two-way method is used to process the images provided for hand extraction.There were also some shortcomings in the proposed approach that prevented concealed information from being accurately obtained from the hand skeletons.Frames with no suitable camera angle made it difficult to acquire the exact key points at hand.As presented in Figure 5a, the extreme key points are localized on the knuckles of the fingers due to the absence of the fingertips in the frame.The suggested system performed well on frames that initially presented the entire hand, followed by the movement of the hand.After the hand and skeleton extractions, the region of interest was passed through the fusion of features.The full-hand and one-point-based features were optimized and passed through RNN for recognition.The accuracy attained over the four datasets via RNN produced better results, with an accuracy of 92.57% using the HaGRI dataset; for Egogesture, it was 91.86%; for Jester, it was 91.57%; and for WLASAL, it was 90.43%.

Conclusions
This paper provides a novel way of recognizing gestures in a home automation system.Home appliances like TVs, washing machines, lights, cleaning robots, printers, stoves, etc. can be controlled using hand gestures.Our system proposed a way to fulfill the requirement of detecting hands from a complex background via six steps, namely noise removal, hand detection, hand skeleton, feature extraction, optimization, and classification.The hand gestures were trained by preprocessing them first using the adaptive median algorithm.Then, the hand detection was performed using the two-way method, and after that, the hand skeleton was extracted using SSMD.From the extracted hand and skeleton points, fusion features were extracted, namely joint colour cloud, neural gas, and directional active model.The features were optimized using the active bee colony algorithm, which provided promising results for all four datasets.The accuracies attained using the HaGRI dataset was 92.57%; for Egogesture, it was 91.86%; Jester provided 91.57%; and WLASL showed 90.43%.The proposed system is for smart home automation, which was designed using different techniques.It provides a set of features for recognition, rather than conventional features, using only deep learning methods.
The proposed system needs to be trained with more gestures, and various experiments can be performed in different environments like healthcare, robotics, sports, and industries.The computation time needs to be considered to remove the complexity of the system.

Figure 1 .
Figure 1.Architecture of the proposed system for hand gesture recognition.

Figure 1 .
Figure 1.Architecture of the proposed system for hand gesture recognition.

Figure 2 .
Figure 2. Sequential model representation for adaptive median filter algorithm.

Figure 2 .
Figure 2. Sequential model representation for adaptive median filter algorithm.

Figure 2 .
Figure 2. Sequential model representation for adaptive median filter algorithm.

Figure 4 .
Figure 4. Hand detection using saliency map on the gestures (a) stop and (b) ok.

Figure 4 .
Figure 4. Hand detection using saliency map on the gestures (a) stop and (b) ok.

Figure 5 .
Figure 5. Hand skeleton mapping presenting palm and finger extreme points over gestures: (a) (b) stop; (c) two up.

Figure 5 .
Figure 5. Hand skeleton mapping presenting palm and finger extreme points over gestures: (a) call; (b) stop; (c) two up.

Figure 8 .
Figure 8. Direction representation of eight Freeman chain codes.
depicts the flow of the point extraction in a directionally active model.

Figure 8 .
Figure 8. Direction representation of eight Freeman chain codes.
depicts the flow of the point extraction in a directionally active model.
depicts the flow of the point extraction in a directionally active model.

Figure 9 .
Figure 9. Flow sheet of point extraction in a directionally active model.

Figure 9 .
Figure 9. Flow sheet of point extraction in a directionally active model.
depicts the flow of the point extraction in a directionally active model.

Figure 9 .
Figure 9. Flow sheet of point extraction in a directionally active model.

Figure 11 .
Figure 11.Flowchart of a working model for ABCA.Figure 11.Flowchart of a working model for ABCA.

Figure 11 . 23 Figure 12 .
Figure 11.Flowchart of a working model for ABCA.Figure 11.Flowchart of a working model for ABCA.

Figure 12 .
Figure 12.Angle fitness along the number of iterations over the "call" gesture.

4. 1
.1.HaGRI Dataset The HaGRI Dataset [70] is specially designed for home automatic, automatic sector, and video conferencing.It consists of 552,992 RGB frames with 18 different gestures.The dataset includes 34,730 subjects who performed gestures with different backgrounds.The subjects were aged between 18 and 65 years old.The gestures were performed indoors with different light intensities.The gestures used in our experiments were call, dislike, like, mute, ok, stop, and two up.Figure 14 presents the gestures from the HaGRI dataset.

4. 1 .
Dataset Descriptions 4.1.1.HaGRI Dataset The HaGRI Dataset [70] is specially designed for home automatic, automatic sector, and video conferencing.It consists of 552,992 RGB frames with 18 different gestures.The dataset includes 34,730 subjects who performed gestures with different backgrounds.The subjects were aged between 18 and 65 years old.The gestures were performed indoors with different light intensities.The gestures used in our experiments were call, dislike, like, mute, ok, stop, and two up.
Figure 14 presents the gestures from the HaGRI dataset.

4. 1 .
Dataset Descriptions 4.1.1.HaGRI Dataset The HaGRI Dataset [70] is specially designed for home automatic, automatic sector, and video conferencing.It consists of 552,992 RGB frames with 18 different gestures.The dataset includes 34,730 subjects who performed gestures with different backgrounds.The subjects were aged between 18 and 65 years old.The gestures were performed indoors with different light intensities.The gestures used in our experiments were call, dislike, like, mute, ok, stop, and two up.
Figure 14 presents the gestures from the HaGRI dataset.

Figure 14 .
Figure 14.Example gesture frames from the HaGRI dataset.Figure 14.Example gesture frames from the HaGRI dataset.

Figure 14 .
Figure 14.Example gesture frames from the HaGRI dataset.Figure 14.Example gesture frames from the HaGRI dataset.4.1.2.Egogesture Dataset The Egogesture [71] contains 2081 RGB videos and 2,953,224 frames with 83 different static and dynamic gestures.The gestures contain indoor and outdoor scenes.For our system training and testing, we selected seven dynamic gesture classes: scroll hand towards the right, scroll hand downward, scroll hand backward, zoom in with fists, zoom out with fists, rotate finger clockwise, and zoom in with fingers.The dataset samples with different gestures and different backgrounds are presented in Figure 15.

Figure 15 .
Figure 15.Example gesture frames from the Egogesture dataset.

Figure 16 .
Figure 16.Example gesture frames from the Jester dataset.

Figure 15 .
Figure 15.Example gesture frames from the Egogesture dataset.

Sensors 2023 ,
23, x FOR PEER REVIEW 16 of 23 4.1.2.Egogesture Dataset The Egogesture [71] contains 2081 RGB videos and 2,953,224 frames with 83 different static and dynamic gestures.The gestures contain indoor and outdoor scenes.For our system training and testing, we selected seven dynamic gesture classes: scroll hand towards the right, scroll hand downward, scroll hand backward, zoom in with fists, zoom out with fists, rotate finger clockwise, and zoom in with fingers.The dataset samples with different gestures and different backgrounds are presented in Figure 15.

Figure 15 .
Figure 15.Example gesture frames from the Egogesture dataset.

Figure 16 .
Figure 16.Example gesture frames from the Jester dataset.

Figure 16 .
Figure 16.Example gesture frames from the Jester dataset.

Figure 16 .
Figure 16.Example gesture frames from the Jester dataset.

Figure 17 .
Figure 17.Example gesture frames from the WLASL dataset.

Figure 18 .
Figure 18.Accuracy comparison of RNN with other state-of-the-art algorithms.

Figure 18 .
Figure 18.Accuracy comparison of RNN with other state-of-the-art algorithms.

Table 2 .
Related work for hand gesture recognition using marker sensors.

Table 3 .
Confusion matrix for gesture classification by the proposed approach using the Ha-GRI dataset.

Table 4 .
Confusion matrix for gesture classification by the proposed approach using the Egogesture dataset.

Table 5 .
Confusion matrix for gesture classification by the proposed approach using the Jester dataset.

Table 6 .
Confusion matrix for gesture classification by the proposed approach using the WLASL dataset.

Table 7 .
Performance evaluation of the proposed approach using the HaGRI dataset.

Table 8 .
Performance evaluation of the proposed approach using the Egogesture dataset.

Table 9 .
Performance evaluation of the proposed approach using the Jester dataset.

Table 10 .
Performance evaluation of the proposed approach using the WLASL dataset.

Table 11 .
Comparison of the proposed method using conventional systems.

Table 11 .
Comparison of the proposed method using conventional systems.