Real-Time Hand Gesture Recognition Based on Deep Learning YOLOv3 Model

: Using gestures can help people with certain disabilities in communicating with other people. This paper proposes a lightweight model based on YOLO (You Only Look Once) v3 and DarkNet-53 convolutional neural networks for gesture recognition without additional preprocessing, image ﬁltering, and enhancement of images. The proposed model achieved high accuracy even in a complex environment, and it successfully detected gestures even in low-resolution picture mode. The proposed model was evaluated on a labeled dataset of hand gestures in both Pascal VOC and YOLO format. We achieved better results by extracting features from the hand and recognized hand gestures of our proposed YOLOv3 based model with accuracy, precision, recall, and an F-1 score of 97.68, 94.88, 98.66, and 96.70%, respectively. Further, we compared our model with Single Shot Detector (SSD) and Visual Geometry Group (VGG16), which achieved an accuracy between 82 and 85%. The trained model can be used for real-time detection, both for static hand images and dynamic gestures recorded on a video.


Introduction
The interaction between humans and computers has increased widely, while the domain is witnessing continuous development, with new methods derived and techniques discovered. Hand gesture recognition is one of the most advanced domains in which computer vision and artificial intelligence has helped to improve communication with deaf people but also to support gesture-based signaling systems [1,2]. Subdomains of hand gesture recognition include sign language recognition [3][4][5], recognition of special signal language used in sports [6], human action recognition [7], pose and posture detection [8,9], physical exercise monitoring [10], and controlling smart home/assisted living applications with hand gesture recognition [11].
Over the years, computer scientists have used different computation algorithms and methods to help solve our problems while easing our lives [12]. The use of hand gestures in different software applications has contributed towards improving computer and human interaction [13]. The progress of the gesture recognition systems plays a vital role in the • A lightweight proposed model where there is no need to apply as much preprocessing which involves filtering, enhancement of images, etc.; • A labeled dataset in both Pascal VOC and YOLO format; • This is the first gesture recognition model that is dedicated to the mentioned gestures using YOLOv3. We use YOLOv3 as it is faster, stronger, and more reliable compared to other deep learning models. By using YOLOv3, our hand gesture recognition system has achieved a high accuracy even in a complex environment, and it successfully detected gestures even in low-resolution picture mode; • The trained model can be used for real-time detection, it can be used for static hand images, and it also can detect gestures from video feed.
The organization of our study is as follows: Section 2 presents the related work and reviews the latest studies on hand gesture recognition. The materials and methods that were used in this study are in Section 3. Section 4 presents the results of the Real-Time Hand Gesture Recognition Based on Deep Learning YOLOv3 Model. The discussion of the results of the proposed Real-Time Hand Gesture Recognition Based on Deep Learning YOLOv3 Model are discussed in Section 5. Finally, Section 6 concludes the study directions and future works.

Related Work
There are various studies on hand gesture recognition as the area is widely expanding, and there are multiple implementations involving both machine learning and deep learning methods aiming to recognize a gesture that is intonated by a human hand. Further, some papers are reviewed to understand the mechanism of the hand gesture recognition technique.
The study [28] demonstrated that with insignificant computational cost, one of the normal and well-known designs, CNN, accomplished higher paces of perceiving components effectively. The proposed strategy focused only on instances of gestures present in static images, without hand location, and followed the instances of hand impediment with 24 gestures, utilizing a backpropagation algorithm and segmentation algorithm for preparing the multi-layer propagation, and for the backpropagation calculation having its influence, sorting out the blunder proliferated on the contrary request [29]. Moreover, it utilized a convolutional neural organization, and sifting incorporated a distinctive division of image division and recognition.
Among these techniques and methods there is a popular method which is used by several other detection applications, and that is Hidden Markov Models (HMM). There are different detection variants that are often used by an application, which we have come across, and the application this paper refers to checks and works with all of it by learning and looking towards other papers. This application deals with all three mediums: image, video, and webcam [30]. Going through a general paper written by Francois, in which he refers to an application that detects posture, the detection is through video and uses the HMM. The process that the paper of Francois and other researchers worked on is related to this application, which this paper refers to first by extracting the information from either image, video, or real-time detection using webcams. These features are detected by the three methods mentioned, but the main concern of all the methods, whether they are CNN-related, RNN-related, or using any other technique, is that all of them use fitting techniques, and these techniques refer to the bounding box that this paper discussed. The box represents the information that is detected, from which they gain a confidence value, and the value that is the highest is the output of what image is being displayed. Besides this, all other information varies depending upon the method that others are using; despite that, some other devices and techniques that are related to segmentation, general localization, and even fusion of other different partials help achieve the tasks of detecting and recognizing.
Nyirarugira et al. [31] proposed Social Touch Gesture Recognition using CNN. He utilized various calculations such as Random Forest (RF), boosting algorithms, and the Decision Tree algorithm to recognize gestures, utilizing 600 arrangements of palmar images with 30 examples utilizing convolutional neural organization, and finding an ideal method on the premise of the framework of an 8 × 8 network. The casing length is variable anyway so the outcome decides the ideal casing length. It is performed using a dataset as of late assessed with different subjects that perform changing social gestures [32]. A system that gathers contact gestures in a practically constant manner using a deep neural organization is favorable to present. The results showed that their strategy performed better when differentiated, and the previous work was reliant on leave-one-subject-out cross-endorsement for the CoST dataset. The proposed approach presents two points of interest differentiated from those in the current writing, acquiring an accuracy of 66.2% when perceiving social gestures.
In the study of Multiscale CNNs for hand detection, excited by the headway of article recognition in the field of PC vision, numerous techniques have been proposed for handdistinguishing proof in the latest decade [33]. The most untroublesome procedure relies upon the recognition of skin tone, which works on hands, faces, and arms, yet moreover has issues because of the affectability for brilliant changes. The contributing components of this multifaceted nature fuse substantial hindrance, low objectivity, fluctuating illumination conditions, various hand gestures, and the incredible coordinated efforts among hands and dissents or various hands. They further presented a Multiscale Fast R-CNN approach to manage to correctly recognize human hands in unconstrained pictures. By merging staggered convolutional features, the CNN model can achieve favored results over the standard VGG16 model, accomplishing practically 85% of 5500 images for the testing and 5500 for the preparing set.
Saqib et al. [34] used a CNN model augmented by edit distance for the recognition of static and dynamic gestures of Pakistani sign language, and achieved 90.79% accuracy. Al-Hammadi et al. [35] proposed a 3DCNN model to learn region-based spatiotemporal features for hand gestures. The fusion techniques to globalize the local features learned by the 3DCNN model were used to improve the performance. The approach obtained recognition rates of 98.12, 100, and 76.67% on three color video gesture datasets.
Do et al. [36] proposed a multi-level feature LSTM with Conv1D, the Conv2D pyramid, and the LSTM block. The proposed method exploited skeletal point-cloud features from skeletal data, as well as depth shape features from the hand component segmentation model. The method achieved accuracies of 96.07 and 94.40% on the Dynamic Hand Gesture Recognition (DHG) dataset with 14 and 28 classes, respectively. The study extracted diversity of dynamic hand gestures from 14 depth and 28 skeletal data through the LSTM model with two pyramid convolutional blocks. The accuracy of 18 classes was 94.40%. Elboushaki et al. [37] learned high-level gesture representations by using Convolutional Residual Networks (ResNets) for learning the spatiotemporal features from color images, and Convolutional Long Short-Term Memory Networks (ConvLSTM) to capture the temporal dependencies between them. A two-stream architecture based on 2D-ResNets was then adopted to extract deep features from gesture representations.
Peng et al. [38] combined a feature fusion network with a ConvLSTM network to extract spatiotemporal feature information from local, global, and deep aspects. Local feature information was acquired from videos by 3D residual network, while the ConvL-STM network learned the global spatiotemporal information of a dynamic gesture. The proposed approach obtained 95.59% accuracy on the Jester dataset, and 99.65% accuracy on the SKIG (Sheffield Kinect Gesture) dataset. Tan et al. [39] proposed an enhanced, densely connected convolutional neural network (EDenseNet) for hand gesture recognition. The method achieved 99.64% average accuracy on three hand gesture datasets. Tran et al. [40] suggested a 3D convolution neural network (3DCNN) that could extract fingertip locations and recognize hand gestures in real-time. The 3DCNN model achieved 92.6% accuracy on a dataset of videos with seven hand gestures.
Rahim et al. [41] analyzed the translation of the gesture of a sign word into text. The authors of this paper performed the skinMask segmentation to extract features along the CNN. Having a dataset of 11 gestures from a single hand and 9 from double hands, the support vector machine (SVM) was applied to classify the gestures of the signs with an accuracy of 97.28%. Mambou et al. [42] analyzed hand gestures associated with sexual assault from indoor and outdoor scenes at night. The gesture recognition system was implemented with the combination of the YOLO CNN architecture, which extracted hand gestures, and a classification stage of bounding box images, which lastly generated the assault alert. Overall, the network model was not lightweight and had a lower accuracy.
Ashiquzzaman et al. [43] proposed a compact spatial pyramid pooling (SPP) a CNN model for decoding gestures or finger-spelling from videos. The model used 65% fewer parameters than traditional classifiers and worked 3× faster than classical models. Benitez-Garcia et al. [44] employed a lightweight semantic segmentation FASSD-Net network, which was improved over Temporal Segment Networks (TSN) and Temporal Shift Modules (TSM). They demonstrated the efficiency of the proposal on a dataset of thirteen gestures focused on interaction with touchless screens in real-time.
Summarizing, the biggest challenge faced by the researchers is designing a robust hand gesture recognition framework that overcomes the most typical problems with fewer limitations and gives an accurate and reliable result. Real-time processing of hand gestures also has some limitations, such as illumination variation, background problems, distance range, and multi-gesture problems. There are approaches to hand gesture recognition that use non-machine-learning algorithms, but there is a problem in that the accuracy varies, and in different environments such as light, it overlaps one gesture with another, which makes the approach less flexible and unable to adapt independently, when compared to the machine-learning approach. Therefore, the machine-learning approach was used for developing the system.

Material and Methods
This section is dedicated to the materials and the methods that were used in this study to achieve the gesture recognition that this paper aimed for. Section 3.1 explains the dataset and all the information related to the material. Section 3.2 deals with the algorithm and the methods that we used to solve the problem.

Dataset
Our dataset consisted of 216 images. These images were further classified into 5 different sets. Each set held an average of 42 images which were labeled using the YOLO labeling format. The dataset was labeled using a labeling tool, which was an open-source tool used to label custom datasets. We used the YOLO format, which labels data into text file format and holds information such as the class ID and class to which it belongs. Our classes started from 0 to 4, where 0 class ID is labeled as 1 and 4 class ID is labeled as 5. There were a total of 5 sets that our application detected, which were finger-pointing positions of 1, 2, 3, 4, and 5. Figure 1 displays the hand gestures in our collected dataset. racy varies, and in different environments such as light, it overlaps one gesture with another, which makes the approach less flexible and unable to adapt independently, when compared to the machine-learning approach. Therefore, the machine-learning approach was used for developing the system.

Material and Methods
This section is dedicated to the materials and the methods that were used in this study to achieve the gesture recognition that this paper aimed for. Section 3.1 explains the dataset and all the information related to the material. Section 3.2 deals with the algorithm and the methods that we used to solve the problem.

Dataset
Our dataset consisted of 216 images. These images were further classified into 5 different sets. Each set held an average of 42 images which were labeled using the YOLO labeling format. The dataset was labeled using a labeling tool, which was an open-source tool used to label custom datasets. We used the YOLO format, which labels data into text file format and holds information such as the class ID and class to which it belongs. Our classes started from 0 to 4, where 0 class ID is labeled as 1 and 4 class ID is labeled as 5. There were a total of 5 sets that our application detected, which were finger-pointing positions of 1, 2, 3, 4, and 5. Figure 1 displays the hand gestures in our collected dataset.

Data Preprocessing
Data preprocessing is an important part of the before training and testing phase. Using a YOLO configuration with a total of 200+ images of the dataset, where 30 were used for the affine transformation by increasing 2-fold, each image was duplicated for reading and training, both left and right hand, by flipping it horizontally and sometimes taking the respective image of those hands for making the set more accurate. Furthermore, an additional 15 images were taken and labeled for the testing set. The data preprocessing step is important before moving towards post-processing, because we have to look at what type of data we have collected and which part of it will be useful for the purpose of training, testing, and for obtaining better accuracy. Table 1 shows the instances of five classes with the features of the YOLO-labeled data.

Data Preprocessing
Data preprocessing is an important part of the before training and testing phase. Using a YOLO configuration with a total of 200+ images of the dataset, where 30 were used for the affine transformation by increasing 2-fold, each image was duplicated for reading and training, both left and right hand, by flipping it horizontally and sometimes taking the respective image of those hands for making the set more accurate. Furthermore, an additional 15 images were taken and labeled for the testing set. The data preprocessing step is important before moving towards post-processing, because we have to look at what type of data we have collected and which part of it will be useful for the purpose of training, testing, and for obtaining better accuracy. Table 1 shows the instances of five classes with the features of the YOLO-labeled data. The above example is the representation of how these files will look when we label our dataset for training it on the desired model. Each line contains 5 different attributes, and all these attributes have their importance. Looking at the left first, we have class ID, followed by the next two, which are the labeled box co-ordinates of a gesture with the x-axis and y-axis values, followed by the width and height of that annotated image. The above example is the representation of how these files will look when we label our dataset for training it on the desired model. Each line contains 5 different attributes, and all these attributes have their importance. Looking at the left first, we have class ID, followed by the next two, which are the labeled box co-ordinates of a gesture with the xaxis and y-axis values, followed by the width and height of that annotated image. Figure 2 represents the correlation heat map of the YOLO-labeled dataset. Here, we can see the diversity of values among different labels, explaining the concentration and depth of our images represented in the form of a multi-dimensional matrix.

Proposed Method
To understand the method which we are proposing, we need to look at the diagram presented in Figure 3 to understand better, generally, what and how our application is detecting objects. It is a kind of a general overview of the application that we have developed. The training process first requires collecting the dataset, and after that the next step is to label it, so we use YOLO annotation to label our data, which gives us some values that are later explained in the model process. After that, when the data is labeled, we then feed it to the DarkNet-53 model, which is trained according to our defined configuration. The image is captured through the camera that can be an integrated (primary) camera or it can be any external (secondary) camera. Other than that, the application can also detect gestures from a video input as well.

Proposed Method
To understand the method which we are proposing, we need to look at the diagram presented in Figure 3 to understand better, generally, what and how our application is detecting objects. It is a kind of a general overview of the application that we have developed. The training process first requires collecting the dataset, and after that the next step is to label it, so we use YOLO annotation to label our data, which gives us some values that are later explained in the model process. After that, when the data is labeled, we then feed it to the DarkNet-53 model, which is trained according to our defined configuration. The image is captured through the camera that can be an integrated (primary) camera or it can be any external (secondary) camera. Other than that, the application can also detect gestures from a video input as well. After capturing real-time objects with the help of the OpenCV [45] module, which is an open-source computer vision library, we can capture images and after that, we send them frame by frame from the real-time objects. Because of incorrect filtering, our dataset of collected images currently has a variable number of images. These images are labeled After capturing real-time objects with the help of the OpenCV [45] module, which is an open-source computer vision library, we can capture images and after that, we send them frame by frame from the real-time objects. Because of incorrect filtering, our dataset of collected images currently has a variable number of images. These images are labeled according to the classes we have stored for our YOLO algorithm, so we have successfully attained the coordinate and the class for our image set. After that, we can now set towards the training section. We then pass this set to our training algorithm, which is a deep neural network model YOLO.
Further, we discuss the methodology how YOLO deals with the network desired output which is achieved by using a formula which takes different co-ordinates, such as pw, ph, tx, ty, tw, th, cx, and cy. These are the variables which we use for the bounding box dimensions. Obtaining the values of the boundary box (x-axis, y-axis, height, and width) is described by Equation (1).
where bx, by, bw, and bh are the box prediction components, x and y refer to the center coordinates, and w and h refer to the height and width of the bounding box. Equation (1) used in the YOLOv3 algorithm shows how it extracts the values from the image in the bounding box, and below is the diagram of how these values are extracted from the bounding box. From Figure 3, we can understand how each value of the bounding box from the algorithm provides us with the co-ordinates of the center, x and y. From the prediction here the next important thing comes, which is the sigmoid function, which we have already discussed above, which filters out data except for the main part which is going to be recognized.
From Figure 4, we can understand the backend of the algorithm. When we pass an input, it first comes into the input layer, and after that it is further rendered and passes into the hidden layers, that are several in number and size, and are interconnected convolutions. These convolutional layers determine a special value, which is the value of confidence. In our case, if the value of the confidence threshold is greater than 0.5, then we assume that the application has successfully determined what object it has encountered.   Figure 5 explains how the YOLO algorithm plays its part when it acquires the image with the help of the OpenCV module. The image is then passed to the YOLO network, which then further identifies the required target. After doing that it sends it forward to the feature map prediction block, where it further extracts the information which is required to identify the gesture, then, after predicting it, sends it to the decoding part, where  Figure 5 explains how the YOLO algorithm plays its part when it acquires the image with the help of the OpenCV module. The image is then passed to the YOLO network, which then further identifies the required target. After doing that it sends it forward to the feature map prediction block, where it further extracts the information which is required to identify the gesture, then, after predicting it, sends it to the decoding part, where the output predicted is mapped onto the image and then displayed, as shown in Figure 5.  Figure 5 explains how the YOLO algorithm plays its part when it acquires the image with the help of the OpenCV module. The image is then passed to the YOLO network, which then further identifies the required target. After doing that it sends it forward to the feature map prediction block, where it further extracts the information which is required to identify the gesture, then, after predicting it, sends it to the decoding part, where the output predicted is mapped onto the image and then displayed, as shown in Figure 5.

Implementation
We changed the configuration of YOLO and defined the activation according to the stride and the pad. For the YOLOv3 model, we set the mask to 0.5. The learning rate was set to 0.001, and the value of jitter was set to 0.3. For the exploratory study of the best alternatives to implement the gesture recognition system, we use some other models to check which algorithm works best for gesture detection which will be discussed in the next section. We changed the configuration of YOLO and defined the activation according to the stride and the pad. For the YOLOv3 model, we set the mask to 0.5. The learning rate was set to 0.001, and the value of jitter was set to 0.3.
For the exploratory study of the best alternatives to implement the gesture recognition system, we use some other models to check which algorithm works best for gesture detection which will be discussed in the next section.
The developed neural network model is summarized in Table 2. Our input layer is just a typical CNN, which has convolutional layers, and other than that, it has a special layer, which is a max-pooling layer, and a very simple layer, which is the output layer. The general architecture of the application, which includes both the training and testing set, can be seen in Figure 6. The developed neural network model is summarized in Table 2. Our input layer is just a typical CNN, which has convolutional layers, and other than that, it has a special layer, which is a max-pooling layer, and a very simple layer, which is the output layer. The general architecture of the application, which includes both the training and testing set, can be seen in Figure 6.

Results
In this section, we present, discuss, and evaluate our results.

Environmental Setup
To carry out this experiment Python 3.7 was used to train the algorithm on the personal computer with local 4GB GPU. Other important parameters are presented in Table 3. The training of the neural network model on our dataset took more than 26 h on a GPU of 24 GB.  Figure 7 shows the output of the developed application that performs real-time hand gesture recognition. As you can see, the bounding box is a little bit large, which is because it was left intentionally as 416 × 416 for covering the maximum part, then scattered, and all unnecessary information was removed from the image so we could obtain a larger coverage area. This also helps in the zoom case when the object is too large, as it will cleverly identify which gesture is being represented, and it then returns the class ID with the best match.

Results and Performance of YOLOv3
To carry out this experiment Python 3.7 was used to train the algorithm on th sonal computer with local 4GB GPU. Other important parameters are presented in 3. The training of the neural network model on our dataset took more than 26 h on a of 24 GB.  Figure 7 shows the output of the developed application that performs real-time gesture recognition. As you can see, the bounding box is a little bit large, which is be it was left intentionally as 416 × 416 for covering the maximum part, then scattered all unnecessary information was removed from the image so we could obtain a coverage area. This also helps in the zoom case when the object is too large, as cleverly identify which gesture is being represented, and it then returns the class ID the best match.  Figure 8 shows the training performance of the proposed deep learning mode results are 98% correct, as measured by experiments. The experiment is in real tim cause we trained our model with YOLO and Pascal VOC [46] configurations and te with live images. We added the YOLO-annotated labels into a CSV file and re-ran tr tests to confirm the accuracy of the model by plotting the curve, as we can find the racy of an individual gestation in the real-time experiment. The fault was seen in the sitions between one gesture to another, because the application is in real-time, so th tem can detect dynamic objects too, as proven by our experiments.  Figure 8 shows the training performance of the proposed deep learning model. The results are 98% correct, as measured by experiments. The experiment is in real time because we trained our model with YOLO and Pascal VOC [46] configurations and tested it with live images. We added the YOLO-annotated labels into a CSV file and re-ran training tests to confirm the accuracy of the model by plotting the curve, as we can find the accuracy of an individual gestation in the real-time experiment. The fault was seen in the transitions between one gesture to another, because the application is in real-time, so the system can detect dynamic objects too, as proven by our experiments. We used several different algorithms to test which was the best method for the gesture recognition application. We compared YOLOv3 [47] with other deep learning models (VGG16 [48] and SSD [49]). In the end, YOLOv3 produced the best results and was chosen, with an accuracy of 97.68% during training and an outstanding 96.2% during testing. DarkNet-53 was used to train the dataset, and for the detection, YOLOv3 was used. As we know, in real-time experiments it is not quite possible to replicate the gestures exactly, so We used several different algorithms to test which was the best method for the gesture recognition application. We compared YOLOv3 [47] with other deep learning models (VGG16 [48] and SSD [49]). In the end, YOLOv3 produced the best results and was chosen, with an accuracy of 97.68% during training and an outstanding 96.2% during testing. DarkNet-53 was used to train the dataset, and for the detection, YOLOv3 was used. As we know, in real-time experiments it is not quite possible to replicate the gestures exactly, so in order to determine the accuracy of the model, so we generated a CSV file of a few YOLO-annotated labels and re-ran the code with some modifications for the accuracy curve. Table 4 explains the accuracy of the individual models, where the stochastic gradient descent gave the lowest value because the real-time motion stochastic could not identify moving objects correctly, compared to the other algorithms. We evaluated the performance of deep learning models using the precision, recall, F-1 score, and accuracy measures, achieving an accuracy of 97.68% for YOLOv3.  Table 5 shows the comparison of the state-of-the-art works of Chen et al. [28], Nyirarugira et al. [31], Albawi et al. [32], Fong et al. [50], Yan et al. [51], and Ren et al. [52] with our proposed model, which produced better results with higher accuracy.

Discussion
Real-time hand gesture recognition based on deep learning models has critical roles in many applications due to being one of the most advanced domains, in which the computer vision and artificial intelligence methods have helped to improve communication with deaf people, but also to support the gesture-based signaling systems]. In this study, we experimented with hand gesture recognition using YOLOv3 and DarkNet-53 deep learning network models [47]. The dataset, which we used, was collected, labeled, and trained by ourselves. We compared the performance of YOLOv3 with several other state-of-the-art algorithms, which can be seen in Table 5. The achieved results were good, as YOLOv3 achieved better results when compared to other state-of-the art algorithms. However, our proposed approach was not tested on the YOLO-LITE model. The YOLOv3 model was trained on a YOLO-labeled dataset with DarkNet-53. YOLOv3 has more tightness when it comes to bounding boxes and generally is more accurate than YOLO-LITE [53].
Moreover, the aim of our study was not to apply real-time hand gestures on limited computing power devices, since we collected hand images of different size and used different angles for diversity. Hence, we focused only on performance criteria rather than YOLO-LITE [54] criteria to recognize hand gestures of speed. Complex applications such as communication with deaf people, gesture-based signaling systems [55], sign language recognition [4], special signal languages used in sports [56], human action recognition [7], posture detection [57], physical exercise monitoring [10], and controlling smart homes for assisted living [58], are where GPUs could perform better through YOLOv3.
In future work we can apply mixed YOLOv3-LITE [59] on our datasets and new datasets of 1-10 numbers for all kind of applications for GPU and non-GPU based computers to achieve real-time object detection precisely and quickly. Furthermore, we can have enhanced our images through oversampling and real-time augmentation [60] as well. The major contribution is that there is no such dataset available in YOLO-labeled format and the research on, specifically, YOLOv3 and onwards requires the YOLO-labeled dataset so by our contribution that dataset will be readily available for future research and improvement as the domain of hand gesture recognition is very much wide there is a need of dataset that should be readily available in the YOLO format also.

Conclusions
In this paper, we have proposed a lightweight model based on the YOLOv3 and DarkNet-53 deep learning models for hand gesture recognition. The developed hand gesture recognition system detects both real-time objects and gestures from video frames with an accuracy of 97.68%. Despite the accuracy obtained there is still room for improvement in the following model, as right now the model proposed detects static gestures. However, the model can be improved for detecting multiple gestures and can be improved by detecting more than one gesture at a time. The proposed method can be used for improving assisted living systems, which are used for human-computer interaction both by healthy and impaired people. Additionally, we compared the performance and execution of the YOLOv3 model with different methods and our proposed method achieved better results by extracting features from the hand and recognized hand gestures with the accuracy, precision, recall, and F-1 score of 97.68, 94.88, 98.66, and 96.70%, respectively. For future work, we will be focusing on hybrid methods with smart mobile applications or robotics with different scenarios. Furthermore, we will design a more advanced convolution neural network with data fusion, inspired by recent works [61][62][63], to enhance the precision of hand gesture recognition.