American Sign Language Alphabet Recognition by Extracting Feature from Hand Pose Estimation

Sign language is designed to assist the deaf and hard of hearing community to convey messages and connect with society. Sign language recognition has been an important domain of research for a long time. Previously, sensor-based approaches have obtained higher accuracy than vision-based approaches. Due to the cost-effectiveness of vision-based approaches, researchers have been conducted here also despite the accuracy drop. The purpose of this research is to recognize American sign characters using hand images obtained from a web camera. In this work, the media-pipe hands algorithm was used for estimating hand joints from RGB images of hands obtained from a web camera and two types of features were generated from the estimated coordinates of the joints obtained for classification: one is the distances between the joint points and the other one is the angles between vectors and 3D axes. The classifiers utilized to classify the characters were support vector machine (SVM) and light gradient boosting machine (GBM). Three character datasets were used for recognition: the ASL Alphabet dataset, the Massey dataset, and the finger spelling A dataset. The results obtained were 99.39% for the Massey dataset, 87.60% for the ASL Alphabet dataset, and 98.45% for Finger Spelling A dataset. The proposed design for automatic American sign language recognition is cost-effective, computationally inexpensive, does not require any special sensors or devices, and has outperformed previous studies.


Introduction
Sign language is a form of communication that utilizes visual-manual methodologies such as expressions, hand gestures, and body movements to interact among the deaf and hard of hearing community, yield opinions, and convey meaningful conversations [1]. The term deaf and hard of hearing is employed to identify a person who is either deaf or incapable to speak an oral language or have some level of speaking ability but prefer to not speak to bypass negative or undesired attention that atypical voices seldom attract.
Deafness is often expressed as hearing loss or injury which is an entire or moderate inability to hear which may appear in one or both ears of an individual [2,3]. The main reasons for hearing loss involve aging, genetics, noise exposure, a variety of infections, such as chronic ear infections, and certain toxins or medications [2]. Diagnosis of hearing loss can be practised when a person is incapable to hear 25 decibels in at least one ear after performing the poor-hearing test [2] and this test is recommended for all newborn children [4]. Hearing loss can be classified as mild (25-40 decibels), moderate (41-55 decibels), moderate-severe (56-70 decibels), severe (71-90 decibels), and profound (greater than 90 decibels) [2]. Approximately 1.33 billion people have been affected by hearing impairment to some extent, as of 2015, which covered 18.5% of the overall population of the from 2D images, and contactless and affordable recognition. At the same time, our proposed system is omitting the weaknesses of both sensor-based and vision-based approaches, such as usage of expensive devices, usage of costly cameras, high computational complexity, and lower accuracy. Since this study uses a webcam, similar studies that classify characters from RGB cameras or images will be used for comparison. One of the studies has a very high result of 99.31% for Massey dataset [26], and this result is one of the indicators.
Similar to natural languages, sign language also holds specific grammar and vocabulary [27]. However, despite having similarities and notable connections, sign languages all over the world are not widely the same and not mutually recognized [27]. Depending on the community, the corresponding sign language also differs in terms of gestures. In this research, American sign language has been considered as it is utilized by the American and Canadian deaf community consisting of approximately 250,000 to 500,000 Americans and some Canadians [28].

Literature Review
In this section, related works will be discussed considering both sensor-based and vision-based approaches. Researches on hand tracking and hand pose recognition have also been discussed here as sign language recognition is an application of hand pose recognition.
One recent study suggested a novel approach for textual input in which the authors conducted an air-writing recognition using smart bands [29]. In [29], the authors proposed a user-dependent method based on k-nearest neighbors (KNN) with dynamic tree wrapping (DTW) as the distance measure and a user-independent method based on a convolutional neural network (CNN) that achieved 89.2% and 83.2% average accuracy, respectively. Apart from the smart bands, Kinect sensors have been being utilized by researchers for a long time now. Earlier research suggested that up to 98.9% average recognition rate can be achieved by capturing the letters with Kinect sensors and recognizing them by using dynamic programming (DP) matching based on inter-stroke information [30]. However, mastering the technique of writing in the air and the usage of Kinect sensors requires specialized training, experience, and a suitable environment with the necessary equipment. Another earlier research suggested a similar approach where the authors captured the alphanumeric characters written in the air through a video camera instead of Kinect sensors and further experimentation by dynamic programming matching revealed an overall accuracy of 75% [31]. The main limitation of this study was the determination of starting and ending points for input and extraction of the user's hand region in each picture frame.
A more recent study proposed an RGB and RGB-D static gesture recognition mechanism that utilized a fine-tuned VGG19 model after capturing the gestures with Kinect sensors and reported a recognition rate of 94.8% [32]. Based on the successful recognition of hand gestures, soon the techniques of recognizing hand gestures have been adapted for sign language alphabet recognition. Microsoft Kinect has been utilized for American sign language recognition where the authors proposed a random forest classifier on segmented hand configuration and obtained 90% accuracy [33]. A recent study utilized InceptionV3, a convolutional neural network model, to obtain 90% validation accuracy on the American sign language dataset containing 24 characters from the American sign alphabet [34]. One of the recent works on American sign language recognition proposed a restricted Boltzmann machine (RBM) fusing mechanism and reported 99.31%, 97.56%, 90.01%, and 98.13% recognition accuracy for Massey dataset, ASL finger spelling A dataset, NYU dataset, and ASL finger spelling dataset of the Surrey University, respectively [26].
Another popular way of hand gesture recognition is via leap motion. Recent work on British sign language recognition suggested a multimodality approach by fusing two artificial neural networks (ANN) and 94.44% overall accuracy was reported by utilizing the leap motion [35]. Leap motion has also been utilized recently for the recognition of American sign language gestures in a virtual reality environment and the authors reported a mean accuracy of 86.1% [36]. In [36], the authors have utilized the data from the leap motion device and hidden Markov classifier (HMC) was utilized for the recognition process. In another work, the authors used a leap motion controller and convolutional neural network to achieve 80.1% accuracy [37]. Moreover, the leap motion controller with support vector machine (SVM) and the deep neural network (DNN) has been applied on 36 American signs beforehand with a reported accuracy of 72.79% and 88.79%, respectively [38].
Apart from the above-mentioned approaches, some other schemes for the recognition of American sign language have been proposed beforehand. A work on static American signs utilized a skin-color modeling technique and convolutional neural network to achieve 93.67% accuracy [39]. Another research utilized a deep neural network on RGB images with a squeezenet architecture to make it suitable for mobile devices and achieved an overall accuracy of 83.28% [40]. Skeletal data and distance descriptors with TreeBag and neural network (NN) classifiers have been utilized to achieve 90.7% accuracy [41]. Another work proposed a recognition system for the sign language alphabet that utilizes geometrical features with an artificial neural network and achieved 96.78% accuracy [42]. Besides, neuromorphic sensors with the artificial neural network have previously reported 79.58% accuracy for 24 American signs [43]. Furthermore, a convolutional neural network with multiview augmentation and inference fusion has been used to achieve 93% accuracy [44]. Table 1 presents the related works with their corresponding approach, classifiers, and recognition rate for a better understanding. It can be observed that the sensor-based approaches have achieved higher accuracy although they are costly. Additionally, some vision-based approaches have utilized CNNs to achieve relatively higher accuracy. However, in such cases, the computational complexity has increased exponentially as well.

Materials and Methods
In this section, first, the details of the dataset have been discussed. After that, the details about the hand pose estimation, distance, and angle-based features, and two classification methods (SVM and light GBM) have been described.

Dataset Description
American sign language, popularly known as ASL [45] is a sign language used in English-speaking countries, such as the United States and Canada, and it consists of 26 letters of the alphabet from A to Z that can be expressed with one hand and has been illustrated in Figure 1. In this study, a total of three datasets have been utilized. First, the ASL alphabet dataset from Kaggle [46] has been used for character recognition to evaluate the performance of more difficult data. The Massey dataset [47] has been utilized to compare the obtained results with the previous studies, which has produced the best recognition rate. In addition, the finger spelling A dataset [48] has been used in this study. Figure 2 shows similar samples from all three datasets for a better understanding of the similarity and complexity of the three considered datasets.

ASL Alphabet Dataset
The first dataset used in this study is the ASL data [46], which contains the letters A to Z. In Figure 2a, it can be observed that the ASL alphabet dataset contains images that are difficult to distinguish, making it a very difficult dataset. Later, the experimental analysis will show that the proposed methodology works decently even for this difficult dataset. There are a total of 780,000 images in the dataset containing 3000 samples per class.

Massey Dataset
Previously, researchers focused on the Massey dataset [47] to report the classification accuracy. Hence, in this study, the Massey dataset has been considered also for a fair comparison with previous work. The dataset contains all 26 letters of the American sign alphabet. However, it is relatively easy to perform sign alphabet recognition on the Massey dataset as the areas other than the hand in the images are black and the hand condition is shown clearly. The dataset contains a total of 1815 images. Apart from 65 samples of the class T, all the other 25 classes have 70 samples each.

Finger Spelling A Dataset
The finger spelling A [48] dataset is another popular dataset that has been considered in this study. From Figure 2c, it can be observed from the figure that the dataset is characterized by a less clear image quality of the hand than the Massey dataset. In addition, this dataset has both RGB and depth images. However, only the RGB images have been used in this study. There are a total of 24 characters in this dataset. The authors of the dataset decided not to include J and Z as they are motion-based signs and the study was about static signs. There are a total of 65,774 images. The number of images per class varies from 2615 to 3108.

Feature Extraction
Feature extraction has been used to recognize the ASL alphabet in this study. The number of obtained coordinates of the joints is 21 in 3D space containing values of X, Y, and Z-axis, and these coordinates have been utilized to extract new features. This is because there may arise some problems if the coordinates are left as they are. For example, if the hand is on the right edge of the camera or image, the output will be presented as a different value even if it has the same signature as the hand on the left edge. Therefore, we need features that are not affected by the location on the screen. In addition, there are some signs in the American sign language that have the same hand shape but represent different characters depending on the degree of tilt, therefore, it is needed to extract features that work effectively even in those cases. In this study, both the distance-based features and the angle-based features were extracted from the initial joint points which has been described in the later sections.

Hand Pose Estimation
Media-pipe hands is an API developed by Google to estimate the coordinates of each joint from a web camera [25]. It can also estimate the coordinates of joints from RGB images. The output produced by the API consists of 21 points, each with 3D (XYZ) coordinates. The order of the coordinates is as follows: the first coordinate is for the wrist which is the bottom point, from there the thumbs coordinates are in the order 1-5 from the bottom, then the index fingers are in the order 6-9 from the bottom point 1, and so on. The position of the wrist and other joints is not fixed, and the coordinates of each joint point change as they move with the movement of the hand. Figure 3 illustrates a sample input image, the estimated joint points and the order of the joint points.

Distance-Based Features
In order to extract features that are not affected by the screen position, first, the distances between the number of 21 coordinates are calculated. However, the distances between neighboring joints were not considered. The distance between two joint points i and j can be obtained by using Equation (1).
(1) Figure 4 illustrates the distance between the 8th and 10th joint points. Here, neighboring joints are the joints that are connected by bones. For example, in the case of the third joint, the second and fourth joints would be the adjacent joints. Since the relative positions of neighboring joints are always fixed by the bones, the distances between adjacent joints do not change even if the formation of the hand varies. Hence, the distances between adjacent joints will not have any impact on the classification as they will produce the same distance value every time regardless of the hand position or formation in the image. If neighboring joints are excluded, 190 features can be obtained from each image. Table 2 presents all possible 190 features and how they are obtained. It can be noticed that for points 20 and 21 joint points, the sets are empty. This is because the expected pairs to be formed considering 20 and 21 joint points as the starting point has already been covered by the previous pairs. Although while using the distance between joints, the problem with the location is solved, the problem with the size of the object is still there. This is because if the recognized object is large, the distance between each joint will be large, and if the object is small, the distance between each joint will be small. Therefore, normalization of the obtained distance values was performed to solve this problem.

Angle-Based Features
The feature values of how much the hand is tilted were calculated as the angle-based features. The direction vectors between the coordinates of each joint were calculated, as well as how much each vector was tilted from the X, Y, and Z-axis directions. Figure 4 illustrates this process where a vector has been created by connecting the 6th and 11th joint points. After that, the angles between the vector and the coordinates ( x, y, and z vectors) have been calculated. Since the number of joints to be estimated is 21, a total of 210 vectors can be created, and three angle-based features can be calculated for each vector, resulting in a total of 630 angle-based features. Table 3 illustrates all possible 210 scenarios and the extraction of 630 angle-based features. Similarly to distance-based features, it can be noticed that for the joint point number 21 the set is empty. This is because the expected pairs have already been covered up by the earlier joint points.
These features are useful and the classifier is expected to have an advantage while the recognition process when the signs that have the same shape but different tilts based on the inclination of the hand is under consideration. For example, in this study, I and J are two such classes and the angle-based features are useful for such letters. While considering the distance-based features, both letters will produce the same distance-based features as apart from the tilt, the shape is the same. As a result, the distances between the joints do not change and the classifier will not be able to find a difference based on the distance-based features. However, the angle-based features can eliminate this problem. Hence, the angle from the axis is expected to be important.
Additionally, since the angle information is not affected by the size of the hand, the extracted features do not require normalization as compared to the distance-based features described beforehand, and, as a result, the effect of the size of the hand will be reduced. The calculation method is to first calculate the direction vector between two points. The angle between the vectors can be calculated using the direction vector and the vectors in the X, Y, and Z-axis directions. Figure 5 illustrates extraction of such angles. The calculation method used was to calculate the cosine of the angle between the two spatial vectors. Suppose, we have two vectors a = (a 1 , a 2 , a 3 ) and b = (b 1 , b 2 , b 3 ). The angle between these two spatial vectors can be calculated by using Equation (3). Here, a is a vector created by 6th and 11th joints. After that, the angles between a and X-axis (left), Y-axis (middle), and Z-axis (right) have been calculated.

Classification
For classification, two methods, support vector machine (SVM) and light gradient boosting machine (GBM) have been utilized. SVM works well for unstructured and semistructured high-dimensional datasets. With an appropriate Kernal function, SVM can solve complex problems. Unlike neural networks, SVM is not solved for local optima. SVM models have generalization in practice and, therefore, the risk of over-fitting is less in SVM. On the other hand, light GBM has faster training speed, lower memory usage, better performance than any other boosting algorithms, is compatible with large datasets, and supports parallel learning. Due to all these reasons, in this research, we chose both SVM and light GBM.

Support Vector Machine
SVM is a pattern recognition model that utilizes supervised learning [50], and in this study, it has been utilized for classification. Support vector machine is a method to construct a pattern discriminator using linear input elements. From the training data, the parameters of the linear input elements are learned based on the criterion of finding the margin-maximizing hyperplane that maximizes the distance to each data point. The kernel used in this study is represented by Equation (7) where X 1 and X 2 are two points, K denotes kernel and ||X 1 − X 2 || denotes the Euclidian distance between the two points.
The support vector machine has parameters and in order to optimize the parameters, parameter tuning was performed. Grid search has been used to find the optimal values of cost (C) and gamma parameters in this research.

Light Gradient Boosting Machine
Light GBM is a machine learning framework for gradient boosting based on the decision tree algorithm [51]. Gradient boosting is an ensemble learning method that combines multiple weak learners (in the case of light GBM, decision trees) into one, using 'boosting'. Before the arrival of light GBM, gradient boosting, called XGboost, was the mainstream method. Normal decision tree models, including Xgboost, are trained hierarchically. Light GBM uses leaf-wise learning, which is more efficient because it does not require unnecessary learning. Therefore, light GBM solves the drawback of gradient boosting such as XGboost, which has high prediction accuracy but a long computation time. Figure 6 illustrates both level-wise learning and leaf-wise learning.

Experimental Settings and Evaluation Metric
Each of the three datasets were divided into train set and test set, having 20% data in the test set. While tuning the support vector machine and light GBM, 5-fold cross validation has been utilized. Accuracy has been used as the evaluation metric in this research which is denoted by, Here, TP = True positive, FP = False positive, TN = True negative, and FN = False negative.

Experimental Analysis
This section starts with the details on parameter tuning. After that, experimental settings, evaluation metric and result analysis have been presented, along with a comparison with previous works. Later, the necessity of both the distance-based, and angle-based features on the overall performance has been discussed.

Parameter Turning
In this study, SVM and light GBM has been used to classify the ASL alphabet. To obtain the best parameters, we used grid search to select the parameters. The parameters searched were cost (C) values and Gamma values for SVM. Table 4 shows the parameters search space for SVM classifier. Table 5 shows the selected C and Gamma values after performing parameter tuning for each of the datasets. Grid search was also applied for selecting the best parameters for light GBM as well. The parameters searched were the number of leaves, learning rate, minimum child samples, and the number of estimators. Table 6 presents the parameters search space and Table 7 presents the selected parameters for each of the datasets while using the light GBM.

Results Analysis
In this study, two types of classifiers, SVM and light GBM have been used. SVM has been used as the main classifier, while light GBM has been utilized for comparison. There are two types of features, distance-based features, and angle-based features. Results after applying SVM and light GBM are illustrated in Table 8. From Table 8, it can be observed that when used alone, the angle-based features gave better results. In ASL, there are letters that have the same shape but different inclinations to express different characters. The distance-based features may be able to determine the shape but not the inclination. Therefore, the performance increased when angle-based features are used that can also determine the degree of inclination. Next, from Table 8 it can also be observed that the results are better when both distance-based and angle-based features are used than when used individually. Although the shape of the hand can be imagined from the inclination, it is still possible to estimate the shape of the hand more clearly with the distance features. Therefore, combining the two features led to further improvement in the accuracy which can be observed in Table 8. In addition, Table 9 presents average hand pose estimation time, average feature extraction time, prediction time per sample, recognized frames per second, and required memory to load final trained model for all three datasets using SVM while considering both distance-based and angle-based features. Here, all times are measured in seconds. It can be seen that the proposed system can recognize at least 62 samples per second which indicates that the proposed system is suitable for real-time gesture recognition. A Kaggle CPU environment, i.e., 2 CPU cores, 16 Gigabytes of RAM, 20 Gigabytes of disk space, was utilized while all experimentation.

Comparison with Previous Studies
In this section, the comparison between this study and previous studies will be discussed. There are two datasets that we are comparing in this study, one is the Massey dataset and the other one is the finger spelling A dataset as these two datasets have been used in previous studies. As can be seen from Table 10, the two datasets in this study showed better results than the previous studies. Specifically, the obtained accuracy of the Massey dataset is 99.39% and the accuracy of the finger spelling A dataset is 98.45%, which is higher than the previous studies. Then, 87.60% accuracy was achieved on the ASL alphabet dataset, which is considered to be a difficult dataset to classify. By using the coordinate estimation method used in this study, it is possible to obtain 3D information that cannot be obtained from 2D images, and this 3D information is important because it allows to easily identify important features regarding joint points that are difficult to identify in 2D. For example, if a person is grasping his or her hand, 3D information is much more useful in identifying the hand because it holds the information more clearly.
There are cameras that can obtain 3D information, such as leap motion and cameras equipped with depth sensors. However, in this research, we used images captured by a webcam, a camera that does not provide three-dimensional information like a depth camera. The reason why we considered webcam inputs is that it has the advantage of being easier to use than the above-mentioned cameras. Moreover, leap motion and depth sensors are more expensive. Web cameras, on the other hand, are inexpensive, and since even laptops are equipped with them, it is not difficult to get and use one. Therefore, we believe that obtaining good recognition rates even with a web camera can be highly beneficial for ASL recognition which will have a great impact on future research.

Dataset
Approach Accuracy

Necessity of Distance-Based Features, Angle-Based Features, and Both
In this study, two types of features are used for character recognition. One is the features using the distance between joints, and the other one is angles. Distance-based features can imagine the shape of the hand more clearly whereas angle-based features can imagine the tilt along with the shape. Normally, the recognition rate would be higher for the distance-based features if the classes of ASL solely depended on the shape of the hand. However, in ASL, some signs have the same shape but different inclinations to represent different letters. For example, the I and J have the same shape but different inclinations. Hence, specifically for ASL recognition, the angle-based features achieved better performance than the distance-based features. Figures 7 and 8 illustrate the confusion matrices for distance-based features and angle-based features, respectively. A clear difference in performance can be seen here. Specifically, I and J also showed a difference in character recognition accuracy. When only distance-based features were used, the recognition rate of I was 74% and that of J was 78%. However, when angle information was used, the recognition rate of I improved to 87% and that of J to 92%. This phenomenon indicates the importance of the inclusion of angle-based features.
However, it is also true that distance-based features can imagine the hand shape more clearly than the angle-based features. As in ASL, many letters have different hand shapes, distance-based features can often be highly useful. For example, sometimes the hand may tilt to some degrees on the left or right unintentionally where tilting is not necessary. For those cases, the angle-based features can face problems in classification. However, this problem can be tackled by distance-based features. Therefore, the better choice seemed to combine both the distance-based and angle-based features. It turned out that combining both features indeed can boost the performance which has been reported beforehand. Table 8 illustrates the difference between using these features individually and in combination.

Conclusions
In this study, we used images obtained from a web camera to recognize sign characters in ASL. However, instead of just using the images, we estimated the coordinates of the hand joints from the images and used the estimated coordinates for recognition. Then, features were generated from the estimated coordinates, and character recognition was performed based on these features. The features we created were based on the distance between the joints and the angle between the direction between the two joints and the X, Y, and Z-axis. By using these features, it was expected that the complex shape of the hand could be easily represented and that the results would be better than using the images themselves. The results were as expected, and the method used in this study performed very well for sign language recognition in ASL. The experiments also showed that the accuracy of our method was better than that of previous papers. We believe this will be a great contribution to the field of character recognition. So far, there has been a great demand for systems that can input text without touching things, as is currently being researched for contactless text input systems. We hope this research will have a great impact on this field and will contribute to it. Another difference between this research and existing research is that, as mentioned earlier, this research considered webcam inputs which is not expensive and easier to get. In the future, we are thinking of recognizing not only ASL but also sign characters from other languages. In addition, the system used in this study can be applied not only to sign language recognition but also to air writing, which is the recognition of characters by writing them in the air. This indicates the diverse applications of this study and the potential to contribute greatly to future researches.