Deep Forest-Based Monocular Visual Sign Language Recognition

: Sign language recognition (SLR) is a bridge linking the hearing impaired and the general public. Some SLR methods using wearable data gloves are not portable enough to provide daily sign language translation service, while visual SLR is more ﬂexible to work with in most scenes. This paper introduces a monocular vision-based approach to SLR. Human skeleton action recognition is proposed to express semantic information, including the representation of signs’ gestures, using the regularization of body joint features and a deep-forest-based semantic classiﬁer with a voting strategy. We test our approach on the public American Sign Language Lexicon Video Dataset (ASLLVD) and a private testing set. It proves to achieve a promising performance and shows a high generalization capability on the testing set.


Introduction
Sign language plays an indispensable role in the soundless world.It has been widely used across the world as the language of the hearing impaired.However, it is still a prominent problem for the hearing impaired to communicate with people who have normal hearing.Written communication on paper is a common method, but it has the disadvantage of inefficiency.Great effort is needed to help the hearing impaired to communicate well with the normal people, and using sign language recognition (SLR) techniques is an efficient way because SLR can convert the sign language into text or even voice.For example, a hearing impaired person could use a portable device to simultaneously communicate with someone.The SLR tool recognizes the signs and then shows the messages on the screen or speaks them out.It requires the device to work online under the condition of limited computing resources and power.
Research in the field of SLR can be divided into two categories [1,2].One is static gesture, mainly used to represent letters of the alphabet, and the other is dynamic gesture, covering most sign languages.To recognize both static and dynamic gestures, feature extraction and semantic identification are the keys.The visual method and the wearable method are two main methods to acquire sign language features.The former uses monocular/stereo/depth cameras to capture gesture images and extract visual features [3].The latter uses data gloves, which are equipped with embedded sensors, to get the joints' locations directly [4][5][6].
Despite the flexible motion of human hands, data gloves can accurately obtain the three-dimensional information of a gesture in space [7].Their disadvantage is that the operator must wear data gloves, which weakens the naturalness and flexibility of human-computer interaction.Besides, the price of the data glove is another reason for its limited use and promotion.
The visual SLR method using depth/stereo cameras, such as Kinect R , can generate outstanding results using a depth vision feature [8,9].Kinect R is a motion-sensing device produced by Microsoft that is based on a time-of-flight camera.These cameras have a large overhead of hardware, and they are sensitive to environments like variant illumination.Compared with the depth/stereo camera, the monocular camera has a lighter structure and is less expensive, but requires robust recognition algorithms.Since the monocular camera only produces RGB images without depth information, it requires a powerful algorithm to achieve a promising result compared with depth/stereo cameras [2].In the field of semantic recognition, there are two tendencies: the traditional methods, including hidden Markov models (HMMs) and dynamic time warping (DTW) [10,11], and the machine learning methods such as support vector machine (SVM) and deep neural networks (DNNs) [12].
The HMM model was mainly used in the field of speech recognition in the early days [13,14].Although the HMM has achieved great success in speech recognition, its performance in SLR is not satisfactory.The reason is that the traditional HMM method needs to establish HMM models for each gesture separately, affecting the real-time performance of the system.In contrast, the DTW method is simple and effective.The optimal dynamic programming matching algorithm can be used to improve the accuracy rate of SLR [15].Since DTW is based on a template matching algorithm, it is difficult for it to learn from data, which limits its robustness.
Machine learning methods have the characteristics of high parallelism, adaptability, and certain learning capabilities [16,17].In particular, DNNs normally have various network models, which satisfy different application requirements.The accuracy of the DNNs depends on the number of training samples.In the field of sign language, as shown in Table 1, there is no public sign language dataset that has a large number of various signs and meanwhile has a sufficient volume of samples for each sign.It is hard for DNNs to achieve good performance if there are only between one and three samples for a sign.In addition, the computational overhead is also an obstacle in practical applications.This paper introduces a novel SLR method for dynamic gestures that has high robustness and a strong generalization performance.We propose a combined joint model with both hand and arm joints to represent a human's pose.Considering the characteristics of joints, we employ the extracted joints as the body features to explain the sign language.Overall, our SLR model involves two steps: In the first step, a visual skeleton extraction method is used to encode the body joint information via the OpenPose detector [18].In the second step, a small sample data is used to train a classifier based on the deep forest model [19], which is compared with the SVM-based one.
The paper is organized as follows: In Section 2, the paper introduces relevant research from the aspect of skeleton detection.In Section 3, the basic modeling formulation is presented with the visual skeleton extraction method, the joint feature re-encoding method, the semantic classification, and the voting mechanism.In Section 4, the experiment is carried out with the public dataset and our private testing set, respectively.Finally, the conclusion is given in Section 5.

Related Work
Before the emergence of end-to-end learning, the main steps of visual sign language recognition could be divided into three parts: human skeleton detection, feature extraction, and semantic classification.
In terms of human skeleton detection, most traditional methods use the color-based skin segmentation model, which detects the human skeleton by the difference between the human's color and the background color [20].Nada et al. presented a dynamic skin detector based on face color tone and a skin-blob tracking technique for hand segmentation.It has a recognition rate of 97% in a signer-independent mode [21].In recent years, along with the rise of depth cameras, e.g., Microsoft's Kinect, many researchers have tried to combine depth information with appearance information.It can provide RGB images, depth images, and skeleton data [22].Dong et al. adopted a 3D hand template with joint angle features with Microsoft's Kinect [23].Sílvia et al. used seven vision-based features from the RGB-D images and achieved accuracy results above 80% on average in Brazilian Sign Language [24].
After obtaining the human skeleton data, it is necessary to extract features from them.There are two types of features: graphical features and interaction information between skeleton.Graphical features involve the Fourier, Zernike moments, the pseudo-Zernike moments, the Hu moments, the complex moments, the Gabor features, and others [25,26].Özbay and Safar used the Hausdorff distance and Hu invariants to process hand movements in a universal sign language recognition system [27].Since the depth camera is able to convert human body images into human joint information, the meanings of signs are embedded in the distribution of joints [28].Kishore et al. proposed a characterization of sign language gestures articulated at different body parts as 3D motionlets, which describe the signs with a subset of joint motions [29].
In the area of semantic classification, HMMs and DTW have been used to classify features since early times [30].Pradeep et al. performed the recognition process using an HMM.The results showed the efficiency of the proposed framework, with an accuracy of 83.77% on occluded gestures [31].However, due to the high efficiency of SVM, many scholars have tended to use SVM and make some improvements on it.Naresh combined linear discriminant analysis (LDA) and SVM to form the basis of tenfold classification to recognize sign language symbols.His work ensures 97.3% accuracy on a random sign symbolic dataset of gestural communication [32].
When neural networks, and especially deep learning methods, arose, the boundaries between feature extraction and semantic classification in SLR became blurred and even disappeared.A neural network can automatically learn and extract classification features from the input images [33].Kiran et al. applied convolutional neural networks (CNNs) in the recognition of 3D motion-captured sign language.The 3D spatiotemporal information of each sign was interpreted using joint angular displacement maps (JADMs), which encoded the sign as a color texture image [16].Although deep neural networks show strong performance, the performance of neural networks depends on the size and quality of the datasets.

Problem Formulation
Human speech conveys information through sound, while sign language conveys information through body gestures.To identify sign language, gestures should be described precisely.In sign language, gestures mainly involve the movements of hands and arms.Therefore, gestures can be described by the continuous posture of hands and arms including position, movement, and shape.Position and movement can be represented using a human skeleton model.Since the shape of hands and arms has little effect on the meaning of sign language, modeling with the skeleton, especially joints, is sufficient for SLR.This paper presents a combined joint model with both hand and arm joints.Figure 1 shows the 48-marker template designed to represent hand and arm joints.The number of markers is defined as where N arm is the number of arm joints, denoted as J 1 , . . ., J 6 (blue points in Figure 1), N hand is the number of finger joints, denoted as J 7 , . . ., J 48 (green points in Figure 1), and an extra root joint J 0 (black point in Figure 1), for each joint J  The framework of the SLR method is illustrated in Figure 2. The input is the raw image sequence from a monocular camera.The keypoints, regarded as the joints of the skeleton, are extracted frame-by-frame, and then the position vector and its confidence are calculated in term of the keypoints.In addition, we regularize the position vector into the normal position vector using a scaling coefficient generated from the position vector.The regularized position vectors are used as the input of the deep forest-based classifier.The output is the classifier combined with the confidence, which is used to generate the result by a voting strategy.

Feature Extraction
We employ OpenPose to obtain the joints information.OpenPose is a library for real-time keypoints detection [18].OpenPose is a bottom-up algorithm of human pose estimation using part affinity fields (PAFs).It is a kind of real-time system to jointly detect human body, hands, and facial keypoints from a single image.Here, we explore one chest point, 6 arm joints (corresponding to the shoulders, the elbows, and the wrists), and 42 hand points (corresponding to the palms and the fingers).
The anatomical keypoints of people are extracted from an RGB image.PAFs are used to describe the direction of pixels in the skeleton, denoted as L(p), and the confidence maps for body part location are represented by S(p), where p represents the locations of the keypoints in the image.The network uses a VGG pre-trained network as the encoder.The detection and association of keypoints are conducted simultaneously via two branches: the confidence map prediction and the affinity fields prediction.Finally, the confidence maps and PAFs are parsed to generate the 2D keypoints of people.The overall loss functions are defined as where S * j represents the ground-truth part confidence map, J represents the keypoints, L * c represents the ground-truth part affinity vector field, C represents the limbs of the human body, W (p) represents the indicator function to diminish the loss of missing annotation, and • 2 stands for the Euclidean distance.

Regularization
Due to differences in human body size, distance from lens, and viewpoints of cameras, the distribution of joint points varies greatly in different images.A scaling method is used to make features consistent.At each frame t, the scaling coefficient k t is defined as where L i t refers to shoulder width and the length of upper arms and lower arms, which is defined as In the default coordinates, the origin of the coordinates is located at the lower left corner of the image, which cannot reflect hand movement efficiently due to the symmetry of the human body's left and right parts.Hence, another regularization method is used to make data symmetrical.In each frame t, the regularized position vector of each joint is defined as The complete definition of the regularization process is with where f is the regularization function, and R j t and r j t represent the regularized point and the regularized position vector, respectively.k t and v j t are the scaling coefficient and the regularized position vector, respectively.c j t is the coordinate confidence.p j t and L i t are the position vector and the size of human body, respectively.
In each frame t, a confidence C t is used to evaluate the total confidence, which is defined as

Semantic Classification
Deep neural networks ask for a huge amount of training data.It is hard to apply in certain tasks where there are only small-scale data.It is necessary to find an alternative to the deep neural networks.In this paper, we explore a deep forest-based classifier, which is fit for working on small-scale datasets.This classifier combines the characteristics of deep learning and random forests [19].The framework of the Deep Forest is shown in Figure 3.In our task, for each frame t, the input feature matrix is defined as with the regularized position vector r j t ∈ R 2 .Deep forest consists of several layers L 1 ,L 2 to L m , where m depends on the training data.In the layer L i , i = 1 to m, there are two random forests (RF i1 and RF i2 ) and two complete-random forests (CRF i1 and CRF i2 ).Each random forest contains 500 decision trees, selecting the feature with the largest Gini value from randomly picked features.In contrast, each complete-random forest contains 500 decision trees, really randomly selecting a feature at each node of the tree [34].The combination of two random forests and two complete-random forests decided by the performance and the model size gives an optimized result.The input of the first layer is the source data S There are 3 layers in the deep forest classifier in Section 4. Each sample will find a path in each tree to find its corresponding leaf node, and the training data in this leaf node is likely to have different categories.The statistics of various categories can be obtained through u categories.u is the total number of semantics in the semantic data set, and the probability distribution of the entire forest is generated by averaging the proportions of all trees.Finally, the semantic category with the highest probability of each sample is selected as the recognition result for the sample.

Voting Mechanism
A sign language word is presented by a series of gestures captured by a monocular camera as a video consisting of t frames.For each frame t, the prediction NP t = (p t1 , p t2 , . . ., p tk ) T ∈ R k is generated by the trained classifier.p tk is the possibility of each category within k categories.To decide the final prediction of the whole gesture, a voting strategy is employed to differentiate and select NP t .The weighted prediction on frame t is defined as where C t is the confidence of prediction for frame t.The final prediction on the gesture is defined as The meaning of a sign language symbol is decided by the category with the top prediction score.The meaning of the sentence is decided by categories with top n prediction scores.The category none is excluded as it has no meaning.The n is given in advance according to the symbol number of the sentence.

Experiments and Results
We use the public dataset ASLLVD and a private testing set in the experiments.The ASLLVD consists of more than 3300 American Sign Language (ASL) signs in video clips, including nouns, verbs, adjectives, and pronouns.Each sign is illustrated by 1-6 native ASL signers.In total, there are more than 9800 clips of signs.This dataset includes multiple synchronized videos showing the signs from different viewpoints.We only explore the front view of the monocular camera.Body joint features and sign recognition result examples are shown in Figure 4.
The classifiers are trained with the annotated images.In our experiments, four classifiers, the state-of-the-art deep forest and the standard classifiers including support vector machine, decision tree, and logistic regression have been used and compared in this experiment.We list the performance of all four classifiers, while only the SVM and deep forest are chosen as the typical classifiers to make an in-depth analysis.In the experiment, there are 10 steps, and 10 signs are randomly chosen at the first step.Each sign includes around 3 video clips, and each clip consists of between 40 and 500 frames.All frames are labeled with their corresponding signs.Then, 20 signs are selected in the second step, 30 signs in the third step, and so on.Lastly, 104 signs are used to train the classifiers (we add 4 extra signs in that these signs can make up some common sentences).All of the experiments are conducted on a workstation with an Intel Xeon E5-1620 CPU, with 16GB RAM and a Nvidia GTX 1080 Ti GPU.The average training time of the deep forest classifier is about 5143 s.
The samples of 104 signs (about 24,385 samples) are randomly split into 80% training data (19,508 samples) and 20% testing data (4877 samples).We adopt precision, recall, and F1 score to evaluate the performance of multi-sign recognition.The F1 score is the harmonic average of the precision and recall, where an F1 score reaches between 0 and 1.The F1 score of the four classifiers in 10 experiments is shown in Figure 5.It is defined as where TP, FP, and FN refer to the number of true positive, false positive, and false negative samples.
In the random 10-sign experiments, all classifiers show good performance with F1 scores over 97%, as shown in Figure 5.However, with the increase in signs, the F1 score of the standard classifiers appears to decrease.The F1 score of the standard classifiers declines below 90% for 100 signs, while the deep forest classifier still has an F1 score of 97.7%.This can be explained as the standard classifiers are more useful for binary classification than multiple classification.The performances of the two typical classifiers on the top 20 signs in the testing data are illustrated in the form of box-plots in Figure 6.The names of signs, the number of frames, precision, recall, and F1 score are listed in the columns "Signs", "Number", "Precision", "Recall", and "F1".In each classifier, the 20 signs with top number of test samples are listed in Table 2.As shown in Table 2, SVM and deep forest have average F1 scores of 86% and 98%, respectively.The performances of these two classifiers are quite different on some signs.For example, for the sign hot, SVM gets 0.00 for precision, recall, and F1 score, while deep forest gets 1.00, 0.92, and 0.96, respectively.Moreover, for the sign milk, SVM gets precision, recall, and F1 scores of 0.48, 0.76, and 0.58, respectively, while deep forest gets 1.00 for all scores.The minimum precision, recall, and F1 scores fpr deep forest are 0.82, 0.79, and 0.86, respectively.This shows that the deep forest classifier has a better performance than the SVM classifier.
In the private testing dataset, 11 signs chosen to make up 6 daily words/sentences are illustrated in Table 3.We use a monocular camera to capture sign language videos of two people in two scenes: an office and a corridor with a black background.Six video clips are captured, containing 37 to 158 frames each.Figure 7 shows the visualized features and extracted vectors from the private testing dataset.All signs are picked from the 104 signs of the ASLLVD.The sign none, which has no meaning, is used to label the beginning and the end of a clip.The RoS score is defined as where S C represents the number of correctly recognized signs and S A represents the number of signs that should be recognized.A high recognition rate for words with single symbol is shown in Table 4.For example, the word apple has an excellent recognition rate of 100% because we clip the video by reserving only the key frames from the whole video.For the word banana, the sign banana has 60 weighted frames, which is far more than that of the M3 sign again.After that, double-sign sentences can be recognized within a moderate recognition rate since some other signs are recognized by mistake.In the sentence drink-water, the sign drink scores 11 weighted frames, but the sign water only scores 4 weighted frames and the incorrect sign come and home score 4 and 3 weighted frames.A similar result appears in the sentence father-walk.In the triple sign sentences, since the transition frames between the key frames be a large proportion of the frames, the meaning of sentences might be confused.In the sentence he/llo-where-toilet, the core signs toilet and hello are correctly recognized, although some frames are wrongly recognized as home and finish.In next step, we will focus on this issue, which can be handled by neural language process models such as long short-term memory (LSTM) [35,36].Overall, the correctly classified frames are dominant, and a promising performance is achieved in the isolated words.However, in terms of multi-word sentences, the desired signs sometimes might be missing or confused with other signs because the transition frames between two signs might be recognized incorrectly, and the ASLLVD dataset provides isolated signs for training.It is still not highly effective for multi-word sentences.

Conclusions
In this paper, we propose a monocular vision-based sign language recognition system that is flexible and accurate for translating visual gesture semantic information into words.The state-of-the-art human keypoint feature extraction system, OpenPose, is employed to accurately provide the keypoint position of the human skeleton from a single image sequence.Then, we further propose a feature regularization to normalize various features and use a deep forest-based classifier to train our model on a small dataset, including the public ASLLVD and our private testing set.It has proven to achieve a high generalization performance on varied datasets and be effective in real-world applications.
In the development of this system, some improvements have been identified that can be made in future.The voting strategy is not quite robust for complex semantic sentences.Considering the real-time performance, we do not employ a network with memory unit such as LSTM and gated recurrent unit (GRU) [37,38].These methods have been proven effective in the field of natural language processing and are possible solutions to improve the performance of our sign language recognition system.Moreover, future investigation of other sign languages like Chinese is necessary to gain more accurate representation, and thus, various datasets from more participants are important to extend and validate our proposal.
at each frame t. p j t and c j t are defined as the position vector and the position confidence of the joint.

Figure 1 .
Figure 1.48 markers of hand and arm joints with one root joint.

Figure 4 .
Figure 4. Visualized features and extracted vectors of sign recognition from the American Sign Language Lexicon Video Dataset (ASLLVD).

Figure 5 .
Figure 5. Performance of the support vector machine (SVM) classifier and the deep forest classifier.

Table 1 .
Public sign language datasets.
In order to avoid over-fitting, each forest training uses K-fold cross-validation.Each sample is used as k − 1 training and k − 1 inspections, so the probability of each forest generation is not the training result from the same batch of training data, but is averaged by k − 1 results after the cross-check.After the training of layer L 1 , the training model is used to estimate a testing set, and a cutoff accuracy ∆ c is selected.If the accuracy of the obtained result ∆ r is less than ∆ c , the training is terminated.This step automatically determines the number of layers m.
1 t .The output of the first layer Re 1 t is cascaded with the source data F 1 t as the input of the second layer F 2 t = [Re 1 t , F 1 t ] ∈ R 128 .Each layer L i , i = 2 to m has the output of the previous layer Re i−1 t cascaded with the source data F 1 t as input F i t = [Re i−1 t , F 1 t ] ∈ R 128 , i = 2 to m.

Table 2 .
Performance of SVM and deep forest classifiers for the top 20 signs.

Table 3 .
Private testing set.