Backhand-Approach-Based American Sign Language Words Recognition Using Spatial-Temporal Body Parts and Hand Relationship Patterns

Most of the existing methods focus mainly on the extraction of shape-based, rotation-based, and motion-based features, usually neglecting the relationship between hands and body parts, which can provide significant information to address the problem of similar sign words based on the backhand approach. Therefore, this paper proposes four feature-based models. The spatial–temporal body parts and hand relationship patterns are the main feature. The second model consists of the spatial–temporal finger joint angle patterns. The third model consists of the spatial–temporal 3D hand motion trajectory patterns. The fourth model consists of the spatial–temporal double-hand relationship patterns. Then, a two-layer bidirectional long short-term memory method is used to deal with time-independent data as a classifier. The performance of the method was evaluated and compared with the existing works using 26 ASL letters, with an accuracy and F1-score of 97.34% and 97.36%, respectively. The method was further evaluated using 40 double-hand ASL words and achieved an accuracy and F1-score of 98.52% and 98.54%, respectively. The results demonstrated that the proposed method outperformed the existing works under consideration. However, in the analysis of 72 new ASL words, including single- and double-hand words from 10 participants, the accuracy and F1-score were approximately 96.99% and 97.00%, respectively.


Introduction
Sign language is a medium of communication for hearing-impaired people, a group which includes 466 million people worldwide [1], and it is expressed using the fingers and hands. American Sign Language is the most populous among its peers, and thus consists of over ten thousand word gestures [2]. Moreover, 65% of ASL gestures represent sign words during a full conversation [3]. In addition, communication between hearing people and hearing-impaired people is very important because they must work together; however, hearing people do not know much about sign language, so there exists a communication gap and this gap in communication occurs very often in society. For instance, in the case of medical diagnosis where the patient is hard-of-hearing, the information received from the patient can be inaccurate, which can affect critical healthcare decisions. Poor communication often has catastrophic effects. According to a World Health Organization report, the simple failure to follow doctor's orders results in 125,000 deaths in the U.S. each year [4]. Moreover, unemployment statistics for persons with deafness, caused by a lack of communication, have reached 3.8% in the U.S. [5]. Therefore, an automatic sign language interpretation system is necessary to bridge a communication gap. Automatic sign language interpretation systems need to be portable and mobile; thus, a backhand

Problem Analysis
Many sign words have a similar shape, rotation, and movement, and these are called the SRM sign group [13]. Consequently, the features used in the existing work may not be sufficient to distinguish these types of sign words. In the previous research on the backhand view [2,12], a similar problem was solved using rotation-based analysis, such as pitch, roll, and yaw angles. Moreover, another previous work [3,6,11] used positionbased feature extraction based on a backhand view. Nevertheless, it would be difficult to apply these features to the SRM sign group, as shown in Figure 1. In the first row of Figure 1, the images in the first column show a single-handed representation of the sign language words "fox" and "fruits", and the images in the second column show the doublehanded representation of the sign language words "brother" and "sister." The outputs of the rotation-based (including pitch, yaw, and roll angles), motion-based, and shape-based (shape representation of thumb, pinky, and wrist joints in terms of time series) methods are demonstrated in the second, third, and fourth rows, respectively. [6] 56 words + Trajectory-based + HB-RNN Video 94. 50 Backhand Fails to track SRM si [2] 57 words + Angle-based + FFV-BiLSTM Video 98. 60 Backhand Fails to track SRM si

Problem Analysis
Many sign words have a similar shape, rotation, and movement, and these the SRM sign group [13]. Consequently, the features used in the existing work m sufficient to distinguish these types of sign words. In the previous research on hand view [2,12], a similar problem was solved using rotation-based analysis pitch, roll, and yaw angles. Moreover, another previous work [3,6,11] used positi feature extraction based on a backhand view. Nevertheless, it would be difficult these features to the SRM sign group, as shown in Figure 1. In the first row of the images in the first column show a single-handed representation of the sign words "fox" and "fruits", and the images in the second column show the doubl representation of the sign language words "brother" and "sister." The outputs tation-based (including pitch, yaw, and roll angles), motion-based, and sha (shape representation of thumb, pinky, and wrist joints in terms of time series) are demonstrated in the second, third, and fourth rows, respectively.  As a result, in both the single-and double-hand groups, there are signs with similar shapes, rotation, and movement. However, the position of the hands relative to a part of the body can be used to distinguish words in this group. Therefore, in this paper we present a method based on the spatial-temporal body parts and hand relationship patterns (ST-BHR) as the main feature, using the 3D distance-based Cartesian product, as displayed in Figure 2, which can be expressed as in Equation (5). As a result, in both the single-and double-hand groups, there are signs with similar shapes, rotation, and movement. However, the position of the hands relative to a part of the body can be used to distinguish words in this group. Therefore, in this paper we present a method based on the spatial-temporal body parts and hand relationship patterns (ST-BHR) as the main feature, using the 3D distance-based Cartesian product, as displayed in Figure 2, which can be expressed as in Equation (5).
According to the solution presented above, ℱ and ℬ are the fingertip and palm position set and the key positions of the body part set, as displayed in Equations (2) Let ℱ , ,ℒ, , ℒ, ℛ, t = {1,…,T} stand for the set of fingertip and palm positions, fingertip and palm positions, the left hand, the right hand, and the total number of frames, respectively.
Here, ℬ , , , , , , , , stand for the set of the key points of the body positions, the forehead 1 , the right ear 2 , the left ear 3 , the nose ( ), the chin ( ), the right shoulder 6 , the left shoulder 7 , and the chest position 8 , respectively.
The 3D distance between two points, such as ℱ , , , , , and , , , , , in xyz-space, is given by the following generalization of the distance formula in Equation (4).
where ,ℱ , ℬ , x, y, z, i = {1,2,3…,12}, k = {1,2,3…,8} stand for the 3D distance, the set of fingertip and palm positions, the set of the key positions of the body parts, the x-axis, the y-axis, the z-axis, the total of set ℱ , and the total of set ℬ , respectively.
The proposed method, 3D distance-based features based on Cartesian products, is used as the feature extraction method, as shown in Figure 3 and Equation (5), in which the output of the Cartesian product is normalized before feeding it into a classifier. According to the solution presented above, F (t) and B (t) are the fingertip and palm position set and the key positions of the body part set, as displayed in Equations (2) Let F (t) , J (1, L,t) , L, R, t = {1, . . . ,T} stand for the set of fingertip and palm positions, fingertip and palm positions, the left hand, the right hand, and the total number of frames, respectively. B (t) = {S 1 , S 2 , S 3 , S 4 , S 5 , S 6 , S 7 , S 8 } Here, B (t) , S 1 , S 2 , S 3 , S 4 , S 5 , S 6 , S 7 , S 8 stand for the set of the key points of the body positions, the forehead (S 1 ), the right ear (S 2 ), the left ear (S 3 ), the nose (S 4 ), the chin (S 5 ), the right shoulder (S 6 ), the left shoulder (S 7 ), and the chest position(S 8 ), respectively.
The 3D distance between two points, such as F (t) (i) = x (i,t) , y (i,t) , z (i,t) and in xyz-space, is given by the following generalization of the distance formula in Equation (4).
where ϕ (t) ,F (t) , B (t) , x, y, z, i = {1,2,3 . . . ,12}, k = {1,2,3 . . . ,8} stand for the 3D distance, the set of fingertip and palm positions, the set of the key positions of the body parts, the x-axis, the y-axis, the z-axis, the total of set F (t) , and the total of set B (t) , respectively. The proposed method, 3D distance-based features based on Cartesian products, is used as the feature extraction method, as shown in Figure 3 and Equation (5), in which the output of the Cartesian product is normalized before feeding it into a classifier. where H 1(t) , F (t) , B (t) , ϕ jk , J (i,h,t) , and N orm indicate the Cartesian of the set of F (t) and B (t) , the set of fingertip and palm positions, the set of the key positions of the body parts, the 3D distance, the finger joints, and min-max normalization, respectively.
where ℋ , ℱ , ℬ , , , , , and indicate the Cartesian of the set of ℱ and ℬ , the set of fingertip and palm positions, the set of the key positions of the body parts, the 3D distance, the finger joints, and min-max normalization, respectively.
This feature represents spatial-temporal body parts and hand relationship patterns based on the 3D distance, normalized via min-max normalization [50], as shown in Figure  4. As a result, the words "fox" and "fruit" show different patterns in terms of the time series of the video, represented by a red line between the circle and the rectangle, which stands for the normalized distance of the words "fox" and "fruit", respectively.  This feature represents spatial-temporal body parts and hand relationship patterns based on the 3D distance, normalized via min-max normalization [50], as shown in Figure 4. As a result, the words "fox" and "fruit" show different patterns in terms of the time series of the video, represented by a red line between the circle and the rectangle, which stands for the normalized distance of the words "fox" and "fruit", respectively.
where ℋ , ℱ , ℬ , , , , , and indicate the Cartesian of the set of ℱ and ℬ , the set of fingertip and palm positions, the set of the key positions of the body parts, the 3D distance, the finger joints, and min-max normalization, respectively.
This feature represents spatial-temporal body parts and hand relationship patterns based on the 3D distance, normalized via min-max normalization [50], as shown in Figure  4. As a result, the words "fox" and "fruit" show different patterns in terms of the time series of the video, represented by a red line between the circle and the rectangle, which stands for the normalized distance of the words "fox" and "fruit", respectively.

System Overview
Hearing-impaired people usually communicate through sign language, whereas nonhearing-impaired people normally use speech based on a natural language. An approach that allows these two groups of people to communicate via a mobile device that converts sign language to natural language and vice versa in real-time is illustrated in Figure 5. When hearing-impaired people speak in front of hearing-impaired and non-hearing-impaired people, hearing-impaired people can understand the meaning based on the signs, while simultaneously, non-hearing-impaired people can listen to the speech. In this scenario, a mobile device is installed on a body part of a hearing-impaired signer, such as the chest, face, and head, and functionally converts the video of signs into speech for non-hearingimpaired people to understand and vice versa.

System Overview
Hearing-impaired people usually communicate through sign language, whereas non-hearing-impaired people normally use speech based on a natural language. An approach that allows these two groups of people to communicate via a mobile device that converts sign language to natural language and vice versa in real-time is illustrated in Figure 5. When hearing-impaired people speak in front of hearing-impaired and nonhearing-impaired people, hearing-impaired people can understand the meaning based on the signs, while simultaneously, non-hearing-impaired people can listen to the speech. In this scenario, a mobile device is installed on a body part of a hearing-impaired signer, such as the chest, face, and head, and functionally converts the video of signs into speech for non-hearing-impaired people to understand and vice versa.

Hardware Unit
The Leap Motion sensor [51] or 3D sensor is set on the signer's chest so that it automatically moves along the moving path of the signer and always detects hand signs in the same view or the backhand view, as illustrated in Figure 6. In this paper, we focus on translating sign words and showing them as text, presented in a red rectangle. In addition, the interaction zone of the sensor extends from 10 cm to 80 cm, and a 140° × 120° typical field of view. Therefore, if the sensor is in the normal plane with no tilt, an interaction zone is unable to detect the selected body point. The interaction zone can be designed as shown in Figure 6. , δ, and ℎ stand for the tilt angle of the sensor, the vertical angle of the 3D sensor, an angle, a bias angle, and the sensor's position in terms of the chest area determined by the height from the shoulder to the chest, respectively.

Software Unit
The software unit is divided into three parts, as illustrated schematically in Figure 7. First, the preprocessing step is performed, containing the data mining and analysis processes, such as the receival of 3D skeleton hand data from the 3D depth sensor and the 3D body input, obtained from individual calibration, as described in Section 5.1. Second, the feature extraction process consists of four features: the spatial-temporal body parts and

Hardware Unit
The Leap Motion sensor [51] or 3D sensor is set on the signer's chest so that it automatically moves along the moving path of the signer and always detects hand signs in the same view or the backhand view, as illustrated in Figure 6. In this paper, we focus on translating sign words and showing them as text, presented in a red rectangle. In addition, the interaction zone of the sensor extends from 10 cm to 80 cm, and a 140 • × 120 • typical field of view. Therefore, if the sensor is in the normal plane with no tilt, an interaction zone is unable to detect the selected body point. The interaction zone can be designed as shown in Figure 6.

System Overview
Hearing-impaired people usually communicate through sign language, whereas non-hearing-impaired people normally use speech based on a natural language. An approach that allows these two groups of people to communicate via a mobile device that converts sign language to natural language and vice versa in real-time is illustrated in Figure 5. When hearing-impaired people speak in front of hearing-impaired and nonhearing-impaired people, hearing-impaired people can understand the meaning based on the signs, while simultaneously, non-hearing-impaired people can listen to the speech. In this scenario, a mobile device is installed on a body part of a hearing-impaired signer, such as the chest, face, and head, and functionally converts the video of signs into speech for non-hearing-impaired people to understand and vice versa.

Hardware Unit
The Leap Motion sensor [51] or 3D sensor is set on the signer's chest so that it automatically moves along the moving path of the signer and always detects hand signs in the same view or the backhand view, as illustrated in Figure 6. In this paper, we focus on translating sign words and showing them as text, presented in a red rectangle. In addition, the interaction zone of the sensor extends from 10 cm to 80 cm, and a 140° × 120° typical field of view. Therefore, if the sensor is in the normal plane with no tilt, an interaction zone is unable to detect the selected body point. The interaction zone can be designed as shown in Figure 6. , δ, and ℎ stand for the tilt angle of the sensor, the vertical angle of the 3D sensor, an angle, a bias angle, and the sensor's position in terms of the chest area determined by the height from the shoulder to the chest, respectively.

Software Unit
The software unit is divided into three parts, as illustrated schematically in Figure 7. First, the preprocessing step is performed, containing the data mining and analysis processes, such as the receival of 3D skeleton hand data from the 3D depth sensor and the 3D body input, obtained from individual calibration, as described in Section 5.1. Second, the feature extraction process consists of four features: the spatial-temporal body parts and Therefore, θ base , θ 3 = 120 • , θ 1 , δ, and h 1 stand for the tilt angle of the sensor, the vertical angle of the 3D sensor, an angle, a bias angle, and the sensor's position in terms of the chest area determined by the height from the shoulder to the chest, respectively.

Software Unit
The software unit is divided into three parts, as illustrated schematically in Figure 7. First, the preprocessing step is performed, containing the data mining and analysis processes, such as the receival of 3D skeleton hand data from the 3D depth sensor and the 3D body input, obtained from individual calibration, as described in Section 5.1. Second, the feature extraction process consists of four features: the spatial-temporal body parts and hand relationship patterns as the main feature, the spatial-temporal finger joint angle patterns, the spatial-temporal double-hand relationship patterns, and the spatial-temporal 3D hand motion trajectory patterns based on PCA, as reported in Section 5.2. Thirdly, the classification method, specifically, the two-layer BiLSTM neural network, can be used to deal with time-independent data, as described in Section 5.3. Otherwise, the system returns to the start of the loop, with new finger joint positions being obtained by the 3D sensor. hand relationship patterns as the main feature, the spatial-temporal finger joint angle patterns, the spatial-temporal double-hand relationship patterns, and the spatial-temporal 3D hand motion trajectory patterns based on PCA, as reported in Section 5.2. Thirdly, the classification method, specifically, the two-layer BiLSTM neural network, can be used to deal with time-independent data, as described in Section 5.3. Otherwise, the system returns to the start of the loop, with new finger joint positions being obtained by the 3D sensor.

Proposed Method
In this section, we present the preprocessing, feature extraction, and classification technique, which is shown schematically in Figure 8.

Preprocessing Technique
The preprocessing technique consists of two parts: 3D skeleton joint data received from the data acquisition device describe the 3D skeleton hand. The calibration technique presents the 3D points of interest of the body area obtained via the calibration using the tip of the index finger as a pointer.

3D Skeleton Joint Data
The 3D sensor is an optical hand tracking module that captures the movements of hands, such as the 3D position and direction. In the case of a single hand, the zero padding

Proposed Method
In this section, we present the preprocessing, feature extraction, and classification technique, which is shown schematically in Figure 8. hand relationship patterns as the main feature, the spatial-temporal finger joint angle patterns, the spatial-temporal double-hand relationship patterns, and the spatial-temporal 3D hand motion trajectory patterns based on PCA, as reported in Section 5.2. Thirdly, the classification method, specifically, the two-layer BiLSTM neural network, can be used to deal with time-independent data, as described in Section 5.3. Otherwise, the system returns to the start of the loop, with new finger joint positions being obtained by the 3D sensor.

Proposed Method
In this section, we present the preprocessing, feature extraction, and classification technique, which is shown schematically in Figure 8.

Preprocessing Technique
The preprocessing technique consists of two parts: 3D skeleton joint data received from the data acquisition device describe the 3D skeleton hand. The calibration technique presents the 3D points of interest of the body area obtained via the calibration using the tip of the index finger as a pointer.

3D Skeleton Joint Data
The 3D sensor is an optical hand tracking module that captures the movements of hands, such as the 3D position and direction. In the case of a single hand, the zero padding

Preprocessing Technique
The preprocessing technique consists of two parts: 3D skeleton joint data received from the data acquisition device describe the 3D skeleton hand. The calibration technique presents the 3D points of interest of the body area obtained via the calibration using the tip of the index finger as a pointer.

3D Skeleton Joint Data
The 3D sensor is an optical hand tracking module that captures the movements of hands, such as the 3D position and direction. In the case of a single hand, the zero padding technique [2] is used to replace the absent of finger joints of the second hand; it is assumed to be zero. Then, the total number of finger joints is expressed in the t-frame, as demonstrated in Equation (1).

Calibration Technique
The implementation of a calibration technique is an important step before using the system, and this must be a calibration for an individual person. The result of this step is the three-dimensional positioning of key points of the body by placing the tip of the index finger on the desired locations at the points of set B (t) and obtaining the fingertip's position using the Leap Motion sensor as a reference position point, as shown in Figure 2 and Equation (7).
. . ,7 stand for the new 3D positions of key points of the body, the 3D position created by pointing a tip of the index finger to the desired locations at the point of set B (t) according to Equation (2), the set of the key points of the body positions, and the total number of selected body points, respectively.

Feature Extraction
In this section we use the spatial-temporal body parts and hand relationship patterns as the main feature to identify signed words with similar shapes, rotation, and movement. Moreover, in this study we have added more features to make the system more efficient because, in addition to solving the problems mentioned above, it can also be used with other samples [37], since in this work we have applied both single-and double-handed approaches. The case of double-handed signs involves the relationship of the left and the right hand. Therefore, we extract four features: the spatial-temporal body parts and hand relationship patterns, the spatial-temporal finger joint angle patterns, the spatial-temporal double-hand relationship patterns, and the spatial-temporal 3D hand motion trajectory patterns. These are proposed in Sections 5.2.1-5.2.4, respectively.

Spatial-Temporal Body Parts and Hand Relationship Patterns
The spatial-temporal body parts and hand relationship patterns are the feature that determines the relationship between the position of the left and right hands with the selected points on the body to solve the problem of words which have similar shapes and movements, but different finger positions. However, information on the relationship of both hands and the selected 3D position points on the body can be used to solve this problem by calculating the distance between each 3D point, as shown in Figure 2. All distance is normalized to store a set of patterns in terms of spatial-temporal data, as shown in Equation (5).

Spatial-Temporal Finger Joint Angle Patterns
We have proposed the use of this feature due to the similar shapes used in sign words [2,12]. There are two kinds of features, as indicated in Figure 9. Firstly, finger joint angles are used to find the angles between the finger joints of the same finger and the angles between two adjacent fingertips, determined as follows in Equation (8). This feature can be used to characterize the shape of the hand. Secondly, pitch, yaw, and roll angles indicate the palm orientation, enabling us to obtain the pitch (ρ) (angle of the x-axis), yaw (φ) (angle of the y-axis), and roll ( ) (angle of the z-axis), as shown in Equations (9)-(11), respectively.
where θ is the angle between → M and → N , which indicate the 3D finger joint positions.
where ρ, φ, , A X A Y , and A Z are the pitch angle, yaw angle, and roll angle of the palm orientation, the x-axis, the y-axis, and the z-axis, respectively.
where is the angle between ℳ ⃗ and ⃗ , which indicate the 3D finger joint positions.
∄ arctan (11) where , , ∄, A A , and are the pitch angle, yaw angle, and roll angle of the palm orientation, the x-axis, the y-axis, and the z-axis, respectively.
In the final step, the pitch, yaw, and roll angles of the palm joint and the finger joint angles of the consecutive joint in the same finger must be collected in the set, in which all data are normalized to store a set of patterns in terms of the spatial-temporal data in ℋ , as shown in Equation (12).
where , 1, … , , R, L, , , and ∄(t) are the min-max normalization, the total number of frames, the right hand, the left hand, the pitch angle, the yaw angle, and the roll angle, respectively.

Spatial-Temporal Double-Hand Relationship Patterns
Due to the similar shapes of sign words [6], the spatial-temporal double-hand relationship patterns are proposed based on 3D Euclidean distance, as demonstrated in Figure  10 and Equation (13). This feature describes the relationship of the left and right hands to solve the problem of words or letters signed with similar movements. Therefore, all 3D distances are collected in the set; then, the selected distance-based data are normalized to store a set of ℋ , which is the term of the spatial-temporal data. In the final step, the pitch, yaw, and roll angles of the palm joint and the finger joint angles of the consecutive joint in the same finger must be collected in the set, in which all data are normalized to store a set of patterns in terms of the spatial-temporal data in H 2 (t), as shown in Equation (12).
where N orm, t = {1, . . . , T}, R, L, ρ(t), φ(t), and (t) are the min-max normalization, the total number of frames, the right hand, the left hand, the pitch angle, the yaw angle, and the roll angle, respectively.

Spatial-Temporal Double-Hand Relationship Patterns
Due to the similar shapes of sign words [6], the spatial-temporal double-hand relationship patterns are proposed based on 3D Euclidean distance, as demonstrated in Figure 10 and Equation (13). This feature describes the relationship of the left and right hands to solve the problem of words or letters signed with similar movements. Therefore, all 3D distances are collected in the set; then, the selected distance-based data are normalized to store a set of H 3 (t), which is the term of the spatial-temporal data.
where , , 1, … , , and n= {1, 2, 3, …, 20} are the distance between two points in xyz-space, min-max normalization, the total number of frames, and the total number of points, respectively.

Spatial-Temporal 3D Hand Motion Trajectory Patterns
Due to the similar shapes used in different signs [3], the concept of the spatial-temporal 3D hand motion trajectory patterns are presented to extract the movement trajectory of the finger joints, in which the 3D positions of the joints are collected in the set, as shown in Figure 11. Then, principal component analysis (PCA) [52] is used to reduce the dimensions of the data set to one dimension. Finally, the selected data are normalized to store a set of H 4 (t), the term of the spatial-temporal data, as shown in Equation (14).
where T n (t), N orm, PCA, and t = {1, . . . , T} are the tip and palm positions in the time series of xyz-space, min-max normalization, principal component analysis, and the total number of frames, respectively.
where , , 1, … , , and n= {1, 2, 3, …, 20} are the distance between two points in xyz-space, min-max normalization, the total number of frames, and the total number of points, respectively.

Spatial-Temporal 3D Hand Motion Trajectory Patterns
Due to the similar shapes used in different signs [3], the concept of the spatial-temporal 3D hand motion trajectory patterns are presented to extract the movement trajectory of the finger joints, in which the 3D positions of the joints are collected in the set, as shown in Figure 11. Then, principal component analysis (PCA) [52] is used to reduce the dimensions of the data set to one dimension. Finally, the selected data are normalized to store a set of ℋ , the term of the spatial-temporal data, as shown in Equation (14).
where , , PCA, and 1, … , are the tip and palm positions in the time series of xyz-space, min-max normalization, principal component analysis, and the total number of frames, respectively.
In the final stage of feature extraction, the four features consist of the spatial-temporal body parts and hand relationship patterns (ℋ ), the spatial-temporal finger joint angle patterns (ℋ ), the spatial-temporal double-hand relationship patterns (ℋ ), and the spatial-temporal 3D hand motion trajectory patterns (ℋ ). These features are concatenated into one-dimensional data by means of a concatenation technique, of which the equations are shown in Equation (15). The final feature extraction in terms of spatial temporal patterns ( ) is the input of a stacked BiLSTM in the classification process. Figure 11. 3D hand motion trajectory patterns for single and double hands.
In the final stage of feature extraction, the four features consist of the spatial-temporal body parts and hand relationship patterns (H 1 (t)), the spatial-temporal finger joint angle patterns (H 2 (t)), the spatial-temporal double-hand relationship patterns (H 3 (t)), and the spatial-temporal 3D hand motion trajectory patterns (H 4 (t)). These features are concatenated into one-dimensional data by means of a concatenation technique, of which the equations are shown in Equation (15). The final feature extraction in terms of spatial temporal patterns (X (t)) is the input of a stacked BiLSTM in the classification process.
where H 1 (t), H 2 (t), H 3 (t), H 4 (t), , t = {1, . . . ,T} stand for the spatial-temporal body parts and hand relationship patterns, the spatial-temporal finger joint angle patterns, the spatial-temporal double-hand relationship patterns, the spatial-temporal 3D hand motion trajectory patterns, the concatenation operator, and the total number of frames, respectively.

Classification
A deep learning algorithm, a recurrent neural network (RNN), is applied for the analysis of data in the form of a serial sequence, such as a video (a series of images) or text (a sequence of words). However, RNNs exhibit a vanishing and exploding gradient problem [53], which results in poor performance when dealing with long sequences. Therefore, bidirectional long short-term memory (BiLSTM) [54] is used for solving long sequences because it is possible to choose which data should be remembered or eliminated. Recent experimental works [3,12,37,55] have demonstrated that the BiLSTM network outperformed various models, such as the standard CNN, SVM, RNN, ARMA, and ARIMA. The use of a single LSTM network for sign word recognition led to low accuracy and overfitting, especially when learning complex sign sequences. To address this problem, stacking more than one BiLSTM unit, as in [2,43], enhances performance in the recognition of sign words. Therefore, inspired by these works, we designed our BiLSTM architecture using two BiLSTM units, as shown in Figure 12. This two-unit BiLSTM architecture allows us to achieve the high-level sequential modeling of the selected features.
bidirectional long short-term memory (BiLSTM) [54] is used for solving long sequences because it is possible to choose which data should be remembered or eliminated. Recent experimental works [3,12,37,55] have demonstrated that the BiLSTM network outperformed various models, such as the standard CNN, SVM, RNN, ARMA, and ARIMA. The use of a single LSTM network for sign word recognition led to low accuracy and overfitting, especially when learning complex sign sequences. To address this problem, stacking more than one BiLSTM unit, as in [2,43], enhances performance in the recognition of sign words. Therefore, inspired by these works, we designed our BiLSTM architecture using two BiLSTM units, as shown in Figure 12. This two-unit BiLSTM architecture allows us to achieve the high-level sequential modeling of the selected features. The structure of the designed two-layer BiLSTM model consists of the input, BiLSTM hidden neurons, dropout, and classification layers, as shown in Figure 12. The input layer contains time series data, as demonstrated in Equation (15). Then, the BiLSTM hidden layer consists of four main gates: the forget gate, the input gate, the input modulation gate, and the output gate. The forget gate ( ) controls the flow of information to forget or keep the previous state ( ). The input data ( ), previous hidden state (ℎ1 ), bias , and the sigmoid function ( ) are used for making a decision, as shown in Equation (16). If the forget gate is set to 0, it decides to forget the previous state, but if the forget gate is set to 1, it keeps the previous state. However, the input gate ( ) is used to decide which information of the input ( , sigmoid function , and ℎ1 should be passed to update the cell state, as demonstrated in Equation (17). Equation (18) expresses the cell states generated from the updated cell state in terms of the forget gate ( ), the input The structure of the designed two-layer BiLSTM model consists of the input, BiLSTM hidden neurons, dropout, and classification layers, as shown in Figure 12. The input layer contains time series data, as demonstrated in Equation (15). Then, the BiLSTM hidden layer consists of four main gates: the forget gate, the input gate, the input modulation gate, and the output gate. The forget gate ( f t+1 ) controls the flow of information to forget or keep the previous state (C t ). The input data (x n(t+1) ), previous hidden state (h1 t ), bias (b f ), and the sigmoid function (σ) are used for making a decision, as shown in Equation (16). If the forget gate is set to 0, it decides to forget the previous state, but if the forget gate is set to 1, it keeps the previous state. However, the input gate (i t+1 ) is used to decide which information of the input (x n(t+1) ), sigmoid function (σ), and h1 t should be passed to update the cell state, as demonstrated in Equation (17). Equation (18) expresses the cell states generated from the updated cell state in terms of the forget gate ( f t+1 ), the input gate (i t+1 ), and the previous state (C t ), respectively. Then, the output gate (o t+1 ) decides the value for the next sequence h1 t+1 , as shown in Equations (19) and (20). After that, a dropout layer is proposed to prevent overfitting by randomly turning off nodes in the network [56]. Based on previous works [57], the dropout layer was compared with each dropout level rate, resulting in dropout value of 0.2, which showed the best performance. Therefore, a dropout value of 0.2 was applied in this study. Lastly, the Softmax function is applied in the classification layer. The Softmax function is a form of logistic regression that normalizes an input value into vector values, following a probability distribution, in which the output values are in the range of [0,1].
where f t+1 , i t+1 , C t+1 , o t+1 , h1 t+1 , x n(t+1) , σ, h1 t , C t , tanh, W ∈ R U ×V , and b ∈ R V stand for the forget gate vector, the input gate vector, the cell input vector, the output gate vector, the hidden state vector, the input data, the sigmoid function, the previously hidden state vector, the previous cell state vector, the hyperbolic tangent function, the weight matrices, in which the superscripts U and V refer to the number of input features and the number of hidden units, and the bias vector parameters, in which the superscripts V refer to the number of hidden units, respectively. Figure 13 shows the experimental setup used for training and testing the proposed idea. It consisted of a sensor that was set up in the chest area and a computer system in which a laptop was used to collect the output from the sensor.  (20) where , , , , ℎ1 , , , ℎ1 , , h, ∈ R , and ∈ R stand for the forget gate vector, the input gate vector, the cell input vector, the output gate vector, the hidden state vector, the input data, the sigmoid function, the previously hidden state vector, the previous cell state vector, the hyperbolic tangent function, the weight matrices, in which the superscripts and refer to the number of input features and the number of hidden units, and the bias vector parameters, in which the superscripts refer to the number of hidden units, respectively. Figure 13 shows the experimental setup used for training and testing the proposed idea. It consisted of a sensor that was set up in the chest area and a computer system in which a laptop was used to collect the output from the sensor.

Experiments
This section consists of three parts: firstly, the dataset, which describes the dataset used in the experiment; the type of dataset used; and the dataset from the existing research that was used to compare the efficiency of the system. Secondly, the configuration parameters describe the parameter settings of the sign language recognition system and the classification model. Finally, we present an evaluation of the classification model and give details regarding the system's validity.

Experiments
This section consists of three parts: firstly, the dataset, which describes the dataset used in the experiment; the type of dataset used; and the dataset from the existing research that was used to compare the efficiency of the system. Secondly, the configuration parameters describe the parameter settings of the sign language recognition system and the classification model. Finally, we present an evaluation of the classification model and give details regarding the system's validity.

Dataset
The American Sign Language dataset includes signed words and signed letters which are currently used in sign language recognition. Most are forehand methods, but in this study, we propose a backhand approach. Therefore, it is compulsory to create a new dataset due to insufficient information. The created datasets are divided into two types-singlehanded and double-handed datasets of signed words, with 36 words, giving a total of 72 words. The datasets were collected by ten deaf and hard-of-hearing people, with each person collecting each word ten times so that both types would contain 7200 samples. This method uses the backhand approach; therefore, the sensor must be set on the chest and can be used in both day and night, as shown in Figure 2. In addition, this method has been tested on signed letters (the letters A-Z) that exhibited the problem of similar signed letters, using the datasets from [3], which presented a backhand dataset of 5200 samples. In the same way, this method was also tested with the dataset of signed words from [12], in which they collected 40 double-hand dynamic ASL words, giving a total of 4000 samples of similar signed words. Therefore, the total dataset contained 16,400 samples, as listed in Table 2. The protocol of k-fold cross validation was applied with these datasets.

Configuration Parameter
The configuration parameters can be categorized into two parts. The first part, the hardware specifications, describe the computer system and the Leap Motion sensor [51] details, as given in Table 3. In the second part, the classification parameter settings are used to configure the parameters of the BiLSTM model, as demonstrated in Table 4.  10 10 3600 36 double-hand ASL words (Created by author) 10 10 3600 26 signed letters (A-Z letters) by [3] 10 20 5200 40 double-hand ASL words by [12] 10 10 4000 Total samples 16,400

. Evaluation of the Classification Model
Accuracy [12] is described as a measure of correct predictions. Accuracy is given by Equation (21). The standard deviation is used to measure the amount of variation of a set of values, as shown in Equation (22). Moreover, the error, precision, recall, and F1-score [12] are applied for this model.
where SD, Y 3 , x i , and n stand for the standard deviation, the mean of all samples, each sample, and the total number of samples, respectively.

Results
The performance comparison in the task of signed-letter recognition (letters A-Z) using the experimental data set in [3] is demonstrated in Table 5. The overall evaluation results were obtained using measures of accuracy, error, precision, recall, F1-score, and standard deviation (SD), respectively. The recognition results of 26 signed letters using the proposed method were compared with the conventional method presented in [37] based on a feature-based model, specifically a finger-motion-based forehand view, and the method using a trajectory-based backhand view [3]. The results of the proposed method demonstrated that our proposed features improved the accuracy rate in signed-letter recognition (letters A-Z) by about 1.27%. The other metrics under consideration were not available in [3,37]; thus, these columns contain no data. Moreover, the performance comparison of the experimental results obtained for 40 double-hand dynamic ASL words from [12] using our proposed method is shown in Table 6. Using the proposed method, the accuracy rate was improved by 0.54 % compared to the conventional method [12] through the use of shape, motion, position, and angle-based features. As shown in Table 6, the conventional method [12] achieved an accuracy of 97.98%, whereas our proposed method increased the accuracy by 0.54%, with an overall accuracy of 98.52%, an error of 1.48%, a precision of 98.56%, a recall of 98.52%, an F1-score of 98.54%, and an SD of 0.22%. The overall accuracy for 72 American Sign Language (ASL) words was 96.99%. The performance comparison of the experimental results in the recognition of signed-words (72 words, including single-and double-handed ASL words) is shown in Table 7. We performed an ablation test, to demonstrate the significance of the proposed features over those used in previous works.

Ablation Test
We conducted ablation tests to determine the significance of each feature in the proposed models. The feature combination used in this method consists of the spatial-temporal body parts and hand relationship patterns (H 1 ), the spatial-temporal finger joint angle patterns (H 2 ), the spatial-temporal double-hand relationship patterns (H 3 ), and the spatialtemporal 3D hand motion trajectory patterns (H 4 ). This method of feature extraction for signed-letter recognition evaluates different feature combinations among all the features, as shown in Table 8. We evaluated the 1 st (H 2 + H 3 ), 2 nd (H 2 + H 3 + H 4 ), and 3 rd (H 1 + H 2 + H 3 + H 4 ) combinations, and found that the 3 rd (H 1 + H 2 + H 3 + H 4 ) combination was the best combination of features for signed-letter recognition (letters A-Z), achieving an accuracy of 97.34%, a precision of 97.39%, a recall of 97.34%, an F1-score of 97.36%, and an SD of 0.26%. However, the results showed that the use of the (H 1 , H 4 ) features can improve the accuracy. In our performance comparison, we conducted an ablation test related to signed-word recognition, and evaluated different feature combinations, as shown in Table 9. The feature combination consisted of the spatial-temporal body parts and hand relationship patterns (H 1 ), the spatial-temporal finger joint angle patterns (H 2 ), the spatial-temporal doublehand relationship patterns (H 3 ), and the spatial-temporal 3D hand motion trajectory patterns (H 4 ). There are three different feature combinations models: the 1 st (H 2 + H 3 ), 2 nd (H 2 + H 3 + H 4 ), and 3 rd (H 1 + H 2 + H 3 + H 4 ) combinations. The results showed that the proposed model provided the best performance, with an accuracy of 98.52%, a precision of 98.56%, a recall of 98.52%, an F1-score of 98.54%, and an SD of 0.22%. The experimental results for 72 sign words, including single-and double-handed words, based on a five-fold cross-validation [58], which is one of the most regularly used model evaluation methods, are shown in Tables 10 and 11, and Figure 14. In Table 10, the group (G.) in the table indicates a pair of words with similar signs. The error column indicates other misspelled words. For example, in Table 11, group (G.) 13 of the word "father" displayed an error in relation to the word "mother" of 2.6%. As a result, this indicates that the system has high accuracy.

Discussion
In the case of using a single hand, there is some hand occlusion by the palm, which obstructs the view of the sensor, resulting in an incorrect predictive finger position and a missing position when the finger is occluded. For example, the word "spit" in the third row causes occlusion by the palm, as shown in Figure 15. Unfortunately, the predicted finger joint position in the second row is similar to the position of the word "grandmother" in the first row, causing the determined meaning to be incorrect. For future solutions, prior and post-information may be used to predict the lost position of the finger.

Discussion
In the case of using a single hand, there is some hand occlusion by the palm, which obstructs the view of the sensor, resulting in an incorrect predictive finger position and a missing position when the finger is occluded. For example, the word "spit" in the third row causes occlusion by the palm, as shown in Figure 15. Unfortunately, the predicted finger joint position in the second row is similar to the position of the word "grandmother" in the first row, causing the determined meaning to be incorrect. For future solutions, prior and post-information may be used to predict the lost position of the finger. In the second case of single-and double-hand words, some hand occlusion problems are caused by fingers. For example, in Figure 16, the position of the word "are" in the third row, which is derived from the sensor, has the wrong position in the second row. Regrettably, this wrong position is similar to the word "true", shown in the first row, thus causing a misclassification problem. Moreover, in terms of double-hand words, Figure 17 shows an instance in which the error position of the word "keep" in the second row is In the second case of single-and double-hand words, some hand occlusion problems are caused by fingers. For example, in Figure 16, the position of the word "are" in the third row, which is derived from the sensor, has the wrong position in the second row. Regrettably, this wrong position is similar to the word "true", shown in the first row, thus causing a misclassification problem. Moreover, in terms of double-hand words, Figure 17 shows an instance in which the error position of the word "keep" in the second row is similar to the word "sister" in the first row. In the second case of single-and double-hand words, some hand occlusion problems are caused by fingers. For example, in Figure 16, the position of the word "are" in the third row, which is derived from the sensor, has the wrong position in the second row. Regrettably, this wrong position is similar to the word "true", shown in the first row, thus causing a misclassification problem. Moreover, in terms of double-hand words, Figure 17 shows an instance in which the error position of the word "keep" in the second row is similar to the word "sister" in the first row.  In addition, since the Leap Motion sensor is mounted on the chest, which limits the scope of the area in which the hand can be detected, there is a problem with some signed words such as "introduce", for which the hand position is sometimes difficult with the interaction zone. The solution for this problem is to increase an interaction zone by using two Leap Motion sensors.

Conclusions
When using a backhand approach, some signed words have a similar shape, rotation, and hand movement, but different hand positions; thus, detection systems suffer from  In the second case of single-and double-hand words, some hand occlusion problems are caused by fingers. For example, in Figure 16, the position of the word "are" in the third row, which is derived from the sensor, has the wrong position in the second row. Regrettably, this wrong position is similar to the word "true", shown in the first row, thus causing a misclassification problem. Moreover, in terms of double-hand words, Figure 17 shows an instance in which the error position of the word "keep" in the second row is similar to the word "sister" in the first row.  In addition, since the Leap Motion sensor is mounted on the chest, which limits the scope of the area in which the hand can be detected, there is a problem with some signed words such as "introduce", for which the hand position is sometimes difficult with the interaction zone. The solution for this problem is to increase an interaction zone by using two Leap Motion sensors.

Conclusions
When using a backhand approach, some signed words have a similar shape, rotation, and hand movement, but different hand positions; thus, detection systems suffer from In addition, since the Leap Motion sensor is mounted on the chest, which limits the scope of the area in which the hand can be detected, there is a problem with some signed words such as "introduce", for which the hand position is sometimes difficult with the interaction zone. The solution for this problem is to increase an interaction zone by using two Leap Motion sensors.

Conclusions
When using a backhand approach, some signed words have a similar shape, rotation, and hand movement, but different hand positions; thus, detection systems suffer from misclassification errors, which results in lower accuracy. Therefore, in this study we propose the use of the spatial-temporal body parts and hand relationship patterns (ST-BHR) as the main feature, in which the set of the 3D positions of the finger joints and the set of the key positions of the body parts are applied, measuring the Euclidean distances based on a Cartesian product to derive a series of 3D distance-based features. Then, the bidirectional long short-term memory method is used as a classifier for time-independent data. The performance of the method was evaluated using 72 sign words from 10 participants, using both single-and double-handed words, and the accuracy was found to be approximately 96.99%. The method was further developed on 26 ASL sign letters and 40 double-hand dynamic ASL word datasets, improving upon the conventional method by 1.27% and 0.54%, respectively.