Multi‑Stroke Thai Finger‑Spelling Sign Language Recognition System with Deep Learning

: Sign language is a type of language for the hearing impaired that people in the general public commonly do not understand. A sign language recognition system, therefore, represents an intermediary between the two sides. As a communication tool, a multi‑stroke Thai finger‑spelling sign language (TFSL) recognition system featuring deep learning was developed in this study. This research uses a vision‑based technique on a complex background with semantic segmentation per‑ formed with dilated convolution for hand segmentation, hand strokes separated using optical flow, and learning feature and classification done with convolution neural network (CNN). We then com‑ pared the five CNN structures that define the formats. The first format was used to set the number of filters to 64 and the size of the filter to 3 × 3 with 7 layers; the second format used 128 filters, each filter 3 × 3 in size with 7 layers; the third format used the number of filters in ascending order with 7 layers, all of which had an equal 3 × 3 filter size; the fourth format determined the number of filters in ascending order and the size of the filter based on a small size with 7 layers; the final format was a structure based on AlexNet. As a result, the average accuracy was 88.83%, 87.97%, 89.91%, 90.43%, and 92.03%, respectively. We implemented the CNN structure based on AlexNet to create models for multi‑stroke TFSL recognition systems. The experiment was performed using an isolated video of 42 Thai alphabets, which are divided into three categories consisting of one stroke, two strokes, and three strokes. The results presented an 88.00% average accuracy for one stroke, 85.42% for two strokes, and 75.00% for three strokes.


Introduction
Hearing-impaired people around the world use sign language as their medium for communication. However, sign language is not universal, with one hundred varieties used around the world [1]. One of the most widely used types of sign language is American Sign Language (ASL), which is used in the United States, Canada, West Africa, and Southeast Asia and influences Thai Sign Language (TSL). Typically, global sign language is divided into two forms: gesture language and finger-spelling. Gesture language or sign language involves the use of hand gestures, facial expressions, and the use of mouths and noses to convey meanings and sentences. This type is used for communication between deaf people in everyday life, focusing on terms such as eating, ok, sleep, etc. Finger-spelling sign language is used for spelling letters related to the written language with one's fingers and is used to spell people' names, places, animals, and objects.
ASL is the foundation of Thai finger-spelling sign language (TFSL).The TFSL was invented in 1953 by Khunying Kamala Krairuek using American finger-spelling as a prototype to represent the 42 Thai consonants, 32 vowels, and 6 intonation marks [2]. All forty-two Thai letters can be presented with a combination of twenty-five hand gestures. For this purpose, the number signs are combined with alphabet signs to create additional Table 1. Research on Thai finger-spelling sign language (TFSL) recognition systems.

Researchers
No. of Sign Method Dataset Accuracy (%) Adhan and Pintavirooj [5] 42 Black glove with 6 sphere marker, geometric invariant and ANN 1050 96. 19 Saengsri et al. [6] 16 Glove with motion tracking (N/A) 94.44 Nakjai and Kantanyakul [2] 25 CNN 1375 91. 26 Chansri and Srinonchat [7] 16 HOG and ANN 320 83.33 Silanon [8] 16 HOG and ANN 2100 78.00 The study of TFSL recognition systems for practical applications requires using a variety of techniques, since TFSL has multi-stroke gestures and combines hand gestures to convey letters as well as complex background handling and a wide range of light volumes.
Deep learning is a tool that is increasingly being used in sign language recognition [2,9,10], face recognition [11], object recognition [12], and others. This technology is used for solving complex problems such as object detection [13], image segmentation [14], and image recognition [12]. With this technique, the feature learning method is implemented instead of feature extraction. Convolution neural network (CNN) is a learning feature process that can be applied to recognition and can provide high performance. However, deep learning requires much data for training, and such data could use simple background or complex background images. In the case of a picture with a simple background, the object and background color are clearly different. Such an image can be used as training data without cutting out the background. On the other hand, a complex background image features objects and backgrounds that have similar colors. Cutting out the background image will obtain input images for training with deep learning, featuring only objects of interest without distraction. This can be done using the semantic segmentation method, which uses deep learning to apply image segmentation. This method requires labeling objects of interest to separate them from the background. Autonomous driving is one application of semantic segmentation used to identify objects on the road, and can also be used for a wide range of other applications.
This study is based on the challenges of the TFSL recognition system. There are up to 42 Thai letters that involve spelling with one's fingers-i.e., spelling letters with a combination of multi-stroke sign language gestures. The gestures for spelling several letters, however, are similar, so a sign language recognition system under a complex background should instead be used. There are many studies about the TFSL recognition system. However, none of them study the TFSL multi-stroke recognition system using the vision-based technique with complex background. The main contributions of this research are the application of a new framework for a multi-stroke Thai finger-spelling sign language recognition system to videos under a complex background and a variety of light intensities by separating people's hands from the complex background via semantic segmentation methods, detecting changes in the strokes of hand gestures with optical flow, and learning features with CNN under the structure of AlexNet. This system supports recognition that covers all 42 characters in TFSL.
The proposed method focuses on developing a multi-stroke TFSL recognition system with deep learning that can act as a communication medium between the hearing impaired and the general public. Semantic segmentation is then applied to hand segmentation for complex background images, and optical flow is used to separate the strokes of sign language. The processes of feature learning and classification use CNN.

Review of Related Literature
Research on sign language recognition systems has been developed in sequence with a variety of techniques to obtain a system that can be used in the real world. Deep learning is one of the most popular methods effectively applied to sign language recognition systems. Nakjai and Katanyukul [2] used CNN to develop a recognition system for TFSL with a black background image and compared 3-layer CNN, 6-layer CNN, and HOG, finding that the 3-layer CNN with 128 filters on every layer had an average precision (mAP) of 91.26%. Another study that applied deep learning to a sign language recognition system was published by Lean Karlo S. et al. [9], who developed an American sign language recognition system with CNN under a simple background. The dataset was divided into 3 groups: alphabet, number, and static word. The results show that alphabet recognition had an average accuracy of 90.04%, number recognition had an accuracy average of 93.44%, and static word recognition had an average accuracy of 97.52%. The total average of the system was 93.67%. Rahim et al. [15] applied deep learning to a non-touch sign word recognition system using hybrid segmentation and CNN feature fusion. This system used SVM to recognize sign language. The research results under a real-time environment provided an average accuracy of 97.28%.
Various sign language studies aimed to develop a vision base by using, e.g., image and video processing, object detection, image segmentation, and recognition systems. Sign language recognition systems using vision can also be divided into two types of backgrounds: simple backgrounds [3,4,9,16] and complex backgrounds [17][18][19]. A simple background entails the use of a single color such as green, blue, or white. This can help hand segmentation work more easily, and the system can recognize accuracy at a high level. Pariwat et al. [4] used sign language images on a blue background while the signer wore a black jacket in a TFSL recognition system with SVM on RBF, including global and local features. The average accuracy was 91.20%. Pariwat et al. [3] also used a simple background in a system that was developed by combining PHOG and local features with KNN. This combination enhanced the average accuracy to levels as high as 97.80%. Anh et al. [10] presented a Vietnamese language recognition system using deep learning in a video sequence format. A simple background, like a white background, was used in training and testing, providing an accuracy of 95.83%. Using a simple background makes it easier to develop a sign language recognition system and provides high-accuracy results, but this method is not practical due to the complexity of the backgrounds in real-life situations.
Studying sign language recognition systems with complex backgrounds is challenging. Complex backgrounds consist of a variety of elements that are difficult to distinguish, which requires more complex methods. Chansri et al. [7] developed a Thai sign language recognition system using Microsoft Kinect to help locate a person's hand in a complex background. This study used the fusion of depth and color video, HOG features, and a neural network, resulting in accuracy of 84.05%. Ayman et al. [17] proposed the use of Microsoft Kinect for Arabic sign language recognition systems with complex backgrounds with histogram of oriented gradients-principal component analysis (HOG-PCA) and SVM. The resulting accuracy value was 99.2%. In summary, research on sign language recognition systems using vision-based techniques remains a challenge for real-world applications.

The Proposed Method
This proposed method is a development of the multi-stroke TFSL recognition system covering 42 letters. The system consists of four parts: (1) creating a SegNet model via semantic segmentation using dilated convolutions; (2) creating a hand-segmented library with a SegNet model; (3) TFSL model creation with CNN for the TFSL model; and (4) classification processes using optical flow for hand stroke detection and using the CNN model for classification. Figure 1 shows an overview of the system.

Hand Segmentation Model Creation
Segmentation is one of the most important processes for recognition systems. In particular, sign language recognition systems require the separation of the hands from the background image. Complex backgrounds are a challenging task for hand segmentation because they feature a wide range of colors and lighting. Semantic segmentation has been effectively applied to complex background images and it is used to describe images at a pixel level with a class label [20]. For example, in an image containing people, trees, cars, and signs, semantic segmentation can find separate labels for the objects depicted in the image, and it is applied to autonomous vehicles, robotics, human-computer interactions, etc.
Dilated convolution is used in segmentation tasks and has consistently improved accuracy performance [21]. Dilated convolution provides a way to exponentially increase the receptive view (global view) of the network and provide linear parameter accretion. In general, dilated convolution includes the application of an input with spacing defined by the dilation rate. For example, in Figure 2, a 1-dilated convolution (a) refers to the normal convolution with a 3 × 3 receptive field, a 2-dilated (b) refers to one-pixel spacing with a size of 7 × 7, and a 4-dilated (c) refers to a 3-pixel spacing receptive field with a size of 15 × 15. The red dot represents the 3 × 3 input filters, and the green area is the receptive area of these inputs.

Hand Segmentation Model Creation
Segmentation is one of the most important processes for recognition systems. In particular, sign language recognition systems require the separation of the hands from the background image. Complex backgrounds are a challenging task for hand segmentation because they feature a wide range of colors and lighting. Semantic segmentation has been effectively applied to complex background images and it is used to describe images at a pixel level with a class label [20]. For example, in an image containing people, trees, cars, and signs, semantic segmentation can find separate labels for the objects depicted in the image, and it is applied to autonomous vehicles, robotics, human-computer interactions, etc.
Dilated convolution is used in segmentation tasks and has consistently improved accuracy performance [21]. Dilated convolution provides a way to exponentially increase the receptive view (global view) of the network and provide linear parameter accretion. In general, dilated convolution includes the application of an input with spacing defined by the dilation rate. For example, in Figure 2, a 1-dilated convolution (a) refers to the normal convolution with a 3 × 3 receptive field, a 2-dilated (b) refers to one-pixel spacing with a size of 7 × 7, and a 4-dilated (c) refers to a 3-pixel spacing receptive field with a size of 15 × 15. The red dot represents the 3 × 3 input filters, and the green area is the receptive area of these inputs. This study employs semantic segmentation using dilated convolutions to create a hand-segmented library. The process for creating a segmented library is as shown in Figure 3. First, create image files and image labels for hand signs as training data. Second, create a semantic segmentation network model (SegNet model). The three layers of convolution are as follows: dilated convolution, batch normalization, and rectified linear unit (ReLU), with the final layer used for softmax and pixel classification. Next, once data training is completed, the results of the SegNet model are ready to be used in the next step. The SegNet model that was created is applied in the process of creating a hand-segmented library.

Hand-Segmented Library Creation
Creating a hand-segmented library is done as preparation for training, with the input hand sign image used to create a library consisting of 25 gestures of TFSL. Each gesture has 5000 images for a total of 125,000 images. This process includes the following processes, as shown in Figure 4. The first process is to engage in semantic segmentation by applying the SegNet model from the previous process to segment the hands and background image. The second process is to label the position of the hand as a binary image. The third process is to blur the Gaussian filter method to reduce the noise of the image and remove small space object from binary image, leaving only areas of wide space. The fourth process sorts the areas in ascending order, selects the largest space (area of the hand), and creates a bounding box at the position of the hand to frame only the features of interest. The last process is to crop the image by hand by applying a cropped binary image to the input image to remove only the interesting part and then resize the image to 150 × 250 pixels. Examples of images from the hand-segmented library are shown in  This study employs semantic segmentation using dilated convolutions to create a hand-segmented library. The process for creating a segmented library is as shown in Figure 3. First, create image files and image labels for hand signs as training data. Second, create a semantic segmentation network model (SegNet model). The three layers of convolution are as follows: dilated convolution, batch normalization, and rectified linear unit (ReLU), with the final layer used for softmax and pixel classification. Next, once data training is completed, the results of the SegNet model are ready to be used in the next step. The SegNet model that was created is applied in the process of creating a hand-segmented library. This study employs semantic segmentation using dilated convolutions to create a hand-segmented library. The process for creating a segmented library is as shown in Figure 3. First, create image files and image labels for hand signs as training data. Second, create a semantic segmentation network model (SegNet model). The three layers of convolution are as follows: dilated convolution, batch normalization, and rectified linear unit (ReLU), with the final layer used for softmax and pixel classification. Next, once data training is completed, the results of the SegNet model are ready to be used in the next step. The SegNet model that was created is applied in the process of creating a hand-segmented library.

Hand-Segmented Library Creation
Creating a hand-segmented library is done as preparation for training, with the input hand sign image used to create a library consisting of 25 gestures of TFSL. Each gesture has 5000 images for a total of 125,000 images. This process includes the following processes, as shown in Figure 4. The first process is to engage in semantic segmentation by applying the SegNet model from the previous process to segment the hands and background image. The second process is to label the position of the hand as a binary image. The third process is to blur the Gaussian filter method to reduce the noise of the image and remove small space object from binary image, leaving only areas of wide space. The fourth process sorts the areas in ascending order, selects the largest space (area of the hand), and creates a bounding box at the position of the hand to frame only the features of interest. The last process is to crop the image by hand by applying a cropped binary image to the input image to remove only the interesting part and then resize the image to 150 × 250 pixels. Examples of images from the hand-segmented library are shown in

Hand-Segmented Library Creation
Creating a hand-segmented library is done as preparation for training, with the input hand sign image used to create a library consisting of 25 gestures of TFSL. Each gesture has 5000 images for a total of 125,000 images. This process includes the following processes, as shown in Figure 4. The first process is to engage in semantic segmentation by applying the SegNet model from the previous process to segment the hands and background image. The second process is to label the position of the hand as a binary image. The third process is to blur the Gaussian filter method to reduce the noise of the image and remove small space object from binary image, leaving only areas of wide space. The fourth process sorts the areas in ascending order, selects the largest space (area of the hand), and creates a bounding box at the position of the hand to frame only the features of interest. The last process is to crop the image by hand by applying a cropped binary image to the input image to remove only the interesting part and then resize the image to 150 × 250 pixels. Examples of images from the hand-segmented library are shown in Figure 5.

TFSL Model Creation
This process uses data training methods to learn features and create a CNN classification model, as shown in Figure 6. The training starts with introducing data from a handsegmented library into deep learning via CNN, which consists of multiple layers in the feature detection, processes, each with convolution (Conv), ReLU, and pooling [22]. Three processes are repeated for each cycle of layers, with each layer learning to detect different features. Convolution is a layer that learns features from the input image. For convolution, small squares of input data are used to learn the image features by preserving the relationship between pixels. The image matrix and filter or kernel are mathematically produced by the dot product, as shown in Figure 7.

TFSL Model Creation
This process uses data training methods to learn features and create a CNN c cation model, as shown in Figure 6. The training starts with introducing data from a segmented library into deep learning via CNN, which consists of multiple layers feature detection, processes, each with convolution (Conv), ReLU, and pooling [22 Conv+Relu Pooling Conv+Relu Pooling Conv+Relu Pooling Three processes are repeated for each cycle of layers, with each layer learning tect different features. Convolution is a layer that learns features from the input For convolution, small squares of input data are used to learn the image features serving the relationship between pixels. The image matrix and filter or kernel are matically produced by the dot product, as shown in Figure 7.

TFSL Model Creation
This process uses data training methods to learn features and create a CNN classification model, as shown in Figure 6. The training starts with introducing data from a hand-segmented library into deep learning via CNN, which consists of multiple layers in the feature detection, processes, each with convolution (Conv), ReLU, and pooling [22].

TFSL Model Creation
This process uses data training methods to learn features and create a CNN classification model, as shown in Figure 6. The training starts with introducing data from a handsegmented library into deep learning via CNN, which consists of multiple layers in the feature detection, processes, each with convolution (Conv), ReLU, and pooling [22].  Three processes are repeated for each cycle of layers, with each layer learning to detect different features. Convolution is a layer that learns features from the input image. For convolution, small squares of input data are used to learn the image features by preserving the relationship between pixels. The image matrix and filter or kernel are mathematically produced by the dot product, as shown in Figure 7. Three processes are repeated for each cycle of layers, with each layer learning to detect different features. Convolution is a layer that learns features from the input image. For convolution, small squares of input data are used to learn the image features by preserving the relationship between pixels. The image matrix and filter or kernel are mathematically produced by the dot product, as shown in Figure 7. ReLU is an activation function that allows an algorithm to work faster and more effectively. The function returns 0 if it receives any negative input, but for any positive value x, it returns that value, as presented in Figures 8 and 9. Thus, ReLU can be written as Equation (1):  Pooling resizes the data to a smaller size, with the details of the input remaining intact. It also has the advantage of increasing the sensitivity to calculations and solving the problems of overfitting. There are two types of pooling layers, max and average pooling, as illustrated in Figure 10. ReLU is an activation function that allows an algorithm to work faster and more effectively. The function returns 0 if it receives any negative input, but for any positive value x, it returns that value, as presented in Figures 8 and 9. Thus, ReLU can be written as Equation (1):  ReLU is an activation function that allows an algorithm to work faster and more effectively. The function returns 0 if it receives any negative input, but for any positive value x, it returns that value, as presented in Figures 8 and 9. Thus, ReLU can be written as Equation (1):  Pooling resizes the data to a smaller size, with the details of the input remaining intact. It also has the advantage of increasing the sensitivity to calculations and solving the problems of overfitting. There are two types of pooling layers, max and average pooling, as illustrated in Figure 10.  ReLU is an activation function that allows an algorithm to work faster fectively. The function returns 0 if it receives any negative input, but for any p x, it returns that value, as presented in Figures 8 and 9. Thus, ReLU can b Equation (1):  Pooling resizes the data to a smaller size, with the details of the input r tact. It also has the advantage of increasing the sensitivity to calculations and problems of overfitting. There are two types of pooling layers, max and aver as illustrated in Figure 10. Pooling resizes the data to a smaller size, with the details of the input remaining intact. It also has the advantage of increasing the sensitivity to calculations and solving the problems of overfitting. There are two types of pooling layers, max and average pooling, as illustrated in Figure 10. The final layer of the CNN architecture is the classification layer, consistin layers: the fully connected (FC) and softmax layers. The fully connected layer w feature map matrix in the form of a vector derived from the feature detection create predictive models. The softmax function layer provides the classificatio

Classification
In the classification process, video files showing isolated TFSL were used put. The video input files consisted of one stroke, two strokes, and three stro files were taken with a digital camera and required the signer to act with the in a black jacket standing in front of a complex background. The whole proces involved the following steps: motion segmentation by optical flow, splitting of the hand gestures, hand segmentation via the SegNet model, classification, signs, and displaying the text, as shown in Figure 11. Motion segmentation used Lucas and Kanade's optical flow calculatio which is a method of tracking the whole image using the pyramid algorithm starts with the top layer of the pyramid and runs down to the bottom. Motion tion was performed by optical flow using the orientation and magnitude of Calculating the orientation and magnitude of the optical flow was accompli frame-by-frame calculations considering 2D motion. In each frame, the angle tion at each point varied, depending on the magnitude and direction of the served from the vector relative to the axis (x, y). The direction of motion can b from Equation (2)  The final layer of the CNN architecture is the classification layer, consisting of 2 sublayers: the fully connected (FC) and softmax layers. The fully connected layer will use the feature map matrix in the form of a vector derived from the feature detection process to create predictive models. The softmax function layer provides the classification output.

Classification
In the classification process, video files showing isolated TFSL were used as the input. The video input files consisted of one stroke, two strokes, and three strokes. These files were taken with a digital camera and required the signer to act with the right hand in a black jacket standing in front of a complex background. The whole process of testing involved the following steps: motion segmentation by optical flow, splitting the strokes of the hand gestures, hand segmentation via the SegNet model, classification, combining signs, and displaying the text, as shown in Figure 11. The final layer of the CNN architecture is the classification layer, consisting of 2 sublayers: the fully connected (FC) and softmax layers. The fully connected layer will use the feature map matrix in the form of a vector derived from the feature detection process to create predictive models. The softmax function layer provides the classification output.

Classification
In the classification process, video files showing isolated TFSL were used as the input. The video input files consisted of one stroke, two strokes, and three strokes. These files were taken with a digital camera and required the signer to act with the right hand in a black jacket standing in front of a complex background. The whole process of testing involved the following steps: motion segmentation by optical flow, splitting the strokes of the hand gestures, hand segmentation via the SegNet model, classification, combining signs, and displaying the text, as shown in Figure 11. Motion segmentation used Lucas and Kanade's optical flow calculation method, which is a method of tracking the whole image using the pyramid algorithm. The track starts with the top layer of the pyramid and runs down to the bottom. Motion segmentation was performed by optical flow using the orientation and magnitude of the vector. Calculating the orientation and magnitude of the optical flow was accomplished using frame-by-frame calculations considering 2D motion. In each frame, the angle of the motion at each point varied, depending on the magnitude and direction of the motion observed from the vector relative to the axis (x, y). The direction of motion can be obtained from Equation (2) [24]: where  is the direction of vector v = [x,y] T , x is coordinate x, and y is coordinate y in the where b is the number of block and B is the amount of bins.
The magnitude of optical flow calculates the vector length by using the linear equation to find the length of each vector between the previous frame and the current frame. The magnitude can be calculated from the equation below [24]: Motion segmentation used Lucas and Kanade's optical flow calculation method, which is a method of tracking the whole image using the pyramid algorithm. The track starts with the top layer of the pyramid and runs down to the bottom. Motion segmentation was performed by optical flow using the orientation and magnitude of the vector. Calculating the orientation and magnitude of the optical flow was accomplished using frame-by-frame calculations considering 2D motion. In each frame, the angle of the motion at each point varied, depending on the magnitude and direction of the motion observed from the vector relative to the axis (x, y). The direction of motion can be obtained from Equation (2) [24]: where θ is the direction of vector v = [x,y] T , x is coordinate x, and y is coordinate y in the range of − π where b is the number of block and B is the amount of bins. The magnitude of optical flow calculates the vector length by using the linear equation to find the length of each vector between the previous frame and the current frame. The magnitude can be calculated from the equation below [24]: where m is the magnitude, x is coordinate x, and y is coordinate y. In this research, motion was distinguished by finding the mean of all magnitudes within each frame of the split hand sign. The mean of the magnitude was high when the hand signals were changed, as shown in Figure 12. where m is the magnitude, x is coordinate x, and y is coordinate y. In this research, motion was distinguished by finding the mean of all magnitudes within each frame of the split hand sign. The mean of the magnitude was high when the hand signals were changed, as shown in Figure 12. Splitting the stroke of the hand was achieved by comparing each stroke with the threshold value. Testing the various threshold values, it was found that a magnitude value of 1-3 means slight movement, and a value equal to 4 or more indicates a change in the stroke of the hand signal. The threshold value in this experiment was considered as 4. If the mean value of magnitude in each frame was lower than the threshold value, then s = 0 (meaning), but if the value was more than the threshold value, then s = 1 (meaningless), as shown in Figure 13. After that, 10 frames between the start and the end of s = 0 (meaning) represent the hand sign.
After extracting the image from the hand stroke separation, the next step was the hand segmentation process. This process involves the separation of hands from the background using the SegNet model, as illustrated in Figure 14.  Splitting the stroke of the hand was achieved by comparing each stroke with the threshold value. Testing the various threshold values, it was found that a magnitude value of 1-3 means slight movement, and a value equal to 4 or more indicates a change in the stroke of the hand signal. The threshold value in this experiment was considered as 4. If the mean value of magnitude in each frame was lower than the threshold value, then s = 0 (meaning), but if the value was more than the threshold value, then s = 1 (meaningless), as shown in Figure 13. After that, 10 frames between the start and the end of s = 0 (meaning) represent the hand sign. where m is the magnitude, x is coordinate x, and y is coordinate y. In this research, motion was distinguished by finding the mean of all magnitudes within each frame of the split hand sign. The mean of the magnitude was high when the hand signals were changed, as shown in Figure 12. Splitting the stroke of the hand was achieved by comparing each stroke with the threshold value. Testing the various threshold values, it was found that a magnitude value of 1-3 means slight movement, and a value equal to 4 or more indicates a change in the stroke of the hand signal. The threshold value in this experiment was considered as 4. If the mean value of magnitude in each frame was lower than the threshold value, then s = 0 (meaning), but if the value was more than the threshold value, then s = 1 (meaningless), as shown in Figure 13. After that, 10 frames between the start and the end of s = 0 (meaning) represent the hand sign.
After extracting the image from the hand stroke separation, the next step was the hand segmentation process. This process involves the separation of hands from the background using the SegNet model, as illustrated in Figure 14.  After extracting the image from the hand stroke separation, the next step was the hand segmentation process. This process involves the separation of hands from the background using the SegNet model, as illustrated in Figure 14.
Symmetry 2021, 13, x FOR PEER REVIEW Figure 14. Image of hand segmentation after splitting the strokes of the hand. The next step was to classify the hand segmentation images with the CNN mo predict 10 frames of each hand sign. The system votes on the results of the predict descending order and selects the most predictable hand sign. As shown in Figure 11, passing the classification process, the results can be predicted as the K sign and the 3 After that, the signs are combined and then translated into Thai letters. The rules f combination of signs are shown in Table 2.

No.
Rule Alphabet No. Rule  The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending order and selects the most predictable hand sign. As shown in Figure 11, when passing the classification process, the results can be predicted as the K sign and the 3 sign. After that, the signs are combined and then translated into Thai letters. The rules for the combination of signs are shown in Table 2.  The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending order and selects the most predictable hand sign. As shown in Figure 11, when passing the classification process, the results can be predicted as the K sign and the 3 sign. After that, the signs are combined and then translated into Thai letters. The rules for the combination of signs are shown in Table 2.

Data Collection
Data collection for this research is divided into two types: images from 6 locations for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), which take a slightly different view of the camera. They were taken with a digital camera The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending order and selects the most predictable hand sign. As shown in Figure 11, when passing the classification process, the results can be predicted as the K sign and the 3 sign. After that, the signs are combined and then translated into Thai letters. The rules for the combination of signs are shown in Table 2.

Data Collection
Data collection for this research is divided into two types: images from 6 locations for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), which take a slightly different view of the camera. They were taken with a digital camera  The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending order and selects the most predictable hand sign. As shown in Figure 11, when passing the classification process, the results can be predicted as the K sign and the 3 sign. After that, the signs are combined and then translated into Thai letters. The rules for the combination of signs are shown in Table 2.

No. Rule Alphabet
No. Rule  The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending order and selects the most predictable hand sign. As shown in Figure 11, when passing the classification process, the results can be predicted as the K sign and the 3 sign. After that, the signs are combined and then translated into Thai letters. The rules for the combination of signs are shown in Table 2.

No. Rule Alphabet
No. Rule  The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending order and selects the most predictable hand sign. As shown in Figure 11, when passing the classification process, the results can be predicted as the K sign and the 3 sign. After that, the signs are combined and then translated into Thai letters. The rules for the combination of signs are shown in Table 2.

No. Rule Alphabet
No. Rule  The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending order and selects the most predictable hand sign. As shown in Figure 11, when passing the classification process, the results can be predicted as the K sign and the 3 sign. After that, the signs are combined and then translated into Thai letters. The rules for the combination of signs are shown in Table 2.

Data Collection
Data collection for this research is divided into two types: images from 6 locations for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), which take a slightly different view of the camera. They were taken with a digital camera /)

Data Collection
Data collection for this research is divided into two types: images from 6 locations for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), which take a slightly different view of the camera. They were taken with a digital camera and required the signer to use his or her right hand to make a hand signal while standing in front of a complex background at the 6 locations presented in Figure 15.
Symmetry 2021, 13, x FOR PEER REVIEW and required the signer to use his or her right hand to make a hand signal while st in front of a complex background at the 6 locations presented in Figure 15. The hand sign images used for the training were divided into 2 groups: the t data for the SegNet model and the training data for the CNN model, with 25 signs The training data for the SegNet model used original images and image labels, 32 pixels in size, totaling 10,000 images. The second group contained images that wer segmented via semantic segmentation and were contained in the hand-segmented These images were 150 × 250 pixels in size. The training dataset contained 25 si presented in Figure 16, each with 5000 images for a total of 125,000 images.  The hand sign images used for the training were divided into 2 groups: the training data for the SegNet model and the training data for the CNN model, with 25 signs in total. The training data for the SegNet model used original images and image labels, 320 × 240 pixels in size, totaling 10,000 images. The second group contained images that were hand-segmented via semantic segmentation and were contained in the hand-segmented library. These images were 150 × 250 pixels in size. The training dataset contained 25 signs, as presented in Figure 16, each with 5000 images for a total of 125,000 images.
Symmetry 2021, 13, x FOR PEER REVIEW 11 of 19 and required the signer to use his or her right hand to make a hand signal while standing in front of a complex background at the 6 locations presented in Figure 15. The hand sign images used for the training were divided into 2 groups: the training data for the SegNet model and the training data for the CNN model, with 25 signs in total. The training data for the SegNet model used original images and image labels, 320 × 240 pixels in size, totaling 10,000 images. The second group contained images that were handsegmented via semantic segmentation and were contained in the hand-segmented library. These images were 150 × 250 pixels in size. The training dataset contained 25 signs, as presented in Figure 16, each with 5000 images for a total of 125,000 images. The video used in the testing process was isolated TFSL featuring 42 Thai letters. These videos were taken from a hand signer group consisting of 4 people, and each letter was recorded 5 times, totaling (42 × 4 × 5) 840 files. The video used in the testing process was isolated TFSL featuring 42 Thai letters. These videos were taken from a hand signer group consisting of 4 people, and each letter was recorded 5 times, totaling (42 × 4 × 5) 840 files.

Experimental Results
The experiments in this study were divided into three parts. The first part involved the evaluation to measure performance using an intersection over union areas (IoU); the second part was the experiment used to determine the efficiency of the CNN model; and the last part was the testing of the multi-stroke TFSL recognition system.

Intersection Over Union Areas (IoU)
IoU is a statistical method that uses two data consistency measurements between a predicted bounding box and ground truth by dividing the overlapping areas between the prediction and ground truth by a union area between the prediction and ground truth, as Equation (4) and shown in Figure 17 [18]. A high-value IoU (value closer to 1) refers to a bounding box with high accuracy. The 0.5 threshold is the accuracy that determines whether the predicted bounding box IoU is accurate, as shown in Figure 18. According to the Standard Pascal Visual Object Classes Challenge 2007, acceptable IoU values must be greater than 0.5 [13].

Experimental Results
The experiments in this study were divided into three parts. The first part involved the evaluation to measure performance using an intersection over union areas (IoU); the second part was the experiment used to determine the efficiency of the CNN model; and the last part was the testing of the multi-stroke TFSL recognition system.

Intersection Over Union Areas (IoU)
IoU is a statistical method that uses two data consistency measurements between a predicted bounding box and ground truth by dividing the overlapping areas between the prediction and ground truth by a union area between the prediction and ground truth, as Equation (4) and shown in Figure 17 [18]. A high-value IoU (value closer to 1) refers to a bounding box with high accuracy. The 0.5 threshold is the accuracy that determines whether the predicted bounding box IoU is accurate, as shown in Figure 18. According to the Standard Pascal Visual Object Classes Challenge 2007, acceptable IoU values must be greater than 0.5 [13].   Figure 18. IoU is evaluated as follows: the left IoU is 0.4034, which is poor; the middle IoU = 0.7330, which is good; and the high IoU = 0.9264, which is excellent [13].

IoU = Area of overlap between bounding boxes Area of union between bounding boxes
The research used sign language gestures on a complex background along with label images, all of which were 320 × 240 pixels in size. A total of 10,000 images passed through semantic segmentation training using dilated convolutions. The results of the dilation rate configuration test and a variety of convolutions can be structured according to the following details. The structure of the training process consists of five blocks of convolution.

Experimental Results
The experiments in this study were divided into three parts. The first part involved the evaluation to measure performance using an intersection over union areas (IoU); the second part was the experiment used to determine the efficiency of the CNN model; and the last part was the testing of the multi-stroke TFSL recognition system.

Intersection Over Union Areas (IoU)
IoU is a statistical method that uses two data consistency measurements between a predicted bounding box and ground truth by dividing the overlapping areas between the prediction and ground truth by a union area between the prediction and ground truth, as Equation (4) and shown in Figure 17 [18]. A high-value IoU (value closer to 1) refers to a bounding box with high accuracy. The 0.5 threshold is the accuracy that determines whether the predicted bounding box IoU is accurate, as shown in Figure 18. According to the Standard Pascal Visual Object Classes Challenge 2007, acceptable IoU values must be greater than 0.5 [13].   Figure 18. IoU is evaluated as follows: the left IoU is 0.4034, which is poor; the middle IoU = 0.7330, which is good; and the high IoU = 0.9264, which is excellent [13].
The research used sign language gestures on a complex background along with label images, all of which were 320 × 240 pixels in size. A total of 10,000 images passed through semantic segmentation training using dilated convolutions. The results of the dilation rate configuration test and a variety of convolutions can be structured according to the following details. The structure of the training process consists of five blocks of convolution. Figure 18. IoU is evaluated as follows: the left IoU is 0.4034, which is poor; the middle IoU = 0.7330, which is good; and the high IoU = 0.9264, which is excellent [13].
The research used sign language gestures on a complex background along with label images, all of which were 320 × 240 pixels in size. A total of 10,000 images passed through semantic segmentation training using dilated convolutions. The results of the dilation rate configuration test and a variety of convolutions can be structured according to the following details. The structure of the training process consists of five blocks of convolution. Each block includes 32 convolution filters (3 × 3 in size), with different dilation factors, batch normalization, and ReLU. The dilations are 1, 2, 4, 8, 16, and 1, respectively. The training option uses MaxEpochs = 500 and MiniBatchSize = 64. According to the IoU performance assessment, the average IoU was 0.8972, which is excellent.

Experiments on the CNN Models
Experiments to determine the effectiveness of the CNN models in this research used the 5-fold cross validation method. All input hand sign images used in this experiment contained 125,000 images from the hand-segmented library, which consisted of 25 fingerspelling images, each with 5000 gestures. This dataset was divided into 5 groups, each with 25,000 images. Each round of 5-fold cross validation used 100,000 images for training and 25,000 images for testing. This experiment compared the effectiveness of the configuration, the number of filters, and the sizes of five different CNN filters.

Experiments on the CNN Models
Experiments to determine the effectiveness of the CNN models in this research used the 5-fold cross validation method. All input hand sign images used in this experiment contained 125,000 images from the hand-segmented library, which consisted of 25 fingerspelling images, each with 5000 gestures. This dataset was divided into 5 groups, each with 25,000 images. Each round of 5-fold cross validation used 100,000 images for training and 25,000 images for testing. This experiment compared the effectiveness of the configuration, the number of filters, and the sizes of five different CNN filters.
The first format used seven layers of 64 filters and a 3 × 3 filter size. The second format set the filter size to 128 and the size of the filter to 3 × 3 with 7 layers. The third format sorted the number of filters in ascending order up to 7 layers, starting from 2, 5, 10, 20, 40, 80, and 160, respectively, and all filter sizes were 3 × 3. The fourth format determined the number of filters in ascending order and set the sizes of the filter from large to small with 7 layers (the number of filters/filter size; 4/11 × 11, 8/5 × 5, 16/5 × 5, 32/3 × 3, 64/3 × 3, 128/3 × 3, 256/3 × 3). The final format was a structure based on AlexNet, which defined the numbers and sizes of the filters as follows (number of filters/filter size): 96/11 × 11, 256/5 × 5, 384/3 × 3, 384/3 × 3, 256/3 × 3, 256/3 × 3 [26]. The structure details are shown in Figure 19.  The CNN model experiment used the 5-fold cross validation method, and the results are shown in Table 3. This experiment found that the accuracy results of the experiment using the structure based on the AlexNet format were 92.03%. The fourth model of 7 layers using the ascending sorting of the filter number and filter size from a larger size to a smaller size offered an average accuracy of up to 90.04%. Format 3, which uses seven descending filters, has a secondary accuracy average of 89.91%. While the first and second formats use seven static filters, they have 64 and 128 filters, respectively, with an average accuracy of 88.23 and 87.97. Comparing the five CNN accuracy results, it was found that the fifth format provided the highest accuracy. Therefore, we used this format for training to create models for the multi-stroke TFSL recognition system, divided into 112,500 training images representing 90% and 12,500 for testing data, representing 10%. The accuracy of the model in the experiment was 99.98%.

The Experiment of the Multi-Stroke TFSL Recognition System
This recognition system focuses on user-independent testing via the isolated TFSL video format. The accuracy of each alphabet was tested 20 times. The experiment was divided into three groups: one stroke, as shown in Table 4; the two-stroke results, which are presented in Table 5; and the three-stroke results, which are shown in Table 6. ep was to classify the hand segmentation images with the CNN model to s of each hand sign. The system votes on the results of the prediction in r and selects the most predictable hand sign. As shown in Figure 11, when ification process, the results can be predicted as the K sign and the 3 sign. igns are combined and then translated into Thai letters. The rules for the signs are shown in Table 2.
tion rules for TFSL.  The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending order and selects the most predictable hand sign. As shown in Figure 11, when passing the classification process, the results can be predicted as the K sign and the 3 sign. After that, the signs are combined and then translated into Thai letters. The rules for the combination of signs are shown in Table 2.

Data Collection
Data collection for this research is divided into two types: images from 6 locations for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), which take a slightly different view of the camera. They were taken with a digital camera gmentation images with the CNN model to tem votes on the results of the prediction in able hand sign. As shown in Figure 11, when an be predicted as the K sign and the 3 sign. ranslated into Thai letters. The rules for the

No.
Rule   Table 5 shows the results of the TFSL two-stroke recognition system. The results of the experiment show that groups with very similar signs, such as T, S, M, N, and A, affected the accuracy rates in two-stroke sign language recognition systems, such as T + 2, S + 2, and S + Q providing an accuracy of 75%, N + G providing an accuracy of 60%, and S + 1 and N + 1 offering an accuracy of 55%. Sign language gesture number 2 was similar to K gestures. This affected the accuracy of K + 2, which had an accuracy rate of 80%. The similarity between sign language gestures affected the average accuracy of the two-stroke sign language recognition system, which was equal to 85.42%. In the sign language recognition system for two-stroke, we measured the error rate of stroke detection. The results show that sign language gestures with slight hand movements, such as K + 2 and S + 1, had a high error rate of 20%, affecting the overall accuracy of the system. The average error rate was 3.54%. gmentation images with the CNN model to em votes on the results of the prediction in ble hand sign. As shown in Figure 11, when an be predicted as the K sign and the 3 sign. anslated into Thai letters. The rules for the

No.
Rule gmentation images with the CNN model to em votes on the results of the prediction in ble hand sign. As shown in Figure 11, when an be predicted as the K sign and the 3 sign. anslated into Thai letters. The rules for the gmentation images with the CNN model to em votes on the results of the prediction in ble hand sign. As shown in Figure 11, when an be predicted as the K sign and the 3 sign. ranslated into Thai letters. The rules for the In terms of performance, the one-stroke TFSL recognition system shown in Table 4 had an average accuracy of 88.00%. According to the results, the system also encountered very similar sign language gesture discrimination problems in three groups: (a) a handful of gestures that use the thumb position ingress in different sign gestures for the letters T, S, M, N, and A, yielding lower accuracy for the letter N at 55%; (b) group gestures with two fingers, consisting of the index finger and middle finger, which are slightly different from the position of the thumb. This group consisted of letters K and 2 and lowered the accuracy of the letter K to 65%; and (c) a sign language gesture similar to raising a finger up one finger. The R gesture involves crossing between the index finger and ring finger, which is similar to the gesture for number 1, which involves holding up one index finger. This resulted in the accuracy of the letter R being 75%, as shown in Figure 20. These three groups of similarities among the sign language gestures resulted in a low accuracy average. two fingers, consisting of the index finger and middle finger, which are slightly differen from the position of the thumb. This group consisted of letters K and 2 and lowered th accuracy of the letter K to 65%; and (c) a sign language gesture similar to raising a finge up one finger. The R gesture involves crossing between the index finger and ring finge which is similar to the gesture for number 1, which involves holding up one index finge This resulted in the accuracy of the letter R being 75%, as shown in Figure 20. These thre groups of similarities among the sign language gestures resulted in a low accuracy ave age.   Figure 20. A group of one-stroke TFSL signs with many similarities. Table 5 shows the results of the TFSL two-stroke recognition system. The results of the experiment show that groups with very similar signs, such as T, S, M, N, and A, affected the accuracy rates in two-stroke sign language recognition systems, such as T + 2, S + 2, and S + Q providing an accuracy of 75%, N + G providing an accuracy of 60%, and S + 1 and N + 1 offering an accuracy of 55%. Sign language gesture number 2 was similar to K gestures. This affected the accuracy of K + 2, which had an accuracy rate of 80%. The similarity between sign language gestures affected the average accuracy of the two-stroke sign language recognition system, which was equal to 85.42%. In the sign language recognition system for two-stroke, we measured the error rate of stroke detection. The results show that sign language gestures with slight hand movements, such as K + 2 and S + 1, had a high error rate of 20%, affecting the overall accuracy of the system. The average error rate was 3.54%.
Three-stroke TFSL combines three sign language gestures to display one letter. Thus, three-stroke TFSL consists of three letters: T + H + 1, C + H + 1, and C + H + 2. The results showed that the accuracy of T + H + 1 was 85%, that of C + H + 2 was 75%, and that of C + H + 1 was 65%, indication a low accuracy value. This low accuracy was due to several reasons, such as the detection of faulty strokes due to the use of multi-stroke sign language and sign language similarities. The average overall accuracy was 75%. The three-stroke has different posture movements, so there is a slight error rate, where the average error rate is 5%, as shown in Table 6. Table 7 show that the multi-stroke TFSL recognition system has an overall average accuracy of 85.60%. The experiment used a training image for CNN modeling and testing with multi-stroke isolated TFSL. The results show that one-stroke tested 300 times was correct 264 times, incorrect 36 times, and had an average accuracy of 88%. For two-strokes, the average accuracy was 85.42% from 480 tests, with 410 correct and 70 incorrect. The use of three-strokes was tested 60 times and was found to be correct 45 times and incorrect 15 times with an average accuracy of 75%. Overall stroke detection of the system showed an average error rate of 3.70. The results show that sign language gestures with little movement affected stroke detection and caused misclassified.

Conclusions
The research focused on the development of a multi-stroke TFSL recognition system to support the use of complex background with deep learning. This study compared five CNN performance models. According to the results, the first and second structure formats are static CNN architecture, and the determined number of filters and size of the filters were the same for all layers, providing a minimal accuracy value. The third format uses an architectural style to determine the number of filters from ascending with the same size of the filter, which results in increased accuracy. The fourth format uses an ascending number of filters. The first layer uses large filters for learning the global feature, whereas the next layer uses smaller filters for learning the local feature. This results in higher accuracy. The fifth format is a mixed architectural style, designed based on AlexNet structure. Its first and second convolution layer increases the number of filters from ascending while the third and fourth layers show the number of a static filter rises from the second layer. However, the fifth and sixth layers use a fixed number of filters which drop from the two previous layers. For similar purpose, the large size of the filter is used in the first layer to learn global features, while the smaller filter is used in the next layer to learn the local feature. The results show that mixed architecture with global feature learning followed by local feature learning shows outperforming accuracy. By using the fifth format to train the system, the results of the overall study indicate that factors affecting the system's accuracy average include (1) very similar sign language gestures that negatively affect classification, resulting in lower accuracy average results; (2) low-movement sign language spelling gestures that affect the detection of multiple spelling gestures, causing faulty stroke detection; and (3) the spelling of multiple-stroke sign language, which affects the average accuracy since too much movement can sometimes lead to movement being detected between gestural changes, resulting in system recognition errors. A solution could be to improve the motion detection system to make the strokes of the sign language more accurate or to apply a long short-term memory network (LSTM) to enhance the recognition system's accuracy.
In conclusion, the study results demonstrate that similarities in the gestures and strokes when finger-spelling Thai sign language caused decreases in accuracy. For future studies, further TFSL recognition system development is needed.