Ensemble Learning of Multiple Deep CNNs Using Accuracy-Based Weighted Voting for ASL Recognition

: More than four million people worldwide suffer from hearing loss. Recently, new CNNs and deep ensemble-learning technologies have brought promising opportunities to the image-recognition ﬁeld, so many studies aiming to recognize American Sign Language (ASL) have been conducted to help these people express their thoughts. This paper proposes an ASL Recognition System using Multiple deep CNNs and accuracy-based weighted voting (ARS-MA) composed of three parts: data preprocessing, feature extraction, and classiﬁcation. Ensemble learning using multiple deep CNNs based on LeNet, AlexNet, VGGNet, GoogleNet, and ResNet were set up for the feature extraction and their results were used to create three new datasets for classiﬁcation. The proposed accuracy-based weighted voting (AWV) algorithm and four existing machine algorithms were compared for the classiﬁcation. Two parameters, α and λ , are introduced to increase the accuracy and reduce the testing time in AWV. The experimental results show that the proposed ARS-MA achieved 98.83% and 98.79% accuracy on the ASL Alphabet and ASLA datasets, respectively.


Introduction 1.Problem Statement
According to the World Health Organization (WHO), approximately 466 million people worldwide suffer from hearing loss, of whom 34 million are children.Moreover, an estimated 900 million people will suffer from hearing loss by 2050 [1].Sign language is the most effective bridge between the hearing-impaired community and the outside world.Although hearing-impaired people also use written language to communicate, this is inefficient and inconvenient.Sign language uses gestures to articulate meaning in text or speech [2] and sign language recognition uses techniques to recognize these gestures.The emergence of sign language recognition has made the lives of hearing-impaired people more convenient and has eased communication, but most available solutions for sign language recognition are imperfect and inaccurate [3].
Expressions in American Sign Language (ASL) are simple, rich, and diverse [4], but the complex background recognition influence and similarity between gestures are very high.Building a reliable ASL letter recognition model is essential for improving communication as a tool for hearing-impaired people to help spell names and book titles and correct letters.However, complex backgrounds and similarities between gestures make recognition challenging.
Deep learning allows computational models with multiple processing layers to learn and represent data with multiple levels of abstraction, imitating the human brain mechanism and implicitly capturing the complex structures of large-scale data [5].Deep learning has also enabled better solutions to hundreds of practical problems and has been widely used in natural language processing, human-computer interaction, and other fields.With the growth of data and the improvement of computing power, the problem of the lack of data and difficulties in training deep neural networks are gradually being solved.At present, computer vision based on deep learning has brought significant progress in image classification [6] and face recognition [7], and the appearance of convolutional neural networks (CNNs) [8] has greatly improved deep learning.
CNNs have different feature extraction structures but a single CNN has difficulty eliminating the influence of the background, so an ensemble of different CNNs is useful for accurate recognition by decreasing the background influence.

Literature Review
Some machine-learning methods have been used in sign language recognition.Halder et al. [9] proposed a static sign language recognition method based on Principal Component Analysis (PCA) and Support Vector Machine (SVM) to recognize the five vowels in English, with an accuracy of 80%.PCA and SVM were used for hand feature extraction and classification, respectively.Chuan et al. [10] used the Leap Motion sensor to collect 26 letters in the ASL alphabet.The k-nearest neighbor algorithm and SVM were used for classification, with accuracies of 72.78% and 79.83%, respectively.Roy et al. [11] used skin-color detection and contour-extraction techniques to detect sign language in videos, and the Camshift and hidden Markov models (HMM) algorithms were used for hand tracking and classification, respectively.Ahmed et al. [12] proposed a skin-colorbased detection method to binarize the input data to obtain the face and hand regions, and they calculated the similarity between training and testing data by the dynamic time warping (DTW) algorithm.The DTW algorithm did not use a statistical model framework for training because it was difficult to connect the semantic information of the context.Consequently, DTW had disadvantages in problem solving such as large data volumes and complex gestures.However, most machine-learning methods must convert the original data into feature vectors that are suitable for operation because the image information is too complex and diverse.Thus, the incomplete feature vectors limit the image recognition performance of machine learning methods.Some deep-learning approaches have been used for sign language recognition.Hasan et al. [13] used a CNN to recognize ASL, achieving 97.62% accuracy.Pigou et al. [14] proposed a CNN-based sign language recognition method to recognize 20 words in the ChaLearn dataset.This model extracted hand and upper-body features from two sets of input data.Jing et al. [15] proposed a multi-channel, multi-modality framework based on a 3D-CNN.The multiple channels contained color, depth, and optical-flow information, and the multiple modalities included gestures, facial expressions, and body poses.Huang et al. [16] proposed a 3D-CNN network model based on multimodal input, including color, depth, and skeleton-joint point information.However, these methods are limited to the single-stream feature extraction portion, and obtaining sufficient feature information to distinguish similar gestures in a single-stream CNN is very difficult due to the high similarity of gestures in ASL.
Gated recurrent unit-relative sign transformer [17], STMC-Transformer [18], and a full transformer network [19] have been successfully used in sign language recognition.The image frame features from sign language videos were extracted and combined into a standard encoder to boost gesture sequence attention and improve recognition performance.However, transformer methods are not appropriate for simple image recognition, such as ASL letters, because these techniques are designed to handle lengthy and complicated sign language videos, so much processing power is wasted.
Some researchers have focused on combining the advantages of different deep-learning models to improve image-recognition performance.Ye et al. [20] proposed a three-dimensional recurrent convolutional neural network (3DRCNN), which combined a 3D-CNN with a fully connected recurrent neural network (FC-RNN).The 3D-CNN learned from color, optical flow, and depth channels.The FC-RNN obtained timing information of sequence segmentation from the video.Yu et al. [21] used deep ensemble learning to decompose body poses automatically and perceive its background information.Zaidi et al. [22] proposed two methods for automatically constructing ensembles with various architectures, using different architectures to achieve feature diversity.These methods allow image features to be extracted more comprehensively.However, due to overly diverse feature information, the final network layer for decisions tended to ignore small parts of the features.
The self-mutual distillation learning-based system [23] focuses on both short-term and long-term information to enhance the discriminative power for better sign language recognition.3D ConvNet with the bidirectional long short-term memory system [24] improved sign language recognition performance through data extraction and time-series information.Excellent recognition performance for continuous gestures is achieved by a long short-term memory (LSTM) technique [25] with four different sequences.However, the above methods do not perform well for datasets in which most letters are not continuous gestures.
Transfer deep-learning methods [26,27] have been used to recognize ASL.The advantage of these methods is their ability to fine-tune the weights of the advanced deep neural networks for image recognition to reduce the waste of computational resources while performing well.However, these methods have the slight limitation of incomplete gesture feature extraction.A 2D-CNN with the joint encoding [28] technique performed excellently but had high hardware requirements.A two-stream CNN [29] method used the addition and concatenation operations to extend the feature maps and thus help the CNN better recognize gestures.This method is slightly insufficient in environments with complex backgrounds.Some researchers are studying ensemble models [30] to improve image recognition performance.A trainable ensemble [31] takes the outputs of individual models to the final decision and demonstrates the possibility of the ensemble being capable of improving sign language recognition performance in learning the correlation of independent model prediction results.Another ensemble-learning method [32] uses various learning algorithms to generate recognition results based on multiple features.Better recognition performance is achieved through a voting scheme.By voting on different recognition results, the models complement their drawbacks, allowing different CNN models to maximize their performance advantages.This allows the CNNs to obtain various features in the feature extraction and reduce the possibility of losing information in the decision portion.Thus, ensemble learning brings new opportunities for better ASL recognition performance.

Contributions and Structure of the Paper
This paper designs an ASL recognition system for translation applications in the hearing-impaired community, so the goal of this paper is to help hearing-impaired people communicate better with others.The proposed ensemble-learning model for sign language recognition is proposed, using multiple CNNs with accuracy-based weighted voting (AWV) to increase the ASL recognition performance.The contributions of the paper can be summarized as follows:

•
The proposed model recognizes 29 gestures with accuracies of 98.83% for the ASL Alphabet dataset and 98.79% for the ASLA dataset with complex backgrounds.
The remainder of this paper is organized as follows: Section 2 introduces the details of the methods and models.Section 3 presents the results and comparison of our method to other methods.Finally, Section 4 summarizes the conclusions.

Datasets and Image Preprocessing
This paper uses two datasets, the ASL Alphabet [33] and ASLA (American Sign Language Alphabet) [34].The ASL Alphabet dataset includes 29 classes, comprising 26 alphabetic characters A-Z and three other characters: space, delete, and nothing.Each class has 3000 images, each of which is 200 × 200 pixels in size.The ASLA dataset is similar to the ASL Alphabet except for the backgrounds, as shown in Figure 1.Each class has 7000 images, each of which is 400 × 400 pixels.The two datasets are split into 85% training images and 15% testing data.
The remainder of this paper is organized as follows: Section 2 introduces the details of the methods and models.Section 3 presents the results and comparison of our method to other methods.Finally, Section 4 summarizes the conclusions.

Datasets and Image Preprocessing
This paper uses two datasets, the ASL Alphabet [33] and ASLA (American Sign Language Alphabet) [34].The ASL Alphabet dataset includes 29 classes, comprising 26 alphabetic characters A-Z and three other characters: space, delete, and nothing.Each class has 3000 images, each of which is 200 × 200 pixels in size.The ASLA dataset is similar to the ASL Alphabet except for the backgrounds, as shown in Figure 1.Each class has 7000 images, each of which is 400 × 400 pixels.The two datasets are split into 85% training images and 15% testing data.
The ASL Alphabet and ASLA datasets have motion, non-motion, and complex background properties, which bring challenges for ASL recognition.The images are preprocessed using gray data normalization, median filtering, reshaping, and label encoding to reduce the noise and environmental effects.Gray data normalization compresses all pixel data into 0-1 space intervals and changes the image channel to one.Furthermore, it reduces the effects of light in the images and makes them scale invariant, meaning that the mean and variance are the same for all features.The median filter removes background noise from the image.The images are reshaped to 227 × 227 pixels for AlexNet and LeNet and to 224 × 224 pixels for GoogleNet, VGGNet, and Res-Net50.Decimal labels are encoded into one-hot vectors to conveniently compare the results in fully connected layers by the label encoding method.

Proposed Model
An ensemble-learning method allows different CNN models to maximize feature extraction and minimize information loss.Deeper CNN models capture deeper semantic expressions from images but they often ignore information extracted from lower dimensions, losing some image information in the feature extraction process.ARS-MA is proposed to combine the feature capture capabilities of different deep CNN models, as shown in Figure 2. It consists of three steps: data preprocessing, feature extraction and probability prediction by CNN models, and final gesture classification.The ASL Alphabet and ASLA datasets have motion, non-motion, and complex background properties, which bring challenges for ASL recognition.
The images are preprocessed using gray data normalization, median filtering, reshaping, and label encoding to reduce the noise and environmental effects.Gray data normalization compresses all pixel data into 0-1 space intervals and changes the image channel to one.Furthermore, it reduces the effects of light in the images and makes them scale invariant, meaning that the mean and variance are the same for all features.The median filter removes background noise from the image.The images are reshaped to 227 × 227 pixels for AlexNet and LeNet and to 224 × 224 pixels for GoogleNet, VGGNet, and ResNet50.Decimal labels are encoded into one-hot vectors to conveniently compare the results in fully connected layers by the label encoding method.

Proposed Model
An ensemble-learning method allows different CNN models to maximize feature extraction and minimize information loss.Deeper CNN models capture deeper semantic expressions from images but they often ignore information extracted from lower dimensions, losing some image information in the feature extraction process.ARS-MA is proposed to combine the feature capture capabilities of different deep CNN models, as shown in Figure 2. It consists of three steps: data preprocessing, feature extraction and probability prediction by CNN models, and final gesture classification.
CNNs are used for feature extraction and new datasets for different classifiers are created from the results of the multiple deep CNNs.The features extracted from CNNs of different depths increase the diversity of the information to improve recognition performance.The five CNN models are LeNet, AlexNet, GoogleNet, VGGNet, and ResNet, with different feature extraction modes and depths in Step 2, which improve the extraction ability and reduce incomplete semantic expressions.In addition, the five CNN models independently predict probabilities and labels to create new datasets for the next classifier step.The advantage of independent prediction is that it reduces the mutual influence between different depth information in the feature maps.Although it is relatively difficult for machine-learning algorithms to handle image classification tasks well, they have strong classification abilities for non-image data.SVM, Random Forest (RF), AdaBoost [35], soft voting [36], and the proposed AWV algorithm are used as the final classifier to recognize the ASL alphabet in Step 3.
By comparing the classifiers' recognition performance, the proposed AWV method was selected as the final classifier for the ARS-MA model as shown in the blue box of Figure 2.

New Datasets for Classifiers
Three new datasets (ND1, ND2, and ND3) are built with the results of the multiple deep CNNs for final recognition after they finish the predictions.
Similarly,  1 is the set of probabilities of all test images for the predicted class in the LeNet model, calculated by Equation (1): Although it is relatively difficult for machine-learning algorithms to handle image classification tasks well, they have strong classification abilities for non-image data.SVM, Random Forest (RF), AdaBoost [35], soft voting [36], and the proposed AWV algorithm are used as the final classifier to recognize the ASL alphabet in Step 3.
By comparing the classifiers' recognition performance, the proposed AWV method was selected as the final classifier for the ARS-MA model as shown in the blue box of Figure 2.

New Datasets for Classifiers
Three new datasets (ND1, ND2, and ND3) are built with the results of the multiple deep CNNs for final recognition after they finish the predictions.
Dataset ND1 (P CNN1 , P CNN2 , P CNN3 , P CNN4 , P CNN5 , Label CNN1 , Label CNN2 , Label CNN3 , Label CNN4 , Label CNN5 ) becomes the new input data for SVM, RF, and AdaBoost to obtain the final recognition results Result SV M , Result RF , and Result Ada .P CNN1 , P CNN2 , P CNN3 , P CNN4 , and P CNN5 are the prediction probabilities of the prediction label Label CNN1 , Label CNN2 , Label CNN3 , Label CNN4 , and Label CNN5 , respectively, from the five CNN models.For example, Label CNN1 is the predicted label from the LeNet model.
Similarly, P CNN1 is the set of probabilities of all test images for the predicted class in the LeNet model, calculated by Equation (1): where P 1,1 to P 1,29 are the probabilities of 29 gesture classes from the output of the LeNet model (CNN1).The max() is defined so as to return the largest value and P CNN1 is the largest value among P 1,1 to P 1,29 .Dataset ND2 (P 1,1 , P 1,2 , . . ., P i,j , . . ., P 5,28 , P 5,29 ) is used for soft voting to obtain the final recognition result of Result SV .
Dataset ND3 (P 1,1 , P 1,2 , . . ., P i, j , . . ., P 5,28 , P 5,29 , ACC 1,1 , ACC 1,2 , . . ., ACC i,j , . . ., ACC 5,28 , ACC 5,29 ) is used for the AWV algorithm to obtain the final recognition result of Result WDE .P i, j and ACC i, j refer to the predicted probability and the accuracy of the jth gesture class in the ith CNN.For example, ACC 1,1 is the accuracy gained after finishing the training process for the LeNet model in Class A; all test images of Class A have the same ACC 1,1 in the next voting step, and P 1,1 is the probability of all test images for Class A in the LeNet model.

Classification and the Proposed AWV Algorithm
The final classifier in the ARS-MA model was selected from among five algorithms: SVM, RF, AdaBoost [35], soft voting, and the AWV algorithm.These obtained final results from the aspects of four classification methods: nonlinear mapping (SVM), an ensemble tree (RF), a weighted ensemble tree (AdaBoost), and voting (soft-voting and the AWV algorithm).
The keys of the SVM algorithm establish the maximum margin hyperplane for classification and have a good generalization ability.The RF algorithm creates new training samples, which are used to train several different decision trees.Then, the final result is aggregated from the various decision trees.Boosting is a machine-learning approach that aims to create a highly accurate model by combining many less accurate models.AdaBoost, the most widely used boosting algorithm, combines many decision trees and gives greater weights to the higher-accuracy tree classifiers.The new dataset ND1 is used for SVM, RF, and AdaBoost to fuse the CNNs.
However, a class result having a high probability from a CNN model does not mean that the class is correct.For example, AlexNet performs better than LeNet in general.When both are used in an ensemble together, LeNet incorrectly predicts Class A gestures into other classes, while AlexNet accurately predicts them in Class A, but the predicted probability of Class A in LeNet is coincidentally higher than in AlexNet.In this case, soft voting only considers the probability, which makes it difficult to help the ensemble model correct the error.Soft voting makes the final recognition of the gesture class result with the highest prediction probability but does not always consider different CNN models.Therefore, a new method AWV is proposed to consider accuracy when voting for the gesture class.This method is called Accuracy-based Weighted Voting (AWV).
The soft-voting algorithm only focuses on the probabilities for the final recognition of the ensemble model, but the AWV algorithm considers the CNN prediction accuracies and the probabilities corresponding to each class, which allows the ensemble model to fuse the results of the CNNs more accurately.The new dataset ND3 is used for the AWV algorithm.The weights in the proposed AWV algorithm are defined in Equation (2): where w i,j is the weight value of the jth class of the dataset in the ith CNN classifier; ACC i,j is the recognition accuracy of the jth class in the ith CNN classifier; and α is an arbitrary number.The accuracy of each CNN is less than one, and ACC i,j α increases the difference between w i,j .λ is an arbitrary number to reduce the testing time and its value is assigned by simulation to prevent the weights from becoming too large.
The output of the AWV algorithm is defined as follows in Equation ( 3 )), (3) where P i,j is the probability calculated by the j th class in the i th CNN classifier.Each CNN model calculates the probabilities for each class, and these probabilities are combined with weights to calculate a group of values using the formula . The largest value is )), where  , is the probability calculated by the jth class in the ith CNN classifier.Each CNN model calculates the probabilities for each class, and these probabilities are combined with weights to calculate a group of values using the formula

CNN Algorithms
The main structures in LeNet [37], AlexNet [38], and VGGNet [39] are shown in Figure 4a-c.The inception structure of GoogleNet [40], whose width increases by branching and merging to improve the accuracy of the model, is shown in Figure 4d.The residual module of ResNet [41] in Figure 4e generates an output F(x) + x from an input x and reduces the gradient-vanishing problem, improving the model performance in the prediction task.Table 1 shows the structure of M-LeNet in detail.The size of the kernels is modified, and one convolution layer is reduced to decrease the consumption time.Table 2 shows the structure of M-AlexNet.Batch normalization [42] is added after each convolution layer in M-AlexNet to reduce the impact of unstable gradients.Table 3 shows the structure of M-GoogleNet in detail, where MP indicates a max-pooling layer, the convolution stride is one, and there is one MP layer in the inception block.Table 4 shows the structure of M-VGGNet in detail, where the 3 × 3 convolution kernel and 2 × 2 max-pooling size are used in the entire network to improve the performance by continuously deepening the network structure.Batch normalization is added behind each convolution layer.The layer configuration in M-VGGNet is similar to LeNet and AlexNet.The purpose of using M-LeNet, M-AlexNet, and M-VGGNet is to help the ARS-MA model extract gesture semantic information at different depths with a similar layer configuration in the feature extraction step.Table 5 shows the structure of M-ResNet in detail, where the number of convolution layers is reduced to obtain better performance for the ASL Alphabet and ASLA datasets.M-GoogleNet, M-VGGNet, and M-ResNet are all high-depth models with different layer configurations.Hence, the ARS-MA model gains the ability to obtain more high-depth information from the feature extraction because the gesture semantic information is easily captured in the high-depth feature maps [43] and deeper feature information categories are obtained.In all models, dropout is used to reduce overfitting, the activation function in the convolution layers is the rectified linear unit (ReLu) function, and the softmax function is used for multi-class prediction.

Evaluation Methods
The performances of the five M-CNNs were evaluated after completing the training.The performance scores [44] were compared using the receiver operating characteristics (ROC) curve, area under the ROC curve (AUC), accuracy, recall, precision, and F1-score evaluation methods.The ROC curve is a tool to examine the classifier performance.AUC is an important metric for model comparison calculated from the ROC.The ROC curve for a multi-class problem is obtained by averaging the ROC curves of each class.
The accuracy, recall, precision, and F1-score are calculated as follows: where TP is true positive, FN is false negative, FP is false positive, and TN is true negative.These four evaluation methods are used to measure the effectiveness of the model.

Results and Analysis
The performance results for the five M-CNN models are shown in Table 6.The accuracies for all M-CNNs are higher than 93%, which means that each M-CNN model effectively extracts useful features for gesture prediction in ASL letter images.The performance scores of the M-CNNs decrease in the ASLA dataset, which has complex background environments, but they are no more than 1% lower than the ASL Alphabet dataset, which proves that the M-CNN models achieve almost the same results on both datasets.M-ResNet with the residual module had the best recognition performance for both datasets of all five M-CNNs because high-depth models with high-depth feature maps are well suited to extracting semantic information for prediction.The recognition scores of the five different classifiers in the classification step are shown in Table 7.The accuracy, precision, recall, and F1-scores range from 97.89% to 98.83%, 96.12% to 97.98%, 95.78% to 97.59%, and 95.95% to 97.60%, respectively.Comparing Tables 7 and 8 shows that the classification accuracies of the fusion classifiers are higher than those in a single CNN, proving that the proposed ensemble models using multiple deep CNNs are superior to single CNN models.The recognition accuracies for the two datasets in Table 7 are nearly the same, proving that the proposed ARS-MA effectively removes the noise in a complex background.The proposed AWV model was the most accurate, at 98.83% and 98.79% for the ASL Alphabet and ASLA datasets, respectively, so it was chosen as the classifier in the classification step.
The ROC curves and AUC values of the five M-CNNs for the two datasets are shown in Figure 5a,b, respectively; M-ResNet achieved the highest AUCs of 0.89 on the ASL Alphabet dataset and 0.88 on the ASLA dataset.The ROC curves and the AUC values of the five classifiers for the two datasets are depicted in Figure 5c,d.The AWV algorithm achieved the highest AUCs, of 0.93 and 0.91, on the ASL Alphabet and ASLA datasets, respectively.The AUCs of all single M-CNNs were lower than that of AWV in the ARS-MA model, which proves the proposed ARS-MA model works well for both datasets.
The recognition accuracies for the two datasets in Table 7 are nearly the same, proving that the proposed ARS-MA effectively removes the noise in a complex background.The proposed AWV model was the most accurate, at 98.83% and 98.79% for the ASL Alphabet and ASLA datasets, respectively, so it was chosen as the classifier in the classification step.
The ROC curves and AUC values of the five M-CNNs for the two datasets are shown in Figure 5a  When w i,j is calculated in Equation ( 2), two parameters must be assigned.α is related to the relative accuracies among the M-CNN models.Table 8 shows the accuracy of the ARS-MA model by α when λ is 5.
Table 9 shows the accuracy and average test time for an image in ARS-MA as λ varies from 1 to 9 when α is 2.An optimal value of 5 is assigned to λ because it reduces the training time, despite accuracy and average test time being nearly the same over the range of 3 to 7. The proposed ARS-MA model performed best when α = 2 and λ = 5, as shown in Table 7.
CNNs increases, as shown in Figure 6a When  , is calculated in Equation ( 2), two parameters must be assigned. is related to the relative accuracies among the M-CNN models.Table 8 shows the accuracy of the ARS-MA model by  when  is 5. Table 9 shows the accuracy and average test time for an image in ARS-MA as  varies from 1 to 9 when  is 2.An optimal value of 5 is assigned to  because it reduces the training time, despite accuracy and average test time being nearly the same over the range of 3 to 7. The proposed ARS-MA model performed best when  = 2 and  = 5, as shown in Table 7.Each dataset has 27 non-motion gesture images and two motion gesture images (J and Z) and has different background complexity.Relative to motion and non-motion gestures, the ARS-MA model prioritizes non-motion character with a more significant proportion.ARS-MA has accuracies of 98.98% and 96.88% for non-motion and motion gesture images, respectively.This result means that the accuracy for non-motion images is around 2% higher than for motion images, and that a model for motion image gestures is necessary for better performance.
Table 10 compares the proposed method to other methods for sign language recognition with motions or gestures in recent papers.The proposed ARS-MA model achieved better recognition accuracy on both datasets.

Conclusions
ASL letters are used as an auxiliary language for exceptional cases, such as names, book titles, and letter correction.Building a reliable ASL letter recognition model is essential to improving communication as a tool for hearing-impaired people.
In this paper, the ARS-MA model was proposed which consists of data preprocessing, feature extraction, and classification.Five M-CNN models with different depths and feature capture methods were designed to combine feature extraction and a novel AWV algorithm using an accuracy-based weighted voting proposed to increase accuracy in the final classification step, where two parameters, α and λ, in AWV were introduced to improve the accuracy and consumption time.The best performance was achieved when α = 2 and λ = 5.Three new datasets, ND1, ND2, and ND3, for accuracy-based weighted voting, were created by the results of the multiple deep CNNs.The two datasets used in this paper have two types of images: non-motion gestures and motion gestures (J and Z).The ARS-MA model obtained 98.83% and 98.79% accuracies for the ASL Alphabet and ASLA datasets.In addition, it has accuracies of 98.98% and 96.88% for non-motion and motion gesture images, respectively, and the accuracy for non-motion images is around 2% higher than for motion images.
In the future, the research on increasing the accuracy of motion gesture images will be conducted for a real-time ASL recognition system and it will help hearing-impaired people better communicate.

Figure 1 .
Figure 1.Sample images from the ASL Alphabet and ASLA datasets.

Figure 1 .
Figure 1.Sample images from the ASL Alphabet and ASLA datasets.

Figure 2 .
Figure 2. The proposed model schematic.CNNs are used for feature extraction and new datasets for different classifiers are created from the results of the multiple deep CNNs.The features extracted from CNNs of different depths increase the diversity of the information to improve recognition performance.The five CNN models are LeNet, AlexNet, GoogleNet, VGGNet, and ResNet, with different feature extraction modes and depths in Step 2, which improve the extraction ability and reduce incomplete semantic expressions.In addition, the five CNN models independently predict probabilities and labels to create new datasets for the next classifier step.The advantage of independent prediction is that it reduces the mutual influence between different depth information in the feature maps.Although it is relatively difficult for machine-learning algorithms to handle image classification tasks well, they have strong classification abilities for non-image data.SVM, Random Forest (RF), AdaBoost[35], soft voting[36], and the proposed AWV algorithm are used as the final classifier to recognize the ASL alphabet in Step 3.By comparing the classifiers' recognition performance, the proposed AWV method was selected as the final classifier for the ARS-MA model as shown in the blue box of Figure2.

Figure 3 .
Figure 3.The process of the proposed accuracy-based weighted voting (AWV) algorithm.

Figure 3 .Algorithm 1 .
Figure 3.The process of the proposed accuracy-based weighted voting (AWV) algorithm.In the AWV algorithm, the five probabilities for each sign language class are predicted by five different CNN models, and the weights of each class are determined by the accuracy of each corresponding CNN model.The total number of weights is 5 × 29 in the ARS-MA model, recognizing 29 gesture classes in the ASL Alphabet and ASLA datasets.AWV makes the CNN models better complement their drawbacks after independent probability predictions.The execution process of the AWV algorithm with α = 2, λ = 5 is shown in Algorithm 1.The proposed weighted voting is accuracy-based, where a CNN model with higher recognition accuracy is given greater weight in the AWV to improve the accuracy of the ARS-MA model.

Figure 5 .Figure 5 .
Figure 5.The ROC curves and AUC values: (a) M-CNN models on ASL Alphabet, (b) M-CNN models on ASLA, (c) classifiers on ASL Alphabet, and (d) classifiers on ASLA.To obtain the best model in the feature extraction step, three types of multiple deep CNNs were designed: CNN-3, CNN-4, and CNN-5.CNN-3 uses M-LeNet, M-AlexNet, and M-GoogleNet; CNN-4 adds M-VGGNet to CNN-3; and CNN-5 adds M-ResNet to CNN-4. Figure 6 shows the performance of the ARS-MA models with the three types of CNNs.It shows that increasing the diversity of the higher-depth CNN models (M-Goog-leNet, M-VGGNet, and M-ResNet) in the feature extraction step improves recognition accuracy.The accuracies and AUC values of the five classifiers increase as the number of M-Figure 5.The ROC curves and AUC values: (a) M-CNN models on ASL Alphabet, (b) M-CNN models on ASLA, (c) classifiers on ASL Alphabet, and (d) classifiers on ASLA.To obtain the best model in the feature extraction step, three types of multiple deep CNNs were designed: CNN-3, CNN-4, and CNN-5.CNN-3 uses M-LeNet, M-AlexNet, and M-GoogleNet; CNN-4 adds M-VGGNet to CNN-3; and CNN-5 adds M-ResNet to CNN-4.Figure 6 shows the performance of the ARS-MA models with the three types of CNNs.It shows that increasing the diversity of the higher-depth CNN models (M-GoogleNet, M-VGGNet, and M-ResNet) in the feature extraction step improves recognition accuracy.The accuracies and AUC values of the five classifiers increase as the number of M-CNNs increases, as shown in Figure 6a,b.CNN-5 with the AWV algorithm performed better than other types of CNN models on both datasets.When w i,j is calculated in Equation (2), two parameters must be assigned.α is related to the relative accuracies among the M-CNN models.Table8shows the accuracy of the ARS-MA model by α when λ is 5.Table9shows the accuracy and average test time for an image in ARS-MA as λ varies from 1 to 9 when α is 2.An optimal value of 5 is assigned to λ because it reduces the training time, despite accuracy and average test time being nearly the same over the range of 3 to 7. The proposed ARS-MA model performed best when α = 2 and λ = 5, as shown in Table7.

Figure 6
shows the performance of the ARS-MA models with the three types of CNNs.It shows that increasing the diversity of the higher-depth CNN models (M-GoogleNet, M-VGGNet, and M-ResNet) in the feature extraction step improves recognition accuracy.The accuracies and AUC values of the five classifiers increase as the number of M-CNNs increases, as shown in Figure 6a,b.CNN-5 with the AWV algorithm performed better than other types of CNN models on both datasets.

Figure 6 .
Figure 6.The performance scores for three types of multiple deep CNNs in the ARS-MA model: (a) AUC and (b) Accuracy.

Figure 6 .
Figure 6.The performance scores for three types of multiple deep CNNs in the ARS-MA model: (a) AUC and (b) Accuracy.
chosen by the max() function.

Table 1 .
The architecture of M-LeNet5.

Table 2 .
The architecture of M-AlexNet.

Table 5 .
The architecture of M-ResNet.

Table 6 .
Performance evaluation of the five CNNs on both datasets.

Table 7 .
Performance evaluation of five classifiers on two datasets.

Table 8 .
Accuracies by α in ARS-MA model.

Table 8 .
Accuracies by  in ARS-MA model.

Table 9 .
Accuracy and time/image by .Each dataset has 27 non-motion gesture images and two motion gesture images (J and Z) and has different background complexity.Relative to motion and non-motion gestures, the ARS-MA model prioritizes non-motion character with a more significant proportion.ARS-MA has accuracies of 98.98% and 96.88% for non-motion and motion gesture

Table 9 .
Accuracy and time/image by λ.

Table 10 .
Comparison of proposed work and previous works.