Compact Spatial Pyramid Pooling Deep Convolutional Neural Network Based Hand Gestures Decoder

: Current deep learning convolutional neural network (DCNN) -based hand gesture detectors with acute precision demand incredibly high-performance computing power. Although DCNN-based detectors are capable of accurate classiﬁcation, the sheer computing power needed for this form of classiﬁcation makes it very difﬁcult to run with lower computational power in remote environments. Moreover, classical DCNN architectures have a ﬁxed number of input dimensions, which forces preprocessing, thus making it impractical for real-world applications. In this research, a practical DCNN with an optimized architecture is proposed with DCNN ﬁlter/node pruning, and spatial pyramid pooling (SPP) is introduced in order to make the model input dimension-invariant. This compact SPP-DCNN module uses 65% fewer parameters than traditional classiﬁers and operates almost 3 × faster than classical models. Moreover, the new improved proposed algorithm, which decodes gestures or sign language ﬁnger-spelling from videos, gave a benchmark highest accuracy with the fastest processing speed. This proposed method paves the way for various practical and applied hand gesture input-based human-computer interaction (HCI) applications.


Introduction
Hand gesture recognition is a key part of human communication. Such gestures were the primary means of communication in the prehistoric age [1]. In modern days, hand gestures are still useful, for example, in the case of difficulties with oral communication or human-computer interactions (HCIs). Today, visual experience plays a significant part in HCIs. An interactive and properly gesture-classifiable computer program can properly recognize human behavior as an input for processing. The use of sign language as a gesture-based, as opposed to voice-based, means of communication with another medium makes HCI tremendously promising. The most natural form of gesture-based communication is sign language. Hearing-impaired individuals and people with speech impediments use sign language in their everyday lives. American Sign Language (ASL), which is the most popular form of sign language in the world, is derived from the old French hand symbols and used throughout the continent of North America. Sign language, first developed two centuries ago, is now widely recognized throughout the world. Visual perception plays a vital role in HCI [2]. HCI bridges the gap between machines and humans, allowing them to communicate with each other. Hence, naturally, hand gesture recognition has been the subject of a significant amount of research in both the communication and machine learning domains. The main idea in the machine learning domain for HCI is to properly classify and detect human gestures and improve the accuracy of the classifier.
The main drawback to implementing any kind of machine learning or computer vision-based gesture classification involves the detection and classification of different hand gestures. Theoretically, there are a limited number of hand gestures, and mapping those to the almost infinite number of possible expressions is close to impossible. For this reason, most computer vision-based research solely focuses on classification of a limited number of gestures. On the other hand, HCI studies concern the development of deep learning-based classification and hand gesture detection systems. Deep convolutional neural networks (DCNNs) have recently gained popularity because of their superior ability to recognize visual data. Utilizing DCNNs for object classification is not a recent advancement. Much research has been done, solely focusing on the DCNN-based classification of hand gestures.
DCNNs are widely used machine learning algorithms that automate feature extraction and classification without the need for external selective feature extraction methods. This is done by the two (2) main steps executed by a DCNN algorithm. The convolutional layers in a DCNN utilize convolutional operations in order to extract the important features of input images. Subsequently the fully connected or dense layer of the DCNN classifies the features and produces a prediction of the given input images. The use of a DCNN eliminates the need for human or machine-based feature selection criteria and improves the overall accuracy of classification performance.
However, DCNNs have certain fundamental limitations for hand gesture classification. By definition, the input image that is presented to the convolutional layers of a DCNN must have a fixed number of dimensions. This creates problems that are related to resizing or cropping of the input images and results in reduced accuracy and mislabelling due to data loss. Moreover, most research on DCNNs has solely focused on achieving accuracy. Thus, DCNN layers are increasing in size, number, and complexity over time. This means that practical applications of DCNNs are extremely power and computational resource-intensive (in terms of CPU, GPU, RAM, etc.).
Addressing high computational costs and optimizing neural networks is a well-researched topic. Many DCNN optimization techniques have been presented in recent years. DCNN node pruning is one such optimization technique, first suggested by Lecun et al. [3]. Neural node and convolution filter pruning is the process of eliminating neural network nodes and filters that are based on rational or mathematical comparisons. Although this process is not new, applying it to modern DCNNs is a comparatively new development. Spatial pyramid pooling (SPP) is a DCNN convolutional layer feature gathering method that eliminates the need for fixed-dimensional input. Hu et al. first described this process for a convolutional neural network [4]. To the best of our knowledge, the combination of both of these techniques has not yet been studied.
This research addresses the ideas of DCNN node pruning and introduction of an SPP layer in the DCNN, and uses several metrics to evaluate the performance of the subsequent model. Also, a new algorithm that decodes hand gestures in real time for practical uses is proposed and analyzed. The rest of the article is organized as follows. Section 2 describes previous work done in this field that inspired this research. Section 3 discusses the fundamentals of DCNN node pruning and SPP. Our proposed method is presented in Section 4. Section 5 describes the dataset that was used to train and test the proposed method. Section 6 concerns the evaluation metrics that were used to validate and test the results presented in Section 7. Finally, we compare the proposed system with other notable work in Section 8 and, finally, provide a summary of this work and future directions in Section 9.

Related Works
In recent decades, many researchers have investigated different approaches that use touch and non-touch gestures in various real-world practical applications. The practicality of manufacturing such systems has been the subject of many experiments in the last few years. Because of the nature of hand gestures and the specifics of machine learning research, discussing all of this a gargantuan task. Accordingly, this section will briefly discuss the original research on machine learning hand gesture classification, followed by the application of DCNNs and the potential drawbacks of such methods.
Earlier research into both machine learning and computer vision-based ASL recognition relied heavily on traditional feature selection methods. These methods involved processing input images into various transformations and then selecting the best features for to classifying the ASL symbol represented in the image. Vogler et al. presented notable work that utilized the above-mentioned methods. Their work used an adaptive hidden Markov model (HMM) and three-dimensional (motion) analysis for recognition [5,6]. These types of studies are usually expensive, as they require additional motion capture instrumentation that is both time-and resource-intensive. The development of a more practical setup to overcome such limitations must follow a vision learning-based approach.These types of studies are usually expensive, as they require additional motion capture instrumentation that is both time-and resource-intensive. The development of a more practical setup to overcome such limitations must follow a vision learning-based approach.
A variety of approaches to computer vision-based ASL detection have been developed [7]. These include both traditional machine learning-and deep learning-based approaches. Notable work in traditional machine learning approaches includes singular value decomposition (SVD) [8], wavelet transformation via discrete wavelet transformation (DFT) [9], geometric features, and local binary patterns (LBP) [10]. These methods rely heavily on extraction of input images while using varieties of classifiers, including but not limited to ensemble classifiers that are based on support vector machines (SVMs) [11], artificial neural networks (ANNs) [12], and linear regression and genetic algorithms (GAs) [13]. These methods produced a significant improvement in both image-and sensor-based input processing. Hand gesture image feature extraction has many applications, some of which rely on Gabor filters [14]. This research follows a common theme of input data preprocessing and feature extraction systems. Although much of this research reported high accuracy, the preprocessing and real-time processing requirements for input data extraction make these systems very impractical to deploy as real-time gesture recognition applications. In the middle of the 20th century, Hubel and Weisel explained the biological underpinnings of visual sensory processing in the mammalian context [15]. In essence, they showed that a small amount of edge detection can lead to superior image detection. This research served as a precursor to DCNN-based research in computer vision. Lecunn et al. [16] were the first to successfully demonstrate neural convolutional network training with backpropagation. Krizhevsky et al. [17], in work that won the 2012 Imagenet competition, opened the door to use of DCNNs in different image detection applications. Their CNN introduced deep feed-forward artificial neural network, which is now the most popular visual image analysis technique. It has been used with exceptional precision in a number of contexts in different models of object recognition and classification [18].
A great deal of previous work has been focused on DCNN-based sign recognition [19,20]. Some studies used a sensor-only approach [21] . An optimized form of Microsoft's Kinect device [22,23], which is primarily a gesture feedback interface for gaming, can be used for gesture classification input data collection [24]. The classic DCNN approach to image-based hand gesture classification has also been studied, with a sole focus on accuracy [25]. This has resulted in some extremely complex models with a very high number of parameters which often require data resizing or background subtraction to achieve an acceptable degree of accuracy [26,27]. Waeerasekera et al. [10] classified this problem as the ASL based finger-spelling recognition/classification problem. However, the underlining machine learning mechanism of ASL based finger-spelling and the gesture recognition is indifferent due to the invariant dataset.Both high network size and input dimension restrictions limit the applicability of DCNN algorithms. Hand gesture recognition must be robust and input dimension-invariant, as the same hand gesture can appear in different scenarios, which results in a variation in the number of input dimensions. Moreover, DCNN algorithms need to be faster and computationally less expensive if they are to be deployed in practical scenarios. We propose a new algorithm that utilizes L2-normalization-based DCNN node and filter pruning and introduce a spatial pyramid pooling layer (SPP) that eliminates the necessity to restrict the number of input dimensions in order to address the data dimension restrictions and computational complexity of DCNNs. We also propose a new real-time detection hand gesture classifier that can recognize and decode multiple gestures in real time from an input video. We also utilized the open sourced ASL dataset [28], which provided 29 separate hand gesture classes for robust ASL finger-spelling or gesture recognition.

Deep Convolutional Neural Network and Node Pruning
The basic concept of a DCNN originated from the theory that an input matrix that controls edge features can be used to understand and define the patterns that appear. Any affine transformation cannot be used to obtain valuable visual information. Many previous work and literates have described the basic DCNN theory thoroughly [29,30]. Convolution is a sparse operation, and the parameters are reused by sharing. This will extract the edge or essential pattern information from visual feedback and label it as the supplied data. Figure 1 shows a simplified DCNN with 2 (two) convolutional layers, followed by a flattenws layer with the extracted feature vector that are used to classify the given image input. Convolution is a mathematical operation that can be applied to a matrix. Basically, all of the images that serve as the input for a convolutional neural network can be represented as an N-dimensional matrix. The filter sub-matrix is the underlying weight of the neural network and the inputs are integrated across the channels, which are the image dimensions. If a single node of a neural network can be denoted as the y, which is the mathematical summation of i inputs x and weights, the node can be expressed as the following matrix form, Here, b is the bias, and to reduce the result to a threshold based non-linearity, an activation function σ(x) can be used, as follows,ŷ = σ(y) Accordingly, the convolution accepts a volume size of W, H, D as the weight, height, and dept the number of filters K their stride S, and followed by the spatial extent F and calculates the following, Here,Ŵ,Ĥ,D are the resulting output matrix dimensions. It is important to mention that the weight and height are computed equally by symmetry and the output convoluted matrix obtains the exact filter-sized volume for the next convolutional layer to process. Thus, based on the equation that is shown above, a simple ranking on the nodes based on the ascending values ofŴ can be used in order to rank the filters of each layer in both the convolutional and fully connected layers. l1 normalization of the weight vectors can be calculated as the element-wise vector absolute value, as shown below, Subsequently, the nodes can be sorted from the calculated l1 normalization and pruned based on the least scored nodes, as needed. Figure 2 displays the block representation of node pruning in one particular layer of the DCNNs.

Image Resizing Restriction in DCNN
A DCNN, by definition, must be given a fixed number of input image dimensions, which depends on the number with which it was trained. For example, Lenet * 5, which was trained on the MNIST dataset, requires input dimensions of (32, 32) [31]. This means that all of the DCNN developments based on the MNIST dataset are forced to resize any model to the aforementioned dimensions. Figure 3 demonstrates the problem that arises when resizing input images.
Moreover, the Imagenet challenge dataset has a fixed (224, 224) input dimension restriction [32]. Consequently, most DCNN architectures developed in recent years require those same dimensions. A trained DCNN model might incorrectly classify images that have been resized from a different number of dimensions. In Figure 3, the original (224, 224) input images has been resized into 2 different images of (200, 400) and (400,200). This results in DCNN wrongly classifying the same images with different classes, making this DCNN unreliable to use in practical life, where the input dimension varies heavily. This is the main motivation to incorporate the spatial pyramid pooling (SPP) layer into the DCNN.

Spatial Pyramid Pooling Layer (SPP)
The use of a spatial pyramid pooling (SPP) layer eliminates the input image dimension constraints of DCNNs. SPPs are based on original work more commonly referred to as spatial pyramid matching (SPM). SPM is an extension of the bag-of-words (BoW) model that was proposed by Sivic et al. [33], a classical computer vision algorithm that divides input image feature vectors into finer-to-coarser forms or sections. Later, the algorithm aggregates feature maps in the sections. SPP not only helps to produce representations for processing from uniformly scaled images/windows, but also helps to feed the DCNN images of different sizes or scales during image preparation. Training with images of variable size improves the scale-invariance and eliminates the over-fitting of a DCNN. He et al. demonstrated that integration of SPP into a DCNN design improves accuracy in traditional DCNN architectures, such as Lenet * 5, Alexnet with the Imagenet dataset, etc. Figure 4 shows the main workings of the SPP layer in the DCNN. In the block diagram, (A) is the input image with any arbitrary input. The image is put through the convolutional feature pooling layer in (B). The features pulled by the convolutional layer are passed to the SPP layer. SPP generates a fixed-length output regardless of the size of the input, whereas the common DCNN sliding window pooling used in normal deep networks cannot perform this operation (C). The operation improves upon BoW models, such that the network retains spatial information by performing max-pooling in local spatial bins. The sizes of these spatial bins are proportional to the image size, so the number of bins is fixed regardless of the image size. This is in contrast to the sliding window pooling used in classical DCNN approaches, in which each spatial bin reflects the max-pooling responses of each filter. The outputs of SPP are presented as kM-dimensional vectors, where M denotes the number of bins and k is the number of filters in the last convolutional layer. Figure 4D shows this process. These vectors are then transferred to the fully connected layers for classification ( Figure 4E) and display ( Figure 4F).

The Proposed Methodology
This section of the article discusses the implementation of the details of the proposed method. For the sake of simplicity, we separate the proposed method into two main sections. The first section deals with selection of the optimal nodes in the DCNN by pruning. The second section discusses transformation of the DCNN to a compact SPP-based DCNN, along with a practical approach to decoding video gestures in real time based on the compact SPP-based DCNN.

Practical DCNN Architecture Selection and Pruning Strategy for Optimal Node Selection
Although SPP is not a modern concept in relation to DCNN, the use of both node pruning and transformation from a classical DCNN to an SPP-based DCNN has not yet been thoroughly researched. Additionally, we accomplish single image-based decoding by using traditional methods in a new manner; our technique makes the most of other approaches that have until now not been practical to deploy in real-life settings. Figure 5 demonstrates the simplicity of the proposed method. Almost all general feature selection-based classifiers carry out resizing and transformation from red, green, blue (RGB)-scale to gray-scale for classification, whereas the DCNN with SPP only requires a dimension-invariant input image for classification.
The proposed DCNN architecture was constructed based on research by Oxford's Vision Geometry Group (VGG) [34]. Their proposed model DCNN has several versions for deployment. We started with the smallest DCNN in VGGnet and pruned from that architecture, as our original dataset contains far fewer labels than the Imagenet dataset used to train VGGnet. The main goal of this pruning is optimization and creation of a compact model that is less computationally intensive. The smallest original VGGnet model is the VGGnet-11. The numeral eleven (11) represents the total number of hidden convolutional and fully connected layers. Figure 6 shows the full network structure and node/filter for each layer. The input layer of the DCNN takes a fixed (224, 224) RGB input, and it is followed by the first convolutional layer, with 64 units/filters. Then, a (2 × 2) max-pooling layer is used to reduce the number of parameters. This is followed by 2 (two) more convolutional layers with 128 units/filters and the same max-pooling process. There are then 2 more convolutional layers with 256 units/filters with max-pooling and, finally, 4 more convolutional layers with 512 units/filters. All of the convolutional layers were given individual batch normalization layers to reduce the over-fitting of the altered proposed VGGnet [35]. The output of the last convolutional layer is then passed through 2 (two) fully connected (FC) layers or dense layers. This is followed by an output layer of 29 nodes with Softmax output for gesture classification. This model has a total of 11 hidden layers and the input dimension of (224, 224) makes this DCNN model capable of processing a total of 22,506,781 learnable parameters.   original VGGnet model is the VGGnet-11. The numeral eleven (11) represents the total number of 219 hidden convolutional and fully connected layers. Figure 6 shows the full network structure and 220 node/filter for each layer.  The proposed DCNN architecture was constructed based on research by Oxford's Vision Geometry

214
Group (VGG) [34]. Their proposed model DCNN has several versions for deployment. As our original 215 dataset contains far fewer labels than the Imagenet dataset used to train VGGnet, we started with 216 the smallest DCNN in VGGnet and pruned from that architecture. The main goal of this pruning is 217 optimization and creation of a compact model that is less computationally intensive. The smallest 218 original VGGnet model is the VGGnet-11. The numeral eleven (11) represents the total number of 219 hidden convolutional and fully connected layers. Figure 6 shows the full network structure and 220 node/filter for each layer. The input layer of the DCNN takes a fixed (224, 224) RGB input, and is 221 followed by the first convolutional layer, with 64 units/filters. Then, a (2 × 2) max-pooling layer is 222 used to reduce the number of parameters. This is followed by 2 (two) more convolutional layers with   pruning of FC layers. Overall, pruning resulted in more than a 50% reduction in parameters. The first 247 convolutional layers with 64 units/filters in Figure 6 now have 32 filters, as shown in Figure 7. The same 248 ) and max-pooling layers are forest green ( Version November 2, 2020 submitted to Appl. Sci.

of 22
The proposed DCNN architecture was constructed based on research by Oxford's Vision Geometry

214
Group (VGG) [34]. Their proposed model DCNN has several versions for deployment. As our original 215 dataset contains far fewer labels than the Imagenet dataset used to train VGGnet, we started with 216 the smallest DCNN in VGGnet and pruned from that architecture. The main goal of this pruning is 217 optimization and creation of a compact model that is less computationally intensive. The smallest 218 original VGGnet model is the VGGnet-11. The numeral eleven (11) represents the total number of 219 hidden convolutional and fully connected layers. Figure 6 shows the full network structure and 220 node/filter for each layer. The input layer of the DCNN takes a fixed (224, 224) RGB input, and is 221 followed by the first convolutional layer, with 64 units/filters. Then, a (2 × 2) max-pooling layer is 222 used to reduce the number of parameters. This is followed by 2 (two) more convolutional layers with   pruning of FC layers. Overall, pruning resulted in more than a 50% reduction in parameters. The first 247 convolutional layers with 64 units/filters in Figure 6 now have 32 filters, as shown in Figure 7. The same 248 ); these are followed by a flatten layer, shown in emerald green ( Version November 2, 2020 submitted to Appl. Sci.

of 22
The proposed DCNN architecture was constructed based on research by Oxford's Vision Geometry

214
Group (VGG) [34]. Their proposed model DCNN has several versions for deployment. As our original 215 dataset contains far fewer labels than the Imagenet dataset used to train VGGnet, we started with 216 the smallest DCNN in VGGnet and pruned from that architecture. The main goal of this pruning is 217 optimization and creation of a compact model that is less computationally intensive. The smallest 218 original VGGnet model is the VGGnet-11. The numeral eleven (11) represents the total number of 219 hidden convolutional and fully connected layers. Figure 6 shows the full network structure and 220 node/filter for each layer. The input layer of the DCNN takes a fixed (224, 224) RGB input, and is 221 followed by the first convolutional layer, with 64 units/filters. Then, a (2 × 2) max-pooling layer is 222 used to reduce the number of parameters. This is followed by 2 (two) more convolutional layers with       ). The number of dimensions in each layer is shown beside/below that layer.
The pruning strategy for the DCNN only works in the hidden layers. Neural network node pruning is a basic process, in which nodes are raked in order to prune them. Hu et al. [36] explored sparsity inactivation for network pruning. The exponential linear unit (ReLU) [37] activation function imposes sparsity during inference, and the average percentage of positive activation in the output can determine the importance of the neuron. L1 normalization is useful for the estimation of the saliency of feature maps in a given layer. This idea can be used to rank the filters in each layer. The trained neural network is then pruned to make it more compact in size while retaining accuracy to produce the correct results with less computational overhead. The proposed pruning of the new DCNN takes place in 2 steps. First, the model is trained with the proper dataset. Subsequently, the trained model undergoes a comparison based on validation accuracy to establish the baseline. Later model pruning is done based on the filter and nodes l1-normalized rankings. The pruned compact network is then retrained to give the same or higher accuracy with a reduced computational workload. As the original layer in the proposed model was extremely large, 50% pruning of convolutional layers and 25% pruning of FC layers were proposed to make the final compact DCNN that is shown in Figure 7. Figure 7 displays a final pruned model after 50% pruning of convolutional layers and 25% pruning of FC layers. Overall, pruning resulted in more than a 50% reduction in parameters. The first convolutional layers with 64 units/filters in Figure 6 now have 32 filters, as shown in Figure 7. The same strategy was implemented in the next two convolutional layers, which have 128 units/filters in Figure 6. The pruned network has 2 convolutional layers with 64 units/filters after the first 32 node/filter layer, as shown in Figure 7. The pruning of the next two convolutional layers with 256 units/filters and the last four convolutional layers with 512 units/filters in Figure 6 followed, resulting in 2 convolutional layers with 128 units/filters and the last 4 convolutional layers with 128 units/filters, as shown in Figure 7. The FC layers of the original model each had 512 nodes, so 25% pruning of both layers leaves 334 nodes in each hidden FC layer, as displayed in Figure 7. Table 1 shows the exact number of parameters in each layer in the original and pruned network. The original proposed model presented in Figure 6 has a final trainable parameters, or the weight value holder of 22,506,781 as compared to 7,328,285 parameters in Figure 7 as noted in the Table 1. The proposed 50% pruning of the convolutional nodes also results in half of the parameters in the pruned model. Batch Normalization parameters also cut in half as a result of the parameter reduction or pruning. The output of the convolutional layers is flattened and feed into the FC layers as a design of DCNN. Table 1 demonstrates a sharp reduction of 4,817,280 parameters of first hidden FC as oppose to 12,845,568 parameters in the original model in Figure 6. This data computed based on pruning shows potential for pruning the DCNN at a high rate in each hidden layer consists of convolution and dense/fully connected nodes. However, it is not possible to prune the last FC layer in the DCNN by design due to the output node fixation of DCNN. This is why the Table 1 shows no change in the last FC later with 29 nodes, as those are the classes/labels for the dataset.    has three (1 × 1),(2 × 2) and (4 × 4) multi-scale SPP. Table 2 shows the modified pruned DCNN in the 275 Figure 7 with the proposed Multi-scale SPP fitted after the convolutions.    has three (1 × 1),(2 × 2) and (4 × 4) multi-scale SPP. Table 2 shows the modified pruned DCNN in the 275 Figure 7 with the proposed Multi-scale SPP fitted after the convolutions.     Table 2 shows the modified pruned DCNN in the 275 Figure 7 with the proposed Multi-scale SPP fitted after the convolutions.     Table 2 shows the modified pruned DCNN in the 275 Figure 7 with the proposed Multi-scale SPP fitted after the convolutions.   strategy was implemented in the next 2 convolutional layers, which have 128 units/filters in Figure 6. shows potential for pruning the DCNN at a high rate in each hidden layer consists of convolution and 264 dense/fully connected nodes. However, due to the output node fixation of DCNN, it is not possible to 265 prune the last FC layer in the DCNN by design. This is why the Table 1 shows no change in the last FC 266 later with 29 nodes, as those are the classes/labels for the dataset. has three (1 × 1),(2 × 2) and (4 × 4) multi-scale SPP. Table 2 shows the modified pruned DCNN in the 275 Figure 7 with the proposed Multi-scale SPP fitted after the convolutions.

276
As the last convolutional layer has 256 filters the SPP has 265 convolutional feature input. Then  224x224  112x112  112x112  112x112  112x112  112x112  56x56  56x56  56x56  56x56  56x56  28x28  28x28  28x28  28x28  28x28  14x14  14x14  14x14  14x14  14x14  7x7   32  strategy was implemented in the next 2 convolutional layers, which have 128 units/filters in Figure 6. shows potential for pruning the DCNN at a high rate in each hidden layer consists of convolution and 264 dense/fully connected nodes. However, due to the output node fixation of DCNN, it is not possible to 265 prune the last FC layer in the DCNN by design. This is why the Table 1 shows no change in the last FC 266 later with 29 nodes, as those are the classes/labels for the dataset. has three (1 × 1),(2 × 2) and (4 × 4) multi-scale SPP. Table 2 shows the modified pruned DCNN in the 275 Figure 7 with the proposed Multi-scale SPP fitted after the convolutions.

276
As the last convolutional layer has 256 filters the SPP has 265 convolutional feature input. Then ). The number of dimensions in each layer is shown beside/below that layer.

Integration of Multi-Spatial Pyramid Pooling into the Pruned DCNN
The model and the pruning strategy described in the section above had a fundamental limitation when it comes to the image input dimension. This results in a forcible data resizing and, thus, accuracy reduction, as stated in Figure 3. He et al. [4] proposed SPP in DCNN improves the accuracy and get rid of the input dimensions restriction. Now, some research [38,39] have suggested combining various spatial pooling combinations including the original research from He et al. [4]. Now, adopting the DCNN to fit in any arbitrary image input dimensions, we proposed a 3 multi-scale based pooling that has three (1 × 1), (2 × 2) and (4 × 4) multi-scale SPP. Table 2 shows the modified pruned DCNN in the Figure 7 with the proposed Multi-scale SPP fitted after the convolutions. As the last convolutional layer has 256 filters the SPP has 265 convolutional feature input. Subsequently, the data are filtered through 3 separate (1 × 1), (2 × 2), and (4 × 4) multi-scaled SPP merged together. This makes the total parameters of SPP layer (256 × 1 × 2 × 4) = 5376 parameters. Now, the number of multi-layer SPP (1, 2, 4) has been decided by both trial and error and suggestion from other researches [38,39]. Although increasing the number of pooling might further increase the accuracy, but it will add more computation overhead to the DCNN model. This proposed pruned DCNN has already been optimized with less than half of the original weight parameter with SPP based pooling. Introducing multi-scale SPP gets rid of this dimension restriction and improves the accuracy.

Practical Classification Algorithm for Real-Time Gesture-To-Word Decoding
The proposed SPP based Compact DCNN will give proper results for hand gestures in real-time. However, to the best of our knowledge, a practical video to gesture decoding algorithm with SPP based DCNN has not yet been properly researched. In this subsection, we have devised a novel practical approach to decode the hand gesture to the consecutive algorithm in real-time. Algorithm 1 has shown the approach for decoding gesture(s) video in real-time.
Algorithm 1 Hand Gesture Decoding Algorithm Input: Hand gesture video or image frames set Output: Decoded alphabets instruction set BEGIN Step 1: Process the video for the given frames per second (fps) Step 2: Transform the images as 2D array (M, N) where M is fps and the N will be the instruction/word length (images)

Step 3: Apply the Proposed Compact Spatial Pyramid Pooling Deep
Convolutional Neural Network to transform each images Step 4: Column wise Majority Selection Step 5: Return array of n-sized output instruction END The ASL sign language has 29 distinct classes denoting all of the English alphabet A to Z, along with space, del, and nothing characters gesture for additional support. The gestures set can be mathematically computed as a permutation problem with Equation (5), Here, n is the total class of 29 gestures and r is the selected sets. Giving the input into Equation (5) gives a total of 570,024 usable gestures for various instructions. Another important step of the gesture is the frame per second (fps) processing. The modern video camera has the standard of 30 fps in real life. Accordingly, the input video containing a 4 instruction set will have 120 frames to decode. It transforms the images into a 2D array of (M, N) dimensions and then applies the proposed compact SPP-DCNN to transform each image into a corresponding gesture alphabet label. At this part of the algorithm, the majority member selection was done based on the median operator applied in the predicted array, as shown in Equation (6) Majority(x 1 , x 2 , . . . , Here, x 1 , x 2 , . . . , x n denotes the decoded label in each form of the array. Now, as a simple example of 5 fps gesture of A, B, C, D can produce [0, 0, 0, 1] for the first row of data with the SPP-DCNN with 95% accuracy. However, the proposed majority member selection will change the result into the correct gesture based on the majority member in that row. This practical approach makes this proposed algorithm extremely effective for practical gesture decoding. A detailed accuracy analysis from this proposed algorithm has been discussed in Section 7 for more clarification.

Dataset Description
DCNN is practically useless without training and validating with proper data. In our research, we aim to develop practical hand gesture decoding in real-time. A proper dataset with enough label would ensure the validity of the proposed method. American Sign Language (ASL) is a generalized gesture set that has been curated and standardised by National Institute of Deafness and Other Communication Disorders (NIDCD). The precise origins of ASL are not clear, although some claim that it originated first from combining of local hand gestures and French sign language (LSF, or Langue des Signes Française) more than two centuries ago. There are a lot of datasets that curate a different form of ASL [26], especially the only one focused on the numerical hand gestures [11,40]. So, to be more robust with more practical application, this research focused on the data that were developed in Kaggle ASL challenge [28].
The data gathering for the training comprises 87,000 images which are (200 × 200) dimensions. There are 29 categories, including 26 for the characters A-Z and 3 for SPACE, DELETE, and EMPTY. These 3 sections are a huge benefit in implementations and decoding gestures in real-time. The test data collection includes only 29 images categories, to allow for real-world reference images to be included. Figure 8 displays the whole data class examples. Although, Islam et al. [26] proposed and collected dataset containing 26 classes of same ASL. However, the prepossessing from RGB to grayscale makes it not usable for our learning purpose. Kaggle ASL challenge dataset was more suitable as it has the dataset with no prepossessing done to it prior to classification. It is worth mentioning that although this dataset titled as the ASL hand gesture dataset, for ASL finger spelling same principle applies for labeling. This makes them virtually the same task with different terminologies.  Table 3 shows the training testing split of the dataset for proper validation. The splitting was done randomly to ensure bias-free training and data augmentation, random rotation and shifting were used during training for ensuring reduced over-fitting.

Evaluation Metrics
It is important to establish proper evaluation metrics for analyzing the performance of any machine learning algorithm. Although, most of the DCNN related research solely focuses on accuracy for validation. However, precision, recall, and f 1-score can also help to interpret the result more accurately. Precision P, Recall R, and f 1 Score can be calculated, as follows, Here TP is the number of true positives, i.e., the number of correctly classified faulty instances. FP is the number of false positives, i.e., the number of normal instances that were wrongfully classified as faulty ones. FN stands for false negatives, which is the number of faulty instances classified as normal ones. the properly labeled cases were denoted as True Negative TN. Additionally, the accuracy (AC) is also computed, as follows, Additionally, R 2 (coefficient of determination) regression score function has been computed for 5 consecutive testing of all 3 proposed model for evaluation. Assuming the original label as y and predicted labels asŷ, the R 2 will be Here,ȳ = 1 i . It reflects the proportion of variance that the independent variables in the model have defined. It gives an indicator of fitness effectiveness and, thus, a measure of how well the model is able to assess unknown samples. For the newly proposed member majority function, the mean majority is calculated, as follows, Mj mean (a 1 , a 2 , . . . , a n ) = 100 a i * k is the corrected majority derived from the dataset. Basically, it is a mean average of the majority count in the array [a 1 , a 2 , . . . , a n ]. The detailed analysis of the result evaluated based on this metric is discussed in Section 7.

Results
The implementation and both pieces of training with validation and testing of the proposed DCNN network were exclusively done by using the Python programming language. Python allows for creation of a virtual environment and a cloning mechanism in that environment with relative ease. Thus, developing the proposed DCNN algorithm with SPP in a Python-based application allows for rapid prototyping and instant deployment in a variety of practical scenarios. Experiments of proposed DCNN node pruning and SPP integration were constructed and modified by Python-based Keras library, which has a TensorFlow library as a backend system [41]. A high-performance server computer with Intel Corporation Xeon E5/Core i7, 32 GB of RAM, and Nvidia GPU RTX 2080 with Ubuntu 16.04.6 LTS Operating System (OS) was used for training and application deployment. The DCNN training parameters are 100 epochs each, with a 40 batch size, the learning rate was 0.001 for faster convergence. Convolution and Dense or FC layers used in these proposed DCNNs were both built with Relu activation [37], along with Adadelta Optimization and Categorical Crossentropy for loss/error function. The main addition in this ensemble is the introduction of Batch normalization [42] to reduce over-fitting and the convergence of data during training. The original VGG-11 was not utilized with batch normalization. The original and both pruned and SPP with pruned DCNN was trained and tested using a single Nvidia RTX 2080 GPU instance. However, in modern application deployment, it is necessary to use parallel GPU training and deployment. Moreover, sometimes, more than two or three instances of GPUs are available to use in real-time application. In order to deploy with GPU based parallelism, the final proposed SPP based DCNN model has trained with 2 Nvidia RTX 2080 GPU instances.
In Figures 9 and 10, loss chart displays the gradual error reduction during training and validation/testing. Even though the loss chart is comparatively smooth, the accuracy varies drastically during training and validation. This is due to the data augmentation and highly randomised testing in each iteration. This forces the model to learn without over-fitting. They also display the training and validation of the Original Proposed DCNN and retraining after the pruning was applied. The original and the pruned training were both done for 100 iterations of Epoch. Now, both Figures 9 and 10 have shown instability during the training and this phenomenon can be described by the random data augmentation that is done to reduce over-fitting. However, model weights were saved at the end of every epoch to select the best weigh from the training. The DCNN had the highest accuracy of 90.0% and the pruned DCNN has 91.0% accuracy. This finding is initially surprising, as all the previous literature suggested that retaining or slight decrease in accuracy usually occurs after the pruning. However, this occurrence can also be explained by learning data volume and DCNN size. The first proposed modified VGG-11, like DCNN, had 11 total hidden layers and most of the last convolution layer along with the first hidden FC layer of the neural network has a 12,845,568 and 2,359,808 weight parameters. On the other hand, the total training instances were 69,600 images. A simple representative convolution of six parameters each can be "memorized" or saved into the weight parameter and, thus, an over-fitting occurs. Now, this result actually proves the hypothesis at the beginning of this research. The hand gesture classifier has less data than the Image-net Challenge dataset and does not need a bigger network, like the proposed original, to achieve higher accuracy. Additionally, this theory can be proved more in the analysis of the weight and the l1-normalization ranking of the nodes and filters in all hidden layers.   Table 4 has the experimental results to prove the assumptions above. The whole worst scored 20 filters and node weights after can been seen in Table 4. The first hidden layers have different nodes. such as middle nodes as less weight, but the last convolutional layer almost exclusively has low scoring weight outermost filters when compared to the projection of previous convoluted output. Figure 11 is the visual representation pruning justification in the proposed method. Original proposed DCNN's first six convolutional layers filters are shown based on the l1-normalized values. Half of these nodes have the normalized weight value that falls less than the median. Accordingly, pruning half of the nodes in reality did not affect the total accuracy of the proposed DCNN. Moreover, retraining the pruned model forces the weights to generalized of features detection more than "memorization" of the dataset. As a result, the pruned DCNN gained more accuracy than the original model. This practical pruning also proves that the bigger DCNN is often not useful for smaller features/labels dataset classification. Table 5 shows the testing accuracy, precision, recall, and f 1 score for the all 3 proposed model. Although the overall accuracy of the original DCNN had 90% over all classes, the individual precision was skewed to some classes and less in some classes. This results in bias classification accuracy in real life. However, pruning and retraining of the model made the DCNN more generalized, and the class-wise precision and recall improved significantly. The overall accuracy also increases due to the newly pruned DCNN. After pruning and retraining for 100 epochs, the general accuracy improved into 92% in the pruned DCNN. Additionally, the introduction of SPP in the pruned DCNN further improves the accuracy of up to 95%. This is around 2% improvement of the previous model, as suggested by He et al. [4]. Moreover, SPP makes the final model dimension invariant. As a result, the input image can be fed directly without any pre-processing. The pruning of the DCNN not only improved the accuracy, it also compacted the DCNN with less weight parameter. Table 6 shows the final pruning statistics of the proposed DCNN. The original DCNN had a total of 22,498,973 weight parameters and the pruned DCNN had the final 7,323,869 weight parameters. This pruning makes the network more than 3× smaller than the original, with a 67.45% compression rate.  Figure 12 displays that some common mislabeling occurs in all proposed DCNN models. In hindsight, this might contradict the high accuracy of pruned model and SPP DCNN, but further examination of the input images shows that the original input image is often too visually obscure to separate distinguishable features with convolution. Furthermore, some of the images are more visually similar to predicted class than ground truth. Table 7 shows the advanced performance of the all 3 proposed DCNN. Each model's performance is based on a 4000 random sample input and initialized 5 times for randomness in each epoch. Accuracy and R 2 score for all 3 model both prove the viability of using this proposed DCNN in real-life scenarios. A comparison among 3 models based on the R 2 -score makes the SPP-based DCNN the best model to use in real-life applications. This ensures the prediction quality of the proposed SPP-DCNN suitable for a faster reliable hand gesture classifier. The proposed gesture decoder algorithm needs faster prediction output to apply in real-time in the gesture decoder systems. Table 8 shows a comparative run-time of the proposed algorithms in real-time based on the generated data from Algorithm 1. The setup was followed, as described at the beginning of this Section 7. Now, the proposed 3 models perform relatively well. However, it is noticeable the run-time for a video sequence containing 4 gestures have 19.30 ± 1.32 s average. On the other hand, the pruned version was sped up significantly having 9.34 ± 1.37 s average. This is predictable, as the pruned model has less weight parameters to calculate in real-time. Moreover, in real-life, some of the server-based application has the option for multi-GPU based parallelization. The proposed SPP-based DCNN was also implemented as a multi-GPU based in this experiment and had an astonishing 0.013 ± 0.019 s average run when compared to the other models.   Table 9 shows the results of multi-GPU based parallelization in greater detail. Here, the 6 gesture sequence video or 6 consecutive alphabets finger-spelling takes an average of 267 ± 2.13 ms to decode per frames. Based on these results, the proposed algorithm is fast enough to decode gestures in the real world. Table 9. Compact deep spatial pyramid pooling convolutional network classification in real-time (mean ± standard deviation of seven runs, 10 loops each) (with multi-GPU functionality; 2 GTX-2080).  Table 10 shows the results of gesture decoding by the proposed algorithm while using a video with various frame rates. The mean majority accuracy was counted based on Equation (12) from Section 6. This provides an idea of the overall ability of the proposed decoding algorithm. Even at 60 fps, the decoder gives 99% accuracy. The decrease in accuracy with increased fps can be explained by the transitional frames between gestures. However, as this error is minimal, the algorithm can cope with this minuscule source of error. Based on this experiment, 30 fps is the recommended speed for fast real-world decoding for practical applications. Our proposed decoder or ASL finger-spelling system usually 100% accurate in various finger-spelling or gesture decoding. However, the mean majority in rows shows that the gradual rise of error can be cumulative and might make wrong classification in future. However, more training with diverse dataset and applying proposed algorithm in recommended 30 fps will held up the reported accuracy of any ASL-finger-spelling gesture decoding in real time.

Comparison with Other Methodologies
Modern deep learning-based hand gesture classification research has been solely focused on accuracy-based classification and regression. Depending on the image dataset, the size and computational complexity of such classification algorithms are increasing rapidly. In this study, we have introduced a new real-time approach to classical gesture decoding. Islam et al. [26] approached DCNN-based classification while using the same dataset as used in this research and reported 94% accuracy over all input data. However, that method involved a background subtraction process along with data image resizing and gray-scale conversion. Moreover, integrating a multi-support vector machine (MSVM) into the DCNN might further improve classification accuracy, but ultimately the algorithm lacks practicality for real-time applications. Tushar et al. [40] approached gesture recognition in a similar fashion as the present research, but the image resizing and limitation of gesture classes to numerical hand gestures limits its utility in real-time applications. Our previous approach to node pruning with a DCNN yielded a similar increase in prediction response time, but the image resizing restriction was still present [43].
The proposed compact SPP-based DCNN model eliminates the image resizing restriction. We utilized a large ASL dataset to create a new gesture or ASL finger-spelling decoding algorithm that works in real time. Although the proposed pruned model with the SPP layer achieved the same accuracy as the latest relevant research using this dataset, the combination with the newly proposed video-based decoding upgraded the result to a maximum of 99% accuracy in real time with very low processing time. All of the previous algorithms focused on accuracy improvement. In this research, we have combined the improved accuracy with a novel algorithm to achieve a new benchmark system that provides a practical technique for the development of future applications that are based on gesture recognition.

Conclusions
The rise of edge-based remote applications has created a demand for high-performance low-cost computing-based deep learning convolutional neural networks (DCNNs). The main idea of the DCNN is not new. However, optimization and expanding the applications of neural networks are now very demanding tasks due to the rise of edge computing and real-time response-based application deployment. Decoding hand gestures in real time requires a fast DCNN capable of interpreting variable-sized image inputs owing to the variation between cameras in modern systems. In this research, we proposed a new hand gesture classification system that can classify various hand gestures with 94% accuracy. Moreover, we integrated a spatial pyramid pooling (SPP) layer into the proposed DCNN and used node pruning to make it less computationally resource intensive and image input dimension-invariant. Consequently, the proposed SPP-DCNN is the most reliable method of real-time gesture decoding. Moreover, we have introduced a novel algorithm for video-based gesture decoding that can process a video with any arbitrary input dimensions and variable frames per second (fps), which will decode an input gesture video into consecutive gesture classes. This proposed new and faster system can also be used as an advance ASL finger-spelling recogniser. The use of these compact SPP-DCNNs in various remote smart locations with minimal computing resources will ensure high performance with lower computing costs and better connectivity.