Design and Implementation of Deep Learning Based Contactless Authentication System Using Hand Gestures

: Hand gestures based sign language digits have several contactless applications. Applications include communication for impaired people, such as elderly and disabled people, health-care applications, automotive user interfaces, and security and surveillance. This work presents the design and implementation of a complete end-to-end deep learning based edge computing system that can verify a user contactlessly using ‘authentication code’. The ‘authentication code’ is an ‘n’ digit numeric code and the digits are hand gestures of sign language digits. We propose a memory-efﬁcient deep learning model to classify the hand gestures of the sign language digits. The proposed deep learning model is based on the bottleneck module which is inspired by the deep residual networks. The model achieves classiﬁcation accuracy of 99.1% on the publicly available sign language digits dataset. The model is deployed on a Raspberry pi 4 Model B edge computing system to serve as an edge device for user veriﬁcation. The edge computing system consists of two steps, it ﬁrst takes input from the camera attached to it in real-time and stores it in the buffer. In the second step, the model classiﬁes the digit with the inference rate of 280 ms, by taking the ﬁrst image in the buffer as input.


Introduction
Contactless biometric authentication has become a necessity at this hour as it is perceived to be not only more hygienic but secure and efficient. Biometrics can be divided into two classes, physiological biometrics and behavioral biometrics [1]. Physiological biometrics usually includes fingerprints [2], facial features [3], palm prints [4], retinas [5], ears [6], and irises [7]. Behavioural biometrics usually consists of keystrokes [8], signatures [9], and gaits [10]. Voice biometric can be classified as both because it includes features belonging to both the classes [1].
Before the era of Deep Learning, biometric authentication was mainly based on hand-crafted features that were extracted using methods such as scale invariant feature transform (SIFT) [11], wavelet [12], etc. With the advent of deep learning in this decade, the biometric authentication field is completely transformed. Most contemporary biometric authentication systems use convolution neural networks and different variants of them. A convolutional neural network (CNN) is a type of deep multilayer artificial neural network. They are widely used in all computer vision tasks because of their convolution process that obtains an effective representation of the input images directly from raw pixels with little to none preprocessing and can easily recognize visual patterns [13]. The representations learned by the CNN models are that of visual features and are effective when compared to the handcrafted features [14]. These networks are being applied in a variety of applications, such as object detection tasks, speech recognition tasks, unmanned aerial vehicles (UAVs), and unmanned ground vehicles (UGVs) [15]. Below mentioned are some of the contemporary biometric authentication systems that use deep learning models, ref. [16] uses CNN for facial biometric authentication, ref. [17] authenticates a user, based on fingerprints using CNN, ref. [18] uses graph neural networks and CNN for palmprint recognition, ref. [19] uses deep neural networks to identify and authenticate a user based on voice, ref. [20] uses deep learning to recognize gaits, ref. [21] recognizes signature biometric using deep learning, etc. Even though deep learning has brought a lot of progress in the field of biometric authentication there are still several challenges to overcome. Some of the challenges are, more challenging datasets have to be developed to train the models, the need for interpretable deep learning models, real-time deployment of the models, memory-efficient models, security and privacy issues, etc. [22].
Also in the security space, the most dominant usage of CNNs had been in the intrusion detection [23]. These networks primarily use object detection modules and the recent development of CNNs has been proven to be effective for object detection [24]. Many of these CNNs are memory hungry due to a high number of parameters (hundreds of millions) and have high computational complexity. This impedes the deployment in the edge computing device, which is a typical requirement in edge computing/cyberphysical system (CPS) [25]. So, the development of memory-efficient CNNs has taken the forefront in recent years [25], which can provide a compact model that is deployable in edge computing/CPS arenas and have hardware-level support that is needed.
Thus to tackle the real-time deployment and memory efficiency challenges of the deep learning models, we propose an end-to-end contactless authentication system that verifies a user by validating the 'authentication code' using a memory-efficient CNN model. The 'authentication code' which is unique for each user and is an 'n' digit numeric code with each digit having a range of 0-9. The task of automatically classifying an input image into one of the given classes using convolution neural networks is known as image classification task [26]. There are quite a few standard image classification datasets namely 'ImageNet Large Scale Visual Recognition Challenge' [27], 'CIFAR 10', 'CIFAR 100' [28], 'Oxford IIIT Pet' [29], 'Oxford Flowers' [30], 'MNIST' [31], etc. There are also quite a few deep learning models available today that achieve high performance on these image classification datasets using different variants of convolution neural networks e.g., the Sharpness-Aware minimization procedure on CNN model achieves 88.61% top-1 accuracy on ILSVRC-2012 dataset [32], the EnAET model with an error rate of 1.99% on CIFAR-10 dataset [33], the Big Transfer (BiT) model with an accuracy of 99.4% on CIFAR-10 dataset and 87.5% top-1 accuracy on ILSVRC-2012 dataset [34], the Branching and Merging convolution network with homogeneous filter capsules model with an accuracy of 99.84% on MNIST dataset [35]. These models consist of several convolution layers, max-pooling layers, and different regularization layers like drop out, and L2 regularization. But all the aforementioned models are very large models that require more memory and greater model inference time. Memory efficient CNN's execution without having any compromise on the accuracy has been a challenge, especially when the inference has to be performed on an edge computing device/CPS. The state-of-the-art performance has been achieved with the help of complex models that require more computational resources [25]. In this work, we develop a memory-efficient CNN model that suits best for edge computing devices. We also compare the performance of the proposed memory efficient CNN model with the MobileNetV2 model, which is the state of the art model for memory-efficient CNN, and show that the proposed network outperforms the existing MobileNetV2 model.

Dataset
We have utilized 'Sign Language Digits Dataset' [36] for the training of the proposed CNN and MobilenetV2. The dataset consists of a total of 2062 samples. There are 10 classes from digit 0 to digit 9. Samples from each class are shown in Figure 1. Each sample is of the size [100 × 100] pixels. Table 1 shows dataset details with the total sample count per class. The entire dataset is split into three parts i.e. training data, validation data, and test data. Since there are 10 classes, the test data is split in such a way that there are equal numbers of samples in each class. We randomly selected 41 samples from each class leading to a total of 410 samples for the test set, which is 19.88% of the entire dataset. The rest of the dataset is split into training and validation sets with 20.07% and 60.03% samples of the dataset respectively.  [36]. The number below each image represents the decoded sign. Table 1. Dataset [36] details. Example Images in the dataset are given in Figure 2.  [37]. Here we upscale the images to [256 × 256] pixel size because of the real-time testing constraints as explained in Section 2.4. The resized image samples are then used as input images to train the deep learning model.

Proposed End to End System for Contactless Authentication
In this work we use CNNs for providing contactless authentication code from input images of hand gestures of sign language digits. The whole system is composed of two steps namely capture of images and classification (digit recognition) of the captured images. The entire classification task used in the system is shown in Figure 2. As shown in Figure 2 the input image to the system is first resized to [256 × 256] pixel size and is then fed into the deep learning model for digit recognition.  Figure 4 and in Table 5.

Lightweight Deep Learning Models for Hand Gestures Recognition
This section discusses the lightweight deep learning based CNN's suitable for edge computing devices for hand gesture recognition. Both the proposed model based on the bottleneck module which is inspired by the deep residual networks, and MobilenetV2 (for comparison) are discussed.

Mobilenetv2
MOBILE-NET-V2 architecture is the state-of-the-art lightweight CNN that is superior compared to other models and shown to provide improved performance especially in tasks like object recognition [38]. The network has depth-wise separable convolution layers (as shown in Figure 3), which act as efficient building blocks. The hypothesis of the network was to achieve an efficient encoding between intermediate input and output through these bottlenecks. We incrementally train the MobileNetV2 model using two different methods of transfer learning namely feature extraction and fine-tuning. We first train the model with the feature extraction method and after that, we continue training the model with the fine-tuning method. Both of these methods are briefly described below:

1.
Feature extraction: In this method, we use the MobileNetV2 model pre-trained on the ImageNet dataset to work as a feature extractor. We do not include the final dense layer, 'classification layer' of the MobileNetV2 model because the number of classes for the ImageNet dataset is different from the 'Sign Language Digit Dataset' we use. So in total, we use 154 pre-trained layers of the MobileNetV2 model as the base model. The output from this base model is the learned features which is a 4 dimensional tensor of size [None, 8,8,1280]. We then flatten the output of the base model into a 2-dimensional matrix of size [None, 1280] using 'Global Average Pooling 2D function' [39]. After this layer we add a Dense Layer which is the classification layer for the dataset we use. Therefore using the Feature extraction method we do not train the base model (MobileNetV2, excluding its final layer) instead, we use the base model to extract features from the input sample and then use these extracted features as input to the dense layer (the added classification layer according to the 'Sign Language Digit Dataset') and just train the dense layer on our dataset. The following hyper-parameter values are used to train the MobileNetV2 model using this method, 'RMS Prop optimizer' [40] with a learning rate of 0.0001, batch size of 32 for a total of 10 epochs. The architecture and the total number of parameters (model weights) used for training the MobileNetV2 model using this method are shown in Tables 2 and 3 respectively.  2. Fine-tuning: In this method, we use the same model architecture as the above method, the MobileNetV2 model (excluding its classification layer) pre-trained on the Im-ageNet dataset is the base model and an added dense layer (classification layer according to our dataset). The only difference is that in this method, we train a few layers of the base model along with the final added dense layer on the 'Sign Language Digit Dataset'. So in total, we freeze the first '99' layers and start training from the layer '100' to layer '154' of the base model along with the final added dense layer on the 'Sign Language Digit Dataset'. Therefore in this method, the number of trainable parameters is more as opposed to the above method. The following are the hyper-parameter values used for training the MobileNetV2 model using this method, 'RMS Prop optimizer' with a learning rate of 0.00001, batch size of 32 for a total of 25 epochs. The architecture and the total number of parameters used for training the MobileNetV2 model using this method are shown in Tables 2 and 3 respectively.

Proposed Model
We propose a lightweight CNN (as shown in Figure 4) for the task in hand. The proposed network utilizes a novel bottleneck motivated by deep residual learning [41]. The bottleneck consists of stacked residual blocks as presented in Figure 4 and Table 4. The layer-wise details like type of operation performed, size of feature maps of the overall architecture are shown in Table 5.  Table 5. Table 4. Architecture details of the proposed Bottleneck within proposed CNN presented in Table 5. Note that 'M' is spatial extent and 'Z' is depth of the feature maps.  The initial residual block is motivated from Ref. [41] and it consists of 1 × 1 convolution for depth squeezing followed by 3 × 3 convolution for local feature extraction and finally 1 × 1 convolution for depth stretching. This kind of 1 × 1 projection-based embeddings possesses more information about large input patch [42]. The second residual block is similar to the first one except that the 3 × 3 convolution is performed in a dilated manner as shown in Figure 5 for improving the receptive field throughout the network. In total, the proposed CNN has 4 blocks of such bottlenecks arranged alternatively as presented in Table 5 and each convolution operation in the proposed model is followed by ReLU [43] and batch normalization [44]. On comparing with MobilenetV2, the proposed model has 2.35 × fewer parameters and 1.85 × less memory usage (refer to Table 6). In short, the operations performed in the proposed bottleneck can be summarized as follows: Given a feature map x (at the input of bottleneck), the output h(x) of the initial residual block can be written as:

Input
where, f (x, θ) is sequence of convolution operations parameterized by θ 1 , performing depth squeezing, feature extraction and depth stretching. Note that it is easy to optimize f (x, θ 1 ), than to learn the underlying h(x) directly from x [41]. Finally, the output b(x) of the bottleneck is given by: where, g(x, θ 2 ) is sequence of convolution operations parameterized by θ 2 , performing depth squeezing, feature extraction via dilated convolution and depth stretching. Given a mini-batch with N samples, the cross-entropy loss L was computed as shown below: where y is the one hot encoded label (C × 1) and y is the predicted softmax probabilities (C × 1) and, C is number of classes. Being extremely lightweight, the proposed network does not require any transfer learning the model was trained end to end on Google Colab using Keras deep learning library [45]. We use the following hyper-parameter values to train the model, 'Adam optimizer' [46] with a learning rate of 0.005, weights initialised using kaiming initialization [47], batch size of 8 for a total of '40' epochs.

Deployment on Edge Computing Device
A low-cost edge computing hardware, the Raspberry pi 4 model B microprocessor is utilized for implementing the given task. Raspberry pi 4 model B is the latest product from the raspberry pi collection of single-board edge computing devices. This new version of raspberry pi provides increased processor speed, connectivity, and memory capacity compared to the raspberry pi 3 model B+. The cost of this edge computing device is around $35. We attached a Raspberry pi camera V2.1 camera module to the raspberry pi 4 micro-controller to capture images in real-time. The camera module is capable of capturing images at 3280 × 2464 pixel resolution. It can also take videos at 1080p30, 720p60 and 640 × 480 p90 quality. The camera module costs around $27. The trained model is deployed on the raspberry pi 4 to predict the hand gestures. The entire Tensorflow version of the proposed and MobileNetV2 model consists of several parameters and using these models in its original version will cause latency while predicting. To tackle this latency issue for real-time applications, the Tensorflow-lite (TFL) version of these models are developed and deployed. The inference time of both these models along with the model size is presented in Table 6. On comparing with MobilenetV2, the proposed model has 2.35 × less parameters and 1.85 × less memory usage (refer Table 6). The complete setup of the hardware along with a one Euro coin for size comparison has been shown in Figure 6.  During real-time testing, the images captured from the camera are the input to the system. These images are of the size [3280 × 2464] pixels are resized (downscaled) to [256 × 256] pixels before it is sent as input to the deep learning model. This is because, for real-time prediction, images whose resolution were below [256 × 256] were distorted. The entire workflow of the contactless authentication process has been shown in Figure 7. The variable 'i' shown in Figure 7 is the counter variable to keep track of the number of iterations that run in the system. We run the loop 'n' times, where n is the length of the authentication code. Inside the loop, the system performs two basic steps. In the first step, the system takes the real-time camera feed from the pi-cam as the input and stores the entire video (consisting of multiple image frames) in the frame buffer. In the next step, the input image (the first image frame in the buffer) is first resized to [256 × 256] pixel size and then trained proposed deep learning model classifies it into one of the ten-digit classes and this predicted digit class is printed on the screen. We then pause for 2 s during which the human changes the digit sign and also the frame buffer is emptied of all its stored frames. Next, we continue with the next iteration of the loop. This stoppage time for sign digit transition is programmable and can be adapted to the application requirements. Once the loop terminates, the authentication code is displayed on the screen. For verification in this work, it gets printed. As the digit is represented as ASCII code, it can directly authenticate any device (an example being ATM).

Performance Evaluation Metrics
We use the following performance evaluation metrics to compare the proposed model and the MobileNetV2 model.

Recall
This metric calculates the ratio of correctly predicted positive observations to all the observations that belong to the actual positive class [48]. This is calculated according to the Equation (5).

Precision
This metric calculates the ratio of correctly predicted positive observations to all the predicted positive observations [48]. This is calculated according to the Equation (6).
2.5.9. F1-Score This metric uses both recall and precision to calculate its value [48]. It is a weighted average between the mentioned metrics according to the Equation (7) F1-score = 2 × (Recall × Precision)/(Recall + Precision)

Results
The confusion matrices (on test set) for both the models are shown in Figure 8. It can be observed from Figure 8a that the proposed CNN model correctly predicts (True Positive) 100% of the test samples corresponding to the digits '1, '2', '3', '5', '6', '7' and '9'. Digits '0' and '8' are correctly predicted for 98% of the test samples corresponding to it and the digit '4' is correctly predicted for 95% of test samples corresponding to it. From Figure 8b, the MobilenetV2 model predicts correctly (True Positive) 100% of the test samples related to the digits '0', '1', '2', '5', '6' and '9'. Digits '3', '7' and '8' are predicted correctly for 98% of the test samples corresponding to it and digit '4' is predicted correctly for 93% of the test samples related to it.  Using the confusion matrices, all the other evaluation metrics are calculated. The 'Accuracy' metric for both the models is shown in Table 7. The other evaluation metrics for both the models are summarized in Table 8. As seen from Table 8, the 'Recall' metric for the proposed CNN model is better than the MobilenetV2 model for all the digits except digit '0'. The 'Precision' metric for the proposed CNN model is better than MobilenetV2 model especially for digits '2', '4', '6' and '9'. The 'F1-score' metric value for the proposed model is better than the MobilenetV2 model especially for the digits '0', '2', '3', '6' and '8'.  The Figure 9 shows few examples of predicted labels by the proposed model on the test data. The real-time predictions of the designed system under different lighting conditions is shown in Figure 10. As it can be observed that the model predicts the authentication code correctly under both uniform and non-uniform lighting conditions.  , where the corresponding sign language images were captured and processed using edge computing device deployed in this work, and the final code that was generated using proposed CNN given as the last image in each row.

Discussion
The results presented in this work indicate that contactless authentication based on hand gestures using a Raspberry pi-based system combined with deep learning is feasible. The proposed deep learning model is also compared to the popular MobileNetV2 and summarized in Table 6. It can be observed from Table 6 that the proposed model size is quite small compared to the MobileNetV2 model for both Tensorflow and Tensorflow-Lite. On the other hand, the proposed CNN model inference time is slightly higher than the MobileNetV2 due to the bottlenecks applied. Due to this, the computation pattern of compressed models becomes more irregular leading to severe data locality and load balancing issues, which in turn increase the inference/latency time.
One example use case for the developed system could be the generation of authentication PIN without touching the keypad. As both sides of a typical ATM usually have enclosures, the digit signs (hand gestures) will not be visible to others and the whole authentication can be a low-cost solution. As shown in Figure 10, the code generation is oblivious to lighting conditions and achieves good accuracy in both uniform and non-uniform lighting conditions. The Raspberry pi 4 model B utilized as an edge computing device in this work can provide various modes of output to feed the code for the security systems. The developed models, both Tensorflow and Tensorflow lite versions, are available as open-source for enthusiastic users at https://github.com/aveen-d/sign_detection.

Limitations
The number of samples of each class (0-9) in the utilized hand gesture sign recognition dataset is limited. However, the performance of the proposed deep learning model is promising as discussed in the results section. Also, the dataset with additional classes can further be enhanced in future work. The motion artifacts caused during image acquisition might slightly reduce the performance of the model. The limitation on the field of view of the camera and practical positioning of the hands in front of the camera may affect the performance.

Conclusions
This work presented the design of a complete end-to-end system, which can utilize the hand gestures of the sign language digits for authenticating the user contactlessly at public and business places, such as ATMs, gas stations, and shopping malls. To be more specific, we have developed a convolutional neural network (CNN) to provide authentication code based on camera data, making it truly contactless. The whole inference of the deep learning model including data capture, converting imaging data to an authentication code was performed on a Raspberry pi 4 model B microprocessor coupled with a camera (total cost being less than $75) attached to it to serve as a complete solution, making it highly suitable for large scale deployment. The designed system achieved an accuracy of 99.1% on the test data set compared to the popular MobileNetV2 model, whose accuracy is 98.5%, despite proposed CNN being 2.35 × lighter in terms of the number of parameters. The developed system works in real-time with the model inference rate of 280 ms per image frame and the designed system can be an alternative solution for the conventional authentication methods, which use touchpads and keypads. In the future, we can increase the dataset with different classes such as "alphabets", "accept", "close", and "go back" to further reduce the need for contact with the surface for navigating through a web page/applications in these secure systems.