# Design and Implementation of Deep Learning Based Contactless Authentication System Using Hand Gestures

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Dataset

#### 2.2. Proposed End to End System for Contactless Authentication

#### 2.3. Lightweight Deep Learning Models for Hand Gestures Recognition

#### 2.3.1. Mobilenetv2

- Feature extraction: In this method, we use the MobileNetV2 model pre-trained on the ImageNet dataset to work as a feature extractor. We do not include the final dense layer, ‘classification layer’ of the MobileNetV2 model because the number of classes for the ImageNet dataset is different from the ‘Sign Language Digit Dataset’ we use. So in total, we use 154 pre-trained layers of the MobileNetV2 model as the base model. The output from this base model is the learned features which is a 4 dimensional tensor of size [None, 8, 8, 1280]. We then flatten the output of the base model into a 2-dimensional matrix of size [None, 1280] using ‘Global Average Pooling 2D function’ [39]. After this layer we add a Dense Layer which is the classification layer for the dataset we use. Therefore using the Feature extraction method we do not train the base model (MobileNetV2, excluding its final layer) instead, we use the base model to extract features from the input sample and then use these extracted features as input to the dense layer (the added classification layer according to the ‘Sign Language Digit Dataset’) and just train the dense layer on our dataset. The following hyper-parameter values are used to train the MobileNetV2 model using this method, ‘RMS Prop optimizer’ [40] with a learning rate of 0.0001, batch size of 32 for a total of 10 epochs. The architecture and the total number of parameters (model weights) used for training the MobileNetV2 model using this method are shown in Table 2 and Table 3 respectively.
- Fine-tuning: In this method, we use the same model architecture as the above method, the MobileNetV2 model (excluding its classification layer) pre-trained on the ImageNet dataset is the base model and an added dense layer (classification layer according to our dataset). The only difference is that in this method, we train a few layers of the base model along with the final added dense layer on the ‘Sign Language Digit Dataset’. So in total, we freeze the first ‘99’ layers and start training from the layer ‘100’ to layer ‘154’ of the base model along with the final added dense layer on the ‘Sign Language Digit Dataset’. Therefore in this method, the number of trainable parameters is more as opposed to the above method. The following are the hyper-parameter values used for training the MobileNetV2 model using this method, ‘RMS Prop optimizer’ with a learning rate of 0.00001, batch size of 32 for a total of 25 epochs. The architecture and the total number of parameters used for training the MobileNetV2 model using this method are shown in Table 2 and Table 3 respectively.

#### 2.3.2. Proposed Model

#### 2.4. Deployment on Edge Computing Device

#### 2.5. Performance Evaluation Metrics

#### 2.5.1. Confusion Matrix

#### 2.5.2. True Positive

#### 2.5.3. True Negative

#### 2.5.4. False Positive

#### 2.5.5. False Negative

#### 2.5.6. Accuracy

#### 2.5.7. Recall

#### 2.5.8. Precision

#### 2.5.9. F1-Score

## 3. Results

## 4. Discussion

## 5. Limitations

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Jain, A.K.; Ross, A.; Prabhakar, S. An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol.
**2004**, 14, 4–20. [Google Scholar] [CrossRef] [Green Version] - Yahya, F.; Nasir, H.; Kadir, K.; Safie, S.; Khan, S.; Gunawan, T. Fingerprint Biometric Systems. Trends Bioinform.
**2016**, 9, 52–58. [Google Scholar] [CrossRef] [Green Version] - Zhao, W.; Chellappa, R.; Phillips, P.J.; Rosenfeld, A. Face Recognition: A Literature Survey. ACM Comput. Surv.
**2003**, 35, 399–458. [Google Scholar] [CrossRef] - Zhang, D.; Lu, G.; Kong, A.W.K.; Wong, M. Palmprint Authentication System for Civil Applications. In Biometric Authentication; Maltoni, D., Jain, A.K., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 217–228. [Google Scholar]
- Mazumdar, J. Retina Based Biometric Authentication System: A Review. Int. J. Adv. Res. Comput. Sci.
**2018**, 9, 711–718. [Google Scholar] [CrossRef] [Green Version] - Chowdhury, R.; Ghosh, D.; Agarwal, P.; Samir, K.; Bandyopadhyay, S. Ear Based Biometric Authentication System. World J. Eng. Res. Technol.
**2018**, 2, 224–233. Available online: https://www.researchgate.net/publication/328610283_EAR_BASED_BIOMETRIC_AUTHENTICATION_SYSTEM (accessed on 12 December 2020). - Olatinwo, S.; Shoewu, O.; Omitola, O. Iris Recognition Technology: Implementation, Application, and Security Consideration. Pac. J. Sci. Technol.
**2013**, 14, 228–233. [Google Scholar] - Ali, M.; Tappert, C.; Qiu, M.; Monaco, V. Keystroke Biometric Systems for User Authentication. J. Signal Process. Syst.
**2017**, 86. [Google Scholar] [CrossRef] - Jabin, S.; Zareen, F. Biometric signature verification. Int. J. Biom.
**2015**, 7, 97. [Google Scholar] [CrossRef] - Gafurov, D. A Survey of Biometric Gait Recognition: Approaches, Security and Challenges. In Proceedings of the Norsk Informatikkonferanse; Oslo, Norway, 2007; Available online: https://www.researchgate.net/profile/Davrondzhon_Gafurov/publication/228577046_A_survey_of_biometric_gait_recognition_Approaches_security_and_challenges/links/00b49528e834aa68eb000000.pdf (accessed on 12 December 2020).
- Iannizzotto, G.; Rosa, F. A SIFT-Based Fingerprint Verification System Using Cellular Neural Networks. In Pattern Recognition Techniques, Technology and Applications; I-Tech: Vienna, Asutria, 2008. [Google Scholar] [CrossRef]
- Naidu, S.; Chemudu, S.; Satyanarayana, V.; Pillem, R.; Hanuma, K.; Naresh, B.; CH.Himabin, d. New Palm Print Authentication System by Using Wavelet Based Method. Signal Image Process.
**2011**, 2. [Google Scholar] [CrossRef] - Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit.
**2018**, 77, 354–377. [Google Scholar] [CrossRef] [Green Version] - Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE
**2017**, 105. [Google Scholar] [CrossRef] [Green Version] - Zulfiqar, M.; Syed, F.; Khan, M.; Khurshid, K. Deep Face Recognition for Biometric Authentication. In Proceedings of the 2019 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), Swat, Pakistan, 24–25 July 2019. [Google Scholar] [CrossRef]
- Praseetha, V.; Bayezeed, S.; Vadivel, S. Secure Fingerprint Authentication Using Deep Learning and Minutiae Verification. J. Intell. Syst.
**2019**, 29. [Google Scholar] [CrossRef] - Shao, H.; Zhong, D. Few-shot palmprint recognition via graph neural networks. Electron. Lett.
**2019**, 55, 890–892. [Google Scholar] [CrossRef] - Aizat, K.; Mohamed, O.; Orken, M.; Ainur, A.; Zhumazhanov, B. Identification and authentication of user voice using DNN features and i-vector. Cogent Eng.
**2020**, 7, 1751557. [Google Scholar] [CrossRef] - Terrier, P. Gait Recognition via Deep Learning of the Center-of-Pressure Trajectory. Appl. Sci.
**2020**, 10, 774. [Google Scholar] [CrossRef] [Green Version] - Poddar, J.; Parikh, V.; Bharti, D. Offline Signature Recognition and Forgery Detection using Deep Learning. Procedia Comput. Sci.
**2020**, 170, 610–617. [Google Scholar] [CrossRef] - Minaee, S.; Abdolrashidi, A.; Su, H.; Bennamoun, M.; Zhang, D. Biometric Recognition Using Deep Learning: A Survey. arXiv
**2019**, arXiv:cs:CV/1912.00271. [Google Scholar] - Wu, K.; Chen, Z.; Li, W. A Novel Intrusion Detection Model for a Massive Network Using Convolutional Neural Networks. IEEE Access
**2018**, 6, 50850–50859. [Google Scholar] [CrossRef] - Kim, K.H.; Hong, S.; Roh, B.; Cheon, Y.; Park, M. Pvanet: Deep but lightweight neural networks for real-time object detection. arXiv
**2016**, arXiv:1608.08021. [Google Scholar] - Deng, B.L.; Li, G.; Han, S.; Shi, L.; Xie, Y. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey. Proc. IEEE
**2020**, 108, 485–532. [Google Scholar] [CrossRef] - Dey, N.; Mishra, G.; Kar, J.; Chakraborty, S.; Nath, S. A Survey of Image Classification Methods and Techniques. In Proceedings of the International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), Kanyakumaru, India, 10–11 July 2014. [Google Scholar] [CrossRef]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV)
**2015**, 115, 211–252. [Google Scholar] [CrossRef] [Green Version] - Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, Department of Computer Science, University of Toronto, Toronto, ON, Canada, 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 13 December 2020).
- Parkhi, O.M.; Vedaldi, A.; Zisserman, A.; Jawahar, C.V. Cats and Dogs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Nilsback, M.E.; Zisserman, A. A Visual Vocabulary for Flower Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1447–1454. [Google Scholar]
- LeCun, Y.; Cortes, C. MNIST Handwritten Digit Database. 2010. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 3 January 2021).
- Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv
**2020**, arXiv:cs.LG/2010.01412. [Google Scholar] - Wang, X.; Kihara, D.; Luo, J.; Qi, G.J. EnAET: Self-Trained Ensemble AutoEncoding Transformations for Semi-Supervised Learning. arXiv
**2019**, arXiv:cs.CV/1911.09265. [Google Scholar] - Kolesnikov, A.I.; Beyer, L.; Zhai, X.; Puigcerver, J.; Yung, J.; Gelly, S.; Houlsby, N. Big Transfer (BiT): General Visual Representation Learning. Comput. Vis. Pattern Recognit.
**2019**, 6, 8. [Google Scholar] - Byerly, A.; Kalganova, T.; Dear, I. A Branching and Merging Convolutional Network with Homogeneous Filter Capsules. arXiv
**2020**, arXiv:abs/2001.09136. [Google Scholar] - Zeynep Dikle, A.M.; Students, T.A.A.A.H.S. Sign Language Digits Dataset. 2017. Available online: https://github.com/ardamavi/Sign-Language-Digits-Dataset (accessed on 2 June 2020).
- Gavade, A.; Sane, P. Super Resolution Image Reconstruction By Using Bicubic Interpolation. In Proceedings of the ATEES 2014 National Conference, Belgaum, India, October 2013. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Lin, M.; Chen, Q.; Yan, S. Network in Network. arxiv
**2013**, arXiv:cs:CV/1312.4400. [Google Scholar] - Tieleman, T.; Hinton, G. Lecture 6.5—RmsProp: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning, 2012. Available online: http://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf (accessed on 4 January 2021).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv
**2015**, arXiv:1502.03167. [Google Scholar] - Keras: The Python Deep Learning Library. Available online: http://ascl.net/1806.022 (accessed on 4 January 2021).
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Ghori, K.M.; Abbasi, R.A.; Awais, M.; Imran, M.; Ullah, A.; Szathmary, L. Performance Analysis of Different Types of Machine Learning Classifiers for Non-Technical Loss Detection. IEEE Access
**2020**, 8, 16033–16048. [Google Scholar] [CrossRef]

**Figure 1.**Sample images of the dataset [36]. The number below each image represents the decoded sign.

**Figure 2.**The classification (digit recognition) task used in the proposed end to end system for contactless authentication. The architecture details of the deep learning model are shown in Figure 4 and in Table 5.

**Figure 3.**Comparison between the proposed bottleneck and the bottleneck of MobileNetV2. For convolution with stride 2, the bottleneck of MobileNetV2 does not have the residual connection. The details of the proposed bottleneck are given in Section 2.3.2.

**Figure 4.**Flow diagram of the proposed deep learning model for hand gesture recognition. Layer wise details along with the sizes of the feature maps are shown in Table 5.

**Figure 5.**Sample example showing 3 × 3 convolution on 7 × 7 patch with dilation rates of 1 and 2 utilized in the proposed CNN.

**Figure 7.**Workflow diagram for the contactless authentication code generation. Note that ‘n’ is the length of the code (n = 4 here).

**Figure 9.**Predictions (given on top of images) of the proposed CNN in classifying the given test samples from the sign dataset.

**Figure 10.**Four Numbers are predicted one after another in real-time under different natural lighting conditions (uniform lighting condition in first row and non uniform lighting condition in second row), where the corresponding sign language images were captured and processed using edge computing device deployed in this work, and the final code that was generated using proposed CNN given as the last image in each row.

Class | Number of Samples | Number of Training Samples | Number of Validation Samples | Number of Testing Samples |
---|---|---|---|---|

0 | 205 | 123 | 41 | 41 |

1 | 206 | 124 | 41 | 41 |

2 | 206 | 124 | 41 | 41 |

3 | 206 | 124 | 41 | 41 |

4 | 207 | 124 | 42 | 41 |

5 | 207 | 124 | 42 | 41 |

6 | 207 | 124 | 42 | 41 |

7 | 206 | 124 | 41 | 41 |

8 | 208 | 125 | 42 | 41 |

9 | 204 | 122 | 41 | 41 |

Total | 2062 | 1238 | 414 | 410 |

**Table 2.**The architecture used to train MobileNetV2 model using feature extraction and fine-tuning methods.

Layer Type | Output Size |
---|---|

MobileNetV2 (Base Model) | (None, 8, 8, 1280) |

Global_Average_Pooling | (None, 1280) |

Output | (None, 10) |

**Table 3.**Comparison of transfer learning methods on MobileNetV2 model utilized in this work Section 2.3.1 in terms of parameters.

Method | Total no. of Parameters | Trainable | Non-Trainable |
---|---|---|---|

Feature extraction method | 2,270,794 | 12,810 | 2,257,984 |

Fine-tuning method | 2,270,794 | 2,236,682 | 34,112 |

**Table 4.**Architecture details of the proposed Bottleneck within proposed CNN presented in Table 5. Note that ‘M’ is spatial extent and ‘Z’ is depth of the feature maps.

Layer Type | Operation | Output Size |
---|---|---|

Input | - | (None, M, M, Z) |

1 × 1 Conv | Depth Squeezing | (None, M, M, Z/4) |

3 × 3 Conv | Feature Extraction | (None, M, M, Z/4) |

1 × 1 Conv | Depth Squeezing | (None, M, M, Z) |

Add | Addition | (None, M, M, Z) |

1 × 1 Conv | Depth Squeezing | (None, M, M, Z/4) |

3 × 3 Dilated Conv | Feature Extraction | None, M, M, Z/4) |

1 × 1 Conv | Depth Stretching | (None, M, M, Z) |

Add | Addition | (None, M, M, Z) |

Output | - | (None, M, M, Z) |

**Table 5.**Architecture details of the proposed CNN for hand gesture recognition. Note that each Conv layer represents 3 ×3 convolution, followed by ReLU activation along with batch normalization. The flow of the architecture is from left to right in each row (top to bottom for successive step). The details of the bottleneck are given in Table 4.

Layer Type | Input | Conv | Conv | Bottleneck |

Output Size | (None, 256, 256, 3) | (None, 127, 127, 64) | (None, 63, 63, 128) | (None, 63, 63, 128) |

Layer Type | Conv | Bottleneck | Conv | Bottleneck |

Output Size | (None, 31, 31, 256) | (None, 31, 31, 256) | (None, 15, 15, 128) | (None, 15, 15, 128) |

Layer Type | Conv | Bottleneck | Conv | Flatten |

Output Size | (None, 7, 7, 64) | (None, 7, 7, 64) | (None, 3, 3, 16) | (None, 144) |

Layer Type | Dense | - | - | - |

Output Size | (None, 10) | - | - | - |

**Table 6.**Comparison of Deep learning models considered in this work for deploying in the edge computing device (Raspberry pi 4 model B). Note that notation wise, TF represents Tensorflow and TFL represents’ Tensorflow-Lite.

Model | MobileNetV2 | Proposed |
---|---|---|

Tag | ||

Parameters | 2.27 Million | 0.97 Million |

TF Model Size | 11.1 MB | 6 MB |

TFL Model Size | 8.5 MB | 4 MB |

TFL Inference time | 212 ms | 280 ms |

Model | Accuracy |
---|---|

Proposed CNN | 99.1% |

MobileNetV2 | 98.5% |

**Table 8.**Recall, Precision and F1-score evaluation metric values for both the proposed model and MobileNetV2 model.

Class | Recall | Precision | F1-score | |||
---|---|---|---|---|---|---|

Proposed CNN | MobileNetV2 | Proposed CNN | MobileNetV2 | Proposed CNN | MobileNetV2 | |

0 | 0.98 | 1.0 | 1.0 | 1.0 | 0.99 | 1.0 |

1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |

2 | 1.0 | 1.0 | 1.0 | 0.98 | 1.0 | 0.99 |

3 | 1.0 | 0.98 | 1.0 | 1.0 | 1.0 | 0.99 |

4 | 0.95 | 0.93 | 0.97 | 1.0 | 0.96 | 0.96 |

5 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |

6 | 1.0 | 1.0 | 0.98 | 0.93 | 0.99 | 0.97 |

7 | 1.0 | 0.98 | 1.0 | 1.0 | 1.0 | 0.99 |

8 | 0.98 | 0.98 | 1.0 | 0.98 | 0.99 | 0.98 |

9 | 1.0 | 1.0 | 0.95 | 0.98 | 0.98 | 0.99 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dayal, A.; Paluru, N.; Cenkeramaddi, L.R.; J., S.; Yalavarthy, P.K.
Design and Implementation of Deep Learning Based Contactless Authentication System Using Hand Gestures. *Electronics* **2021**, *10*, 182.
https://doi.org/10.3390/electronics10020182

**AMA Style**

Dayal A, Paluru N, Cenkeramaddi LR, J. S, Yalavarthy PK.
Design and Implementation of Deep Learning Based Contactless Authentication System Using Hand Gestures. *Electronics*. 2021; 10(2):182.
https://doi.org/10.3390/electronics10020182

**Chicago/Turabian Style**

Dayal, Aveen, Naveen Paluru, Linga Reddy Cenkeramaddi, Soumya J., and Phaneendra K. Yalavarthy.
2021. "Design and Implementation of Deep Learning Based Contactless Authentication System Using Hand Gestures" *Electronics* 10, no. 2: 182.
https://doi.org/10.3390/electronics10020182