1. Introduction
Communication is an essential part of life, without which life would be very difficult. Each living being in the world communicates in their own way. We, as human beings, usually communicate by speaking a language. However, there is an exception; for example, people with deaf symptoms and and hearing impairment. They use signs to communicate among themselves (i.e., deaf to deaf or deaf to impaired hearing). Over a period of time, these signs became a language. Just like all other languages, American Sign Language (ASL) also has its own syntax and semantics [
1,
2]. One must follow its syntax and semantics to communicate correctly and efficiently. Also, for communication to be successful, it is important to understand what is being communicated. Most people who do not have these disabilities are not aware of these signs, how to use them, or the meaning of the different signs. This could be because of lack of knowledge. As a result, they struggle to communicate with deaf and hearing-impaired people.
ASL has its own grammar and culture, which differ from place to place. Hence, there are many versions of sign language available in the world. French Sign Language (LSF), British Sign Language (BSL), and ASL are a few of the well-known sign languages. Based on the location, different signs are used to say different words. Therefore, it is also important to understand which signs to use in which area. For this research, we are focusing on ASL. ASL is a sign language used by deaf and hearing impaired people in the United States and Canada, devised in part by Thomas Hopkins Gallaudet and Laurent Clerc based on sign language in France [
2,
3]. It is a visual–gestural language used by approximately 500,000 deaf or hearing-impaired people in North America. For each letter of English grammar, ASL has a specific sign, as well as for different words. If we get to know this word sign mapping, it is easy to understand ASL.
There are some existing studies that focus on identifying this mapping. People in the past have developed wearable devices that help to identify ASL signs [
4,
5]. Also, using the Convolution Neural Network (CNN) model and deep learning methodologies, there are a few research studies exploring ways to identify ASL. Most of these studies mainly focused on identifying fingerspelling, which is nothing but identifying signs for the English alphabet [
6,
7,
8]. However, ASL has a vast variety of signs for different words, and little work has been conducted on the identification of these signs.
We propose a method to identify word-level ASL using deep learning, the CNN model, and the rolling average prediction method. Rolling average or moving average is often used in CNN models to improve the prediction accuracy [
9]. The CNN model is well suited to identify image data. An ASL video for a particular word is nothing but a series of image frames with different hand gestures [
10]. Hence, we train the CNN model with images containing temporal and spatial changes of hand gestures. We use the trained model to predict the correct ASL word. The ResNet50 model is used as the base model to achieve this goal. The ASL videos are converted into image frames and are pre-processed before using them to train the model. It is equally important to enable hearing people with a way to communicate with the deaf people. To facilitate this, we also concentrated on producing an ASL fingerspelling video for speech content from them. We hosted a model in FastAPI. A user-friendly application is developed using the ReactJS framework which takes the user input either to produce the ASL fingerspelling video or to translate the ASL video into the English language. The user has to upload an audio or video file to the application dashboard. This file is transferred via Rest API to the model present in the FastAPI back-end. The API interprets the request and performs the appropriate translation using the trained model. The result of this translation is sent back to the application for the user’s assistance.
We summarize the contributions of this research study in the following:
We used the ResNet50 model and transfer learning concept to train our model to recognize and classify these words. We made use of rolling average prediction to recognize the temporal changes present in the video and recognize the word without any jitters in prediction.
The proposed framework translates both ASL to English and English to ASL.
To showcase the framework, we developed a web application, which makes use of the trained CNN model to translate ASL to English and vice versa.
We generated a dataset consisting of images showing the hand gestures and facial expressions used by people to convey 2000 different words.
The rest of the paper is organized as follows. In
Section 2, we discuss some existing works related to the classification of ASL. The details of the proposed approach are presented in
Section 3. Information on the ASL-to-English/English-to-ASL translation application developed is provided in
Section 4. In
Section 5, the results and evaluation of the proposed approached is discussed. Finally, in
Section 6 and
Section 7, we provide a discussion and conclusion, respectively.
2. Literature Review
Over the past several years, a good number of research studies has been conducted on interpreting ASL. Thad Starner et al. proposed sign language recognition based on Hidden Markov Models (HMM) [
6]. This study used a camera to track hand movement to identify hand gestures. They extracted the features from the hand movements and fed them into four-state HMM to identify the ASL words in sentences. They evaluated their work by using a webcam or desk-mounted camera (second-person view) and a wearable camera (first-person view) for a 40-word lexicon. Similarly, Gaus and Wong [
11] used two real-time hidden Markov model-based systems that were used to recognize ASL sentences by using a camera to track the user’s hands. The authors used word lexicon, and in their system they used a desk-mounted camera to observe the user’s hands.
In [
7], Qutaishat et al. proposed a method that does not require any wearable gloves or virtual markings to identify ASL. Their process is divided into two phases—feature extraction and classification. At the feature-extraction phase, features are extracted using Hough transformation from the the input images. These features are then passed as input to the neural network classification model. Their work was mainly focused on recognizing static signs. Several studies, such as [
8,
12,
13,
14] used the CNN model to classify ASL alphabets. In a separate study, Garcia et al. [
8] used the transfer learning concept and developed the model using the Berkeley version of GoogLeNet. Most of these works concentrated on recognizing the ASL fingerspelling corresponding to the English alphabet and numbers [
6,
7,
13]. Furthermore, Rahman et al. [
12] used a CNN model to recognize ASL alphabets and numerals. Using a publicly available dataset, their study mainly focused on improving the performance of the CNN model. The study did not involve any human interaction to assess the accuracy of the approach. A similar work was found in [
15], where the authors used an ensemble classification technique to show performance improvement. In a separate study, Kasapbasi et al. [
16] used a CNN model to predict American Sign Language Alphabets (ASLA), and Bellen et al. [
17] focused on recognizing ASL-based gestures during video conferencing.
In a study, Ye et al. [
18] used a 3D recurrent convolutional neural network (3DRCNN) to recognize ASL signs from continuous videos. Moreover, they used a fully connected recurrent neural network (FC-RNN), which captured the temporal information. The authors were able to recognise ASL alphabets and several ASL words. In [
13,
18], the authors used 3D-CNN models to classify ASL. In [
13], authors developed a 3D-CNN architecture which consists of eight layers. They used multiple feature maps as inputs for better performance. The five features which they considered are color-R, color-G, color-B, depth, and body skeleton. They were able to achieve better prediction percentages compared to the GMM-HMM model. In [
7], Munib et al. used images of signers’ bare hands (in a natural way). Their goal was to develop an automatic translation system for ASL alphabets and signs. This study used Hough transform and neural network to recognize the ASL signs.
In [
18], authors proposed a hybrid model, and it consisted of the 3D-CNN model and the Fully Connected Recurrent Neural Network (FC-RNN). The 3D-CNN model learns RGB, motion, and depth channel whereas FC-RNN captures the temporal features in the video. They collected their own dataset consisting of sequence videos and sentence videos. They achieved 69.2% accuracy. However, the use of 3D-CNN is a resource-intensive approach. Jeroen et al. [
19] proposed a hybrid approach to recognize sign language using statistical dynamic time wrapping for time wrapping and wrapped features are classified by separate classifiers. This approach relied mainly on 3D hand motion features. Mahesh et al. [
20] tried to improve the performance of traditional approaches by minimizing the CPU processing time.
These existing previous works focus on building applications that enable communication between deaf people and hearing people [
20]. However, creating an app requires a more precise design. One has to think of memory usage and other operations to enable a smooth user experience. Dongxu Li et al. [
21] worked on gathering the word-level ASL Dataset and an approach to recognize them. In their work, they concluded that more advanced learning algorithms are needed to recognize the large dataset created by them. In [
14,
22], authors developed a means to convert from ASL to text. They used the CNN model to identify the ASL and then they converted the predicted label to text. They mainly concentrated on generating the text for fingerspelling instead of word-level signs. Garcia and Viesca [
8], focused on classifying alphabet handshape correctly for letters a–k instead of all types of ASL alphabets. Another work presented in [
23] detected ASL signs and converted to audio, and authors of [
24] focused on constructing a corpus using the Mexican Sign Language (MSL).
After studying and understanding what has been achieved in existing studies, we first determined the goal for this study, which was to develop a framework to translate English to ASL and vice versa. We understood that not all deaf people know English, and several existing works focused on using CNN models, and improving computational performance. CNN models are a good choice for classifying ASL signs from image and video data. So, we experimented with a few CNN models before selecting the one that provided better results. For example, VGG16 [
14], 3D-CNN [
13], I3D [
21], 3DRCNN [
18], and ResNet50 (see
Table 1 for more details). Among all, the ResNet50 model provided the best performance. Hence, the ResNet50 model was used for training and testing our research data. The model uses
input images in four-stage processing. The next section presents the methodology used in this study, i.e., dataset collection, pre-processing, model training, model evaluation, and application development.
Author Contributions
Conceptualization: M.A. and S.A.; methodology: M.A. and S.A.; software: V.D.A.; validation: V.D.A., M.A., S.A., L.B.N. and M.A.A.D.; formal analysis: V.D.A., M.A., S.A. and M.A.A.D.; investigation: V.D.A., M.A. and S.A.; resources: V.D.A., M.A. and S.A.; data curation: V.D.A.; writing—original draft preparation: V.D.A., M.A. and S.A.; writing—review and editing: V.D.A., M.A., S.A. and M.A.A.D.; visualization: V.D.A.; supervision: M.A. and S.A.; project administration: M.A. and S.A.; funding acquisition: M.A., S.A. and M.A.A.D. All authors have read and agreed to the published version of the manuscript.
Funding
This research was partly funded by the Pennsylvania State System of Higher Education (PASSHE) Faculty Professional Development Council (FPDC) grant.
Data Availability Statement
All data publicly available, we downloaded the data from Kaggle.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Kuhn, J.; Aristodemo, V. Pluractionality, iconicity, and scope in French Sign Language. Semant. Pragmat. 2017, 10, 1–49. [Google Scholar] [CrossRef]
- Liddell, S.K. American Sign Language Syntax; Walter de Gruyter GmbH & Co KG: Berlin, Germany, 2021; Volume 52. [Google Scholar]
- Vicars, W.G. ASL—American Sign Language. Available online: https://www.lifeprint.com/asl101/pages-layout/lesson1.htm (accessed on 1 March 2023).
- Kudrinko, K.; Flavin, E.; Zhu, X.; Li, Q. Wearable sensor-based sign language recognition: A comprehensive review. IEEE Rev. Biomed. Eng. 2020, 14, 82–97. [Google Scholar] [CrossRef]
- Lee, B.; Lee, S.M. Smart wearable hand device for sign language interpretation system with sensors fusion. IEEE Sens. J. 2017, 18, 1224–1232. [Google Scholar] [CrossRef]
- Starner, T.; Weaver, J.; Pentl, A. Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1371–1375. [Google Scholar] [CrossRef]
- Munib, Q.; Habeeb, M.; Takruri, B.; Al-Malik, H.A. American sign language (ASL) recognition based on Hough transform and neural networks. Expert Syst. Appl. 2007, 32, 24–37. [Google Scholar] [CrossRef]
- Garcia, B.; Viesca, S.A. Real-time American sign language recognition with convolutional neural networks. Convolutional Neural Netw. Vis. Recognit. 2016, 2, 8. [Google Scholar]
- Kurian, E.; Kizhakethottam, J.J.; Mathew, J. Deep learning based surgical workflow recognition from laparoscopic videos. In Proceedings of the 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 10–12 June 2020; pp. 928–931. [Google Scholar]
- Dabre, K.; Dholay, S. Machine learning model for sign language interpretation using webcam images. In Proceedings of the 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA), Mumbai, India, 4–5 April 2014; pp. 317–321. [Google Scholar]
- Gaus, Y.F.A.; Wong, F. Hidden Markov Model-Based gesture recognition with overlapping hand-head/hand-hand estimated using Kalman Filter. In Proceedings of the 2012 Third International Conference on Intelligent Systems Modelling and Simulation, Kota Kinabalu, Malaysia, 8–10 February 2012; pp. 262–267. [Google Scholar]
- Rahman, M.M.; Islam, M.S.; Rahman, M.H.; Sassi, R.; Rivolta, M.W.; Aktaruzzaman, M. A new benchmark on american sign language recognition using convolutional neural network. In Proceedings of the 2019 International Conference on Sustainable Technologies for Industry 4.0 (STI), Dhaka, Bangladesh, 24–25 December 2019; pp. 1–6. [Google Scholar]
- Huang, J.; Zhou, W.; Li, H.; Li, W. Sign language recognition using 3d convolutional neural networks. In Proceedings of the 2015 IEEE international conference on multimedia and expo (ICME), Turin, Italy, 29 June–3 July 2015; pp. 1–6. [Google Scholar]
- Thakar, S.; Shah, S.; Shah, B.; Nimkar, A.V. Sign Language to Text Conversion in Real Time using Transfer Learning. In Proceedings of the 2022 IEEE 3rd Global Conference for Advancement in Technology (GCAT), Bangalore, India, 7–9 October 2022; pp. 1–5. [Google Scholar]
- Chung, H.X.; Hameed, N.; Clos, J.; Hasan, M.M. A Framework of Ensemble CNN Models for Real-Time Sign Language Translation. In Proceedings of the 2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA), Phnom Penh, Cambodia, 2–4 December 2022; pp. 27–32. [Google Scholar]
- Kasapbaşi, A.; Elbushra, A.E.A.; Omar, A.H.; Yilmaz, A. DeepASLR: A CNN based human computer interface for American Sign Language recognition for hearing-impaired individuals. Comput. Methods Progr. Biomed. Update 2022, 2, 100048. [Google Scholar] [CrossRef]
- Enrique, M.B., III; Mendoza, J.R.M.; Seroy, D.G.T.; Ong, D.; de Guzman, J.A. Integrated Visual-Based ASL Captioning in Videoconferencing Using CNN. In Proceedings of the TENCON 2022-2022 IEEE Region 10 Conference (TENCON), Hong Kong, 1–4 November 2022; pp. 1–6. [Google Scholar]
- Ye, Y.; Tian, Y.; Huenerfauth, M.; Liu, J. Recognizing american sign language gestures from within continuous videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2064–2073. [Google Scholar]
- Lichtenauer, J.F.; Hendriks, E.A.; Reinders, M.J. Sign language recognition by combining statistical DTW and independent classification. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 2040–2046. [Google Scholar] [CrossRef]
- Mahesh, M.; Jayaprakash, A.; Geetha, M. Sign language translator for mobile platforms. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, India, 13–16 September 2017; pp. 1176–1181. [Google Scholar]
- Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1459–1469. [Google Scholar]
- Patil, P.; Prajapat, J. Implementation of a real time communication system for deaf people using Internet of Things. In Proceedings of the 2017 International Conference on Trends in Electronics and Informatics (ICEI), Tirunelveli, India, 11–12 May 2017; pp. 313–316. [Google Scholar]
- Santon, A.L.; Margono, F.C.; Kurniawan, R.; Lucky, H.; Chow, A. Model for Detect Hand Sign Language Using Deep Convolutional Neural Network for the Speech/Hearing Impaired. In Proceedings of the 2022 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Jakarta, Indonesia, 16–17 November 2022; pp. 118–123. [Google Scholar]
- Trujillo-Romero, F.; García-Bautista, G. Mexican Sign Language Corpus: Towards an automatic translator. ACM Trans. Asian-Low-Resour. Lang. Inf. Process. 2023, 22, 1–24. [Google Scholar] [CrossRef]
- Kaggle. Available online: https://www.kaggle.com/ (accessed on 13 June 2023).
- Hashemi, M. Web page classification: A survey of perspectives, gaps, and future directions. Multimed. Tools Appl. 2020, 79, 11921–11945. [Google Scholar] [CrossRef]
- Lee, C.S.; Baughman, D.M.; Lee, A.Y. Deep learning is effective for classifying normal versus age-related macular degeneration OCT images. Ophthalmol. Retin. 2017, 1, 322–327. [Google Scholar] [CrossRef] [PubMed]
- Du, L. How much deep learning does neural style transfer really need? An ablation study. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 3150–3159. [Google Scholar]
- Mehta, T.I.; Heiberger, C.; Kazi, S.; Brown, M.; Weissman, S.; Hong, K.; Mehta, M.; Yim, D. Effectiveness of radiofrequency ablation in the treatment of painful osseous metastases: A correlation meta-analysis with machine learning cluster identification. J. Vasc. Interv. Radiol. 2020, 31, 1753–1762. [Google Scholar] [CrossRef] [PubMed]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).