Every normal human being has been granted the most precious gift that cannot be replaced: the ability to express themselves by responding to the events occurring in their surroundings, where they observe, listen, and then react to circumstances through speech [1
]. Unfortunately, there are those unfortunate ones who lack this precious gift. This creates a difference between normal human beings and disadvantaged ones, creating a massive gap between them [1
]. Because communication is a necessary element of regular people’s lives, deaf/mute individuals must communicate as normally as possible with others.
Communication is a tedious task for people who have hearing and speech impairments. Hand gestures, which involve the movement of hands, are used as sign language for natural communication between ordinary people and deaf people, which is just like speech for vocal people [4
]. Nonetheless, sign languages differ by country and are used for a variety of purposes, including American Sign Language (ASL), British Sign Language (BSL), Japanese Sign Language [10
], and Turkish Sign Language (TSL) [12
]. This project focuses on Bahasa Isyarat Malaysia (BIM), also known as Malaysian Sign Language (MSL). BIM began its journey by forming a deaf school in Penang called the Federation School for the Deaf (FSD) in 1954. Studies have revealed that indigenous sign words arose through gestural communication amongst deaf students at the FSD outside their classroom. With the aim of educating deaf students, in 1964, American Sign Language (ASL) was made available in Johor, while Kod Tangan Bahasa Malaysia (KTBM) started to settle in Penang in 1978 when Total Communication was introduced into education for deaf students [13
]. BIM has been the main form of communication amongst the deaf population in Malaysia for many years since it was first developed [14
Communication is a vital aspect of everyday life; deaf/mute individuals must communicate as normally as possible with others [9
]. The inability to speak is considered a problem amongst people [17
] because they cannot clearly understand the words of normal people and, hence, cannot answer them [17
]. This inability to express oneself verbally generates a significant disadvantage and, thus, a communication gap between the deaf/mute society and normal people [1
]. The deaf/mute population or sign language speakers experience social aggregation challenges [4
], and they constantly feel helpless because no one understands them and vice versa. This major humanitarian issue requires a specialised solution. Deaf/mute individuals face difficulties connecting with the community [3
], particularly those who were denied the blessing of hearing prior to the development of spoken language and learning to read and write [3
Usually, the ancient technique utilised for the deaf/mute to communicate with normal people is a human translator that can aid them with the discussion. However, it might be challenging due to the lack of a human translator [7
], they might not always be accessible [19
] for the deaf/mute, and paying for them can be expensive. It also makes such persons dependent on interpreters [2
]. This procedure may also be relatively slow. It makes talking seem unnatural and boring between deaf/mute and normal people, which indirectly causes a lack of engagement in social activities [2
]. Correspondingly, as previously stated, the deaf/mute use sign language to communicate with others and those who understand sign language. This causes a challenge if the deaf/mute are required to communicate with normal people, as they must be proficient in sign language, which only a minority of people learn and understand [19
In addition, the issues of sign language are due to the substantial variation in gesture form and meaning amongst many cultures, situations, and people; gesture detection is a challenging undertaking. It is difficult to create precise and trustworthy models for gesture recognition because of this heterogeneity. Some of the most important elements that influence gesture recognition are (i) gestures can vary in terms of their speed, amplitude, duration, and spatial placement, which can make it challenging to consistently identify them [20
], (ii) gestures can indicate a variety of things depending on the situation, the culture, and the individual’s perception, (iii) different modes of interfering: speech, facial expressions, and other nonverbal clues can all be used in conjunction with gestures to affect how they are perceived [21
], (iv) individual variations: different gesturing techniques can influence how accurately recognition models work [20
], (v) the distinctions between spoken languages and sign languages present extra difficulties for sign language recognition [22
], (vi) the unique grammar, syntax, and vocabulary of sign languages can make it difficult to effectively translate them into written or spoken language, and (vii) the difficulty of recognising sign languages can also be complicated by regional and cultural variances.
Undoubtedly, the advancement in technology, such as smartphones that can be used to make calls or send messages, has significantly improved people’s quality of life. This includes the numerous assistive technologies available to the deaf/mute, such as speech-to-text and speech-to-visual technologies and sign language, which are portable and simple. Several applications are accessible to normal people; however, each has a set restriction today [16
]. Additionally, there is a shortage of excellent smartphone translation programs that encourage sign language translation [14
] between deaf/mute and normal people. Therefore, despite the tremendous benefits of cutting-edge technologies, deaf/mute and normal people cannot benefit from them. Unknowingly, Malaysians are unfamiliar with BIM, and present platforms for translating sign language are inefficient, highlighting the limited capability of the market’s existing mobile translating application [16
As previously said, the smartphone is a dependable technology for the deaf/mute to connect with normal people. Thus, this project intends to develop and build a two-way communication system for a Bahasa Isyarat Malaysia (BIM) application between deaf/mute and normal users, allowing both groups to engage freely. The deaf/mute community will benefit from the sign-language-to-text module, while the normal community will benefit from the speech-to-text module. This application will make it simpler for deaf/mute individuals to converse with the normal and vice versa. This can help to decrease the amount of time spent communicating. It will also be advantageous if the deaf/mute attend a meeting or a ceremony, where they can easily interpret the speech using the Android application without the assistance of a translator.
Today, there are many applications available for deaf/mute individuals to communicate with non-deaf/mute individuals. Despite all the benefits of state-of-the-art technology, each application has certain limitations [23
], and there are fewer BIM mobile translation applications on the market [7
]. This is because BIM is little known amongst Malaysians, and existing sign language translation platforms are inefficient, not to mention the incomplete functionality of existing mobile translation applications on the market [16
]. Therefore, people with hearing and speech disabilities cannot fully benefit from them [24
]. The main challenge in this study is the availability of a recognisable character database. Existing databases, especially BIM databases, are often provided without adequate standards for image resolution, structure, and compression that are good enough [25
]. Hence, this project aims to reduce the communication gap between deaf and normal people by easily communicating using an Android application. This project can also eliminate the need to hire a human translator, thus, becoming substantially more cost-effective and simultaneously developing briefer and more interesting conversations. Finally, this app can also increase the utilisation of Bahasa Isyarat Malaysia, which boosts the acknowledgment of this language in Malaysian society.
2. Related Work
Bahasa Isyarat Malaysia (BIM), also known as Malaysian Sign Language (MSL), was initially developed in 1998, shortly after the Malaysian Federation of the Deaf was founded. This paper aims to create a mobile application that will bridge the communication gap between hearing people and the deaf–mute community by assisting the community in learning BIM.
], a survey was conducted for possible consumers as its methodology. The target populations were University of Tenaga Nasional (UNITEN) students and Bahasa Isyarat Malaysia Facebook Group (BIMMFD) members. Multiple-choice, open-ended, and dichotomous items were included in the surveys. This research demonstrates that the software is thought to be helpful for society and suggests creating a more user-friendly and accessible way to study and communicate using this app utilising BIM.
The current state of the art with modern and more efficient gesture recognition methods has been discussed in several papers. In [26
], the author introduced two deep-neural-network-based models: one for audio–visual speech recognition (AVSR) using the Lip Reading in the Wild Dataset (LRW) and one for gesture recognition using the Ankara University Turkish Sign Language Dataset (AUTSL). This paper uses both visual and acoustic features and fusion approaches, achieving 98.56% accuracy and demonstrating the possibility of recognizing speech and gestures using mobile devices. The authors of [27
] worked on training models on datasets from different sign languages (Word-Level American Sign Language (WLASL), AUTSL, and Russian (RSL)) to improve sign recognition quality and demonstrate the possibility of real-time sign language recognition without using GPUs, with the potential to benefit speech- or hearing-impaired individuals, using VideoSWIN transformer and MViT. However, this paper focuses on the development of BIM letter and word recognition using SSD-MobileNet-V2 FPNLite and COCO mAP.
2.1. SSD-MobileNet-V2 FPNLite
SSD-MobileNet-V2 can recognise multiple items in a single image or frame. This model detects each image’s position, producing the object’s name and bounding boxes. Ninety different objects can be classified using the pre-trained SSD-Mobile model.
Due to the elimination of bounding box proposals, Single-Shot Multibox detector (SSD) models run faster than R-CNN models. The processing speed of detection and the model size were the deciding factors in the choice of the SSD-MobileNet-V2 model. As demonstrated in Table 1
, the model requires input photos of 320 × 320 and detects objects and their locations in those images in 19 milliseconds, whereas other models require more time. For example, SSD-MobileNet-V1-COCO, the second-fastest model, takes 0.3 milliseconds to categorise objects in a picture compared to SSD-MobileNet-V2-COCO, the third-fastest model, and so on. Compared to the second-fastest model SSD-MobileNet-V1-COCO, SSD-MobileNet-V2 320 × 320 is the most recent MobileNet model for Single-Shot Multibox detection. It is optimised for speed at a very low cost, with a mean average precision (mAP) of only 0.8 [28
2.2. TensorFlow Lite Object Detection
An open-source deep learning framework called TensorFlow Lite was created for devices with limited resources, such as mobile devices and Raspberry Pi modules. TensorFlow Lite enables the use of TensorFlow models on mobile, embedded, and Internet of Things (IoT) devices. It allows for low latency and compact binary size on-device machine learning inference. As a result, latency is increased and power consumption is decreased [28
For edge-based machine learning, TensorFlow Lite was explicitly created. It enables us to use various resource-constrained edge devices, such as smartphones, micro-controllers, and other circuits, to perform multiple lightweight algorithms [29
An open-source machine learning tool called TensorFlow Object Detection API is utilised in many different applications and has recently grown in popularity. When installing the TensorFlow Object Detection API, an implicit assumption is that it can be provided with noise-free or benign datasets. This open-source software is now being used in many object detection applications. However, in the real world, the datasets could contain inaccurate information due to noise, naturally occurring adversarial objects, adversarial tactics, and other flaws. Therefore, for the API to handle datasets from the real world, it needs to undergo thorough testing to increase its robustness and capabilities [30
Another paper also defines TensorFlow Object Detection as a class of semantic things (such as people, buildings, or cars) that can be detected in digital photos and videos using object detection, a computer technology linked to computer vision, and image processing. The study areas for target detection include pedestrian and face detection.
Many computer vision applications require object detection, such as image retrieval and video surveillance. Applying this method to an edge device could let you perform a task, such as an autopilot [29
2.3. MobileNets Architecture and Working Principle
Efficiency in deep learning is the key to designing or creating a helpful tool that is feasible to use with as little computation as possible. There are other ways or methods to solve efficiency issues in deep learning programming, and MobileNet is one of the approaches for said problem. MobileNets reduce the computation by factorising the convolutions. The architecture of MobileNets is primarily from depth-wise separable filters. MobileNets factorise a standard convolution into a depth-wise convolution and a 1 × 1 convolution (pointwise convolution) [31
]. A standard convolution filters and combines inputs into a new set of outputs in one step. In contrast, depth-wise separable convolution splits the information into the filtering layer and the combining layer, decreasing the computation power and model size drastically.
2.4. Android Speech-to-Text API
Google Voice Recognition, or GVR, is a tool with an open API that converts the speech from the user to text to be read. GVR usually requires an internet connection from the user to the GVR server. GVR uses neural network algorithms to convert raw audio speech to text and works for several languages [32
]. This tool has two-thread communication. The first thread is to receive the user’s audio speech and send it to Google Cloud to be converted into text and stored as strings. After that, the other communication thread reads the strings, sends them to the server, and resides in the user workstation.
Google Cloud Speech-to-Text or Cloud Speech API is another tool for the speech-to-text feature. It has far more features than standard Google Speech API. For example, it has 30+ voices available in multiple languages and variants. However, this is not just a tool; it is a product made by Google, and the user needs to subscribe and send some fees to use this tool. Table 2
lists the advantages and disadvantages of these tools.
3. Materials and Methods
This project includes three main categories: BIM letters, BIM words, hand gestures, and Android application development. These three main categories are divided into the database acquisition phase, the system’s design phase, and the system’s testing phase. The BIM sign language implemented utilises the static hand gesture, which only involves capturing a single image at the classifier’s input.
3.1. BIM Letters
The first category, BIM letters, had three phases: the database acquisition phase, system’s design phase, and system’s testing phase. Phase 1: In the database acquisition phase, datasets were obtained from deaf/mute teacher datasets, Kaggle, and self-generated datasets. BIM datasets in Kaggle are limited; thus, ASL letters were implemented with a replacement of self-generated letters G and T. Phase 2: for the system’s design phase, TensorFlow/Keras was implemented into the system as deep learning neural network to train the dataset. Lastly, Phase 3: the system’s testing phase was tested to ensure the functionality was well executed by generating a confusion matrix table.
Collected data were processed for classification using the CNN model, in this case, MobileNet, and the datasets were trained by implementing 10% of the datasets for testing and 90% of the datasets for training. Once the result was obtained, the model was converted to TensorFlow Lite to be imported into Android Studio for application making. A flow process is shown in Figure 1
There are 29 letters in the datasets, including delete, nothing, and space, which is beneficial for real-time applications. The data collected from a total number of 3000 images in each class, comprising 87,000 images, were then resized to 200 px × 200 px before being provided as input because smaller images can allow training to be faster.
The system was tested to ensure its operation was executed effectively using a confusion matrix, as seen in Figure 2
The confusion matrix consists of True Negative, True Positive, False Positive, and False Negative, and zero means false, while one means true. Therefore, there are two classes: (class 0) and (class 1). Thus, anything that the confusion matrix stated as zero or (class 0) is where the prediction is incorrect, such as True Negative, False Positive, and False Negative, whereas (class 1) is the number of samples that the model correctly classified as true, that is, True Positive.
3.2. BIM Word Hand Gestures
The dataset includes five classes, three of which are from family (keluarga
) and contain the words brother (abang
), father (bapa
), and mother (emak
); one from feelings (perasaan
), which is love (sayang
); and one from pronouns (ganti nama
), which is I (saya
). Data were gathered and processed to be classified using the CNN model. A pre-trained model from TensorFlow 2 Model Zoo was used to ensure that it achieved the best accuracy. This process includes changing the ratios, which are 25% for testing and 75% for training. The model was converted to TensorFlow Lite to construct apps and put into Android Studio. Database acquisition, system design, and system testing were the three steps that make up this category. A flow process of BIM words and hand gestures is shown in Figure 1
The datasets were self-generated, in which 100 images for each class were used, and 500 images in total were collected, with a size of 512 px × 290 px. The images captured were based on different positions and light intensities, including the distance from the camera and the brightness. The pictures were also mirrored to acquire a variety of images. To differentiate the images, labelImg was downloaded and used. This software generates an XML file for each image labelled so it can be detected using TensorFlow Object Detection API. Figure 3
shows an example of selecting and labelling hand gestures for brother (abang
) using labelImg software.
While the pre-processed datasets were classified using TensorFlow 2 Detection Model Zoo, the SSD-MobileNet-V2 FPNLite 320 × 320 model with a speed of 22 s/frame rate and COCO mAP of 22.2 was used to determine the model’s accuracy before being transformed into TensorFlow Lite and exported to Android Studio. Then, a collection of 500 (512 px × 290 px) images was used, of which 25%, or 125 images, was utilised for testing and 75%, or 375 images, was used for training.
Once the training process was completed, the hand gesture was detected in real time using TensorFlow Object Detection API and TensorFlow 2 Detection Model Zoo, the SSD-MobileNet-V2 FPNLite 320 × 320 model with a speed of 22 s/frame rate and COCO mAP of 22.2 was used as the pre-trained model. This is because the pre-trained model was trained with a large dataset, and this saves much more time rather than creating a model. COCO is an extensive dataset for object identification, segmentation, and captioning. Therefore, since a larger COCO mAP is advised, other models may also be employed to recognise the objects correctly. TensorFlow Records (TFRecords) can be used, and these TFRecords are a binary file format for storing data. Using this helps speed up training for custom object detection, in this case, hand gestures. The model was trained three times, in which the number of steps was changed to 2000 steps and 2500 steps to evaluate the model’s accuracy.
3.3. Android Application
This application has features of converting speech to text, converting BIM letter hand gestures that can form words, and converting BIM word hand gestures into text. The datasets for the BIM letter and word hand gestures were obtained by trained models that were converted into TensorFlow Lite. Android Studio was used to make the Android application. Users need to sign up and log in to the application to gain access to the feature. To ensure the system functions properly, the system was tested towards the objectives of this project. This, in turn, ensures the developer can improve the developed application. The flow process of the Android application of this BIM recognition is presented in Figure 4
For this phase of developing the Android application, two files from the BIM letter and BIM word hand gestures were included. To acquire the mentioned files, a trained model of BIM letter and BIM word hand gestures was converted into TensorFlow Lite files and used for application making.
The BIM letters, BIM word hand gestures, and Android speech to text were developed using Android Studio. To enable real-time hand gesture detection in the application, the trained model of BIM letters and BIM word hand gestures are translated to TensorFlow Lite and imported into Android Studio. By importing the SpeechRecognizer class, which gives access to the speech recognition service, the application of speech-to-text capability can also be accomplished in Android Studio. The speech recogniser can be accessed through this service. This API’s implementation involves sending audio to distant servers for speech recognition, such as converting microphone input to text.
In this project, the trained models are created with TensorFlow and converted into TensorFlow Lite format. Then, the converted format is then used to develop an Android app that analyses a live video stream and identifies things using a machine learning model; in this case, it analyses the BIM letters and BIM word hand gestures.
This machine learning model detects objects, which are BIM hand gestures, and it evaluates visual data in a prescribed manner to categorise components in the image as belonging to one of a set of recognised classes it was taught to identify. Milliseconds are frequently used to assess how long a model takes to recognise a known item (also known as object prediction or inference). In reality, the amount of data being processed, the size of the machine learning model, and the hardware hosting the model all affect how quickly inferences are made.
For the user’s Android application, there are a few stages and features that need to be fulfilled by the user, such as:
The user needs to turn on the internet connection.
The user needs to download and install the app on their smartphone.
The user needs to register to the app if they are a first-time user (input name, email address, and password).
The user needs to log in as a user with their successfully registered account (input name and password).
The user must allow the app to use the camera and record audio.
The implementation of the Android application that allows two-way communication between deaf/mute and normal people, which integrates Bahasa Isyarat Malaysia (BIM), consists of four main buttons that enable users to choose whether they want to use speech-to-text conversion, BIM letters to text conversion, BIM letters to create words conversion, and BIM word hand gestures to text conversion.
Three main categories help the application to be fully functional: BIM letters, BIM word hand gestures, and the development of the Android application itself. For BIM letters, the trained model achieved the highest accuracy of 99.78% by utilising the MobileNet pre-trained model with a 10% test size and a 90% training size. The result was evaluated by using a normalised confusion matrix. As for BIM word hand gestures, by implementing the TensorFlow 2 Detection Model Zoo, which uses SSD-MobileNet-V2 FPNLite 320 × 320, the average precision was 61.60% after being trained three times with 2000 steps and 2500 steps. Lastly, for the development of the Android application, ‘2 CUBE’ is the name of the application, which means ‘2 Cara Untuk BErkomunikasi dalam Bahasa Isyarat Malaysia’. Furthermore, a feature of this application includes speech-to-text conversion, and the trained models of letters and BIM word hand gestures were converted to TensorFlow Lite, which can be implemented for real-time hand gesture detection.
4.1. BIM Letters
Using MobileNet pre-trained models, 29 BIM letters were trained and evaluated. Figure 5
displays a normalised confusion matrix for the trained model with the 10% test size and 90% training size. The diagonal elements represent the total correct values predicted for the classes based on the normalised confused matrix. The result demonstrates that the model accurately predicted all classes with a value of about 99 per cent.
4.2. BIM Word Hand Gestures
The training results of BIM words using hand gestures conducted using TensorBoard, as explained in Section 3.2
, presents the loss, learning rate, and steps per second. The first-time training was set to 2000 steps, while the second- and third-time training were set to 2500 steps, and Figure 6
shows the training results of classification loss via TensorBoard, while Table 3
shows the training results of loss, learning rate, and steps per second.
For the evaluation result, this model obtained 0.616, which is 61.60% average precision (AP), with intersection over union (IoU) between 0.50 and 0.95 in all datasets with a maximum detection of 100. The precision is not that high since the datasets collected are in a small volume because the laptop capacity used for this project was low, and it required a lot of time running on CPU instead of GPU. Other than that, the classes for the hand gestures of father (bapa
), mother (emak
), and I (saya
) are almost the same; hence, these were detected as the same classes. For the average recall (AR), the model obtained a value of 0.670, or 67%, with IoU between 0.50 and 0.95 in all datasets with a maximum detection of one. The evaluation results can be seen in Figure 7
The model accuracy was determined by using images to estimate the percentage of the accuracy of each class, with brother (abang) at 86%, father (bapa) at 88%, mother (emak) at 92%, I (saya) at 97%, and love (sayang) at 98%, while the results of using a live camera from a webcam to detect the hand gestures in real time show that the accuracy of saya is 83%, sayang is 94%, and emak is 93%.
4.3. Development of Android Application
a shows the launcher icon for the application, a graphic representing the mobile application. This icon appears on the user’s home screen whenever the user downloads this application. The main page for this application is shown in Figure 8
b, where users need to register before they can use the application. If the user already has a registered account, they can log in with their successfully registered account.
a shows the user’s registration page for the application. Users need to input their name, email address, and password before clicking on the register button, and Figure 9
b shows the user’s login page. The user must enter their successfully registered email and password to log in to the application by clicking on the login button.
c shows the home page of the application after the user successfully logs into the application, where there are four clickable buttons with different functions for them to choose from, which are BIM letter hand gestures, BIM letter hand gestures to create a word conversion, BIM word hand gesture to text conversion, and, lastly, speech-to-text conversion.
shows the page after clicking on the BIM letters recognition button, whereas Figure 10
a shows that users need to click on the start camera (mulakan kamera
) recognition before using this feature. Figure 10
b shows that the user needs to allow the app to take pictures and record videos if they are a first-time user before proceeding.
shows the BIM letters page when the camera recognition has been allowed, whereas Figure 11
a shows the camera detected the letter ‘D’ when the BIM hand gesture is shown. In contrast, Figure 11
b shows the camera detects the letter ‘I’ when the BIM hand gesture is directed to the camera. As for Figure 11
c, when the camera does not recognise the hand gesture shown, the app displays Tidak dapat dikesan
, which means it cannot be detected.
a shows the sidebar menu on which the user can click, and they can see their name and the registered email address they use. In addition, the sidebar menu includes four buttons with different features that they can click on, and they can also sign out from the application when they do not want to use it anymore.
b shows the BIM combined letter page where the user needs to click on the start camera recognition button, and this page also has an additional and clear button for the user to use when they want to combine the hand gesture they show or erase the letter they want.
shows the BIM combined letter page, whereas Figure 13
a shows the hand gesture of the letter ‘B’, and this is added to the app by clicking on the add (Tambah
) button. Figure 13
b shows that the hand gesture of the letter ‘C’ is shown and is added to the app, resulting in the word ‘bilc
’, which is an incorrect word; therefore, the user must click on the delete (Padam
) button to delete the letter ‘C’. Lastly, Figure 13
c shows the hand gesture of the letter ‘A’ after deleting the letter ‘C’; hence, the resulting word is ‘bila
shows the BIM word hand gesture page, where the user can use this feature to detect the BIM word hand gesture. Users need to click on the start camera recognition button to start detecting the hand gesture they show. Figure 14
a shows the BIM hand gesture being translated to brother (abang
) in a text, while Figure 14
b shows mother (emak
) being translated by the app when the user displays the mother (emak
) hand gesture. Lastly, Figure 14
c shows the letter I (saya
) being translated from the hand gesture shown by the user.
shows the speech-to-text page, whereas Figure 15
a shows the main page once the user clicks the speech-to-text button. After that, the microphone can be clicked, and for a first-time user of the app, the user needs to grant access to recording the audio, as shown in Figure 15
b. Finally, Figure 15
c shows the Semua kebenaran dibenarkan
message, meaning the user has granted all access.
shows the speech-to-text page, and in Figure 16
a, it can be seen that the user needs to click on the microphone icon, the Google speech recogniser pops up, the user is able to talk and capture speech by using the microphone, and it detects the speech and converts it to text, as shown in Figure 16
b. Users need to click the change (Tukar) button for the next speech-to-text process.
4.4. Analysis of Android Application
By selecting their preferred speech to text, BIM letter recognition, BIM letters to construct a word, and, finally, BIM word hand gesture buttons on the BIM Android application, deaf/mute and normal people can communicate with one another.
This test can be conducted by repeating a hand gesture of each BIM letter and captured by phone camera ten times, and the accuracy results are tabulated in Table 4
. The letters ‘B’, ‘D’, ‘I’, ‘M’, and ‘V’ have the highest accuracy from ten trials at 100%, while the lowest is the letter ‘E’, with 50% accuracy. The other stated letters have an average accuracy above 50%.
A speech-to-text analysis was conducted, and the accuracy results are presented in Table 5
. The test aims to determine whether or not the application accurately recognises the speech. For example, the words ‘abang
’ and ‘sayang
’ have an accuracy of 100%, ‘bapa
’ has an accuracy of 90%, and ‘emak
’ and ‘saya
’ have an accuracy of 80%.
In summary, Bahasa Isyarat Malaysia (BIM), an Android application, was successfully developed. This project’s goals were all completed. This success can be seen in the findings for the BIM letters, which, after training the models, achieved 99.75% accuracy. The app was built successfully for testing and analysis to determine the effectiveness of the whole system, where the test analysis reveals that, after ten trials, the average accuracy of the letters hand gesture was greater than 50%. The same may be said for speech to text, where an acceptable accuracy of more than 80% was attained. Briefly, this application can help deaf/mute and normal people communicate at ease with each other. This project can also eliminate the hassle of a human translator, making it significantly more cost-effective while developing a shorter and more fascinating interaction.
Additionally, there are a number of potential areas for future research that can be taken into account: (i) to increase the accuracy of speech recognition, audio–visual speech recognition with lip-reading will be introduced and (ii) to increase the performance of hand gesture recognition, attention models that enable the system to concentrate on the most instructive portion of a sign video sequence can be used.