Next Article in Journal
IoT Device Identification Using Unsupervised Machine Learning
Previous Article in Journal
Federated Blockchain Learning at the Edge
 
 
Article
Peer-Review Record

Interpretation of Bahasa Isyarat Malaysia (BIM) Using SSD-MobileNet-V2 FPNLite and COCO mAP

Information 2023, 14(6), 319; https://doi.org/10.3390/info14060319
by Iffah Zulaikha Saiful Bahri 1, Sharifah Saon 1,*, Abd Kadir Mahamad 1,*, Khalid Isa 1, Umi Fadlilah 2, Mohd Anuaruddin Bin Ahmadon 3 and Shingo Yamaguchi 3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Information 2023, 14(6), 319; https://doi.org/10.3390/info14060319
Submission received: 12 March 2023 / Revised: 2 May 2023 / Accepted: 12 May 2023 / Published: 31 May 2023
(This article belongs to the Section Information and Communications Technology)

Round 1

Reviewer 1 Report

The manuscript entitled "Interpretation of Bahasa Isyarat Malaysia(BIM) using SSD MobileNet V2 FPNLite and COCO mAP" proposed a two-way communication between deaf and hearing people. The manuscript is well-written and well-explained. The result is also significant. I have only one concern.

1. Figure 6 and Figure 7 need to be clarified why the author showed only one epoch. Explain the number of epochs, and the author should plot the summary at the end. One epoch or the first second doesn't mean anything.

Author Response

Dear Reviewer

Thank you for your comments, suggestion and idea, we really appreciate it and try our best to overcome it. 

Reviewer 1

 

The manuscript entitled "Interpretation of Bahasa Isyarat Malaysia(BIM) using SSD MobileNet V2 FPNLite and COCO mAP" proposed a two-way communication between deaf and hearing people. The manuscript is well-written and well-explained. The result is also significant. I have only one concern.

 

1

Figure 6 and Figure 7 need to be clarified why the author showed only one epoch. Explain the number of epochs, and the author should plot the summary at the end. One epoch or the first second doesn't mean anything.

Revised version in Section 4.1 have been addressed. The original, Figure 6 and 7 have been removed.

Author Response File: Author Response.pdf

Reviewer 2 Report

The article examines various aspects related to technologies that improve the sensitivity of gestures for organizing human-machine interaction and facilitating communication between individuals with and without hearing or speech impairments. The authors propose a gesture recognition approach that focuses solely on static gestures. They conduct a comparative analysis of the results obtained from a small corpus that includes marked-up gestures from the Malaysian Sign Language and the dactyl of American Sign Language. Based on this analysis, it can be argued that the quality of the results still depends on the input data, which, in turn, may be influenced by the selected gesture corpora. On a positive note, the article includes many helpful illustrations and tables. However, it is necessary to further discuss the shortcomings that require correction.

 

Shortcomings:

1) First and foremost, it is apparent that the introduction lacks a comprehensive description of the factors that can influence gesture recognition. Therefore, it is advisable to expand this section to provide a clearer understanding of the fact that despite its great practical potential, the problem of effective gesture recognition remains unresolved due to significant differences in the semantic-syntactic structure of any gestures. As a result, it is still not feasible to perform an unambiguous translation from sign language into, for instance, textual representation. This makes it impossible to operate fully automated models and methods for recognizing set systems, static and dynamic gestures. Developing comprehensive models requires performing deep semantic analysis, which is currently only possible at a superficial level due to the failure of text analysis algorithms, knowledge bases, and similar tools. Moreover, it is essential to note that the issue of sign language recognition is relevant as the number of individuals with hearing or speech impairments continues to grow every year (see the latest data from the World Health Organization: https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss). This underscores the importance of the chosen research field, as gesture recognition belongs to the field of assistive technologies.

 

2) The article fails to describe the challenges that arise from the lack of universal methods for creating multimodal gesture corpora. These corpora are crucial for improving the efficiency of machine learning and the accuracy of automatic gesture recognition using various video information capture devices that enable the acquisition of high-quality images in optical mode, as well as additional data on the coordinates of graphic areas of interest in other modes such as depth map and infrared. With such universal methods, it would be possible to create corpora that can be used in analyzing non-verbal communication, including the recognition of hand gestures. Therefore, it would be worthwhile to expand the article by analyzing a table with already available gesture corpora (InterHand2.6M, TheRuSLan, AUTSL, LSE_eSaude_UVIGO, WLASL, Chalearn Synthetic Hand Dataset, LSA64, MS-ASL, and others), and then explain why certain corpora are suitable or unsuitable for this study. For example, some corpora are not available for open use, or they may only contain dynamic hand gestures (AUTSL, TheRuSLan, WLASL), or they may not include the Malaysian Sign Language. This would help readers understand why the authors decided to create their own corpus and use the American Sign Language dactyl.

 

3) Section 2 of the article lacks a description of related work, as the authors only focus on the neural network architectures used. To provide a better overview of the current state of the art (SOTA), it would be wise to expand the section with a description of modern and more efficient gesture recognition methods. Information on the latest methods can be found on Papers with Code, which provides a SOTA page for each corpus. For instance, the best methods for the well-known large AUTSL corpus (https://paperswithcode.com/sota/sign-language-recognition-on-autsl) are currently: STF+LSTM, SAM-SLR (RGB-D), and MViT-SLR. By including information about these modern methods, the authors can show that they are aware of the latest advancements in the field and that they have not limited themselves to neural network models alone. In SOTA AUTSL, methods are built by Mediapipe and other approaches, and also describe their application on mobile devices (STF+LSTM), which is the subject of the authors. In addition, for future readers of the article, this will be useful information. It may also lay the foundation for future research by the authors of the article, since most of the current approaches analyze dynamic gestures, and not just static gestures, as is currently the case in the current article. Additionally, the authors can add information on the best results from other gesture corpora, including those that focus on the analysis of static gestures, in the description of previous works. This would further demonstrate their knowledge of the current state of the art and help readers understand the context of the authors' work.

 

4) The authors of the article mention that people without hearing and speech impairments can interact with the developed system by voice, which is recognized through the Google API. However, it is worth noting that voice recognition is not always accurate, especially in noisy environments. Audiovisual speech recognition has gained popularity in recent years, where visual cues such as lip reading are used alongside acoustic speech recognition to improve accuracy. It may be useful to mention this in the article to provide a more complete picture. There has been a significant amount of research on audiovisual speech recognition at various conferences such as ICASSP, INTERSPEECH, and LREC, with dedicated corpora available for different conditions such as CN-CVS and RUSAVIC and etc.  In addition, SOTA (https://paperswithcode.com/sota/lipreading-on-lip-reading-in-the-wild) also has results for audiovisual recognition of English words (for example see the best 3 results). Given that the authors' system uses speech recognition, it may be relevant to discuss the potential benefits of incorporating audiovisual speech recognition in future work.

 

This is not really a shortcoming, but just a recommendation that now it’s not just speech recognition that is relevant, but audiovisual speech recognition, and it’s reasonable to write about it, since the authors’ system works with speech recognition.

 

5) It is recommended to include references to previous works of the world scientific community (2020-23), which are constantly presented at conferences focused on working with video modality (CVPR, ICCV, INTERSPEECH, ICASSP, EUSIPCO, ICMI, SPECOM, among others) or in journals of the first (Q1) quartile. If the authors correct shorcomings 2-4, then this shorcoming will be addressed.

 

6) The authors may consider adding information about the augmentation techniques used in the experiments (e.g., MixUp) and whether cosine annealing was employed. If these techniques were not used, it would be helpful to explain why. Further details are also necessary to provide a comprehensive description of the experiments.

 

7) The authors may consider explaining why they did not use an attention model in addition to their current approach.

 

8) In the conclusion section, the authors may describe their plans for future work to provide readers with an idea of the next steps in this research.

 

9) If possible, the authors are encouraged to present all illustrations in vector format.

 

10) The authors may consider changing the background of figures 8 and 10 to improve their readability.

 

11) Finally, it is recommended that the authors revise the style of the article to correct spelling and punctuation errors. The idea and experiments presented in the article are clear, but some points may require further corrections to enhance readability. Additionally, more explanations of the experiments conducted are necessary.

 

In this form, the article is still unfinished. Especially the first sections. Experimental explanations are also needed. It seems to me that all the proposed additions will only improve this article.

Author Response

Dear Reviewer

Thank you for your comments, suggestion and idea, we really appreciate it and try our best to overcome it. 

Reviewer 2

 

The article examines various aspects related to technologies that improve the sensitivity of gestures for organizing human-machine interaction and facilitating communication between individuals with and without hearing or speech impairments. The authors propose a gesture recognition approach that focuses solely on static gestures. They conduct a comparative analysis of the results obtained from a small corpus that includes marked-up gestures from the Malaysian Sign Language and the dactyl of American Sign Language. Based on this analysis, it can be argued that the quality of the results still depends on the input data, which, in turn, may be influenced by the selected gesture corpora. On a positive note, the article includes many helpful illustrations and tables. However, it is necessary to further discuss the shortcomings that require correction.

 

1

 First and foremost, it is apparent that the introduction lacks a comprehensive description of the factors that can influence gesture recognition. Therefore, it is advisable to expand this section to provide a clearer understanding of the fact that despite its great practical potential, the problem of effective gesture recognition remains unresolved due to significant differences in the semantic-syntactic structure of any gestures. As a result, it is still not feasible to perform an unambiguous translation from sign language into, for instance, textual representation. This makes it impossible to operate fully automated models and methods for recognizing set systems, static and dynamic gestures. Developing comprehensive models requires performing deep semantic analysis, which is currently only possible at a superficial level due to the failure of text analysis algorithms, knowledge bases, and similar tools. Moreover, it is essential to note that the issue of sign language recognition is relevant as the number of individuals with hearing or speech impairments continues to grow every year (see the latest data from the World Health Organization: https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss). This underscores the importance of the chosen research field, as gesture recognition belongs to the field of assistive technologies.

I appreciate your input and helpful criticism. I like your perspective and concur that a more thorough discussion of the variables affecting gesture recognition would be beneficial for the introduction.

 

You correctly noted that undertaking deep semantic analysis, which is now only doable at a surface level, is required to construct comprehensive models for gesture recognition.

 

Revised version of the article have been made in Section 1, para-5.

 

2

The article fails to describe the challenges that arise from the lack of universal methods for creating multimodal gesture corpora. These corpora are crucial for improving the efficiency of machine learning and the accuracy of automatic gesture recognition using various video information capture devices that enable the acquisition of high-quality images in optical mode, as well as additional data on the coordinates of graphic areas of interest in other modes such as depth map and infrared. With such universal methods, it would be possible to create corpora that can be used in analyzing non-verbal communication, including the recognition of hand gestures. Therefore, it would be worthwhile to expand the article by analyzing a table with already available gesture corpora (InterHand2.6M, TheRuSLan, AUTSL, LSE_eSaude_UVIGO, WLASL, Chalearn Synthetic Hand Dataset, LSA64, MS-ASL, and others), and then explain why certain corpora are suitable or unsuitable for this study. For example, some corpora are not available for open use, or they may only contain dynamic hand gestures (AUTSL, TheRuSLan, WLASL), or they may not include the Malaysian Sign Language. This would help readers understand why the authors decided to create their own corpus and use the American Sign Language dactyl.

Thank you for your comment. You are correct that the article could benefit from a more comprehensive discussion of the challenges surrounding the creation of multimodal gesture corpora.

 

Revised version of the article have been made in Section 1, para-5.

 

3

Section 2 of the article lacks a description of related work, as the authors only focus on the neural network architectures used. To provide a better overview of the current state of the art (SOTA), it would be wise to expand the section with a description of modern and more efficient gesture recognition methods. Information on the latest methods can be found on Papers with Code, which provides a SOTA page for each corpus. For instance, the best methods for the well-known large AUTSL corpus (https://paperswithcode.com/sota/sign-language-recognition-on-autsl) are currently: STF+LSTM, SAM-SLR (RGB-D), and MViT-SLR. By including information about these modern methods, the authors can show that they are aware of the latest advancements in the field and that they have not limited themselves to neural network models alone. In SOTA AUTSL, methods are built by Mediapipe and other approaches, and also describe their application on mobile devices (STF+LSTM), which is the subject of the authors. In addition, for future readers of the article, this will be useful information. It may also lay the foundation for future research by the authors of the article, since most of the current approaches analyze dynamic gestures, and not just static gestures, as is currently the case in the current article. Additionally, the authors can add information on the best results from other gesture corpora, including those that focus on the analysis of static gestures, in the description of previous works. This would further demonstrate their knowledge of the current state of the art and help readers understand the context of the authors' work.

Thank you for your thoughtful feedback

 

Revised version of the article have been made in Section 2, para-1 until 3.

 

4

The authors of the article mention that people without hearing and speech impairments can interact with the developed system by voice, which is recognized through the Google API. However, it is worth noting that voice recognition is not always accurate, especially in noisy environments. Audiovisual speech recognition has gained popularity in recent years, where visual cues such as lip reading are used alongside acoustic speech recognition to improve accuracy. It may be useful to mention this in the article to provide a more complete picture. There has been a significant amount of research on audiovisual speech recognition at various conferences such as ICASSP, INTERSPEECH, and LREC, with dedicated corpora available for different conditions such as CN-CVS and RUSAVIC and etc.  In addition, SOTA (https://paperswithcode.com/sota/lipreading-on-lip-reading-in-the-wild) also has results for audiovisual recognition of English words (for example see the best 3 results). Given that the authors' system uses speech recognition, it may be relevant to discuss the potential benefits of incorporating audiovisual speech recognition in future work.

 

This is not really a shortcoming, but just a recommendation that now it’s not just speech recognition that is relevant, but audiovisual speech recognition, and it’s reasonable to write about it, since the authors’ system works with speech recognition

Thank you for your insightful comment. Will be considered for future work.

 

5

It is recommended to include references to previous works of the world scientific community (2020-23), which are constantly presented at conferences focused on working with video modality (CVPR, ICCV, INTERSPEECH, ICASSP, EUSIPCO, ICMI, SPECOM, among others) or in journals of the first (Q1) quartile. If the authors correct shorcomings 2-4, then this shorcoming will be addressed.

OK, added in section 1 and 2

 

6

The authors may consider adding information about the augmentation techniques used in the experiments (e.g., MixUp) and whether cosine annealing was employed. If these techniques were not used, it would be helpful to explain why. Further details are also necessary to provide a comprehensive description of the experiments.

Augmentation techniques is not used in this work.

 

In sign language recognition, augmentation approaches have potential ability to add artificial changes in the sign language data and may not accurately depict real signing and decrease the performance. Additionally, BIM datasets with augmentation is currently not available.

7

The authors may consider explaining why they did not use an attention model in addition to their current approach.

Thank you for your comments, we believe this can be a part of our future work.

8

In the conclusion section, the authors may describe their plans for future work to provide readers with an idea of the next steps in this research.

Thank you.

 

Revised version of the article have been made in Section 5, para-2.

9

If possible, the authors are encouraged to present all illustrations in vector format.

We are sorry, currently this is the best illustrations can be provided.

10

The authors may consider changing the background of figures 8 and 10 to improve their readability.

Thank you for the feedback.

 

Revised as in Figure 5 & 7.

11

Finally, it is recommended that the authors revise the style of the article to correct spelling and punctuation errors. The idea and experiments presented in the article are clear, but some points may require further corrections to enhance readability. Additionally, more explanations of the experiments conducted are necessary.

 

In this form, the article is still unfinished. Especially the first sections. Experimental explanations are also needed. It seems to me that all the proposed additions will only improve this article.

Thank you for the feedback.

Author Response File: Author Response.pdf

Reviewer 3 Report

The article describes an Android application that allows communication between deaf people who use sign language and people who do not know this language. The authors assumed that the characters of the alphabet, five words and several special gestures will be recognized - a total of 34 gestures  (if I counted correctly). The application also allows to use a speech recognition engine.

The article is quite long (22 pages) and contains information about both sign language, system architecture (a neural network trained with tens of thousands of images was used), how to develop applications for Android and test results. I have the following comments:

1. A short description of the BIM Sign Language would be useful, in terms of its volume (the number of gestures that a person using this language should know) and the syntax - whether the sequence of gestures exactly corresponds to the successive words of the spoken language, or whether a different syntax is used, and whether it is sufficient to limit the recognition ability to 34 gestures.

2. I don't think it is necessary to describe how to develop Android applications (Section 2.4). Extensive manuals are available on this subject, describing what the Java Virtual Machine is, how to create buttons in the graphical interface, what "visual programming" is all about, etc.

3. The Materials and Methods chapter is a bit chaotic. The reader expects a description of the corpus used (training and test data) and then the methods. Here these elements are mixed. It is not clear to me why the gestures for letters and gestures for words have been separated - are they really so different? Do they have some dynamic characteristics or are they static?

4. Section 3.1 is titled "BIM Alphabets" - are there multiple alphabets? Or is the word "alphabet" used to mean "letter"? It seems, that the term "alphabet" is used interchangeably with "letter" (which is not correct), as in line 500: "repeating a hand gesture of each BIM alphabet to the camera ten times".

5. If the two categories of gestures are not significantly different, one can combine their descriptions (eg Figure 1 and Figure 3). Figure 2 is redundant in my opinion - it describes an obvious concept.

6. Does the term "one epoch" (Figure 6 and text) mean that the images from the training set were presented once? This is not enough, many such epochs must be carried out.

7. Recognition accuracy score of 2.30% means some error in the network architecture or its use - after all, this is a weaker result than if the classifier answer were randomly drawn.

8. In my opinion, it is not necessary to place here screenshots of so many application screens. And it would be useful to add translations of words used in the UI at least in captions of figures (Fig. 16, 17 - "Tambah", "padam": "add", "delete", Fig. 19 - "Tukar": change etc.)

9. Figure 5: it is "Android application flowchart", not "Flowchart of the development of the Android application"

10. Table 5, 6 - isn't all the useful information in the last column? Does the order of correct results matter? Maybe only the last columns of these tables should be left?

11. I suggest ensuring the appropriate quality of graphics (applies to Fig. 9 - poor text and chart resolution)


In general, I propose to shorten and organize this text, focusing on the essential elements:
1. Description of the BIM sign language - how many gestures are used in normal communication
2. Description of the application concept - it should recognize gestures based on a video image and recognize speech and display it as written text
3. Description of gesture recognition tools used (network type, learning data)
4. Description of the results of the examination of the correctness of gesture recognition
5. Description of the developed application.
6. Clarification whether the described application can be used in real conversations or is just a demonstration of the method. The doubt arises from the number of recognized gestures: 34, while there are thousands of them in practically used sign languages.

Correction suggestions:

is: whereas other models require longer.
suggestion: whereas other models require more time.

is: Today, there are many applications for the deaf/mute with normal.
suggestion: Today, there are many applications available for deaf/mute individuals to communicate with non-deaf/mute individuals.

is: ...with as little computational as possible.
suggestion: ...with as little computation as possible.

In the paragraph starting on line 175 - somewhat awkward wording; many references to "this book" (26 in bibliography?)

is: integrated Android application has successfully evolved
suggestion: the application has been successfully developed

Errors:
is: ...one from filling (perasaan)
should be: ...one from feelings (perasaan)

is: Then, the collected of 500 (512px × 290px) images...
should be: Then, the collection of 500 (512px × 290px) images...

Author Response

Dear Reviewer

Thank you for your comments, suggestion and idea, we really appreciate it and try our best to overcome it. 

Reviewer 3

 

The article describes an Android application that allows communication between deaf people who use sign language and people who do not know this language. The authors assumed that the characters of the alphabet, five words and several special gestures will be recognized - a total of 34 gestures  (if I counted correctly). The application also allows to use a speech recognition engine.

The article is quite long (22 pages) and contains information about both sign language, system architecture (a neural network trained with tens of thousands of images was used), how to develop applications for Android and test results. I have the following comments:

 

1

A short description of the BIM Sign Language would be useful, in terms of its volume (the number of gestures that a person using this language should know) and the syntax - whether the sequence of gestures exactly corresponds to the successive words of the spoken language, or whether a different syntax is used, and whether it is sufficient to limit the recognition ability to 34 gestures.

Thank you for the comments.

 

As stated in this paper, the 34 gestures are used in this work (26 letters, 3 special character and 5 words) is just a few hand gestures for proof of concepts of this Android application system.

2

I don't think it is necessary to describe how to develop Android applications (Section 2.4). Extensive manuals are available on this subject, describing what the Java Virtual Machine is, how to create buttons in the graphical interface, what "visual programming" is all about, etc.

Noted, with that, the original Section 2.4 was removed.

3

The Materials and Methods chapter is a bit chaotic. The reader expects a description of the corpus used (training and test data) and then the methods. Here these elements are mixed. It is not clear to me why the gestures for letters and gestures for words have been separated - are they really so different? Do they have some dynamic characteristics or are they static?

Thank you for your comment.

 

Regarding the separation of gestures for letters and words, we made this distinction because they have different recognition requirements. In this project BIM letters recognition using MobileNet model, while the BIM words are work with SSD MobileNetV2 FPN Lite model.  

 

The flowchart is revised accordingly for a better understanding

4

Section 3.1 is titled "BIM Alphabets" - are there multiple alphabets? Or is the word "alphabet" used to mean "letter"? It seems, that the term "alphabet" is used interchangeably with "letter" (which is not correct), as in line 500: "repeating a hand gesture of each BIM alphabet to the camera ten times".

I appreciate your input.

 

The words “letters” is more appropriate, thus we already change the words “alphabets” to “letters”.

5

If the two categories of gestures are not significantly different, one can combine their descriptions (eg Figure 1 and Figure 3). Figure 2 is redundant in my opinion - it describes an obvious concept.

I appreciate the comments.

 

Figure 1 and 2 were revised as Figure 1.

6

Does the term "one epoch" (Figure 6 and text) mean that the images from the training set were presented once? This is not enough, many such epochs must be carried out.

I appreciate your input.

 

Revised version in Section 4.1 have been addressed. The original, Figure 6 and 7 have been removed.

7

Recognition accuracy score of 2.30% means some error in the network architecture or its use - after all, this is a weaker result than if the classifier answer were randomly drawn.

I appreciate the comment.

 

We believe, the recognition accuracy of 99.75 % is acceptable if compare with others paper as discussed in Section 2.

8

In my opinion, it is not necessary to place here screenshots of so many application screens. And it would be useful to add translations of words used in the UI at least in captions of figures (Fig. 16, 17 - "Tambah", "padam": "add", "delete", Fig. 19 - "Tukar": change etc.)

Thank you for the idea.

Revised as in page 14 and 16.

9

Figure 5: it is "Android application flowchart", not "Flowchart of the development of the Android application"

Change accordingly.

10

Table 5, 6 - isn't all the useful information in the last column? Does the order of correct results matter? Maybe only the last columns of these tables should be left?

Revised accordingly.

11

I suggest ensuring the appropriate quality of graphics (applies to Fig. 9 - poor text and chart resolution).

Revised accordingly.

12

In general, I propose to shorten and organize this text, focusing on the essential elements:
1. Description of the BIM sign language - how many gestures are used in normal communication
2. Description of the application concept - it should recognize gestures based on a video image and recognize speech and display it as written text
3. Description of gesture recognition tools used (network type, learning data)
4. Description of the results of the examination of the correctness of gesture recognition
5. Description of the developed application.
6. Clarification whether the described application can be used in real conversations or is just a demonstration of the method. The doubt arises from the number of recognized gestures: 34, while there are thousands of them in practically used sign languages

 

Thank you for the suggestion, however, we remain the structure of this paper.

13

Correction suggestions:

is: whereas other models require longer.
suggestion: whereas other models require more time.

is: Today, there are many applications for the deaf/mute with normal.
suggestion: Today, there are many applications available for deaf/mute individuals to communicate with non-deaf/mute individuals.

is: ...with as little computational as possible.
suggestion: ...with as little computation as possible.

In the paragraph starting on line 175 - somewhat awkward wording; many references to "this book" (26 in bibliography?)

is: integrated Android application has successfully evolved
suggestion: the application has been successfully developed

Errors:
is: ...one from filling (perasaan)
should be: ...one from feelings (perasaan)

is: Then, the collected of 500 (512px × 290px) images...
should be: Then, the collection of 500 (512px × 290px) images...

Revised accordingly.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors of this article have answered all questions and corrected the article. As it stands, the article appears to be complete. Overall, the article deserves the attention of the scientific community and can be recommended for publication.

Back to TopTop