Next Article in Journal
MSSN: An Attribute-Aware Transmission Algorithm Exploiting Node Similarity for Opportunistic Social Networks
Next Article in Special Issue
A Sustainable and Open Access Knowledge Organization Model to Preserve Cultural Heritage and Language Diversity
Previous Article in Journal
Self-Portrait, Selfie, Self: Notes on Identity and Documentation in the Digital Age
Previous Article in Special Issue
Terminology Translation in Low-Resource Scenarios
 
 
Article
Peer-Review Record

Subunits Inference and Lexicon Development Based on Pairwise Comparison of Utterances and Signs

Information 2019, 10(10), 298; https://doi.org/10.3390/info10100298
by Sandrine Tornay 1,2,* and Mathew Magimai.-Doss 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Information 2019, 10(10), 298; https://doi.org/10.3390/info10100298
Submission received: 22 July 2019 / Revised: 7 September 2019 / Accepted: 24 September 2019 / Published: 26 September 2019
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)

Round 1

Reviewer 1 Report

The article develops a hidden Markov model-based abstract methodology to extract subword units given pairwise comparison between utterances. In the case of spoken language, it demonstrates that the  proposed methodology can lead up to discovery of phone set and development of phonetic dictionary. In the case of sign language, it demonstrate how hand movement information can be modeled for sign language processing

The paper uses some standard methods to derive subunits from speech and sign language and use them for recognition. The innovation factor is rather average. However, the study has some merit due to the numerous experiments, which give some insights in the related research.

Some detailed comments:

“Such linguistically motivated subword units help in handing data scarcity issues when training statistical models and handle words that are unseen during training.”

Unclear statement: what sort of handling is meant here?

"In the world, there are approximately 6900 languages and only about 5-10% of them employ a writing system [3]. Furthermore, not all of the languages that have a writing system may have a well developed phonetic dictionary. Studying these languages manually to acquire linguistic knowledge and phonetic dictionary from the acoustic data is a highly challenging and non-trivial task."

This is not showcased for any such languages, which would add intrest to the paper.

Table 1 displays RA per word and per subunit, as far as I understand ( In the first case the #units refers to subunits-phones and in the second to words? A clarification is needed.)

However these are not comparable because they calculate RA on different quantities. So the related comments need to be changed ( lines 227-231). Similarly for table 2.

Minor spelling errors like "day-today" etc

 

Author Response

We thank the reviewer for her/his valuable time, feedback and suggestion. In the revised version, the changes made based on Reviewer 1 comments are in blue color font. In the remainder, we provide a point-by-point response to the Reviewer 1's comments.

Point 1: “Such linguistically motivated subword units help in handing data scarcity issues when training statistical models and handle words that are unseen during training.”

Unclear statement: what sort of handling is meant here?

Response: Thank you for raising this point. We have clarified this statement in the abstract.

"When developing language technologies, as words in a language do not have the same prior probability, there may not be sufficient training data for each word to model. Furthermore, the training data may not cover all possible words in the language. Due to these data sparsity and word unit coverage issues, language  technologies employ modeling of subword units or subunits, which are based on prior linguistic knowledge."

Point 2: "In the world, there are approximately 6900 languages and only about 5-10% of them employ a writing system [3]. Furthermore, not all of the languages that have a writing system may have a well developed phonetic dictionary. Studying these languages manually to acquire linguistic knowledge and phonetic dictionary from the acoustic data is a highly challenging and non-trivial task."
This is not showcased for any such languages, which would add interest to the paper.

Response: The reviewer has raised an interesting point. We do indeed would like to investigate the approach on truly such language(s). However, as clarified in the revised version of the paper in Section 4 Page 6, to validate the methodology up to pronunciation level, we needed a language that has such well developed resources. So, we simulated the scenario through an investigation on Phonebook.  If we were to carry out experimental study on a language which does not have any linguistic resources, then, the validation up to pronunciation level is not an easy task. As the subword units and lexicon extracted would have to be substantiated through a manual linguistic analysis, which in turn could be subjective. Having said that, as you can notice that under no circumstances the experimental validation considers the point whether the target language has a writing system or not or has other linguistic resources for subword unit extraction and lexicon development. So, the proposed methodology should be generalizable to truly under-resourced languages.

Point 3: Table 1 displays RA per word and per subunit, as far as I understand ( In the first case the #units refers to subunits-phones and in the second to words? A clarification is needed.)

However these are not comparable because they calculate RA on different quantities. So the related comments need to be changed ( lines 227-231). Similarly for table 2.

Response: The average number of units in Table 1 corresponds to the number of HMM states. In the case of the word level System, we used 15 states per words which gives 1125 units. In the case of the clustered subword unit-based System, we clustered these 1125 units through pairwise discrimination using the Bhattacharyya distance leading to 810 units (see Section 3.1). It is the same explanation for Table 4 and Table 8. We have further clarified this aspect in the text in Section 4.2.1.

 

Point 4: Minor spelling errors like "day-today" etc

Response: Thank you for pointing it out. We have corrected it.

 

Reviewer 2 Report

In this paper, the authors propose methods of pairwise comparison of utterances and signs for subunit inference and lexicon development.

 

Advantages:

1) The problem is interesting.

2) The use of HMM/GMM, Hybrid HMM/ANN and KL-HMM methods are reasonable.

3) The experiment results supports the proposed method.

 

Suggestions:

1) It would be better to add a figure in the Introduction section to clearly depict the input/output of the problem, why state-of-the-art method cannot do well, and why the proposed methods can do well.

2) It would also be better to add a baseline method for comparison in the experiments.

3) It would be nice if the software can be released.

 

Thank you very much!

 

Author Response

We thank the reviewer for her/his valuable time and for the feedback and suggestions. The changes made to the paper based on the reviewer's comments are in red color font. In the remainder, we provide response to the reviewer's comment.

Point 1: It would be better to add a figure in the Introduction section to clearly depict the input/output of the problem, why state-of-the-art method cannot do well, and why the proposed methods can do well.

Response: We thank the reviewer for this suggestion. For the sake of clarity, instead of modifying the Introduction section, we have added Section 2, where we have discussed the methods used in speech processing and sign language processing for subunits extraction. At the end of that section,  we have teased out the shortcomings of the existing approaches for the light supervision computational linguistic problem posed in the present paper. 

We found that adding a figure is not bringing an added value. So we have clarified the input-output aspect at the beginning of Section 3.

Point 2: It would also be better to add a baseline method for comparison in the experiments.

Response: As pointed out in Section 2, although the general methods for subunits extraction are based on time series segmentation and clustering, existing approaches have limitation for the problem we have studied. In the speech recognition studies and sign language recognition studies, the word level and sign level systems serve as a baseline.

In the spoken language part of study, as there is a phonetic dictionary for Phonebook, we developed standard phone-based systems with exactly same protocol as done in the paper.

HMM/GMM:  97.7±2.5 (CI), 96.1±4.8 (CD)

KL-HMM: 98.1±3.4 (CI),  99.7±0.9 (CD)

where CI denotes context-independent phone-based system and CD denotes context-dependent phone-based system. As we are adopting exactly the same protocol, hybrid HMM/ANN system studies required more time and were incomplete with in the revision time period. Nevertheless, it can be observed that the phone-based HMM/GMM and KL-HMM systems results and the performance of systems reported in the paper are comparable. As the paper is not focusing on speech recognition and as these results do not change the findings of the paper, we are not sure if these results should be added in the paper. The reviewer's opinion in this regard would be helpful.

For sign language recognition, the recent work in citation [44] serves also as a baseline.

Point 3: It would be nice if the software can be released.

Response: We understand the reviewer's point and tend to agree with the reviewer about it. However, the challenge is that the implementation has been carried out with a mix of scripts and tools. More precisely, training of neural networks has been done with Quicknet. For HMM/GMM, we have used HTK. For hybrid HMM/ANN and KL-HMM, we have used a locally modified version of HTK. This requires release of a patch. The clustering of HMM states stored in HTK mmf format has been implemented in Python, which we can be released. Hand movement synthesis has been carried out using jupyter notebook / python scripts - pbdlib library. Bundling all this properly and releasing the software will need some more work and time.

One potential solution could be is to release it in a phase-by-phase manner. For instance, releasing Idiap Speech Scripts (ISS) for HMM and ANN training using HTK and Quicknet and the python scripts and library. ISS is available publicly (https://github.com/idiap/iss). Next, release KL-HMM patch after the internal formalities at Idiap.

 

Round 2

Reviewer 1 Report

No more comments

Reviewer 2 Report

The authors have addressed all of my comments, thank you very much!

Back to TopTop