Using Transfer Learning to Realize Low Resource Dungan Language Speech Synthesis
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe article discusses a framework for Dungan language speech synthesis. My general comment is that the article is too long and not focused on the novelty, generally repeating history in many parts leading to loosing focus when reading. Also, I have the following comments:
1. The abstract is very confusing. For instance, the word Dungan language is repeated a lot, there is a level of technicality in the statement: "These sequences with the speech corpus, provide <phoneme sequence with prosodic labels, speech > pairs as the input for input into Tacotron2," which is hard to follow and finally, the level of improvement in the result is not declared.
2. The introduction introduced the problem along with a literature review of speech synthesis and the scarcity of work on the Dungan language. Now why do you need a related work section. You don't need to explain the basics or background material. I don't think you need Section 2.
3. Concerning the figures starting from Figure 6, I can see a "Liner projection" block in the training by TL, is it actually liner or linear?? Those blocks were not explained.
4. Section 3 then explains the proposed TL speech synthesis using Tacotron2+WaveRNN. Why do you need figure 7, again? A plain background could be retrieved from the literature if needed by the reader.
5. Again, Section 4 contains a lot of basics that do not need to be included. Your article should focus on novelty rather than repeating history.
6. The assessment methods need to be explained in the end of the proposed model so that they can be read and understood before the results section
7. What are the results in Table 4? Is there no previous literature to compare with.
8. It is not common to name an acronymn ( DSD and MDSD) for the proposed model in the experimentation section. It should have been introduce much earlier
Overall, I think the paper needs severe restructuring. Starting from section 2 you should have one figure explaining the overall proposed model then sub images (if needed) to explain parts of that model followed by the evaluation metrics for this model. The figures need much better explanation. Background explanation should be minimized
Comments on the Quality of English LanguageThe English language is fine but very repetitive in some parts
Author Response
Comments 1: The article discusses a framework for Dungan language speech synthesis. My general comment is that the article is too long and not focused on the novelty, generally repeating history in many parts leading to loosing focus when reading.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have removed some content and focused more on our work. Please find the changes in Section 2 and Section 3.
Comments 2: The abstract is very confusing. For instance, the word Dungan language is repeated a lot, there is a level of technicality in the statement: "These sequences with the speech corpus, provide <phoneme sequence with prosodic labels, speech > pairs as the input for input into Tacotron2," which is hard to follow and finally, the level of improvement in the result is not declared.
Response 2: Thank you for pointing this out. We agree with this comment. Therefore, we have revised the abstract to reflect the work of the manuscript more clearly. Because we used several objective and subjective evaluation metrics when comparing our methods, listing these evaluation metrics in the abstract would affect its readability. Therefore, we did not specify the level of improvement achieved by the proposed method in the abstract. Please find the changes in the Abstract.
Comment 3: The introduction introduced the problem along with a literature review of speech synthesis and the scarcity of work on the Dungan language. Now why do you need a related work section. You don't need to explain the basics or background material. I don't think you need Section 2.
Response 3: Thank you very much for all your suggestions. Another reviewer also believed that our manuscript's structure was unreasonable. We agree with this comment. Therefore, to make the manuscript clearer, we reorganized it according to the structure of the Applied Sciences, deleted Section 2 (Related work), and merged Sections 3 to 5 into Section Models and Methods. Please find the changes in the n Section 2 and Section 3.
Comment 4: Concerning the figures starting from Figure 6, I can see a "Liner projection" block in the training by TL, is it actually liner or linear?? Those blocks were not explained.
Response 4: Thank you for pointing this out. We agree with this comment. Therefore, we correct the typos Liner to Linear in the figures. Please see Figure 1 and Figure 5.
Comment 5: Section 3 then explains the proposed TL speech synthesis using Tacotron2+WaveRNN. Why do you need figure 7, again? A plain background could be retrieved from the literature if needed by the reader.
Response 5: Thank you very much for all your suggestions. We agree with this comment. Therefore, we deleted Figure 7 and Figure 12.
Comment 6: Again, Section 4 contains a lot of basics that do not need to be included. Your article should focus on novelty rather than repeating history.
Response 6: Thank you very much for your suggestions. We agree with this comment. Therefore, we remove some basics of the Mandarin acoustic model. Please see Section 2.
Comment 7: The assessment methods need to be explained in the end of the proposed model so that they can be read and understood before the results section.
Response 7: We appreciate your kind views. We agree with this comment. In this study, we used subjective and objective evaluation and employed multiple evaluation metrics to compare the proposed model with others. To clarify the manuscript, we have removed the introduction to the evaluation methods and provided specific evaluation indicators and their references in sections 3.2.3 and 3.2.4.
Comment 8: What are the results in Table 4? Is there no previous literature to compare with.
Response 8: Thank you for pointing this out. We agree with this comment. The study's originalities include a front-end for the Dungan language and a transfer learning-based Dungan acoustic model. The text analysis in the front end affects the quality of speech synthesis in the back end, so we evaluated the Dungan text analyzer, where the character-to-unit conversion module is the most critical factor affecting the quality of synthesized speech. As far as we know, no text analysis is available for the Dungan language. So, we'd like to present the performance of our transformer-based character-to-unit conversion part in Table 4 (as this part directly generates the final unit sequence with prosodic information) to show that this text analysis can be used for subsequent acoustic model training. We have adjusted the manuscript's structure by placing the experiment of the text analysis module in Section 3.1, and the results are shown in Table 3.
Comment 9: It is not common to name an acronymn ( DSD and MDSD) for the proposed model in the experimentation section. It should have been introduce much earlier
Response 9: Thank you for pointing this out. We agree with this comment. Therefore, we rewrite this in Section 3.2.2.
Comment 10: Overall, I think the paper needs severe restructuring. Starting from section 2 you should have one figure explaining the overall proposed model then sub images (if needed) to explain parts of that model followed by the evaluation metrics for this model. The figures need much better explanation. Background explanation should be minimized
Response 10: We appreciate your suggestions and agree with this comment. Therefore, we restructured the manuscript and removed some background explanations. The figures are also explained in detail.
Comment 11: The English language is fine but very repetitive in some parts
Response 11: Thank you for pointing this out. We agree with this comment. Therefore, we carefully revised the manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe article is dedicated to solving the problem of speech synthesis using Transfer Learning. The topic of the article is relevant. The structure of the article does not correspond to the format accepted by MDPI for research articles (Introduction (including literature review), Models and Methods, Results, Discussion, Conclusions). The level of English is acceptable. The article is easy to read. The figures in the article are of acceptable quality. The article cites 56 sources, many of which are outdated. The References section is poorly formatted.
The following comments and recommendations can be formulated regarding the material of the article:
1. The task of automatic speech synthesis consists of three stages. The first stage is linguistic analysis, which includes text normalization, word segmentation, morphological tagging, grapheme-to-phoneme conversion (G2P), and the extraction of various linguistic features. The second stage involves converting the input sequence of phonemes into a spectrogram – a representation of the signal in the frequency-time domain. The final stage is the reconstruction of the sound wave from the spectrogram, usually using a special algorithm called a vocoder. In my opinion, not all operations of the first stage are fully described by the authors.
2. Recently, the quality of modern adaptive synthesis models has become comparable to real human speech. This has largely been achieved through the use of end-to-end TTS models, which employ data-driven methods based on generative modeling. I think it would be beneficial to compare these with the authors' approach.
3. Modern solutions used in speech technology are based on neural networks and, consequently, require extensive training datasets. For tasks such as speech and emotion recognition, speaker identification, and audio synthesis, datasets with expressive speech are necessary. It is easy to imagine the problems that arise when collecting such data. Firstly, it is necessary to evoke the required emotion in a person, but not all reactions can be induced in simple and natural ways. Secondly, using recordings of professional actors can lead to significant financial costs and artificial emotions. I believe these difficulties fully apply to the Dungan language. I ask the authors to comment on this point.
4. To calculate MOS scores, one must take the arithmetic mean of the quality ratings of synthesized speech, given by specific individuals on a scale from 1 to 5. It should be noted that this assessment is not absolute, as it is subjective, so comparing experiments conducted by different groups of people at different times on different data is incorrect. Additionally, I note some difficulties: - With significant discrepancies in phrase length, the sound is heavily distorted; - Style transfer works unstably; - The quality of data used during training critically affects the final result when changing the speaker. How did the authors address these challenges?
5. I think it would be interesting to compare the authors' approach with VITS. VITS is a parallel end-to-end TTS system that uses a variational autoencoder (VAE) to connect the acoustic model with the vocoder through a latent (hidden) representation. This approach allows generating high-quality audio recordings by enhancing the expressive features of the network with a mechanism of normalizing flows and adversarial training in the signal's time domain. It also adds the ability to pronounce text with various variations by modeling the uncertainty of the latent state and a stochastic duration predictor.
Author Response
Comment 1: The structure of the article does not correspond to the format accepted by MDPI for research articles (Introduction (including literature review), Models and Methods, Results, Discussion, Conclusions). The article cites 56 sources, many of which are outdated. The References section is poorly formatted.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, to make the manuscript clearer, we reorganized it according to the structure of the Applied Sciences, deleted Section 2 (Related work), and merged Sections 3 to 5 into Section Models and Methods. We also updated some references and carefully revised Section References. Please find the changes in Section 2, Section 3 and Section References.
Comment 2: The task of automatic speech synthesis consists of three stages. The first stage is linguistic analysis, which includes text normalization, word segmentation, morphological tagging, grapheme-to-phoneme conversion (G2P), and the extraction of various linguistic features. The second stage involves converting the input sequence of phonemes into a spectrogram – a representation of the signal in the frequency-time domain. The final stage is the reconstruction of the sound wave from the spectrogram, usually using a special algorithm called a vocoder. In my opinion, not all operations of the first stage are fully described by the authors.
Response 2: Thank you very much for all your comments. We agree with this comment. One important work is to realize a text analyzer for the Dungan language that generates unit sequences with prosodic information. Dungan language can be regarded as a pinyin-based Chinese. Although it is written in Cyrillic script, each writing symbol corresponds to Chinese Pinyin, and there are spaces between syllables. Therefore, we can convert Dungan characters to Chinese Pinyin by looking up tables. We have also established a dictionary for the Dungan language to obtain word segmentation information. Using this information, we have achieved a prosodic boundary prediction for the Dungan language, and based on this, we have implemented a Transformer-based character-to-unit conversion. We have reorganized the manuscript and presented the text analysis process of the Dungan language in Section 2.1.
Comment 3: Recently, the quality of modern adaptive synthesis models has become comparable to real human speech. This has largely been achieved through the use of end-to-end TTS models, which employ data-driven methods based on generative modeling. I think it would be beneficial to compare these with the authors' approach.
Response 3: Thank you very much for your suggestions. We agree with this comment. We proposed an end-to-end Dungan TTS model based on Tacotran2. End-to-end methods need a large training corpus, which is very difficult for low-resource languages such as Dungan and Tibetan. To the best of our knowledge, only our team has achieved speech synthesis in the Dungan language, so we cannot compare our work with others. We have only compared the end-to-end model trained solely in the Dungan language and our transfer learning-based model on Tacotron, Tacotron2, and different vocoders (Griffin Lim, WaveNet, WaveRNN). The results show that transferring the Mandarin model to the Dungan language can obtain high-quality synthesized speech due to the similarity between Dungan language pronunciation and Mandarin. Therefore, our method provides a way to synthesize various dialects of China and minority languages. Please find it in Section 2 and Section 3.
Comment 4: Modern solutions used in speech technology are based on neural networks and, consequently, require extensive training datasets. For tasks such as speech and emotion recognition, speaker identification, and audio synthesis, datasets with expressive speech are necessary. It is easy to imagine the problems that arise when collecting such data. Firstly, it is necessary to evoke the required emotion in a person, but not all reactions can be induced in simple and natural ways. Secondly, using recordings of professional actors can lead to significant financial costs and artificial emotions. I believe these difficulties fully apply to the Dungan language. I ask the authors to comment on this point.
Response: Thank you very much for your very interesting views. We agree with this comment. We try to comment on the challenges in collecting training datasets for speech technology, especially for the Dungan language. Thanks to large-scale speech datasets, speech technologies based on neural networks have been developed rapidly for resource-rich languages. Dungan is a low-resource language. There are likely to be additional challenges in collecting expressive speech datasets for the Dungan language, a less commonly studied language. These difficulties can be categorized as follows.
- Inducing Genuine Emotions: Evoking authentic emotional responses in speakers is a fundamental challenge. Emotions are often complex and context-dependent, making it hard to create a controlled environment where natural emotions can be consistently reproduced. This issue is particularly pronounced in lesser-known languages like Dungan, where cultural and linguistic nuances may further complicate the process.
- Cost of Professional Recordings: Hiring professional actors to create expressive speech datasets can be prohibitively expensive. While actors can provide a wide range of emotions, there is a risk that their portrayals may come off as artificial or exaggerated, which can negatively impact the realism and effectiveness of the training data. This concern is amplified for niche languages like Dungan, where the pool of available actors might be limited, thus driving up costs and potentially reducing the quality of the recorded data. This may lead to fewer speakers available for recording, limited access to recording studios, and a lack of existing annotated datasets that can be used as a foundation for training neural networks.
- Cultural and Linguistic Authenticity: Maintaining cultural and linguistic authenticity in the dataset is crucial for the Dungan language. Any artificiality in the recorded emotions can skew the training process, leading to less effective or biased models. This is particularly important for tasks like emotion recognition, where the subtleties of vocal expression must be accurately captured.
To address these challenges, researchers can recruit a large number of ordinary people through crowdsourcing platforms to participate in data collection. This approach can reduce costs and obtain more diverse and natural speech data. Furthermore, researchers can use machine learning and natural language processing technologies to develop automated data annotation and analysis tools, improving the efficiency and accuracy of data processing.
Comment 5: To calculate MOS scores, one must take the arithmetic mean of the quality ratings of synthesized speech, given by specific individuals on a scale from 1 to 5. It should be noted that this assessment is not absolute, as it is subjective, so comparing experiments conducted by different groups of people at different times on different data is incorrect. Additionally, I note some difficulties: - With significant discrepancies in phrase length, the sound is heavily distorted; - Style transfer works unstably; - The quality of data used during training critically affects the final result when changing the speaker. How did the authors address these challenges?
Response 5: Thank you for pointing this out. We agree with this comment. As a subjective evaluation method, we acknowledge the existence of the issues you mentioned. However, in this study, the main purpose of using MOS scoring is to compare the relative quality of different models. Therefore, as long as the subjects feel that the synthesized speech of which model is good, they will give a relatively high score. In this study, we randomly selected 30 sentences from the test set and had 20 native Mandarin-speaking students and 10 Dungan international students (who understood Chinese) undergo MOS evaluation. Of course, these participants received training before the formal evaluation. We take the average of all reviewers as the final result. Of course, to address the shortcomings of MOS ratings, we also asked reviewers to rate natural speech as ground truth. From the final results, our proposed Dungan language speech synthesis model based on transfer learning can synthesize more natural speech than other methods. Please find it in Section 3.2.4.
Comment 6: I think it would be interesting to compare the authors' approach with VITS. VITS is a parallel end-to-end TTS system that uses a variational autoencoder (VAE) to connect the acoustic model with the vocoder through a latent (hidden) representation. This approach allows generating high-quality audio recordings by enhancing the expressive features of the network with a mechanism of normalizing flows and adversarial training in the signal's time domain. It also adds the ability to pronounce text with various variations by modeling the uncertainty of the latent state and a stochastic duration predictor.
Response 6: Thank you for pointing this out. We agree with this comment. Numerous breakthroughs have been achieved in TTS based on deep neural networks. We have noticed that some new speech synthesis methods have been proposed in recent years, and your mentioned VITS is one of them. With the widespread use of discrete audio tokens, the research paradigm in language models has shown a profound impact on speech modeling and synthesis. Motivated by recent advancements in auto-regressive (AR) models employing decoder-only architectures for text generation, several studies, such as VALL-E and BASE TTS, apply similar architectures to TTS tasks. These studies demonstrate the remarkable capacity of decoder-only architectures to produce natural-sounding speech. However, these new speech synthesis methods have only begun to be applied in the speech synthesis of low-resource languages. We will further deepen our research, use these new methods to improve the quality of Dungan language speech synthesis, and compare them with the method proposed in this manuscript. We mentioned this in Section Conclusions.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsArticle is improved
Author Response
We would like to thank you for evaluating our manuscript and providing valuable comments.Reviewer 2 Report
Comments and Suggestions for AuthorsI have formulated the following comments on the previous version of the article:
1. The task of automatic speech synthesis consists of three stages. The first stage is linguistic analysis, which includes text normalization, word segmentation, morphological tagging, grapheme-to-phoneme conversion (G2P), and the extraction of various linguistic features. The second stage involves converting the input sequence of phonemes into a spectrogram – a representation of the signal in the frequency-time domain. The final stage is the reconstruction of the sound wave from the spectrogram, usually using a special algorithm called a vocoder. In my opinion, not all operations of the first stage are fully described by the authors.
2. Recently, the quality of modern adaptive synthesis models has become comparable to real human speech. This has largely been achieved through the use of end-to-end TTS models, which employ data-driven methods based on generative modeling. I think it would be beneficial to compare these with the authors' approach.
3. Modern solutions used in speech technology are based on neural networks and, consequently, require extensive training datasets. For tasks such as speech and emotion recognition, speaker identification, and audio synthesis, datasets with expressive speech are necessary. It is easy to imagine the problems that arise when collecting such data. Firstly, it is necessary to evoke the required emotion in a person, but not all reactions can be induced in simple and natural ways. Secondly, using recordings of professional actors can lead to significant financial costs and artificial emotions. I believe these difficulties fully apply to the Dungan language. I ask the authors to comment on this point.
4. To calculate MOS scores, one must take the arithmetic mean of the quality ratings of synthesized speech, given by specific individuals on a scale from 1 to 5. It should be noted that this assessment is not absolute, as it is subjective, so comparing experiments conducted by different groups of people at different times on different data is incorrect. Additionally, I note some difficulties: - With significant discrepancies in phrase length, the sound is heavily distorted; - Style transfer works unstably; - The quality of data used during training critically affects the final result when changing the speaker. How did the authors address these challenges?
5. I think it would be interesting to compare the authors' approach with VITS. VITS is a parallel end-to-end TTS system that uses a variational autoencoder (VAE) to connect the acoustic model with the vocoder through a latent (hidden) representation. This approach allows generating high-quality audio recordings by enhancing the expressive features of the network with a mechanism of normalizing flows and adversarial training in the signal's time domain. It also adds the ability to pronounce text with various variations by modeling the uncertainty of the latent state and a stochastic duration predictor.
The authors have addressed all my comments. I found their responses quite convincing. I support the publication of the current version of the article. I wish the authors creative success.
Author Response
Comments 1: The authors have addressed all my comments. I found their responses quite convincing. I support the publication of the current version of the article. I wish the authors creative success.
Response 1: We are grateful for your acknowledgment of our revision efforts and the insightful comments you offered, which have significantly enhanced the quality of our paper.
