Next Article in Journal
Boundary Shape Inversion of Two-Dimensional Steady-State Heat Transfer System Based on Finite Volume Method and Decentralized Fuzzy Adaptive PID Control
Next Article in Special Issue
An Analysis of Rhythmic Patterns with Unsupervised Learning
Previous Article in Journal
Histogram Based Clustering for Nonlinear Compensation in Long Reach Coherent Passive Optical Networks
 
 
Article
Peer-Review Record

Noise-Robust Voice Conversion Using High-Quefrency Boosting via Sub-Band Cepstrum Conversion and Fusion

Appl. Sci. 2020, 10(1), 151; https://doi.org/10.3390/app10010151
by Xiaokong Miao, Meng Sun *, Xiongwei Zhang * and Yimin Wang
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2020, 10(1), 151; https://doi.org/10.3390/app10010151
Submission received: 11 November 2019 / Revised: 14 December 2019 / Accepted: 19 December 2019 / Published: 23 December 2019

Round 1

Reviewer 1 Report

The authors propose a method of noise robust voice conversion (VC), using the bidirectional long short-term memory (BLSTM) neural networks. The main novelty of their approach consists in applying high-frequency boosting via sub-band cepstrum conversion and fusion.

The exposition of previous research is informative and adequate. The description of the proposed method is well-structured, the differences with previous similar methods is described, as well as the rationale for introducing the novel features that characterize the new method.

The method is extensively and soundly evaluated and the results are convincing.

The Conclusions section is rather short. The authors should enrich it, adding (1) conditions in which their method could falter and (2) ways to alleviate such possible problems, as well as additional avenues of future research on extending the method to specific fields of VC.

Other corrections/suggestions:

1) In section 1,2 authors state “For more detailed conversion process of VC and DIFFVC and the training process of parallel VC based on GMM, …”. Obviously DIFFVC is differential VC, but it should be made clear whether the authors mean that DIFFVC is implemented using DIFFGMM.

2) The authors state: “The remainder of this paper is organized as follows: Section 2 mainly presents related work and recent progress. Section 3 describes the proposed approach. The setup of experiments and analysis of results are discussed in Section 4. Finally, the conclusions and future work are presented in Section 5.” In this part of the text they should correct to Sections 1,2,3,4 respectively, since this is the numbering that is followed subsequently in the manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The article describes a method of voice conversion that provides results better than existing methods in terms of objective and subjective measures of generated voices.

The article is well structured and well written. The presented study design seems plausible.

Regarding the content, it would be good to know what computing power the individual methods require, i.e. whether the advantages obtained have to be bought with higher computing power or whether there is no difference in computing power required.

The conclusions are short, and the discussion is missing. This needs more work.

Regarding the form, the article should be thoroughly revised so that the contents are adequately presented.

Formal issues:

The abbreviation BLSTM has been introduced in the Abstract, but not in the main text. I suggest writing the main text in a way that the main text could be read independent from the Abstract. HUB subtask? What is it? A subtask from the VCC? Please spend a half-sentence on an explanation. L53: “of the system”: which system? The N10? Please specify. Although the quality of writing is very sound, the paper would benefit from being proofread by a native speaker. Examples: The structure of the sentence “The motivation for doing this is that, based on the experience of our previous experiments, when the voice is contaminated by noises, the high-quefrency components of speech seem to be vulnerable to interference.” seems to be strange IMHO. “The motivation of designing a sub-band cepstrum conversion system is that when voice is affected by strong noises, the high quefrency components of cepstrum failed to be recovered accurately, given our previous experiences.” L73 and L76: “demonstrate” in simple past and simple present. Please ensure the consistent use of tenses. L103 “maximum-likelihood estimation” -> without hyphen L104 “as a novel features which are” -> “as a novel feature, which is” L107: Is there a comma required? L113-L115: Missing verb in the sentence? Please check. L122: “in reference [2] and [28]” -> “in references [2] and [28]” or “in reference [2] and in reference [28]” L 77: “The remainder of this paper is organized as follows: Section 2 mainly presents related work” -> The sections have to renumbered: the following section mentioned is numbered with “1”. L84: What is F0? Please specify in a half-sentence. (L 157) L89-90: “In this paper, two benchmark methods are taken as baselines, one is the GMM-based VC method 90 from VCC 2018 and the other is a BLSTM-based VC method.” Imbalance: while the first method is defined in a way that the method can be identified / reviewed, the second method remains unclear. Please specify. L96: obsolete space ahead of the period. L97: “that accomplishing feature mapping by soft clustering.” Sounds strange. Please check the language. L102: comma instead of period. L157: comma instead of period in before “which”. At the end of the line: obsolete period. L162: “has showed” -> “has shown” L166: Mismatch of singular and plural (“process of T1 and T2 are presented”) L187: “By mapping the full-band information of the source MCEPs with the high quefrency band of the target MCEPs” This is a subclause, but not a complete sentence. Figure 3: “foefficient”: typo? L194: Subclause instead of sentence: “And …” L224: Superscript of variables, e.g. xta, intended?r L232: References 33, 34 and 35 are missing. Figure 5 (and other): I would move the legend below the diagram and reduce especially the size of the “95% confidence intervals” just to focus on the relevant data. L293: the information that one database used is English and the other Mandarin should be provided when the databases are introduced.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The review comments have been considered in the revised version of the manuscript. Thank you!

Two remarks left:

Please check line 28: "improve the quality of the converted speechvoicehe quality and similarity of the converted voice": First, "he" after voice seems to be obsolete? Second, "quality" occurs two times. This sounds not so good. Consider replacing one of the occurences by another word. Figures 4 and 5 are understandable. However, as already said, another design would increase the readability of the figures. (reducing the size of the "confidence interval" and placing the legend below the diagram.

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Back to TopTop