Noise Modeling to Build Training Sets for Robust Speech Enhancement
Round 1
Reviewer 1 Report
Judging from the literature, making synthetic noisy speech training sets is a valuable contribution in the field of speech enhancement, so I congratulate the authors.
If possible, add a few more recent references concerning training data and noise in DNN based SE.
In several places (line 29, line 53, ...) there are introductory one term sentences. These should become a subheading or be better integrated with the rest of the text.
I would be interested to know why l1 was assigned a value of 120 and why 50 epochs without improvement were chosen?
The meaning of lines 257 and 258 is unclear!
In Figure 5, text seems to be missing in the lowest block on the right.
If possible, explain the method or criterion for choosing 120,000 pairs among 150,000 on page 9.
Presenting the evaluation results not only in Tables 1 and 2, but also in the form of graphs, maybe column charts or something appropriate, would be welcome.
The formatting of references is not entirely unified.
Some suggested corrections of English:
line 50: end the sentence after 'collected'
line 51: ...speech, and the fact that clean speech signal...
line 86: ...may not be ideal...
line 149: ...such as CGANs...
line 173: ...into both the...
line 228: ...This is followed by...
line 280: In order to carry out...
line 283: ...which is here referred to as...
line 294: ...A recorder was placed 0.5m...
Author Response
Reviewer 1:
Judging from the literature, making synthetic noisy speech training sets is a valuable contribution in the field of speech enhancement, so I congratulate the authors.
Q1.If possible, add a few more recent references concerning training data and noise in DNN based SE.
Response:
Thank you for your valuable comments concerning adding some recent work related to training data and noise in DNN-based SE. In the third subheading of Chapter 1, we explain that our work is the first that specifically focuses on using GAN-based noise modeling to create training sets in the field of speech enhancement. Although similar ideas appear for image enhancement tasks, these methods have not been applied to speech enhancement. Our approach sounds like data augmentation, so there are some recent references concerning data augmentation in this article. However, data augmentation commonly used in image recognition or speech recognition where transformations such as translation, rotation, scaling, and reflection can produce significant improvements in recognition accuracy. Speech enhancement tasks to be solved in this paper commonly involve limited amounts of noisy speech and no training set at all, thus data augmentations is not applicable.
According to Comment 2 presented below, we added subtitles to clarify the related work section further.
Q2.In several places (line 29, line 53, ...) there are introductory one term sentences. These should become a subheading or be better integrated with the rest of the text.
Response:
We placed subheadings to highlight the problems solved in the article and the work we have done.
Q3.I would be interested to know why l1 was assigned a value of 120 and why 50 epochs without improvement were chosen?
Response:
Regarding the weight of our L1 regularization, after some experimentation, we set it to 120 for the entire training. We initially set it to 50, but we observed that the G loss was two orders of magnitude under the adversarial one, so the L1 had no practical effect on the learning. Once we set it to 120, we saw a minimization behavior in the L1 and an equilibrium behavior in the adversarial one. As the L1 reduced, the quality of the output samples increased, which we hypothesize helped G be more effective in terms of realistic generation. Therefore, this value selection is based on the verification of multiple training results in the experiment.
As for the epochs, one of the critical issues while training a neural network on the sample data is overfitting. When the number of epochs exceeds the necessary, the training model learns patterns specific to the sample data, prohibiting the model from performing well on new datasets. Hence, the model attains high accuracy on the training set (sample data) but fails to achieve good accuracy on the test set. In other words, the model loses its generalization capacity by overfitting the training data. To mitigate overfitting and increase the neural network's generalization capacity, the model should be trained for an optimal number of epochs. A part of training data is dedicated to validating the model and checking its performance after each training epoch. Loss and accuracy on the training and the validation sets are monitored to look over the epoch number, after which the model starts overfitting. Therefore, choosing an optimal number of epochs to train a neural network is a problem worth studying. In our experiments, we selected 50 epochs due to our processor’s speed and the experimental effect. We found that this is a value suitable for designing network training.
Q4.The meaning of lines 257 and 258 is unclear!
Response:
We explained the meaning of “GAN-train” and “GAN-test” after lines 257 and 258 (now they are form lines 374 to lines 342) as follows:
“GAN-train” and “GAN-test” were first proposed by Shmelkov to measure the quality of data generated by GANs by calculating the distance between the generated samples and a real data manifold in terms of precision and recall [15]. The “GAN-train” refers to training the SE model on our synthetic training sets and evaluating its performance on real noisy samples. The synthetic samples are considered sufficiently diverse if an SE model trained on them can provide appealing noise reduction on a real noisy speech. The “GAN-test” utilizes an SE model trained on real training sets but tested on synthetic samples. Here, if an SE model trained on a real training set can provide ap-pealing noise reduction on a noisy synthetic speech, the generated samples are considered a realistic approximation of the natural samples' (unknown) distribution. The “GAN-train” and “GAN-test” experiments aim to measure the diversity and similarity between synthetic and real noisy samples and to assess the extent to which NM-GAN can match real noise distributions.
Q5.In Figure 5, text seems to be missing in the lowest block on the right.
Response:
Thank you for your comment. We have corrected this error, with the correct expression being “synthesizing training set by our method” in the lowest block on the right in Figure 5.
Q6.If possible, explain the method or criterion for choosing 120,000 pairs among 150,000 on page 9.
Response:
In traditional machine learning, datasets are divided into training sets and test sets at a typical ratio of 80%-20% or 70%- 30%. Our experiment selected 80% from the prepared data set as the training set, i.e., 120,000 samples as the training set and the rest as the test set. This strategy applies to a small-medium-sized dataset, as for larger datasets involving millions of data, the division is 98% and 2%, respectively. Other ratios between the training and the test set have also been proposed, mainly depending on the sample’s data volume. Further details on the division proportion are explained in the revised paper.
Q7.Presenting the evaluation results not only in Tables 1 and 2, but also in the form of graphs, maybe column charts or something appropriate, would be welcome.
Response:
Thank you for your comments. The difference between the data in Table 1 and Table 2 may be more obvious after adding the columnar comparison chart. Because the limited time (five days) to improve the paper’s content and the language does not allow us to create additional charts/ plots to present the evaluation results. However, we have improved the content of analysis and comparison to make the comparison between the two tables clearer.
Q8.The formatting of references is not entirely unified.
Response:We revised the format of references to make them unified.
Q9.Some suggested corrections of English:
line 50: end the sentence after 'collected'
line 51: ...speech, and the fact that clean speech signal...
line 86: ...may not be ideal...
line 149: ...such as CGANs...
line 173: ...into both the...
line 228: ...This is followed by...
line 280: In order to carry out...
line 283: ...which is here referred to as...
line 294: ...A recorder was placed 0.5m...
Response:
Thank you for pointing out these mistakes. We have now had the paper carefully proofread by a native English-speaking expert.
Reviewer 2 Report
Report attached
Comments for author File: Comments.pdf
Author Response
Reviewer 2:
This research aimed at a vital research parameter, a dataset, based on which the conducted research can be effective or vice versa. The authors addressed the scarce availability of the noisy speech dataset that can be used in the problem of robust speech enhancement. They employed GANs to generate enough noisy speech datasets that can effectively take part in the solution of the speech enhancement problem. However, there are a few Major observations based on which this paper cannot be accepted in its current form until it is improved. The comments are as follows,
Q1. The language of this paper is ordinary, and there are grammatical mistakes at various places. This needs thorough improvement; I suggest authors can get help from professional editing services providers to improve the language.
Response:
Thank you for this suggestion. We have now had the paper carefully proofread by a native English-speaking expert.
Q2. The introduction section is too short, there must be more explanation to set the ground for upcoming sections of the paper.
Response:
Thank you for your comments. Within the limited five-day revision time, we have resorted to the logic and content of Chapter 1. In addition, we also added subheadings to make the content clearer and enhance the reading experience. The revised chapter 1 includes four parts. The first part introduces deep learning speech enhancement and the related research methods. The second part states the current problems, and the third part involves our scheme. The fourth part states the current lack of relevant research work, and the last part introduces the content of each part. I hope this modification assists readers in clearly understanding our paper’s structure. Thank you again for your comments.
Q3. The authors pointed out the problem of synthetic datasets, based on which they conducted this research. MY question is that are the most available datasets are synthetic? So the authors tried to address this issue. Authors should provide details on the available state of the art datasets in terms of being synthetic.
Response:
Thank you for your valuable comments. According to the literature, deep learning-based speech enhancement methods typically focus on the design of SE networks. Moreover, the model’s performance is verified employing synthetic data sets. However, the synthesis steps are relatively simple, commonly assuming additive noise, and the pure speech and noise in the public data are directly used to synthesize noisy speech at a certain signal-to-noise ratio. Although much work has been done in the SE field, the latest deep learning models cannot be directly applied as they do not consider noisy speech captured from other acoustic conditions. Unlike current trends, we consider this application type because our team's work is laser speech detection. Due to special occasions, we can only obtain limited noisy speech, encountering challenges in speech enhancement. Overall, we believe that most of the current works committed to deep learning speech enhancement may focus more on the design of algorithms and models, ignoring the application of the extreme practical situation.
In order to solve the problem of limited noisy speech and lack of paired speech training sets, this paper proposes the idea of using a GAN to generate a large amount of noise and then synthesize data sets. Relevant research first appeared in the field of image enhancement. In 2018, Chen [1] applied similar methods to solve the image enhancement problem in practice, but Chen's network design is relatively simple, and the network’s feature extraction ability is weak. Nevertheless, we investigated the SE field, and currently, to the best of our knowledge, no works employ deep learning methods to synthesize speech and prepare data sets for speech enhancement. Therefore, we have not found relevant references. Although data augmentation, a method of preparing training sets, has been widely used in processing images and speech, it suffers from relying on label-preserving transformations. These typically deform existing training sets and expand upon them to create more training sets. Data augmentation has the advantage of enhancing the adaptability of DNN-based networks and solves the problem of small or limited training sets. Moreover, data augmentation is commonly used in image recognition, where transformations such as translation, rotation, scaling, and reflection can significantly improve recognition accuracy. However, in the case of speech enhancement, there are often only limited amounts of noisy speech or no training sets at all.
Considering speech synthesis, several ways besides adding noise and pure speech exist, but these are out of this paper’s scope.
[1] J Chen, Image Blind Denoising with Generative Adversarial Network Based Noise Modeling. IEEE/CVF Conference on Computer Vision & Pattern Recognition. IEEE
Q4. On page 2, line 50, the authors mentioned only limited noisy speech could be collected. Why limited noisy speech availability for recording a dataset? There are many situations and environments where noisy datasets can be recorded in abundance, like a train situation, fish market, etc. There are different kinds of speech and noises available.
Response:
In the revised paper, we explain and supplement why we can only obtain limited noise sets in some practical cases. As you mentioned, in most cases, if we have determined the voice detection scene or collection in advance and if these scenes are unrestricted, we can obtain several noisy voice instances from this scene. However, in special fields such as intelligence acquisition and national security, we cannot enter the voice acquisition scene in advance, and the voice collection time is limited, so we can only detect limited noisy voice samples.
Unlike public noise data sets, we capture real noisy speech recordings from diverse noisy conditions, with the speech and noise being captured by the same microphone simultaneously and in similar acoustic conditions. These conditions are difficult to simulate when using synthetic training data with clean speech combined with noise obtained from the internet because the clean speech and noise signals are captured independently or under different acoustic conditions.
As a result, we can only obtain limited noisy speech and lack the data set to train deep learning speech enhancement networks, which is also the problem solved in this paper.
Q5. Page 4, line 139, what kind of auxiliary information is added? More elaboration is required about it.
Response:
The Conditional Generative Adversarial Nets (CGAN) was originally proposed for image generation in 2014. There is no control over the generated data modes in an unconditioned generative model. However, it is possible to direct the data generation process by conditioning the model on additional information. According to the different targets to be generated, this extra information can be any auxiliary information such as class labels or data from other modalities. Please refer to the revised paper for further details.
Q6. Paper format is irregular; somewhere, its mentioned as “chapter,” pointing to a part of writing, and somewhere its written as a section. The section headings are also kept at the same level as the remaining text; this seems weird, so they must be highlighted and bolded to be distinguished.
Response:
Indeed, this observation is perfectly correct. We revised the unified expression as “chapter”.
Reviewer 3 Report
As deep learning-based Speech Enhancement models depend on synthetic paired datasets, and will degrades significantly when there is a mismatch between the synthetic datasets on which they are trained and real test sets.
To solve this problem, this paper proposes a new Generative Adversarial Network framework for Noise Modeling (NM-GAN) that can build realistic paired training sets by imitating real noise distribution. This model consists of a novel 7-layer U-Net with two bidirectional long short-term memory layers.
Extensive experiments were carried out. Qualitative and quantitative evaluation of the generated noise samples and training sets demonstrate the potential of the framework.
Though there are some limitations of the proposed method, this paper proposed a good solution on improving the quality of paired synthetic datasets. Some of the limitations have been mentioned in chapter “5.2. Limitations and Future work”.
Another limitation of this paper comes from the comprehensive of experiments. It’s hard to evaluate a pure noise model. As this noise model is used for Speech Enhancement, adding Speech Enhancement results from both objective and subjective experiments will make this approach more reliable.
Author Response
Reviewer 3:
As deep learning-based Speech Enhancement models depend on synthetic paired datasets, and will degrades significantly when there is a mismatch between the synthetic datasets on which they are trained and real test sets.
To solve this problem, this paper proposes a new Generative Adversarial Network framework for Noise Modeling (NM-GAN) that can build realistic paired training sets by imitating real noise distribution. This model consists of a novel 7-layer U-Net with two bidirectional long short-term memory layers.
Extensive experiments were carried out. Qualitative and quantitative evaluation of the generated noise samples and training sets demonstrate the potential of the framework.
Though there are some limitations of the proposed method, this paper proposed a good solution on improving the quality of paired synthetic datasets. Some of the limitations have been mentioned in chapter “5.2. Limitations and Future work”.
Another limitation of this paper comes from the comprehensive of experiments. It’s hard to evaluate a pure noise model. As this noise model is used for Speech Enhancement, adding Speech Enhancement results from both objective and subjective experiments will make this approach more reliable.
Response:
Thank you for your affirmation of our paper. As you mentioned, it is difficult to evaluate a pure noise model because it is used for speech enhancement. Based on the above considerations, we selected a public and widely cited SE network [1] as the evaluation network to challenge the synthetic data set’s adaptability created by our generated noises. As a comparison, we used the directly synthesized data set by real limited noise and real data set to train the same SE network and then compared the enhancement effect. This evaluation and comparison method is also called “GAN-train” and “GAN-test” [2].
The “GAN-train” refers to training the SE model on our synthetic training sets and evaluating its performance on real noisy samples. The synthetic samples are considered sufficiently diverse if an SE model trained on them can provide appealing noise reduction on a real noisy speech. The “GAN-test” utilizes an SE model trained on real training sets but tested on synthetic samples. Here, if an SE model trained on a real training set can provide appealing noise reduction on a noisy synthetic speech, the generated samples are considered a realistic approximation of natural samples' (unknown) distribution. The “GAN-train” and “GAN-test” experiments aim to measure the diversity and similarity between synthetic and real noisy samples and to assess the extent to which NM-GAN can match real noise distributions.
The experimental setup involves two groups of “GAN-train" and "GAN-test” comparative experiments. In the first “GAN- train” experimental group, the SE model is trained on our synthesized training set but tested on real noisy samples. In the first “GAN- test” experimental group, the SE model is trained on a real training set but tested on our synthesized noisy samples. In contrast, in the second group of “GAN-train” experiments, we employ the “limited” training set based on the direct synthesis of speech and limited real noise samples to train the same SE model and then test it on the real noisy set. Similarly, for the second group of “GAN-test” experiments, the SE model is trained on the real training set and tested on the "limited" set.
In order to perform the two “GAN-train" and "GAN-test” comparative experiments, we prepared three different training sets. The first is real-world recording, referred to as the real training set. The second training set is a synthesized set using limited real noise samples, referred to as the “limited” training set, while the third is prepared using our proposed method. In the first group of “GAN-train” experiments, an SE model is trained on our synthesized training set but tested on a set of real noisy samples. The preparation method of the three data sets is illustrated in Figure 5.
To compare the quality of the speech enhancement of the “GAN-train” and “GAN-test” experimental groups, we used the PESQ, CSIG, CBAK, and COVL as objective metrics and obtained the results presented in Table 1 and Table 2. From the latter tables, we conclude that the SE model trained on our synthesized training set has a robust SE effect for real noisy speech samples and that the SE model trained on the real training set has a similar SE effect for the synthesized noisy speech. Besides, by comparing Tables 1 and 2, we conclude that the NM-GAN-prepared training sets can better train DNN-based SE models than “limited” training sets synthesized directly from limited noise samples. Additionally, the experimental results highlight that the proposed method is feasible and reliable.
However, your question makes us deeply think that evaluating noise generation by comparing speech enhancement networks is still an indirect method. In future research, we need to find a method to directly evaluate the generated noise quality. This is also stated in chapter “5.2. Limitations and Future work”.
[1] S Pascual, SEGAN: Speech enhancement generative adversarial network
[2] K Shmelkov, How good is my GAN?