Next Article in Journal
Emoji, Text, and Sentiment Polarity Detection Using Natural Language Processing
Previous Article in Journal
Atlas-Based Shared-Boundary Deformable Multi-Surface Models through Multi-Material and Two-Manifold Dual Contouring
 
 
Article
Peer-Review Record

A Dual Stream Generative Adversarial Network with Phase Awareness for Speech Enhancement

Information 2023, 14(4), 221; https://doi.org/10.3390/info14040221
by Xintao Liang, Yuhang Li, Xiaomin Li, Yue Zhang and Youdong Ding *
Reviewer 1:
Reviewer 2:
Reviewer 3:
Information 2023, 14(4), 221; https://doi.org/10.3390/info14040221
Submission received: 5 November 2022 / Revised: 31 March 2023 / Accepted: 1 April 2023 / Published: 4 April 2023

Round 1

Reviewer 1 Report

Phase unwrapping algorithm should be described in more details.

Author Response

Dear Professor.

Thank you very much for your suggestion.

I have added a section on experiments and a description of the training method in lines 326 to 340 of the manuscript. In conjunction with your comments and those of several other reviewers, I have made significant changes. All changes in the manuscript have been highlighted.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper proposed a generative adversarial network (GAN) for single channel speech enhancement. The generator of the GAN with the Transformer architecture is trained to predict the amplitude mask and the phase mask, and the discriminator predicts the objective quality scores (PESQ or STOI) with the reference signal available. 

To improve the enhancement performance, the amplitude mask prediction and the phase mask prediction that work mainly in parallel are connected by Eq. (2). The two networks are trained alternatively. 

The evaluation on the voicebank-DEMAND dataset shows that the proposed method improves PESQ and STOI scores comparing to the noisy input.

Major comments

* The introduction and the related work provide a nice overview of the speech enhancement methods. Yet, it is too high-level and abstract. The choice of the proposed architecture could be better motivated (e.g., more concrete evidence on the advantage of transformer over RNN). 

* If PESQ/STOI evaluation is always needed for when training the discriminator, and the training of G and D is alternatively done, what is the point of using a discriminator to approximate the metric instead of using it directly?

* How is the wrapping problem of phase solved in the proposed method (phase masking)?

* In the result section, it is not clear how the metrics are obtained for other baseline systems: are they trained and evaluated on the same data set, or taken from the original papers? If taken from the original papers, does it remain a fair comparison (complexity, training-data, ...)

* The reported evaluation results in Table 1 show a strange trend: the proposed methods score below average on CSIG and CBAK, but exceed all other methods in COVL. Is there any explanation on this behaviour? I also did not understand how the conclusion `DPGAN outperforms the vast majority of the SOTA systems' (line 402) is drawn from the results in Table 1. 

* A lower min gain boundary should compensate the noise reduction ability by less speech distortion, however, as noted by the authors from line 344 to 348, this leads to lower scores of both metrics (CSIG and CBAK). I could not find enough evidence for the argument provided by the authors in the paper (`ultra-low frequency noise that the human ear cannot distinguish').

* The conclusions `using waveform mapping tends to perform better on these metrics' and `fixes to spectrograms may produce detailed information that does not affect listening and speech quality but can degrade the metrics' not supported by any evidence in the paper. 

* Most of the recent methods score higher than the proposed methods apart from PESQ when the proposed network is trained to optimise it, could the authors provide more evidence for saying `In general, our work was able to perform the speech enhancement task better'?

* The comment on STOI `The main reason for this is the design of the STOI scores themselves' (line 386-387) does not coincide with the finding of the previous research. In [1], the authors reported similar performance when using STOI or common MSE as loss function. 

[1] Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5059–5063. IEEE, 2018.

* The test set only contains two unseen speakers from the same speech corpus. How is the generalisation ability guaranteed and tested?

* The choice of $\beta$ in Eq. (1) can be better motivated. `One hundred speech data' (line 209) is not clear. Does it mean 100 sentences or 100 frames? In any case, it is a small subset and not representative enough. 

* Q(I) is a normalised evaluation metric according to line 272, but the normalisation method is not mentioned in the paper. 

 

Minor comments:

* Line 118-120: `widely used networks such as RNNs and LSTMs often fail to extract long-term dependencies between sounds directly and do not focus on feature extraction' vs line 155 `LSTMs have a cyclic structure that adequately captures the dependencies of time series'. Pick a side - this is contradicting. 

* Line 350-352: `This is because the spectrogram is a higher dimensional representation than the waveform, and its data tends to be more abstract and able to show more concrete audio details.' --> Not clear what is meant here. Spectrogram contains as much information as the waveform (linear transform). The representation is not abstract.

* Line 377-379 `We can understand this problem simply from the point of view of speech features, i.e. a single treatment targeting the phase structure will destroy its original information to a certain extent.' These descriptions without proof cannot support or explain anything. 

* Line 91-92 `Similar to deep learning methods, we first need to convert the waveform into amplitude and phase by short-time Fourier transform (STFT).' The description of the statistical methods is not necessarily wrong but the chronological order could be maintained.

* Figure 3 has no axis label. 

* Brackets in Eq. (3) are not matched. 

And some descriptions are not fully correct in my opinion:

* Line 29 `... the T-F masking method, which extracts frequency domain features from speech and applies them as training targets'

* Line 55-57 `the convolutional neural network (CNN) used in traditional image processing is not able to detect the harmonic signals in the T-F spectrogram [10], as the speech signal is more generally correlated than the image'

* Line 138-140 `As a result, the performance of time-domain methods is not fully applicable to speech enhancement tasks and degrades in the face of unfamiliar noise outside the training set.' The SOTA time-domain methods perform similarly to time-frequency methods according to latest research (e.g., convTas net).

Author Response

Please see the attachment.

The Word document is the cover letter, and the PDF document is the revised manuscript after marking.

Author Response File: Author Response.rar

Reviewer 3 Report

Summary:

This paper focused on speech enhancement. The authors proposed a dual-stream generative adversarial network, which has four components: (1) Dual Stream Generator with Phase awareness,  (2) Information Communication, (3) Mask Estimated Blocks, and (4) Perception Guided Discriminator. The experimental result on the Voicebank-DEMAND dataset shows that DPGAN achieved high performance in some metrics.

 

Strengths:

1. Detailed related work is introduced.

2. The methods and discussion part are well presented.



Weaknesses:

1. The authors only reported results on a single Voicebank-DEMAND dataset. Other datasets should be reported (DNS: deep noise suppression [1]).

2. In Table 1, it is better to add references after the method name.

3. The proposed method only had high performance in PESQ and COVL metrics. More reasons should be explained.

4. It is better to show some generated images so that we can know whether the generated images are good or not.

5. The authors emphasized the phase and amplitude masks. It is necessary to show some images of them.

6. In Table 2, the ablation experiment is not completed. It is unknown which module is the more useful.

-DPGAN(BLSTM) + MEB + phase

-DPGAN(BLSTM) + MEB + phas+IC(P)

-DPGAN(BLSTM) + MEB + phas+IC(C)

etc.

[1]. Reddy, C. K., Gopal, V., Cutler, R., Beyrami, E., Cheng, R., Dubey, H., ... & Gehrke, J. (2020). The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981.

Author Response

Please see the attachment.

Author Response File: Author Response.rar

Back to TopTop