Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform †
Abstract
:1. Introduction
Introduction of Cross-Domain Features in DNN-Based SE Network
- Features from multiple domains may learn a more effective SE model than single-domain features, according to [31].
- The well-known discrete wavelet transform (DWT) may split a signal into many sub-bands without losing information or introducing distortion, and it could be a good choice for producing features or as a pre-processing stage for an SE model. If the used wavelet function has a small support, DWT can be implemented quickly. Furthermore, because of the downsampling process, the total length of the DWT-wise sub-band signals is approximately equal to that of the original signal, and thus DWT does not increase data size when the number of sub-bands is increased. Additionally, our recent research [32,33,34] has shown that DWT can be applied in various aspects in SE to promote its performance.
- The work in [31] employs BPF to fuse two feature sources. We wondered if the BPF mechanism can be extended to fuse more feature sources than just two.
2. Background Knowledge
2.1. Conv-TasNet
2.1.1. Encoder and Decoder
2.1.2. Mask Estimation Network
2.2. An Advanced Variant of Conv-TasNet: DPTNet
- Segmentation: In the segmentation stage, the input feature matrix from the upstream encoder is split into fix-length chunks with hopping. These chunks are then concatenated together to comprise a 3-D tensor.
- Dual-path modeling: The 3-D tensor passes through the stack of dual-path modeling blocks containing intra-chunk and inter-chunk processing. The intra-chunk processing models the local information of the tensor with an improved transformer defined by [40]. The respective output then passes through another improved transformer to perform inter-chunk processing, which catches the information about global dependency among different chunks.
- Overlap-add: After completing the dual-path modeling, we employ the overlap-add method to the chunks of the ultimate tensor to generate a preliminary mask. After that, a hyperbolic tangent function is subsequently applied as a non-linear function to the preliminary mask to ensure the final mask values in the range .
2.3. Discrete Wavelet Transform
3. Proposed SE Framework with DWT Features
3.1. Fusion of One-Level DWT Features and Time-Domain Features
- Create time-domain features and one-level DWT features For the encoder end of the framework, one splits the input time-domain utterance into K overlapping frames of length L, each frame represented by a vector . These are concatenated to comprise a data matrix . Then, the time-domain feature matrix is created as in Equation (1) mentioned in Section 2.1.1.To create another branch of features, we present to apply a one-level DWT to the data matrix X concerning each of its columns . That is, each frame signal passes through a one-level DWT (as introduced in Section 2.3) to produce its approximation (low-pass) and detail (high-pass) sub-band signals, denoted by and , respectively. Compared with , and have half the length and bandwidth. As such, we organize these sub-band signals to produce two feature matrices, and with size , consisting of and as columns, respectively.Furthermore, we process and individually with a 1-D trainable convolution (together with the nonlinear function H) to produce the other two matrices and with size , the same size as the time-domain feature matrix . The operations are formulated as
- Integrate time-domain features and DWT featuresTo date, we have three feature matrices: (time-domain features), (DWT-wise approximation features), and (DWT-wise detail features). Notably, they have the same size, and we have come up with three ways to integrate them to constitute the ultimate encoder features :
- Addition:Here, is simply the weighted sum of three matrices:
- Concatenation:The other intuitive way for the integration is to concatenate the three matrices:
- Fusion and concatenation:To extract the information across the two DWT features and more effectively, we exploit the bi-projection fusion (BPF) method [31]. BPF has been used to integrate time-domain and frequency-domain features and exhibits outperforming behavior in DPTNet [39]. In addition, the two DWT features, which reflect the low-pass and high-pass half-band short-time spectra in speech, are supposed to be of unequal importance. For example, the low-pass part, , contains more about vowels, and the high-pass part, , might correspond to consonants better. Furthermore, and usually have different signal-to-noise ratios (SNRs) due to the embedded speech components and the background noise. Therefore, the two complementary masks (being non-negative with a unity sum) of the BPF module are suitable to leverage these two features to benefit SE. After obtaining the BPF features from the two DWT features, and , as in Figure 9a, we concatenate them with the time-domain features as in Figure 9b. The details are as follows:First, we employ the concatenation of and to estimate a ratio mask matrix M:Consequently, the final feature matrix has double the size of each original feature matrix.
3.2. Fusion of Two-Level DWT Features and Time-Domain Features
- Use two BPF modules for two pairs of DWT feature matrices:This method is illustrated in Figure 11. We choose the DWT feature matrices with closer frequencies as a pair and apply a BPF module to each pair. Then, the resulting two BPF outputs are added to form the final DWT matrix. The whole process is formulated by:
- Use an MPF module with intra-channel softmax for three DWT feature matrices: This method is illustrated in Figure 12. It extends the idea of BPF in order to linearly combine more than two sources, which is termed multiple projection fusion (MPF). Rather than the sigmoid function used in BPF, the softmax function is used to obtain three multiplicative masks for three DWT feature matrices, respectively. More precisely, we apply the softmax function to three DWT features individually for each channel (here, a channel indicates an entry of the frame-wise feature vector), as we carried out in the BPF process for two DWT feature matrices. The whole process is formulated by:As such, the obtained three masks satisfy the equation:After that, we generate the DWT-wise MPF features by multiplying these masks with three DWT feature matrices:
- Use an MPF module with inter-channel softmax for three DWT feature matrices:This method is illustrated in Figure 13. The BPF and MPF modules used in the previous methods pursue the weights for different DWT-wise sub-band feature vectors at the same channel (position in a vector) to show the relative importance of each sub-band features. Therefore, we use the sigmoid or softmax functions N times in the BPF or MPF module, where N is the is the number of channels (dimensionality) of each sub-band feature vector. By contrast, this method exploits the MPF module with a single softmax function for all of the channels of three sub-band feature vectors simultaneously. Accordingly, the obtained mask values are supposed to reflect the relative importance of different channels in different sub-band feature vectors.The whole process is formulated by:Therefore the obtained three masks satisfy the equation:After that, the DWT-wise MPF features are generated by multiplying these masks with three DWT feature matrices:
4. Experimental Setup, Results, and Discussion
4.1. Experimental Setup
- Perceptual estimation of speech quality (PESQ) [47]: This metric ranks the level of enhancement for the processed utterances relative to the original noise-free ones. PESQ indicates the quality difference between the enhanced and clean speech signals, and it ranges from to 4.5. Briefly speaking, the calculation of PESQ is achieved in several stages: time alignment, level alignment to a calibrated listening level, time-frequency mapping, frequency warping, and compressive loudness scaling.
- Short-time objective intelligibility (STOI) [48]: This metric measures the objective intelligibility for short-time time-frequency (TF) regions of an utterance with discrete-time Fourier transform (DFT). STOI ranges from 0 to 1, and a higher STOI score corresponds to better intelligibility. Briefly speaking, the STOI of the processed utterance with respect to its clean counterpart is calculated with the following procedures: First, we perform STFT on and to obtain their spectrograms, and X. Then, a one-third octave band analysis is performed by grouping DFT-bins of and X, resulting and Y. respectively. The octave band energy of is further normalized to be equal to that of Y, and then clipped to lower the bound of the signal-to-distortion ratio (SDR), producing . Finally, the linear correlation coefficient between and Y with respect to each octave band j and each frame m is computed, and the STOI of is the average of the correlation coefficients over all octave bands and all frames.
- Scale-invariant signal-to-noise ratio (SI-SNR) [46]: This metric usually reflects the degree of artifact distortion between the processed utterance and the clean counterpart , which is formulated by
4.2. Experiments on Conv-TasNet with the VoiceBank-DEMAND Task
- Compared with the unprocessed baseline, all Conv-TasNet variants used here give significantly superior PESQ and SI-SNR scores, reflecting their excellent capability of SE. By contrast, the brought STOI improvement is moderate, probably because the baseline STOI score has been as high as .
- The cross-domain Conv-TasNet [31] with time and STFT features outperforms the original Conv-TasNet with time-domain features only in PESQ and SI-SNR, and it provides the best possible SI-SNR among all of the variants. These results reveal that the addition of STFT features benefits Conv-TasNet significantly.
- When the STFT is replaced by one-level DWT, the corresponding three Conv-TasNet variants exhibit similar or better PESQ results. It indicates that DWT features can complement time-domain features to provide Conv-TasNet with better speech quality. Among the three one-level DWT-based Conv-TasNets, addition-wise integration behaves the best in PESQ, outperforming the more complicated method of implementation with one BPF and concatenation. However, when it comes to SI-SNR, the one-level DWT-wise methods behave worse than the method with STFT.
- As for the cases with which the two-level DWT features are involved, the achieved PESQ scores are close to or moderately better than that obtained by the method with STFT, but not necessarily superior to those with one-level DWT. However, the optimal PESQ score is obtained by concatenating time features and two-level DWT features that use one MPF and inter-channel softmax. This result might indicate that further weighting the different channel values in the used encoding features is a promising direction to improve the SE performance.
- To further examine if the presented fusion feature sets provide statistically significant improvement in PESQ relative to the pure time-domain feature, we perform a one-tailed two-sample t-test, the details of which are shown in Appendix A. Referring to the results shown in Table A1, we see that the four types of fusion “addition”, “concatenation”, “one BPF and concatenation”, and “one MPF with inter-channel softmax” provide Conv-TasNet with significant improvement in PESQ when compared with Conv-TasNet using the time-domain feature only. The other two fusion types (“two BPFs” and “one MPF with intra-channel softmax”) do not improve PESQ significantly.
4.3. Experiments on Conv-TasNet with the VoiceBank-QUT Task
4.4. Experiments on DPTNet with Two Tasks
- Almost all encoding features provide better metric scores in DPTNet than in Conv-TasNet as expected, except for the time-domain features. In our opinion, the improper model configuration might cause this performance disagreement in the case of time features.
- When integrating with time features, STFT and various forms of DWT features provide DPTNet with similar STOI and SI-SNR results.
- The PESQ results for DPTNet are somehow converse to those for Conv-TasNet. With the DPTNet framework, STFT features behave better in PESQ than DWT features for the VoiceBank-DEMAND task, while they are inferior for the VoiceBank-QUT task. However, as for the VoiceBank-DEMAND task, the DWT features with the integration “one MPF with inter-channel softmax” provides PESQ close to the STFT features. In contrast, the DWT features with the integration “one BPF and concatenation” behave significantly better than the STFT features in PESQ for the more challenging VoiceBank-QUT task.
- The results with different feature sets are not entirely consistent for DPTNet and Conv-TasNet. However, we can conclude that the presented DWT-domain features offer beneficial SE information. They are additive to the time-domain features to improve the performance of both Conv-TasNet and DPTNet. In addition, referring to Table A3 and Table A4, we find that all of the three fusion types (“one BPF and concatenation”, “two BPFs”, “one MPF with inter-channel softmax”) listed here can benefit DPTNet with statistically significant improvement in PESQ relative to DPTNet using the time-domain feature only. These results indicate that DWT-domain features improve DPTNet more than Conv-TasNet in both DEMAND and QUT noise scenarios.
4.5. Spectrogram Demonstration for Various SE Methods
5. Concluding Remarks
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Feature Type | Integration Manner | PESQ Mean | PESQ Standard Deviation | t-Statistic | Significant Improvement? () |
---|---|---|---|---|---|
Time | – | 2.618 | 0.5991 | – | – |
Time and one-level DWT | Addition | 2.681 | 0.6311 | 2.081 | True |
Concatenation | 2.669 | 0.6216 | 1.698 | True | |
One BPF and concatenation | 2.668 | 0.6084 | 1.683 | True | |
Time and two-level DWT | Two BPFs | 2.654 | 0.6039 | 1.216 | False |
One MPF with intra- channel softmax | 2.667 | 0.6240 | 1.628 | False | |
One MPF with inter- channel softmax | 2.690 | 0.6337 | 2.373 | True |
Feature Type | Integration Manner | PESQ Mean | PESQ Standard Deviation | t-Statistic | Significant Improvement? () |
---|---|---|---|---|---|
Time | – | 1.908 | 0.5875 | – | – |
Time and one-level DWT | Addition | 1.922 | 0.5897 | 0.483 | False |
Concatenation | 1.932 | 0.5932 | 0.826 | False | |
One BPF and concatenation | 1.936 | 0.5929 | 0.964 | False | |
Time and two-level DWT | Two BPFs | 1.922 | 0.5936 | 0.482 | False |
One MPF with intra- channel softmax | 1.917 | 0.5868 | 0.312 | False | |
One MPF with inter- channel softmax | 1.926 | 0.6009 | 0.616 | False |
Feature Type | Integration Manner | PESQ Mean | PESQ Standard Deviation | t-Statistic | Significant Improvement? () |
---|---|---|---|---|---|
Time | – | 2.549 | 0.5885 | – | – |
Time and one-level DWT | One BPF and concatenation | 2.724 | 0.5962 | 6.004 | True |
Time and two-level DWT | Two BPFs | 2.745 | 0.6061 | 6.668 | True |
One MPF with inter- channel softmax | 2.779 | 0.6250 | 7.700 | True |
Feature Type | Integration Manner | PESQ Mean | PESQ Standard Deviation | t-Statistic | Significant Improvement? () |
---|---|---|---|---|---|
Time | – | 1.804 | 0.5079 | – | – |
Time and one-level DWT | One BPF and concatenation | 2.044 | 0.6158 | 8.641 | True |
Time and two-level DWT | Two BPFs | 2.033 | 0.6372 | 8.077 | True |
One MPF with inter- channel softmax | 2.034 | 0.6116 | 8.315 | True |
References
- Kahneman, D.; Sibony, O.; Sunstein, C.R. Noise: A Flaw in Human Judgment; Little, Brown Spark: New York, NY, USA, 2021. [Google Scholar]
- Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
- Scalart, P.; Filho, J.V. Speech enhancement based on a priori signal to noise estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA, USA, 9 May 1996. [Google Scholar]
- Gauvain, J.; Lee, C.-H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Acoust. Speech Signal Process. 1994, 2, 291–298. [Google Scholar] [CrossRef]
- Leggetter, C.J.; Woodland, P.C. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 1995, 9, 171–185. [Google Scholar] [CrossRef]
- Gales, M.J.F. Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 1998, 12, 75–98. [Google Scholar] [CrossRef]
- Gales, M.J. Model-Based Techniques for Noise Robust Speech Recognition. Ph.D. Thesis, Cambridge University, Cambridge, UK, 1995. [Google Scholar]
- Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [Google Scholar] [CrossRef]
- Wu, J.; Huo, Q. An environment-compensated minimum classification error training approach based on stochastic vector mapping. IEEE Trans. Audio Speech Lang Process. 2006, 14, 2147–2155. [Google Scholar] [CrossRef]
- Buera, L.; Lleida, E.; Miguel, A.; Ortega, A.; Saz, O. Cepstral vector normalization based on stereo data for robust speech recognition. IEEE Trans. Audio Speech Lang Process. 2007, 15, 1098–1113. [Google Scholar] [CrossRef]
- Xu, Y.; Du, J.; Dai, L.; Lee, C. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 7–19. [Google Scholar] [CrossRef]
- Zhao, Y.; Wang, D.; Merks, I.; Zhang, T. DNN-based enhancement of noisy and reverberant speech. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6525–6529. [Google Scholar]
- Wang, D. Deep learning reinvents the hearing aid. IEEE Spectr. 2017, 54, 32–37. [Google Scholar] [CrossRef]
- Chen, J.; Wang, Y.; Yoho, S.E.; Wang, D.; Healy, E.W. Large-scale training to increase speech intelligibility for hearing impaired listeners in novel noises. J. Acoust. Soc. Am. 2016, 139, 2604–2612. [Google Scholar] [CrossRef]
- Karjol, P.; Kumar, M.A.; Ghosh, P.K. Speech enhancement using multiple deep neural networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
- Kounovsky, T.; Malek, J. Single channel speech enhancement using convolutional neural network. In Proceedings of the 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), Donostia, Spain, 24–26 May 2017. [Google Scholar]
- Chakrabarty, S.; Wang, D.; Habets, E.A.P. Time-frequency masking based online speech enhancement with multi-channel data Using convolutional neural Networks. In Proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, 17–20 September 2018. [Google Scholar]
- Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef]
- Fu, S.; Tsao, Y.; Lu, X.; Kawai, H. Raw waveform-based speech enhancement by fully convolutional networks. In Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017. [Google Scholar]
- Kiranyaz, S.; Ince, T.; Abdeljaber, O.; Avci, O.; Gabbouj, M. 1-D convolutional neural networks for signal processing applications. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS Workshop on Deep Learning; Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Sun, L.; Du, J.; Dai, L.; Lee, C. Multiple-target deep learning for LSTMRNN based speech enhancement. In Proceedings of the Hands-Free Speech Communication and Microphone Arrays (HSCMA), San Francisco, CA, USA, 1–3 March 2017. [Google Scholar]
- Wang, Y.; Narayanan, A.; Wang, D. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. 2014, 22, 1849–1858. [Google Scholar] [CrossRef] [PubMed]
- Wang, D.; Chen, J. Supervised speech separation based on deep learning: An overview. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1702–1726. [Google Scholar] [CrossRef] [PubMed]
- Roman, N.; Woodruff, J. Ideal binary masking in reverberation. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 629–633. [Google Scholar]
- Narayanan, A.; Wang, D. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7092–7096. [Google Scholar] [CrossRef]
- Williamson, D.S.; Wang, Y.; Wang, D. Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. 2015, 24, 483–492. [Google Scholar] [CrossRef] [PubMed]
- Erdogan, H.; Hershey, J.R.; Watanabe, S.; Roux, J.L. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015. [Google Scholar]
- Vani, H.Y.; Anusuya, M.A. Hilbert Huang transform based speech recognition. In Proceedings of the 2016 Second International Conference on Cognitive Computing and Information Processing (CCIP), Mysuru, India, 12–13 August 2016; pp. 1–6. [Google Scholar] [CrossRef]
- Ravanelli, M.; Bengio, Y. Speech and speaker recognition from raw waveform with sincnet. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018. [Google Scholar]
- Chao, F.-A.; Hung, J.-W.; Chen, B. Cross-Domain Single-Channel Speech Enhancement Model with BI-Projection Fusion Module for Noise-Robust ASR. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021. [Google Scholar]
- Lin, J.-Y.; Chen, Y.-T.; Liu, K.-Y.; Hung, J.-W. An evaluation study of. modulation-domain wavelet denoising method by alleviating different sub-band portions for speech enhancement. In Proceedings of the 2019 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW), Yilan, Taiwan, 20–22 May 2019. [Google Scholar]
- Chen, Y.-T.; Lin, Z.-Q.; Hung, J.-W. Employing low-pass filtered temporal speech features for the training of ideal ratio mask in speech enhancement. In Proceedings of the Conference on Computational Linguistics and Speech Processing (ROCLING), Taoyuan, Taiwan, 15–16 October 2021. [Google Scholar]
- Liao, C.-W.; Wu, P.-C.; Hung, J.-W. A Preliminary Study of Employing Lowpass-Filtered and Time-Reversed Feature Sequences as Data Augmentation for Speech Enhancement Deep Networks. In Proceedings of the 2022 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Penang, Malaysia, 22–25 November 2022. [Google Scholar]
- Chen, Y.T.; Wu, Z.T.; Hung, J.W. A Preliminary Study of the Application of Discrete Wavelet Transform Features in Conv-TasNet Speech Enhancement Model. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), Taipei, Taiwan, 21–22 November 2022; pp. 92–99. [Google Scholar]
- Ochiai, T.; Delcroix, M.; Ikeshita, R.; Kinoshita, K.; Nakatani, T.; Araki, S. Beam-TasNet: Time-domain Audio Separation Network Meets Frequency-domain Beamformer. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6384–6388. [Google Scholar] [CrossRef]
- Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks for Action Segmentation and Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1003–1012. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv 2015, arXiv:1502.01852v1. [Google Scholar]
- Chen, J.; Mao, Q.; Liu, D. Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation. arXiv 2020, arXiv:2007.13975. [Google Scholar]
- Luo, Y.; Chen, Z.; Yoshioka, T. Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation. arXiv 2019, arXiv:1910.06379. [Google Scholar]
- Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNNbased speech enhancement methods fornoise-robust Text-to-Speech. In Proceedings of the 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar]
- Veaux, C.; Yamagishi, J.; King, S. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE 2013), Gurgaon, India, 25–27 November 2013. [Google Scholar]
- Thiemann, J.; Ito, N.; Vincent, E. Demand: A collection of multi-channel recordings of acoustic noise in diverse environments. In Proceedings of the 21st International Congress on Acoustics (ICA 2013), Montreal, QC, Canada, 2–7 June 2013. [Google Scholar]
- Dean, D.B.; Sridharan, S.; Vogt, R.J.; Mason, M.W. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Chiba, Japan, 26–30 September 2010; pp. 3110–3113. [Google Scholar]
- Choose a Wavelet. Available online: https://www.mathworks.com/help/wavelet/gs/choose-a-wavelet.html (accessed on 27 April 2023).
- Isik, Y.; Roux, J.L.; Chen, Z.; Watanabe, S.; Hershey, J.R. Single-Channel Multi-Speaker Separation Using Deep Clustering. arXiv 2016, arXiv:1607.02173. [Google Scholar]
- ITU-T Recommendation P. 862; Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. ITU Telecommunication Standardization Sector: Geneva, Switzerland, 2001.
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Défossez, A.; Usunier, N.; Bottou, L.; Bach, F. Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed. arXiv 2019, arXiv:1909.01174. [Google Scholar]
- Park, H.J.; Kang, B.H.; Shin, W.; Kim, J.S.; Han, S.W. MANNER: Multi-View Attention Network For Noise Erasure. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7842–7846. [Google Scholar] [CrossRef]
Feature Domains | Integration Manner | VoiceBank-DEMAND | ||
---|---|---|---|---|
Metric Scores | ||||
PESQ | STOI | SI-SNR | ||
Unprocessed baseline | 1.970 | 0.921 | 8.445 | |
Time | - | 2.618 | 0.943 | 19.500 |
Time and STFT [31] | - | 2.648 | 0.942 | 19.712 |
Time and one-level DWT | Addition | 2.681 | 0.942 | 19.352 |
Concatenation | 2.669 | 0.942 | 19.609 | |
One BPF and concatenation | 2.668 | 0.943 | 19.496 | |
Time and two-level DWT (concatenation) | Two BPFs | 2.654 | 0.943 | 19.703 |
One MPF with intra-channel softmax | 2.667 | 0.942 | 19.540 | |
One MPF with inter-channel softmax | 2.690 | 0.942 | 19.378 |
Feature Domains | Integration Manner | VoiceBank-QUT | ||
---|---|---|---|---|
Metric Scores | ||||
PESQ | STOI | SI-SNR | ||
Unprocessed baseline | 1.247 | 0.784 | 3.876 | |
Time | - | 1.908 | 0.860 | 13.694 |
Time and STFT [31] | - | 1.936 | 0.863 | 13.779 |
Time and one-level DWT | Addition | 1.922 | 0.858 | 13.645 |
Concatenation | 1.932 | 0.861 | 13.775 | |
One BPF and concatenation | 1.936 | 0.862 | 13.824 | |
Time and two-level DWT (concatenation) | Two BPFs | 1.922 | 0.861 | 13.837 |
One MPF with intra-channel softmax | 1.917 | 0.861 | 13.748 | |
One MPF with inter-channel softmax | 1.926 | 0.859 | 13.729 |
Feature Domains | Integration Manner | VoiceBank-DEMAND | ||
---|---|---|---|---|
Metric Scores | ||||
PESQ | STOI | SI-SNR | ||
Unprocessed baseline | 1.970 | 0.921 | 8.445 | |
Time | - | 2.549 | 0.935 | 19.080 |
Time and STFT [31] | - | 2.782 | 0.946 | 19.963 |
Time and one-level DWT | One BPF and concatenation | 2.724 | 0.945 | 19.960 |
Time and two-level DWT (concatenation) | Two BPFs | 2.745 | 0.944 | 19.624 |
One MPF with inter-channel softmax | 2.779 | 0.944 | 19.470 |
Feature Domains | Integration Manner | VoiceBank-QUT | ||
---|---|---|---|---|
Metric Scores | ||||
PESQ | STOI | SI-SNR | ||
Unprocessed Baseline | 1.247 | 0.784 | 3.876 | |
Time | - | 1.804 | 0.845 | 12.802 |
Time and STFT [31] | - | 2.019 | 0.870 | 14.500 |
Time and one-level DWT | One BPF and concatenation | 2.044 | 0.873 | 14.543 |
Time and two-level DWT (concatenation) | Two BPFs | 2.033 | 0.872 | 14.549 |
One MPF with inter-channel softmax | 2.034 | 0.871 | 14.611 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Y.-T.; Wu, Z.-T.; Hung, J.-W. Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform. Appl. Sci. 2023, 13, 5992. https://doi.org/10.3390/app13105992
Chen Y-T, Wu Z-T, Hung J-W. Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform. Applied Sciences. 2023; 13(10):5992. https://doi.org/10.3390/app13105992
Chicago/Turabian StyleChen, Yan-Tong, Zong-Tai Wu, and Jeih-Weih Hung. 2023. "Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform" Applied Sciences 13, no. 10: 5992. https://doi.org/10.3390/app13105992
APA StyleChen, Y.-T., Wu, Z.-T., & Hung, J.-W. (2023). Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform. Applied Sciences, 13(10), 5992. https://doi.org/10.3390/app13105992