Next Article in Journal
Asymmetries in Football: The Pass—Goal Paradox
Previous Article in Journal
AACS: Attribute-Based Access Control Mechanism for Smart Locks
Open AccessArticle

Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture

by Haibo Geng 1,2, Ying Hu 1,2,* and Hao Huang 1,3
1
School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
2
Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Urumqi 830046, China
3
Key Laboratory of Multilingual Information Technology in Xinjiang Uygur Autonomous Region, Urumqi 830046, China
*
Author to whom correspondence should be addressed.
Symmetry 2020, 12(6), 1051; https://doi.org/10.3390/sym12061051
Received: 31 May 2020 / Revised: 23 June 2020 / Accepted: 23 June 2020 / Published: 24 June 2020
This paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the feature maps of encoder and decoder subnetworks. In the GNU-Net architecture, only the backbone not including nested part is applied with gated linear units (GLUs) instead of conventional convolutional networks. The outputs of GNU-Net are further fed into a time-frequency (T-F) mask layer to generate two masks of singing voice and accompaniment. Then, those two estimated masks along with the magnitude and phase spectra of mixture can be transformed into time-domain signals. We explored two types of T-F mask layer, discriminative training network and difference mask layer. The experiment results show the latter to be better. We evaluated our proposed model by comparing with three models, and also with ideal T-F masks. The results demonstrate that our proposed model outperforms compared models, and it’s performance comes near to ideal ratio mask (IRM). More importantly, our proposed model can output separated singing voice and accompaniment simultaneously, while the three compared models can only separate one source with trained model. View Full-Text
Keywords: singing voice separation; nested U-Net; gated linear units; CNN; monaural source separation singing voice separation; nested U-Net; gated linear units; CNN; monaural source separation
Show Figures

Figure 1

MDPI and ACS Style

Geng, H.; Hu, Y.; Huang, H. Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture. Symmetry 2020, 12, 1051.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop