This paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the feature maps of encoder and decoder subnetworks. In the GNU-Net architecture, only the backbone not including nested part is applied with gated linear units (GLUs) instead of conventional convolutional networks. The outputs of GNU-Net are further fed into a time-frequency (T-F) mask layer to generate two masks of singing voice and accompaniment. Then, those two estimated masks along with the magnitude and phase spectra of mixture can be transformed into time-domain signals. We explored two types of T-F mask layer, discriminative training network and difference mask layer. The experiment results show the latter to be better. We evaluated our proposed model by comparing with three models, and also with ideal T-F masks. The results demonstrate that our proposed model outperforms compared models, and it’s performance comes near to ideal ratio mask (IRM). More importantly, our proposed model can output separated singing voice and accompaniment simultaneously, while the three compared models can only separate one source with trained model.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited