Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture

: This paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the feature maps of encoder and decoder subnetworks. In the GNU-Net architecture, only the backbone not including nested part is applied with gated linear units (GLUs) instead of conventional convolutional networks. The outputs of GNU-Net are further fed into a time-frequency (T-F) mask layer to generate two masks of singing voice and accompaniment. Then, those two estimated masks along with the magnitude and phase spectra of mixture can be transformed into time-domain signals. We explored two types of T-F mask layer, discriminative training network and difference mask layer. The experiment results show the latter to be better. We evaluated our proposed model by comparing with three models, and also with ideal T-F masks. The results demonstrate that our proposed model outperforms compared models, and it’s performance comes near to ideal ratio mask (IRM). More importantly, our proposed model can output separated singing voice and accompaniment simultaneously, while the three compared models can only separate one source with trained model.


Introduction
Singing voice separation attempts to isolate singing voice (also called vocal line) from a song.In recent years, this problem has attracted increasing attention with the demand for singer identification [1][2][3], automatic lyrics recognition [4,5] and alignment [6], singing pitch estimation [7], singing style visualization [8], and so on.Meanwhile, isolating pure accompaniment from a song also has great applications such as leading instrument detection [9] and drum source separation [10].Although these tasks seem effortless to humans, it turns out to be very difficult for machines, especially when the singing voice is accompanied by musical instruments.However, such a requirement can be satisfied if successful separations of singing voice and accompaniment are used as preprocessing.
A popular song often has two major acoustic components that are singing voice and background accompaniment.Due to the harmony of a popular song, the singing voice and accompaniment are strongly correlated in both time and frequency [11], thus separating singing voice from a song in single channel is a challenging task.Several approaches have been proposed for singing voice separation.Po-sen Huang et al. [12] proposed using robust principal component analysis for singing voice separation from music accompaniment.Hu and Liu proposed a system based on Non-negative Matrix Factorization (NMF) to separate singing voice from monaural music for singer identification [2].It indeed helps to improve the performance of singer identification.However, the performance of singing voice separation still need to be boosted especially when the energy of accompaniment in a recording is larger than that of the singing voice.
With the development of deep learning, most recent methods based on deep learning show better performance [11,13].Po-Sen Huang et al. explored using deep recurrent neural networks (RNN) for singing voice separation from monaural recordings [14].Moreover, they proposed the joint optimization of mask functions and deep RNN, exploring a discriminative training criterion for neural networks to further enhance the separation performance [15].Fan et al. proposed a monaural singing voice separation model using generative adversarial network (GAN) with a T-F masking function [16].Generator G inputs a mixture spectra and generates realistic singing voice and accompaniment spectra, while discriminator D distinguishes the clean spectra from those generated spectra, which can be transformed into time-domain signals using the inverse short-time Fourier transform (ISTFT) with phase information.He et al. [17] also used the adversarial mechanism to improve the separation effect of monaural singing voice separation networks.The GAN's discriminator was introduced to measure the correlation between the latent variables of the vocals and music generated by the variational autoencoder probability encoder.Stoller et al. proposed a semisupervised approach, also using GAN on multitrack data for singing voice extraction [18].
The above supervised source separation approaches are all conducted in the time-frequency (T-F) domain [13][14][15][16][17][18][19].These approaches reconstruct the target source signal in the time domain from the frequency domain using the phase of mixture by inverse short time Fourier transform (ISTFT).This paper also focuses on being conducted in the T-F domain.
Gating mechanisms were first proposed by Dauphin et al. for language modeling [20] in 2017.Since then, gating mechanism-also termed as gated linear units (GLUs)-has been broadly applied to the speech process field.Tan and Wang [21] extended the convolutional recurrent network and incorporated gated linear units (GLUs) for complex spectral mapping, which aims to estimate the real and imaginary spectrograms of clean speech from noisy speech for monaural speech enhancement.The convolutional neural network (CNN) model additionally incorporating gating mechanisms was proposed for speech enhancement [22], speech separation [23], and audio classification [24].
Various methods based on U-Net architecture have sprung up in various fields since the U-Net model was first proposed for biological cells segmentation by Ronneberger et al. [25].Jansson et al. adopted U-Net architecture for the task of singing voice separation [26].Stoller et al. investigated end-to-end audio source separation and introduced further architectural improvements on U-Net architecture [27].They proposed Wave U-net, an adaptation of U-Net to the one-dimensional time domain.In addition, for image segmentation, Zhou et al. [28] made further improvements on the model structure and proposed a nested U-Net architecture model which was used for medical image segmentation and achieved better results.
Motivated by the success of the U-Net-based architecture model and gating mechanism as mentioned above, we further develop the nested U-Net (NU-Net) architecture by applying gated linear units on backbone, not including the nested part, to replace the conventional convolution network.We term the separation model based on NU-Net with gated linear units as gated nested U-Net (GNU-Net).The outputs of GNU-Net are further fed into a mask layer to generate two masks of singing voice and accompaniment.So, our proposed model can output singing voice and pure accompaniment simultaneously.We also explored two mask layers, discriminative training network (DTN) and difference mask layer (DML).The experimental results show that, on the whole, the latter is better.
The rest of this paper is organized as follows.We introduce the proposed separation model in Section 2. Section 3 presents the experimental setting.Section 4 presents the results for monaural singing voice and accompaniment separation.We then make our conclusions in Section 5.

Gated Nested U-Net Separation Model
We use a fully convolutional neural network that is comprised of a series of convolutional and deconvolutional layers.We first describe the proposed GNU-Net separation model and then detail the gated nested U-Net architecture and two kinds of mask layers.

Proposed GNU-Net Separation Model
Figure 1 shows the framework of proposed singing voice and accompaniment separation model.The mixed time-domain signals are converted into magnitude and phase spectra using short-time Fourier transformation (STFT).The magnitude spectra are fed into a gated nested U-Net with gated linear units, then the outputs are further fed into a mask layer to produce two masks of singing voice and accompaniment.These two estimated T-F masks are respectively applied to the magnitude spectrum of mixture to get two predicted spectra, which can be transformed into time-domain signals using inverse short-time Fourier transform (ISTFT) with phase information of mixture.Note that the dashed line arrow over the mask layer with dashed line box denotes that this data flow (mixture magnitude spectra) exists only in the training phase.While the full line arrow over the notation of multiply denotes that this data flow (mixture magnitude and phase spectra) exists only in the validation and test phases.As it can be seen in Figure 1, the nested U-Net architecture with dashed line box takes as input the magnitude spectrum of mixture and outputs a 2-dimensional (2-D) feature map (shown in green block) by a series of convolution and deconvolution layers.Those convolution layers and deconvolution layers accomplish the tasks of encoder and decoder respectively.Through the redesigned skip pathways (shown as dotted arrow), the encoder and decoder subnetworks are connected.The dotted box denotes the concatenation operation.The concatenated feature maps are taken as input to perform the deconvolution operation which outputs upsampled feature maps.The skip-connections have been shown to help recovering the full resolution at the network output, where the downsampling operation is performed in the encoder subnetwork and the upsampling operation in the decoder subnetwork.We denote the number of layers of encoder subnetwork as the number of levels of nested U-Net, for example, the nested U-Net in Figure 1 is a 3-level nested N-net, since there are a total of three downsampling operations in the encoder subnetwork.
The outputs of nested U-Net are fed into a mask layer to generate two masks of singing voice and accompaniment.Then, those two masks are applied with mixture spectrum by doing the dot product, respectively, to obtain two estimated source spectra.Through ISTFT operation with the phase of mixture, we can obtain the estimated singing voice and accompaniment time-domain waveform.

Gated Nested U-Net Architecture
The nested U-Net architecture in Figure 1 can be illustrated in detail through Figure 2, which is an illustration of a 6-level nested U-Net and clearly exhibits the details of operation and skip-connection.The triangularlike pink shadow area denotes nested encoder-decoder, which distinguishes nested U-Net from U-Net.In U-Net architecture for singing voice separation [26], the feature maps of the last convolution layer undergo deconvolution operation the same number of times as convolution.Before each deconvolution operation, it should take a concatenation operation between the outputs of previous deconvolution layers and of the same level convolution layer.This paper also adopts the same concatenation operation.
are the outputs of convolution layers.
are stacks of outputs of convolution and deconvolution layers.They are computed as follows: where function H(•) is convolution operation followed by leaky rectified linear units (Relu) activation function and then a batch normalization process, and U (•) denotes deconvolution operation followed by Relu activation and batch normalization process.
[•] denotes the concatenation operation, which is denoted by the dotted box in Figures 1 and 2. Specifically, )], X 3,3 , and X 3,2 would undergo a deconvolution process to output a fraction of X 2,4 and X 2,3 .Due to the symmetry of encoding and decoding, X i,0 and {U (X i+1,j ), i + j ≤ 5} own the same size.
In conclusion, in Equation ( 4), the upper-half formulates the encoding process and the outputs of convolution layer; while the lower-half formulates the decoding process and the outputs of concatenation operation.Note that the skip-connection in nested U-Net is designed to concatenate two boxes but not to sum directly, as for image segmentation in U-Net [25].
Owing to the nested skip pathways, nested U-Net could generates full-resolution feature maps at multiple semantic levels, {X 0,j , j ∈ {1, 2, 3, 4, 5}} (This part is not exit in Figure 2).However, for medical image segmentation, Zhou et al. [28] added a combination of binary cross-entropy and dice coefficient as the loss function to each of the full-resolution feature maps.According to the results of our experiment, X 0,6 contains abundant information that is quite qualified for the subsequent mask estimation of each sources.So, our proposed GNU-Net does not include the deconvolution layers used to generate full-resolution feature maps, {X 0,j , j ∈ {1, 2, 3, 4, 5}}.

Gated Linear Unit
The gating mechanism controls the information flow throughout the network, which potentially allows for modeling more sophisticated interactions [21].The gated mechanism was first proposed for recurrent neural networks (RNNs) [29] and further developed for CNN [20].Oord et al. [30] have shown the effectiveness of the LSTM-style gating, which be dubbed gated tahn unit (GTU): where W's and b's denote kernels and biases, respectively.σ represents sigmoid function, and means dot product.The gradient of GTUs is The gradient gradually vanishes as the network depth increases because of the downscaling factors tanh (x) and σ (x).To tackle this problem, Dauphin et al. [20] introduced the gated linear unit (GLU): The gradient of the GLUs, has a path ∇v 1 σ(v 2 ) without downscaling for the activated gating units in σ(v 2 ).This can be regarded as a multiplicative skip-connection which helps gradients flow through the layers.
A convolutional GLU block (denoted as "ConvGLU") is illustrated in Figure 3a.A deconvolutional GLU block (denoted as "DeconvGLU") is analogous, except that the convolutional layers are replaced by deconvolutional layers, as shown in Figure 3b.In our proposed GNU-Net model, only the backbone of GNU-Net (two dashed line columns shown in Figure 2) is applied with GLUs, not including nested subnetworks.We use convolution GLU block (black arrow show in Figure 2) and deconvolution GLU block (red arrow show in Figure 2) instead of convolution layer and deconvolution layer in the backbone part.Figure 2 clearly exhibits the details of concatenation operation, skip-connection, and GLU blocks.The triangularlike shadow area same as nested part.So, we rewrite Equation (4) as follows: where function H GLU (•) and U GLU (•) are convolution GLU block and deconvolution GLU block, respectively.They are all followed by leaky Relu activation function and then a batch normalization process.U (•) denotes conventional deconvolution operation also followed by Relu activation and batch normalization process.Take the same examples as Section 2.2, where

Mask Layer
Ronneberger [25] chose to train two distinctive separation models for two sources exploiting U-Net model.Our goal is to separate singing voice and accompaniment from a mixture simultaneously; so, instead of learning one of the sources as the target, we propose to simultaneously model all the sources.The output of GNU-Net, X 0,6 , is fed into a mask layer to generate two masks of singing voice and accompaniment.In this paper, we explore two kinds of mask layer, discriminative training network and difference output layer.

A. Discriminative Training Network (DTN)
Discriminative training network was proposed to jointly train the network with T-F mask function by Po-Sen Huang et al. [15].In our proposed separation model, the output of GNU-Net, X 0,6 , is fed into two linear layers, each followed with a Relu activation operation.These two linear layers output magnitude predictions of two sources, ŷ1t and ŷ2t , as shown in Figure 4a.Here, we also add an extra layer to the output of the linear layers as where the addition, division, and (Hadamard product) operators are elementwise operations.z t denotes magnitude spectra of the mixture signals.ỹ1t and ỹ2t are two estimated magnitudes of sources y 1t and y 2t through a soft mask function, t = 1, 2, 3, ...., T, where T is the frame length of an input sequence.Equation ( 10) enforces the constraint that the sum of prediction results is equal to the original mixture.This implies a soft T-F mask function Here, two predictions ŷ1t and ŷ2t should be positive because of Relu activation function.Equation (11) implies that m 1t + m 2t = 1.In this way, we integrate the constraints into the network and optimize the network with the masking functions jointly.Although this extra layer is a deterministic layer, the network weights are optimized for the error metric of Equation (11).Thus, it also can be considered that the discriminative training network outputs two masks of singing voice and accompaniment, as shown in Figure 4a.
To reduce the interference from other sources, we adopt the discriminative network training criterion with a simple and useful form [14,15]: The first half of Equation ( 12) is general mean squared error (MSE), which directly optimizes the reconstruction objective, adding the extra term −γ||y 1t − ỹ2t || 2 − γ||y 2t − ỹ1t || 2 further penalizes the interference from the other source.For our experimental results, we generally achieved higher source-to-interference ratio (SIR) and source-to-distortion ratio (SDR) while slightly lower source-to-artifacts ratio (SAR).We think that an appropriate value of γ would further improve the performance.

B. Difference Mask Layer (DML)
To speed-up learning and improve performance, difference output layer was proposed by Stoller et al. [27].Similarly, we adopt a difference mask layer (DML) to constrain the mask M jt for source j at time t.If a mixture includes K sources, then enforce Σ K j=1 M jt = 1 that only K − 1 convolutional filters with a size of 1 are applied to the last feature map of the network, followed by a sigmoid nonlinearity function to estimate the first K − 1 mask of source signals.The last mask is then simply computed as In our singing voice and accompaniment separation tasks, there are just two sources, so K = 2. So, as shown in Figure 4b, the output X 0,6 and the mixture spectrum input X 0,0 are concatenated, forming a feature map with dimensions 2 × 512 × 128.Through a convolutional network with the filter size of 2 × 1 × 1 followed with a Sigmoid activation operation, the difference mask layer outputs a mask of source 1 with the dimensions 1 × 512 × 128.M 2t , computed by Equation ( 13), can be obtained as the mask of source 2 simultaneously.

Dataset and Preprocessing
The iKala dataset has been used as a standardized evaluation for the annual Music Information Retrieval Evaluation (MIREX) campaign for several years, so there are many existing results that can be used for comparison.The iKala dataset [31] includes 352 30-second song clips with a sample rate of 44,100 Hz.These clips are recorded from Chinese popular songs performed by professional singers.Only 252 song clips are released as a public subset for evaluation.Each song clip is a stereo recording, with one channel for singing voice and the other for accompaniment.We first downsample the input audio to the same sampling frequency of 8192 Hz as per U-Net model [25], then extract the magnitude spectrum using a 1024-point STFT with 75% overlap.All sample clips are cut into roughly 11 s so that the number of time frame of each patch can be set with 128 (a power of 2 times).The magnitude spectrograms are normalized by x → log(1 + x).(See Supplementary Materials).

Evaluation Metrics
To measure the quality of estimated time-domain signal v with respect to the original signal v, the source-to-interference ratio (SIR), source-to-artifacts ratio (SAR), and source-to-distortion ratio (SDR) [32] provided in the commonly used BSS EVAL toolbox.The source-to-distortion ratio (SDR) is computed as follows: Normalized SDR (NSDR) is the improvement of SDR from the original mixture x to the separated singing voice v, and is commonly used to measure the separation performance for each mixture [12,26]: where v is the estimated source signal, v is the reference source signal, and x is the mixed signal.

Experiment Configurations
The networks are trained on 11-second-long segments.Mean squared error (MSE) is exploited as loss function.ADAM [33] is used as optimizer.The learning rate is set to 10 −5 with decay rates β 1 = 0.9.Batch size is 4. Stride size of 2 is used in the convolutional encoder.γ of discriminative training network in Equation ( 12) is set to 0.05.
The detailed description of GNU-Net is shown in Table 1.The column Shape represents the dimension of outputs (cubes in Figure 2).The column Operation represents the different neural network operations.F c equals 16.ConvGLU-2D (A), Deconv-2D (A), and DeconvGLU-2D (A) denote the operations; and A is the output channels of each operation.The filter size is 5 × 5. Concat(A, B) denotes the concatenation operation of A and B. i in the row Encoder block refers to the number of downsampling process.j = 0, 1 ≤ i ≤ L. Note that the Decoder blocks are applied in reverse order, so that j is from level L to 1, j = 0, 2 ≤ i + j < L. In nested part, 1 ≤ i, j ≤ L, i + j = L in backbone part.As it is shown in Figure 2, L = 6.
Table 1.Schematic diagram of the proposed GNU-Net architecture.

Block Operation Shape
Input X 0,0 = (1,512,128) ConvGLU-2D (A) denotes the convolutional GLU operation with stride of 2 followed with leaky rectified linear units (ReLU) activation, obtaining that leakiness is 0.2.Deconv-2D (A) and DeconvGLU-2D (A) denote the conventional-deconvolutional and deconvolutional GLU operations, respectively.The deconvolutional operations are both followed with a batch normalization operation and leaky ReLU activation with leakiness of 0.2.Note in decoder block, before the deconvolutional operation we should concatenate the output of the last deconvolutional operation and the previous output in the same level.
For the difference mask layer in Section 2.4, the output of last DeconvGLU operation, X 0,6 , and the input mixture spectrum, X 0,0 , are concatenated and further fed into a 2-D conventional convolution layer with a stride of 1and kernel size of 2 × 1 × 1 followed with a Sigmoid activation, as shown in Figure 4b.

Comparison with Ideal Time-Frequency Masks
Following the common configurations in [34,35], the ideal time-frequency masks were calculated using STFT with a 32-ms window size and 8-ms hop size with a Hanning window.The ideal masks include the ideal binary mask (IBM), ideal ratio mask (IRM), and Wiener filterlike mask (WFM), which are defined for source i as where S i ( f , t) ∈ C F×T are the complex-valued spectrograms of clean sources i = 1, ..., C.

Results
Firstly, the proposed GNU-Net model and two kinds of mask layer were verified by the separation performance, and the effect of the nested U-Net was assessed by comparing with U-Net [26].Then, a comparison of various networks levels was made on model parameter and system performance to select a proper network level.Finally, the performance of GNU-Net separation model was compared with three models and ideal T-F masks on the iKala dataset.

Optimizing the Network Model
The performance of GNU-Net separation model was evaluated on iKala dataset.Table 2 shows the performance scores of various models with 6-level nested U-Net and 6-level U-Net [26].In the first row, the results of singing voice and accompaniment are based on two U-Net separation models, as the U-Net [26] model can output only one source signal, while our proposed model can output estimated singing voice and accompaniment simultaneously.NU-Net denotes nested U-Net without introducing GLUs.The contents of Table 2 are exhibited in another form in Figure 5, which can help to intuitively distinguish various models by the means and variances of various evaluation metrics.From Table 2 and Figure 5, we can conclude the following statements:   (iv) On the whole, the NSDR scores of accompaniment outperform that of singing voice.This may be because in the most general case, the intensity of the accompaniment is greater than that of the singing voice, and accompaniment has more continuous components over time.
Figure 6 shows the magnitude spectra comparison between the estimated sources and original sources.From the estimated magnitude spectra of estimated singing voice, we can noticeably distinguish that our proposed models outperform U-Net model.Some experiments were performed for selection of the depth of network.Table 3 shows the model size and system performances of U-Net [26] architecture and our proposed method (NU-Net and GNU-Net) with the mask layer of difference mask layer (DML).The numbers of parameters in different methods are based on our implementations.The results of U-Net [26] by our implementation is basically the same as their reported results.The GNU-Net model has the biggest model size compared with U-Net and NU-Net at the same network level and have the best separation performance on NSDR, SIR, and SAR.Compromise the system performance and complexity, 6-level network was selected to adopt for the GNU-Net separation model.

Comparison of Proposed Method with Previous Methods
Finally, the proposed models were also compared to the RPCA [12] and Chimera [36] models, which produced the highest evaluation scores in the 2016 MIREX Source Separation campaign.Table 4 shows the means of evaluation metrics using iKala dataset.The results of first row of RPCA are from their reported paper.The second row shows the results reported in Reference [26], the results are run by the Chimera web server using the improved Chimera network [36].NU-Net+DML and GNU-Net+DML denote our proposed methods, which separate singing voice and accompaniment simultaneously, while Chimera and U-Net separate singing voice and accompaniment using two distinct trained separation models.We can see from Table 4 that the separation performance of our proposed GNU-Net with the mask layer of DML approaches the results of IBM, especially the NSDR of separated singing voice.Our proposed separation model even surpasses IBM in SAR metric for both singing voice and accompaniment.

Conclusions
We propose a separation model based on GNU-Net architecture.The outputs of GNU-Net are further fed into a T-F mask layer to generate two masks of singing voice and accompaniment.Then,

Figure 2 .
Figure 2. Illustration of 6-level nested U-Net architecture with gated linear units (GLUs) applied only on backbone.Dashed line columns denote the backbone of gated nested U-Net (GNU-Net), and the light-pink triangle denotes the nested part.Cubes denote the output of each layer or concatenation operation, except for X 0,0 which denotes the input.

Figure 3 .
Figure 3. Diagrams of a convolutional GLU block and a deconvolutional GLU block, where σ denotes a sigmoid function.

Figure 4 .
Figure 4. Two kinds of mask layer.

Figure 5 .
Figure 5. Three evaluation metrics of estimated singing voice and accompaniment by various network models.
(i) Nested U-Net architecture outperforms U-Net architecture, this results verifies that the nested decoder subnetworks can remedy the information loss caused by previous downsampling operations.(ii) Introducing gated mechanisms can noticeably improve system performance.(iii) As mask layer, difference mask layer (DML) is superior to discriminative training network (DTN).

Figure 6 .
Figure 6.(a) The mixture magnitude spectrogram of a clip in iKala dataset; (b,c) the ground truth spectra of clean singing voice and pure accompaniment; (d-f) the magnitude spectra of estimated singing voice by U-Net model and our proposed two models; (g-i) The magnitude spectra of estimated accompaniment by U-Net model and our proposed two models (model1, NU-Net+DML; model2, GNU-Net+DML).Accom denotes accompaniment.

Table 2 .
Comparison between various network models and mask layer on iKala dataset.

Table 3 .
Comparison of model size and evaluation results.

Table 4 .
Comparison of proposed methods (NU-Net+DML and GNU-Net+DML) and previous methods using iKala dataset.