Underdetermined Blind Source Separation Combining Tensor Decomposition and Nonnegative Matrix Factorization

: Underdetermined blind source separation (UBSS) is a hot topic in signal processing, which aims at recovering the source signals from a number of observed mixtures without knowing the mixing system. Recently, expectation-maximization algorithm shows a great potential in the UBSS. However, the ﬁnal separation results depend strongly on the parameter initialization, leading to poor separation performance. In this paper, we propose an effective algorithm that combines tensor decomposition and nonnegative matrix factorization (NMF). In the proposed algorithm, we ﬁrst employ tensor decomposition to estimate the mixing matrix, and NMF source model is used to estimate the source spectrogram factors. Then a series of iterations are derived to update the model parameters. At the same time, the spatial images of source signals are estimated with Wiener ﬁlters constructed from the learned parameters. Therefore, time-domain sources can be obtained through inverse short-time Fourier transform. Finally, plenty of experimental results demonstrate the effectiveness and advantages of our proposed algorithm over the compared algorithms.


Introduction
Blind source separation (BSS) considers the recovery of source signals from observed signals without knowing the recording environment.Recently, the use of BSS has become an active research area.If the number of source signals is less, equal or greater than the number of microphones, BSS can be classified as the overdetermined case [1], the determined case [2,3], or the underdetermined case [4,5], respectively.In particular, in the natural environment, the mixing process is generally considered to be convolutive, i.e., the channel between each source and each microphone is modeled in a linear filter that represents multiple source-to-microphone paths because it considers the reverberation of the channel.Therefore, underdetermined convolutive BSS is a challenging problem in the field of BSS.
To address this underdetermined convolutive BSS problem, tensor decomposition shows great potential, because an interesting property of higher-order tensors is that their rank decomposition is unique.Additionally, parallel factor (PARAFAC) decomposition factorizes a tensor into a sum of component rank-one tensors, and the factor matrices refer to the combination of the vectors from the rank-one components.By singular value decomposition of a series of matrices, the parallel factorization problem is transformed into a joint matrix diagonalization problem, such that the PARAFAC analysis is solved [6,7].Therefore, the PARAFAC method can be used to identify the mixing matrix in the underdetermined case, which has been proven usefully in a wide range of applications from sensor array processing to communication, speech and audio signal processing [8,9].In the phase of source separation, source signals can be estimated by using the l 0 -norm minimization method [10] or the binary masking algorithm [11].However, these methods suffer from poor separation performance.To improve separation performance, we found out that the source model includes some specific information on the spectral structures of sources.Therefore, a better source model has the potential to improve the source separation performance.
In BSS, the non-negative matrix factorization (NMF) source model is usually applied on the speech/music power spectrogram, where the spectrogram is approximated by the product of two non-negative matrices, i.e., a basis matrix and an activation matrix.The basis matrix represents the repeating spectral patterns, and the activation matrix represents the presence of these patterns over time.Additionally, NMF aims to decompose a non-negative factor matrix into the product of two low-rank non-negative factor matrices [12,13].The NMF model can be used to efficiently exploit the low-rank nature of the speech spectrogram and its dependency across the frequencies.In some NMF-based methods [14][15][16][17], non-negative matrix factor two-dimensional deconvolution is an effective machine learning method in audio source separation field.In particular, in the convolutive frequency-domain model, the well-known permutation alignment problem cannot be solved without using additional a priori knowledge about the sources or the mixing filters.However, the NMF source model implies a coupling of the frequency bands, and joint estimation of the source parameters and mixing coefficients, which frees us from the permutation problem.Furthermore, NMF is well suited to polyphony as it basically takes the source to be a sum of elementary components with characteristic spectral signatures.Therefore, NMF source model is able to improve the source separation performance.
Additionally, in order to obtain better source separation results, the estimated mixing matrix and NMF variables need to be updated using an optimization algorithm.In most BSS optimization algorithms, we found out that expectation-maximization (EM) algorithm [18], which is a popular choice for Gaussian models, provided faster convergence.The EM algorithm is related to some multichannel source separation techniques by employing Gaussian mixture model as source models.However, it is very sensitive to initialization in source separation tasks.There had been some studies of parameter initialization of NMF to optimize separation performance [19,20].Therefore, we try to take an optimization algorithm to improve the source separation performance.
In this paper, an alternative optimization algorithm is proposed to deal with the parameter initialization problem and improve separation performance.First, we employ tensor decomposition to detect the mixing matrix, and NMF is used to estimate the source spectrogram factors.Then these model parameters are updated using the EM algorithm.Meanwhile, the spatial images of source signals are estimated using Wiener filters constructed from the learned parameters.The time-domain sources can be obtained through inverse short-term Fourier transform (STFT) using an adequate overlap-add procedure with dual synthesis window.Thanks to the linearity of the inverse STFT, the reconstruction is conservation in the time-domain as well.Finally, a series of experimental results including synthetic instantaneous and convolutive music and speech source mixtures, as well as live real recordings, show that our improved algorithm outperforms the state-of-the-art baseline methods.We can highlight the main contributions of this article as follows.
(1) We propose an improved algorithm that combines tensor decomposition and advanced NMF to deal with the underdetermined linear BSS.The mixing matrix is estimated using tensor decomposition, and NMF is used to decompose the given spectrogram into several spectral bases and temporal activations.Then the mixing matrix, NMF variables, and noise components are updated by a series of iteration rules.The proposed algorithm combines the advantages of tensor decomposition and NMF, which is beneficial to improve the performance of source separation.Additionally, the improved algorithm can be extended to underdetermined convolutive BSS.
(2) We have demonstrated the superiority in the underdetermined linear and convolutive BSS cases.Additionally, in this paper we mainly consider the audio datasets, the proposed algorithm demonstrates the effectiveness and superiority compared with the state-of-the-art algorithms, which improves the source separation performance based on a series of simulation experiments.
The structure of the remaining of this paper is organized as follows.Section 2 formulates the problem of the underdetermined blind source separation.In Section 3, an optimization algorithm is presented based on tensor decomposition and NMF.Experimental results compared the source separation performance with the state-of-the-art techniques in various experimental settings are shown in Section 4. Finally, Section 5 summarizes our conclusion and the future work.

Linear Instantaneous Mixture Model
The signal model with noise used in this paper is described as follows: where n = 1, 2, ..., N denotes the index of the time window for applying the Fourier transform, f = 0, ..., F − 1 is the index of the frequency bins, are the STFT of the mixtures and the sources at time-frequency point ( f , n), respectively.v f n = [v 1, f n , ..., v J, f n ] T , the noise v j, f n is assumed to be stationary and spatially uncorrelated, i.e., v j, f n ∼ N c (0, σ 2 j, f ) and

The NMF Source Model
Let K ≥ I is known in advance, and {K i } I i=1 be a nontrivial partition of K = 1, ..., K. Following [16,17], a coefficient s i, f n is modeled as the sum of latent components c k, f n , such that where G ∈ N I×K is a binary selection matrix with entries where N c (µ, Σ) denotes the proper multivariate complex Gaussian distribution [21] with probability density function (pdf is the proper complex Gaussian distribution.),w f k , h kn ∈ R + , w f k represents the spectral basis of i-th source, and h f k represents the temporal code for each spectral basis element of the i-th source.In the rest of the paper, the quantities s i, f n and c k, f n are referred to as "source" and "component", respectively.The components are assumed to be mutually independent and individually independent across frequency and time.It follows that This corresponds to model the source power spectral densities (PSD) with the NMF model, i.e., where S i denotes the F × N STFT matrix of source i and the matrices respectively.Then for the maximum likelihood estimation of W i and H i , it is shown that the minus log-likelihood (ML) of the parameters describing S i writes where "cst" denotes constant terms and is the Itakura-Satio (IS) Divergence (In this paper, the Itakura-Satio divergence is chosen as a measure of fit, which is appropriate for Gamma multiplicative noise.In addition, the Euclidean distance can cope with Gaussian additive noise and the Kullback-Leibler divergence fits multinomial distributions or Poisson noise.).In addition, the following two types of divergence are widely used [20]: Squared Euclidean (EU) distance Therefore, the ML estimation of W i and H i given source STFT S i is equivalent to NMF of the power spectrogram |S i | 2 into W i H i .In our simulation experiments, we build the initialization of the source spectrogram estimation using the EU divergence, KL divergence, and IS divergence, respectively.

Objective
We are interested in jointly updating the source spectrogram factors W i , H i , the mixing matrix A, and estimating the sources at the same time.In this paper, we propose a robust parameter initialization scheme to optimize EM algorithm.The block diagram of our proposed BSS algorithm is shown in Figure 1.Initially, tensor method is used to estimate the mixing matrix A. Then the time-frequency sources are estimated and the source spectrogram factors are detected using NMF of the power spectrogram.Finally, the model parameters are updated and the spatial images of source signals are estimated.Detailed descriptions of each step are given in the following section.

The Proposed Optimization Algorithm
In the following, we will propose a robust parameters initialization scheme using tensor decomposition and NMF to optimize the EM algorithm.First, the mixing matrix is estimated using tensor decomposition.Second, the state-of-the-art source separation algorithms are reviewed.Finally, the optimization algorithm is presented in detail.

Mixing Matrix Estimation Using Tensor Decomposition
Let us denote the J × J auto-correlation matrix as follows: where R s = E[ss H ] is the auto-correlation matrix of the source signal, the superscripts • H denotes the complex conjugate transpose.For simplicity, we have dropped the noise terms.Let us divide the whole data block into P non-overlapping sub-blocks, which are indexed by p = 1, ..., P. Then the spatial covariance matrices of the observation satisfy The problem we want to solve is the estimation of A from the set R p x .The solution will be obtained by interpreting as a tensor decomposition.It can equivalently be written as PARAFAC decomposition of a third-order tensor R x ∈ C J×J×P built by stacking the P matrices {R 1 x , ..., R P x } one after each other along the third dimension.Each element of the tensor R x is denoted by r (x) j 1 ,j 2 ,p , with j 1 = 1, ..., J, j 1 = 1, ..., J, and p = 1, ..., P. Define the matrix C ∈ C P×I whose element on the p-th row and i-th column, denoted c p,i , is the i-th diagonal element of R s .Then we have The PARAFAC decomposition (13) of the tensor R x ∈ C J×J×P is a decomposition R x as a linear combination of a minimal number of rank-1 term: where • denotes the tensor outer product, the superscripts • * denotes the complex conjugate, a i and c i are the column of A and C, respectively.
In this paper, we will use the following J 2 × P matrix representation of R x [22,23] [ Then ( 14) can be written in a matrix format as where denotes the Khatri-Rao product.As a result, its reduced-size SVD can be written as where U ∈ C J 2 ×I , Σ ∈ R I×I is diagonal, and V ∈ C P×I .Then there exists a nonsingular matrix where the columns of A A * are the vectors a i ⊗ a * i (⊗ denotes the Kronecker product), which are the vectorized representations of the rank-1 matrices a i a H i .As a consequence, the mixing matrix A can be determined using some optimization algorithms.The standard way is by means of an alternating least squares (ALS) algorithm [24].To enhance the convergence of ALS algorithm, an exact line search method is also used [25][26][27].The discussion [25] is limited to the real case and the complex case is addressed [26,27].Additionally, the matrix Z is to impose that has a Khatri-Rao structure.It was shown that Z diagonalizes a set of symmetric matrices by congruence.For further details on the way these matrices are built [28].This tensor method is uniquely identifiable in certain underdetermined cases, thus proving uniqueness of the estimated mixing matrix.

Source Separation Using the Baseline Methods
Now the mixing matrix had been estimated using the above tensor decomposition method, we can separate the source signals using some state-of-the-art methods.In the following, we review two baseline methods for the source separation.One is complex l p norm minimization method [10], the other is binary masking method [11].

l p Norm Minimization Method
The phases of the source STFT coefficients s i, f n are assumed to be uniformly distributed, while their magnitudes are modeled by where the parameters p > 0 and β > 0 govern the shape and the variance of the prior, respectively.Γ(•) is the gamma function.Therefore, the maximum a posterior source coefficients are given as follows: where s p is the l p norm of the source s defined by s

Binary Masking Method
We create the time-frequency mask corresponding to each source and produce the original source time-frequency representation.For instance, defining which is the indicator function for the support of s i .Then, we obtain the time-frequency representation of s i from the mixture via ŝi, Therefore, the spectral basis and temporal code are estimated based on NMF of the source spectrogram estimates by using the outputs of the above source separation methods.Then the mixing matrix and the source spectrogram factors are updated jointly, the spatial images of all sources are obtained using the following optimization EM algorithm.

The Optimization EM Algorithm
Let θ = {A, W, H, Σ v } be the set of all parameters, where A is the J × I matrix with entries a ji , W is the F × K matrix with entries w f k , H is the K × N matrix with entries h kn , Σ v, f is the noise covariance parameters.We derive an optimization EM algorithm, and the set {R xx, f , R xs, f , R ss, f , {u k, f n } kn } f is defined as follows: We select the following Minus Log-likelihood (ML) criterion: Then the mixing matrix, noise covariance, and W i , H i will be updated by using the following two-step iteration.

• E-step: Conditional Expectations of Natural Statistics
The minimum mean square error estimates ŝ f n of the source STFT are directly retrieved, and the spatial images of all source signals are obtained by using Wiener filtering, which is expressed as follows: and the component estimates is where • M-step: Update of Parameters In the linear instantaneous mixture case, the mixing matrix is real-valued.Therefore, we obtain the updated mixing matrix where Rss, f = 1 • Normalize A, W, and H.
Finally, by conservativity of Wiener reconstruction the spatial images of the estimated sources and noise sum up to the original mixture in the STFT domain.Then the inverse STFT can be used to transform them to the time-domain due to the linearity of the STFT.The source separation algorithm in the linear mixture case is outlined in Algorithm 1.
Algorithm 1: Proposed Algorithm for Underdetermined Linear BSS.
• Underdetermined Linear Mixture Case (I > J) Step 1. Estimate the mixing matrix A by using the time-domain tensor decomposition.
Step 2. Perform STFT on x(t) to get x f n .
Step 3. Estimate the sources using (20) and detect the source spectrogram factors employing the NMF method with (7).
Step 4. Initialize the updated matrix, the spectral basis, and temporal code, then update these parameters using EM algorithm.i.e., repeat (i).Update A with (33) in the linear mixture case.
(ii).Alternately update w f k and h kn with (35).until convergence Step 5. Estimate ŝ f n by using Wiener filter of (28).
Step 6. Transform ŝ f n into time-domain to obtain s(t) through inverse STFT.

Convolutive Mixed Sources Case
The derivation of optimization EM algorithm for convolutive model is more complex since each mixing filter boils down to the combination of a delay so that the updated mixing matrix cannot be expressed using (33) in the M-step.In the following, we consider the underdetermined multichannel convolutive mixture model, namely in which A(l) is the mixing system's impulse response matrix at the time-lag l, and L denotes the maximum channel length.Then the convolutive mixtures can be decoupled into a series of linear instantaneous mixtures by applying STFT on consecutive time windows.Therefore, (39) can be expressed as follows: where A f is the frequency component of the mixing filter A(l) at frequency f , and x f n , s f n are defined by the same way as in the linear instantaneous mixture model.In this case, the updated mixing matrix (33) needs to be replaced by In the convolutive mixture model, the mixing matrix is estimated in the Fourier domain.Therefore, the main difficulty is the need to deal with the permutation and scaling ambiguities.In our algorithm, the minimal distortion principle is used to compensate the scaling ambiguity, and K-mean clustering algorithm is employed to deal with the frequency-dependent permutation ambiguity problem.Finally, the source separation algorithm in the convolutive mixture case is outlined in Algorithm 2.
• Underdetermined Convolutive Mixture Case (I > J) Step 1. Perform STFT on x(t) to get x f n Step 2. Estimate the mixing matrix A f by using frequency-domain tensor decomposition.
Step 3. Estimate the sources using (22), and detect the source spectrogram factors employing the NMF method with (7).
Step 4. Initialize the updated matrix, the spectral basis, and temporal code, then update these parameters using EM algorithm.i.e., repeat (i).Update A f with (41) in the convolutive mixture case.
(ii).Alternately update w f k and h kn with (35).until convergence Step 5. Estimate ŝ f n by using Wiener filter of (28).
Step 6. Transform ŝ f n into time-domain to obtain s(t) through inverse STFT.

Experiments
In this section, all the simulation experiments are conducted on a computer with Inter (R) Xeon (R) CPU E5-2630 v3 @ 2.40GHz, 32.00 GB memory under Ubuntu 15.04 operational system and the programs are coded by Matlab R2016b installed in a personal computer.
First, we describe the test datasets and evaluation criteria, and proceed with experiments including the music mixture signals and speech mixture signals.Based on these criteria, we select two models for further study, namely the linear instantaneous mixture model and convolutive mixture model.Second, we compare the proposed algorithm with the baseline algorithms over synthetic reverberant speech/music mixtures and the real-world speech/music mixtures.Finally, numerous simulation examples are shown to illustrate the performance of our proposed algorithm.

Datasets
We talk about four audio datasets, i.e., two synthetic stereo linear instantaneous mixture (Dataset A and Dataset B) and two convolutive mixture (Dataset C and Dataset D).In the linear instantaneous mixture case, Dataset A matches with the development dataset dev2 (dev2-wdrums-inst-mix) of the 2008 Signal Separation Evaluation Campaign "under-determined speech and music mixtures" task development datasets (SiSEC'08) (http://www.sisec.wiki.irisa.fr),which consists of one synthetic stereo mixture, including three musical sources with drums which consist of percussive instruments.Dataset B comes from the development dataset dev1 (dev1-female3-inst-mix) of SiSEC'08 which consists of three speech mixtures.
In the convolutive mixture case, Dataset C comes from the music data with drums in dataset dev2 (dev2-wdrums-liverec-250ms-1m-mix), which has 250 ms of reverberation time with 1 m space between their microphones in the live real-recording environment.Dataset D is from the dataset dev1 (dev1-male3-synthconv-130ms-1m-mix) of the SiSEC'08 which has 130 ms of reverberation time with 1 m space between their microphones.

Source Signal Separation Evaluation Criteria
In order to evaluate our proposed algorithm in the blind audio source separation, we use several objective performance criteria [29] which compare the reconstructed source signal images with the original ones.Now we define numerical performance criteria by computing energy ratios expressed in decibels (dB) from estimated source decomposition to global performance.
The criteria derive from the decomposition of an estimated source image as where and e arti f ij (t) are distinct error components representing distortion, interference, and artifacts in the channel j, respectively.Therefore, these criteria are defined as follows: The Signal to Distortion Ratio (SDR) The Source Image to Spatial Distortion Ratio (ISR) The Source to Interference Ratio (SIR) The Source to Artifacts Ratio (SAR) In our paper, we employ the above measures (SDR, ISR, SIR, SAR) to evaluate the performance of our proposed algorithm and compare with the baseline methods.Finally, a series of simulation results verify the competence of our proposed algorithm.

Algorithm Parameters
The proposed algorithm will be compared with the EM, MU algorithms [16], and full-rank algorithm [30].In the linear instantaneous case, the initial values of the NMF parameters for the MU and EM algorithms are based on a mixing matrix estimate obtained with the method of Arberet et al. [31].In the convolutive case, the initial values are based on frequency-dependent complex-valued mixing matrix estimation [32].For verifying the effective of our proposed algorithm, we employ the time-domain tensor decomposition to estimate the linear mixing matrix and the frequency-domain tensor decomposition to estimate the convolutive mixing matrix.Additionally, we build the initialization of the source spectrogram estimation W i and H i based on EU-NMF, KL-NMF, and IS-NMF, respectively.The initial values for the NMF parameters {w f k , h kn }, k ∈ K i of a given source i are calculated by applying the NMF algorithm to mono-channel power spectrogram of source.
Finally, we set the following parameters for our optimization algorithm, the number of components is K i = 4 for every experiment.Furthermore, since the choice of the STFT window size and the number of iteration are rather important, so we use the STFT with the half-overlapping sine windows (typically a Hanning Window), these parameters are reported in Table 1.In the first place, we consider underdetermined music mixtures and speech mixtures in the linear case, and compare our proposed algorithms (Tensor-EU, Tensor-KL, Tensor-IS) with the baseline algorithms (l 0 min [10], EM [16], MU [16]).Additionally, we run the EM and MU from 100 different random initializations (EM, MU), and select the average as the results for the tasks of underdetermined music and speech mixtures in the linear instantaneous case, respectively.
In the second place, we test the performance of our proposed algorithms in the realistic underdetermined convolutive mixture case.For example, music mixtures are the live recording dataset which are more complicated than the synthetic convolutive case, and the speech recorded in an indoor environment are often convolutive, due to multipath reflections.We compare our proposed algorithms with the methods [11,16,30].In addition, the separation result obtained with the EM and MU methods depends on the initial values, we conducted 100 trials with random initializations and selected the average as the results.

Music Signal Mixtures in the Linear Instantaneous Case
In Dataset A, we first select the music signal mixtures in the linear case.The average SDR, ISR, SIR, and SAR are depicted in Figures 2 and 3 based on the MU, EM with the random initialization, MU [16], EM [16], l 0 min [10], and our proposed algorithm (Tensor-EU, Tensor-KL, Tensor-IS).Finally, the waveforms of the estimated sources are shown in Figure 4.   (c) Waveforms of the estimated sources using MU algorithm for drum case [16]; (d) Waveforms of the estimated sources using EM algorithm for drum case [16]; (e) Waveforms of the estimated sources using l 0 minimization algorithm for drum case [10]; and (f) Waveforms of the estimated sources using our proposed algorithm (Tensor-IS) in the linear instantaneous mixture case.

Speech Signal Mixtures in the Linear Instantaneous Case
In Dataset B, we select the speech signal mixtures in the linear instantaneous case.The average SDR, ISR, SIR, and SAR are depicted in Figures 5 and 6 based on the MU, EM with the random initialization, MU [16], EM [16], l 0 min [10], and our proposed algorithm (Tensor-EU, Tensor-KL, Tensor-IS), respectively.In Dataset C, we select the real live recording convolutive dataset which consists of vocal and musical instrument with drum.The average SDR, ISR, SIR, and SAR are depicted in Figures 7 and 8 based on the MU, EM with the random initialization, MU [16], EM [16], Full-rank [30], Bm [11] and our proposed algorithm (Tensor-EU, Tensor-KL, Tensor-IS), respectively.Finally, the waveforms of the estimated sources are shown in Figure 9. Performance evaluation MU EM MU [16] EM [16] Full-rank [30] Bm [11] Tensor-EU Tensor-KL Tensor-IS   (e) Waveforms of the estimated sources using binary masking algorithm [11]; and (f) Waveforms of the estimated sources using the proposed algorithm (Tensor-IS) in the convolutive mixture case.

Speech Signal Mixtures in the Convolutive Case
In Dataset D, we select the synthetic convolutive mixtures including three speech sources and two mixing channels.The average SDR, ISR, SIR, and SAR are depicted in Figures 10 and 11 based on the MU, EM with the random initialization, MU [16], EM [16], Full-rank [30], Bm [11] and our proposed algorithm (Tensor-EU, Tensor-KL, Tensor-IS), respectively.Performance evaluation MU EM MU [16] EM [16] Full-rank [30] Bm [11] Tensor-EU Tensor-KL Tensor-IS  Performance evaluation MU EM MU [16] EM [16] Full-rank [30] Bm [11] Tensor-EU Tensor-KL Tensor-IS Discussion 1.According to the above experimental results of Dataset A, Dataset B, Dataset C, and Dataset D, it can be seen that our proposed algorithm can separate music signal mixtures and speech signal mixtures in the underdetermined linear and convolutive case.What is more, according to the average value of source separation results, it is also shown that our proposed algorithm outperforms the baseline algorithms.

The Runtime of All Algorithms
The corresponding runtimes of the algorithms are shown in Table 2.It can be seen that the proposed algorithm takes more time than the MU and EM methods.It is mainly because the time consuming on estimation of mixing matrix based on tensor decomposition.However, compared with the full-rank algorithm, our proposed algorithm takes less time.Additionally, as for the source separation results, the proposed algorithms exhibit better separation performance than the compared algorithms.In our future work, it is still necessary to develop a better algorithm to reduce time cost.
Table 2.The runtime of all algorithms (sec.).

Conclusions and Future Work
In this paper, we proposed an optimization underdetermined multichannel BSS algorithm based on tensor decomposition and NMF.Because the EM method is very sensitive to the parameter initialization, we first estimated the mixing matrix employing tensor decomposition; meanwhile, the source spectrogram factors were estimated using NMF source model, and produced an optimization parameter initialization scheme.Then the model parameters were updated using the EM algorithm.The spatial images of all sources were obtained in the minimum mean square error sense by multichannel Wiener filtering.The time-domain sources can be obtained through inverse STFT.Finally, a series of experimental results showcase that our proposed optimization algorithm improves the separation performance compared with the baseline algorithms.
In addition, there are some aspects that deserve further study.Firstly, the estimation of number of components of NMF model is an open topic.There have been some articles to solve this problem, such as the automatic order selection [33], Information Theoretic Criteria [34], and N-way Probabilistic Clustering [35].Secondly, the window length used in the STFT has been taken advantage to match the characteristics of audio signals, and different window lengths have different effects on the separation results.Furthermore, taking into account source or microphone motions [36,37], i.e., convolutive mixture corresponding to source-to-microphone channel that can change over time, is a challenging problem.Therefore, these problems would be the focus of our future work.

Figure 1 .
Figure 1.Block diagram of our proposed blind source separation algorithm.

Figure 2 .
Figure 2. The average SDR and ISR results in the linear music signal mixture case.

Figure 3 .
Figure 3.The average SIR and SAR results in the linear music signal mixture case.

Figure 4 .
Figure 4.A numerical example demonstrating that (a) Waveforms of music source signals with drum in the linear mixture case; (b) Waveforms of the mixture sources; (c) Waveforms of the estimated sources using MU algorithm for drum case[16]; (d) Waveforms of the estimated sources using EM algorithm for drum case[16]; (e) Waveforms of the estimated sources using l 0 minimization algorithm for drum case[10]; and (f) Waveforms of the estimated sources using our proposed algorithm (Tensor-IS) in the linear instantaneous mixture case.

Figure 5 .
Figure 5.The average SDR and ISR results in the linear speech signal mixture case.

Figure 6 .
Figure 6.The average SIR and SAR results in the linear speech signal mixture case.4.4.3.Music Signal Mixtures in the Convolutive Case

Figure 7 .
Figure 7.The average SDR and ISR results in the convolutive music signal mixture case.

Figure 9 .
Figure 9.A numerical example demonstrating that (a) Waveforms of music source signals with drum in the convolutive mixture case; (b) Waveforms of the mixture sources[16]; (c) Waveforms of the estimated sources using MU algorithm[16]; (d) Waveforms of the estimated sources using EM algorithm; (e) Waveforms of the estimated sources using binary masking algorithm[11]; and (f) Waveforms of the estimated sources using the proposed algorithm (Tensor-IS) in the convolutive mixture case.

Figure 10 .
Figure 10.The average SDR and ISR results in the convolutive speech signal mixture case.

Figure 11 .
Figure 11.The average SIR and SAR results in the convolutive speech signal mixture case.

in which x(t) =
[x 1 (t), ..., x J (t)] ∈ C J represents the received J signals, s(t) = [s 1 (t), ..., s I (t)] ∈ C I denotes the I source signals (unknown), and I > J, i.e., in the underdetermined mixture case.A = [a 1 , ..., a I ] ∈ C J×I is the unknown mixing matrix, v(t) ∈ C J is an additional noise with zero mean and variance σ 2 .Nevertheless, for audio signals, the separation is much easier in the short-time discrete frequency transform domain, where the source signals are sparser.Therefore, the mixture model (1) can be expressed as follows:

Table 1 .
The parameter setting of all the algorithms.
The average SIR and SAR results in the convolutive music signal mixture case.