Next Article in Journal
On the Existence of the Solutions of a Fredholm Integral Equation with a Modified Argument in Hölder Spaces
Previous Article in Journal
Dark Matter as a Non-Relativistic Bose–Einstein Condensate with Massive Gravitons
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Underdetermined Blind Source Separation Combining Tensor Decomposition and Nonnegative Matrix Factorization

1
School of Automation, Guangdong University of Technology, Guangzhou 510006, China
2
Institute of Intelligent Information Processing and the Guangdong Provincial Key Laboratory for Information Technology in Internet of Things, Guangzhou 510006, China
*
Authors to whom correspondence should be addressed.
Symmetry 2018, 10(10), 521; https://doi.org/10.3390/sym10100521
Submission received: 25 September 2018 / Revised: 14 October 2018 / Accepted: 16 October 2018 / Published: 18 October 2018

Abstract

:
Underdetermined blind source separation (UBSS) is a hot topic in signal processing, which aims at recovering the source signals from a number of observed mixtures without knowing the mixing system. Recently, expectation-maximization algorithm shows a great potential in the UBSS. However, the final separation results depend strongly on the parameter initialization, leading to poor separation performance. In this paper, we propose an effective algorithm that combines tensor decomposition and nonnegative matrix factorization (NMF). In the proposed algorithm, we first employ tensor decomposition to estimate the mixing matrix, and NMF source model is used to estimate the source spectrogram factors. Then a series of iterations are derived to update the model parameters. At the same time, the spatial images of source signals are estimated with Wiener filters constructed from the learned parameters. Therefore, time-domain sources can be obtained through inverse short-time Fourier transform. Finally, plenty of experimental results demonstrate the effectiveness and advantages of our proposed algorithm over the compared algorithms.

1. Introduction

Blind source separation (BSS) considers the recovery of source signals from observed signals without knowing the recording environment. Recently, the use of BSS has become an active research area. If the number of source signals is less, equal or greater than the number of microphones, BSS can be classified as the overdetermined case [1], the determined case [2,3], or the underdetermined case [4,5], respectively. In particular, in the natural environment, the mixing process is generally considered to be convolutive, i.e., the channel between each source and each microphone is modeled in a linear filter that represents multiple source-to-microphone paths because it considers the reverberation of the channel. Therefore, underdetermined convolutive BSS is a challenging problem in the field of BSS.
To address this underdetermined convolutive BSS problem, tensor decomposition shows great potential, because an interesting property of higher-order tensors is that their rank decomposition is unique. Additionally, parallel factor (PARAFAC) decomposition factorizes a tensor into a sum of component rank-one tensors, and the factor matrices refer to the combination of the vectors from the rank-one components. By singular value decomposition of a series of matrices, the parallel factorization problem is transformed into a joint matrix diagonalization problem, such that the PARAFAC analysis is solved [6,7]. Therefore, the PARAFAC method can be used to identify the mixing matrix in the underdetermined case, which has been proven usefully in a wide range of applications from sensor array processing to communication, speech and audio signal processing [8,9]. In the phase of source separation, source signals can be estimated by using the l 0 -norm minimization method [10] or the binary masking algorithm [11]. However, these methods suffer from poor separation performance. To improve separation performance, we found out that the source model includes some specific information on the spectral structures of sources. Therefore, a better source model has the potential to improve the source separation performance.
In BSS, the non-negative matrix factorization (NMF) source model is usually applied on the speech/music power spectrogram, where the spectrogram is approximated by the product of two non-negative matrices, i.e., a basis matrix and an activation matrix. The basis matrix represents the repeating spectral patterns, and the activation matrix represents the presence of these patterns over time. Additionally, NMF aims to decompose a non-negative factor matrix into the product of two low-rank non-negative factor matrices [12,13]. The NMF model can be used to efficiently exploit the low-rank nature of the speech spectrogram and its dependency across the frequencies. In some NMF-based methods [14,15,16,17], non-negative matrix factor two-dimensional deconvolution is an effective machine learning method in audio source separation field. In particular, in the convolutive frequency-domain model, the well-known permutation alignment problem cannot be solved without using additional a priori knowledge about the sources or the mixing filters. However, the NMF source model implies a coupling of the frequency bands, and joint estimation of the source parameters and mixing coefficients, which frees us from the permutation problem. Furthermore, NMF is well suited to polyphony as it basically takes the source to be a sum of elementary components with characteristic spectral signatures. Therefore, NMF source model is able to improve the source separation performance.
Additionally, in order to obtain better source separation results, the estimated mixing matrix and NMF variables need to be updated using an optimization algorithm. In most BSS optimization algorithms, we found out that expectation-maximization (EM) algorithm [18], which is a popular choice for Gaussian models, provided faster convergence. The EM algorithm is related to some multichannel source separation techniques by employing Gaussian mixture model as source models. However, it is very sensitive to initialization in source separation tasks. There had been some studies of parameter initialization of NMF to optimize separation performance [19,20]. Therefore, we try to take an optimization algorithm to improve the source separation performance.
In this paper, an alternative optimization algorithm is proposed to deal with the parameter initialization problem and improve separation performance. First, we employ tensor decomposition to detect the mixing matrix, and NMF is used to estimate the source spectrogram factors. Then these model parameters are updated using the EM algorithm. Meanwhile, the spatial images of source signals are estimated using Wiener filters constructed from the learned parameters. The time-domain sources can be obtained through inverse short-term Fourier transform (STFT) using an adequate overlap-add procedure with dual synthesis window. Thanks to the linearity of the inverse STFT, the reconstruction is conservation in the time-domain as well. Finally, a series of experimental results including synthetic instantaneous and convolutive music and speech source mixtures, as well as live real recordings, show that our improved algorithm outperforms the state-of-the-art baseline methods. We can highlight the main contributions of this article as follows.
(1) We propose an improved algorithm that combines tensor decomposition and advanced NMF to deal with the underdetermined linear BSS. The mixing matrix is estimated using tensor decomposition, and NMF is used to decompose the given spectrogram into several spectral bases and temporal activations. Then the mixing matrix, NMF variables, and noise components are updated by a series of iteration rules. The proposed algorithm combines the advantages of tensor decomposition and NMF, which is beneficial to improve the performance of source separation. Additionally, the improved algorithm can be extended to underdetermined convolutive BSS.
(2) We have demonstrated the superiority in the underdetermined linear and convolutive BSS cases. Additionally, in this paper we mainly consider the audio datasets, the proposed algorithm demonstrates the effectiveness and superiority compared with the state-of-the-art algorithms, which improves the source separation performance based on a series of simulation experiments.
The structure of the remaining of this paper is organized as follows. Section 2 formulates the problem of the underdetermined blind source separation. In Section 3, an optimization algorithm is presented based on tensor decomposition and NMF. Experimental results compared the source separation performance with the state-of-the-art techniques in various experimental settings are shown in Section 4. Finally, Section 5 summarizes our conclusion and the future work.

2. Problem Formulation

2.1. Linear Instantaneous Mixture Model

The signal model with noise used in this paper is described as follows:
x ( t ) = As ( t ) + v ( t )
in which x ( t ) = [ x 1 ( t ) , , x J ( t ) ] C J represents the received J signals, s ( t ) = [ s 1 ( t ) , , s I ( t ) ] C I denotes the I source signals (unknown), and I > J , i.e., in the underdetermined mixture case. A = [ a 1 , , a I ] C J × I is the unknown mixing matrix, v ( t ) C J is an additional noise with zero mean and variance σ 2 .
Nevertheless, for audio signals, the separation is much easier in the short-time discrete frequency transform domain, where the source signals are sparser. Therefore, the mixture model (1) can be expressed as follows:
x f n As f n + v f n
where n = 1 , 2 , , N denotes the index of the time window for applying the Fourier transform, f = 0 , , F 1 is the index of the frequency bins, x f n = [ x 1 , f n , , x J , f n ] T and s f n = [ s 1 , f n , , s I , f n ] T are the STFT of the mixtures and the sources at time-frequency point ( f , n ) , respectively. v f n = [ v 1 , f n , , v J , f n ] T , the noise v j , f n is assumed to be stationary and spatially uncorrelated, i.e., v j , f n N c ( 0 , σ j , f 2 ) and Σ v , f = d i a g [ σ j , f 2 ] .

2.2. The NMF Source Model

Let K I is known in advance, and { K i } i = 1 I be a nontrivial partition of K = 1 , , K . Following [16,17], a coefficient s i , f n is modeled as the sum of latent components c k , f n , such that
s i , f n = k K i c k , f n s f n = Gc f n
where G N I × K is a binary selection matrix with entries
G i k = { 1 , k K i 0 , otherwise
and c f n = [ c 1 , f n , , c K , f n ] C K is the vector of component coefficients at ( f , n ) . Each component c k , f n follows that
c k , f n N c ( 0 , w f k h k n )
where N c ( μ , Σ ) denotes the proper multivariate complex Gaussian distribution [21] with probability density function (pdf) ( N c ( x ; μ , Σ ) = | π Σ | 1 exp [ ( x μ ) H Σ 1 ( x μ ) ] is the proper complex Gaussian distribution.), w f k , h k n R + , w f k represents the spectral basis of i-th source, and h f k represents the temporal code for each spectral basis element of the i-th source. In the rest of the paper, the quantities s i , f n and c k , f n are referred to as “source" and “component", respectively. The components are assumed to be mutually independent and individually independent across frequency and time. It follows that
s i , f n N c ( 0 , k K i w f k h k n )
This corresponds to model the source power spectral densities (PSD) with the NMF model, i.e.,
E [ | S i | 2 ] = W i H i
where S i denotes the F × N STFT matrix of source i and the matrices W i = [ w f k ] f , k K i , H i = [ h k n ] k K i , n , respectively. Then for the maximum likelihood estimation of W i and H i , it is shown that the minus log-likelihood (ML) of the parameters describing S i writes
log p ( S i | W i , H i ) = f n d I S ( | s i , f n | 2 | k K i w f k h k n ) + cst
where “cst” denotes constant terms and
d I S ( x | y ) = x y log x y 1
is the Itakura-Satio (IS) Divergence (In this paper, the Itakura-Satio divergence is chosen as a measure of fit, which is appropriate for Gamma multiplicative noise. In addition, the Euclidean distance can cope with Gaussian additive noise and the Kullback-Leibler divergence fits multinomial distributions or Poisson noise.).
In addition, the following two types of divergence are widely used [20]:
Squared Euclidean (EU) distance
d E U ( x | y ) = | x y | 2
Kullback-Leibler(KL) divergence
d K L ( x | y ) = x · log x y x + y
Therefore, the ML estimation of W i and H i given source STFT S i is equivalent to NMF of the power spectrogram | S i | 2 into W i H i . In our simulation experiments, we build the initialization of the source spectrogram estimation using the EU divergence, KL divergence, and IS divergence, respectively.

2.3. Objective

We are interested in jointly updating the source spectrogram factors W i , H i , the mixing matrix A , and estimating the sources at the same time. In this paper, we propose a robust parameter initialization scheme to optimize EM algorithm. The block diagram of our proposed BSS algorithm is shown in Figure 1. Initially, tensor method is used to estimate the mixing matrix A . Then the time-frequency sources are estimated and the source spectrogram factors are detected using NMF of the power spectrogram. Finally, the model parameters are updated and the spatial images of source signals are estimated. Detailed descriptions of each step are given in the following section.

3. The Proposed Optimization Algorithm

In the following, we will propose a robust parameters initialization scheme using tensor decomposition and NMF to optimize the EM algorithm. First, the mixing matrix is estimated using tensor decomposition. Second, the state-of-the-art source separation algorithms are reviewed. Finally, the optimization algorithm is presented in detail.

3.1. Mixing Matrix Estimation Using Tensor Decomposition

Let us denote the J × J auto-correlation matrix as follows:
R x = E [ x ( t ) x H ( t ) ] = AR s A H
where R s = E [ ss H ] is the auto-correlation matrix of the source signal, the superscripts · H denotes the complex conjugate transpose. For simplicity, we have dropped the noise terms. Let us divide the whole data block into P non-overlapping sub-blocks, which are indexed by p = 1 , , P . Then the spatial covariance matrices of the observation satisfy
R x 1 = A · R s 1 · A H R x P = A · R s P · A H
in which R s p = E [ s p s p H ] is diagonal. The problem we want to solve is the estimation of A from the set R x p . The solution will be obtained by interpreting as a tensor decomposition. It can equivalently be written as PARAFAC decomposition of a third-order tensor R x C J × J × P built by stacking the P matrices { R x 1 , , R x P } one after each other along the third dimension. Each element of the tensor R x is denoted by r j 1 , j 2 , p ( x ) , with j 1 = 1 , , J , j 1 = 1 , , J , and p = 1 , , P . Define the matrix C C P × I whose element on the p-th row and i-th column, denoted c p , i , is the i-th diagonal element of R s . Then we have
r j 1 , j 2 , p ( x ) = i = 1 I a j 1 , i c p , i a j 2 i *
The PARAFAC decomposition (13) of the tensor R x C J × J × P is a decomposition R x as a linear combination of a minimal number of rank-1 term:
R x = i = 1 I a i c i a i *
where ∘ denotes the tensor outer product, the superscripts · * denotes the complex conjugate, a i and c i are the column of A and C , respectively.
In this paper, we will use the following J 2 × P matrix representation of R x [22,23]
[ R x ] ( j 1 1 ) J + j 2 , p = [ R x ] j 1 , j 2 , p
Then (14) can be written in a matrix format as
R x = [ A A * ] · C T
where ⊙ denotes the Khatri-Rao product. As a result, its reduced-size SVD can be written as
R x = U Σ V H
where U C J 2 × I , Σ R I × I is diagonal, and V C P × I . Then there exists a nonsingular matrix Z C I × I , such that
A A * = U Σ Z C T = Z 1 V H
where the columns of A A * are the vectors a i a i * (⊗ denotes the Kronecker product), which are the vectorized representations of the rank-1 matrices a i a i H . As a consequence, the mixing matrix A can be determined using some optimization algorithms. The standard way is by means of an alternating least squares (ALS) algorithm [24]. To enhance the convergence of ALS algorithm, an exact line search method is also used [25,26,27]. The discussion [25] is limited to the real case and the complex case is addressed [26,27]. Additionally, the matrix Z is to impose that has a Khatri-Rao structure. It was shown that Z diagonalizes a set of symmetric matrices by congruence. For further details on the way these matrices are built [28]. This tensor method is uniquely identifiable in certain underdetermined cases, thus proving uniqueness of the estimated mixing matrix.

3.2. Source Separation Using the Baseline Methods

Now the mixing matrix had been estimated using the above tensor decomposition method, we can separate the source signals using some state-of-the-art methods. In the following, we review two baseline methods for the source separation. One is complex l p norm minimization method [10], the other is binary masking method [11].

3.2.1. l p Norm Minimization Method

The phases of the source STFT coefficients s i , f n are assumed to be uniformly distributed, while their magnitudes are modeled by
P ( | s i , f n | ) = p β 1 / p Γ ( 1 / p ) e β | s i , f n | p
where the parameters p > 0 and β > 0 govern the shape and the variance of the prior, respectively. Γ ( · ) is the gamma function. Therefore, the maximum a posterior source coefficients are given as follows:
s ^ f n = arg min s C I s p p subject to As f n = x f n
where s p is the l p norm of the source s defined by s p p = Σ i = 1 I | s i | p .

3.2.2. Binary Masking Method

We create the time-frequency mask corresponding to each source and produce the original source time-frequency representation. For instance, defining
M i , f n = { 1 , s ^ i , f n 0 0 , otherwise
which is the indicator function for the support of s i . Then, we obtain the time-frequency representation of s i from the mixture via
s ^ i , f n = M i , f n x i , f n
Therefore, the spectral basis and temporal code are estimated based on NMF of the source spectrogram estimates by using the outputs of the above source separation methods. Then the mixing matrix and the source spectrogram factors are updated jointly, the spatial images of all sources are obtained using the following optimization EM algorithm.

3.3. The Optimization EM Algorithm

Let θ = { A , W , H , Σ v } be the set of all parameters, where A is the J × I matrix with entries a j i , W is the F × K matrix with entries w f k , H is the K × N matrix with entries h k n , Σ v , f is the noise covariance parameters. We derive an optimization EM algorithm, and the set { R xx , f , R xs , f , R ss , f , { u k , f n } k n } f is defined as follows:
R xx , f = 1 N n x f n x f n H
R xs , f = 1 N n x f n s f n H
R ss , f = 1 N n s f n s f n H
u k , f n = | c k , f n | 2
We select the following Minus Log-likelihood (ML) criterion:
C ( θ ) = f n trace ( x f n x f n H Σ x , f n 1 ) + log det Σ x , f n
Then the mixing matrix, noise covariance, and W i , H i will be updated by using the following two-step iteration.
E-step: Conditional Expectations of Natural Statistics
The minimum mean square error estimates s ^ f n of the source STFT are directly retrieved, and the spatial images of all source signals are obtained by using Wiener filtering, which is expressed as follows:
s ^ f n = Σ s , f n A H Σ x , f n 1 x f n
and the component estimates is
c ^ f n = Σ c , f n A H Σ x , f n 1 x f n
where
Σ x , f n = A Σ s , f n A H + Σ v , f
Σ s , f n = diag k K i w f k h k n i
Σ c , f n = diag w f k h k n k
M-step: Update of Parameters
In the linear instantaneous mixture case, the mixing matrix is real-valued. Therefore, we obtain the updated mixing matrix
A = real f R ^ xs , f real f R ^ ss , f 1
and
Σ v , f = diag ( R ^ xx , f A R ^ xs , f H R ^ xs , f A H + A R ^ xs , f A H )
w f k = 1 N n u ^ k , f n h k n , h f k = 1 F n u ^ k , f n w f k
where
R ^ xx , f = R xx , f , R ^ xs , f = 1 N n x f n s ^ f n H
R ^ ss , f = 1 N n s ^ f n s ^ f n H + Σ s , f n Σ s , f n A H Σ x , f n 1 A Σ s , f n
u ^ k , f n = c ^ f n c ^ f n H + Σ c , f n Σ c , f n A H Σ x , f n 1 A Σ c , f n k k
• Normalize A, W, and H.
Finally, by conservativity of Wiener reconstruction the spatial images of the estimated sources and noise sum up to the original mixture in the STFT domain. Then the inverse STFT can be used to transform them to the time-domain due to the linearity of the STFT. The source separation algorithm in the linear mixture case is outlined in Algorithm 1.
Algorithm 1: Proposed Algorithm for Underdetermined Linear BSS.
 • Underdetermined Linear Mixture Case ( I > J )
Step 1. Estimate the mixing matrix A by using the time-domain tensor decomposition.
Step 2. Perform STFT on x ( t ) to get x f n .
Step 3. Estimate the sources using (20) and detect the source spectrogram factors employing
  the NMF method with (7).
Step 4. Initialize the updated matrix, the spectral basis, and temporal code, then update these
  parameters using EM algorithm. i.e.,
  repeat
   (i). Update A with (33) in the linear mixture case.
   (ii). Alternately update w f k and h k n with (35).
   until convergence
Step 5. Estimate s ^ f n by using Wiener filter of (28).
Step 6. Transform s ^ f n into time-domain to obtain s ( t ) through inverse STFT.
 • end

3.4. Convolutive Mixed Sources Case

The derivation of optimization EM algorithm for convolutive model is more complex since each mixing filter boils down to the combination of a delay so that the updated mixing matrix cannot be expressed using (33) in the M-step. In the following, we consider the underdetermined multichannel convolutive mixture model, namely
x ( t ) = l = 0 L A ( l ) s ( t l ) + v ( t )
in which A ( l ) is the mixing system’s impulse response matrix at the time-lag l, and L denotes the maximum channel length. Then the convolutive mixtures can be decoupled into a series of linear instantaneous mixtures by applying STFT on consecutive time windows. Therefore, (39) can be expressed as follows:
x f n A f s f n + v f n
where A f is the frequency component of the mixing filter A ( l ) at frequency f, and x f n , s f n are defined by the same way as in the linear instantaneous mixture model. In this case, the updated mixing matrix (33) needs to be replaced by
A f = R ^ xs , f R ^ ss , f 1
In the convolutive mixture model, the mixing matrix is estimated in the Fourier domain. Therefore, the main difficulty is the need to deal with the permutation and scaling ambiguities. In our algorithm, the minimal distortion principle is used to compensate the scaling ambiguity, and K-mean clustering algorithm is employed to deal with the frequency-dependent permutation ambiguity problem. Finally, the source separation algorithm in the convolutive mixture case is outlined in Algorithm 2.
Algorithm 2: Proposed Algorithm for Underdetermined Convolutive BSS.
 • Underdetermined Convolutive Mixture Case ( I > J )
Step 1. Perform STFT on x ( t ) to get x f n
Step 2. Estimate the mixing matrix A f by using frequency-domain tensor decomposition.
Step 3. Estimate the sources using (22), and detect the source spectrogram factors employing
  the NMF method with (7).
Step 4. Initialize the updated matrix, the spectral basis, and temporal code, then update these
  parameters using EM algorithm. i.e.,
  repeat
   (i). Update A f with (41) in the convolutive mixture case.
   (ii). Alternately update w f k and h k n with (35).
   until convergence
Step 5. Estimate s ^ f n by using Wiener filter of (28).
Step 6. Transform s ^ f n into time-domain to obtain s ( t ) through inverse STFT.
 • end

4. Experiments

In this section, all the simulation experiments are conducted on a computer with Inter (R) Xeon (R) CPU E5-2630 v3 @ 2.40GHz, 32.00 GB memory under Ubuntu 15.04 operational system and the programs are coded by Matlab R2016b installed in a personal computer.
First, we describe the test datasets and evaluation criteria, and proceed with experiments including the music mixture signals and speech mixture signals. Based on these criteria, we select two models for further study, namely the linear instantaneous mixture model and convolutive mixture model. Second, we compare the proposed algorithm with the baseline algorithms over synthetic reverberant speech/music mixtures and the real-world speech/music mixtures. Finally, numerous simulation examples are shown to illustrate the performance of our proposed algorithm.

4.1. Datasets

We talk about four audio datasets, i.e., two synthetic stereo linear instantaneous mixture (Dataset A and Dataset B) and two convolutive mixture (Dataset C and Dataset D). In the linear instantaneous mixture case, Dataset A matches with the development dataset dev2 (dev2-wdrums-inst-mix) of the 2008 Signal Separation Evaluation Campaign “under-determined speech and music mixtures” task development datasets (SiSEC’08) (http://www.sisec.wiki.irisa.fr), which consists of one synthetic stereo mixture, including three musical sources with drums which consist of percussive instruments. Dataset B comes from the development dataset dev1 (dev1-female3-inst-mix) of SiSEC’08 which consists of three speech mixtures.
In the convolutive mixture case, Dataset C comes from the music data with drums in dataset dev2 (dev2-wdrums-liverec-250ms-1m-mix), which has 250 ms of reverberation time with 1 m space between their microphones in the live real-recording environment. Dataset D is from the dataset dev1 (dev1-male3-synthconv-130ms-1m-mix) of the SiSEC’08 which has 130 ms of reverberation time with 1 m space between their microphones.

4.2. Source Signal Separation Evaluation Criteria

In order to evaluate our proposed algorithm in the blind audio source separation, we use several objective performance criteria [29] which compare the reconstructed source signal images with the original ones. Now we define numerical performance criteria by computing energy ratios expressed in decibels (dB) from estimated source decomposition to global performance.
The criteria derive from the decomposition of an estimated source image as
s ^ i j i m g = s i j i m g ( t ) + e i j s p a t ( t ) + e i j i n t e r f ( t ) + e i j a r t i f ( t )
where s i j i m g ( t ) is the true source image of source i ( 1 i I ) on channel j ( 1 j 2 ) . e i j s p a t ( t ) , e i j i n t e r f ( t ) and e i j a r t i f ( t ) are distinct error components representing distortion, interference, and artifacts in the channel j, respectively. Therefore, these criteria are defined as follows:
The Signal to Distortion Ratio (SDR)
SDR i = 10 log 10 j = 1 J t s i j i m g ( t ) 2 j = 1 J t ( e i j s p a t + e i j i n t e r f + e i j a r t i f ) 2
The Source Image to Spatial Distortion Ratio (ISR)
ISR i = 10 log 10 j = 1 J t s i j i m g ( t ) 2 j = 1 J t ( e i j s p a t ( t ) ) 2
The Source to Interference Ratio (SIR)
SIR i = 10 log 10 j = 1 J t ( s i j i m g ( t ) + e i j s p a t ( t ) ) 2 j = 1 J t ( e i j i n t e r f ( t ) ) 2
The Source to Artifacts Ratio (SAR)
SAR i = 10 log 10 j = 1 J t ( s i j i m g + e i j s p a t + e i j i n t e r f ) 2 j = 1 J t e i j a r t i f ( t ) 2
In our paper, we employ the above measures (SDR, ISR, SIR, SAR) to evaluate the performance of our proposed algorithm and compare with the baseline methods. Finally, a series of simulation results verify the competence of our proposed algorithm.

4.3. Algorithm Parameters

The proposed algorithm will be compared with the EM, MU algorithms [16], and full-rank algorithm [30]. In the linear instantaneous case, the initial values of the NMF parameters for the MU and EM algorithms are based on a mixing matrix estimate obtained with the method of Arberet et al. [31]. In the convolutive case, the initial values are based on frequency-dependent complex-valued mixing matrix estimation [32]. For verifying the effective of our proposed algorithm, we employ the time-domain tensor decomposition to estimate the linear mixing matrix and the frequency-domain tensor decomposition to estimate the convolutive mixing matrix. Additionally, we build the initialization of the source spectrogram estimation W i and H i based on EU-NMF, KL-NMF, and IS-NMF, respectively. The initial values for the NMF parameters { w f k , h k n } , k K i of a given source i are calculated by applying the NMF algorithm to mono-channel power spectrogram of source.
Finally, we set the following parameters for our optimization algorithm, the number of components is K i = 4 for every experiment. Furthermore, since the choice of the STFT window size and the number of iteration are rather important, so we use the STFT with the half-overlapping sine windows (typically a Hanning Window), these parameters are reported in Table 1.

4.4. Underdetermined BSS in the Linear Instantaneous Case and Convolutive Mixture Case

In the first place, we consider underdetermined music mixtures and speech mixtures in the linear case, and compare our proposed algorithms (Tensor-EU, Tensor-KL, Tensor-IS) with the baseline algorithms ( l 0 min [10], EM [16], MU [16]). Additionally, we run the EM and MU from 100 different random initializations (EM, MU), and select the average as the results for the tasks of underdetermined music and speech mixtures in the linear instantaneous case, respectively.
In the second place, we test the performance of our proposed algorithms in the realistic underdetermined convolutive mixture case. For example, music mixtures are the live recording dataset which are more complicated than the synthetic convolutive case, and the speech recorded in an indoor environment are often convolutive, due to multipath reflections. We compare our proposed algorithms with the methods [11,16,30]. In addition, the separation result obtained with the EM and MU methods depends on the initial values, we conducted 100 trials with random initializations and selected the average as the results.

4.4.1. Music Signal Mixtures in the Linear Instantaneous Case

In Dataset A, we first select the music signal mixtures in the linear case. The average SDR, ISR, SIR, and SAR are depicted in Figure 2 and Figure 3 based on the MU, EM with the random initialization, MU [16], EM [16], l 0 min [10], and our proposed algorithm (Tensor-EU, Tensor-KL, Tensor-IS). Finally, the waveforms of the estimated sources are shown in Figure 4.

4.4.2. Speech Signal Mixtures in the Linear Instantaneous Case

In Dataset B, we select the speech signal mixtures in the linear instantaneous case. The average SDR, ISR, SIR, and SAR are depicted in Figure 5 and Figure 6 based on the MU, EM with the random initialization, MU [16], EM [16], l 0 min [10], and our proposed algorithm (Tensor-EU, Tensor-KL, Tensor-IS), respectively.

4.4.3. Music Signal Mixtures in the Convolutive Case

In Dataset C, we select the real live recording convolutive dataset which consists of vocal and musical instrument with drum. The average SDR, ISR, SIR, and SAR are depicted in Figure 7 and Figure 8 based on the MU, EM with the random initialization, MU [16], EM [16], Full-rank [30], Bm [11] and our proposed algorithm (Tensor-EU, Tensor-KL, Tensor-IS), respectively. Finally, the waveforms of the estimated sources are shown in Figure 9.

4.4.4. Speech Signal Mixtures in the Convolutive Case

In Dataset D, we select the synthetic convolutive mixtures including three speech sources and two mixing channels. The average SDR, ISR, SIR, and SAR are depicted in Figure 10 and Figure 11 based on the MU, EM with the random initialization, MU [16], EM [16], Full-rank [30], Bm [11] and our proposed algorithm (Tensor-EU, Tensor-KL, Tensor-IS), respectively.
Discussion 1. According to the above experimental results of Dataset A, Dataset B, Dataset C, and Dataset D, it can be seen that our proposed algorithm can separate music signal mixtures and speech signal mixtures in the underdetermined linear and convolutive case. What is more, according to the average value of source separation results, it is also shown that our proposed algorithm outperforms the baseline algorithms.

4.5. The Runtime of All Algorithms

The corresponding runtimes of the algorithms are shown in Table 2. It can be seen that the proposed algorithm takes more time than the MU and EM methods. It is mainly because the time consuming on estimation of mixing matrix based on tensor decomposition. However, compared with the full-rank algorithm, our proposed algorithm takes less time. Additionally, as for the source separation results, the proposed algorithms exhibit better separation performance than the compared algorithms. In our future work, it is still necessary to develop a better algorithm to reduce time cost.

5. Conclusions and Future Work

In this paper, we proposed an optimization underdetermined multichannel BSS algorithm based on tensor decomposition and NMF. Because the EM method is very sensitive to the parameter initialization, we first estimated the mixing matrix employing tensor decomposition; meanwhile, the source spectrogram factors were estimated using NMF source model, and produced an optimization parameter initialization scheme. Then the model parameters were updated using the EM algorithm. The spatial images of all sources were obtained in the minimum mean square error sense by multichannel Wiener filtering. The time-domain sources can be obtained through inverse STFT. Finally, a series of experimental results showcase that our proposed optimization algorithm improves the separation performance compared with the baseline algorithms.
In addition, there are some aspects that deserve further study. Firstly, the estimation of number of components of NMF model is an open topic. There have been some articles to solve this problem, such as the automatic order selection [33], Information Theoretic Criteria [34], and N-way Probabilistic Clustering [35]. Secondly, the window length used in the STFT has been taken advantage to match the characteristics of audio signals, and different window lengths have different effects on the separation results. Furthermore, taking into account source or microphone motions [36,37], i.e., convolutive mixture corresponding to source-to-microphone channel that can change over time, is a challenging problem. Therefore, these problems would be the focus of our future work.

Author Contributions

Y.X. designed the experiment and drafted the manuscript; K.X. and J.Y. reviewed the experiment and manuscript; S.X. reviewed and refined the paper.

Funding

This work was partially supported by the National Natural Science Foundation of China (grants 613300032, 61773128).

Conflicts of Interest

The authors declare no conflicts of interest with respect to the research, authorship, and publication of this article.

References

  1. Wang, L.; Reiss, J.D.; Cavallaro, A. Over-determined source separation and localization using distributed microphones. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 1573–1588. [Google Scholar] [CrossRef]
  2. Loesch, B.; Yang, B. Adaptive segmentation and separation of determined convolutive mixtures under dynamic conditions. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, St. Malo, France, 27–30 September 2010; pp. 41–48. [Google Scholar]
  3. Kitamura, D.; Ono, N.; Sawada, H.; Kameoka, H.; Saruwatari, H. Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 1622–1637. [Google Scholar] [CrossRef]
  4. Sawada, H.; Araki, S.; Makino, S. Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 516–527. [Google Scholar] [CrossRef]
  5. Cho, J.; Yoo, C.D. Underdetermined convolutive BSS: Bayes risk minimization based on a mixture of super-Gaussian posterior approximation. IEEE Press 2015, 23, 828–839. [Google Scholar] [CrossRef]
  6. Harshman, R.A. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-model factor analysis. Ucla Work. Pap. Phon. 1970, 16, 1–84. [Google Scholar]
  7. Kolda, T.G.; Bader, B.W. Tensor decompositions and applications. Siam Rev. 2009, 51, 455–500. [Google Scholar] [CrossRef]
  8. Nion, D.; Mokios, K.N.; Sidiropoulos, N.D.; Potamianos, A. Batch and adaptive PARAFAC-based blind separation of convolutive speech mixtures. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 1193–1207. [Google Scholar] [CrossRef]
  9. Liavas, A.P.; Sidiropoulos, N.D. Parallel algorithms for constrained tensor factorization via alternating direction method of multipliers. IEEE Trans. Signal Process. 2015, 63, 5450–5463. [Google Scholar] [CrossRef]
  10. Vincent, E. Complex Nonconvex lp Norm Minimization for Underdetermined Source Separation. In Proceedings of the International Conference on Independent Component Analysis and Signal Separation, London, UK, 9–12 September 2007; pp. 430–437. [Google Scholar]
  11. Yilmaz, O.; Rickard, S. Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 2004, 52, 1830–1847. [Google Scholar] [CrossRef]
  12. Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
  13. Gillis, N.A.; Vavasis, S. Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE Pattern Anal. Mach. Intell. 2014, 36, 698–714. [Google Scholar] [CrossRef] [PubMed]
  14. Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Comput. 2009, 21, 793. [Google Scholar] [CrossRef] [PubMed]
  15. Gao, B.; Woo, W.L.; Dlay, S.S. Variational regularized 2-D nonnegative matrix factorization. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 703–716. [Google Scholar] [PubMed]
  16. Ozerov, A.; Fevotte, C. Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 550–563. [Google Scholar] [CrossRef]
  17. Al-Tmeme, A.; Woo, W.L.; Dlay, S.S.; Gao, B. Underdetermined convolutive source separation using GEM-MU with variational approximated optimum model order NMF2D. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 35–49. [Google Scholar] [CrossRef]
  18. Dempster, A.P. Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. 1977, 39, 1–38. [Google Scholar]
  19. Kitamura, D.; Saruwatari, H.; Kameoka, H.; Yu, T.; Kondo, K.; Nakamura, S. Multichannel signal separation combining directional clustering and nonnegative matrix factorization with spectrogram restoration. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 654–669. [Google Scholar] [CrossRef]
  20. Sawada, H.; Kameoka, H.; Araki, S.; Ueda, N. Multichannel extensions of non-negative matrix factorization with complex-valued data. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 971–982. [Google Scholar] [CrossRef]
  21. Neeser, F.D.; Massey, J.L. Proper complex random processes with applications to information theory. IEEE Trans. Inform. Theor. 2002, 39, 1293–1302. [Google Scholar] [CrossRef]
  22. Zhou, G.; Cichocki, A.; Zhao, Q.; Xie, S. Nonnegative matrix and tensor factorizations: An algorithmic perspective. IEEE Signal Process. Mag. 2009, 31, 54–65. [Google Scholar] [CrossRef]
  23. Cichocki, A.; Mandic, D.; Lathauwer, L.D.; Zhou, G.; Zhao, Q.; Caiafa, C.; Phan, H.A. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Process. Mag. 2015, 32, 145–163. [Google Scholar] [CrossRef]
  24. Sidiropoulos, N.D.; Giannakis, G.B.; Bro, R. Blind parafac receivers for Ds-Cdma systems. IEEE Trans. Signal Process. 2000, 48, 810–823. [Google Scholar] [CrossRef]
  25. Rajih, M.; Comon, P. Enhanced Line Search: A novel method to accelerate Parafac. In Proceedings of the 13th European Signal Processing Conference, Antalya, Turkey, 4–8 September 2005; pp. 1–4. [Google Scholar]
  26. Nion, D.; Lathauwer, L.D. Line search computation of the block factor model for blind multi-user access in wireless communications. In Proceedings of the IEEE 7th Workshop on Signal Processing Advances in Wireless Communications, Cannes, France, 2–5 July 2006; pp. 1–4. [Google Scholar]
  27. Domanov, I.; De Lathauwer, L. An Enhanced Plane Search Scheme for Complex-Valued Tensor Decompositions. In Proceedings of the 16th Conference of the International Linear Algebra Society (ILAS), Pisa, Italy, 1 June 2010. [Google Scholar]
  28. De Lathauwer, L. A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM J. Matrix Anal. Appl. 2006, 28, 642–666. [Google Scholar] [CrossRef]
  29. Vincent, E.; Sawada, H.; Bofill, P.; Makino, S.; Rosca, J.P. First stereo audio source separation evaluation campaign: Data, algorithms and results. In Proceedings of the International Conference on Independent Component Analysis and Signal Separation (ICA 2007), London, UK, 9–12 September 2007; pp. 552–559. [Google Scholar]
  30. Duong, N.Q.K.; Vincent, E. Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model; IEEE Press: New York, NY, USA, 2010; pp. 1830–1840. [Google Scholar]
  31. Arberet, S. A robust method to count and locate audio sources in a stereophonic linear instantaneous mixture. In Proceedings of the International Conference on Independent Component Analysis and Blind Signal Separation, Charleston, SC, USA, 5–8 March 2006; pp. 536–543. [Google Scholar]
  32. O’Grady, P.D.; Pearlmutter, B.A. Soft-LOST: EM on a mixture of oriented lines. In Proceedings of the International Conference on Independent Component Analysis and Signal Separation, Granada, Spain, 22–24 September 2004; Volume 3195, pp. 430–436. [Google Scholar]
  33. Tan, V.Y.F.; Févotte, C. Automatic Relevance Determination in Nonnegative Matrix Factorization. In Proceedings of the Signal Processing with Adaptive Sparse Structured Representations, SPARS, St-Malo, France, 6–9 April 2009; Volume 35, pp. 1592–1605. [Google Scholar]
  34. Wax, M.; Kailath, T. Determining the number of signals by information theoretic criteria. In Proceedings of the IEEE International Conference on ICASSP Acoustics, Speech, and Signal Processing, San Diego, CA, USA, 19–21 March 1984; pp. 232–235. [Google Scholar]
  35. He, Z.; Cichocki, A.; Xie, S.; Choi, K. Detecting the number of clusters in n-way probabilistic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 2006–2021. [Google Scholar] [PubMed]
  36. Nikunen, J.; Diment, A.; Virtanen, T. Separation of moving sound sources using multichannel NMF and acoustic tracking. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 26, 281–295. [Google Scholar] [CrossRef]
  37. Taseska, M.; Habets, E.A.P. Blind source separation of moving sources using sparsity-based source detection and tracking. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 26, 657–670. [Google Scholar] [CrossRef]
Figure 1. Block diagram of our proposed blind source separation algorithm.
Figure 1. Block diagram of our proposed blind source separation algorithm.
Symmetry 10 00521 g001
Figure 2. The average SDR and ISR results in the linear music signal mixture case.
Figure 2. The average SDR and ISR results in the linear music signal mixture case.
Symmetry 10 00521 g002
Figure 3. The average SIR and SAR results in the linear music signal mixture case.
Figure 3. The average SIR and SAR results in the linear music signal mixture case.
Symmetry 10 00521 g003
Figure 4. A numerical example demonstrating that (a) Waveforms of music source signals with drum in the linear mixture case; (b) Waveforms of the mixture sources; (c) Waveforms of the estimated sources using MU algorithm for drum case [16]; (d) Waveforms of the estimated sources using EM algorithm for drum case [16]; (e) Waveforms of the estimated sources using l 0 minimization algorithm for drum case [10]; and (f) Waveforms of the estimated sources using our proposed algorithm (Tensor-IS) in the linear instantaneous mixture case.
Figure 4. A numerical example demonstrating that (a) Waveforms of music source signals with drum in the linear mixture case; (b) Waveforms of the mixture sources; (c) Waveforms of the estimated sources using MU algorithm for drum case [16]; (d) Waveforms of the estimated sources using EM algorithm for drum case [16]; (e) Waveforms of the estimated sources using l 0 minimization algorithm for drum case [10]; and (f) Waveforms of the estimated sources using our proposed algorithm (Tensor-IS) in the linear instantaneous mixture case.
Symmetry 10 00521 g004
Figure 5. The average SDR and ISR results in the linear speech signal mixture case.
Figure 5. The average SDR and ISR results in the linear speech signal mixture case.
Symmetry 10 00521 g005
Figure 6. The average SIR and SAR results in the linear speech signal mixture case.
Figure 6. The average SIR and SAR results in the linear speech signal mixture case.
Symmetry 10 00521 g006
Figure 7. The average SDR and ISR results in the convolutive music signal mixture case.
Figure 7. The average SDR and ISR results in the convolutive music signal mixture case.
Symmetry 10 00521 g007
Figure 8. The average SIR and SAR results in the convolutive music signal mixture case.
Figure 8. The average SIR and SAR results in the convolutive music signal mixture case.
Symmetry 10 00521 g008
Figure 9. A numerical example demonstrating that (a) Waveforms of music source signals with drum in the convolutive mixture case; (b) Waveforms of the mixture sources [16]; (c) Waveforms of the estimated sources using MU algorithm [16]; (d) Waveforms of the estimated sources using EM algorithm; (e) Waveforms of the estimated sources using binary masking algorithm [11]; and (f) Waveforms of the estimated sources using the proposed algorithm (Tensor-IS) in the convolutive mixture case.
Figure 9. A numerical example demonstrating that (a) Waveforms of music source signals with drum in the convolutive mixture case; (b) Waveforms of the mixture sources [16]; (c) Waveforms of the estimated sources using MU algorithm [16]; (d) Waveforms of the estimated sources using EM algorithm; (e) Waveforms of the estimated sources using binary masking algorithm [11]; and (f) Waveforms of the estimated sources using the proposed algorithm (Tensor-IS) in the convolutive mixture case.
Symmetry 10 00521 g009
Figure 10. The average SDR and ISR results in the convolutive speech signal mixture case.
Figure 10. The average SDR and ISR results in the convolutive speech signal mixture case.
Symmetry 10 00521 g010
Figure 11. The average SIR and SAR results in the convolutive speech signal mixture case.
Figure 11. The average SIR and SAR results in the convolutive speech signal mixture case.
Symmetry 10 00521 g011
Table 1. The parameter setting of all the algorithms.
Table 1. The parameter setting of all the algorithms.
DatasetWindow LengthSamplingIterations
SamplesMillisecondsFreq. (Hz)
A-inst10246416000200
B-inst10246416000200
C-conv204812816000500
D-conv204812816000500
Table 2. The runtime of all algorithms (sec.).
Table 2. The runtime of all algorithms (sec.).
Linear BSS CaseConvolutive BSS Case
AlgorithmRuntimeAlgorithmRuntime
l 0 min [10]21.3982bm [11]38.5658
MU [16]65.5909MU [16]70.3608
EM [16]90.5785EM [16]182.2820
Full-rank [30]346.5758
Proposed114.6651Proposed208.8797

Share and Cite

MDPI and ACS Style

Xie, Y.; Xie, K.; Yang, J.; Xie, S. Underdetermined Blind Source Separation Combining Tensor Decomposition and Nonnegative Matrix Factorization. Symmetry 2018, 10, 521. https://doi.org/10.3390/sym10100521

AMA Style

Xie Y, Xie K, Yang J, Xie S. Underdetermined Blind Source Separation Combining Tensor Decomposition and Nonnegative Matrix Factorization. Symmetry. 2018; 10(10):521. https://doi.org/10.3390/sym10100521

Chicago/Turabian Style

Xie, Yuan, Kan Xie, Junjie Yang, and Shengli Xie. 2018. "Underdetermined Blind Source Separation Combining Tensor Decomposition and Nonnegative Matrix Factorization" Symmetry 10, no. 10: 521. https://doi.org/10.3390/sym10100521

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop