Unsupervised Learning for Monaural Source Separation Using Maximization–Minimization Algorithm with Time–Frequency Deconvolution

This paper presents an unsupervised learning algorithm for sparse nonnegative matrix factor time–frequency deconvolution with optimized fractional β-divergence. The β-divergence is a group of cost functions parametrized by a single parameter β. The Itakura–Saito divergence, Kullback–Leibler divergence and Least Square distance are special cases that correspond to β=0, 1, 2, respectively. This paper presents a generalized algorithm that uses a flexible range of β that includes fractional values. It describes a maximization–minimization (MM) algorithm leading to the development of a fast convergence multiplicative update algorithm with guaranteed convergence. The proposed model operates in the time–frequency domain and decomposes an information-bearing matrix into two-dimensional deconvolution of factor matrices that represent the spectral dictionary and temporal codes. The deconvolution process has been optimized to yield sparse temporal codes through maximizing the likelihood of the observations. The paper also presents a method to estimate the fractional β value. The method is demonstrated on separating audio mixtures recorded from a single channel. The paper shows that the extraction of the spectral dictionary and temporal codes is significantly more efficient by using the proposed algorithm and subsequently leads to better source separation performance. Experimental tests and comparisons with other factorization methods have been conducted to verify its efficacy.


Introduction
Blind source separation (BSS) [1][2][3][4][5][6][7][8] is an ill-posed problem that cannot be totally solved without some prior information. This entails a certain number of assumptions have to be imposed to render the  [2]), mutual statistical independence among the sources [3], the number of sources [4], how the sources are mixed (instantaneous [5] versus convolutive [6]), and the location of the sources with respect to the microphones. Several recent solutions have been developed to mitigate some of these constraints. In the work [7], it was previously shown that non-Gaussian stationary process can be approximated as non-stationary Gaussian process which enabled separation involving mixtures of non-Gaussian sources. Of similar concept, a method is proposed for separation by decorrelating multiple non-stationary stochastic sources using a multivariable crosstalk-resistant adaptive noise canceller [8]. In a related method, the problem of speech quality enhancement is tackled using adaptive and non-adaptive filtering algorithms [9]. A two-microphone Gauss-Seidel pseudo affine projection algorithm combined with forward blind source separation is proposed. A higher efficiency in speech enhancement in noisy environment has been attained. The paper [10] proposes rational polynomial functions to replace the original score functions used in standard independent component analysis (ICA). The rational polynomials are derived by the Pade approximant from Taylor series expansion of the original nonlinearities which can be quickly evaluated to enable large-scale multidimensional sets of data characterized by super-Gaussian distribution to be separated within a short period of time. Recently, a bi-variate empirical mode decomposition algorithm combined with complex ICA by entropy bound minimization technique is proposed for convolutive signal separation [11]. In telecommunication problems, neither the direction of arrival (DOA) nor a training sequence is assumed to be available at the receiver. The only assumption is that the transmitted signals satisfy the constant modulus property. In the work [12], a multistage space-time equalizer is proposed to blindly separate signals received by an antenna array from different sources simultaneously. In the algorithm, each stage consists of an adaptive beamformer, a DOA estimator and an equalizer which are jointly optimized using the constant modulus property of the sources. Other than statistical independence and non-Gaussianity, signal separation approach based on second-order statistics of the speech signals using canonical correlation approach [13] has also been proposed. The work [14] considers complex-valued mixing matrix estimation and direction-of-arrival estimation of synchronous orthogonal frequency hopping signals in the underdetermined blind source separation (UBSS). A mixing matrix estimation algorithm is proposed by detecting single source points where only one source contributes its power. While traditional algorithms are usually applied in the ideal sparse environment, the work [15] proposes a solution where multiple input multiple output mixed signals are insufficiently sparse in both time and frequency domains under noisy conditions. The work [16] demonstrates the application of UBSS in addresses the mixing of pipe abrasive debris problem and focuses on the superimposed abrasive debris separation of a radial magnetic field abrasive sensor. Through accurately separating and calculating the morphology and amount of the abrasive debris, the abrasive sensor has provided the system with wear trend and sizes estimation of the wear particles.
In recent years, an alternate class of solutions for BSS based on nonnegative matrix factorization (NMF) [17] has been proposed. Compared to ICA, NMF gives a more part based decomposition and the decomposition is unique under certain conditions, making it unnecessary to impose the constraints in the form of orthogonality and independence [18]. These properties have led to a significant interest in NMF lately for its application in areas of BSS [5,[19][20][21][22][23][24], pattern recognition [25], and dimensionality reduction [26]. Multiplicative update-based families of parameterized cost functions such as the Csiszar's divergences [27,28] were also presented. The NMF is a matrix decomposition technique. Let the data matrix V be a nonnegative matrix of dimensions I × J. The aim of NMF is to find two matrices W and H such that: or in scalar form, where i = 1, 2, ..., k = 1, 2, . . . , K, and j = 1, 2, . . . , J. When W and H are nonnegative matrices of dimensions I × K and K × J, then is usually chosen such that A sparseness constraint can be added to the cost function [26][27][28][29][30][31], and this can be achieved by regularization using the L 1 -norm leading to Sparse NMF (SNMF). Here, "sparseness" refers to a representational scheme where only a few units (out of a large population) are effectively used to represent typical data vectors. In effect, this implies most units taking values close to zero while only few take significantly non-zero values. Several other types of prior distribution over W and H can be defined, e.g., it is assumed that the prior of W and H satisfy the exponential density and the prior for the noise variance is chosen as an inverse gamma density [27]. In the work [28], Gaussian distributions are chosen for both W and H. The model parameters and hyper parameters are adapted by using the Markov chain Monte Carlo (MCMC) [32]. In all cases, a fully Bayesian treatment is applied to approximate inference for both model parameters and hyper parameters. While these approaches increase the accuracy of matrix factorization, it only works efficiently when a large sample dataset is available. Moreover, it consumes significantly high computational complexity at each iteration to adapt the parameters and its hyper parameters. The NMF with the β-divergence has been previously used in music signal processing [33,34]. In our previous paper [35], we investigated β-divergence for source separation problem. It was shown that improved performance has been attained over integer-based β-divergence. Thus, this motivates research of using β-divergence for music signal processing and source separation. However, all of these works fixed β to some constant values within 0-2, and have not presented any method to determine the desired β value. This significantly constrains the performance of matrix factorization and its ability in separating mixed sources. In addition, these works do not consider the issue of sparsity of the temporal codes which would undermine the quality of matrix factorization when the β value is inappropriately chosen. The selection of the β value should consider the sparseness constraint used in the cost function.
Regardless of the cost function and sparseness constraint being used, the standard NMF or SNMF models are only satisfactory for solving source separation provided that the spectral frequencies of the analyzed audio signal do not change over time. However, this is not the case for many realistic signals such as music and speech. As a result, the spectral dictionary obtained via the NMF or SNMF decomposition is not adequate to capture the temporal dependency of the frequency patterns within the signal. To remedy the situation, a pragmatic approach is to work on a more holistic model based on matrix factor deconvolution [21][22][23][24]. In this paper, we work with NMF model extended to two-dimensional time-frequency deconvolution of W and H where (W, H) are considered as the matrix factors [22]. Mathematically, this is expressed as where i and i represent the frequency and time index, respectively, k indicates the factor number, τ represents the temporal shift and φ is the frequency shift. The terms τ max and φ max are the maximum temporal and frequency shift, respectively. With this definition, both W τ i,k and H φ k,j have tensorial structures with dimension I × K × τ max and K × J × φ max , respectively. Thus, W τ i,k represents the τth-slice of the kth-spectral basis while H φ k,j represents the associated φth-slice of the kth-temporal code. The downward and rightward arrow signs denote the corresponding shifting direction of each column in W τ and each row in H φ by the amount indicated by τ and φ, respectively.
Model (4) represents both temporal structure and the pitch change which occur when an instrument plays different notes. In the log-frequency spectrogram, the pitch change corresponds to a displacement on the frequency axis. Where previous NMF methods needed one component to model each note for each instrument, Model (4) represents each instrument compactly by a single time-frequency profile convolved in both time and frequency by a time-pitch weight matrix. This model dramatically decreases the number of components needed to model various instruments and effectively solves the blind single channel source separation problem for certain classes of musical signals. When polyphonic music is modeled by factorizing the magnitude spectrogram with NMF, each instrument is modeled by an instantaneous frequency signature which can vary over time. However, the NMF requires multiple basis functions to represent tones with different pitch values. The two-dimensional time-frequency deconvolution model implicitly solves the problem of grouping notes. Thus, all notes for an instrument is an identical pitch shifted time-frequency signature, Model (4) will give better estimates of these signatures, because more examples of different notes are used to compute each time-frequency signature. In the event when this assumption does not hold, it might still hold in a region of notes for an instrument. Furthermore, the two-dimensional time-frequency deconvolution model can explain the spectral differences between two notes of different pitch by the two-dimensional deconvolution of the time-frequency signature.
The novelty of this paper can be summarized as follows: Firstly, a new algorithm is developed for sparse nonnegative matrix factor time-frequency deconvolution optimized with fractional β-divergence. Secondly, the maximization-minimization algorithm is developed to derive the auxiliary cost function which caters for any β value. The paper shows that the optimal β that leads to the desired performance is not necessarily limited to the special cases of integer β but extends to fractional values. Thirdly, it is analytically shown that the convergence of the proposed algorithm is guaranteed under the auxiliary function. Fourthly, a method is proposed to estimate the fractional β within the context of monoaural source separation. Finally, the paper proposes an adaptive method to estimate the sparsity parameter for each of the individual temporal code.
The remainder of the paper is organized as follows: In Section 2, the new algorithm for matrix factor time-frequency deconvolution model with β-divergence based on the maximization-minimization algorithmic framework is derived. Real application of blind source separation using the proposed method and comparisons with other matrix factorization methods are presented in Section 3. Finally, Section 4 concludes the paper.

β-Divergence Cost Function
The NMF problem can be written as the minimization of an objective function: The general β-divergence [24,31] is defined as: when β = 2, this matches with the first β-divergence and the update algorithm is referred to as the "Least Square" [17]. When we use the second β-divergence with β = 1, the update algorithm is referred to as the "Kullback-Leibler" [17]. When the third β-divergence with β = 0 is used, the update algorithm is referred to as the "Itakura-Saito" [33]. These algorithms have their own advantages and disadvantages. If the sources have large dynamic difference in the power, the Itakura-Saito divergence would have better performance than other NMF algorithms. The Least Square and Kullback-Leibler NMFs are more suited when the power of sources are close to other. However, it is difficult to define the difference of power between the sources, and therefore it is difficult to choose the algorithms.
In this paper, we present the results to show that the best results are not necessarily limited to the above integer β special cases. The use the fractional β-divergence is expected to yield more realistic and optimized results than the previous NMF algorithms. For completeness of presentation, the following section briefly reviews the update function based on the Least Square and Kullback-Leibler criterion.

Least Square Distance
The Least Square NMF algorithm introduced by Lee and Seung [17] defines the β-divergence as Least Squares divergence when β = 2. First, we consider the least square cost function: Differentiating C LS with respect to W τ i,k and H φ k,j , and plugging the multiplicative update algorithm where ∂d β (y|x)/∂θ = ∇d β (y|x) + − ∇d β (y|x) − , which leads to the following W τ and H φ updates: where "A·B" represents element-wise multiplication.

Kullback-Liebler Divergence
When β = 1, the β-divergence is identical to the Kullback-Leibler divergence. The Kullback-Leibler divergence is expressed as: By following similar steps as the Least Square, we can derive the update function as follow: where " A B " represents element-wise division and "1" is a column vector of unit elements.

Auxiliary Cost Function of Fractional β-Divergence for Matrix Factors Time-Frequency Deconvolution
In this subsection, we introduce the cost function for the fractional β-divergence matrix factors time-frequency deconvolution model. The algorithm allows the user to choose a fractional β value instead of using the previous NMF algorithms which constrain β to special cases of integer value. After the derivation, this paper shows the steps on how the update function of the fractional β-divergence is obtained for the parameters. Firstly, the first derivative of d β (y|x) are given by This shows that y is continuous in β and thus the second derivative of d β (y|x) is given by The second derivative shows that the β-divergence is convex for y in β ∈ [1, 2]. Outside of this range, d β (y|x) can be expressed as: whereď(y|x) is a convex function of y,d(y|x) is a concave function of y, and d(x) is a constant of y.
In Equation (16), G 1 (β) is convex for β ≥ 1 and concave for β < 1, and G 2 (β) is convex for β ≤ 2 and concave for β > 2. Thus, there is a need to alleviate this problem by decomposing the above function into several terms to be either convex or concave depending on the value of β and use the appropriate inequalities to build an auxiliary function.
The equality holds when Proof. Let f : R → R be a continuously differentiable and concave function. Then, for any point z, and with equality holds if and only if x = z. (17).
Using Lemmas 1 and 2, we may proceed with the following analysis. When β < 1, we use Q , then the cost function becomes a convex function.
Thus, the cost function can be shown to be bounded by the auxiliary function G + (θ|θ): instead of G 2 (β), then the cost function becomes a convex function and is bounded by the auxiliary function G + (θ|θ): Finally, when β > 2, we use P instead of G 2 (β), then the cost function is bounded by From above, we can conclude that The equality holds whenθ satisfies Equations (18) and (20). The above function yields three different sub-functions which depend on the β value. In different β range, we use different cost function in the algorithm. This allows the user to choose the optimal β value to separate the mixture and caters for more flexibility than the previous algorithms.

Auxiliary Update Function of "Fractional" β-Divergence
To minimize G + (θ|θ), we formulate the derivative of G + (θ|θ) with respect to θ. First, we consider the derivative for W τ i,k : The second derivative of G + (θ|θ) with respect to W τ i,k in then expressed as: where We can see G(θ|θ) is a convex function in W τ i,k , so by setting to 0, we can then express the update function for W τ i,k as: We next consider the auxiliary variablesθ. Since both Equations (17) and (18) minimize G + (θ|θ) with respect toθ, substituting these into Equation (30) gives the following update rule: where The above can be written in the matrix form: Similarly, for H φ k,j update function, first we have where Again, since both Equations (19) and (20) minimizes G + (θ|θ) with respect toθ, substituting these into Equation (38) gives the following update rule for H φ k,j : In matrix form, the above can be written as

Sparsity-Aware Optimization
The cost-function in Equation (21) can be augmented with a regularization term to render sparsity to the solution. We can define a prior on H as an exponential distribution with independent decay parameters, namely, It is worth pointing out that each individual element in H is constrained to an exponential distribution with independent decay parameter λ φ k,j so that each element in H can be driven to be optimally sparse in the L 1 -norm. Other forms of sparseness exist 19 but the proposed L 1 -norm is computationally favourable. First, we define G s (θ) and G s (θ|θ) as follow: (42) where α is the regularization constant. To avoid the scaling misbehavior when incorporating the sparseness for H, we reformulate the cost function to work with normalized matrix for W τ i.e., and Thus, the cost function takes the following form: where To obtain λ φ k,j , we minimize the cost function with respect λ φ k,j and set it to zero which results: provided that H φ k,j = 0. However, it has been observed in many cases that optimizing the factor matrices with β-divergence and the sparseness in Equation (46) increases the likelihood for some H φ k,j to converge very close to zero, thus leading to numerical divergence when dividing by zero. Other practices introduced a small constant to H φ k,j to prevent direct division by zero. Unfortunately, such approach is identical to constant sparsity and no longer preserves the L 1 -norm optimal solution. In this paper, we adopt the maximum likelihood approach to formulating the adaptive estimation of sparsity parameter λ φ k,j . Considering the following maximum likelihood criterion [31,36]: where ln p(v|λ,W) is the log-likelihood conditional probability of the observations givenW and λ. By using the Jensen's inequality, for any distribution Q(h), the log-likelihood function satisfies the following: Since each element of H is constrained to be exponential distributed with independent decay parameters, Equation (48) becomes: Thus, we have One can easily check that the distribution that maximizes the maximum likelihood is given by Q(h) = p(h|v, λ,W) = p(v|h, λ,W)p(h|λ)/p(v|λ,W) which is the posterior distribution of h and p(v|h, λ,W) is the log-likelihood function of the observation which is usually expressed by a Gaussian density function with mean centered at ∑ k,τ,φ W τ i−φ,k H φ k,j−τ . However, as H φ k,j is directly acquired from the original code matrix H 0 k,j , we can simply work with τ max = 0. This allows us to express the log-likelihood function of the posterior distribution of h up to a constant as ln p(h|v, λ,W) . = ln p(v|h, λ,W) + ln p(h|λ) =" denotes equality up to a constant, " " is the Kronecker product, 1 is vector contains unit elements, I is the identity matrix, α assumes the role of a regularization constant to balance the cost function fit and smoothness of H. For ease of presentation, we simplify the above terms 0T . . . λ φ max T T which enables us to rewrite Equation (46) as For ease of analysis, Q(h) is represented using Gibbs distribution as Q(h) = 1 Z exp(−F(h)) where Z = exp(−F(h))dh. Let P represents the index set of inactive code i.e., P = φ, k, j|H φ k,j = 0 and M the index set of active code i.e., M = φ, k, j|H φ k,j = 0 . Thus, Q(h) can be factorized as Since h M corresponds to the original non-zero value of h, it then follows that Q M (h M ) is not of interest to us. We are only interested in h P and therefore, to characterize Q P (h P ), we need to allow some positive deviation to h P . A suitable distribution is to use the factorized exponential distribution given byQ as the approximate distribution. The variational parameters u = u p are determined by minimizing the Kullback-Leibler divergence between true Q P and approximateQ P : which leads to the following optimization: where b P = (Ch −W T v + λ) P and C = C P + diag(C P ) with C =W TW , C P =W T PW P . Solving Equation (56) for u p leads to the following update: Once u p is obtained and re-arranged to the original form u φ k,j , the final update for λ φ k,j takes the form of: Equipped with above, we obtain the multiplicative update for the normalized W as for τ = 0, 1, . . . , τ max . By using the same approach, we can obtain the update for the sparse H as follows: for φ = 0, 1, . . . , φ max . In Equation (61), α assumes the role of a regularization constant to balance the cost function fit and smoothness of H. In this work, we set α ∈ [0.5, 1] which has been found to give satisfactory results.

Optimizing the Fractional β
To determine the optimal value for β, we perform the investigation from the source separation viewpoint. Mathematically, the single-channel signal separation (SCSS) [37][38][39] problem can be treated as one mixture of N unknown source signals: where t = 1, 2, . . . , T denotes time index and the goal is to estimate the sources x k (t), ∀k ∈ N of length T when only the observation signal y(t) is available. For simplicity, we consider only N = 2 sources in the mixture. We also use 50 different pieces of piano music, 50 different pieces of trumpet music and 50 different pieces of violin music from the RWC [40] database to generate different mixtures.
The signal-to-distortion (SDR) [41] is used to measure the performance. The SDR results and its corresponding β value that produces the best performance in the separation of mixtures are shown in Table 2. From these experiments, we can propose some general ideas of how to choose a suitable β value: (i) The mixtures from same type of music share similar β value, e.g., the best results of piano and trumpet mixture occur around β = 2 but the best results of piano and violin mixture occur around β = 1. (ii) If the power of one source is clearly weaker than the other source, then a smaller β value should be selected. (iii) When there is a large amount of overlap between the two sources in the in the time-frequency domain, a larger β value should be selected.  Table 2 strongly suggests that the β value depends on the mixture of original sources. Generally, it depends mainly on the two factors: (i) the weight of each source in the mixture; and (ii) the frequency spread of each source which the frequency band contains most weight of the signal. Firstly, we define weight of each source in the mixture by the following function: for k = 1, . . . , N. The term γ k is nonnegative and bounded to unity. It measures the dominance of k-th source in the mixture. The higher value of γ k , the greater the contribution from the k-th source is to the mixture. Secondly, we consider the separability of each source in the time-frequency domain by the following function: where ||·|| F is the Frobenius norm, X k (i, j) is the short-time Fourier Transform (STFT) of x k (t) with i representing the frequency bins and j the timeslot, and M k (i, j) is the binary mask Obtained from the k-th source as The function η k is also nonnegative and determines the degree separability of the signal in each frequency band. Based on the experiments conducted, both γ k and η k have an inverse relation to β. Thus, one possible empirical approach to determine β is proposed as follows: where ρ(n) is step size, ε 1 is a constant to weight the effects of η k and γ k . For example, in the experiments conducted, we have given more emphasis to γ k and set ε 1 = 1/3. The term ε 2 is a constant to control the value of β(n + 1) to ensure its value is bounded within an interval chosen by the user (for example, in the experiments conducted we have set ε 2 = 4 as normally does not exceed 4). Equation (66) is inserted into the update funtions in Equations (60) and (61) to update β at every iteration in conjunction with the update of W and H. In this case, β can be optimized based on the type of sources and the separation process. This enables the separation process to be fully automated and enables more accurate performance. In the case of SCSS, the sources are unknown and these are estimated from the mixture as:X and The expression in (69) is computed using the time-frequency deconvolution model. The main steps of the proposed algorithm have been summarized in Algorithm 1.

1.
Initialize W τ and H φ with non-negative random values.

10.
Repeat Steps 3-9 until it converges or reaches the pre-defined number of iteration.

Experiments, Results and Analysis
In this section, we conduct in-depth investigations of the proposed algorithm to analyze the impact of fixed and adaptive sparsity, the adaptive behavior of the sparsity parameter, and the analysis of fractional β-divergence. The analysis is necessary as the issue of sparsity of the temporal codes would undermine the quality of matrix factorization when the β value is inappropriately chosen. The selection of the β value should consider the sparseness constraint used in the cost function. In addition, the proposed algorithm based on matrix factor time-frequency deconvolution is also compared to conventional NMF models. This allows us to quantify the impacts of fractional β-divergence and sparsity behaviors when using the time-frequency deconvolution model.

Experimental Set-Up
To investigate the proposed method, we use the algorithm to separate several pieces of mixed music signals. Several experimental simulations under different conditions have been designed to investigate the efficacy of the proposed method. All simulations were performed using MATLAB as the programming platform and performed using a PC with dual core processor @ 2.4 GHz (i7 Intel processor) 8 GB RAM and 320 GB HDD. The tested signals are generated by mixing several music sources. The polyphonic music is 4 s long and the sampling frequency is 16 kHz. In this experiment, we randomly chose 50 different pieces of piano music, 50 different pieces of trumpet music and 50 different pieces of violin music from the RWC database to produce the different mixtures. The mixed signal was then generated by adding the chosen sources. In all cases, the sources were mixed with equal average power over the duration of the signals. The time-frequency (TF) representation was obtained by first normalizing the time-domain signal to unit power and then by computing the STFT using 2048 point Hanning window FFT with a 50% overlap. We evaluated our separation performance in terms of the signal-to-distortion ratio (SDR) which is one form of perceptual measure. This is a global measure that unifies source-to-interference ratio (SIR), source-to-artifacts ratio (SAR) and source-to-noise ratio (SNR). The definition and mathematical expression and MATLAB routines for computing these criteria can be obtained online [42].

Analysis of Adaptive and Fixed Sparsity
In this implementation, we conducted several experiments to compare the performance of the proposed method using different β values. Our aim was to investigate the impact of β value used in the separation. Figure 1 shows the time and TF domains of the original trumpet, piano music and its mixture. The TF domain is displayed using the log-frequency spectrogram. The trumpet and the piano play a different short melodic passage each consisting of three distinct notes. However, both trumpet and piano overlap in time, and the piano notes are interspersed in frequency with the trumpet notes. Hence, this is a challenging task for single channel separation which tests the impact of flexible β for matrix factorization.
in the separation. Figure 1 shows the time and TF domains of the original trumpet, piano music and its mixture. The TF domain is displayed using the log-frequency spectrogram. The trumpet and the piano play a different short melodic passage each consisting of three distinct notes. However, both trumpet and piano overlap in time, and the piano notes are interspersed in frequency with the trumpet notes. Hence, this is a challenging task for single channel separation which tests the impact of flexible for matrix factorization.   Figure 2 shows the estimation of bases W τ and temporal codes H φ when using different λ values. In Figure 2a, we set λ = 0 which renders non-sparse solution. There is obvious spreading of the estimated temporal codes, as shown in the red part of the figure. In Figure 2b, when λ = 0.1, there are some improvements over the spreading but they still exist in the red parts. Here, the sparseness is not strong enough and, as a result, the estimated mixture becomes under-sparse. In Figure 2c, when λ = 100, it is visibly shown in the blue parts that some information has been lost in the estimated temporal codes and the resulting estimated mixture becomes very noisy. Finally, Figure 2d shows the case where the sparseness parameters are adaptively and individually estimated using the prior information of H. The obtained result has shown that the estimated temporal codes are just appropriately sparse and, by visual inspection, the resulting estimated mixture retains all information, as evidenced by the musical notes, while the noise level has been kept small, which very closely resembles the original mixture.
temporal codes and the resulting estimated mixture becomes very noisy. Finally, Figure 2d shows the case where the sparseness parameters are adaptively and individually estimated using the prior information of . The obtained result has shown that the estimated temporal codes are just appropriately sparse and, by visual inspection, the resulting estimated mixture retains all information, as evidenced by the musical notes, while the noise level has been kept small, which very closely resembles the original mixture.

Adaptive Behavior of Sparsity Parameter
In this section, we use the piano and trumpet mixture, and fix value to 1 and the value ranges from 0 to 3 with each step of 0.1 to show the results. The adaptive behavior of the sparsity parameters using the proposed method is demonstrated. Figure 3 presents the convergence trajectory of four adaptive sparsity parameters, , , , , , and , , corresponding to their respective element codes, ℎ , , ℎ , , ℎ , and ℎ , . All sparsity parameters are initialized as = 1 . After 150 iterations, the above sparsity parameters converge to their steady-states. By examining Figure 3, it is noted that the converged steady-state values are significantly different for each sparsity parameter, e.g., , = 0.9, , = 0.18, , = 0.29 and , = 0.08 , even though they started at the same initial condition. This shows that each element code has its own sparseness.

Adaptive Behavior of Sparsity Parameter
In this section, we use the piano and trumpet mixture, and fix β value to 1 and the λ value ranges from 0 to 3 with each step of 0.1 to show the results. The adaptive behavior of the sparsity parameters using the proposed method is demonstrated. Figure 3 presents the convergence trajectory of four adaptive sparsity parameters, λ  15 . All sparsity parameters are initialized as λ = 1. After 150 iterations, the above sparsity parameters converge to their steady-states. By examining Figure 3, it is noted that the converged steady-state values are significantly different for each sparsity parameter, e.g., λ φ=0 1,1 = 0.9, λ φ=0 1,5 = 0.18, λ φ=0 1,10 = 0.29 and λ φ=0 1,15 = 0.08, even though they started at the same initial condition. This shows that each element code has its own sparseness.
In Figure 4, we compare the SDR results of using different λ values. In the figure, we can see the λ value that can get the best result changes with the mixture. For each different mixture, the best λ values are different. In Figure 4, we can see the best separation results of piano and trumpet mixture occurs near λ = 1, and the SDR = 14.7 dB. However, as λ increases, the SDR performance begins to deteriorate rapidly due to over-sparseness of the temporal code H φ . Sensors 2018, 18, x FOR PEER REVIEW 18 of 24 In Figure 4, we compare the SDR results of using different values. In the figure, we can see the value that can get the best result changes with the mixture. For each different mixture, the best values are different. In Figure 4, we can see the best separation results of piano and trumpet mixture occurs near = 1, and the SDR=14.7 dB. However, as increases, the SDR performance begins to deteriorate rapidly due to over-sparseness of the temporal code . Although the SDR performance in Figure 4 seems to suggest good performance, it may not necessary refer to the optimum setting. In fact, when is fixed to a constant value, the matrix factors deconvolution process may still be subjected to under-and over-fitting. In the previous sparsity algorithm, is fixed for the whole process and this sparsity may not necessarily suited for the whole signal. This calls for the need to allow each temporal code to have its own sparsity parameter. In the   In Figure 4, we compare the SDR results of using different values. In the figure, we can see the value that can get the best result changes with the mixture. For each different mixture, the best values are different. In Figure 4, we can see the best separation results of piano and trumpet mixture occurs near = 1, and the SDR=14.7 dB. However, as increases, the SDR performance begins to deteriorate rapidly due to over-sparseness of the temporal code . Although the SDR performance in Figure 4 seems to suggest good performance, it may not necessary refer to the optimum setting. In fact, when is fixed to a constant value, the matrix factors deconvolution process may still be subjected to under-and over-fitting. In the previous sparsity algorithm, is fixed for the whole process and this sparsity may not necessarily suited for the whole signal. This calls for the need to allow each temporal code to have its own sparsity parameter. In the  Although the SDR performance in Figure 4 seems to suggest good performance, it may not necessary refer to the optimum setting. In fact, when λ is fixed to a constant value, the matrix factors deconvolution process may still be subjected to under-and over-fitting. In the previous sparsity algorithm, λ is fixed for the whole process and this sparsity may not necessarily suited for the whole signal. This calls for the need to allow each temporal code to have its own sparsity parameter. In the adaptive sparsity algorithm in Equation (53), the sparsity parameter λ is updated alongside W τ and H φ in the process. Therefore, the sparsity parameter is optimized for each element of the temporal code. In addition, we plot the histogram of the converged adaptive sparsity parameters in Figure 5. The plot strongly suggests that the histogram can be represented as a bimodal distribution. We have used the Gaussian mixture model (GMM) [43] to learn the distribution of this histogram and the result adaptive sparsity algorithm in Equation (53), the sparsity parameter is updated alongside and in the process. Therefore, the sparsity parameter is optimized for each element of the temporal code. In addition, we plot the histogram of the converged adaptive sparsity parameters in Figure 5. The plot strongly suggests that the histogram can be represented as a bimodal distribution. We have used the Gaussian mixture model (GMM) [43] to learn the distribution of this histogram and the result produces two Gaussian distributions with mean 0.16 and 1.1. The global mean of the GMM is given by 0.92. With the GMM analysis, we can proceed to further investigate the assignment of sparsity parameters and compare them with the adaptive approach. We considered the following sparseness assignments: The SDR results are tabulated in Table 3 where we can see the separation results of all the six cases. The obtained results readily informed that the source separation with adaptive sparsity has rendered the best separation result.  With the GMM analysis, we can proceed to further investigate the assignment of sparsity parameters and compare them with the adaptive approach. We considered the following sparseness assignments: The SDR results are tabulated in Table 3 where we can see the separation results of all the six cases. The obtained results readily informed that the source separation with adaptive sparsity has rendered the best separation result.  Figure 6 shows the SDR values of the separation results using different values of β values. In this implementation, we fixed the sparsity parameter λ to 0, and the value of β ranges from 0 to 4, and each step is 0.1. In the figure, we can observe that the SDR value changes as the β value is increased. The best result is obtained at when β = 2.5, where SDR = 14.26 dB. The SDR value keeps increasing for values of β value within the range of 0 ≤ β < 2.5, and, after the best performance is attained, the performance deteriorates as β increases. In this figure, we can see that the best performance does not necessarily occur at value of β used other algorithms, i.e., β = 2, 1, 0. This means, if we choose the best β to carry out the separation, we can obtain better results than the other algorithms.  Figure 6 shows the SDR values of the separation results using different values of β values. In this implementation, we fixed the sparsity parameter to 0, and the value of β ranges from 0 to 4, and each step is 0.1. In the figure, we can observe that the SDR value changes as the value is increased. The best result is obtained at when = 2.5, where SDR = 14.26 dB. The SDR value keeps increasing for values of value within the range of 0 ≤ < 2.5, and, after the best performance is attained, the performance deteriorates as β increases. In this figure, we can see that the best performance does not necessarily occur at value of used other algorithms, i.e., = 2, 1, 0. This means, if we choose the best β to carry out the separation, we can obtain better results than the other algorithms. We also conducted several experiments to compare the performance of the proposed method with the non-sparse and normal sparse methods using different value. To investigate the impact of and value used in the separation, we used that ranges from 0 to 4 (with every increment of 0.1), and was set to several configurations, i.e., non-sparse, best fixed sparsity that is obtained in Figure 4 and the proposed adaptive sparsity. The results are plotted in Figure 7, which shows the mutual influence between and . It is noted that the SDR performance of non-sparse algorithm is the lowest of which its best result occurs when = 2.5 with SDR = 14.26 dB. The normal sparse algorithm using best value gives better performance than the non-sparse method with its best result occuring when = 0.5 with SDR = 15.96 dB. On the other hand, the proposed algorithm with adaptive delivers the best performance where occurs around 0.3 with SDR = 16.71 dB. It is also noted that the worst SDR performance given by = 2.9 with adaptive is still higher than the highest SDR when = 2.5 without sparsity optimization = 0.

Analysis of Fractional β-Divergence
All the above experiments used the pre-determined β that are fixed for the whole process. In this section, we present the results of experiments where is adaptively tuned alongside with the adaptation of , and . We applied this method to the source separation problem and compared the results with the situation where = 1 and = 2 corresponding to the Kullback-Leibler divergence and Least Square cost function, respectively. In the update of adaptive , the step size ( ) = 0.95 is selected which represents an exponential decay update process. The SDR results of the top five best performance are tabulated in Table 4. It is interesting to note from the table that the proposed algorithm with adaptive delivers better performance by about 2.1 dB compared to that when = 1 (Kullback-Leibler divergence) and 1.9 dB compared to that when = 2 (Least Square distance). The obtained performance improvement is attributed to the fact that the joint optimization of β with , and has enabled the current estimate of to better fit mixture of sources and thus rendered better source separation performance. We also conducted several experiments to compare the performance of the proposed method with the non-sparse and normal sparse methods using different β value. To investigate the impact of β and λ value used in the separation, we used β that ranges from 0 to 4 (with every increment of 0.1), and λ was set to several configurations, i.e., non-sparse, best fixed sparsity that is obtained in Figure 4 and the proposed adaptive sparsity. The results are plotted in Figure 7, which shows the mutual influence between λ and λ. It is noted that the SDR performance of non-sparse algorithm is the lowest of which its best result occurs when β = 2.5 with SDR = 14.26 dB. The normal sparse algorithm using best λ value gives better performance than the non-sparse method with its best result occuring when β = 0.5 with SDR = 15.96 dB. On the other hand, the proposed algorithm with adaptive λ delivers the best performance where β occurs around 0.3 with SDR = 16.71 dB. It is also noted that the worst SDR performance given by β = 2.9 with adaptive λ is still higher than the highest SDR when β = 2.5 without sparsity optimization λ = 0.
All the above experiments used the pre-determined β that are fixed for the whole process. In this section, we present the results of experiments where β is adaptively tuned alongside with the adaptation of W τ , H φ and λ φ . We applied this method to the source separation problem and compared the results with the situation where β = 1 and β = 2 corresponding to the Kullback-Leibler divergence and Least Square cost function, respectively. In the update of adaptive β, the step size ρ(n) = 0.95 n is selected which represents an exponential decay update process. The SDR results of the top five best performance are tabulated in Table 4. It is interesting to note from the table that the proposed algorithm with adaptive β delivers better performance by about 2.1 dB compared to that when β = 1 (Kullback-Leibler divergence) and 1.9 dB compared to that when β = 2 (Least Square distance). The obtained performance improvement is attributed to the fact that the joint optimization of β with W τ , H φ and λ φ has enabled the current estimate of β to better fit mixture of sources and thus rendered better source separation performance.

Comparison with Other Nonnegative Factorization Models
In this section, we compare the proposed algorithm with other signal separation algorithms, in both time-domain representation and analysis the SDR results of all algorithms. The signal chosen was the same piano and trumpet mixture music used in Section 3.4. We compared the Least Square NMF (NMF-LS) and Kullback-Leibler NMF (NMF-KLD) algorithms introduced in the earlier sections of this paper, NMF with temporal continuity and sparseness criteria (NMF-TCS) [23], and NMF with automatic relevance determination (NMF-ARD) [44]. The obtained results are summarized in Table  5.

Comparison with Other Nonnegative Factorization Models
In this section, we compare the proposed algorithm with other signal separation algorithms, in both time-domain representation and analysis the SDR results of all algorithms. The signal chosen was the same piano and trumpet mixture music used in Section 3.4. We compared the Least Square NMF (NMF-LS) and Kullback-Leibler NMF (NMF-KLD) algorithms introduced in the earlier sections of this paper, NMF with temporal continuity and sparseness criteria (NMF-TCS) [23], and NMF with automatic relevance determination (NMF-ARD) [44]. The obtained results are summarized in Table 5. When using the various NMF models, it is seen that the average improvement per source of about 3 dB has been gained by leveraging the fractional β value and adaptive sparsity. In addition, a step jump of approximately 5-8 dB in performance improvement is further obtained when the model is switched to the matrix factor time-frequency deconvolution. This is attributed to the latter model which represents both temporal structure and the pitch change which occur when an instrument plays different notes. The pitch change corresponds to a displacement on the frequency axis. Where NMF methods needed one component to model each note for each instrument, the time-frequency deconvolution model represents each instrument compactly by a single time-frequency profile convolved in both time and frequency by a time-pitch weight matrix [45]. The model dramatically decreases the number of components needed to model various instruments and effectively solves the blind single channel source separation problem for certain classes of musical signals.

Conclusions
This paper presents an adaptive fractional β-divergence with sparsity-aware optimization for non-negative factor time-frequency deconvolution algorithm. The impetus behind this work is that the previous β-divergence algorithms are all limited to special cases of β, and the previous sparsity methods are limited to a fixed sparsity parameter which are determined manually. Thus, these algorithms may not always produce the best results. In the proposed method, β is made adaptive and takes on fractional value. The sparsity parameter is also concurrently updated along with the estimation of β and model parameters. The convergence is theoretically proven for any β based on the auxiliary function method. This paper has shown that the proposed method is more general and can deliver better performance than other algorithms, as demonstrated using real audio recordings.