1. Introduction
Blind source separation (BSS) [
1,
2,
3,
4,
5,
6,
7,
8] is an ill-posed problem that cannot be totally solved without some prior information. This entails a certain number of assumptions have to be imposed to render the problem solvable such as channel type (linear [
1] versus nonlinear [
2]), mutual statistical independence among the sources [
3], the number of sources [
4], how the sources are mixed (instantaneous [
5] versus convolutive [
6]), and the location of the sources with respect to the microphones. Several recent solutions have been developed to mitigate some of these constraints. In the work [
7], it was previously shown that non-Gaussian stationary process can be approximated as non-stationary Gaussian process which enabled separation involving mixtures of non-Gaussian sources. Of similar concept, a method is proposed for separation by decorrelating multiple non-stationary stochastic sources using a multivariable crosstalk-resistant adaptive noise canceller [
8]. In a related method, the problem of speech quality enhancement is tackled using adaptive and non-adaptive filtering algorithms [
9]. A two-microphone Gauss–Seidel pseudo affine projection algorithm combined with forward blind source separation is proposed. A higher efficiency in speech enhancement in noisy environment has been attained. The paper [
10] proposes rational polynomial functions to replace the original score functions used in standard independent component analysis (ICA). The rational polynomials are derived by the Pade approximant from Taylor series expansion of the original nonlinearities which can be quickly evaluated to enable large-scale multidimensional sets of data characterized by super-Gaussian distribution to be separated within a short period of time. Recently, a bi-variate empirical mode decomposition algorithm combined with complex ICA by entropy bound minimization technique is proposed for convolutive signal separation [
11]. In telecommunication problems, neither the direction of arrival (DOA) nor a training sequence is assumed to be available at the receiver. The only assumption is that the transmitted signals satisfy the constant modulus property. In the work [
12], a multistage space–time equalizer is proposed to blindly separate signals received by an antenna array from different sources simultaneously. In the algorithm, each stage consists of an adaptive beamformer, a DOA estimator and an equalizer which are jointly optimized using the constant modulus property of the sources. Other than statistical independence and non-Gaussianity, signal separation approach based on second-order statistics of the speech signals using canonical correlation approach [
13] has also been proposed. The work [
14] considers complex-valued mixing matrix estimation and direction-of-arrival estimation of synchronous orthogonal frequency hopping signals in the underdetermined blind source separation (UBSS). A mixing matrix estimation algorithm is proposed by detecting single source points where only one source contributes its power. While traditional algorithms are usually applied in the ideal sparse environment, the work [
15] proposes a solution where multiple input multiple output mixed signals are insufficiently sparse in both time and frequency domains under noisy conditions. The work [
16] demonstrates the application of UBSS in addresses the mixing of pipe abrasive debris problem and focuses on the superimposed abrasive debris separation of a radial magnetic field abrasive sensor. Through accurately separating and calculating the morphology and amount of the abrasive debris, the abrasive sensor has provided the system with wear trend and sizes estimation of the wear particles.
In recent years, an alternate class of solutions for BSS based on nonnegative matrix factorization (NMF) [
17] has been proposed. Compared to ICA, NMF gives a more part based decomposition and the decomposition is unique under certain conditions, making it unnecessary to impose the constraints in the form of orthogonality and independence [
18]. These properties have led to a significant interest in NMF lately for its application in areas of BSS [
5,
19,
20,
21,
22,
23,
24], pattern recognition [
25], and dimensionality reduction [
26]. Multiplicative update-based families of parameterized cost functions such as the Csiszar’s divergences [
27,
28] were also presented. The NMF is a matrix decomposition technique. Let the data matrix
V be a nonnegative matrix of dimensions
. The aim of NMF is to find two matrices
W and
H such that:
or in scalar form,
where
i = 1, 2, ...,
, and
. When
W and
H are nonnegative matrices of dimensions
and
, then is usually chosen such that
A sparseness constraint can be added to the cost function [
26,
27,
28,
29,
30,
31], and this can be achieved by regularization using the
L1-norm leading to Sparse NMF (SNMF). Here, “sparseness” refers to a representational scheme where only a few units (out of a large population) are effectively used to represent typical data vectors. In effect, this implies most units taking values close to zero while only few take significantly non-zero values. Several other types of prior distribution over
W and
H can be defined, e.g., it is assumed that the prior of
W and
H satisfy the exponential density and the prior for the noise variance is chosen as an inverse gamma density [
27]. In the work [
28], Gaussian distributions are chosen for both
W and
H. The model parameters and hyper parameters are adapted by using the Markov chain Monte Carlo (MCMC) [
32]. In all cases, a fully Bayesian treatment is applied to approximate inference for both model parameters and hyper parameters. While these approaches increase the accuracy of matrix factorization, it only works efficiently when a large sample dataset is available. Moreover, it consumes significantly high computational complexity at each iteration to adapt the parameters and its hyper parameters. The NMF with the
β-divergence has been previously used in music signal processing [
33,
34]. In our previous paper [
35], we investigated
β-divergence for source separation problem. It was shown that improved performance has been attained over integer-based
β-divergence. Thus, this motivates research of using
β-divergence for music signal processing and source separation. However, all of these works fixed
to some constant values within 0–2, and have not presented any method to determine the desired
value. This significantly constrains the performance of matrix factorization and its ability in separating mixed sources. In addition, these works do not consider the issue of sparsity of the temporal codes which would undermine the quality of matrix factorization when the
value is inappropriately chosen. The selection of the
value should consider the sparseness constraint used in the cost function.
Regardless of the cost function and sparseness constraint being used, the standard NMF or SNMF models are only satisfactory for solving source separation provided that the spectral frequencies of the analyzed audio signal do not change over time. However, this is not the case for many realistic signals such as music and speech. As a result, the spectral dictionary obtained via the NMF or SNMF decomposition is not adequate to capture the temporal dependency of the frequency patterns within the signal. To remedy the situation, a pragmatic approach is to work on a more holistic model based on matrix factor deconvolution [
21,
22,
23,
24]. In this paper, we work with NMF model extended to two-dimensional time–frequency deconvolution of
W and
H where (
W,
H) are considered as the matrix factors [
22]. Mathematically, this is expressed as
where
and
represent the frequency and time index, respectively,
indicates the factor number,
represents the temporal shift and
is the frequency shift. The terms
and
are the maximum temporal and frequency shift, respectively. With this definition, both
and
have tensorial structures with dimension
and
, respectively. Thus,
represents the
-slice of the
-spectral basis while
represents the associated
-slice of the
-temporal code. The downward and rightward arrow signs denote the corresponding shifting direction of each column in
and each row in
by the amount indicated by
and
, respectively.
Model (4) represents both temporal structure and the pitch change which occur when an instrument plays different notes. In the log-frequency spectrogram, the pitch change corresponds to a displacement on the frequency axis. Where previous NMF methods needed one component to model each note for each instrument, Model (4) represents each instrument compactly by a single time–frequency profile convolved in both time and frequency by a time–pitch weight matrix. This model dramatically decreases the number of components needed to model various instruments and effectively solves the blind single channel source separation problem for certain classes of musical signals. When polyphonic music is modeled by factorizing the magnitude spectrogram with NMF, each instrument is modeled by an instantaneous frequency signature which can vary over time. However, the NMF requires multiple basis functions to represent tones with different pitch values. The two-dimensional time–frequency deconvolution model implicitly solves the problem of grouping notes. Thus, all notes for an instrument is an identical pitch shifted time–frequency signature, Model (4) will give better estimates of these signatures, because more examples of different notes are used to compute each time–frequency signature. In the event when this assumption does not hold, it might still hold in a region of notes for an instrument. Furthermore, the two-dimensional time–frequency deconvolution model can explain the spectral differences between two notes of different pitch by the two-dimensional deconvolution of the time–frequency signature.
The novelty of this paper can be summarized as follows: Firstly, a new algorithm is developed for sparse nonnegative matrix factor time–frequency deconvolution optimized with fractional β-divergence. Secondly, the maximization–minimization algorithm is developed to derive the auxiliary cost function which caters for any value. The paper shows that the optimal that leads to the desired performance is not necessarily limited to the special cases of integer but extends to fractional values. Thirdly, it is analytically shown that the convergence of the proposed algorithm is guaranteed under the auxiliary function. Fourthly, a method is proposed to estimate the fractional within the context of monoaural source separation. Finally, the paper proposes an adaptive method to estimate the sparsity parameter for each of the individual temporal code.
The remainder of the paper is organized as follows: In
Section 2, the new algorithm for matrix factor time–frequency deconvolution model with
β-divergence based on the maximization–minimization algorithmic framework is derived. Real application of blind source separation using the proposed method and comparisons with other matrix factorization methods are presented in
Section 3. Finally,
Section 4 concludes the paper.