Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization

: We propose a class of multiplicative algorithms for Nonnegative Matrix Factorization (NMF) which are robust with respect to noise and outliers. To achieve this, we formulate a new family generalized divergences referred to as the Alpha-Beta-divergences (AB-divergences), which are parameterized by the two tuning parameters, alpha and beta, and smoothly connect the fundamental Alpha-, Beta-and Gamma-divergences. By adjusting these tuning parameters, we show that a wide range of standard and new divergences can be obtained. The corresponding learning algorithms for NMF are shown to integrate and generalize many existing ones, including the Lee-Seung, ISRA (Image Space Reconstruction Algorithm), EMML (Expectation Maximization Maximum Likelihood), Alpha-NMF, and Beta-NMF


Introduction to NMF and Basic Multiplicative Algorithms for NMF
The Nonnegative Matrix Factorization (NMF) problem has been investigated by many researchers, e.g., Paatero and Tapper [36], and has gained popularity through the work of Lee and Seung [37,38]. Based on the argument that the nonnegativity is important in human perception, they proposed simple algorithms (often called the Lee-Seung algorithms) for finding physically meaningful nonnegative representations of nonnegative signals and images [38].
The basic NMF problem can be stated as follows: Given a nonnegative data matrix Y = P = [p it ] ∈ R I×T + (with p it ≥ 0 or equivalently P ≥ 0) and a reduced rank J (J ≤ min(I, T ), typically J << min(I, T )), find two nonnegative matrices A = [a 1 , a 2 , . . . , a J ] ∈ R I×J + and X = B T = [b 1 , b 2 , . . . , b J ] T ∈ R J×T + which factorize P as faithfully as possible, that is where the matrix E ∈ R I×T represents approximation error. Since we usually operate on column vectors of matrices for convenience we shall use often the matrix B = X T instead of the matrix X.
The factors A and X may have different physical meanings in different applications: in a Blind Source Separation (BSS) A denotes mixing matrix and X source signals; in clustering problems, A is the basis matrix and X is a weight matrix; in acoustic analysis, A represents the basis patterns, while each row of X corresponds to sound patterns activation [7].
Standard NMF only assumes nonnegativity of factor matrices A and X. Unlike blind source separation methods based on independent component analysis (ICA), here we do not assume that the sources are independent, although we can impose some additional constraints such as smoothness, sparseness or orthogonality of A and/or X [7].
Although the NMF can be applied to the BSS problems for nonnegative sources and nonnegative mixing matrices, its application is not limited to the BSS and it can be used in various and diverse applications far beyond BSS [7].
In NMF, our aim is to find the entries of nonnegative matrices A and X assuming that a data matrix P is known: or equivalently in a scalar form: In order to estimate nonnegative factor matrices A and X in the standard NMF, we need to consider the similarity measures to quantify a difference between the data matrix P and the approximative NMF model matrix P = Q = AX. The choice of the similarity measure (also referred to as distance, divergence or measure of dissimilarity) mostly depends on the probability distribution of the estimated signals or components and on the structure of data and a distribution of a noise.
The best known and the most frequently used adaptive multiplicative algorithms for NMF are based on the two loss functions: Squared Euclidean distance and generalized Kullback-Leibler divergence also called the I-divergence.
The squared Euclidean distance is based on the Frobenius norm: which is optimal for additive Gaussian noise [7]. It should be noted that the above cost function is convex with respect to either elements of matrix A or matrix X, but not both.
Remark: Although the NMF optimization problem is not convex, the objective functions are separately convex in each of the two factors A and X, which implies that finding the optimal factor matrix A corresponding to a fixed matrix X reduces to a convex optimization problem and vice versa. However, the convexity is lost as soon as we try to optimize factor matrices simultaneously [39]. Using a gradient descent approach for cost function (4) and switching alternatively between the two sets of parameters, we obtain the simple multiplicative update formulas (see Section 3 for derivation of algorithms in a more general form): where a small positive constant ε prevents division by zero, P = Y and Q = AX. The above algorithm (5)- (6), called often the Lee-Seung NMF algorithm can be considered as a natural extension of the well known algorithm ISRA proposed first by Daube-Witherspoon and Muehllehner [40] and investigated extensively by De Pierro and Byrne [41][42][43][44][45].
The above update rules can be written in a compact matrix form as X ← X (A T P) (A T Q + ε) , (8) where is the Hadamard (components-wise) product and is element-wise division between two matrices. In practice, the columns of the matrix A should be normalized to the unit p -norm (typically, p = 1).
The original ISRA algorithm is relatively slow, and several heuristic approaches have been proposed to speed it up. For example, a relaxation approach rises the multiplicative coefficients to some power w ∈ (0, 2], that is [7,47], in order to achieve faster convergence. Another frequently used cost function for the NMF is the generalized Kullback-Leibler divergence (also called the I-divergence) [38]: Similar to the squared Euclidean cost function, the I-divergence is convex with respect to either A or X, but it is not generally convex with respect to A and X jointly, so the minimization of such a cost function can yield many local minima.
By minimizing the cost function (11) subject to the nonnegativity constraints, we can easily derive the following multiplicative learning rule, referred to as the EMML algorithm (Expectation Maximization Maximum Likelihood) [44,45]. The EMML algorithm is sometimes called the Richardson-Lucy algorithm (RLA) or simply the EM algorithm [48][49][50]. In fact, the EMML algorithm was developed for a fixed and known A. The Lee-Seung algorithm (which is in fact the EMML algorithm) based on the I-divergence employs alternative switching between A and X [37,38]: We shall derive the above algorithm in a much more general and flexible form in Section 3.
To accelerate the convergence of the EMML, we can apply the following heuristic extension of the EMML algorithm [8,51]: where the positive relaxation parameter w helps to improve the convergence (see Section 3.3).
In an enhanced form to reduce the computational cost, the denominators in (14) and (15) can be ignored due to normalizing a j to the unit length of 1 -norm: One of the objectives of this paper is develop generalized multiplicative NMF algorithms which are robust with respect to noise and/or outliers and integrate (combine) both the ISRA and the EMML algorithms into the more general a flexible one.

The Alpha-Beta Divergences
For positive measures P and Q consider the following new dissimilarity measure, which we shall refer to as the AB-divergence: for α, β, α + β = 0, or equivalently for α = 0, α = λ, λ = α + β = 0, Note that, Equation (19) is a divergence since the following relationship holds: for α, β, α + β = 0, with equality holding for p it = q it . In Appendix A, we show that Equation (21) is a summarization of three different inequalities of Young's type, each one holding true for a different combination of the signs of the constants: αβ, α(α + β) and β(α + β).
In order to avoid indeterminacy or singularity for certain values of parameters, the AB-divergence can be extended by continuity (by applying l'Hôpital formula) to cover all the values of α, β ∈ R, thus the AB-divergence can be expressed or defined in a more explicit form: where d (α,β)

Special Cases of the AB-Divergence
We shall now illustrate that a suitable choice of the (α, β) parameters simplifies the AB-divergence into some existing divergences, including the well-known Alpha-and Beta-divergences [4,7,19,35].
For both α → 0 and β → 0, the AB-divergence converges to the Log-Euclidean distance defined as

Properties of AB-Divergence: Duality, Inversion and Scaling
Let us denote by P .
[r] the one to one transformation that raises each positive element of the matrix P to the power r, i.e., each entry is raised to power r, that is p r it . According to the definition of the AB-divergence, we can easily check that it satisfies the following duality property (see Figure 1) and inversion property The above properties can be considered as a particular case of the scaling property of parameters α and β by a common factor ω ∈ R \ {0}. The divergence whose parameters has been rescaled is proportional to original divergence with both arguments raised to the common factor, i.e., The scaling of α and β by a common factor ω < 1 can be seen as a 'zoom-in' on the arguments P and Q. This zoom gives more relevance to the smaller values over the large values. On the other hand, for ω > 1 yields a 'zoom-out' effect were the smaller values decrease their relevance at the expense of large values whose relevance is increased (see Figure 1). By scaling arguments of the AB-divergence by a positive scaling factor c > 0, it yields the following relation These basic properties imply that whenever α = 0, we can rewrite the AB-divergence in terms of a β α -order Beta-divergence combined with an α-zoom of its arguments as Similarly, for α + β = 0, the AB-divergence can also be expressed in terms of α α+β -order Alpha-divergence with an (α + β)-zoom of the arguments (see Figure 2) as illustrating that the AB-divergence equals to an α α+β -order Alpha-divergence with an (α + β)-zoom in of the arguments. On the other hand D (α 1 ,β 1 ) AB can be seen as an Alpha-divergence of order α 1 /(α 1 + β 1 ) of a zoom-out of the arguments (with α 1 + β 1 > 1). Note that, D (α 2 ,β 2 ) AB can be seen as the Alpha-divergence of order α 2 /(α 2 + β 2 ) with a zoom-in of its arguments with α 2 + β 2 < 1 (see Figure 2). When α = 0 and β = 0, the AB-divergence can be expressed in terms of the Kullback-Leibler divergence with an α-zoom of its arguments When α + β = 0 with α = 0 and β = 0, the AB-divergence can also be expressed in terms of a generalized Itakura-Saito distance with an α-zoom of the arguments Remark: with any c > 0. Note that, from the AB-divergence, a more general scale-invariant divergence can be obtained by the monotonic nonlinear transformations: leading to the following scale-invariant divergence given by which generalizes a family of Gamma-divergences [35].
The divergence (49) is scale-invariant in a more general context for any positive scale factors c 1 and c 2 .

Why is AB-Divergence Potentially Robust?
To illustrate the role of the hyperparameters α and β on the robustness of the AB-divergence with respect to errors and noises, we shall compare the behavior of the AB-divergence with the standard Kullback-Leibler divergence. We shall assume, without loss of generality, that the proposed factorization model Q (for given noisy P) is a function of the vector of parameters θ and that each of its elements q it (θ) > 0 is non-negative for a certain range of parameters Θ.
The estimatorθ obtained for the Kullback-Leibler divergence between two discrete positive measures P and Q, is a solution of while, for the Beta-divergence, the estimator solves The main difference between both equations is in the weighting factors q β it for the Beta-divergence which are controlled by the parameter β. In the context of probability distributions, these weighting factors may control the influence of likelihood ratios p it /q it . It has been shown [55] and [56] that the parameter β determines a tradeoff between robustness to outliers (for β > 0) and efficiency (for β near 0). In the special case of β = 1 the Euclidean distance is obtained, which is known to be more robust and less efficient than the Kullback-Leibler divergence β = 0.
On the other hand, for the Alpha-divergence, the estimating equation takes a different form In this case, the influence of the values of individual ratios p it /q it are controlled not by weighting factors but by the deformed logarithm of order 1 − α. This feature can be interpreted as a zoom or over-weight of the interesting details of the likelihood ratio. For α > 1 (a zoom-out), we emphasize the relative importance of larger values of the ratio p it /q it , whereas for α < 1 (a zoom-in), we put more emphasis on smaller values of p it /q it (see Figure 3). The major consequence is the inclusive (α → ∞) and exclusive (α ← −∞) behavior of the Alpha-divergence discussed in [52]. The estimating equation for the Alpha-Beta divergence combines both effects: and therefore is much more flexible and powerful regarding insensitivity to error and noise. As illustrated in Figure 3, depending on value of α, we can zoom-in or zoom-out the interesting sets of the ratios p it /q it and simultaneously weight these ratios by scaling factors q λ−1 it controlled by the parameter λ = α + β. Therefore, the parameter α can be used to control the influence of large or small ratios in the estimator, while the parameter β provides some control on the weighting of the ratios depending on the demand to better fit to larger or smaller values of the model.  Downweights the ratios p it /q it where q it is small.
Downweights the larger ratios p it /q it w.r.t. the smaller ones.
Downweights the ratios p it /q it where q it is large.

Derivation of Multiplicative NMF Algorithms Based on the AB-Divergence
We shall now develop generalized and flexible NMF algorithms (see Equation (1)) by employing the In this case, the gradient of the AB-divergence (20) can be expressed in a compact form (for any α, β ∈ R) in terms of an 1 − α deformed logarithm (see (30)) The new multiplicative learning algorithm is gradient descent based, in the natural parameter space φ(x) of the proposed divergence: These updates can be considered as a generalization of the exponentiated gradient (EG) [57].
In general, such a nonlinear scaling (or transformation) provides a stable solution and the obtained gradients are much better behaved in the φ space. We use the 1 − α deformed logarithm transformation φ(z) = ln 1−α (z), whose inverse transformation is a 1 − α deformed exponential For positive measures z > 0, the direct transformation φ(z) and the composition φ −1 (φ(z)) are bijective functions which define a one to one correspondence, so we have φ −1 (φ(z)) = z.
By choosing suitable learning rates a new multiplicative NMF algorithm (refereed to as the AB-multiplicative NMF algorithm) is obtained as: In these equations, the deformed logarithm of order 1 − α of the quotients p it /q it plays a key role to control relative error terms whose weighted mean provides the multiplicative corrections. This deformation in the relative error is controlled by the parameter α. The parameter α > 1 gives more relevance to the large values of the quotient, while the case of α < 1 puts more emphasis on smaller values of the quotient. On the other hand, the parameter λ − 1 (where λ = α + β) controls the influence of the values of the approximation (q it ) on the weighing of the deformed error terms. For λ = 1, this influence disappears.
It is interesting to note that the multiplicative term of the main updates can be interpreted as a weighted generalized mean across the elements with indices in the set S.
Depending on the value of α, we obtain as particular cases: the minimum of the vector z (for α → −∞), its weighted harmonic mean (α = −1), the weighted geometric mean (α = 0), the arithmetic mean (α = 1), the weighted quadratic mean (α = 2) and the maximum of the vector (α → ∞), i.e., The generalized weighted means are monotonically increasing functions of α, i.e., if α 1 < α 2 , then Thus, by increasing the values of α, we puts more emphasis on large relative errors in the update formulas (61).
In the special case of α = 0, the above update rules can be simplified as: where q it = [AX] it and at every iteration the columns of A are normalized to the unit length. The above multiplicative update rules can be written in a compact matrix forms as or even more compactly: where Z = P .
[α] Q . [β −1] . In order to fix the scaling indeterminacy between the columns of A and the rows of X, in practice, after each iteration, we can usually evaluate the l 1 -norm of the columns of A and normalize the elements of the matrices as This normalization does not alter Q = AX, thus, preserving the value of the AB-divergence. The above novel algorithms are natural extensions of many existing algorithms for NMF, including the ISRA, EMML, Lee-Seung algorithms and Alpha-and Beta-multiplicative NMF algorithms [7,58]. For example, by selecting α + β = 1, we obtain the Alpha-NMF algorithm, for α = 1, we have Beta-NMF algorithms, for α = −β = 0 we obtain a family of multiplicative NMF algorithms based on the extended Itakura-Saito distance [8,53]. Furthermore, for α = 1 and β = 1, we obtain the ISRA algorithm and for α = 1 and β = 0 we obtain the EMML algorithm.
It is important to note that in low-rank approximations, we do not need access to all input data p it . In other words, the above algorithms can be applied for low-rank approximations even if some data are missing or they are purposely omitted or ignored. For large-scale problems, the learning rules can be written in a more efficient form by restricting the generalized mean only to those elements whose indices belong to a preselected subsets S T ⊂ {1, . . . , T } and S I ⊂ {1, . . . , I} of the whole set of indices.
Using a duality property (D (α,β) AB (Q P)) of the AB-divergence, we obtain now the dual update rules: which for β = 0 reduce to

Conditions for a Monotonic Descent of AB-Divergence
In this section, we first explain the basic principle of auxiliary function method, which allows us to establish conditions for which NMF update rules provide monotonic descent of the cost function during iterative process. In other words, we analyze the conditions for the existence of auxiliary functions that justify the monotonous descent in the AB-divergence under multiplicative update rules of the form (61). NMF update formulas with such property have been previously obtained for certain particular divergences: in [38,41] for the Euclidean distance and Kullback-Leibler divergence, in [58] for the Alpha-Divergence and in [59][60][61] for the Beta-divergence.
Let Q (k) = A (k) X (k) denote our current factorization model of the observations in the k-th iteration, and let {Q (k+1) } denote our candidate model for the next k + 1-iteration. In the following, we adopt the notation {Q (k) } ≡ {A (k) , X (k) } to refer compactly to the non-negative factors of the decompositions.
An auxiliary function G({Q (k+1) }, {Q (k) }; P) for the surrogate optimization of the divergence should satisfy with equality holding for In analogy to the Expectation Maximization techniques developed in [38,41,58], our objective is to find a convex upper bound of the AB divergence and to replace the minimization of the original AB-divergence by an iterative optimization of the auxiliary function.
Given the factors of the current factorization model {Q (k) }, the factors of the next candidate {Q (k+1) } are chosen as where, for simplicity, the minimization is usually carried out with respect to only one of the factors A or Assuming that auxiliary function is found, a monotonic descent of the AB-divergence is a consequence of the chain of inequalities: The inequality (a) is due to the optimization in (75), while the inequality (b) reflects the definition of the auxiliary function in (74).

A Conditional Auxiliary Function
It is shown in Appendix B that under a certain condition, the function where γ is an auxiliary function for the surrogate optimization of D (α,β) AB (P||Q). The required condition is, at each iteration, the convexity of d Recall that Q (k) = A (k) X (k) corresponds to the current iteration, while a candidate model Q (k+1) = A (k+1) X (k+1) for the next iteration, is the solution of the optimization problem (75). For updating A at each iteration, we keep X (k+1) equal to X (k) and find the global minimum of the auxiliary function G({A (k+1) , X (k) }, {A (k) , X (k) }; P) with respect to A (k+1) .
By setting the following derivative to zero and solving with respect to a ij = a Analogously, for the intermediate factorized modelQ (k) = A (k+1) X (k) , the global minimum of the auxiliary function G(A (k+1) X (k+1) , {A (k+1) , X (k) }; P) with respect to the new update X (k+1) , while keeping A = A (k+1) , is given by The above two equations match with the updates proposed in (61). As previously indicated, a sufficient condition for a monotonic descent of the AB-divergence is the convexity of terms d AB (p it ,q it ) with respect toq it . Depending on the parameter β, it is required that one of the following condition be satisfied: where the upper and lower bounds depend on the function whose contour plot is shown in Figure 4(b). The convexity of the divergence w.r.t. q it holds for the parameters (α, β) within the convex cone of β ∈ [min{1, 1 − α}, max{1, 1 − α}]. Therefore, for this set of parameters, the monotonic descent in the AB-divergence with the update formulas (83), (84) is guaranteed. it ] is bounded and sufficiently close to p it so that they satisfy the conditions (85). Figure 4(b) illustrates this property demonstrating that if the ratios p it q it approach unity, the convexity of the divergence w.r.t. q it holds for an increasing size of (α, β) plane. In other words, for sufficiently small relative errors between p it andq it , the update formulas (83), (84) still guarantee a monotonic descent of the AB-divergence, within a reasonable wide range of the hyperparameters (α, β).

Unconditional Auxiliary Function
The conditions of the previous section for a monotonic descent can be avoided by upper-bounding d which is convex with respect toq it and, at the same time, tangent to the former curve atq it = q where and Similarly to the approach in [59][60][61] for the Beta-divergence, we have constructed an auxiliary function for the AB-divergence by linearizing those additive concave terms of the AB-divergence: The upper-bounds for singular cases can be obtained by continuity using L'Hópitals formula. The unconditional auxiliary function for the surrogate minimization of the AB-divergence D (α,β) AB (P||Q) is now given bȳ where γ The alternating minimization of the above auxiliary function with respect to A and X yields the following stabilized iterations (with monotonic descent of the AB-divergence): where The stabilized formulas (92), (93) coincide with (83), (84), except for the exponent w(α, β), which is shown in Figure 5. This exponent is bounded between zero and one, and plays a similar role in the multiplicative update to that of the normalized step-size in an additive gradient descent update. Its purpose is to slow down the convergence so as to guarantee monotonic descent of the cost function. This is a consequence of the fact that the multiplicative correction term is progressively contracted towards the unity as w(α, β) → 0. For the same reason, the stabilized formulas completely stop the updates for α = 0 and β = 1. Therefore, for α close to zero, we recommend to prevent this undesirable situation by enforcing a positive lower-bound in the value of exponent w(α, β).  (α, β), whose role in the multiplicative update is similar to that of a normalized step-size in an additive gradient descent update.

Multiplicative NMF Algorithms for Large-Scale Low-Rank Approximation
In practice, for low-rank approximations with J min{I, T } we do not need to process or save the whole data matrix Y, nor is it necessary to perform computations at each iteration step of the products of the whole estimated matrices Y T A or YX T (see Figure 6). with J < C << T and J < R << I. For simplicity of graphical illustration, we have selected the first C columns of the matrices P and X and the first R rows of A.

Y X A
In other words, in order to perform the basic low-rank NMF Y = P = AX + E, we need to perform two associated nonnegative matrix factorizations using much smaller-scale matrices for large-scale problems, given by where Y r ∈ R R×T + and Y c ∈ R I×C + are the matrices constructed from the selected rows and columns of the matrix Y, respectively. Analogously, we can construct the reduced matrices: A r ∈ R R×J + and X c ∈ R J×C + by using the same indices for the columns and rows as those used for the construction of the data sub-matrices Y c and Y r . In practice, it is usually sufficient to choose: J < R ≤ 4J and J < C ≤ 4J.
Using this approach, we can formulate the update learning rule for a large-scale multiplicative NMF as (see Figure 6) In fact, we need to save only two reduced set of data matrices Y r ∈ R R×T + and Y c ∈ R I×C + . For example, for large dense data matrix Y ∈ R I×T + with I = T = 10 5 and J = 10 and R = C = 50, we need to save only 10 7 nonzero entries instead of 10 10 entries, thus reducing memory requirement 1000 times.
We can modify and extend AB NMF algorithms in several ways. First of all, we can use for update of matrices A and X two different cost functions, under the assumption that both the functions are convex with respect to one set of updated parameters (another set is assumed to be fixed). In a special case, we can use two AB-divergences (with two different sets of parameters: one D (α A ,β A ) AB Furthermore, it would be very interesting to apply AB-multiplicative NMF algorithms for inverse problems in which matrix A is known and we need to estimate only matrix X for ill-conditioned and noisy data [62,63].

Simulations and Experimental Results
We have conducted extensive simulations with experiments designed specifically to address the following aspects: • What is approximately a range of parameters alpha and beta for which the AB-multiplicative NMF algorithm exhibits the balance between relatively fastest convergence and good performance.
• What is approximately the range of parameters of alpha and beta for which the AB-multiplicative NMF algorithm provides a stable solution independent of how many iterations are needed.
• How robust is the AB-multiplicative NMF algorithm to noisy mixtures under multiplicative Gaussian noise, additive Gaussian noise, spiky biased noise? In other words, find a reasonable range of parameters for which the AB-multiplicative NMF algorithm gives improved performance when the data are contaminated by the different types of noise.
In order to test the performance of the AB-divergence for the NMF problem, we considered the matrix X * with three non-negative sources (rows) shown in Figure 7(a). These sources were obtained by superposing pairs of signals from the "ACsincpos10" benchmark of the NMFLAB toolbox and truncating their length to the first 250 samples [7]. We mix these sources with a random mixing matrix A * of dimension 25 × 5, whose elements were drawn independently from a uniform random distribution in the unit interval. This way, we obtain a noiseless observation matrix P = A * X * whose first five rows are displayed in Figure 7(b). Our objective is to reconstruct from the noisy data P, the matrix X of the nonnegative sources and the mixing matrix A, by ignoring the scaling and permutation ambiguities.
The performance was evaluated with the mean Signal to Interference Ratio (SIR) of the estimated factorization model AX and the mean SIR of the estimated sources (the rows of the X) [7].
We evaluate the proposed NMF algorithm in (61) based on the AB-divergence for very large number of pairs of values of (α, β), and a wide range of their values. The algorithm used a single trial random initialization, followed by a refinement which consist of running ten initial iterations of the algorithm with (α, β) = (0.5, 0.5). This initialization phase serves to approach the model to the observations (AX → P) which is important for guaranteing the posterior monotonic descent in the divergence when the parameters are arbitrary, as discussed in Section 3.3. Then, we ran only 250 iterations of the proposed NMF algorithm for the selected pair of parameters (α, β).
To address the influence of noise, the observations were modeled as P = Q * + E, where Q * = A * X * and E denote, respectively, the desired components and the additive noise. To cater for different types of noises, we assume that the elements of the additive noise e it are functions of the noiseless model q * it and of another noise z it , which was independent of q * it . Moreover, we also assume that the signal q * it and the noise z it combine additively to give the observations in the deformed logarithm domain as which is controlled by the parameter α * . Solving for the observations, we obtain or equivalently, , α * = 0, additive noise in a deformed log-domain.
Such approach allows us to model, under one single umbrella, noises of multiplicative type, additive noises, and other noises that act additively in a transformed domain, together with their distributions. In order to generate the observations, we should assume first a probability density function g(z it ) for z it ≡ ln 1−α * (z it ), the noise in the transformed domain. This distribution is corrected, when necessary, to the nearby distribution that satisfies the positivity constraint of the observations.
In the following, we assume that the noise in the transformed domainz it is Gaussian. Figure 8 presents mixtures obtained for this noise under different values of α * . The transformation (106) of the Gaussian density g(z it ) leads to the marginal distribution of the observations which can be recognized as a deformed log-Gaussian distribution (of order 1−α * ), with mean ln 1−α * (q * it ) and variance σ 2 . This distribution is exact for multiplicative noise (α * = 0), since in this case no correction of the distribution is necessary to guarantee the non-negativity of the observations. For the remaining cases the distribution is only approximate, but the approximation improves when α * is not far from zero or when q * it is sufficient large for all i, t, and for σ 2 sufficient small. Interestingly, the log-likelihood of the matrix of observations for mutually independent components, distributed according to (108), is Therefore, provided that (108) is sufficiently accurate, the maximization of the likelihood of the observations is equivalent to the minimization of the AB-divergence for (α * , α * ), that is, Figure 9 presents the simulation results for mixtures with multiplicative noise. Usually, performances with SIR > 15 dB are considered successful. Observe that the domain where the performance is satisfactory is restricted to α + β ≥ −0.5, otherwise some terms could be extremely small due the inversion of the arguments of the AB-divergence. In other words, for α + β < −0.5 there is too strong enhancement of the observations that correspond to the values of the model close to zero, deteriorating the performance. In our simulations we restricted the minimum value of entries of the true factorization model to a very small value of 10 −7 . We confirmed this by extensive computer simulations. As theoretically predicted, for a Gaussian distribution of the noisez it with α * = 0, the best performance in the reconstruction of desired components was obtained in the neighborhood of the origin of the (α, β) plane. The generalized Itakura-Saito divergence of Equation (46) gives usually a reasonable or best performance for the estimation of X and A. The result of the simulation can be interpreted in light of the discussion presented in Section 2.3. As the multiplicative noise increases the distortion at large values of the model, the best performance was obtained for those parameters that prevent the inversion of the arguments of the divergence and, at the same time, suppress the observations that correspond with larger values of the model. This leads to the close to optimal choice of the parameters that satisfy equation α + β = 0.
The performance for the case of additive Gaussian noise (α * = 1) is presented in Figure 10. For an SNR of 20 dB, the distribution of the noisy observations was approximately Gaussian, since only 1% of the coordinates of the matrix P were rectified to enforce positivity. This justifies the best performance for the NMF model in the neighborhood of the pair (α, β) = (1, 1), since according to (110) minimizing this divergence approximately maximizes the likelihood of these observations. Additionally, we observed poor performance for negative values of α, explained by the large ratios p it /q it being much more unreliable, due to the distortion of the noise for the small values of q * it , together with the rectifying of the negative observations. Figure 10. Performance of the AB-multiplicative NMF algorithm for 25 mixtures with additive Gaussian noise and SNR of 20 dB. The best performance for AX was for an SIR of 31.1 dB, obtained for (α, β) = (0.8, 0.7), that is, close to the pair (1, 1) that approximately maximizes the likelihood of the observations. The best performance for X and A was obtained in the vicinity of (−0.2, 0.8), with respective mean SIRs of 17.7 dB and 20.5 dB.  Figure 11 shows simulation results with α * = 3, the case for which the effect of the noise was more pronounced for the small values of q * it . The noise in the transformed domain was again Gaussian and set to an SNR (in this domain) of 20 dB. In this case, the distribution of the observation was no longer sufficiently close to Gaussian because 11% of the entries of the matrix P were rectified to enforce their positivity. The graphs reveal that the best performance was obtained for the pair (α, β) = (0.9, 4.0) that is quite close to a Beta-divergence. As a consequence of the strong distortion by the noise the small values of the factorization model the large and small ratios p it /q it were not reliable, leaving as preferable the values of α close to unity. On the other hand, the ratios associated with small q it should be penalized, what leads to a larger parameter β as a robust choice against the contamination of the small values of the model. Figure 11. Performance of the AB-multiplicative NMF algorithm when the observations are contaminated with Gaussian noise in the ln 1−α * (·) domain, for α * = 3. The best performance for AX was for an SIR of 22.6 dB obtained for (α, β) = (0.9, 4.0). A best SIR of 16.1 dB for X was obtained for (α, β) = (0.5, 1.7), which gave an SIR for A of 19.1 dB. The last two simulations, shown in Figures 12 and 13, illustrate the effect of uniform spiky noise which was activated with a probability of 0.1 and contained a bias in its mean. When the observations are biased downwards by the noise the performance improves for a positive α since, as Figure 3 illustrates, in this case the α suppresses the smaller ratios p it /q it . On the other hand, for a positive bias we have the opposite effect, resulting in the negative values of α being preferable. With these changes in α, the value of β should be modified accordingly to be in the vicinity of −α plus a given offset, so as to avoid an excessive penalty for observations that correspond to the large or small values of the factorization model q it . Figure 12. Performance for biased (non-zero mean) and spiky, additive noise. For α * = 1, we have uniform noise with support in the negative unit interval, which is a spiky or sparse in the sense that it is only activated with a probability of 0.1, i.e., it corrupts only 10% of observed samples. The best SIR results were obtained around the line (α, 1 − α) for both positive and large values of α.  Figure 13. Performance for multiplicative noise that is positively biased and spiky (activated with a probability of 0.1). For α * = 0, the noise in the ln(·) domain (z it ) followed a uniform distribution with support in the unit interval. The best SIR results were obtained along the line (α, −α) for negative values of α.

Conclusions
We have introduced the generalized Alpha-Beta divergence which serves as a flexible and robust cost function and forms a basis for the development of a new class of generalized multiplicative algorithms for NMF. Natural extensions of the Lee-Seung, ISRA, EMML and other NMF algorithms have been presented, in order to obtain more flexible and robust solutions with respect to different contaminations of the data by noise. This class of algorithms allows us to reconstruct (recover) the original signals and to estimate the mixing matrices, even when the observed data are imprecise and/or corrupted by noise.
The optimal choice of the tuning parameters α and β depends strongly both on the distribution of noise and the way it contaminates the data. However, it should be emphasized that we are not expecting that the AB-multiplicative NMF algorithms proposed in this paper will work well for any set of parameters.
In summary, we have rigorously derived a new family of robust AB-multiplicative NMF algorithms using the generalized Alpha-Beta Divergence which unifies and extends a number of the existing divergences. The proposed AB-multiplicative NMF algorithms have been shown to work for wide sets of parameters and combine smoothly many existing NMF algorithms. Moreover, they are also adopted for large-scale low-rank approximation problems. Extensive simulation results have confirmed that the developed algorithms are efficient, robust and stable for a wide range of parameters. The proposed algorithms can also be extended to Nonnegative Tensor Factorizations and Nonnegative Tucker Decompositions in a straightforward manner.