2.1. Orthogonal NMF and K-Means
Nonnegative Matrix Factorization (NMF), originally introduced by Paatero and Tapper in 1994 as positive matrix factorization [
38], is a specific matrix factorization method designed to obtain a low-rank approximation of a given and typically large nonnegative data matrix. Different from the widely used Principal Component Analysis (PCA), which is based on the singular value decomposition and allows for computation of a best rank
K approximation of a given arbitrary matrix, the NMF constrains the matrix factors to be nonnegative. This property makes the NMF the method of choice where the considered data naturally fulfills a nonnegativity constraint so that the interpretability of the factor matrices is ensured. NMF has been widely used for data compression, source separation, feature extraction, clustering, or even for solving inverse problems. Possible application fields are hyperspectral unmixing [
30,
31,
32], document clustering [
8,
39], and music analysis [
40] but also medical imaging problems, such as dynamic computed tomography, to perform a joint reconstruction and low-rank decomposition of the corresponding dynamic inverse problem [
41], or Matrix-Assisted Laser Desorption/Ionization (MALDI) imaging, where it can be used for tumor typing in the field of bioinformatics as a supervised classification method [
42].
Mathematically, the standard NMF problem can be formulated as follows: For a given nonnegative matrix
$X\in {\mathbb{R}}_{\ge 0}^{M\times N},$ the task is to find two nonnegative matrices
$U\in {\mathbb{R}}_{\ge 0}^{M\times K}$ and
$V\in {\mathbb{R}}_{\ge 0}^{K\times N}$ with
$K\ll min\{M,N\},$ such that
This allows the approximation of the columns ${X}_{\u2022,n}$ and rows ${X}_{m,\u2022}$ via a superposition of just a few basis vectors ${\left\{{U}_{\u2022,k}\right\}}_{k}$ and ${\left\{{V}_{k,\u2022}\right\}}_{k},$ such that ${X}_{\u2022,n}\approx {\sum}_{k}{V}_{kn}{U}_{\u2022,k}$ and ${X}_{m,\u2022}\approx {\sum}_{k}{U}_{mk}{V}_{k,\u2022}$. In this way, the NMF can be seen as a basis learning tool with additional nonnegativity constraints.
The typical variational approach to tackle the NMF problem is to reformulate it as a minimization problem by defining a suitable discrepancy term
$\mathcal{D}$ according to the noise assumption of the underlying problem. The default case of Gaussian noise corresponds to the Frobenius norm on which we will focus on in this work. Further possible choices include the Kullback–Leibler divergence or more generalized divergence functions [
25].
Moreover, NMF problems are non-linear and ill-conditioned [
43,
44]. Thus, they require stabilization techniques, which is typically done by adding regularization terms
${\mathcal{R}}_{j}$ into the NMF cost function
$\mathcal{F}$. However, besides the use case of regularization, the penalty terms
${\mathcal{R}}_{j}$ can be also used to enforce additional properties to the factorization matrices
U and
$V$. The general NMF minimization problem can, therefore, be written as
where
${\alpha}_{j}$ are regularization parameters. Common choices for
${\mathcal{R}}_{j}$ are
${\ell}_{1}$ and
${\ell}_{2}$ penalty terms. Further possible options are total variation regularization or other penalty terms which enforce orthogonality or even allow a supervised classification scheme in case the NMF is used as a prior feature extraction step [
37,
42]. In this work, we will focus on the combination of orthogonality constraints and total variation penalty terms to construct an NMF model for spatially coherent clustering methods.
Another essential step for computing the NMF of a given matrix
X is the determination of an optimal number of features
$K$. Typical techniques used throughout the literature are based on heuristic or approximative methods, including core consistency diagnostics via a PCA or residual analysis (also see Reference [
25]). A straightforward technique used in Reference [
45] is based on the analysis of the rank-one matrices
${U}_{\u2022,k}{V}_{k,\u2022}.$ For a
$K\in \mathbb{N}$ chosen sufficiently large, the considered NMF algorithm is executed to obtain the factorization
$UV.$ Afterwards, the norm of the rank-one matrices
${U}_{\u2022,k}{V}_{k,\u2022}$ for every
$k\in \{1,\cdots ,K\}$ is analyzed. By the choice of a large
$K,$ the NMF algorithm is forced to compute additional irrelevant features, which can be identified by a small norm of the corresponding rank-one matrices. By choosing a suitably defined threshold for the values of the norm, a suitable
K can be obtained.
This work, however, will not focus on methods to determine an optimal number of features. Hence, we assume in the numerical part in
Section 4 that the true number of features in the considered dataset is known in advance so that
K can be set a priori.
On the other hand, K-means clustering is one of the most commonly used prototype-based, partitional clustering technique. As for any other clustering method, the main task is to partition a given set of objects into groups such that objects in the same group are more similar compared to the ones in other groups. These groups are usually referred to as clusters. In mathematical terms, the aim is to partition the index set $\{1,2,\cdots ,M\}$ of a corresponding given dataset $\{{x}_{m}\in {\mathbb{R}}^{N}\phantom{\rule{4pt}{0ex}}|\phantom{\rule{4pt}{0ex}}m=1,\cdots ,M\}$ into disjoint sets ${\mathcal{I}}_{k}\subset \{1,\cdots ,M\},$ such that ${\cup}_{k=1,\cdots ,K}{\mathcal{I}}_{k}=\{1,\cdots ,M\}$.
Many different variations and generalizations of
K-means have been proposed and analyzed (see, for instance, References [
1,
46] and the references therein), but we will focus in this section on the most common case. The method is based on two main ingredients. On the one hand, a similarity measure
$dist(\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}},\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}})$ is needed to specify the similarity between data points. The default choice is the squared Euclidean distance
$dist({x}_{i},{x}_{j}):={\parallel {x}_{i}-{x}_{j}\parallel}_{2}^{2}$. On the other hand, so-called representative centroids
${c}_{k}\in {\mathbb{R}}^{N}$ are computed for each cluster
${\mathcal{I}}_{k}$. The computation of the clusters and centroids is based on the minimization of the within-cluster variances given by
$\mathcal{J}={\sum}_{k=1}^{K}{\sum}_{m\in {\mathcal{I}}_{k}}dist({x}_{m},{c}_{k})$. Due to the NP-hardness of the minimization problem [
47], heuristic approaches are commonly used to find an approximate solution. The
K-means algorithm is the most common optimization technique which is based on an alternating minimization. After a suitable initialization, the first step is to assign each data point
${x}_{m}$ to the cluster with the closest centroid with respect to the distance measure
$dist(\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}},\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}})$. In the case of the squared Euclidean distance, the centroids are recalculated in a second step based on the mean of its newly assigned data points to minimize the sum of the within-cluster variances. Both steps are repeated until the assignments do not change anymore.
The relationship between the NMF and
K-means clustering can be easily seen by adding further constraints to both problems. First of all, from the point of view of
K-means, we assume nonnegativity of the given data and write the vectors
${x}_{m}$ row-wise to a data matrix
X such that
$X={[{x}_{1},\cdots ,{x}_{M}]}^{\u22ba}\in {\mathbb{R}}_{\ge 0}^{M\times N}.$ Furthermore, we define the so-called cluster membership matrix
$\tilde{U}\in {\{0,1\}}^{M\times K},$ such that
and the centroid matrix
$\tilde{V}:={[{c}_{1},\cdots ,{c}_{K}]}^{\u22ba}\in {\mathbb{R}}_{\ge 0}^{K\times N}$. With this, and by choosing the squared Euclidean distance function, it can be easily shown that the objective function
$\mathcal{J}$ of
K-means can be rewritten as
$\mathcal{J}=\parallel X-\tilde{U}\tilde{V}{\parallel}_{F}^{2},$ which has the same structure of the usual cost function of an NMF problem. However, the usual NMF does not constrain one of the matrices to have binary entries or, more importantly, to be row-orthogonal as it is the case for
$\tilde{U}$. This ensures that each row of
$\tilde{U}$ contains only one nonzero element which gives the needed clustering interpretability.
This gives rise to the problem of ONMF, which is given by
where
I is the identity matrix. The matrices
U and
V of an ONMF problem will be henceforth also referred to as cluster membership matrix and centroid matrix, respectively. In the case of the Frobenius norm as discrepancy term and without any regularization terms
${\mathcal{R}}_{j},$ it can be shown that this problem is equivalent to weighted variant of the spherical
K-means problem [
16]. For further variants of the relations between ONMF and
K-means, we refer to the works of References [
2,
3,
4] and the review articles of References [
23,
24].
2.2. Algorithms for Orthogonal NMF
Due to the ill-posedness of NMF problems and possible constraints on the matrices, tailored minimization approaches are needed. In this section, we review shortly common optimization techniques of NMF and ONMF problems, which will also be used in this work for the derivation of algorithms for ONMF models, including spatial coherence.
For usual choices of
$\mathcal{D}$ and
${\mathcal{R}}_{j}$ in the NMF problem (
1), the corresponding cost function
$\mathcal{F}$ is convex in each of the variables
U and
V but non-convex in
$(U,V).$ Therefore, the majority of optimization algorithms for NMF and ONMF problems are based on alternating minimization schemes
One classical technique to tackle these minimization problems are alternating multiplicative algorithms, which only consist of summations and multiplications of matrices and, therefore, ensure the nonnegativity of
U and
V without any additional projection step provided that they are initialized appropriately. This approach was mainly popularized by the works of Lee and Seung [
48,
49], which also brought much attention to the NMF, in general. The update rules are usually derived by analyzing the Karush–Kuhn–Tucker (KKT) first-order optimality conditions for each of the minimization problems in (
2) and (
3) or via the so-called Majorize-Minimization (MM) principle. The basic idea of the latter technique is to replace the NMF cost function
$\mathcal{F}$ by a majorizing surrogate function
${\mathcal{Q}}_{\mathcal{F}}:dom\left(\mathcal{F}\right)\times dom\left(\mathcal{F}\right)\to \mathbb{R},$ which is easier to minimize and whose tailored construction leads to the desired multiplicative updates rules defined by
with the defining properties of a surrogate function that
${\mathcal{Q}}_{\mathcal{F}}$ majorizes
$\mathcal{F}$ and
${\mathcal{Q}}_{\mathcal{F}}(A,A)=\mathcal{F}\left(A\right)$ for all
$A\in dom\left(\mathcal{F}\right),$ it can be easily shown that the above update rule leads to a monotone decrease of the cost function
$\mathcal{F}$ (also see
Appendix A). However, the whole method is based on an appropriate construction of the surrogate functions, which is generally non-trivial. Possible techniques for common choices of
$\mathcal{D}$ and
${\mathcal{R}}_{j}$ in the NMF cost function are based on the quadratic upper bound principle and Jensen’s inequality [
37]. Overall, multiplicative algorithms offer a flexible approach to various choices of NMF cost functions and will also be used in this work for some of the proposed and comparative methods.
Another classical method is based on Alternating (nonnegative) Least Squares (ALS) algorithms. They are based on the estimation of the stationary points of the cost function with a corresponding fixed point approach and a subsequent projection step to ensure the nonnegativity of the matrices. An extension to this procedure is given by Hierarchical Alternating (nonnegative) Least Squares (HALS) algorithms, which solve nonnegative ALS problems column-wise for both matrices
U and
V [
11,
12] and will also be used as a comparative methodology.
An optimization approach, which was recently used for NMF problems, is the widely known Proximal Alternating Linearized Minimization (PALM) [
14,
15], together with its extensions, including stochastic gradients [
50]. As a first step, the cost function is split up into a differentiable part
${\mathcal{F}}_{1}$ and a non-differentiable part
${\mathcal{F}}_{2}.$ In its basic form, the PALM update rules consist of alternating gradient descent steps for
U and
V with learning rates based on the Lipschitz constants of the gradient of
${\mathcal{F}}_{1}$ in combination with a subsequent computation of a proximal operator of the function
${\mathcal{F}}_{2}.$ Some of these techniques will be used for the proposed methods in this work and will be discussed in more detail in
Section 3.
Further well-known techniques are, for example, projected gradient algorithms consisting of additive update rules, quasi-Newton approaches based on second-order derivatives of the cost function, or algorithms based on an augmented Lagrangian concept [
16,
25].
All these methods can be also used to derive suitable algorithms for ONMF problems. Common approaches to include the orthogonality constraints are the use of Lagrangian multipliers [
3,
5,
6,
11,
12] or the replacement of the hard constraint
${U}^{\u22ba}U=I$ by adding a suitable penalty term into the NMF cost function to enforce approximate row-orthogonality for
U controlled by a regularization parameter [
7,
9,
10,
15,
20,
37]. Other methods include optimization algorithms on the Stiefel manifold [
19], the use of sparsity and nuclear norm minimization [
8,
39], or other techniques [
14,
16,
18].
In the next section, we will introduce the considered NMF models in this work and derive the corresponding optimization algorithms.