1. Introduction
Clustering is a technique attributed to MacQueen et al. [
1], who posed the problem of partitioning a set of observations into disjoint subsets such that the within-group variance is as small as possible, and they proposed the benchmark
k-means method [
1]. Currently, there is a well-established corpus in the statistics and machine learning (ML) fields that involves the use of extended and well-known methods with the purpose of
a formal study of algorithms and methods for grouping or clustering objects according to measured or perceived intrinsic characteristics or similarities [
2]. The vertiginous growth experienced in such fields in recent decades, particularly in terms of concepts and computational techniques, is reflected in several excellent books published in this field. One of them, providing detailed explanations of the most relevant contributions, is [
3], which clearly reflects the difficulty of conducting simple classifications of existing techniques, as there exists an overlapping of methods aggravated with the introduction of eigenanalysis criteria.
An alternative to partitioning methods can be designed through admitting that, in the presence of uncertainty in observations, they may belong to more than one category (to a greater or lesser extent) and by using supporting generative models. These approaches are based on determining suitable mixtures of
components in terms of number and parameters, and they are denoted by
, where
g is a density with parameter
, and
denotes the mixing weights assigned to each density. Two interpretations of this formulation exist: the first supposes that each entity belongs to each cluster with a different probability [
4], while the second concept classifies the observations that most likely belong to each distribution [
5].
If the data frame of observations can be handled as a matrix
, its transformation to the probability space yields a stochastic matrix or
probabilistic image of the data matrix, thus making non-negative matrix factorization (NMF) techniques especially apt for handling probabilistic structures. Many authors have attributed the introduction of these techniques to [
6], while others attribute their introduction to [
7]—a study clearly centered on ML. We prefer to attribute their introduction to Chen [
8], who was influential in establishing conditions for the existence of stochastic matrices [
9]. The main idea is to factorize the stochastic matrix as the product of the other two matrices and of non-negative inputs. This factorization depends on the dimension of the space span (the dimension of the factorization space), which is not determined a priori. Then, the problem of parameter estimation is translated to determining the number of suitable mixtures, or the space span of the factorization, converting the problem to the unsupervised classification of the parameter estimation that assigns probabilities of membership to each item in each cluster. Pioneering applications of clustering using NMF techniques have been demonstrated in studies on the classification of textures [
10], bioinformatics [
11], instrument classification [
12], facial recognition [
13], and spectroscopy [
14].
Independent of the classification method, an important step is determining the quality of the classification. In this context, the number of clusters is one of the most relevant aspects. This problem, called
clustering validation, is of active and increasing interest in this research field, and it is critical for unsupervised cases. Additionally, many validation methods exist. A drawback shared by many validation methods is that
the use of one or more criteria may inadvertently satisfy different algorithms. Thus, clustering is a problem in which precise quantification is often not possible because of its unsupervised nature [
3] (pg. 22). The practical consequence is that many existing validation procedures seem to be less stringent with respect to the clustering methods that they validate. From a theoretical viewpoint, the relationships between many of these criteria have not yet been well established. In contrast, qualitative methods exist—usually based on graphical criteria—that provide good estimates but suffer from drawbacks in terms of providing quantitative results [
15]. Moreover, for large datasets, the presence of noise can be significant, and making estimations and results that are formally correct can be unrealistic.
A line of work that does not share these problems is based on the idea of stability. Stability is an internal measure that evaluates the results for a choice of the number of clusters. The main idea is that a dataset is a sample of a statistical distribution. The sample is stable if the underlying distribution is the same for different samples obtained from the data [
16]. A review work that provides insight into these methods is [
17]. A more recent work to determine the partition from the underlying distribution is [
18]. This work redefines the concept of clustering as the
partitioning of data into groups so that the partition is stable, proposing the stadion index. In essence, this index maximizes a function of the stability between clusters and minimizes it within cluster.
In our approach, we derive a probability density function (pdf) to assess the degree of validity for classifications. The first key step is the formulation of the problem in light of the NMF. The second key step is to obtain a sequence of traces by varying the space span of the NMF. Then, the expectation of the limit behavior is given as a pdf.
While the term
inference is not vague, it has been given different definitions. It is extensively used in logic, referring to the correct result of reasoning with several propositions. In statistics, it refers to
the branch... which deals with the generalizations from samples to the population parameters, being
the relative frequency interpretation in the Bayesian context, or
the interpretation of probability in a subjective sense, which
may estimate the probability of an event on the basis of our experience or credibility [
19]. In the ML context, according to the dictionary of Google (
https://developers.google.com/machine-learning/glossary?hl=es-419#i, URL accessed 15 September 2023 ), it is
the process of making predictions by applying a trained model to unlabeled examples. This definition implies a simple descriptive statistical framework, restricting the clustering validation problem to the best data structure description in order to handle the data while sacrificing predictions. In this manuscript, the definition of inference is related to inferential statistics.
The remainder of this manuscript is structured as follows: Studies covering similar topics are presented in
Section 2. Basic formulas that are well known and extended for use with NMF practitioners are presented in
Section 3. In this section, we also propose an error bound for the case of low-rank approximations. The manipulations necessary to obtain the traces and the underlying hypothesis (linear independence of observational variables) are explained in
Section 3.4, as well as the main result; that is, the sequence of NMF traces follows a gamma pdf when the dimension of the space span varies, and it is the posterior of a Poisson distribution. We provide some indications for the application of this result for clustering validation in
Section 4. In
Section 5, we illustrate how the pdf operates in the context of several data configurations. We discuss several gaps and open questions in
Section 6 before presenting our conclusions.
A key contribution of this manuscript is the generalization of NMF to real input matrices achieved through transformations to probabilistic space, which can always be justified in the presence of data uncertainty. However, the most important contribution of this work is the provision of a probabilistic validation criterion that allows for credibility to be assigned to clustering results independently of the clustering method. This result enables the construction of credible or acceptance regions, thereby allowing for a comparison of hypotheses or acceptance intervals to be carried out.
The existence of a pdf extends the validation problem to the evaluation on a quantitative basis of criteria other than purely numerical ones, including expert criteria. In the case of serious discrepancies, it allows justifying the re-analysis of the problem with more relevant data. Furthermore, in the case of big data, it may be more appropriate to provide a confidence interval than a single value for classification.
2. Related Work
The problem of clustering validation is as old as the techniques of clustering. Although there are no unified criteria to determine whether it is a geometric problem related to partitioning a set based on similarities or a problem of label assignment, this conceptual nuance does not affect the practical results. Validation requires that several properties (e.g., sensibility, cluster number impact, invariance) be determined in order to establish the capability of a cluster method for various data structures. There exist many studies that have justified these criteria, such as [
3] (chap. 23) and [
20].
One of the first statistical attempts to use multivariate techniques to determine the number of clusters was detailed in [
21]. At this time, the use of statistical concepts to construct validation indices was quasi-mandatory [
22]. A later work by the same author pointed out the effects of sample size, data dimension, and cluster spread [
23]. These ideas are currently relevant, considering the increasing size of datasets. In [
24], the difference between graphical methods and those that postulate the existence of an underlying parametric statistical model was pointed out. This idea is relevant when using the likelihood ratio test, which is the starting point for many fuzzy and probabilistic validation criteria. Furthermore, the current classification of validation methods was introduced in this work. One such criterion is the classical silhouette index, which evaluates the within and between variances of the groups into which the data have been divided for the selection of the cluster number [
25]. Further studies focusing on this viewpoint of statistical techniques for clustering validation include Har [
26], which introduced an indicator function for observations based on kernelization and used the null hypothesis as a classifier.
In Smyth [
27], likelihood cross-validation was used to infer information on the number of model components. At this time, and as a consequence of the expansion of the Internet, other problems appeared regarding what constitutes a good measure of similarity. In Halkidi, a detailed survey analyzing the problem of data similarity was provided. This study raised the problem of how to measure the goodness of fit of the data relative to the groups that use concepts of compactness and separation. This work serves to create a consensus according to the types of indices: external validation (the availability of additional information, such as a subset of labeled data) and internal validation (the case in which only information on the data is available) [
28]. This work also provided many examples. A classical index derived from these ideas is the gap statistic, which compares the within-cluster dispersion to the expectation of a reference distribution [
29].
Using a different approach, the similarity between clusters can be evaluated using the
statistic between probabilistic classifications [
30]. In Brun, a model for error prediction based on the correlation error rate vs. several validity measures was provided. The Kendall rank correlation index can be used to quantify the degree of similarity between the validation indices and the classification errors [
31]. By introducing an index and assuming that each cluster is generated by a parametric distribution, the minimum can be taken as the validation index [
32]. The effect of clusters with different sizes and densities was studied in [
33].
More recent works include [
34], which, within the scope of astronomical observations and under the hypothesis of normality and the existence of a correlation, presented an algorithm in which the posterior distribution of the correlations follows a gamma pdf. A quantitative discriminant method using the elbow point for the determination of the optimal cluster number was presented in [
35]. In [
36], a purely algebraic approach was presented, in which elements are clustered according to their co-linearity. A review centered on the impact and importance of clustering validation in the context of the recent growth of bioinformatics was presented in [
37], with the evaluation of the number of clusters being a more recent development [
38].
3. Probabilistic Space Domain Transformation, Parametrization, and Non-Negative Factorization Sequence Traces
The transformation of data to the probabilistic space allows one to search for differences or similarities in terms of probabilities. This transformation is justified when uncertainty exists in the data estimation. Several techniques that involve transformation based on a kernel are extensively used in the ML context [
3] (p. 10). Additionally, well-established methods for smoothing represent a commonly employed solution to obtain good results.
One of the relevant issues in the statistics literature is the determination of the underlying pdf of a sample. In the multivariate case, the data frame containing this sample is the matrix , in which m observations are formulated as i () rows obtained under the j () observational criteria. Thus, an observation under the criterion is the column vector , while the row vectors provide a description of an item or observation. We represent these observations as as . Additionally, we assume the linear independence of the columns of .
3.1. Probabilistic Image of a Real-Valued Data Matrix
3.1.1. Real Field Domain
The transformation to the probabilistic space for the case in which the data frame of observations can be written as the matrix
requires the determination of the underlying pdfs
, such that
Each column vector is associated with a density corresponding to the observational criteria used to generate the data, which are unknown in the general case.
The estimate of each density in (2) using a smoothing technique requires the
j density to be approximated by
p mixtures of a kernel density function (kdf), providing an estimate
that approximates
as
where
denotes the kdf. This adjustment involves the choice of the shape of the kdf and the selection of the smoothing parameter
h. This point is controversial, as a small value provides a good estimation for each point but increases the variance, while large values reduce the variance but increase the bias and obscure the underlying data structure. An extended criterion for its selection involves the estimation of the mean integrated standard error (MISE), defined as
where
is the MISE estimate of
f.
After juxtaposing the densities
, the result is the following stochastic column matrix:
where
is the
norm in the Shalten sense.
This case is univariate, while the multivariate case involves considering the
entries of (
1) as
. Smoothing with the kdf is carried out:
where
is the matrix of smooth parameters, usually written as
, in which we add the sub-index
referring to the kernel in order to avoid confusion with the matrix
used in the NMF context. Additionally,
is a row vector in (
1), and
represents the other row vectors weighted by the kdf.
is a positive semi-definite matrix containing the smooth parameters:
Moreover, the probabilistic image is
The relationship between (
5) and (
8) is determined by
where
satisfies
. Thus, pseudo-inverses are taken:
where
is an identity matrix of suitable dimension, and
More details on smoothing techniques have been presented in [
39], which contains many examples and the R code. The multivariate case has been explained in depth in [
40]. Additionally, the probabilistic sense of the matrices in (
5) and (
8) can be formally justified by considering the set
(or
), which takes values in all possible outcomes of the set
. Thus, the problem can be stated in terms of the triplet
, with
P being a measure or probability and
being a Borel
-algebra. The map of the inverse image of
is
. The Borel sets define probabilities
with distributions
and
; this allows for the estimation of the density
(or
f) associated with the distribution
P, which generates the data [
41] (Chap. 1).
3.1.2. The Particular Case of the Positive Integer Domain
A classical probabilistic transformation widely used in many contexts consists of estimating the relative frequencies of a data frame
, obtained from a corpus of
documents when crossed with a thesaurus of
words [
42,
43]. This approach has been extended to many other cases in which the data domain takes values in the positive integers [
44], many of which have been explained in our recent work [
45]. Then, the relative frequencies are the following probabilities:
which are estimated via joint probability
.
The Bayes rule provides the following:
The introduction of matrix notation requires the identification of
and
. Each column of
has an
norm of 1. In this case, we have the following:
with
as defined in (
11).
Furthermore, if
, the transformation (
12) can be written as
Transformation (
12) is based on Laplace’s probability definition, and its use is limited to the case of count matrices or contingency tables. This restriction greatly limits the use of NMF techniques.
3.1.3. Equivalence
The probabilistic transformations given by (
9) and (
14) for the case of positive integer and real-entry matrices are equivalent, and equality between both results can be obtained if the entries of
and
are the same. This result can be obtained by introducing a triangular kdf in (
3) and a suitable smoothing parameter
h.
The triangular kdf is based on the triangular function, as shown in
Figure 1. It was investigated in [
46,
47], in which it was stated that it corresponds to a discrete pdf, while it was found to be continuous in [
41] (Chap. 13). This question depends on the conditions of the definition of the variable domain and its support.
The triangular kernel is introduced as follows [
46,
47]:
Writing the difference
for
the
grid (i.e., the values at which the density is estimated)
, the interval
contains a unique point. Then, we have the following:
with the density estimate
where
is the number of observations at
. To obtain the probabilistic image, this procedure must be carried out for all the columns of (
1).
The multivariate case requires the consideration of a multivariate kernel in the form of
. According to our literature search, the
d-dimensional triangular kernel has not been defined; although we think that it will not be difficult to obtain, it remains an unresolved issue. We point out that once (
18) is obtained, the analogous multivariate case introducing the diagonal matrix
(as defined in (
11)) can be immediately obtained. This matrix corresponds to a uniform distribution, with equal weights assigned to all observational variables. This result demonstrates that the Laplace rule is a particular case of the wider transformation to the probabilistic space.
We call the stochastic matrices and the probabilistic image and column probabilistic image, respectively. Moreover, they are the transformations to the probabilistic space or simply probabilistic transformations of the data.
3.2. Parametrization
To achieve the factorization of (
8) while maintaining a probabilistic sense, one must impose the NMF on
to obtain the matrices
and
as the following product:
which minimizes an objective or cost function. (The standard NMF formulation is usually stated as the product
, being the objective to minimize
[
48] (p. 8). In this work we have preferred to ignore the matrix
and formulate it strictly as an approximation problem.) This factorization always exists for non-negative real matrices [
8], and it has a probabilistic sense if the normalization conditions of (
19) are fulfilled.
The Kullback–Leibler (KL) divergence is chosen as the following objective function [
49]:
where ⊙ is the Hadamard (or element-wise) product, and the matrix fraction is the quotient for equal sub-index entries.
Then, (
21) is minimized according to the Karush–Kuhn–Tucker conditions [
48] (p. 141):
and
which provides the following solutions [
50]:
This procedure is a particular case of NMF in which divergences are used, which are known as
iterative updates or
multiplicative iterative algorithms [
48] (Chap. 3). In [
51], it was demonstrated that this procedure maintains the normalization conditions in the iterative process.
To adjust the product (
19) or approximation (20), it is necessary to select a value for
k, initialize the matrices
and
, and then apply an iterative process (switching between (
25) and (26)) until a condition is reached where the approximation degree of (
19) is achieved.
This procedure of approximating to in order to maintain matrix normalization involves an approximation of to , which are densities. Thus, it is a sufficient (not minimal) statistic—roughly speaking, the density contains the full sample’s information. Therefore, the products are distributional parameters with no distributional assumption, constituting a non-parametric approach (in the sense that no hypothesis is made for the parameters). We also refer to the columns of (or ) and the rows of (or ) as components.
3.3. Convergence of Solutions
The approximation of (
19) requires the following:
Thus, convergence in the iterative process of approximating must be proven.
Usually, these proofs are based on the same strategy as the Expectation-Maximization (EM) algorithm, which always converges [
52]. Basically, they consist of postulating the existence of a function
G accomplishing
and
, which leads to the sequence
. This procedure is similar to that detailed in [
7].
Based on this proof, it is usually stated that factorization (
19) with the KL divergence converges to the matrix
through a simple observation of the expansion of the KL divergence of (22): While the first term is a constant, the second is the log-likelihood. A more general conclusion has been derived by [
53], showing the asymptotic properties of this solution.
An alternative naive proof can be obtained by taking into account the fact that the divergence (22) takes positive values (divergences are non-negative). Thus,
. Then, for matrices satisfying this condition, Formulas (
25) and (26) can be rewritten for the
iteration as follows:
interpreting
and
as actualization weight matrices.
As all the entries of the matrices
and
take values in
, it holds that
(also for the matrices in (29)). Then, the product of the matrices
and
is monotonically decreasing, and the difference (
27) also decreases.
A consequence of convergence is an approximation error bound. This requires the consideration of two cases. First, for
, the convergence
is well known (a formal proof can be found in [
50]).
The case
is known as
low-rank approximation, which has particular interest in practice. It poses several problems, as reported in the literature. In the context of the probabilistic latent semantic analysis (PLSA), it has been pointed out that the convergence limit does not necessarily occur at a global optimum, thus providing sub-optimal results [
54].
Our proposal to establish a convergence limit is similar to that used by Schmidt for the approximation theorem [
55].
Theorem 1 below was formulated according to the singular value decomposition (SVD) theorem for real-valued matrices [
56] (p. 275).
Theorem 1. Let . Then, orthogonal (or unitary) matrices and exist such thatwhere with diagonal entries The approximation for
is known as the
low-rank SVD approximation. Additionally, it is necessary to consider that the spectral norm of a matrix is the sum of its eigenvalues (or trace). Thus, by writing the inequality
and taking into account approximation (
27) for the eigenvalues of the second term of (
30),
as the common eigenvalues are the same [
56] (p. 277):
Formula (
33) provides the error bound for the approximation case
.
Furthermore, in [
43], it was demonstrated that
3.4. Non-Negative Factorization Sequence Traces
In Formula (
19), the trace
leads to a sequence with varying
k values in the NMF factorization space span.
Here, the sub-index brackets indicate that terms are given in increasing order.
In Formulas (38) and (
39), the quotient
leads to a monotonically decreasing sequence. To achieve a reliable sequence, it is necessary to establish the same convergence condition for all of the terms. This requires consideration of the following cases.
If
occurs, then
during the iteration process. By imposing the same approximation condition for
on all products of
obtained with different values of
k and introducing
the inequality
holds only if
.
If
, then a convergence error bound given by (
33) exists. Thus, it is necessary to impose a wider condition for the approximation of (
27) (i.e.,
), which is a jump in the empirical univariate distribution of the columns of
. As the convergence of the empirical distribution to each component (column) of
is almost sure (a.s.), the difference is bounded. After taking the maximum difference, a decreasing behavior is observed when we carry out division by increasing values of
k.
3.4.1. Trace Sequence Limit Behavior
For sequence (
40) obtained from a full-rank matrix, the function
represents the similarity between
z and
in terms of the inverse of the (non-logarithmic) likelihood. This function can be written as
By introducing the following transformations
with the Jacobian
we can express (
43) as a function of the new variables as follows:
Formula (
45) is merely a displacement, and Relation (46) transforms the domain of
to a set with lower bound 1 but no upper bound. This transformation does not change the dimension of the space, as
depends on
.
As
by substituting (
52) into (
51) and taking into account the fact that
z is a constant, we obtain
With the sole purpose of facilitating further calculations and avoiding an incomplete inverse gamma, we change the variables
, which ensures that the variation domain is
, with no effect on the scale (the Jacobian is 1). Then, we take
Function (
53) can be rewritten as follows:
3.4.2. Expectation of Trace Sequence Limit
To solve (
56), we take into account (
55). As exponential functions are sufficiently regular to interchange the derivative signs, considering Formula (
55), we introduce
After taking a derivative, (
57) becomes
The Laplace transform of (
58) is
The relationship between (60) and (
61) provides the following recursive formula:
where we indicate the order of the derivative in parentheses to avoid confusion with the exponents.
By reversing the change given by (
57) for
, we obtain
The negative signs cancel each other out, and we find
Formula (
66) is a solution of Equation (
56), and this is the main result of this manuscript. As the Laplace transform can be viewed as an expectation (
), this relation can be interpreted as the expectation of the limit function, obtained from the sequences of a non-negative matrix trace that follows a gamma pdf.
3.4.3. Gamma Parameter Selection
The adjustment of an unbiased (gamma) density for sequence (
43) implicitly assumes
for
c given by Formula (
54) and imposes values for parameters
and
. Additionally,
can be obtained in closed form as
The classical transformation in (
66) is introduced as follows:
which leads to
This is a standard gamma density with the following maximum:
and expectation
.
Figure 2 shows the effects of the parameters on the gamma pdf.
If (
66) and (
70) must reproduce the same shape with
, it is necessary to adjust the expectation of (
68) to the maximum of (
72). Then,
and
3.5. Relation between Solutions
Another solution for Equation (
56) can be obtained by writing
and denoting
. The relation between derivatives with respect to
x is
Following the same reasoning described in
Section 3.4.2 and considering that
, the Laplace transforms of
and
are
respectively.
Comparing (77) and (79), we find
which leads to
Equation (
81) is a Poisson equation with parameter
(
), and it also solves the differential Equation (
56).
The pdfs that solve Equation (
56) state the problem in a Bayesian context. The gamma and Poisson densities are conjugate. The interpretation in this context is immediate when comparing (
66) and (
81). For a value of
, we achieve equality by introducing a factor
, and
The Bayesian theorem can be written as follows:
where
is the posterior factor,
is the likelihood factor, and
is the prior factor. We identify these factors as follows:
where
is the minimal sufficient statistic.
4. Moving to Clustering Validation
Transferring the previous results to the clustering validation problem requires some considerations. Clustering denotes the task of assigning each item of
to a subset
or
cluster based on the geometric or probabilistic properties of vectors
. Formally, it is a map [
4]:
This map provides the assignment
. The evaluation of the quality of this assignment in a probabilistic clustering model is stated in terms of the minimization of a loss function, usually the log-likelihood; in particular, the parameters that maximize the log-likelihood are those that provide the best clustering result. It can be immediately observed that (
43) is the inverse of the non-logarithmic likelihood. From this point, we obtain the expectation of the limit, and the result is a pdf.
The existence of a pdf allows for the construction of acceptance regions, and its utilization—with or without disjoint subsets—allows for the division of the original dataset. This fact allows for the definition of acceptance regions for the distributional parameters. This problem, which is classic in the field of statistics, is usually known as hypothesis testing. It is stated as a set of parameter values
, for which certain values
constitute the null hypothesis or acceptance region. Its complementary in
is
, constituting the rejection of the null hypothesis. In particular, the acceptance region
is
and, for a sample, dependence on the data
leads to
providing the probability
level confidence interval
, with those values being the boundaries of
. Then, the confidence level is
In the literature,
is usually referred to as
. We use this notation to avoid confusion with the parameters of the gamma pdf. A detailed pedagogical exposition of these concepts, containing many examples, has been presented in [
57]. This text deals with point estimation, the construction of hypothesis tests, and the construction of confidence intervals, emphasizing the equivalence between them in a standard statistical exposition.
When taking into account that only positive integers make sense in the clustering problem, Formula (
66) should be rewritten as
, with
being a suitable column matrix. We follow the statistical convention to omit the one-dimensional basis.
Moreover, it is important that the normalization of the data matrix must be carried out in the same space. As a consequence of smoothing via columns, the norm
, where
is the rank of the image of matrix
of (
1). To achieve accurate results, it is necessary to transform to the same representation space and the following:
This results in a re-estimation of sequence (38):
This nuance has no theoretical effects since the limit of the sequence (
43) is also a gamma, but it causes bias in the estimation of parameters.
On the other hand, the clustering validation problem makes sense for positive integers. The solution needs to be re-scaled for transformation to this space, for which we take the following:
These results allow the evaluation of the number of clusters for any data frame that can be written in matrix form, with the only restriction being that the observational variables are linearly independent. This makes it possible to evaluate data structures that simultaneously contain continuous and categorical variables and do not require distributional assumptions or their independence on the parameters.
4.1. Computational Remarks
Several parameters affecting the reliability of the gamma density are required. To better clarify their effect, we divide the process into three phases. First, we obtain the sequence of traces, evaluate the limit sequence, and adjust the gamma density, for which it is necessary to reproduce the shape—especially the maximum and a sufficient number of values for the queue, which is the value of the ancillary statistic (k).
4.1.1. NMF Parameters
The first step is to obtain an NMF with random initialization (otherwise, it would not make sense), varying
k from 1 to the ancillary statistic in increasing order. For each factorization, the sequence (
43) is computed. This process is repeated, and the results are kept in a matrix. In this phase, the parameters of matrix
are also obtained, including the matrix dimension, the matrix rank, and
z. Algorithm 1 provides details of this process.
As the computational cost required to obtain a reliable approximation of (
19) is high, an alternative approach for the trace sequence is to relax the approximation degree and re-compute the terms with
q random re-initializations of the matrices in formulas (
25) and (26) (otherwise, the re-estimation would not make sense). Then, by selecting the statistic
such that
, we obtain an estimate sequence.
The convergence condition of (
27) is somewhat complex in practice. A tight condition provides good values when
, but it causes problems if
, and the iterative process may not stop. Conversely, an overly wide condition provides no representative values. To achieve reasonable results, it is sufficient to take a value of
. Furthermore, deciding the rank of the matrix is not a trivial problem. This is a classical situation illustrating a well-solved problem in theory that faces problems in practice. This issue is intensified by transformations to the probabilistic space. We use SVD, selecting the significant eigenvalues for this situation.
Algorithm 1: Data Parameters and Trace Estimation. |
![Mathematics 12 02349 i001]() |
4.1.2. Sequence of Traces
This step involves selecting an estimator for the matrix containing the trace sequences, which is calculated for each value of
k. When the estimator is computed, we handle the results using local likelihood regression with the help of the
R package [
58]. The selected smooth parameter is
, which has no effect on the data. It is also necessary to define a support (
grid) that contains the positive integers valued at those same points, as detailed in Algorithm 2.
Algorithm 2: Trace Sequence. |
![Mathematics 12 02349 i002]() |
4.1.3. Gamma Density Adjustment
In this step, we estimate the peak of (
67) according to the criteria discussed in
Section 3.4.3 with no further problems. Algorithm 3 details this simple procedure.
Algorithm 3: Gamma Parameters. |
Input: S 1: 2: Output: , (parameters of gamma pdf) |
4.1.4. Overlapping
The application of the previous theoretical results for the clustering validation is based solely on the hypothesis of the linear independence of the observational variables. However, this situation may be very different from that which occurs in practical situations.
Assuming that the probabilistic image has been obtained from a matrix for which its columns are linearly independent (the information contained in the sample is preserved by omitting the linearly dependent observational variables), the existence of overlap plays an important role in obscuring the estimation of the parameters of (
66), even in the case where the hypotheses required for its derivation are fulfilled. This is because the overlap between variables affects their number and the contained information. In the case of uncorrelated variables, this effect is dramatic. Let us consider the case of the mixture
with strong overlap. In this case,
; thus,
. This drawback can be corrected by introducing the overlapping effect as a factor.
Furthermore, in multivariate structures, the correction must consider the structure of the distribution of the information: it can be within variables, between variables, or both. This means that considering
, which provides information on the distributional behavior of the information (in the case where the number of non-overlapping informative variables is less than
, the information is mainly contained within each variable; otherwise, the existence of more variables provides the main source of information between them). This suggests the introduction of the following factor:
where
if
,
if
,
is the maximum overlapping, and
is the determinant of the correlation matrix. It is noted that
indicates non-correlation, while smaller values correspond to correlation. This correction is also described in Algorithm 4.
Algorithm 4: Overlap and Correlation. |
![Mathematics 12 02349 i003]() |
On the other hand, we indicate that this correction requires the choice of a value for the overlap parameter, and the choice of the distributional criteria is an open question.
5. Examples
The examples have the purpose of evaluating the pros and cons of the proposed method. Before performing experiments, we provide a classical toy example related to the classification of documents.
Example 1. For a co-occurrence data frame consisting of a corpus of seven (–) documents containing letters , which we assimilate into words in a thesaurus, the probability image is as follows: The objective is to find the suitable number of subjects that classifies the documents, with the number of subjects being unknown.
Simple visual inspection of the data frame indicates three or four clusters. We first perform NMF clustering for the choices of .
The results for are as follows: The matrix represents the probabilities of documents vs. model components or clusters, and they are assimilated to the matrix’s columns. Once obtained, the qualitative matrix is a probabilistic classification per cluster. In each column, lines are the probabilities close to zero. This matrix is usually written as lists in decreasing order, and it is a probabilistic classification shown in matrix .
Figure 3 shows the gamma pdf that provides credibility for the model components (or the number of clusters) used to achieve the NMF. The parameters for the gamma pdf in Figure 3 are and , providing an acceptance hypothesis test of for the confidence β-level acceptance interval of . From the matrix , it is possible to obtain a disjoint classification through the introduction of a Bayes classifier [43]. For the Bayes classifier , the qualitative matrix containing the documents classified according its weight in the span space are This procedure illustrates the equivalence of NMF and k-means clustering. Despite NMF clustering being related to probabilistic and k-means clustering, the construction of the gamma pdf has no hypothesis that goes beyond the construction of a statistic, and it does not depend on the clustering method.
To examine the examples, we also compared the proposed criterion with indices that provide good results: the Dunn and silhouette indices and the gap statistic. These indices were chosen due to their widespread use in the related literature and the good results they provide: notably, these indices provide a single value. We took the maximum of (
66) as
to make comparisons possible. We also considered the number of labels or distributions for each dataset or the distributions that generate them.
The Dunn index is an ancient internal evaluation index for identifying compact sets of clusters, with a small variance within them [
59]. The silhouette index evaluates the compactness and separation of clusters [
25]. The underlying idea of the gap statistic is to obtain the expected differences in the
k clusters, with each of them having
r elements; then, they are compared with a reference distribution [
29]. These indices assume the optimization of a loss function related to the clustering method. We assume a k-means underlying partition schema.
We aid with the
clValid-package of the R computing environment to obtain the Dunn and silhouette indices [
60]. For the gap statistic, we use the
factoextra package [
61] with a brute force recursive procedure consisting of a loop varying the number of clusters.
We conducted experiments in two contexts. The first involved illustrative examples, and we selected four examples from well-known datasets in the UCI repository (
https://archive.ics.uci.edu/datasets, accessed on 2 September 2023), which is currently of a reduced size. Using these examples, we illustrate the validation ability of the proposed criterion. Meanwhile, the other examples possessed more complex data structures (first used in [
20], and it is available at
http://cs.joensuu.fi/sipu/datasets, last accession 2 September 2023). We examined the validation criterion against overlapping, dimension variation, and different numbers of observations per cluster.
5.1. Simple Examples
The four datasets from the UCI repository are briefly described in
Table 1. The selection criterion was the non-existence of missing values in order to avoid the need for a pre-processing step.
The
iris dataset is one of the most popular and best-studied datasets, and it has been used in many statistical studies. It contains three species of labeled flowers. Its original use can be attributed to Fisher. A discussion of how it was obtained and the conditions under which various authors attribute different results (2–3 categories, while some provide 4) can be found in [
62]. The
ecoli dataset was proposed to illustrate the unsupervised classification of biological functions [
63]. The dataset
seeds was first used to analyze several clustering algorithms on a real dataset of seed images [
64]. The
glass dataset allows for the study of the identification of seven types of glass and the effects of certain additives.
Figure 4 provides a graphical description of the datasets.
Table 2 shows the results obtained for the considered validation indices. It corresponds to
clusters. The value
refers to the
of Formula (
66).
The results for the glass (a) dataset assume the selection of the first
columns of the data frame. It was found that
glass contains
classes of glasses, identifying them with the labels
. In addition, there are additional treatments consisting of additives. Despite the quantities not being the same, it is easy to identify them (columns 8 and 9 of the data frame) as binary variables.
Table 3 presents both results for this dataset, and in
Figure 5, we present the case alone. All results obtained with our validation criteria were similar to those of the gap statistic; however, the Dunn and silhouette index and the gap statistic did not capture the presence of the binary variable. This illustrates, from a practical point of view, an advantage of our less restrictive assumption.
A comparison is presented in
Table 3.
Figure 5 shows the behavior of the pdf compared to the non-parametric curve obtained with Equation (
43), in which the relation with both maximums can be seen.
Figure 6 shows the NMF clustering with the iris dataset. This dataset is useful to illustrate the independence of our validation criterion and the clustering method. To obtain the density of Equation (46), we set
iterations in the switching process to adjust (
34) and
re-estimates, while to obtain the classification of
Figure 6, we needed
iterations. This shows that the underlying matrices of (
40) are not necessary a plausible clustering and cannot be used to obtain the corresponding classification (except for the case where a good approximation of (
30) is reached). This is due to the different speeds of convergence of the adjustment process to obtain the factorization and the sequence
.
5.2. More Complex Data Structures
To show the effects of overlapping, dimension variation, and different numbers of observations per cluster, we selected several synthetic datasets created in [
20] to study validation criteria behavior (available at
http://cs.joensuu.fi/sipu/datasets, last accession on 2 September 2023). These datasets were generated using a certain number of distributions, in which the overlaps and number of generating distributions vary.
The datasets include four types of artificial data configurations in which the overlapping effect increases (datasets , , , and ), the number of clusters varies (datasets , , and ), and the dimension varies (datasets , , , , and ). These datasets are balanced (i.e., equal number of entities in each cluster). An dataset allows us to examine the effect of different numbers of observations per cluster.
To proceed with the examples, we selected the same parameters for each case: the number of component estimations
,
re-estimations, no condition on the degree of approximation, and
iterations in the process of switching Equations (
25) and (26). To determine the rank of the matrix, we considered the number of relevant eigenvalues of matrix
with the help of the
condition number (i.e., the quotient of the largest involved eigenvalue).
Running Algorithms 1–3, we obtain the parameter
. These values are acceptable if the overlapping between observational variables does not exists. In the case of the existence of overlapping, it is necessary to adjust them according Formula (
96), providing new values for
in
Table 4.
Figure 7 shows the biplots of the data and the corresponding distributions for estimating the credibility of the number of clusters according to the proposed criteria.
5.3. Comparative
Comparisons illustrate the pros and cons of the proposed method versus the Dunn, silhouette, and gap statistic indices. In all cases, the most suitable value is chosen to minimize the corresponding loss function. The number of re-sampling steps in the bootstrap procedure used to estimate the reference density for determining the gap statistic was
. Those cases corresponded to the default parameters of the used R packages. Additionally, as in the previous examples, we took the value of the maximum of the gamma pdf for comparison with the other indices, with the results summarized in
Table 5.
A quantitative comparison of the results can be carried out using the Adjusted Rand Index (ARI), defined as [
65]
with
k being the number of clusters and
denoting its expectation, which is a measure of the similarity of clustering results. To avoid negative values, and to be consistent with the
metric in the context of non-negativity, we define the
relative error in the
space as:
with
and
being the maximal and minimal optimal number of clusters provided by a method, respectively. This index provides values over the interval
, and values close to zero are preferable. The results are shown in
Table 6.
On the other hand, we provide values of the levels of , assuming that the reported number of densities (D) is the correct value. These examples show that the proposed validation criteria do not assume any partition or fuzzy scheme. It evaluates the number of clusters and provides no classification or clusterization.
It can be obeserved that the proposed procedure captured overlapping, providing underestimation in these cases and exhibiting stability when the dimension increased (compared with
D). Therefore, the procedure works well in the unbalanced case. A discussion of the computational cost to obtain the NMF solutions is shown in [
50], and some tricks are explained in [
45].
Also, while the examples show the behavior of the proposed validation method and open the door to statistical inference, they do not constitute a numerical confirmation. Performing confirmatory numerical experiments would require a study that should include real-world examples including other recent validation indices, such as the Wemmert–Gancarski index [
66], SpecialK [
67], and Stadion [
18], which should be used to establish more rich conclusions.
6. Discussion
Usually, the gamma density is obtained as a sum of exponentials [
41] (p. 179). It is possible to prove that the independence of sums and sums of squares implies that the coefficient of variation follows this distribution [
68,
69], which is also true for the harmonic mean of the posterior distribution [
70]. Recently, [
71] focused on reducing the variance in such cases. Our approach aims to evaluate the expectation of a trace sequence in the probabilistic space. In fact, the Laplace transform confirms this interpretation. The same result can be reached by taking derivatives and carrying out normalization (
56); however, we minimized the statistical assumptions and procedures needed to achieve this result. Additionally, we present an approach for applications in the context of clustering validations.
Our approach does not result in any hypotheses in terms of the space of parameters—that is, in neither the distribution nor the dimensionality of the parameters—and, therefore, constitutes a non-parametric approach. One of the advantages of such an approach is that it supports any type of variable. The unique hypothesis, in this sense, is the independence of observational variables. Conversely, the transformation to the probabilistic space and use of the KL divergence provides a maximum-likelihood estimate. However, some issues need to be highlighted.
First, it seems that the main drawback of this method results from the NMF approach being a slow iterative process with high computational costs. We alleviated this through the introduction of several random re-estimations. Another problem is that the gamma pdf is continuous on the compact set
, while the clustering problem is appropriate only for non-negative integers. Additionally, the type of smoothing used in this method implicitly assumes Euclidean distances, while the approximation problem requires the
norm. Although some works have focused on this issue [
72,
73], according to our literature search, such works have received little attention in recent years. We recall that this choice is critical to reproduce the data structure and is related to the variance.
Although we think that, from a practical viewpoint, the provided examples do not validate our clustering validation criteria or our attempts, they do provide a practical framework that illustrates several issues. Our proposal worked well for the UCI Machine Learning Repository datasets, but we also selected some of the synthetic datasets developed by Franti to illustrate certain problems. We made this choice as the effects of superposition, stability, and size could be controlled. In this sense, we believe that the greatest difficulties arise when attempting to establish conditions to adjust the model to cases in which compactness is high (datasets , , and ).
We did not apply our proposed procedure to other data structures, such as those with well-defined geometric shapes, for which available methods such as CLARANS [
74] offer good results and have been extended in the context of Big Data, although we think that this would allow for good exploration. However, according to the experiments of
Section 5, our results are comparable to those provided by the selected indices.
Moreover, the provision of a pdf provides several advantages. In unsupervised environments, a graph can be easily interpreted by human operators without analytical skills, allowing them to incorporate other results in the analysis as part of their expert judgment. Expert judgment does not always coincide with the statistical results, leading to controversial situations. Additionally, expert judgment comes from many years of studying a discipline or practice, and it should not be misconsidered. We believe that controversial situations should be placed in the area of the selection of relevant observational variables [
75].
Without entering into this discussion—which is philosophical and profound—we merely indicate that graphic visualization can provide relevant considerations for practical situations, and a contribution of the proposed method is the visualization of densities, thereby providing relevant graphical information.
Future research should include our proposal in a more exhaustive comparison, including artificial datasets that reproduce shapes. The effect of superposition is another window to explore.
7. Conclusions
Although inference on clustering is controversial, a pdf was built from the sequence of traces obtained with NMF techniques, the construction of which requires no assumptions beyond linearly independent uncorrelated observational variables. Thus, by transforming observations to the probabilistic space, with the expectation provided by the limit of the distribution on the sequence of traces, varying the dimensions of the space span provides a gamma density. This is the main result of this manuscript.
This result allows us to assign credence to clustering results regardless of the method used. To carry this out, we have established an error bound for the approximation error between the matrix of observations in the probabilistic space and the approximate factorization in the case of low-rank approximation.
Our proposal allows non-skilled humans to visualize the results in a fully unsupervised validation environment, achieved with a single plot of the adjusted gamma density. Additionally, in the context of Big Data and Computer Engineering, an interval of plausible estimations seems more advantageous. This result allows discussions in quantitative terms between different validation results. In practical situations, it allows the verification of whether the selection of observational or experimental criteria is correct.