Incorporating Clustering Techniques into GAMLSS

: A method for statistical analysis of multimodal and/or highly distorted data is presented. The new methodology combines different clustering methods with the GAMLSS (generalized additive models for location, scale, and shape) framework, and is therefore called c-GAMLSS, for “clustering GAMLSS. ” In this new extended structure, a latent variable (cluster) is created to explain the response-variable (target). Any and all parameters of the distribution for the response variable can also be modeled by functions of the new covariate added to other available resources (features). The method of selecting resources to be used is carried out in stages, a step-based method. A simulation study considering multiple scenarios is presented to compare the c-GAMLSS method with existing Gaussian mixture models. We show by means of four different data applications that in cases where other authentic explanatory variables are or are not available, the c-GAMLSS structure outperforms mixture models, some recently developed complex distributions, cluster-weighted models, and a mixture-of-experts model. Even though we use simple distributions in our examples, other more sophisticated distributions can be used to explain the response variable.


Introduction
Multiple regression models in which the response variables are generated by distributions that belong to a family of parametric probability distributions have been widely used. The goal is to explain the behavior of a response variable, denoted here as Y. The increasing ease with which data can be collected and stored allows for quick construction of databases of ever-increasing sizes. With large volumes of data available for study, patterns of greater complexity are being observed, forcing the search for more flexible probabilistic (regression) models to deal with such patterns. Examples of such complexities include asymmetry around central values, the presence of a high excess kurtosis, multimodality, and other aspects of separate subgroups with little variability. These anomalies, common in a large database, can be caused by the effect of latent variables or latent categories, which are variables or categories that are not observed or collected.
For example, consider the weight distribution of a certain animal species for which the sexes of the sampled units are not recorded. A possible difference between the average weights of males and females may lead to a bimodal distribution of frequencies. Other aspects not noted in the database may also be responsible for additional anomalies. Another common issue in establishing species profiles occurs when the variabilities of the two subgroups determine the corresponding profiles. If latent characteristics are not considered, marginal distributions of the response variables can produce anomalies such as those illustrated in Figure 1-(a) bimodality with little variability between subgroups; (b) high asymmetry on the right with little difference between location parameters. This high asymmetry can be explained by the variability differences among subgroups. Several methods of handling these kinds of behavior are described in the literature. It is common for analysts not to have access to features that may be causing these distortions in the target variable. Usually, it is assumed that the response variables have bimodal and/or skewed distributions. New overly complex probabilistic distributions are being developed to deal with these problems, such as in [1][2][3]. When one uses these new distributions to explain such distortions, one is only considering the marginal distribution of the target variable, and not the possibility that they might be due to one or more additional features.
Mixture distributions, such as in [4], are a promising alternative approach. The simplest way to use this kind of model in problems such as those in Figure 1 is to consider the observations of the response variable to have been generated by two distinct distributions that although they belong to the same family of distributions, have different values between their respective parameters. Mathematically, this can be written as where p represents the proportion of observations generated by distribution ω 1 , (1 − p) is from ω 2 , and ξ 1 and ξ 2 are parameter vectors. For instance, if both generators are considered to follow Gaussian distributions, then we have ω 1 ∼ N(µ 1 , σ 1 ) and ω 2 ∼ N(µ 2 , σ 2 ). Some other well-known options for dealing with such behavior available in the literature are the class of mixture-of-expert models (MoE) [5] and cluster-weighted models (cwm) [6]. The present work provides another alternative approach in which clustering techniques are used early in the modeling process. The use of these tools has the following objectives: (i) identify different clusters in the data set from a possible latent explanatory variable (e.g., in a bimodal data set, two different clusters would be created, producing a new dummy variable, while a trimodal data set would lead to three clusters, generating a factor with three levels, and so on); and (ii) include this new covariate in a regression model to explain the specific behavior observed in the response, as in [7]. In our approach, we use the GAMLSS framework, short for "generalized additive models for location, scale and shape" [8] with simple distributions. As stated in [9] the use of simple distributions with highly sophisticated regression-type models may return reliable and easily interpreted results.
The paper is organized as follows. In Section 2, we briefly describe some clustering methods that could be used in the proposed methodology. In Section 3, we incorporate clusters as latent variables into the GAMLSS framework. In Section 4, we perform some simulation studies to validate the proposed modeling process. In Section 5, four different applications are discussed. In the first two, no extra covariates are available, and we compare the proposed methodology to both mixture models and sophisticated distributions. In the others, authentic explanatory variables are also considered, and we compare our approach with cluster-weighted regression models and the mixture-of-experts model. Finally, Section 6 contains concluding remarks.

Clustering Methods
In this section, we present a brief summary of four well-established clustering methods that will be considered to identify different groups in both Sections 4 and 5. Nevertheless, it is worth noting that any clustering method may be applied in the proposed framework in this paper (e.g., the ones presented at https://cran.r-project.org/web/views/Cluster.html, accessed on 1 July 2021), as we show in the last two applications in Sections 5.3 and 5.4.

k-Means Clustering
The k-means clustering method is the most well-known and popular grouping method in the literature. The basic idea of the k-means method is to minimize intra-cluster variation. There are several algorithms available to minimize that variation, of which the best-known were pioneered by [10,11]. Here we focus only on the algorithm available in [11], which defines the total intra-cluster variation as the sum of squared Euclidean distances between items and the corresponding centroid ∑ y i ∈ C k (y i − µ k ) 2 , where y i is a point that belongs to cluster C k and µ k is the mean value of the points assigned to the cluster C k . The main idea of this method is to minimize J(C) = ∑ K k=1 ∑ y i ∈ C k (y i − µ k ) 2 , the sum of the squared distances between points and their respective clusters' centroids over all K clusters. This is known to be an NP-hard problem [12].

Ward's Hierarchical Clustering
Ward's hierarchical clustering method is the only one among the agglomerative methods that is based on a sum-of-squares method and produces groups that minimize intra-group dispersion at each binary fusion [13]. The function to be minimized is given by As can be seen in [13], Ward's method shares the sum-of-squares minimization criterion with k-means partitioning. A full flowchart regarding the hierarchical grouping procedure is available in the original paper by [14].

Fuzzy Clustering
Fuzzy clustering allows a given observation (or an individual) to belong to more than one cluster, generalizing partitioning methods such as k-means [15]. Here, the objective function to be minimized is where u ik represents membership of object i in cluster k, d(i, j) are the dissimilarities and n is the number of observations. The considered objective function may vary in this method, however, as mentioned above, any clustering method (and objective function) can be used in the proposed methodology.

Model-Based Clustering
Model-based clustering is based on finite Gaussian mixture modeling [16] and may be achieved using an EM (expectation-minimization) algorithm as described in [17]:

1.
A maximum number K of clusters is defined and a mixture model is considered; 2.
A hierarchical agglomeration to approximately maximize the classification likelihood for each model is carried out, followed by obtaining classifications for up to K groups; 3.
The EM algorithm is performed for each model and each number of clusters; 4.
A goodness-of-fit measure (such as Bayesian information criterion) is computed for the one-cluster case for each model and for the mixture model based on the estimates obtained in Step 3.

Incorporating Clustering into GAMLSS
As highlighted in Section 1, many different sophisticated statistical distributions have been proposed to deal with some specific patterns presented in the marginal distributions of the response Y. In this paper, we use simpler distributions, modeling not only the mean (location) parameter µ in terms of the latent variable (clusters) and any other known features (explanatory variables) through the GAMLSS (generalized additive models for location, scale, and shape) framework [8], a generalization of both well-established generalized linear models (GLM) [18] and generalized additive models (GAM) [19].
Generically, let Y ∼ D(θ), i.e., Y follows any distribution (that does not necessarily belong to the exponential family) with parameter vector θ. For an extensive list of distributions used in GAMLSS, see [20]. Clustering GAMLSS (c-GAMLSS), which considers different clusters obtained through one of the methods described in Section 2, can then be described as follows: where g r (·) denotes an appropriate link function for parameter θ r , usually determined by the range of the parameter, m r denotes the number of explanatory variables related to the rth parameter, X r is a known n × (m r + 1) model matrix, β r = (β 0r , β 1r , . . . , β m r r ) is a parameter vector of length (m r + 1), s jr (·) are smooth functions of x jr (e.g., p-splines [21]), C k denotes latent variables (clusters) and ψ kr is a parameter vector of length (K − 1). Note here that any and all of the parameters of the response distribution may be modeled as functions of a given set of explanatory and latent variables. As in the classical GAMLSS framework, it is possible to consider interactions between covariates and varying coefficient terms (as introduced in [22]), where the varying coefficient function fits separate smooth curves against X for each cluster [23]. Finally, model (2) can be seen as a flexible expertnetwork mixture of experts [24]. However, in this approach, the mixing proportion does not depend on the covariates. Due to the high flexibility of GAMLSS, there are multiple techniques that may be used to select and fit a suitable model for the response variable considering the available covariates. They are extensively described in [23]. In the c-GAMLSS framework, we can use the following steps in the model-selection process:

•
Step 1: Select a clustering method (e.g., among the ones presented in Section 2) to create different groups (estimating the number of clusters can be achieved using one of multiple available strategies. See, for instance, [25]). • Step 2: Select subsets of authentic and latent variables for each of the parameters of the response-variable distribution, using any of the applied strategies for GAMLSS models [23]. The most common procedure is a stepwise procedure called "Strategy A" [23,26,27]).
Strategy A uses a goodness-of-fit measure to select the most suitable model to describe the data being studied. One may say that the Bayesian information criterion (BIC) [28] would perform better than the Akaike information criterion (AIC) [29] in the c-GAMLSS framework [30,31]. However, since we may consider smoothing functions within model (2), BIC can lead to underfitting (i.e., oversmoothing) [32]. A richer discussion regarding this topic can be seen in [27]. Furthermore, as in MoE, an important issue to be addressed in the future is whether a given covariate may be (or not) considered in either or both of the above-mentioned steps.
If only latent variables (clusters) are used to explain the behavior of a given response, then model (2) reduces to which can be seen as a mixture model [24] with no covariates affecting the mixing proportion. It is worth mentioning that the new coefficient ψ 1r in (3) is the intercept associated with each of the regression structures related to the first cluster obtained from any of the methods described in Section 2. In model (2), the intercept is the first element of each vector of parameters β r . As a straightforward example, letting Y have a Gaussian distribution, with vector of parameters θ r = (θ 1 , θ 2 ) = (µ, σ) in (3), results in the following: Note that we choose appropriate link functions for both parameters (identity and logarithm functions for µ and σ, respectively). See [23] for more information.
Regarding the inference processes, all model parameters can be estimated by the penalized maximum-likelihood method through the Rigby and Stasinopoulos (RS) and/or Cole and Green (CG) algorithms described in [8,33]. As stated in [23], both algorithms are stable and fast using simple starting values, such as constants.
Implementation of fitting a c-GAMLSS model (including or not extra covariates and considering different response-variable distributions) is easily achieved using a new generic function called gamlss.cluster(), based on the gamlss package [33] for the R software environment [34] and available at https://git.io/JtOZW (accessed on 1 July 2021).

Simulation
In this section, we conduct some Monte Carlo simulation studies to understand the behavior of the c-GAMLSS framework based on the Gaussian distribution and compare it to the usual Gaussian mixture model approach, considering four different scenarios (marginal response-variable shapes) where each is composed of two different clusters.
All observations were generated in the R software environment via the rnorm() function. The averages (µ) and standard deviations (σ) for each cluster and scenario are reported in Table 1 and the resulting empirical densities (considering both scenarios together) are displayed in Figure 2.  For each scenario, we evaluate the maximum-likelihood estimates (MLEs) of the parameters for both approaches and then, after all replications, we compute the biases and mean squared errors (MSEs) based on the average estimates. In order to obtain the MLEs for the mixture models, we are using the normalmixEM() function available in the mixtools package for the R software environment, which returns the best fitted model after five attempts. All results here are obtained from 1000 Monte Carlo replications. Here we are considering two different sample sizes for each cluster, 50 and 300, totaling n = 100 and n = 600 observations for each data/replication. Results are displayed in Table 2.  (4), in c−GAMLSS µ 1 = ψ 11 , µ 2 = ψ 11 + ψ 21 , σ 1 = ψ 12 and σ 2 = ψ 12 + ψ 22 , where ψ 1r and ψ 2r , r = 1, 2, are coefficients associated with the first and second clusters, respectively.
Please note that we have five different parameters in Table 2: µ 1 , σ 1 , µ 2 , σ 2 and p. Regarding the mixture model, all parameters were already introduced in Equation (1) and refer to the mean and standard deviation of the first cluster, mean and standard deviation of the second cluster and the proportion of observations that will be modeled by the first Gaussian distribution, respectively. However, we note here that we temporarily reparameterize the c-GAMLSS model in this subsection to compare the performance of the two approaches, since they will have the same interpretation. This reparameterization, based on model (4), is given as follows: µ 1 = ψ 11 , µ 2 = ψ 11 + ψ 21 , σ 2 = ψ 12 , σ 2 = ψ 12 + ψ 22 and p is the proportion of observations in the first cluster obtained through the best clustering method available in Table 1 for each of the scenarios (Scenario A: k-means; Scenario B: Fuzzy; Scenario C: model-based; and Scenario D: model-based).
In Scenario A we may note that the c-GAMLSS method clearly performs better than the Gaussian mixture models for both sample sizes considered (nevertheless, as expected, biases and MSEs decrease for both methods in the greater sample). For Scenario B (n = 50 in each cluster), we may note that the absolute biases for the standard deviation (σ 1 and σ 2 ) estimates are greater for the c-GAMLSS method; however, the MSEs are drastically smaller, indicating a better accuracy. Considering n = 300 in each cluster, all results are clearly favorable to the new proposed alternative. In Scenario C, the MSEs obtained for σ 2 by the mixture model are 6.67 and 21.99 times greater than the one returned by the c-GAMLSS method for n = 50 and n = 300 in each cluster, respectively. Finally, for Scenario D, the MSE returned by the Gaussian mixture when n = 300 in each cluster for the standard deviation σ 2 was almost 130 times greater than the one returned by c-GAMLSS. We can conclude that our proposed methodology clearly outperformed the already well-known Gaussian mixture model in all simulated scenarios.

Applications
In this section, we provide two applications comparing the adequacy of the c-GAMLSS framework, Gaussian mixture models, and some sophisticated bimodal distributions for marginally modeling the response variable (i.e., not considering any further explanatory variables), and another two applications where extra authentic covariates exist, comparing c-GAMLSS to the mixture-of-experts models (MoE; Gaussian parsimonious clustering models with covariates) and cluster-weighted models (cwm). In order to compare these approaches, both AIC and BIC statistics are calculated.
The flexible recently proposed four-parameter distributions considered in the first two applications, where only the target variable is available, are: the odd log-logistic exponentiated Gumbel (OLLEGu) [1], extended generalized odd half-Cauchy log-logistic (EGOHC-LL) [2], and exponentiated log-sinh Cauchy (ELSC) [3] distributions. All parameter roles and their respective ranges from each of these distributions are described in Table 3.

Actuarial Sciences Data
In the first application, we present a data set already studied in [1], where the authors compare their new model with some of its sub-models and other alternative distributions. The data consist of the lifespans (in years) of 280 retired women covered by the Mexican public health-insurance system who had temporary disabilities and died during 2004. The data are more fully reported in [35]. Table 4 contains the MLEs, their respective SEs, and the AIC and BIC statistics for all fitted models. The c-GAMLSS framework based on the Gaussian distribution clearly performs better than all other approaches, yielding the lowest AIC and BIC values (1788. 35 and 1802.90, respectively). Please note that in this particular case, the latent variable (clusters) was considered only for the parameter µ, i.e., both clusters present the same dispersion (σ), and hence no estimates for ψ 22 are displayed. This was achieved through the stepwise method (Strategy A), which also selected the k-means clustering method as the most appropriate to divide these data. The Gaussian mixture model did not perform well when compared to the other approaches (Table 4), returning the highest AIC and BIC values among all models (2116.10 and 2134.30, respectively), presenting a roughly unimodal curve in Figure 3a. Moreover, the c-GAMLSS method was able to precisely identify two different clusters (red and blue curves). Although the OLLEGu distribution was elected as the best one to describe these data according to [1] when compared to some other models, we can see that the EGOHC-LL distribution, proposed two years earlier by [2], returned better AIC and BIC statistics (2091.69 and 2106.23, respectively). Their fitted densities are displayed in Figure 3b.

Ozone Data
The second application corresponds to daily ozone-level measurements (in ppb = ppm × 1000) collected in New York during 1973 and is available in [36]. As with the previous application, we provide in Table 5 the MLEs, their corresponding SEs and AIC and BIC statistics. Once again, the proposed methodology is the best alternative to explain the behavior of the data since it returned the lowest AIC and BIC values (936.18 and 947.78, respectively) and the Gaussian mixture approach performed poorly. In this application, the c-GAMLSS framework based on the Gaussian distribution uses the model-based clustering selected via Strategy A, where the generated latent variable was included in both regression structures (µ and σ). Table 5. MLEs of the model parameters for the ozone data, the corresponding SEs (given in parentheses), and the AIC and BIC statistics.

Model
Estimates These data were also analyzed in [2], where the EGOHC distribution was the model selected when compared to some of its sub-models and other distributions. Nevertheless, we can see that the OLLEGu model returns better AIC and BIC values (1054.41 and 1065.28, respectively). Finally, all fitted densities are displayed in Figure 4.

Two-Moon
The two-moon is a well-known synthetic data set introduced by [37] which can be obtained in the clusterSim package in the R software environment, and is mainly used in studies regarding clustering methods. The generated data consist of two variables: the response Y and an authentic covariate X, both defined on the real numbers. The relationship between Y and X, as well as the two clusters (in red and black colors) are shown in Figure 5. We may note a clear nonlinear relationship between response and the explanatory variable, and hence clustering algorithms that consider linear separations may not be suitable for this example. Here we are considering the cluster classification already available in the data set [37]. Furthermore, this nonlinear relationship indicates that the use of smoothing functions in the regression structure would be appropriate, and the shape of such relationship may depend on each cluster, so considering an interaction between each cluster and the explanatory variable X may also be necessary. Now we compare c-GAMLSS based on the Gaussian distribution, considering the presence of an authentic explanatory variable, against three different models: (i) fitting a regression structure only for the location parameter, with a constant variance (reducing to a c-GLM); (ii) incorporating smoothing functions in this regression structure (reducing to a c-GAM); and (iii) considering a mixture-of-experts model (Gaussian parsimonious clustering models with covariates, obtained using the MoEClust package for the R software environment). We shall highlight here that due to the behavior observed in Figure 5, in the c-GLM structure, we are considering a quadratic term and interactions between covariate X and the clusters, whereas in both c-GAM and c-GAMLSS, we fit a varying coefficient term (pvc).  . We emphasize here that the BIC criterion highly penalizes models that consider smoothing functions to explain a given response-variable parameter. However, as can be seen in Figure 6, such functions are quite important to explain the nonlinear effect of X on each of the response-variable parameters (mean µ and standard deviation σ). Table 6. Regression structures for all fitted models applied to the two-moon data and their corresponding AIC and BIC statistics.

Model
Regression Structure AIC BIC

Eruption Data
In this last application we revisit the well-known Old Faithful Geyser data from Yellowstone National Park in Wyoming, USA [38]. This data set contains 299 pairs of measurements regarding the times between the beginnings of successive eruptions, continuously collected over the first 15 days of August of 1985, and can be obtained in the MASS package for the R software environment. The explanatory variable X, which represents eruption duration, can also be used here to explain the response variable.
Here, we apply the full model displayed in Equation (2) to these data, disregarding the smoothing functions, i.e., the eruption duration X is treated as a linear predictor of the c-GAMLSS model. Furthermore, in order to show the great flexibility of this approach, we now consider a more complex distribution (apart from the celebrated Gaussian) to explain the target variable. In the classical GAMLSS context, when Y > 0, one of the most-used distributions is the Box-Cox Cole and Green (BCCG), a three parameter distribution where µ > 0 is the median, σ > 0 is the approximate coefficient of variation and −∞ < ν < ∞ is the skewness parameter. For further details regarding the BCCG, see [20].
We compare the results obtained by the c-GAMLSS framework based on both the Gaussian and BCCG distributions with a cluster-weighted model (cwm) [6], the estimates of which were obtained using the cwm function in the flexCWM package. All model structures and their respective AIC and BIC statistics are presented in Table 7, where we see that c-GAMLSS based on the BCCG distribution provides a better fit than the other two alternatives since its AIC and BIC values are the smallest (1840.6 and 1870.2, respectively). A visual comparison of all models is provided in Figure 7 (the graphical representation of c-GAMLSS based on both Gaussian and BCCG distributions overlap). Finally, we also present in Figure 8 the term plot for each effect fitted, i.e., the effects of each explanatory variable on the model parameters. Table 7. Regression structures for c-GAMLSS (based on the BCCG and Gaussian distributions) and cwm models applied to the eruption data and their corresponding AIC and BIC values.

Conclusions
This article introduces c-GAMLSS-clustering generalized additive models for location, scale, and shape. It is an alternative model to deal with highly distorted data and even though only bimodal responses were considered in this paper, this approach can also be applied in multimodal response problems. It is based on adding clustering techniques to the celebrated GAMLSS framework. It provides a new generic function implemented in R, a widely used statistical software. This new approach helps to make the fitting process more reliable, outperforming Gaussian mixture models, as illustrated in both simulation and real studies. Moreover, c-GAMLSS based on quite simple distributions returned better (smaller) AIC and BIC values than highly complex recently proposed distributions, cluster-weighted models and a mixture-of-experts model. Nevertheless, more sophisticated distributions may be used with c-GAMLSS for further applications. Data Availability Statement: Actuarial sciences, ozone and moon data can be found in [35][36][37], respectively. The eruption data may be obtained in the MASS package for the R software environment and further information may be found in [38]. The new generic function for use in R is made available by the authors at https://git.io/JtOZW (accessed on 1 July 2021).