Abstract
In this paper, we present a two-component Weibull mixture model. An important property is that this new model accommodates bimodality, which can appear in data representing phenomena in some heterogeneous populations. We provide statistical properties, such as the quantile function and moments. Additionally, the expectation-maximization (EM) algorithm is used to find maximum-likelihood estimates of the model parameters. Further, a Monte Carlo study is carried out to evaluate the performance of the estimators on finite samples. The new model’s relevance is shown with an application referring to the vote proportion for the Brazilian presidential elections runoff in 2018. The proportion of votes is an important measure in analyzing electoral data. Since it is a variable limited to the unitary interval, unit distributions should be considered to analyze its probabilistic behavior. Thus, the introduced model is suitable for describing the characteristics detected in these data, such as the asymmetric behavior, bimodality, and the unit interval as support. In the application, the superiority of the proposed model is verified when comparing the fit with the two-component beta mixture models.
1. Introduction
Finite mixture models appeared in a study on the asymmetry of grouped materials not being homogeneous [1], being useful in the presence of multimodality, heavy tails, and asymmetry [2]. Many works have appeared in the literature in the context of finite mixtures. For example, Jewell [3] proposed a model for exponential mixtures. Considering Weibull mixture models, we can cite [4] for characterizations of the failure rate function and [5] for reliability approximations. Recently, Huang et al. [6] analyzed individual periods in combined sea waves using parametric mixture models.
In data with limited support, beta mixture models have been studied by several authors. Ji et al. [7] proposed a study on the beta mixture to solve problems related to correlations of gene expression levels, Bouguila et al. [8] presented a study on Bayesian analysis, and Grün et al. [9] studied beta mixture in regression models. The Kumaraswamy mixture model is an alternative to the beta mixture models. Khalid et al. [10] carried out a Bayesian study on the three-component Kumaraswamy mixture.
In this paper, a new two-component mixture model is proposed as an alternative to model population heterogeneities in the unit support. We consider that each mixture component follows a unit Weibull () distribution [11]. Some of the contributions of this new distribution, the so-called Weibull mixture model of the two-component unit (), are: (i) all estimation routines, including simulations and applications, are performed using the expectation-maximization (EM) algorithm, and (ii) applicability for electoral data modeling. The EM algorithm is a computational method used to calculate the maximum likelihood estimator (MLE) iteratively [12]. It is widely used to estimate the maximum probability for finite mixture models [13]. Finally, the adjustment to electoral data, defined as the district’s share of votes by the total number of valid votes cast in the district, the proportions of votes are useful since the electoral districts can vary considerably in the size of the population [14]. Additionally, this measure can analyze other characteristics of the electoral process, such as electoral volatility [15] and nationalization of electoral change [14]. The data set used refers to the proportions of votes in the Brazilian presidential elections runoff in 2018.
The rest of the work is organized as follows. In Section 2, the new mixture model is presented. Section 3 introduces the EM algorithm to perform maximum likelihood estimation for the model. In Section 4, an application is made with electoral data. The final considerations of this work are addressed in Section 5.
2. The Proposed Model
In this section, the two-component unit Weibull mixture distribution, so-denoted , is introduced. Let X be a random variable with distribution. Then, its cdf is obtained as
where = , , = (, , and are location parameters associated with the th quantiles of each component of the mixture, and are shape parameters, and is assumed to be known. One can note we use a parameterization based on quantiles to formulate each component of the mixture. The advantage of working with reparametrization in terms of quantiles is its flexibility to model data with heterogeneous conditional distributions [16,17]. The probability density function (pdf) is given by
Figure 1 shows some plots of the pdf for some combinations of parameters and , which reveals the high flexibility of the new distribution. It accommodates bimodal, unimodal, descending, and bath forms under different asymmetric characteristics. Additionally, it is possible to identify a bimodal form for different values of p. Hereafter, we denote X as a random variable following a distribution, this is, .
Figure 1.
Plots of the density for some parameter values. (a) For . (b) For and .
3. Parameter Estimation
An approach to the iterative computation of MLEs when the observations can be treated as incomplete data is the well-known expectation-maximization (EM) algorithm. Considering the context of two-component mixture models, let be a random sample of size n from a random variable X having pdf (4) with unknown parameter vector where = (, and = (, . It is customary to call of “incomplete data” since it is associated with a second component of unobserved values of a latent random variable Z. Each value of Z indicates which component of the mixture belongs to the ith observation such that
where and . The complete-data specification is determined by the joint density of
and based on it, the complete log-likelihood function, for the sample of size n, is given by
The EM algorithm iterates, between two steps, to compute the MLEs of . In the E-step or expectation step, due to (2), is unobservable, it is replaced by its conditional expectation with respect to the conditional distribution of Z, given and the current parameter estimates. More specifically, in the th iteration, the E-step computes
where
and are obtained from the kth iteration.
The M-step or maximization step, requires the maximization of (3) with respect to . This is
The vector is used to initialize the next iteration. Thus, the EM algorithm is initialized by the starting values and the MLEs of are obtained by when a convergence criterion is reached [12]. We set = 10,000. It should be noted that it is not possible to obtain analytical results from these expressions. It is necessary to perform this maximization by applying some iterative techniques, for example, Newton–Raphson’s method [18].
4. Application
In what follows, we present a case study that illustrates the suitability of the distribution for modeling real unit data sets. The database considered is the municipality’s vote proportion of the winning candidate in the Brazilian presidential elections runoff in 2018. Since it presents a bimodal shape, see Figure 2a, a unimodal distribution would not be appropriate to fit this data set. Therefore, the distribution is a suitable alternative to model these data. Its performance is compared with other double-bounded component mixtures that have already been studied in the literature: two-component beta mixture () model. In this paper, the parameterization proposed by [19] is considered to define the model, which has pdf given by
where , and are location parameters associated with the mean of each mixture component, and are precision parameters, and is the parameter that measures the weights of the mixture.
Figure 2.
Estimated densities (a) and empirical cdf (b) of the , and models.
For all competitive mixture models, the parameter estimation is carried out using the EM algorithm following the steps described in Section 3. The Corrected Anderson–Darling () [20], Cramér–von Misses () [21], and the Kolmogorov–Smirnov () [22] statistics are calculated to assess the quality-of-fit for the three fitted models. The lower their values are, the better the model fit. All the analysis is performed using the R programming language, and the goodness-of-fit measures are computed using the AdequacyModel [23] subroutine.
Table 1 displays the parameter estimates, standard errors, and the model comparison criteria of the three considered models. The results indicate that the distribution provides the lowest values for all goodness-of-fit statistics. The presents the worse performance, not being an adequate alternative to fit these data.
Table 1.
Parameter estimates and standard errors (given in parentheses) for the models fitted to Bolsonaro’s vote proportion in Brazilian presidential elections in 2018.
Figure 2a presents the histogram of the vote proportion data overlaid with the estimated densities of the fitted models. The bimodality of the data is confirmed, and the model provides the closest fit to the histogram. Clearly, the model is not adequate to fit these data. Further, Figure 2b gives plots of the empirical and estimated cdfs. This visual inspection favors the results in Figure 2a and Table 1, indicating that the proposed model is appropriate to fit these data. Thus, it can be an effective alternative to analyze vote proportions, being quite competitive with the model and providing consistently better fits than the model. Therefore, the provides a useful tool for modeling bimodal data restricted to the unit interval. Additionally, with the estimates of the mixture parameters, it is possible to identify that more than 50 % of the observations belong to the first mixture component. The estimated median of the first component is and the estimated median of the second component is .
5. Conclusions
A two-component mixture model was defined to describe the heterogeneities of the population with the limited domain. The two-component unit Weibull mixture () model is formulated considering that each mixture component follows the unit Weibull distribution. Some of the main properties of have been presented, such as ordinary moments. The EM algorithm was used to obtain maximum likelihood estimates for the model parameters. To evaluate the performance of the EM algorithm, Monte Carlo simulations were performed. An application to electoral data illustrates the importance and potential of the new model. The motivating data set is about the vote proportions obtained by the winning candidate in the Brazilian presidential runoff elections in 2018. The results indicate that our proposal is adequate to fit this data set since it is suitable to analyze the asymmetric and bimodal behaviors. From the mixing parameter estimate, we can conclude that 53.68% of the observations are from the first component of the mixture with estimated median at The estimated median for the municipalites from the second mixture component was This application proved empirically that the performance may overcome other two-component mixture models based on other widely known unit distributions such as the beta and Kumaraswamy.
Author Contributions
Conceptualization, R.R.G.; methodology, R.R.G. and F.A.P.-R.; software, R.R.G.; validation, R.R.G. and F.A.P.-R.; formal analysis, R.R.G.; investigation, R.R.G., F.A.P.-R. and C.P.M.; resources, C.P.M.; data curation, R.R.G.; writing—original draft preparation, R.R.G., F.A.P.-R. and C.P.M.; writing—review and editing, G.M.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Pearson, K. Contributions to the mathematical theory of evolution. Philos. Trans. R. Soc. Lond. A 1894, 185, 71–110. [Google Scholar]
- Lachos, V.H.; Moreno, E.J.L.; Chen, K.; Cabral, C.R.B. Finite mixture modeling of censored data using the multivariate Student-t distribution. J. Multivar. Anal. 2017, 159, 151–167. [Google Scholar] [CrossRef]
- Jewell, N.P. Mixtures of exponential distributions. Ann. Stat. 1982, 10, 479–484. [Google Scholar] [CrossRef]
- Jiang, R.; Murthy, D. Mixture of Weibull distributions-parametric characterization of failure rate function. Appl. Stoch. Model. Data Anal. 1998, 14, 47–65. [Google Scholar] [CrossRef]
- Bučar, T.; Nagode, M.; Fajdiga, M. Reliability approximation using finite Weibull mixture distributions. Reliab. Eng. Syst. Saf. 2004, 84, 241–251. [Google Scholar] [CrossRef]
- Huang, W.; Dong, S. Probability distribution of wave periods in combined sea states with finite mixture models. Appl. Ocean Res. 2019, 92, 101938. [Google Scholar] [CrossRef]
- Ji, Y.; Wu, C.; Liu, P.; Wang, J.; Coombes, K.R. Applications of beta-mixture models in bioinformatics. Bioinformatics 2005, 21, 2118–2122. [Google Scholar] [CrossRef]
- Bouguila, N.; Ziou, D.; Monga, E. Practical Bayesian estimation of a finite beta mixture through Gibbs sampling and its applications. Stat. Comput. 2006, 16, 215–225. [Google Scholar] [CrossRef]
- Grün, B.; Kosmidis, I.; Zeileis, A. Extended Beta Regression in R: Shaken, Stirred, Mixed, and Partitioned; Technical Report, Working Papers in Economics and Statistics; University of Innsbruck, Research Platform Empirical and Experimental Economics (EEECON): Innsbruck, Austria, 2011. [Google Scholar]
- Khalid, M.; Aslam, M.; Sindhu, T.N. Bayesian analysis of 3-components Kumaraswamy mixture model: Quadrature method vs. Importance sampling. Alex. Eng. J. 2020, 59, 2753–2763. [Google Scholar] [CrossRef]
- Mazucheli, J.; Menezes, A.; Ghitany, M. The unit-Weibull distribution and associated inference. J. Appl. Probab. Stat. 2018, 13, 1–22. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar]
- Redner, R.A.; Walker, H.F. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 1984, 26, 195–239. [Google Scholar] [CrossRef]
- Alemán, E.; Kellam, M. The nationalization of presidential elections in the Americas. Elect. Stud. 2017, 47, 125–135. [Google Scholar] [CrossRef]
- Powell, E.N.; Tucker, J.A. Revisiting electoral volatility in post-communist countries: New data, new results and new approaches. Br. J. Political Sci. 2013, 44, 123–147. [Google Scholar] [CrossRef]
- Bayes, C.L.; Bazán, J.L.; De Castro, M. A quantile parametric mixed regression model for bounded response variables. Stat. Its Interface 2017, 10, 483–493. [Google Scholar] [CrossRef]
- Mazucheli, J.; Menezes, A.F.B.; Fernandes, L.B.; de Oliveira, R.P.; Ghitany, M.E. The unit-Weibull distribution as an alternative to the Kumaraswamy distribution for the modeling of quantiles conditional on covariates. J. Appl. Stat. 2020, 47, 954–974. [Google Scholar] [CrossRef]
- Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical Recipes 3rd Edition: The Art of Scientific Computing; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
- Ferrari, S.L.P.; Cribari-Neto, F. Beta regression for modelling rates and proportions. J. Appl. Stat. 2004, 7, 799–815. [Google Scholar] [CrossRef]
- Chen, G.; Balakrishnan, N. A general purpose approximate goodness-of-fit test. J. Qual. Technol. 1995, 27, 154–161. [Google Scholar] [CrossRef]
- Durbin, J.; Knott, M. Components of Cramér-Von Mises statistics. J. R. Stat. Soc. Ser. B (Methodol.) 1972, 34, 290–307. [Google Scholar] [CrossRef]
- Goodman, L.A. Kolmogorov-Smirnov tests for psychological research. Psychol. Bull. 1954, 51, 160–168. [Google Scholar] [CrossRef]
- Marinho, P.R.D.; Silva, R.B.; Bourguignon, M.; Cordeiro, G.M.; Nadarajah, S. AdequacyModel: An R package for probability distributions and general purpose optimization. PLoS ONE 2019, 14, e0221487. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).