Improved Initialization of the EM Algorithm for Mixture Model Parameter Estimation

: A commonly used tool for estimating the parameters of a mixture model is the Expectation–Maximization (EM) algorithm, which is an iterative procedure that can serve as a maximum-likelihood estimator. The EM algorithm has well-documented drawbacks, such as the need for good initial values and the possibility of being trapped in local optima. Nevertheless, because of its appealing properties, EM plays an important role in estimating the parameters of mixture models. To overcome these initialization problems with EM, in this paper, we propose the Rough-Enhanced-Bayes mixture estimation (REBMIX) algorithm as a more effective initialization algorithm. Three different strategies are derived for dealing with the unknown number of components in the mixture model. These strategies are thoroughly tested on artiﬁcial datasets, density–estimation datasets and image–segmentation problems and compared with state-of-the-art initialization methods for the EM. Our proposal shows promising results in terms of clustering and density-estimation performance as well as in terms of computational efﬁciency. All the improvements are implemented in the rebmix R package.


Introduction
The Expectation-Maximization (EM) algorithm was introduced in [1]. Today, after a little more than 40 years, it is still one of the most popular algorithms for statistical pattern recognition. Studies range from the theoretical, i.e., convergence of the EM and its variant DA-EM in [2], to modifications of the EM algorithm for different purposes: image matching [3]; parameter estimation [4,5]; malaria diagnoses [6]; mixture simplification [7]; and audio-visual scene analysis [8].
EM's popularity has risen due to its use in estimating mixture-model (MM) parameters [9,10]. The mixture model (MM) is an umbrella term for probabilistic models in which the modeled population contains several sub-populations. To describe the characteristics of the sub-population, it is assumed that each sub-population follows a simple parametric distribution, usually referred to as the component of the MM. The type of the MM is therefore identified by the distribution family of its components, e.g., components of the Gaussian Mixture Model (GMM) have a Gaussian distribution. Once estimated, the parameters of MM can be used multi-purposely. Their use is equally interesting for density-estimation tasks [7,11] and clustering tasks [6,12,13]. In the context of MM parameter estimation, the EM algorithm can be seen as a clustering algorithm that maximizes the missing data log-likelihood function as the objective function. Therefore, the EM for MM can be seen as the maximum-likelihood estimator. EM also has many appealing properties, such as

Estimation of the Gaussian Mixture Model Parameters
Let Y = {y 1 , y 2 , . . . , y n } be an d-dimensional observed dataset of n continuous observations y j = {y 1 , y 2 , . . . , y d }. Each observation y j is assumed to follow the probability density function (PDF): f (y j |c, w, Θ) = c ∑ l=1 w l f l (y j |Θ l ), (1) where the components have the multivariate Gaussian PDF: The GMM consists of finite number of components c and every component has weight w l . The weights of the components w l have the properties of the convex combination w l ≥ 0 ∧ ∑ c l=1 w l = 1 [9]. Therefore, the estimation of the parameters in the GMM consists of an estimation of the number of components c, an estimation of each component's weight w l and each component's distribution parameters Θ l ; in the GMM case, the mean vector µ l and the covariance matrix Σ l .
Let L(Y|Θ) be the likelihood function. The log-likelihood function of the GMM is defined as: where f l (y j |Θ l ) is defined in Equation (2). The maximum likelihood estimation of the parameters Θ can be defined as:

Expectation-Maximization Algorithm
The EM algorithm is an iterative algorithm for an estimation of the parameters Θ of GMM (or any other MM). Instead of maximizing the log-likelihood function for dataset Y containing n observations, it maximizes the complete-data log-likelihood. The complete data Y c is assembled from the observed data Y (sampled dataset) and the missing data (or data labels) Z. Missing data are the labels of observations representing the belongings of the observations in the dataset to the components in the GMM. Since the labels of the observations z kj are unknown (for example, the label of the observation y j from the dataset Y) the EM algorithm utilizes two steps, i.e., the expectation (E) step and the maximization (M) step. In the E step of the algorithm, the posterior probability τ kj that the observation y j from Y arose from the k-th component is calculated: In the M step, the iteration-wise update equations for the parameters can be derived by maximization of conditional expectation of the complete-data log-likelihood function [9]. The updates of the parameters can be obtained with the following equations. The weights w k with: the mean vectors µ k with: and the covariance matrices Σ k with update equation: After every successful iteration of the E and M steps, the value of the likelihood is guaranteed to increase [9]. The expectation and maximization steps are repeated iteratively until the convergence of the log-likelihood is achieved. The convergence criteria are mostly assessed by the change in the log-likelihood value, i.e., when it is smaller than some predefined threshold value ε [9,33].

Model Selection with EM Algorithm
Let Θ(c) denote the GMM parameters with c components. Let I = {Θ init (c = c min ), Θ init (c = c min + 1), . . . , Θ init (c = c max )} denote the set of initial parameters of the GMM for the different number of components c. The final set S of estimated parameters becomes S = {EM(i), ∀i ∈ I}, where EM(i) denotes the iterative process of GMM parameters estimation with the EM algorithm. The problem of model selection can then be summarized as: Choose s from S so that measure C(s) is optimal, where C(s) is the cost function for the calculation of the selected measure.
The most commonly used model selection measure, namely information criterion (IC), utilizes the estimated value of likelihood function (or log-likelihood). There is a lot of IC presented in literature, see Chapter 6 of [9]. Out of these, the ones mostly used are definitely AIC and BIC. AIC is defined as: AIC = −2 log L(Y|Θ(c)) + 2M, (9) where M = c − 1 + cd + cd(d + 1)/2 is the number of parameters in d-dimensional GMM for the number of components c. This criteria mostly favor complex models, due to the small penalization term [25]. Although for most cases of datasets with a larger number of observations n it will select the over-fitted GMM (in terms of number of components c), it is still an interesting criterion for datasets with a smaller number of observations n. The second one, BIC, is defined as: BIC = −2 log L(Y|Θ(c)) + M log n.
BIC penalizes more heavily complex models than AIC due to the additional multiplicative term of logn. On the other hand, for the datasets with a smaller number of observations, it can result in GMM with a very small number of components c (under-fit).

REBMIX Algorithm
The Rough-Enhanced-Bayes mixture estimation (REBMIX) algorithm is a heuristic approach to the estimation of the GMM [16]. Here, the algorithm will be briefly described. For the interested readers, we advise the following papers [16,34,35]. The algorithm utilizes five simple steps, illustrated in Figure 1. The first step, the preprocessing of observations, conducts an empirical density estimation of the dataset. The empirical density estimation can be carried out in three different ways within the REBMIX algorithm, see ( [16], p. 16). However, here the attention is given only to the histogram estimation.  Let v be the number of bins in histogram. The preprocessing step assigns each bin in the histogram a frequency k j , which represents the fraction of the observations falling into the volume V j occupied by the j-th bin. This is used to transform the input dataset into a new shrunken dataset, consisting only of bin origins y j and the corresponding empirical frequency k j of the bin j ∈ {1, 2 . . . v * } (It is worth noting that the v * is the number of non-empty bins of histogram as they are only used for the estimation. Additionally, for convenience, the number of bins v mentioned refer to a one-dimensional histogram. In the multidimensional setting, the actual number of bins is equal to v d , where d is the dimension of the problem). This transformed dataset X = (x 1 , x 2 . . . x j , . . . x v * ), where x j = (y 1 , . . . , y d , k j ), will be denoted as the main cluster. The second step in the algorithm is the global mode detection step. The global mode is simply the mode with highest frequency y m , where m is defined as m = argmax (k 1 , k 2 , . . . , k v * ). The global mode can be always detected and it serves as the basis for the Rough component parameter estimation (Step 3) [16,36]. Rough component parameters are additionally used to split the main cluster X into two clusters, base cluster X l , and the residue R, where X l = (x l1 , . . . x lj , . . . , x * lv ) ∧ x lj = (y 1 , . . . , y d , k lj ) and R = (r 1 , . . . , r j , . . . , r * v ) ∧ r j = (y 1 , . . . , y d , r j ). In addition, the equality k j = k lj + r j , ∀j ∈ {1, . . . , v * } holds. The base cluster holds the observations tied to the currently estimated component of the GMM and residue is the residual cluster. The split process ends when the criteria D l ≤ D min /w l is met, where the D l is sum of positive deviations [16], w l is the component weight, and D min is the internally controlled parameter. After the split process is done, the base cluster is used for the Enhanced component parameter estimation. Enhanced component parameter estimates are an approximation of the Gaussian distribution maximum likelihood estimates for the histogram preprocessing [16,36]. The weight of the currently estimated component l is: the mean vector: and the covariance matrix: The Enhanced component parameter estimation concludes the l-th component parameter estimation. The residue becomes the main cluster, i.e., X = R, and the Global mode detection, Rough component parameter, and Enhanced component parameter estimation are repeated to estimate the parameters of a newly added (l + 1)-th component. This addition of the components ends when the criteria n res /n ≤ cD min is met. The n res = ∑ v * j=1 r j is the number of observations in residue and c is the current number of estimated components in GMM. Two intuitive assumptions are made here. The first assumption is the addition of new components should end when the number of observations in the residue is too small to form a new component in the GMM. The control of how small should the residue be is assessed via the parameter D min . The second assumption is that the unassigned observations left in the residue should be redistributed to the already-estimated components and consequently make a certain impact on their parameters. This is made with the Bayes classification of the unassigned observations step [16,36].
The art of the estimation of the parameters of the GMM for the different number of components c lies in updating the parameter D min iteratively after each Bayes classification of the unassigned observation step is conducted. The parameter D min is iteratively decreased with the empirical equation D new min = cD old min /(c + 1). Parameter D min is decreased until c > c max . In doing so, the algorithm is forced to iterate until the final set {Θ final (c = 1), Θ final (c = 2), . . . , Θ final (c = c max )} of GMM parameters is obtained.
Initially, value of the parameter D min is taken to be 1. As the weights of the components w l form a convex combination, the initial value of D min = 1 ensures that the first estimated GMM contains only one component. In practice, every decrease in the parameter D min does not guarantee that estimated GMM will have exactly one more component, or generally speaking more components, as the decreasing equation is empirical. Thus, the set of final estimated parameters of the GMM can be, e.g., {Θ final (c = 1), Θ final (c = 2), Θ final (c = 2), Θ final (c = 4) . . . , Θ final (c = c max )}. Still, employing different numbers of bins v eases the burden of the above-mentioned problem. In addition, since there is no specific rule to identify v, the presented algorithm flow in [16] runs over an ordered set of numbers K of bins v.
In [16], there is a suggestion to use the same model-selection procedure as described in Section 3.1.1, to select the optimal GMM parameters. However, it is reported in [17,18] that the model-selection procedure usually makes sub-optimal parameter estimates. This happens because the estimated parameters of the GMM with the REBMIX algorithm can be further optimized to yield higher values of the likelihood function (Equation (3)) and therefore provide a beneficial impact on the values of the commonly used IC, for example, BIC, which is defined in Equation (10).

Combined REBMIX and EM Algorithm
Let R denote the set of solutions (the estimated parameters of the GMM for different numbers of components c) obtained by the REBMIX algorithm. For a certain number of bins v in the histogram preprocessing step and a specified maximum number of components c max and minimum number of components c min , REBMIX will always provide the same set of solutions R. However, there are many different values for the number of bins v that can be chosen for the preprocessing step. The two extreme cases are the case of v = 1 and v = v uni , where v uni is a sufficiently large number of bins for which the k j = 1, ∀j ∈ {1, 2, . . . , v * } is true. All the values that lie in between these two values can be used for REBMIX to provide different initial parameters for the EM algorithm. As opposed to the stand-alone REBMIX, the use of the EM algorithm ensures that the final estimates yield the maximum likelihood, thus providing more stable grounds for the model-selection procedure. Three different strategies for the coupling of the REBMIX and EM algorithms are proposed. The first strategy, denoted as Exhaustive REBMIX&EM, can be summarized as follows: 1. Select the multiple number of bins v 1 , v 2 , . . . for histogram preprocessing; 2. For each v from v 1 , v 2 , . . . , obtain the corresponding set of REBMIX estimates R v 1 , R v 2 , . . . ; 3. Merge all solutions from the sets R v 1 , R v 2 , . . . into one set R; 4. For each solution r ∈ R, run the EM algorithm and obtain the set S of EM improved estimates of the GMM parameters; 5. Run the model selection procedure on the set S.
Although this strategy gives the best potential for the estimation of the parameters of the GMM, it is computationally intensive, due to the many repetitions of the EM algorithm. To ease the computational intensity, the second strategy, denoted as Best REBMIX&EM, is summarized as follows: 1. Select the multiple number of bins v 1 , v 2 , . . . for histogram preprocessing; 2. For each v from v 1 , v 2 , . . . , obtain the corresponding set of REBMIX estimates R v 1 , R v 2 , . . . ; 3. Group all solutions from the sets R v 1 , R v 2 , . . . by the number of components c, i.e., form the sets R c=c min , R c=c min +1 , . . . , R c=c max , where each set holds the estimated parameters of the GMM with the same number of components c; 4. From each set R c=j , choose the solution for which the likelihood value is the largest r c=j,best and merge it into new set R best = {r c=c min ,best , r c=c min +1,best . . . r c=cmax,best }; 5. For each solution r c=j,best ∈ R best , run the EM algorithm and obtain the set S of EM improved estimates of the GMM parameters; 6. Run the model selection procedure on the set S.
At the end, the least computationally intensive strategy, which additionally minimizes the computation in the REBMIX algorithm, can be derived. This strategy is denoted as the Single REBMIX&EM. If the number of bins v is estimated or intuitively taken as known, the overhead of computing the different parameters Θ of the GMM for the different numbers of bins v can be reduced in the REBMIX algorithm. This strategy can be summarized as follows: 1. Estimate or intuitively select number of bins v; 2. Obtain the corresponding set R of REBMIX estimates for known v; 3. For each solution r ∈ R run EM algorithm and obtain the set S of EM estimates of the GMM parameters; 4. Run the model selection procedure on the set S.

Experiments
The proposed method for estimating the parameters of the GMM was tested on the artificially generated datasets, density estimation tasks, and on the color-image segmentation tasks. Additionally, for a sanity check, two different strategies were included for the EM algorithm: multiple repeated random initialization with the EM algorithm and multiple repeated k-means initialization with the EM algorithm. The number of repetitions of the k-means and the random initialization was 5. They are referred to as the Random&EM and Kmeans&EM. For the model-selection criterion, the BIC was used.

Artificial Datasets
Artificial datasets were created using the MixSim R package [37]. The utility of this package is in simulating GMMs with various parameters, e.g., with different levels of overlap between the mixture components or number of components ( Figure 2). For this research, we have used the dimension of the dataset d, the number of components in the GMM c, and the overlap level a between the components in the GMM.    Table 1 summarizes the input parameters. Using all the combinations of input parameters, 18 different parameters of GMMs were simulated. Sampling of the datasets from each simulated GMM was done with n = 100, n = 1000, and n = 10,000 observations, which resulted in 54 different datasets.

Density Estimation
To study the performance of the proposed strategies for the density estimation, we chose the multivariate distributions presented in [11], for which their implementation performed well.  = 50, 100, 200, 300, 400, 500, 750, 1000, 1500. The number of replications of each dataset was 10. For the performance measure of the density estimates, we have chosen the same measure as in the original paper, the mean integrated square error (MISE) between the true density and estimated density. MISE is defined as: where r is the number of replications, f (y) estimated density, and f (y) the true density. Additionally, Reference [11] provides an R implementation of their work, so we included the results obtained with their Adaptive density estimation by Bayesian averaging (ADEBA) algorithm.

Image Segmentation
To study the performance of proposed strategies for the clustering, we have selected the digital-image segmentation task. The digital-image-segmentation task can be seen as a real-world clustering problem. To elaborate, segmentation of an image means grouping, i.e., clustering, the image into coherent parts. The pixels in the digital image hold certain values of color intensity (brightness). In general, color digital images are represented by three channels, where each channel holds the intensity of one primary color, such as red, green, and blue (known as RGB images). The clustering problem in this setting can be conceived as color quantification of the digital image or, more simply, grouping the parts of the digital image by the color similarity. Hence, the use of GMMs for this is appropriate. The estimation and the model selection are carried on all the pixel color values in the digital image. Once the GMM parameters are estimated, the clustering scheme can be obtained by assigning each pixel j ∈ {1, . . . , n} in a given image with n pixels to the one of the components in the estimated GMM. The assignment is made with the posterior probabilities τ lj , namely the component for which the posterior probability is highest l = argmax(τ 1j , τ 2j , . . . , τ cj ) [33]. Posterior probabilities τ lj for pixel j are obtained using Equation (5). The number of clusters, therefore, becomes the number of components c in estimated GMM.
The clustering performance is assessed through the Adjusted Rand Index (ARI) measure [38]. To define the ARI, first we need to define the contingency table ( Table 2). The contingency table is used to summarize the overlap between the two clustering schemes M = {M 1 , M 2 , . . . , M r } and N = {N 1 , N 2 , . . . , N s } with different numbers of clusters r and s. Each entry n ij in the contingency table represents the common number of objects in clusters i and j, respectively, from the M and N clustering schemes.
where n ij , a i , b j can be acquired from the contingency table. The ARI is bounded between −1 and 1, where 1 means that the two clustering schemes have 100% agreement and a negative ARI value means that the agreement is not better than the random. Therefore, the higher the ARI value, the better the match between the two clustering schemes. For the image-segmentation tasks, five images from the famous Berkeley dataset were used [39], see Figure 3. The images additionally have human segmentation, which can serve as a ground-truth segmentation for the evaluation of the performance of the clustering. As reported in [39], each image in the dataset has multiple human segmentations and they are all considered valid. Therefore, the ARI is calculated for each human segmentation and the maximum value is reported.

Results on Artificially Generated Datasets
First, let us introduce the parameters used for the proposed strategies. The Exhaustive REBMIX&EM and Best REBMIX&EM strategies require an ordered set of the numbers of bins K, and the Single REBMIX&EM strategy requires a single number of bins v as the input. The ordered set of the numbers of bins K had 3-100 bins. For the Single REBMIX&EM strategy, the number of bins v was estimated using the Knuth rule [40]. The maximum number of components c max = 20 and the minimum numbers of components c min = 1 is kept the same for every strategy, along with the Kmeans&EM and the Random&EM. The threshold value for the average log-likelihood value (for the convergence assumption) was 0.0001 and the maximum number of EM iterations allowed was 1000. The choice of convergence of the EM with the average log-likelihood value was made because it removes the impact of larger and smaller datasets (in terms of the number of observations). The average log-likelihood was calculated simply by dividing the log-likelihood value L(y|Θ) by the number of observations n in the dataset.
Secondly, let us describe the methodology used to compare the strategies here. As explained in Section 3.1.1, the purpose of the model-selection procedure is to choose the appropriate GMM parameters based on the defined measure, which was in our case BIC. Therefore, the quality of the estimated GMM parameters can be expressed by the corresponding value of the BIC. If the strategy estimated the GMM parameters that gave the best value of the BIC, i.e., the lowest value compared to other strategies used, then we assume that the strategy also estimated the best GMM parameters. This is a reasonable and commonly used method for comparing the quality of the estimated GMM parameters [14,32]. In addition, the use of the MixSim R package to sample the artificial datasets (Section 4.1) allows the true density estimation and the clustering for each artificial dataset to be determined based on the fact that the GMM from which the data set was sampled is known. This was used to additionally compare the clustering (Section 5.1.1) and density-estimation (Section 5.1.2) performance of the estimated GMM parameters.
The general results are presented in Figure 4 which are acquired on all artificial datasets. The first plot represents the percentage of selected GMM with the best BIC value for each REBMIX&EM strategy versus the Kmeans&EM and Random&EM strategy. Exhaustive REBMIX&EM selected better GMM on more datasets than Kmeans&EM and Random&EM strategy, judging by the BIC value. On other hand, strategies Best and Single REBMIX&EM had a smaller number of datasets on which they yielded best GMM in terms of BIC value than Kmeans&EM and equal to the Random&EM strategy. The second plot represents the box-plots of the EM iterations used by each strategy. The Random&EM strategy needed the largest number of EM iterations. Following this is the Exhaustive REBMIX&EM strategy, then the Kmeans&EM strategy, and the smallest number of iterations was needed by the Best and Single REBMIX&EM strategies. This is inferred based on the median value of the box-plots (horizontal line inside the box). In addition, the Best and Single REBMIX&EM strategies had a much smaller number of iterations of the EM algorithm, due to the fact that they had only one set of initial parameters for each number of components c = 1, 2, 3 . . . 20, as opposed to, e.g., the Random&EM, which had five sets of initial parameters for each number of components. However, interestingly, the Exhaustive REBMIX&EM strategy had at least 90 different sets of initial parameters for each number of components and still had a smaller number of iterations than the Random&EM and had the same median value as the distribution of the number of iterations for the Kmeans&EM. That does seem to be beneficial to the REBMIX algorithm, which seems to provide very close to the initial parameters of the GMM to the final estimated parameters. Finally, the third plot shows the amount of time spent on estimating the GMM parameters with different strategies. The Best and Single REBMIX&EM strategies had similar computing times and therefore their effectiveness seems to be equivalent. Most of the differences in the performance of these two strategies stem from the fact that the Best REBMIX&EM strategy had a multiple number of bins v (3-100) for the REBMIX algorithm, as opposed to the Single REBMIX&EM strategy, for which we used the Knuth rule to obtain the single number of bins v. Moving forward, the Exhaustive REBMIX&EM strategy gradually increased the required computation time with the increase in the number of observations. This is to be expected, since this strategy required the most EM iterations (second plot) compared to the other two REBMIX&EM strategies. The Random&EM and Kmeans&EM strategy had the worst performance in terms of the computing time. Although the Random&EM strategy needed less time when the number of observations was lower, the Kmeans&EM strategy outperformed it as the number of observations increased. Judging by the actual values, however, the difference between these two strategies was not great.

Application to Clustering
The ARI is used to study the clustering performance. The results are given as box-plots in Figure 5. The grouping in Figure 5 was selected because it reflects the difficulty of the dataset, i.e., a small number of observations in the dataset along with a high degree of overlap between the components can be seen as a difficult clustering problem, while the large number of observations with a small degree of overlap can be seen as a simpler clustering problem. The performance of the Exhaustive REBMIX&EM strategy and the Best REBMIX&EM strategy on smaller datasets (n = 100) with a lower degree of overlap was better than the multiple repeated Kmeans&EM and multiple repeated Random&EM strategy. The Kmeans&EM strategy improved the results for the datasets with a larger number of observations with smaller overlap although the Exhaustive REBMIX&EM strategy and the Best REBMIX&EM strategy performed equivalently or better. The Random&EM only had equivalent results on the largest datasets n = 10,000. The Single REBMIX&EM performed slightly worse than the Exhaustive REBMIX&EM strategy and the Best REBMIX&EM strategy, although equivalently to the Kmeans&EM strategy and mostly better than the Random&EM strategy. On the other hand, the datasets with larger overlap reduced the clustering performance of all the strategies, especially for datasets with n = 100 and n = 1000. In this context, the Kmeans&EM performed better than all the REBMIX&EM strategies for datasets with a small number of observations n = 100. On datasets with a larger number of observations n = 1000 only the Exhaustive REBMIX&EM strategy performed as well as the Kmeans&EM strategy, while only the Random&EM strategy performed poorly. On the datasets with the largest number of observations n = 10,000, the strategies Exhaustive and Best REBMIX&EM slightly outperformed the other strategies. We have also included the results we obtained just by running the REBMIX algorithm. For the K, we chose the 3-100 range-in other words, the same as for the Exhaustive and Best REBMIX&EM strategy. The results for the n = 100 were mainly worse than the results from all the other strategies. As this number increased, i.e., n = 1000 and n = 10,000, the performance improved, although the other strategies still performed better.

Application to the Density Estimation
To study the performance of the estimated GMM with different strategies in terms of density estimation, we used the mean square error (MSE) measure [33]. The MSE is defined as: where Θ are estimated GMM parameters with different strategies and Θ are the true GMM parameters from simulation. The choice of MSE instead of MISE was made here because the distributions used for sampling the datasets had higher dimensions, i.e., 3, 5, and 10, and the creation of the good integration grid requires a lot of computing power in terms of memory. The results are therefore presented as box-plots of MSE values in the upper plot in Figure 6. These box-plots are grouped by the number of observations in the dataset. Again, for the estimation, we have used all the strategies and the stand-alone REBMIX algorithm. Judging by the values presented, all the strategies and the stand-alone REBMIX gave good results for the datasets with n = 1000 and n = 10000. In contrast, for n = 100, the Kmeans&EM, Random&EM, and REBMIX had much better results than the REBMIX&EM strategies. To study more carefully why this happened, the box-plots for the datasets with n = 100 observations only are given in the lower plot ( Figure 6), grouped by the number of dimensions d in the dataset. It is clear that, as the dimensionality of the dataset increased, the results became worse. Although the stand-alone REBMIX algorithm had good results, it is clear that the results of the REBMIX&EM strategies deteriorated. To understand why this happened, we must recall that the log-likelihood function, defined in Equation (3), for the GMM is unbounded. Namely, a singularity in one of the component covariance matrices Σ l can arise when det(Σ l ) → 0 [33]. When this problem appears, the EM can be jeopardized and the obtained solution can degenerate. A high-dimensional setting, such as d = 5 or d = 10, emphasizes this problem even more [14]. Because we did not include any of the safeguards for the EM algorithm, in terms of bounding the log-likelihood function, the resulting log-likelihood increased and the model-selection procedure was misled into choosing the degenerated solution.

Density-Estimation Tasks
For the density estimates, the same parameters were used for the three REBMIX&EM strategies as in the case of the Artificial datasets and also the same for the Random&EM and Kmeans&EM strategies. For the ADEBA algorithm, we used the parameters for the ADEBA − α implementation (see [11], Section 3), i.e., the parameter α is chosen with Bayesian averaging and β = 1 for which we will use the generic term ADEBA.
The acquired values of MISE for each pair of strategy/method and distribution are shown on different plots in Figure 7. In general, the GMM estimates using different strategies yielded better results with an increase in the dataset size n. For the small dataset sizes, e.g., n = 50 and n = 100, ADEBA usually had the best results, although the REBMIX&EM strategies proved to be good competitors, which can be seen from the results on the distributions Bend 2-d, Bend 3-d, Trips-2d, Blob 2-d, Tetra 3-d and Wave 2-d. For the datasets sampled from the distributions Trips-2d and Trips-3d with an increasing size of the dataset, all the strategies used outperformed the ADEBA. This is expected because the Trips-2d and Trips-3d distributions are actually GMM distributions with c = 3 non-overlapping components. Interestingly, the ADEBA outperformed all the strategies used for all the dataset sizes n on the Blob 2-d distribution, which is also a GMM distribution with c = 8 overlapping components. However, as the dimension of the distribution increased (i.e., the distribution Blob 3-d is only the same distribution as Blob 2-d in three-dimensional space), the results from the ADEBA deteriorated. On the distributions Bend 2-d and Bend 3-d, all the estimators had their performance deteriorated. Kmeans&EM and Random&EM performed worse than the REBMIX&EM strategies and the ADEBA. On the Spiral 3-d distribution for the n = 50 REBMIX&EM strategies, the Kmeans&EM and Random&EM strategy performed worse than the ADEBA, although their performance improved as the n increased (for the REBMIX&EM strategies again from n = 100 and and for the Kmeans&EM and Random&EM strategies from the n = 300 and n = 500, respectively). For the distributions Tetra 3-d and Wave 2-d, the results are similar to the ones for the Blob 2-d distribution. Between the different REBMIX&EM strategies, the results did not differ greatly. The slight difference can be seen on the smaller dataset sizes n.
However, as n increased, the results were almost identical. We have also evaluated REBMIX, as a stand-alone algorithm, for this experiment. Again, K was 3-100. REBMIX was better when the number of observations in the dataset was small, usually only for n = 50. On other hand, just the increase from the n = 50 to n = 100 improved the results from the REBMIX&EM strategies on many distributions, most notably on the GMM distributions such as Trips 3d.

Trips 2d
Trips 3d Wave 2d Blob 3d Spiral 3d Tetra 3d Bend To study the performance of different EM strategies and the ADEBA, the computation times for each method/strategy were measured for different distributions and different dataset sizes n. The values are grouped by the dataset size n and the method/strategy, and the mean values are shown in Figure 8. The Single and Best REBMIX&EM strategy far outperformed all the other strategies used and the ADEBA algorithm for all the dataset sizes, with the exception of size n = 50, where the ADEBA and REBMIX&EM strategies achieved equivalent results. The Exhaustive REBMIX&EM strategy performed as well as the ADEBA. The Kmeans&EM and Random&EM performed worse on all the dataset sizes, with the exception of the dataset size n = 1500, where they performed slightly better than the ADEBA and Exhaustive REBMIX&EM. The REBMIX stand-alone outperformed all the used strategies and the ADEBA algorithm in terms of computation efficacy.

Image-Segmentation Tasks
The image segmentations were carried out using the following strategies: the Single REBMIX&EM strategy, where the number of bins v was estimated using the Knuth rule; the Single REBMIX&EM strategy, where the number of bins v was taken to be v = 255; the Best REBMIX&EM strategy, with the range of the number of bins being 3-100; and the Kmeans&EM strategy and the Random&EM strategy. The number of bins v = 255 was taken as the color-channel value. The intensities of the RGB images only have integer values and, therefore, the frequency of each color channel is estimated. Furthermore, using v = 255, the marginal distributions of the color-channel value's intensity are represented. The maximum number of components for each strategy is kept the same, i.e., cmax = 20. The threshold value for the EM algorithm was 0.0001 and the maximum allowed number of EM iterations was 1000.
The numerical results in terms of the number of components c and the ARI value are given in Table 3. From the numerical results, one of the REBMIX&EM strategies always had the best ARI values. Additionally, the Single REBMIX&EM with v = 255 yielded very good results in terms of ARI value on the flower image (353013), herd image (38092), and woman image (216053); therefore, the intuitive guess of v = 255 was shown to be beneficial. Finally, the selected GMMs with all strategies had a large number of components c, which can be related to the BIC model selection criteria.
To establish some qualitative comparison, the segmentations of the flower and herd images are given in Figure 9. First, the issue of the black or white colored background. Namely, the black or white colored background caused a high density on the edges of the feature space, creating a heavy-tailed distribution, for which GMMs are not suitable. The GMMs estimated with each strategy needed at least two components to address the heavy tails of distribution; therefore, a background like on the flower image (label 353013) was mostly over-segmented and caused a lower ARI value. Next, the strategies mostly had problems with the objects that occupied a large part of the image scene and had little variation in color (for example, the flower object on the flower image or soil on the herd image). Namely, the object that occupied larger parts of the scene caused a high-density peak in the distribution. This high-density peak caused a certain asymmetry in the distribution and the strategies tried to fit multiple components of the GMM for that part. This caused a lot of noise in the segmentation, which can be seen in Figure 9. This problem was slightly reduced by the Single REBMIX&EM strategy with v = 255, which mostly did tie the peaks with one component of the GMM (flower object in the flower image or the soil-ground part of the herd image in Figure 9). This also led to a much higher ARI value for those segmentations. The sky gradients mostly caused over-segmentations, which were not properly addressed by any strategy.   Additionally, we also included results from purely clustering algorithms, such as the MeanShift (MS) algorithm or the newer Cutting edge spatial clustering (CutESC) [41], which exhibited good results in their study. We have used scikit-learn Python MS implementation [42] with default values. The segmentation was made on the original pixel-color values. For the CutESC algorithm, we followed their recommendation. First, we calculated superpixels of the images with scikit-image SLIC implementation [43] with defaults values and then constructed a shrunken dataset, containing the spatial and color values. Table 4 summarizes the ARI values obtained using their algorithms. In addition, to obtain consistently good segmentations for the CutESC algorithm, parameters α and β were set to 0.5. As expected, CutESC outperformed all the other strategies/algorithm. The GMM estimated with the REBMIX&EM strategies outperformed the MS clustering algorithm on images 353013, 38092 and 48055, and MS yielded slightly better results on images 216053 and 22093 To investigate the robustness of the different strategies in a clustering scenario of the image-segmentation tasks, we added different types of noise to the images used. Seven types of noise, salt, pepper, salt&pepper or shortly (s&p), Gaussian, Poisson, speckle, localvar implemented in the Scikit-image python module, and their effects can be seen in the herd image in Figure 10. The results of the ARI value are shown on the plots in Figure 11 for each strategy/method and the corresponding image and noise type. As expected, the different types of noise contributed differently to the quality of the segmentation; however, surprisingly, not all the types of noise deteriorated the results in terms of reducing the ARI value. The Poisson had an improving effect on almost every image and every method/strategy used. Noises such as salt, pepper and salt&pepper generally had a deteriorating effect on the results of the segmentation. With certain methods/strategies, this type of noise deteriorated the quality of the segmentation by up to 20%. In order to better evaluate the results, we provided additional box-plots ( Figure 12) of the different methods/strategies that show the percentage of difference between the results of the non-noisy images and the results of the noisy images. The box-plots are constructed as follows. The upper limit of the box is the maximum value of the difference (positive value means improvement), and the lower limit is the minimum value (negative value means deterioration) and the horizontal line within the box median value. Since the ARI measure is bounded between −1 and 1, the difference of 10% can be interpreted as 0.05 of the difference of the ARI value. The following general remarks can be derived from the box plots in Figure 12. The results of the segmentation from the MS algorithm have usually deteriorated. However, this deterioration was only minor and ranged from 5% up to 10%. On the other hand, the CutESC algorithm on the image 48055 mostly had improvements, while on the image 216053 it produced only a slight deterioration of the segmentation results. On the image 22093, there were slight improvements in the segmentation results for certain noise types, while there were large deteriorations for the others. On the images 353013 and 38092, the noise usually deteriorated the segmentation results from the CutESC algorithm. Forwardly, the noise had the following effects on the results of the segmentation using the Kmeans&EM and Random&EM strategies. Different noise types had little effect (either improving or deteriorating) on the segmentation results with the Kmeans&EM strategy, while the noise deteriorated the segmentation results with the Random&EM strategy in most cases, although the specific noise types improved the segmentation results for the images 48055 and 22093.
The REBMIX&EM strategies used were influenced by the noise and the different types of noise in different ways. We will start with the Single REBMIX&EM strategy with the number of bins v = 255. This strategy had the greatest deterioration for the images 353013, 216053, and 38092. Although it is still not as significant, since 20% represents the decrease of the ARI value of 0.1, it indicates that this method should be used with caution in the presence of noise. On the other hand, the Best REBMIX&EM and Single REBMIX&EM strategy with the number of bins v estimated with the Knuth rule had minor to mild (image 38092 with Best REBMIX&EM strategy) deterioration of the segmentation results, and, interestingly, the greatest improvements of the segmentation results in the presence of different types of noise. In fact, the Best REBMIX&EM only had improvements of the segmentation results for the image 216053 without any deterioration (the resulting rectangle in Figure 12 is above zero). The Single REBMIX&EM strategy with the number of bins v estimated with the Knuth rule had similar results for the 353013 image. This can be interpreted as follows. By adding noise to particular images, the density shape of the image pixel values became more similar to those that the GMM can generate, so the REBMIX&EM strategies were able to successfully estimate them. The results of the different strategies/algorithms regarding image segmentation with noise applied to the images revealed some interesting facts, such as the fact that adding some types of noise could improve the results of the GMM-based image segmentation. The in-depth investigation of when and why this happened goes beyond the scope of this article and also goes beyond the topic, so we conclude with a final remark. In many cases, the noise will deteriorate the results of the segmentation; however, this is to be expected. For most of the strategies/algorithms used, the amount of deterioration was low. If the present amount of noise is large, the Random&EM strategy or the Single REBMIX&EM with the number of bins v = 255 is not recommended, while the others are safe to use. It should also be considered as to whether the addition of some artificial noise can improve the result of the segmentation.

Comparison of the Time Complexity of REBMIX and k-Means Algorithm
As the k-means algorithm is more or less the fastest used algorithm to initialize the parameters for the EM estimation of the GMM, we compared the time complexity of the k-means algorithm and the REBMIX algorithm. It is hard to estimate theoretically the time complexity for both algorithms. Judging by [31], the theoretical time complexity of the k-means is O(tcnd), where t is the number of iterations needed, c is the number of components, n is the number of observations, and d is the dimension of the dataset. It is hard to know how many iterations t will the k-means algorithm need. On the other hand, the time complexity of the REBMIX algorithm depends on multiple factors. First, the most important is definitely the number of non-empty bins v * in the histogram preprocessing. Forward, it depends on the chosen maximum value of the components c max and the dimension of the problem d. Finally, it depends on the different number of bins v for the histogram preprocessing, e.g., the 3-100 range has 97 values. Therefore, we have chosen to give an empirical comparison of both algorithms in terms of the number of observations n, the number of dimensions d, and the maximum number of components c max . The results are presented in Figure 13. For the REBMIX algorithm, we used the REBMIX function provided by the rebmix R package. For the k-means algorithm, we used the default R kmeans implementation provided in the base R package. Although the k-means algorithm needs a different number of iterations, here we considered 100 iterations of the k-means algorithm. This seems reasonable due to the fact that repetitions of the k-means are often preferable [31]. In addition, due to the fact that, for the GMM initialization, the k -means algorithm needs to be repeated c max − c min times, we have taken this into consideration by repeating the k -means algorithm for each c ∈ {c min , c min + 1, . . . , c max }. For the REBMIX algorithm, we have used the setting in which the number of bins v is not known and the range used was 3-100. In addition, when the time complexity with respect to the one parameter is evaluated, the other two were kept constant. In other words, when comparing the time complexity with respect to the number of dimensions d, the number of observations n = 1000 and the maximum number of components c max = 10 was kept constant. For the comparison with respect to the number of observations n, the number of dimensions was d = 5 and the maximum number of components c max = 10. Finally, for the comparison with respect to the maximum number of components c max , the number of dimensions was d = 5 and the number of observations was n = 1000. First, we will start with a comparison with respect to the number of dimensions d, which is given in the first plot of Figure 13. We can see that the time complexity of k-means is linear with respect to this parameter. Therefore, it agrees with the theoretical assumption. On other hand, the REBMIX seems to have a nonlinear complexity with respect to the parameter d. The nonlinearity with respect to this parameter comes from the definition of the multivariate Gaussian distribution in Equation (2). To calculate the PDF of the GMM, the inverse of the covariance matrix Σ for each component is required. The matrix inversion is a nonlinear operation; therefore, nonlinear dependence in time is justified. The second parameter for which the comparison was made was the number of observations n in the dataset. The results are shown on second plot of Figure 13. Both the REBMIX and k-means algorithms exhibited a linear dependency in respect to the number of observations n, and, finally, k-means had a linear dependency with respect to the c max parameter, while the REBMIX has nonlinear. The nonlinear complexity in terms of the parameter c max comes from its empirical stopping criteria, which does not guarantee that each iteration adds exactly one more component to the GMM (details of the stopping criteria of the REBMIX algorithm are explained in the last two paragraphs of Section 3.2). The nonlinear dependencies of the REBMIX algorithm with respect to the parameters d and c max are troublesome, given that k-means has a linear dependence. This gives the k-means algorithm an advantage over the REBMIX algorithm. However, if the time complexity with respect to the number of observations n is observed, they have proven to be equal competitors. In the end, with the obtained values of time spent for the REBMIX algorithm and the k-means algorithm, the REBMIX algorithm was generally faster.

Note on Selection of Hyperparameters for the REBMIX&EM Strategies and Their Impact on Performance
In this section, we will discuss the hyperparameters of the REBMIX&EM strategies used for the presented experiments. There were five different hyperparameters used for all the experiments, in particular the number of bins v, the minimum number of components c min in the GMM, the maximum number of components c max in the GMM, the maximum number of iterations of the EM algorithm, and the threshold for convergence of the EM algorithm.
First, we will discuss the choice of the number of bins v for the histogram preprocessing. There is no general rule for estimating the number of bins v in the histogram [44]. Some common estimators are Sturges' rule log 2 (n) + 1 originating in [45] and the RootN rule √ n from [46]. In [16], it was suggested to create the lower and upper bounds for the bin range K based on these rules. Strurges' rule always yields a smaller number of bins v and is suitable for the lower bound, while the RootN rule gives a larger number of bins v and thus can be used for the upper bound. The ranges K based on those two rules are narrow for datasets with a smaller number of observations n, e.g., for n = 100, the range is 7-10 bins. For the dataset with a large number of observations, e.g., n = 10,000 leads to the range of 14-100, which is very wide. The choice of the bin range 3-100 as a constant for each experiment was chosen to test whether these additional ranges can be used to better estimate the GMM parameters using the REBMIX algorithm and thus allow a better initialization of the EM algorithm, or only cause an unnecessary computational effort. For the estimation of the number of bins v for the Single REBMIX&EM strategy, we have used the Knuth rule. The Knuth rule uses a Bayesian approach to estimate the number of bins v in a histogram, thus making this estimation dependent on the observations from the dataset. Ref. [40] gives a straightforward extension to multivariate histograms. Although, in the original paper, the multivariate histograms can have a different number of bins for each dimension, i.e., v i , ∀i ∈ {1, 2, . . . , d}, it is worth noting that we only used the same number of bins v in each dimension. The number of bins v estimated with the Knuth rule for datasets from Section 4.1 is shown in Figure 14. It is clear from Figure 14 that the Knuth rule depends on both the dimension d of the dataset and the number of observations n, unlike the above-mentioned estimators. In addition, it had a different number of bins v for the datasets with the same dimension d and number of observations n (boxes span over multiple values of v), which indicates that n and d are not the only influences. In addition, it is clear that the number of bins v is greatly reduced as the dimensions of the dataset increased. This can be tied to what is called the curse of dimensionality [44]. There are many bins in a high-dimensional histogram grid. Just for the v = 3 and d = 10 settings, there are 9,765,625 bins. Such an extremely large number of bins leads to many empty bins, i.e., for which the frequency was k j = 0, which were heavily penalized by the Knuth rule (it can be seen that, for d = 10, the estimated number of bins was mostly v = 2 or v = 3). n = 100 n = 1000 n = 10000 3  5  10  3  5  10  3  5  10   3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  Additionally, in Figure 14, the number of bins v that yielded the optimal GMM parameters using the Exhaustive and Best REBMIX&EM strategies given for the datasets. Although we have given a large range, judging by the values in Figure 14, it can be seen that both the Exhaustive and the Best REBMIX&EM strategies estimated the GMM parameters for a different number of bins v than what is proposed by the Knuth estimator, which leads in some experiments to benefits, while in another leads to some numerical instabilities. For example, the number of optimal GMM parameters estimated with the Exhaustive REBMIX&EM strategy was the greatest if we recall Figure 4. On the other hand, regarding what was found by comparing the density-estimation performance in Section 5.1.2 for a small number of observations n = 100, this caused some of the degeneracies in the estimated GMM parameters. The Best REBMIX&EM strategy favored a larger number of bins v in the histogram preprocessing in all the presented cases. This is also a useful insight because in the image-segmentation experiment ( Table 3) the results were better than what was acquired with a Single REBMIX&EM strategy with the number of bins v estimated with the Knuth rule. For the density-estimation tasks in Section 5.2, the results seem to have not been very impacted by the chosen REBMIX&EM strategy, at least in terms of the MISE value (Figure 7). Let us recall that we also used the intuitive v = 255 for the image-segmentation tasks. As we already said, this intuition was made based on the fact that digital images have 255 different intensity values for each color channel. This assumption was shown to be good, judging by the results we obtained with it; however, it was probably most sensitive to the noise along with the Random&EM strategy.
To conclude, the Single REBMIX&EM strategy with the Knuth rule can be safely used when a short computation time is preferred, and the dimension of the dataset is small. The datasets in the density-estimation tasks in Section 5.2 had d = 2 or d = 3. For the Exhaustive and Best REBMIX&EM strategies, we recommend the RootN rule for the upper bound and the Knuth rule for the lower bound. The RootN rule will mostly overestimate the number of bins for larger n, judging by the obtained values in Figure 14. However, due to the fact that a smaller number of EM iterations was needed with the REBMIX, as an initialization technique, this remains acceptable. If the time efficacy is a major concern, to obtain the best-possible results for the datasets with a larger number of observations n, the Best REBMIX&EM strategy could be used to obtain the upper bound for the histogram preprocessing number of bins v range. Then, the Exhaustive REBMIX&EM strategy could be used with this value as the upper bound, while keeping the Knuth rule for the lower bound, for the histogram preprocessing number of bins v range.
We would like to mention some points about the selection of the c min and c max values for the GMM or any other MM parameter estimation. The number of MM parameters increases with the increasing dimension, especially the GMM parameters. It is, therefore, always advisable to check to see whether the number of observations n in the dataset is larger than the number of parameters M, i.e., n ≥ M. For a dataset with a smaller number of observations n, this can be used as a rule-of-thumb selector of c max . Choosing a smaller value for c max when the number of observations n is also small can additionally reduce the probability of estimating the degenerated solution. Otherwise, the maximum number of components c max can be as large as required, depending on the application of the GMM. For example, in the image-segmentation tasks, the number of observations was large. As can be seen in Table 3, the estimated number of components c was always high. The images used had a resolution of 480 × 320 or 320 × 480, which resulted in a number of observations equal to n =153,600. Based on our above rule of thumb, the value of c max can go as high as c max = 15,000. However, this would be numerically unstable due to the floating arithmetic. For this kind of example, where the number of observations n is high and we can expect a large number of components in the GMM, this hyperparameter can also be fine-tuned. Some lower value can be used as the starting point to obtain some information about how the IC (in our case the BIC) is changing with respect to an increase in the number of components c in the GMM. If there are some large fluctuations in the value of BIC, or the slope of the BIC-c curve seems too steep, we can try to increase the value of c max until the curve stabilizes and the values of BIC slowly start to increase. If some other IC is used instead of the BIC, then the same analogy applies. As for the c min , it is advisable to use c min = 1 for the minimum value to check whether the dataset fits well into the non-mixture distribution.
For the values of the maximum number of iterations of the EM algorithm and the threshold of the EM algorithm, we have chosen 1000 and 0.0001, respectively. These two values are chosen in this paper based on some empirical evidence from other software that implemented the EM algorithm (for example, Sklearn's EM implementation for the GMM estimation has a default value of 100 maximum iterations and 0.001 for the average log-likelihood threshold). Choosing these values should always be based on pragmatic grounds, i.e., we expect that this threshold can be reached in this amount of iterations. For example, the authors of [25] argue that even perhaps the most efficient way of managing the convergence of the EM algorithm, for the GMM parameter estimation, is to use only the specified number of iterations as stopping criteria because setting small threshold values can be hazardous in terms of the computation times due to the unboundedness of the log-likelihood function. However, in our experiments, we have observed that the REBMIX algorithm tends to estimate the initial parameters close to the final ones and we usually did not even use half of the expected iterations (see the second plot in Figure 4). To conclude, we can safely recommend these values as starting points; however, if necessary, fine-tuning can also be applied.

Conclusions
We have proposed the use of the REBMIX algorithm as an initialization technique for the EM algorithm, as applied to the MM parameter-estimation problem. Additionally, three different strategies are proposed when the number of components in the MM is unknown. The proposed strategies were thoroughly tested on artificially created datasets (Section 5.1), density-estimation tasks (Section 5.2), and image-segmentation tasks (Section 5.3). Our proposals showed benefits in terms of accuracy as well as with the computational efficacy. While preserving good or even better results from various applications of the GMM, fewer of the more expensive EM iterations were needed for the EM to converge. Moving forward, the proposed initialization and strategies when applied to real-world tasks, such as density estimation and image segmentation, were again competitive and even better with the state-of-the-art initialization strategies and, additionally, with the state-of-the-art algorithms for density estimation and image-segmentation (clustering), respectively. Moreover, the proposed initialization and strategies provided comparable or better results compared to the ADEBA algorithm in terms of computational effectiveness, as was shown in Figure 8. In addition, we provided some insights into the time complexity of the REBMIX algorithm in Section 5.4. Even though we did not provide a theoretical proof of the time complexity, which was due to multiple stochastic factors impacting on its performance, from the empirical evaluation, it can be seen that REBMIX has linear time with respect to the parameter n and a nonlinear time with respect to the parameters c and d. On other hand, the k-means algorithm was linear with respect to all three parameters. However, the actual value of the computation times in Figure 13 showed that REBMIX was faster than k-means for this specific case. On the third plot in Figures 4 and 8, we have provided some empirical insights into the time efficacy of the proposed REBMIX&EM strategies. Here, the theoretical estimation of the time complexity classes, due to the use of the EM algorithm, which does not have clear termination criteria, was not provided. That being said, we have given our best effort to provide enough critical empirical insights with which we can state that our proposal is more time efficient with regard to the state-of-the-art approaches.
From the paragraph above, the main advantages of our proposal with regard to the state of the art can be extracted. Our proposal was time efficient and had good results. The current state-of-the-art approach, nevertheless, of the initialization technique used, is to try the EM algorithm on as many initializations as possible in a selected/given amount of time. Therefore, algorithms/initialization that can provide multiple repeats are preferred. The REBMIX heuristic is deterministic and cannot be improved with multiple restarts. We see this as the main disadvantage with regard to the state-of-the-art approaches. For a given number of bins v, it will always estimate the same parameters of the GMM. In addition, this parameter, i.e., the number of bins v, is usually not known in advance. Therefore, we proposed the use of the Exhaustive and Best REBMIX&EM strategies. Both strategies are more computational demanding when compared to the Single REBMIX&EM strategy. Furthermore, the choices we made for the number of bins v are thoroughly described in Section 5.5. We see this task of choosing the appropriate range of the number of bins, or some specific value or set of specific values, as the largest overhead introduced with our proposal. Other state-of-the-art strategies do not need this kind of parameter, only the rest of the hyperparameters (c max , c min , maximum number of iterations, and threshold). Some smaller limitations were also encountered. Namely, datasets with a large dimension d and small number of observations n led to some degeneracies in the estimates. Therefore, at this point, our proposal cannot be recommended for such problems. We did not include any safeguards for the EM algorithm and thus this happened. The empirically evaluated time complexity of the REBMIX algorithm showed some problems. The nonlinearity in terms of the dimension of the dataset d was to be expected, while the nonlinearity in terms of the maximum number of components c max was troublesome. This overhead introduced with our proposal can be burdensome, especially with larger values of the parameter c max .
To conclude this article, let us say that we have implemented all our proposals in the rebmix R package (https://cran.r-project.org/web/packages/rebmix/index.html), and they are ready to use. In addition, let us explain how our work can be extended. First, the Bayesian regularization for the EM algorithm could be used to additionally avoid degeneracies and constrict the likelihood [14]. Second, the use of the GMM for certain image-segmentation tasks and the density-estimation tasks was questionable, like for the flower image or the Bend distribution. Although we have used the GMM as a proof of concept, our work can be extended to other types of MMs. For the extension to the other parametric family of the MM, the equations for the REBMIX and EM algorithms need to be derived. In [17], it was shown how the REBMIX algorithm can be extended to other types of MM, particularly to the von-Mises MM. For the EM algorithm, the M-step equations need to be derived. For example, in [18], the derivation of the Weibull-Normal MM parameters is given. In addition, there is a plethora of literature about how the EM can be extended to other parametric families [9,10] to name just a few. We will strive to provide some more theoretical insights into the REBMIX algorithm. Specifically, the empirical equation for decreasing the parameter D min should be improved to avoid nonlinear time complexity in terms of the parameter c max . The hyperparameter, the number of bins v for the histogram preprocessing, needs special care. For the histogram preprocessing, we only used the same number of bins in each dimension, although the number of bins can be different in each dimension, or, finally, some adaptive bandwidth for the histogram preprocessing can be applied, like the ones used for a kernel density estimation in [11]. Some other REBMIX preprocessing capabilities like kernel density or k-nearest neighbor could be used. In the end, the application of the MM for the image segmentation should be more thoroughly researched. Throughout the testing of the proposals for the image-segmentation tasks, we encountered some interesting facts outlined in Section 5.3. As this was not the primary problem of this article, we left this topic open to be revisited.
Author Contributions: The conceptualization and methodology was done by B.P.; The REBMIX algorithm software implementation was done by M.N.; The EM algorithm software implementation was done by B.P.; The combined REBMIX&EM software implementation was done by B.P. and M.N.; The software for testing purposes was made by B.P.; Formal analysis, visualization and writing-original draft preparation was made by B.P.; Resources, supervision, validation, writing-review and editing, project administration, and funding acquisition was made by J.K. and M.N. All authors have read and agreed to the published version of the manuscript.

Funding:
The authors acknowledge the financial support from the Slovenian Research Agency (research core funding No. 1000-18-0510).

Conflicts of Interest:
The authors declare no conflict of interest.