1. Introduction
The Expectation-Maximisation (EM) algorithm [
1] is one of the most useful and widely adopted algorithms in data science, statistics, and pattern recognition. It is best known for its role as a maximum likelihood estimator in the presence of missing or incomplete data [
2]. The EM algorithm has a broad range of applications, the most prominent being mixture model parameter estimation, with further use in clustering [
3,
4], image segmentation [
5], density estimation [
6,
7], regression [
8], anomaly detection [
9] and more.
EM’s popularity lies in its simplicity and stability [
2]. It addresses otherwise intractable maximum likelihood estimation problems involving missing data through an iterative procedure consisting of an Expectation (E) step and a Maximisation (M) step, repeated until convergence. At each iteration, the algorithm guarantees an increase in the likelihood function, making it a strictly hill-climbing method [
10]. Nevertheless, due to the presence of multiple local optima in the likelihood surface, EM requires a reasonably good initialisation [
11]. The choice of the starting point can significantly influence the convergence path and ultimately determine which local optimum the algorithm converges to. The initialisation of EM has been extensively studied, and a variety of methods have been proposed to address this issue effectively [
12,
13,
14,
15].
Another limitation of the EM algorithm is its slow, linear convergence to the optimum of the likelihood function [
10]. Although this linear convergence contributes to EM’s stability, it often requires a large number of iterations to closely approach the optimum. This limitation can be partially addressed by using better initialisation methods, but it remains a significant concern in practice [
16]. The issue becomes even more pronounced in scenarios where the updates in the E-step are minimal, or in other words, when the data poorly separates the latent components and the estimated entropy of their posterior distribution remains high [
2]. In such cases, obtaining a reasonable initialisation is particularly challenging due to the high degree of overlap in the data [
17]. Additionally, some distributions do not have closed-form solutions, and iterative procedures are required to obtain the M-step updates [
18]. Therefore, accelerated EM approaches should be explored, and their effects carefully evaluated.
In this paper, simple yet effective approaches to accelerate the EM algorithm are reviewed and empirically investigated. A related study by [
19] compared acceleration techniques for the EM algorithm in the context of item response theory models; the present work focuses instead on mixture model estimation with varying degrees of overlap between components, examining how overlap influences the performance and stability of different acceleration schemes. The primary objective is to establish clear guidelines on when EM acceleration is most effective, to identify which methods perform best under different conditions, and to examine how different initialisation strategies influence the stability and effectiveness of acceleration. A key finding, which to our knowledge has not been previously documented for mixture modelling, is that only one of the three acceleration parameter estimates proposed by [
10] provides genuine acceleration; the other two consistently act as deceleration. To achieve this, a comprehensive simulation study was designed to incorporate a wide range of factors, including the degree of overlap between mixture components, the number of dimensions, the number of components in the model, the size of the data set, and other relevant parameters. These factors significantly affect the complexity of the estimation task and are therefore essential for evaluating the performance of various acceleration schemes. The study includes approximately 2400 distinct data sets, making it, to the best of our knowledge, one of the most extensive investigations on this topic. In addition, multiple estimation strategies based on different EM initialisation methods are considered. Finally, the empirical findings are validated using a real-world data set to demonstrate their practical relevance and general applicability.
The outline of this article is as follows.
Section 2 provides the theoretical background on mixture modelling and the EM algorithm.
Section 3 presents a review of the acceleration schemes considered in this work.
Section 4 offers a brief overview of EM initialisation strategies.
Section 5 introduces the experimental setup used to evaluate the performance of different acceleration schemes.
Section 6 reports the main results and discussion based on simulated data sets.
Section 7 presents real-world data set along with the corresponding results. The article concludes with
Section 8.
6. Results and Discussion
In this section we present the results and discussion obtained on simulated datasets.
6.1. Convergence Performance of Acceleration Schemes
Figure 1 presents the mean number of EM iterations (top row) and computation time (bottom row, log scale) as a function of overlap level
, faceted by sample size
n, for all nine acceleration variants evaluated in this study. The figure reveals three distinct groups of methods.
The first group consists of the
and
variants of STEM and SQUAREM. These methods consistently require more iterations than standard EM: STEM with
requires
more iterations than STEM with
, while
requires
more. The pattern is nearly identical for SQUAREM (
and
, respectively). Both
and
also exceed standard EM in iteration count (
–
for
,
–
for
), effectively acting as deceleration rather than acceleration. This occurs because
and
frequently produce oversized step sizes that fail to increase the log-likelihood, triggering a fallback to an unaccelerated EM update at each such iteration. The
estimate, being the geometric mean of
and
, yields a more conservative step size that succeeds more frequently. The quality of the obtained solutions, as measured by ARI, is statistically indistinguishable across all
variants (spread
). These findings hold across all overlap levels, sample sizes, and initialisations tested, confirming the recommendation of [
10]. In all subsequent analyses, only
is used for STEM and SQUAREM.
The second group contains the greedy search methods—line search and golden section search. These achieve the largest iteration reductions (up to and , respectively), as they explicitly maximise the log-likelihood over a range of values at each iteration. However, this comes at a substantial computational cost: line search increases total computation time by on average, with increases exceeding at small-to-moderate sample sizes, due to multiple additional log-likelihood evaluations per iteration. Golden section search is less expensive (approximately slower on average) but exhibits erratic behaviour: at and , it requires more iterations than standard EM, and it produced catastrophic numerical instability in a small number of runs, with MSE values for covariance estimates exceeding . Both greedy methods are therefore impractical for routine use despite their iteration efficiency.
The third group comprises STEM and SQUAREM with , which provides the best trade-off between iteration reduction and computational overhead. SQUAREM reduces iterations by overall, with the benefit increasing monotonically with both overlap and sample size: from a modest at , to at , . STEM follows a similar but slightly less pronounced pattern ( overall reduction). At small sample sizes and low overlap, both methods may slightly increase the iteration count (by up to at , ), indicating that the overhead of computing is not offset by the acceleration gain in these easy problems. Crucially, the per-iteration overhead of STEM and SQUAREM is minimal: at , both methods achieve wall-clock time savings that match or exceed their iteration savings, while at smaller n the time overhead remains modest.
Table 4 provides detailed iteration and time reduction percentages across the full
grid. The monotonic increase of SQUAREM’s benefit with both
n and
is evident: the method becomes increasingly valuable precisely in the settings where standard EM struggles most.
Table 4.
Percentage reduction in EM iterations and computation time relative to standard EM (%). Positive values indicate improvement; negative values indicate degradation.
Table 4.
Percentage reduction in EM iterations and computation time relative to standard EM (%). Positive values indicate improvement; negative values indicate degradation.
| n | Method | Metric | Overlap Level |
|---|
| 0.01 | 0.05 | 0.10 | 0.15 | 0.20 |
|---|
| 200 | Line | Iterations | 27.5 | 35.0 | 33.9 | 35.8 | 40.1 |
| Time | −86.0 | −92.2 | −104.2 | −112.3 | −112.7 |
| Golden | Iterations | −61.5 | −5.3 | 12.1 | 13.1 | 16.8 |
| Time | −135.4 | −121.1 | −98.6 | −104.6 | −86.4 |
| STEM | Iterations | −11.7 | 2.0 | 7.7 | 9.1 | 11.6 |
| Time | −46.8 | −54.8 | −50.7 | −59.3 | −56.4 |
| SQUAREM | Iterations | −13.7 | 3.2 | 9.6 | 11.5 | 15.2 |
| Time | −49.7 | −50.7 | −49.2 | −58.5 | −57.0 |
| 400 | Line | Iterations | 33.2 | 39.1 | 38.5 | 38.9 | 45.1 |
| Time | −82.9 | −93.9 | −105.0 | −108.1 | −99.0 |
| Golden | Iterations | 24.8 | 38.4 | 43.5 | 39.8 | 41.8 |
| Time | −23.6 | −24.9 | −20.3 | −28.9 | −23.8 |
| STEM | Iterations | −1.9 | 14.5 | 16.6 | 20.5 | 27.2 |
| time | −40.7 | −38.4 | −42.8 | −27.8 | −27.7 |
| SQUAREM | Iterations | −4.4 | 13.9 | 20.1 | 26.0 | 33.3 |
| Time | −45.0 | −42.4 | −43.2 | −33.5 | −20.1 |
| 800 | Line | Iterations | 36.6 | 38.0 | 42.6 | 43.0 | 42.9 |
| Time | −55.0 | −78.2 | −80.9 | −84.7 | −89.0 |
| Golden | Iterations | 17.7 | 45.3 | 46.1 | 45.1 | 40.0 |
| Time | −20.6 | −14.3 | −13.5 | −22.3 | −24.1 |
| STEM | Iterations | 9.0 | 19.2 | 30.6 | 31.0 | 32.9 |
| Time | −11.1 | −16.8 | −5.6 | −0.8 | 2.4 |
| SQUAREM | Iterations | 5.2 | 22.1 | 32.9 | 36.7 | 39.6 |
| Time | −14.5 | −20.6 | −11.8 | −3.9 | 2.8 |
| 1600 | Line | Iterations | 39.4 | 43.0 | 40.9 | 43.2 | 44.3 |
| Time | −19.7 | −34.8 | −49.7 | −52.8 | −52.8 |
| Golden | Iterations | 28.6 | 47.7 | 45.8 | 47.8 | 42.1 |
| Time | −7.1 | −5.3 | −10.5 | −9.7 | −15.0 |
| STEM | Iterations | 19.3 | 28.9 | 36.3 | 36.6 | 35.5 |
| Time | 0.7 | −0.3 | 1.0 | −1.2 | 0.3 |
| SQUAREM | Iterations | 16.9 | 36.0 | 42.6 | 47.7 | 43.3 |
| Time | 1.2 | 2.0 | 3.6 | 6.5 | 0.8 |
The number of mixture components is the strongest single predictor of iteration count (), followed by overlap level (), sample size and acceleration method (both ). Dimensionality and initialisation have negligible effects on iteration count (). Importantly, the ranking of acceleration methods remains stable across all tested dimensionalities (), component counts (), and initialisation methods. For computation time, sample size dominates (), followed by the number of components (), while the acceleration method explains less than of time variance for STEM and SQUAREM—reflecting their minimal per-iteration overhead.
6.2. Effect of Acceleration on Estimation Quality
A central question is whether acceleration schemes that reduce iteration counts also degrade the quality of parameter estimates.
Figure 2 presents the mean bias and mean squared error for weights, means, and covariance matrices as a function of overlap
, averaged across all sample sizes, dimensionalities, component counts, and initialisations. Values are clipped at the 99th percentile to prevent distortion from a small number of degenerate runs associated with the golden section scheme.
Standard EM, STEM, and SQUAREM produce virtually identical parameter estimates across all six metrics and all overlap levels. The lines overlap almost perfectly in every panel, confirming that acceleration with does not introduce systematic bias or increase estimation variance. Line search yields marginally higher bias and MSE for means and covariances at high overlap, likely because its aggressive per-iteration optimisation occasionally overshoots into regions of the parameter space that are harder to recover from.
The golden section scheme presents a paradox: after clipping, it appears to produce the best estimates—lowest bias and MSE across all panels. However, this is a survivorship effect—that is, the clipping procedure preferentially removes the golden scheme’s degenerate runs, and the remaining sample is no longer representative of the method’s overall performance. The golden scheme generated degenerate solutions with near-singular covariance matrices in a small but non-negligible fraction of runs, producing MSE values exceeding
. After clipping these extreme values, the surviving runs are precisely those where the greedy search successfully located a superior optimum. The unclipped mean log-likelihood for golden (−50,650) is an order of magnitude worse than standard EM (−1952), confirming that its apparent superiority in
Figure 2 does not generalise.
Table 5 reports the mean ARI and median log-likelihood across overlap levels for each method. Median log-likelihood is used rather than the mean to mitigate the influence of golden degenerate runs. The ARI values are nearly indistinguishable: the total spread across all five methods is
, and the acceleration method explains less than
of the total variance in ARI (
). For comparison, overlap level and the number of components each explain over
of ARI variance (
and
, respectively). This confirms that the choice of acceleration scheme has no practical impact on clustering quality—the dominant factors are the inherent difficulty of the mixture (overlap, number of components, dimensionality) and sample size, not the optimisation strategy.
In summary, STEM and SQUAREM with
provide a 24–
reduction in iterations (
Section 6.1) without any measurable degradation in parameter estimation quality, clustering performance, or model fit. Acceleration is, for practical purposes, a free improvement to the EM algorithm in the settings studied here.
6.3. Interaction Between Initialisation and Acceleration
We now examine whether the benefit of acceleration depends on the initialisation method, and whether acceleration degrades estimation quality for any particular initialisation. This section first analyses the four initialisation methods (hclust, k-means, random, and REBMIX averaged across all its configurations), and then provides a detailed breakdown of REBMIX preprocessing and mode-traversing options.
6.3.1. Acceleration Benefit Across Initialisations
Figure 3 presents the average iteration reduction and time reduction achieved by STEM and SQUAREM (relative to standard EM) as a function of overlap
and sample size
n, separately for each initialisation method.
The benefit of acceleration varies substantially across initialisations. Hierarchical clustering and k-means benefit the most, with iteration reductions reaching 46– at and . Random initialisation follows closely (43– at ). REBMIX benefits the least, achieving only 35– at and substantially less at smaller n. At , REBMIX shows no iteration benefit (and slight degradation at low ), while hclust and k-means already achieve 15– reduction.
The time reduction (bottom row) reveals a more important distinction. For k-means and random, the iteration savings translate directly into wall-clock savings: up to 48– at . For hclust, the time savings are more modest (10– at ) because the expensive initialisation step dominates the total cost. For REBMIX, acceleration increases computation time at small n—by as much as at —because the preprocessing step contributes substantially to the total cost, and the per-iteration overhead of computing is not offset by the modest iteration savings. Only at does REBMIX begin to see marginal time savings.
This interaction has a practical implication: acceleration should always be applied when using k-means, random, or hclust initialisation, as the overhead is negligible and the potential savings are substantial. With REBMIX initialisation, acceleration is beneficial only at larger sample sizes ().
6.3.2. Estimation Quality Is Unaffected
Figure 4 shows the change in six estimation quality metrics (ARI, MSE and bias for weights, means, and covariances) when switching from standard EM to each acceleration scheme, separately for each initialisation.
For hclust and k-means, all six quality metrics change by less than regardless of which acceleration scheme is used. The acceleration method ranking is also stable: the initialisation ranking (hclust > k-means > random > REBMIX by ARI) is preserved across all five acceleration schemes. Acceleration is therefore a quality-neutral transformation that can be applied independently of the initialisation choice.
For random initialisation, SQUAREM shows a small increase in MSE() of and a small decrease in bias() of , suggesting that the accelerated scheme occasionally converges to slightly different local optima when starting from a poor initialisation. These differences are small in practical terms and do not change the method ranking.
The golden section scheme shows anomalous behaviour with REBMIX: substantially lower MSE and bias values. As discussed in
Section 6.2, this is a survivorship effect—the golden scheme’s degenerate runs are removed by clipping, and the surviving runs happen to reach superior optima.
6.3.3. REBMIX Configuration Analysis
The REBMIX algorithm offers three preprocessing methods and three mode traversing strategies.
Figure 5 shows how the acceleration benefit varies across these nine configurations.
The iteration reduction pattern (
Figure 5a) is consistent across preprocessing methods: all configurations show 15–
reduction at high overlap, with the outliers mode benefiting slightly more than all or outliersplus. KNN with outliers shows the largest apparent reduction, but this reflects survivorship: this configuration completes only
of the experimental grid, with failures concentrated at high
d and high
c where the EM problem is most difficult. When restricted to the same configurations where KNN succeeded, histogram preprocessing achieves equal or higher ARI.
The time reduction (
Figure 5b) reveals the cost of preprocessing. Histogram configurations show the most favourable time profile (up to
savings at high
), while KNN configurations show severe time penalties (
to
, i.e., 2–
slower) because the
k-nearest neighbour preprocessing is
more expensive than histogram binning, and this cost is not offset by the iteration savings.
Figure 6 confirms that the choice of REBMIX configuration does not interact meaningfully with acceleration in terms of estimation quality: the
MSE and
bias values are negligible across all nine configurations and all acceleration schemes.
Based on these results, we recommend histogram preprocessing with the outliers mode traversing strategy as the default REBMIX configuration: it achieves the best ARI among configurations with full completion, requires the fewest iterations, and is the fastest. KDE is a viable alternative with equivalent quality but longer computation time. KNN preprocessing is not recommended due to its computational cost and poor scalability to high-dimensional, many-component settings.
7. Hard Drive Disk Failure Data Set
We analysed hard disk drive failure patterns using publicly available SMART telemetry data from Backblaze [
28], covering drive failures from the years 2022, 2023, 2024, and Q1 of 2025. Each daily drive snapshot was filtered to retain only records associated with failed drives (
failure > 0). This approach isolates the characteristics of drives at or near their point of failure, allowing for a more precise analysis of failure-related patterns. To ensure consistency across model types, all SSD models were excluded using a predefined list, focusing solely on mechanical hard drives, where mechanical wear and age-related degradation are dominant failure mechanisms.
The data set includes a total of 197 features, of which 186 correspond to SMART attributes, each reported in both raw and normalized form [
29]. The remaining 11 features contain metadata such as date, serial number, model, capacity (in bytes), data center location (
pod_id), and failure status. Normalized features are highly vendor-specific, and since the data set includes a variety of manufacturers, these features are not meaningful for comparison. In addition, many of the features are highly correlated, skewed, and sparse, often unreported or zero-valued across many models, which limits their usefulness in the context of mixture modelling.
We therefore selected only three SMART raw features: smart_9_raw (power-on hours), smart_193_raw (load cycle count), and smart_194_raw (temperature). These features account for most of the variance in the data and are also theoretically linked to reliability, as well as the mechanical and thermal load of the hard disk drive. The load cycle count exhibited low to moderate skewness and was log-transformed to mitigate this issue. All selected features were then rescaled to the interval after removing extreme outliers, defined as values outside the 2.5th to 97.5th percentile range, to ensure a uniform range across the data. Prior to rescaling, feature ranges varied considerably. Although the data set was not an ideal fit for Gaussian mixture modelling, this preprocessing enabled us to obtain reasonably interpretable results. After the final removal of missing values, the dataset contained 10,678 observations.
The number of mixture components was not known in advance. Except for different models of hard drive disks or possibly manufacturers, there was no other inherent labelling of the data that could be used as an initial guess. As there are over 50 different models, it is highly unrealistic that each represents a unique pattern. A second possible guess could be the number of hard drive manufacturers, which was five, namely HGST, Western Digital, Seagate, Toshiba, and Hitachi. Among these, only one hard disk drive model was from Hitachi; most of the hard disk drive models were Seagate drives, 5856 or 54.8%, the second largest group was Toshiba hard disk drive models, 2269 or 21.2%, followed by HGST hard disk drive models, 2079 or 19.4%, and a minority were Western Digital, 473 or 4.5%. Hence, there could be five, or more realistically four, major patterns, yet this is also highly unrealistic, as there may be other traits shared across different manufacturers. Therefore, we chose to use a model selection procedure and determine the best model via the Bayesian Information Criterion (BIC) [
30],
where
is the obtained log-likelihood value,
M is the number of parameters in the mixture model, and
n is the number of observations. The minimum number of components was chosen as 2, and the maximum number of components was 10. We also used the same EM initialisations and acceleration schemes as in the simulation study.
Table 6 presents the results for all six initialisation strategies. Most estimation strategies converge toward models with 10 components. The highest BIC value was achieved using hclust initialisation (−19,539), though at the cost of substantially longer processing time (155–170 s) due to the
agglomerative clustering step. The
k-means initialisation yielded the second-highest BIC (−19,496) with reasonable computational time, and was therefore selected as the preferred method.
Across all estimation strategies, a consistent reduction in both computation time and the number of iterations was observed when using SQUAREM and STEM acceleration. SQUAREM achieved the largest iteration reductions (27–60%) and, with k-means initialisation, reduced computation time by . As in the simulation study, line and golden section search effectively reduced the number of EM iterations but at the cost of increased computational time—line search was 2– slower than standard EM across all initialisations.
The interaction between initialisation and acceleration also confirms the simulation findings. With hclust, the iteration reduction translates to only time saving because the initialisation dominates the total cost. With REBMIX histogram preprocessing, SQUAREM reduces iterations by and time by —a more modest benefit than with k-means, consistent with the simulation observation that REBMIX starts closer to the optimum and thus benefits less from acceleration. REBMIX with KNN preprocessing is entirely dominated by the preprocessing cost (305 s), with acceleration contributing negligible time savings.
Notably, with random initialisation, standard EM and SQUAREM converge to a model (BIC = −18,887), while line search, golden section, and STEM find a model with substantially better BIC (−19,428). This suggests that the more exploratory search strategies occasionally escape local optima that trap the standard and quadratic acceleration schemes. However, this is an isolated observation; for all other initialisations, all acceleration schemes converge to the same solution, confirming the simulation finding that acceleration does not systematically alter the quality of the obtained estimates.
8. Conclusions
In this article, we have examined simple acceleration schemes applicable to the EM algorithm for Gaussian mixture modelling, with a focus on their behaviour under varying degrees of component overlap. We evaluated linear (STEM) and quadratic (SQUAREM) acceleration with three parameter estimates (, , and ), as well as greedy line search and golden section search, across a comprehensive simulation study comprising 240 mixture configurations (3 dimensionalities, 4 component counts, 5 overlap levels, and 4 sample sizes) and four initialisation methods (hierarchical clustering, k-means, random, and REBMIX). The findings were validated on a real-world Backblaze hard drive failure dataset.
A key empirical contribution of this study is the systematic comparison of the three acceleration parameter estimates proposed by [
10]. Across all tested configurations,
and
consistently required more iterations than standard EM (52–
and 35–
more, respectively), effectively acting as deceleration. Only
, the geometric mean of
and
, provides genuine acceleration. This occurs because
and
frequently produce oversized step sizes that fail to increase the log-likelihood, triggering fallback to unaccelerated updates. This finding, which to our knowledge has not been documented for mixture modelling applications, confirms and strengthens the recommendation of [
10] to use
exclusively.
With , both SQUAREM and STEM reduce EM iterations by and on average, with the benefit increasing monotonically with overlap level and sample size—reaching at and for SQUAREM. The per-iteration overhead is minimal: at , the iteration savings translate directly into wall-clock time reductions. Line search and golden section search achieve larger iteration reductions (up to ) but increase total computation time by 50– due to repeated log-likelihood evaluations, making them impractical for routine use. The golden section scheme additionally exhibited catastrophic numerical instability in a small fraction of runs, producing degenerate covariance estimates.
Crucially, acceleration does not degrade estimation quality. Across all six metrics examined (ARI, log-likelihood, bias and MSE for weights, means, and covariances), the acceleration method explained less than of the total variance. The initialisation ranking was preserved regardless of which acceleration scheme was applied, confirming that the choice of acceleration and the choice of initialisation can be made independently.
The interaction between initialisation and acceleration revealed that REBMIX benefits least from acceleration ( iteration reduction vs. 35– for other methods), because it already starts near the optimum. At small sample sizes, the per-iteration overhead of computing can exceed the savings, making acceleration counterproductive for REBMIX at . In contrast, k-means is the most favourable partner for acceleration, achieving up to time savings at . Among REBMIX configurations, the outliers mode traversing strategy consistently outperformed all and outliersplus, and histogram preprocessing offered the best cost-efficiency, being faster than kernel density estimation at equivalent quality and an order of magnitude faster than k-nearest neighbour preprocessing, which additionally failed to scale beyond or .
The main findings of this study can be summarised as follows:
Only the estimate provides genuine acceleration; and act as deceleration and should not be used.
SQUAREM with is the most effective acceleration scheme, reducing iterations by up to with negligible per-iteration overhead.
Acceleration effectiveness depends on sample size: benefits are negligible at and increase monotonically with n.
Acceleration does not deteriorate parameter estimates under any tested combination of initialisation, dimensionality, number of components, or overlap level.
Greedy methods (line search, golden section) reduce iteration counts but are computationally inefficient and, in the case of golden section, numerically unstable.
Initialisation and acceleration do not interact: the initialisation ranking is preserved across all acceleration schemes, and acceleration benefits all initialisations (though REBMIX benefits least).
For REBMIX, histogram preprocessing with outliers mode is recommended; k-nearest neighbour preprocessing is not recommended due to poor scalability.
In summary, practitioners fitting Gaussian mixture models should use SQUAREM with the parameter estimate as their default acceleration scheme. The and estimates should be avoided entirely, as they reliably decelerate convergence. All methods evaluated in this study are implemented in the free and open-source R package rebmix, making acceleration readily available. For instance, enabling SQUAREM on the Backblaze hard drive dataset reduced computation time from s to s, which is a saving achieved by changing a single parameter in the estimation call. To our best knowledge, this capability is not natively available in any other R package for mixture modelling.
For future research, it would be valuable to investigate whether the acceleration behaviour observed here extends to non-Gaussian mixture models, constrained covariance structures (e.g., diagonal or tied covariances), and mixtures with very large numbers of components. Additionally, histogram-based EM schemes could substantially reduce per-iteration cost, potentially amplifying the benefits of acceleration in large-scale settings.