Comparison of Initialization Strategies for EM in High-Dimensional Multivariate Diagonal Gaussian Mixture Models
Abstract
1. Introduction
2. Materials and Methods
2.1. Multivariate Diagonal Gaussian Distribution
2.2. Multivariate Diagonal Gaussian Mixture Model (MDGMM)
2.3. Likelihood Function of the Dataset
2.4. EM Recursive Algorithm for Mixture Parameters Estimation
2.4.1. Initial Values
2.4.2. E-Step
2.4.3. M-Step
2.4.4. Stopping Criteria
2.4.5. Condition for Regularization/Stabilization of Iterations
2.5. Initialization Strategies Compared in the Study
2.5.1. Random Initialization
2.5.2. K-Means Initialization
2.5.3. Hierarchical Clustering Initialization
2.5.4. Ensemble Clustering Initialization
- Estimation of one-dimensional models (univariate GMM): For each of the features, a one-dimensional Gaussian mixture model (with k components) is fitted independently to the data , . This enables a preliminary capture of the data structure along each dimension. The same number of components K is used for all features. The value of K is predetermined according to the known number of components in the datasets; therefore, no separate model selection procedure is performed for individual features.
- Generating component partitions (MAP method): After fitting a one-dimensional GMM for a given feature m, each measurement (where n is the observation vector index) is classified into one of K clusters. This assignment is performed using the MAP (Maximum A Posteriori) estimator, which assigns the observation to the component of the mixture for which the a posteriori probability reaches its highest value. As a result of this step, for each of the M features, an independent partition of all observation vectors is created into K classes.
- Building consensus: Given M independent partitions of observation vectors, the algorithm aggregates them to identify the most stable division. This process involves constructing a co-association matrix . Each element of this matrix corresponds to a pair of observation vectors and the entry in the number of partitions for which this pair was assigned to the same cluster. The co-association matrix is treated as a measure of similarity on the basis of which a distance matrix is generated. The final consensus partition is determined using a Hierarchical Clustering algorithm (Ward’s method), which categorizes patients into K groups.
- Initialization of the Multivariate Diagonal GMM: As with the previous initialization strategies, random, K-means, and Hierarchical, the initial estimates , , are computed based on the Ensemble Clustering partitions.
2.6. Reference Algorithm for Fitting MDGMM Models to Data
2.7. Datasets Used
2.7.1. Artificially Generated Datasets
2.7.2. Real-World Biological Data
2.8. Performance Evaluation
- Adjusted Rand Index (ARI): evaluates the similarity between two partitions of the same dataset, correcting for chance agreement between them [36,37], and is defined aswhere:
- –
- is a number of objects common to a group and (elements of the contingency matrix).
- –
- is a sum of elements in the i-th row of the contingency matrix ().
- –
- is a sum of elements in the j-th column of the contingency matrix ().
- –
- n is the total number of samples in the dataset.
- –
- is the number of all possible pairs that can be formed from n elements.
- Matthews Correlation Coefficient (MCC): measure of the quality of binary classification; it takes into account the full confusion matrix and is defined aswhere:
- –
- is the number of true positives, when the model correctly predicts the positive class.
- –
- is the number of true negatives, when the model correctly predicts the negative class.
- –
- is the number of false positives, type I errors, when the model incorrectly predicts the positive class.
- –
- is the number of false negatives, type II errors, when the model incorrectly predicts the negative class.
This coefficient produces values between −1 and 1, where −1 means complete disagreement, 0 means random assignment, and 1 means complete agreement [38]. - Error Rate: measure the frequency of incorrect predictions (ratio of the number of incorrect decisions to the total number of decisions made by the model) [39] and is defined as
- Normalized Mutual Information (NMI): measures the amount of information shared by two data divisions normalized by the arithmetic mean of entropies and [28] and is defined aswhere:
- –
- is Mutual Information;
- –
- and are Shannon entropies of distributions U and V.
- Maximum-Match Measure (MMM): measures the clustering accuracy with optimal label assignment [40] and is defined aswhere:
- –
- is a true label for the i-th sample.
- –
- is a cluster label obtained from the algorithm for the -th sample.
- –
- is a function mapping cluster labels to class labels.
- –
- is an indicator function that returns 1 when the condition is met and 0 if it is not.
3. Results
3.1. Artificial Datasets
3.2. Real-Word Biological Data
3.3. Computational Time
4. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| ARI | Adjusted Rand Index |
| EM | Expectation–Maximization |
| GMMs | Gaussian Mixture Models |
| HT | High-Throughput |
| MAP | Maximum A Posteriori |
| MCC | Matthews Correlation Coefficient |
| MDGMM | Multivariate Diagonal Gaussian Mixture Model |
| MGMMs | Multivariate Gaussian Mixture Models |
| MMM | Maximum-Match Measure |
| NMI | Normalized Mutual Information |
| ssGSEA | single-sample Gene Set Enrichment Analysis |
Appendix A




References
- Bouveyron, C.; Celeux, G.; Murphy, T.B.; Raftery, A.E. Model-Based Clustering and Classification for Data Science: With Applications in R; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
- Reynolds, D.A. Gaussian Mixture Models. Encycl. Biom. 2009, 741, 659–663. [Google Scholar]
- Bouveyron, C.; Brunet, C. Model-Based Clustering of High-Dimensional Data: A review. Comput. Stat. Data Anal. 2013, 71, 52–78. [Google Scholar] [CrossRef]
- Frühwirth-Schnatter, S.; Celeux, G.; Robert, C.P. Handbook of Mixture Analysis; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–38. [Google Scholar] [CrossRef]
- Bilmes, J.A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Int. Comput. Sci. Inst. 1998, 4, 126. [Google Scholar]
- McLachlan, G.J.; Peel, D. Finite Mixture Models; John Wiley & Sons: New York, NY, USA, 2000. [Google Scholar]
- McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions; John Wiley & Sons: New York, NY, USA, 2007. [Google Scholar]
- Zhang, H.; Zhao, L.; Yang, S.; Deng, Y.; Ouyang, Z. Fatigue evaluation of Orthotropic steel deck welds based on WIM data and UD-BP neural network. Structures 2025, 78, 109198. [Google Scholar] [CrossRef]
- Zyla, J.; Szumala, K.; Polanski, A.; Polanska, J.; Marczyk, M. dpGMM: A new R package for efficient and robust Gaussian mixture modeling of 1D and 2D data. J. Comput. Sci. 2026, 95, 102811. [Google Scholar] [CrossRef]
- Baudry, J.-P.; Celeux, G. EM for mixtures: Initialization requires special care. Stat. Comput. 2015, 25, 713–726. [Google Scholar] [CrossRef]
- Karlis, D.; Xekalaki, E. Choosing initial values for the EM algorithm for finite mixtures. Comput. Stat. Data Anal. 2003, 41, 577–590. [Google Scholar] [CrossRef]
- Biernacki, C. Initializing EM using the properties of its trajectories in Gaussian mixtures. Stat. Comput. 2004, 14, 267–279. [Google Scholar] [CrossRef]
- Polanski, A.; Marczyk, M.; Pietrowska, M.; Widlak, P.; Polanska, J. Signal partitioning algorithm for highly efficient Gaussian mixture modeling in mass spectrometry. PLoS ONE 2015, 10, e0134256. [Google Scholar] [CrossRef]
- Polański, A.; Marczyk, M.; Pietrowska, M.; Widłak, P.; Polańska, J. Initializing the EM algorithm for univariate Gaussian, multi-component, heteroscedastic mixture models by dynamic programming partitions. Int. J. Comput. Methods 2018, 15, 1850012. [Google Scholar] [CrossRef]
- Maitra, R. Initializing partition-optimization algorithms. IEEE/ACM Trans. Comput. Biol. Bioinform. 2009, 6, 144–157. [Google Scholar] [CrossRef] [PubMed]
- Biernacki, C.; Celeux, G.; Govaert, G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 2003, 41, 561–575. [Google Scholar] [CrossRef]
- Melnykov, V.; Melnykov, I. Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput. Stat. Data Anal. 2012, 56, 1381–1395. [Google Scholar] [CrossRef]
- O’Hagan, A.; Murphy, T.; Gormley, I. Computational aspects of fitting mixture models via the expectation-maximization algorithm. Comput. Stat. Data Anal. 2012, 56, 3843–3864. [Google Scholar] [CrossRef]
- Zhang, H.; Zhao, L.; Chen, F.; Luo, Y.; Xiao, X.; Liu, Y.; Deng, Y. A machine learning and multi-source authentic data-driven framework for accurate fatigue life prediction of welds in existing steel bridge decks. Thin-Walled Struct. 2026, 222, 114559. [Google Scholar] [CrossRef]
- Panić, B.; Simić, S.; Kovačević, M.; Panić, M. Improved initialization of the EM algorithm for mixture model parameter estimation. Mathematics 2020, 8, 373. [Google Scholar] [CrossRef]
- Ingrassia, S. A likelihood-based constrained algorithm for multivariate normal mixture models. Stat. Methods Appl. 2004, 13, 151–166. [Google Scholar] [CrossRef]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Oakland, CA, USA, 1967; pp. 281–297. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. [Google Scholar]
- Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2012, 2, 86–97. [Google Scholar] [CrossRef]
- Ward, J.H., Jr. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
- Fred, A.L.N. Finding consistent clusters in data partitions. In Multiple Classifier Systems; Kittler, J., Roli, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 309–318. [Google Scholar]
- Fred, A.L.N.; Jain, A.K. Robust data clustering: A combiner approach. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Madison, WI, USA, 2003; pp. 128–136. [Google Scholar]
- Mouselimis, L. R Package, version 1.3.6; ClusterR: Gaussian Mixture Models, K-Means, Mini-Batch-Kmeans, K-Medoids and Affinity Propagation Clustering; CRAN: Vienna, Austria, 2025.
- Sanderson, C.; Curtin, R. Armadillo: A template-based C++ library for linear algebra. J. Open Source Softw. 2016, 1, 26. [Google Scholar] [CrossRef]
- Widłak, W. High-Throughput Technologies in Molecular Biology. In Molecular Biology. Lecture Notes in Computer Science; Widłak, W., Ed.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8248. [Google Scholar] [CrossRef]
- Geistlinger, L.; Csaba, G.; Santarelli, M.; Ramos, M.; Schiffer, L.; Turaga, N.; Waldron, L. Toward a gold standard for benchmarking gene set enrichment analysis. Brief. Bioinform. 2021, 22, 545–556. [Google Scholar] [CrossRef] [PubMed]
- Barbie, D.A.; Tamayo, P.; Boehm, J.S.; Kim, S.Y.; Moody, S.E.; Dunn, I.F.; Hahn, W.C. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 2009, 462, 108–112. [Google Scholar] [CrossRef]
- Kanehisa, M.; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef]
- Jassal, B.; Matthews, L.; Viteri, G.; Gong, C.; Lorente, P.; Fabregat, A.; D’Eustachio, P. The reactome pathway knowledgebase. Nucleic Acids Res. 2020, 48, D498–D503. [Google Scholar] [CrossRef]
- Santos, J.M.; Embrechts, M. On the use of the adjusted Rand index as a metric for evaluating supervised classification. In Proceedings of the International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2009; pp. 175–184. [Google Scholar]
- Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
- Cai, D.; He, X.; Han, J. Document clustering using locality preserving indexing. IEEE Trans. Knowl. Data Eng. 2005, 17, 1624–1637. [Google Scholar] [CrossRef]
- Kruskal, W.H.; Wallis, W.A. Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]
- Conover, W.J. Practical Nonparametric Statistics, 3rd ed.; John Wiley & Sons: New York, NY, USA, 1999. [Google Scholar]
- Jonckheere, A.R. A distribution-free k-sample test against ordered alternatives. Biometrika 1954, 41, 133–145. [Google Scholar] [CrossRef]




| ID | HT Platform | Disease | Sample Size [Control/Case] | No. of Features |
|---|---|---|---|---|
| GSE15471 | Microarray | Pancreatic Cancer | 70 [35/35] | 17,656 |
| GSE16515 | Microarray | Pancreatic Cancer | 30 [15/15] | 17,656 |
| GSE18842 | Microarray | NSC Lung Cancer | 88 [44/44] | 17,656 |
| GSE19188 | Microarray | NSC Lung Cancer | 153 [62/91] | 17,656 |
| GSE19728 | Microarray | Astrocytoma | 21 [4/17] | 17,656 |
| GSE5281-HIP | Microarray | Alzheimer’s | 23 [13/10] | 17,656 |
| TCGA-BRCA | RNA-Seq | Breast Cancer | 226 [113/113] | 12,112 |
| TCGA-COAD | RNA-Seq | Colorectal Cancer | 82 [41/41] | 11,876 |
| TCGA-HNSC | RNA-Seq | HNSCC | 86 [43/43] | 11,456 |
| TCGA-KIRC | RNA-Seq | Kidney Cancer | 144 [72/72] | 12,049 |
| TCGA-LIHC | RNA-Seq | Liver Cancer | 100 [50/50] | 10,468 |
| TCGA-LUAD | RNA-Seq | Lung Adenocarcinoma | 116 [58/58] | 12,081 |
| TCGA-LUSC | RNA-Seq | Lung SCC | 102 [51/51] | 12,100 |
| TCGA-PRAD | RNA-Seq | Prostate Cancer | 104 [52/52] | 12,004 |
| TCGA-THCA | RNA-Seq | Thyroid Cancer | 118 [59/59] | 11,747 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Radwan, E.; Kania, M.; Widzisz, K.; Zyla, J.; Szczęsna, A.; Polański, A. Comparison of Initialization Strategies for EM in High-Dimensional Multivariate Diagonal Gaussian Mixture Models. Appl. Sci. 2026, 16, 5427. https://doi.org/10.3390/app16115427
Radwan E, Kania M, Widzisz K, Zyla J, Szczęsna A, Polański A. Comparison of Initialization Strategies for EM in High-Dimensional Multivariate Diagonal Gaussian Mixture Models. Applied Sciences. 2026; 16(11):5427. https://doi.org/10.3390/app16115427
Chicago/Turabian StyleRadwan, Ewa, Mateusz Kania, Karolina Widzisz, Joanna Zyla, Agnieszka Szczęsna, and Andrzej Polański. 2026. "Comparison of Initialization Strategies for EM in High-Dimensional Multivariate Diagonal Gaussian Mixture Models" Applied Sciences 16, no. 11: 5427. https://doi.org/10.3390/app16115427
APA StyleRadwan, E., Kania, M., Widzisz, K., Zyla, J., Szczęsna, A., & Polański, A. (2026). Comparison of Initialization Strategies for EM in High-Dimensional Multivariate Diagonal Gaussian Mixture Models. Applied Sciences, 16(11), 5427. https://doi.org/10.3390/app16115427

