A Clustering Approach for Rare Variant Classification by Effect Direction and Magnitude
Abstract
1. Introduction
2. Methods
2.1. Association Testing of Rare Variants
2.2. Multiple-Variant Model to Obtain Variant-Level Summary Statistics
2.3. Gaussian Mixture Model
2.4. Parameter Estimation by Expectation–Maximization (EM) Algorithm
2.4.1. Initialization
2.4.2. Expectation Step
2.4.3. Maximization Step
2.5. A Simulation Study
- (i)
- k-means clustering, a widely used baseline unsupervised clustering method,
- (ii)
- gvClust-initial, which uses the same initialization procedure as gvClust but skips the subsequent expectation–maximization (EM) updates (i.e., clustering based solely on initialization). We compare these methods under each of the six scenarios for a continuous trait with the absence of LD and N = 25,000 individuals.
2.6. Application to Exome-Wide Association and a Rare-Variant GWAS of Blood Pressure Traits
3. Results
3.1. Simulation Studies
3.1.1. Simulations Without LD Structure
3.1.2. Comparison of Simulations with and Without LD
3.2. Application to a GWAS on BP Traits with Rare Variants
3.2.1. Identification of Blood Pressure Trait-Associated Genes
3.2.2. Annotation of Rare Variants Within Blood Pressure Trait-Associated Genes
3.2.3. Clustering of Rare Variants
3.3. Computational Burden
4. Discussion
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Visscher, P.M.; Wray, N.R.; Zhang, Q.; Sklar, P.; McCarthy, M.I.; Brown, M.A.; Yang, J. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 2017, 101, 5–22. [Google Scholar] [CrossRef] [PubMed]
- Zuk, O.; Schaffner, S.F.; Samocha, K.; Do, R.; Hechter, E.; Kathiresan, S.; Daly, M.J.; Neale, B.M.; Sunyaev, S.R.; Lander, E.S. Searching for missing heritability: Designing rare variant association studies. Proc. Natl. Acad. Sci. USA 2014, 111, E455–E464. [Google Scholar] [CrossRef] [PubMed]
- Li, B.; Lea, S.M. Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. Am. J. Hum. Genet. 2008, 83, 311–321. [Google Scholar] [CrossRef]
- Wu, M.C.; Lee, S.; Cai, T.; Li, Y.; Boehnke, M.; Lin, X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011, 89, 82–93. [Google Scholar] [CrossRef]
- Lee, S.; Wu, M.C.; Lin, X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 2012, 13, 762–775. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, S.; Li, Z.; Morrison, A.C.; Boerwinkle, E.; Lin, X. ACAT: A Fast and Powerful p Value Combination Method for Rare-Variant Analysis in Sequencing Studies. Am. J. Hum. Genet. 2019, 104, 410–421. [Google Scholar] [CrossRef]
- Clarke, B.; Holtkamp, E.; Öztürk, H.; Mück, M.; Wahlberg, M.; Meyer, K.; Munzlinger, F.; Brechtmann, F.; Hölzlwimmer, F.R.; Lindner, J.; et al. Integration of variant annotations using deep set networks boosts rare variant association testing. Nat. Genet. 2024, 56, 2271–2280. [Google Scholar] [CrossRef] [PubMed]
- Ziyatdinov, A.; Mbatchou, J.; Marcketta, A.; Backman, J.; Gaynor, S.; Zou, Y.; Joseph, T.; Geraghty, B.; Herman, J.; Watanabe, K.; et al. Joint testing of rare variant burden scores using non-negative least squares. Am. J. Hum. Genet. 2024, 111, 2139–2149. [Google Scholar] [CrossRef]
- Kim, Y.; Jeong, M.; Koh, I.G.; Kim, C.; Lee, H.; Kim, J.H.; Yurko, R.; Bin Kim, I.; Park, J.; Werling, D.M.; et al. CWAS-Plus: Estimating category-wide association of rare noncoding variation from whole-genome sequencing data with cell-type-specific functional data. Brief. Bioinform. 2024, 25, bbae323. [Google Scholar] [CrossRef]
- Park, E.; Nam, K.; Jeong, S.; Keat, K.; Kim, D.; Bansal, V.; Zhou, W.; Lee, S. Scalable and accurate rare variant meta-analysis with Meta-SAIGE. Nat. Genet. 2025, 57, 3185–3192. [Google Scholar] [CrossRef]
- McCaw, Z.R.; Gao, J.; Dey, R.; Tucker, S.; Zhang, Y.; Gronsbell, J.; Li, X.; Fox, E.; O’DUshlaine, C.; Soare, T.W. A scalable framework for identifying allelic series from summary statistics. Am. J. Hum. Genet. 2025, 112, 2772–2788. [Google Scholar] [CrossRef] [PubMed]
- McLachlan, G.J.; Peel, D. Finite Mixture Models. In Probability and Statistics—Applied Probability and Statistics Section; Wiley: New York, NY, USA, 2000; Volume 299. [Google Scholar]
- McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions; Wiley: New York, NY, USA, 1996. [Google Scholar]
- Surendran, P.; Feofanova, E.V.; Lahrouchi, N.; Ntalla, I.; Karthikeyan, S.; Cook, J.; Chen, L.; Mifsud, B.; Yao, C.; Kraja, A.T.; et al. Discovery of rare variants associated with blood pressure regulation through meta-analysis of 1.3 million individuals. Nat. Genet. 2020, 52, 1314–1332. [Google Scholar] [CrossRef]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. B 1977, 39, 1–38. [Google Scholar] [CrossRef]
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- MacQueen, J. Multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–28 July 1967; University of California Press: Oakland, CA, USA, 1967. [Google Scholar]
- Turkmen, A.; Lin, S. Are rare variants really independent? Genet. Epidemiol. 2017, 41, 363–371. [Google Scholar] [CrossRef] [PubMed]
- The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 2015, 526, 68–74. [Google Scholar] [CrossRef]
- Meyer, H.V.; Birney, E. PhenotypeSimulator: A comprehensive framework for simulating multi-trait, multi-locus genotype to phenotype relationships. Bioinformatics 2018, 34, 2951–2956. [Google Scholar] [CrossRef]
- Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- Ehret, G.B.; Caulfield, M.J. Genes for blood pressure: An opportunity to understand hypertension. Eur. Heart J. 2013, 34, 951–961. [Google Scholar] [CrossRef]
- Forouzanfar, M.H.; Liu, P.; Roth, G.A.; Ng, M.; Biryukov, S.; Marczak, L.; Alexander, L.; Estep, K.; Abate, K.H.; Akinyemiju, T.F.; et al. Global Burden of Hypertension and Systolic Blood Pressure of at Least 110 to 115 mm Hg, 1990-2015. JAMA 2017, 317, 165–182. [Google Scholar] [CrossRef] [PubMed]
- Puigdevall, P.; Castelo, R. GenomicScores: Seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics 2018, 34, 3208–3210. [Google Scholar] [CrossRef]
- Liu, C.; Kraja, A.T.; Smith, A.J.; Brody, J.A.; Franceschini, N.; Bis, J.C.; Rice, K.; Morrison, A.C.; Lu, Y.; Weiss, S.; et al. Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci. Nat. Genet. 2016, 48, 1162–1170. [Google Scholar] [CrossRef] [PubMed]
- Bomba, L.; Walter, K.; Soranzo, N. The impact of rare and low-frequency genetic variants in common disease. Genome Biol. 2017, 18, 77. [Google Scholar] [CrossRef]
- Lee, S.; Abecasis, G.R.; Boehnke, M.; Lin, X. Rare-variant association analysis: Study designs and statistical tests. Am. J. Hum. Genet. 2014, 95, 5–23. [Google Scholar] [CrossRef] [PubMed]
- Foley, C.N.; Mason, A.M.; Kirk, P.D.W.; Burgess, S. MR-Clust: Clustering of genetic variants in Mendelian randomization with similar causal estimates. Bioinformatics 2021, 37, 531–541. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Q.; Zhu, Y.; Cordeiro, F.R.; Chen, Q. PSSCL: A progressive sample selection framework with contrastive loss designed for noisy labels. Pattern Recognit. 2025, 161, 111284. [Google Scholar] [CrossRef]



| Number of Groups of Rare Variants | The Ratio of the True Beta Effects Between Each Group | Proportions of the Number of Rare Variants Among Each Group | |
|---|---|---|---|
| Scenario 1 | 3 | 0:1:2 | |
| Scenario 2 | 3 | −1:0:1 | |
| Scenario 3 | 4 | −1:0:1:2 | |
| Scenario 4 | 5 | −2:−1:0:1:2 | |
| Scenario 5 | 2 | 1:2 | |
| Scenario 6 | 2 | −1:1 |
| Scenario | N = 5000 | N = 15,000 | N = 25,000 | |
|---|---|---|---|---|
| Without LD | ||||
| Scenario 1 | 0.0102 | 0.00329 | 0.00169 | |
| Scenario 2 | 0.0102 | 0.00221 | 0.000868 | |
| Scenario 3 | 0.0124 | 0.00366 | 0.00179 | |
| Scenario 4 | 0.0127 | 0.00461 | 0.00228 | |
| Scenario 5 | 0.00535 | 0.0024 | 0.00141 | |
| Scenario 6 | 0.00574 | 0.000633 | 0.000143 | |
| With LD | ||||
| Scenario 1 | 0.0102 | 0.00429 | 0.00225 | |
| Scenario 2 | 0.0118 | 0.00226 | 7.00 × 10−4 | |
| Scenario 3 | 0.014 | 0.00446 | 0.0022 | |
| Scenario 4 | 0.0148 | 0.00557 | 0.00268 | |
| Scenario 5 | 0.00405 | 0.00282 | 0.00177 | |
| Scenario 6 | 0.0061 | 0.00042 | 0.000116 |
| Trait | # Variants | # Clusters | Mu | Phi | # Variants in Clusters | p-Value ANOVA | p-Value MWU |
|---|---|---|---|---|---|---|---|
| SBP | 58 | 3 | 0/−0.0826/0.0573 | 0.719/0.178/0.102 | 46/8/4 | 4.26 × 10−18 | 0.559 |
| DBP | 45 | 3 | 0/−0.0924/0.0471 | 0.718/0.165/0.117 | 37/5/3 | 1.68 × 10−11 | 0.338 |
| PP | 55 | 5 | 0/−0.0438/−0.0968/0.334/0.173 | 0.673/0.134/0.0654/0.0313/0.0966 | 43/4/2/1/5 | 1.39 × 10−19 | 0.976 |
| HTN | 59 | 3 | 0/−2.785/4.646 | 0.802/0.147/0.0516 | 48/8/3 | 3.66 × 10−16 | 0.32 |
| Gene | Chr | Start (bp) | End (bp) | # Variants | # Clusters | Mu | Phi | # Variants in Clusters | p-Value _ANOVA | p-Value MWU |
|---|---|---|---|---|---|---|---|---|---|---|
| DBH | 9 | 136501569 | 136523555 | 27 | 2 | 0/−0.082 | 0.75/0.25 | 21/6 | 5.18 × 10−8 | 0.785 |
| NPR1 | 1 | 153652129 | 153665650 | 13 | 3 | 0/−0.0846/0.164 | 0.727/0.198/0.0753 | 10/2/1 | 6.02 × 10−5 | 0.322 |
| PLCB3 | 11 | 64021930 | 64034975 | 18 | 2 | 0/0.0529 | 0.716/0.284 | 15/3 | 2.33 × 10−5 | 0.934 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Sun, X.; Liu, X.; Cao, Y.; Liu, C. A Clustering Approach for Rare Variant Classification by Effect Direction and Magnitude. Algorithms 2026, 19, 426. https://doi.org/10.3390/a19060426
Sun X, Liu X, Cao Y, Liu C. A Clustering Approach for Rare Variant Classification by Effect Direction and Magnitude. Algorithms. 2026; 19(6):426. https://doi.org/10.3390/a19060426
Chicago/Turabian StyleSun, Xianbang, Xue Liu, Yumeng Cao, and Chunyu Liu. 2026. "A Clustering Approach for Rare Variant Classification by Effect Direction and Magnitude" Algorithms 19, no. 6: 426. https://doi.org/10.3390/a19060426
APA StyleSun, X., Liu, X., Cao, Y., & Liu, C. (2026). A Clustering Approach for Rare Variant Classification by Effect Direction and Magnitude. Algorithms, 19(6), 426. https://doi.org/10.3390/a19060426

