Tracking Down the Evolution of Microorganisms by Exhaustive Bottom-Up Analysis of Proteomes
Abstract
1. Introduction
2. Results
2.1. Hierarchical Clustering and Construction of Phylogenetic Tree
- k ∈ {1, 2, 3};
- Distance metrics L1, L2, corr(p, q), and intera(p, q) for a ∈ {0.1, 0.05};
- Protein groups: whole proteome, membrane, non-membrane, nucleotide-binding, non-nucleotide-binding, ribosomal;
- Amino acid alphabet of 20 symbols or its reduction to 5;
- “Single”, “average” and “complete” clustering linkage parameter values.
2.2. Non-Hierarchical Clustering
2.3. The Set of Most Abundant k-Mers
2.4. Clustering of Protein Groups
3. Discussion
4. Materials and Methods
4.1. Representation of Proteins and Proteomes
4.2. Proteome Dataset and Its Processing
- Aliphatic: Alanine, Isoleucine, Leucine, Methionine, Proline, Valine.
- Aromatic: Phenylalanine, Tryptophan, Tyrosine.
- Uncharged: Cysteine, Glycine, Asparagine, Glutamine, Serine, Threonine.
- Positively charged: Histidine, Lysine, Arginine.
- Negatively charged: Aspartic acid, Glutamic acid.
4.3. Clustering Algorithms
4.4. Clustering Assessment
4.5. Clustering Feasibility
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kapli, P.; Yang, Z.; Telford, M.J. Phylogenetic Tree Building in the Genomic Age. Nat. Rev. Genet. 2020, 21, 428–444. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.; Farré, M.; Auvil, L.; Capitanu, B.; Larkin, D.M.; Ma, J.; Lewin, H.A. Reconstruction and Evolutionary History of Eutherian Chromosomes. Proc. Natl. Acad. Sci. USA 2017, 114, E5379–E5388. [Google Scholar] [CrossRef] [PubMed]
- Chen, S.; Kim, D.-K.; Chase, M.W.; Kim, J.-H. Networks in a Large-Scale Phylogenetic Analysis: Reconstructing Evolutionary History of Asparagales (Lilianae) Based on Four Plastid Genes. PLoS ONE 2013, 8, e59472. [Google Scholar] [CrossRef] [PubMed]
- Maura, F.; Rustad, E.H.; Boyle, E.M.; Morgan, G.J. Reconstructing the Evolutionary History of Multiple Myeloma. Best. Pract. Res. Clin. Haematol. 2020, 33, 101145. [Google Scholar] [CrossRef]
- Le Rouzic, A.; Payen, T.; Hua-Van, A. Reconstructing the Evolutionary History of Transposable Elements. Genome Biol. Evol. 2013, 5, 77–86. [Google Scholar] [CrossRef]
- Hodges, M.E.; Scheumann, N.; Wickstead, B.; Langdale, J.A.; Gull, K. Reconstructing the Evolutionary History of the Centriole from Protein Components. J. Cell Sci. 2010, 123, 1407–1413. [Google Scholar] [CrossRef]
- Wen, J.; Chan, R.H.; Yau, S.-C.; He, R.L.; Yau, S.S.T. K-Mer Natural Vector and Its Application to the Phylogenetic Analysis of Genetic Sequences. Gene 2014, 546, 25–34. [Google Scholar] [CrossRef]
- Sims, G.E.; Jun, S.-R.; Wu, G.A.; Kim, S.-H. Alignment-Free Genome Comparison with Feature Frequency Profiles (FFP) and Optimal Resolutions. Proc. Natl. Acad. Sci. USA 2009, 106, 2677–2682. [Google Scholar] [CrossRef]
- Pierce-Ward, N.T.; Botvinnik, O.B.; Reiter, T.E.; Irber, L.; Brown, C.T. Amino Acid K-Mers Enable Assembly- and Alignment-Free Sequence Analysis; Manubot: Philadelphia, PA, USA, 2022. [Google Scholar]
- Chang, C.H.; Nelson, W.C.; Jerger, A.; Wright, A.T.; Egbert, R.G.; McDermott, J.E. Snekmer a Scalable Pipeline for Protein Sequence Fingerprinting Based on Amino Acid Recoding. Bioinform. Adv. 2023, 3, vbad005. [Google Scholar] [CrossRef]
- Yu, Z.-G.; Anh, V.; Lau, K.-S. Chaos Game Representation of Protein Sequences Based on the Detailed HP Model and Their Multifractal and Correlation Analyses. J. Theor. Biol. 2004, 226, 341–348. [Google Scholar] [CrossRef]
- Zhang, Y.; Wen, J.; Yau, S.S.-T. Phylogenetic Analysis of Protein Sequences Based on a Novel K-Mer Natural Vector Method. Genomics 2019, 111, 1298–1305. [Google Scholar] [CrossRef] [PubMed]
- Déraspe, M.; Boisvert, S.; Laviolette, F.; Roy, P.H.; Corbeil, J. Flexible Protein Database Based on Amino Acid K-Mers. Sci. Rep. 2022, 12, 9101. [Google Scholar] [CrossRef] [PubMed]
- Moeckel, C.; Mareboina, M.; Konnaris, M.A.; Chan, C.S.Y.; Mouratidis, I.; Montgomery, A.; Chantzi, N.; Pavlopoulos, G.A.; Georgakopoulos-Soares, I. A Survey of K-Mer Methods and Applications in Bioinformatics. Comput. Struct. Biotechnol. J. 2024, 23, 2289–2303. [Google Scholar] [CrossRef] [PubMed]
- Guo, Y.; Hou, L.; Zhu, W.; Wang, P. Prediction of Hormone-Binding Proteins Based on K-Mer Feature Representation and Naive Bayes. Front. Genet. 2021, 12, 797641. [Google Scholar] [CrossRef]
- Cascarina, S.M.; Ross, E.D. Proteome-Scale Relationships between Local Amino Acid Composition and Protein Fates and Functions. PLoS Comput. Biol. 2018, 14, e1006256. [Google Scholar] [CrossRef]
- Chantzi, N.; Mareboina, M.; Konnaris, M.A.; Montgomery, A.; Patsakis, M.; Mouratidis, I.; Georgakopoulos-Soares, I. The Determinants of the Rarity of Nucleic and Peptide Short Sequences in Nature. NAR Genom. Bioinform. 2024, 6, lqae029. [Google Scholar] [CrossRef]
- Solis-Reyes, S.; Avino, M.; Poon, A.; Kari, L. An Open-Source k-Mer Based Machine Learning Tool for Fast and Accurate Subtyping of HIV-1 Genomes. PLoS ONE 2018, 13, e0206409. [Google Scholar] [CrossRef]
- Persi, E.; Wolf, Y.I.; Karamycheva, S.; Makarova, K.S.; Koonin, E.V. Compensatory Relationship between Low-Complexity Regions and Gene Paralogy in the Evolution of Prokaryotes. Proc. Natl. Acad. Sci. USA 2023, 120, e2300154120. [Google Scholar] [CrossRef]
- Sibson, R. SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method. Comput. J. 1973, 16, 30–34. [Google Scholar] [CrossRef]
- Defays, D. An Efficient Algorithm for a Complete Link Method. Comput. J. 1977, 20, 364–366. [Google Scholar] [CrossRef]
- Ahmed, M.; Seraj, R.; Islam, S.M.S. The K-Means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
- Lobanov, M.Y.; Surin, A.A.; Galzitskaya, O.V. What Can Be Learned by Knowing Only the Amino Acid Composition of Proteins? Int. J. Mol. Sci. 2024, 25, 13680. [Google Scholar] [CrossRef] [PubMed]
- Pilla, S.P.; Bahadur, R.P. Residue Conservation Elucidates the Evolution of R-Proteins in Ribosomal Assembly and Function. Int. J. Biol. Macromol. 2019, 140, 323–329. [Google Scholar] [CrossRef] [PubMed]
- Korobeinikova, A.V.; Garber, M.B.; Gongadze, G.M. Ribosomal Proteins: Structure, Function, and Evolution. Biochem. Mosc. 2012, 77, 562–574. [Google Scholar] [CrossRef]
- Agmon, I.; Bashan, A.; Yonath, A. On Ribosome Conservation and Evolution. Isr. J. Ecol. Evol. 2006, 52, 359–374. [Google Scholar] [CrossRef]
- Manni, M.; Berkeley, M.R.; Seppey, M.; Simão, F.A.; Zdobnov, E.M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol. Biol. Evol. 2021, 38, 4647–4654. [Google Scholar] [CrossRef]
- Letunic, I.; Bork, P. Interactive Tree of Life (iTOL) v6: Recent Updates to the Phylogenetic Tree Display and Annotation Tool. Nucleic Acids Res. 2024, 52, W78–W82. [Google Scholar] [CrossRef]
- Koblížek, M. Ecology of Aerobic Anoxygenic Phototrophs in Aquatic Environments. FEMS Microbiol. Rev. 2015, 39, 854–870. [Google Scholar] [CrossRef]
- Hug, L.A.; Baker, B.J.; Anantharaman, K.; Brown, C.T.; Probst, A.J.; Castelle, C.J.; Butterfield, C.N.; Hernsdorf, A.W.; Amano, Y.; Ise, K.; et al. A New View of the Tree of Life. Nat. Microbiol. 2016, 1, 16048. [Google Scholar] [CrossRef]
- Frey, B.J.; Dueck, D. Clustering by Passing Messages between Data Points. Science 2007, 315, 972–976. [Google Scholar] [CrossRef]
- Schubert, E.; Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst. 2017, 42, 1–21. [Google Scholar] [CrossRef]
- Ankerst, M.; Breunig, M.M.; Kriegel, H.-P.; Sander, J. OPTICS: Ordering Points to Identify the Clustering Structure. ACM Sigmod Rec. 1999, 28, 49–60. [Google Scholar] [CrossRef]
- Malzer, C.; Baum, M. A Hybrid Approach to Hierarchical Density-Based Cluster Selection. In Proceedings of the 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 14–16 September 2020; pp. 223–228. [Google Scholar]
- Rand, W.M. Objective Criteria for the Evaluation of Clustering Methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]







| k | Alphabet | Protein Group | Metric | Linkage | |||||
|---|---|---|---|---|---|---|---|---|---|
| Value | Mean Rand | Value | Mean Rand | Value | Mean Rand | Value | Mean Rand | Value | Mean Rand |
| 1 | 0.9649 ± 0.0043 | 20 symbols | 0.9706 ± 0.0067 | all | 0.9668 ± 0.0054 | L1 | 0.9712 ± 0.0055 | single | 0.9668 ± 0.0061 |
| 2 | 0.9670 ± 0.0058 | 5 symbols | 0.9640 ± 0.0049 | membrane | 0.9679 ± 0.0057 | L2 | 0.9693 ± 0.0051 | average | 0.9680 ± 0.0073 |
| 3 | 0.9701 ± 0.0082 | non—membrane | 0.9663 ± 0.0048 | corr | 0.9691 ± 0.0056 | complete | 0.9677 ± 0.0068 | ||
| nucleotide—binding | 0.9676 ± 0.0068 | inter0.05 | 0.9629 ± 0.0063 | ||||||
| non—nucleotide—binding | 0.9666 ± 0.0056 | inter0.1 | 0.9637 ± 0.0074 | ||||||
| ribosomal | 0.9700 ± 0.0103 | ||||||||
| # | Rand | k | Alphabet | Protein Group | Metric | Linkage |
|---|---|---|---|---|---|---|
| 1 | 0.9937 | 3 | 20 symbols | ribosomal | corr | average |
| 2 | 0.9929 | 3 | 20 symbols | ribosomal | L1 | average |
| 3 | 0.9912 | 3 | 20 symbols | ribosomal | L2 | average |
| 4 | 0.9902 | 3 | 20 symbols | ribosomal | L1 | complete |
| 5 | 0.9899 | 3 | 20 symbols | ribosomal | inter0.1 | average |
| 1 | 0.9776 | 3 | 20 symbols | all | L1 | average |
| 2 | 0.9762 | 3 | 20 symbols | all | inter0.1 | average |
| 3 | 0.9748 | 3 | 20 symbols | all | inter0.1 | complete |
| 4 | 0.9743 | 2 | 20 symbols | all | L1 | average |
| 5 | 0.9737 | 3 | 20 symbols | all | corr | complete |
| 2 | 20 symbols | all | L2 | average |
| Protein Group | Number (Out of 800) |
|---|---|
| All | 226 |
| Membrane | 261 |
| Non-membrane | 191 |
| Nucleotide-binding | 155 |
| Non-nucleotide-binding | 228 |
| Ribosomal | 56 |
| Proteins Set | Average Within-Group Distance |
|---|---|
| All | 1.893 |
| Membrane | 1.861 |
| Non membrane | 1.898 |
| Nucleotide binding | 1.889 |
| Non-nucleotide binding | 1.892 |
| Ribosomal | 1.932 |
| Non ribosomal | 1.891 |
| All | Membrane | Non-Membrane | Nucleotide Binding | Non-Nucleotide Binding | Ribosomal | Non-Ribosomal | |
|---|---|---|---|---|---|---|---|
| All | — | 1.883 | 1.895 | 1.893 | 1.892 | 1.926 | 1.891 |
| Membrane | 1.883 | — | 1.891 | 1.890 | 1.882 | 1.927 | 1.882 |
| Non-membrane | 1.895 | 1.891 | — | 1.894 | 1.895 | 1.926 | 1.894 |
| Nucleotide binding | 1.893 | 1.890 | 1.894 | — | 1.895 | 1.919 | 1.893 |
| Non-nucleotide binding | 1.892 | 1.882 | 1.895 | 1.895 | — | 1.927 | 1.891 |
| Ribosomal | 1.926 | 1.927 | 1.926 | 1.919 | 1.927 | — | 1.927 |
| Non-ribosomal | 1.891 | 1.882 | 1.894 | 1.893 | 1.891 | 1.927 | — |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kostenko, D.O.; Bogatyreva, N.S.; Fedorov, A.N. Tracking Down the Evolution of Microorganisms by Exhaustive Bottom-Up Analysis of Proteomes. Int. J. Mol. Sci. 2026, 27, 109. https://doi.org/10.3390/ijms27010109
Kostenko DO, Bogatyreva NS, Fedorov AN. Tracking Down the Evolution of Microorganisms by Exhaustive Bottom-Up Analysis of Proteomes. International Journal of Molecular Sciences. 2026; 27(1):109. https://doi.org/10.3390/ijms27010109
Chicago/Turabian StyleKostenko, Dmitrii O., Natalya S. Bogatyreva, and Alexey N. Fedorov. 2026. "Tracking Down the Evolution of Microorganisms by Exhaustive Bottom-Up Analysis of Proteomes" International Journal of Molecular Sciences 27, no. 1: 109. https://doi.org/10.3390/ijms27010109
APA StyleKostenko, D. O., Bogatyreva, N. S., & Fedorov, A. N. (2026). Tracking Down the Evolution of Microorganisms by Exhaustive Bottom-Up Analysis of Proteomes. International Journal of Molecular Sciences, 27(1), 109. https://doi.org/10.3390/ijms27010109

