Whole-Genome k-mer Topic Modeling Associates Bacterial Families
Abstract
:1. Introduction
2. Methods
2.1. Corpus & Bacterial Families
2.2. Topic Model
- Choose Poisson .
- Choose Dir ().
- For each of the N words :
- Choose a topic ∼ Multinomial ().
- Choose a word from , a multinomial probability conditioned on the topic .
3. Results and Discussion
4. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Zielezinski, A.; Vinga, S.; Almeida, J.; Karlowski, W.M. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol. 2017, 18, 1–17. [Google Scholar] [CrossRef] [Green Version]
- Wang, A.; Ash, G.J. Whole Genome Phylogeny of Bacillus by Feature Frequency Profiles (FFP). Sci. Rep. 2015, 5, 1–14. [Google Scholar] [CrossRef] [Green Version]
- Sims, G.E.; Kim, S.H. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc. Natl. Acad. Sci. USA 2011, 108, 8329–8334. [Google Scholar] [CrossRef] [Green Version]
- Fofanov, Y.; Luo, Y.; Katili, C.; Wang, J.; Belosludtsev, Y.; Powdrill, T.; Belapurkar, C.; Fofanov, V.; Li, T.B.; Chumakov, S.; et al. How independent are the appearances of n-mers in different genomes? Bioinformatics 2004, 20, 2421–2428. [Google Scholar] [CrossRef]
- Zhang, Q.; Jun, S.R.; Leuze, M.; Ussery, D.; Nookaew, I. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer. Sci. Rep. 2017, 7, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Lu, G.; Zhang, S.; Fang, X. An improved string composition method for sequence comparison. BMC Bioinform. 2008, 9, 788–805. [Google Scholar] [CrossRef] [Green Version]
- Chan, R.H.; Chan, T.H.; Yeung, H.M.; Wang, R.W. Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 79–87. [Google Scholar] [CrossRef]
- Sims, G.E.; Jun, S.R.; Wu, G.A.; Kim, S.H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. USA 2009, 106, 2677–2682. [Google Scholar] [CrossRef] [Green Version]
- Kantorovitz, M.R.; Booth, H.S.; Burden, C.J.; Wilson, S.R. Asymptotic behaviour of k-word matches between two uniformly distributed sequences. J. Appl. Probab. 2007, 44, 788–805. [Google Scholar] [CrossRef] [Green Version]
- Forêt, S.; Wilson, S.R.; Burden, C.J. Characterizing the D2 Statistic: Word Matches in Biological Sequences. Stat. Appl. Genet. Mol. Biol. 2009, 8. [Google Scholar] [CrossRef]
- Saw, A.K.; Raj, G.; Das, M.; Talukdar, N.C.; Tripathy, B.C.; Nandi, S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Sci. Rep. 2019, 9, 1–18. [Google Scholar] [CrossRef]
- Orabi, B.; Erhan, E.; McConeghy, B.; Volik, S.V.; Bihan, S.L.; Bell, R.; Collins, C.C.; Chauve, C.; Faraz, H. Alignment-free clustering of UMI tagged DNA molecules. Bioinformatics 2019, 35, 1829–1836. [Google Scholar] [CrossRef]
- Dong, R.; He, L.; He, R.L.; Yau, S.S.T. A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance. Front. Genet. 2019, 10, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Kuksa, P.; Pavlovic, V. Fast Kernel Methods for SVM Sequence Classifiers; International Workshop on Algorithms in Bioinformatics; Springer: Berlin/Heidelberg, Germany, 2007; pp. 228–239. [Google Scholar]
- Putonti, C.; Chumakov, S.; Mitra, R.; Fox, G.E.; Wilson, R.C.; Fofanov, Y. Human-blind probes and primers for dengue virus identification. FEBS J. 2006, 273, 398–408. [Google Scholar] [CrossRef]
- Rizk, G.; Lavenier, D.; Chikhi, R. DSK: k-mer counting with very low memory usage. Bioinformatics 2013, 29, 652–653. [Google Scholar] [CrossRef]
- Deorowicz, S.; Debudaj-Grabysz, A.; Grabowski, S. Disk-based k-mer counting on a PC. BMC Bioinform. 2013, 14, 160. [Google Scholar] [CrossRef] [Green Version]
- Bonham-Carter, O.; Steele, J.; Bastola, D. Alignment-free genetic sequence comparisons: A review of recent approaches by word analysis. Brief. Bioinform. 2014, 15, 890–905. [Google Scholar] [CrossRef]
- Liu, B. BioSeq-Analysis: A platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. 2019, 20, 1280–1294. [Google Scholar] [CrossRef] [Green Version]
- Blei, D.M. Introduction to Probabilistic Topic Models; IEEE signal processing magazine; IEEE: Piscataway, NJ, USA, 2010. [Google Scholar]
- Bisgin, H.; Liu, Z.; Fang, H.; Xu, X.; Tong, W. Mining FDA drug labels using an unsupervised learning technique-topic modeling. BMC Bioinform. 2011, 12, S11. [Google Scholar] [CrossRef] [Green Version]
- Elango, P.K.; Jayaraman, K. Clustering Images Using the Latent Dirichlet Allocation Model; University of Wisconsin: Madison, WI, USA, 2005; pp. 1–18. [Google Scholar]
- Kim, S.; Narayanan, S.; Sundaram, S. Acoustic topic model for audio information retrieval. In Proceedings of the 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 18–21 October 2009; pp. 37–40. [Google Scholar]
- Hu, D.; Saul, L.K. A Probabilistic Topic Model for Unsupervised Learning of Musical Key-Profiles. In Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR 2009, Kobe International Conference Center, Kobe, Japan, 26–30 October 2009; pp. 441–446. [Google Scholar]
- La Rosa, M.; Fiannaca, A.; Rizzo, R.; Urso, A. Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC Bioinform. 2015, 16, S2. [Google Scholar] [CrossRef] [Green Version]
- Chen, X.; Hu, X.; Lim, T.Y.; Shen, X.; Park, E.K.; Rosen, G.L. Exploiting the functional and taxonomic structure of genomic data by probabilistic topic modeling. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 980–991. [Google Scholar] [CrossRef]
- Blei, D. Probabilistic topic models. Commun. ACM 2012, 55, 77–84. [Google Scholar] [CrossRef] [Green Version]
- Steyvers, M.; Griffiths, T. Probabilistic Topic Models. In Handbook of Latent Semantic Analysis; Landauer, T.K., McNamara, D.S., Dennis, S., Kintsch, W., Eds.; Routledge: Abingdon, UK, 2007. [Google Scholar]
- Hofmann, T. Probabilistic latent semantic indexing. ACM SIGIR Forum 2017, 51, 211–218. [Google Scholar] [CrossRef]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Manzano-Marín, A.; Lattore, A. Settling Down: The Genome of Serratia symbiotica from the Aphid Cinara tujafilina Zooms in on the Process of Accommodation to a Cooperative Intracellular Life. Genome Biol. Evol. 2014, 6, 1683–1698. [Google Scholar] [CrossRef]
- Guillén-Ramírez, H.A.; Martínez-Pérez, I.M. Classification of riboswitch sequences using k-mer frequencies. Biosystems 2018, 174, 63–76. [Google Scholar] [CrossRef]
- Sievers, A.; Wenz, F.; Hausmann, M.; Hildenbrand, G. Conservation of k-mer Composition and Correlation Contribution between Introns and Intergenic Regions of Animalia Genomes. Genes 2018, 9, 482. [Google Scholar] [CrossRef] [Green Version]
- Solis-Reyes, S.; Avino, M.; Poon, A.; Kari, L. An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS ONE 2018, 13, e0206409. [Google Scholar] [CrossRef] [Green Version]
- Chen, X.; Hu, X.; Shen, X.; Rosen, G. Probabilistic topic modeling for genomic data interpretation. In Proceedings of the 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Hong Kong, China, 18–21 December 2010; pp. 149–152. [Google Scholar]
- Backenroth, D.; He, Z.; Kiryluk, K.; Boeva, V.; Pethukova, L.; Khurana, E.; Christiano, A.; Buxbaum, J.D.; Ionita-Laza, I. FUN-LDA: A latent Dirichlet allocation model for predicting tissue-specific functional effects of noncoding variation: Methods and applications. Am. J. Hum. Genet. 2018, 102, 920–942. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Z.Y.; Yang, Y.H.; Ding, H.; Wang, D.; Chen, W.; Lin, H. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief. Bioinform. 2020. [Google Scholar] [CrossRef]
- Wei, L.; Su, R.; Luan, S.; Liao, Z.; Manavalan, B.; Zou, Q.; Shi, X. Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics 2019, 35, 4930–4937. [Google Scholar] [CrossRef]
- Lv, H.; Zhang, Z.M.; Li, S.H.; Tan, J.X.; Chen, W.; Lin, H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief. Bioinform. 2019. [Google Scholar] [CrossRef]
- Basith, S.; Manavalan, B.; Shin, T.H.; Lee, G. SDM6A: A web-based integrative machine-learning framework for predicting 6mA sites in the rice genome. Mol. Ther. Nucleic Acids 2019, 18, 131–141. [Google Scholar] [CrossRef] [Green Version]
Accession No. | Family | Organism | Genome Size (bp) |
---|---|---|---|
AE001273.1 | Chlamydia trachomatis D/UW-3/CX | 1,042,519 | |
AE002160.2 | Chlamydia muridarum Nigg | 1,072,950 | |
AE009440.1 | Chlamydophila pneumoniae TW-183 | 1,225,935 | |
AE015925.1 | Chlamydophila caviae GPIC | 1,173,390 | |
AP006861.1 | Chlamydiales | Chlamydia felis Fe/C-56 | 1,166,239 |
CP002549.1 | Chlamydophila psittaci 6BC | 1,171,660 | |
CP002608.1 | Chlamydophila pecorum E58 | 1,106,197 | |
CP006571.1 | Chlamydia avium 10DC88 | 1,041,170 | |
CP015840.1 | Chlamydia gallinacea 08-1274/3 | 1,059,583 | |
CR848038.1 | Chlamydophila abortus strain S26/3 | 1,144,377 | |
BA000031.2 | Vibrio parahaemolyticus RIMD 2210633 | 3,288,558 | |
BA000037.2 | Vibrio vulnificus YJ016 | 3,354,505 | |
CP000020.2 | Vibrio fischeri ES114 | 2,897,536 | |
CP000626.1 | Vibrio cholerae O395 | 1,108,250 | |
CP000789.1 | Vibrionaceae | Vibrio harveyi ATCC BAA-1116 | 3,765,351 |
CP002284.1 | Vibrio anguillarum 775 | 3,063,912 | |
CP002377.1 | Vibrio furnissii NCTC 11218 | 3,294,546 | |
CR354531.1 | Photobacterium profundum SS9 | 4,085,304 | |
FM178379.1 | Aliivibrio salmonicida LFI1238 | 3,325,165 | |
FM954972.2 | Vibrio splendidus LGP32 | 3,299,303 | |
AL590842.1 | Yersinia pestis CO92 | 4,653,728 | |
CP000720.1 | Yersinia pseudotuberculosis IP 31758 | 4,723,306 | |
CP000826.1 | Serratia proteamaculans 568 | 5,448,853 | |
CP002505.1 | Rahnella sp. Y9602 | 4,864,217 | |
CP002774.1 | Yersiniaceae | Serratia sp. AS12 | 5,443,009 |
CP006250.1 | Serratia plymuthica 4Rx13 | 5,328,010 | |
CP016940.1 | Yersinia enterocolitica strain YE5 | 4,593,248 | |
CP017236.1 | Yersinia ruckeri strain QMA0440 isolate 14/0165-5k | 3,856,634 | |
HG738868.1 | Serratia marcescens SMB2099 | 5,123,091 | |
LN890288.1 | Serratia symbiotica strain STs | 650,317 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Borrayo, E.; May-Canche, I.; Paredes, O.; Morales, J.A.; Romo-Vázquez, R.; Vélez-Pérez, H. Whole-Genome k-mer Topic Modeling Associates Bacterial Families. Genes 2020, 11, 197. https://doi.org/10.3390/genes11020197
Borrayo E, May-Canche I, Paredes O, Morales JA, Romo-Vázquez R, Vélez-Pérez H. Whole-Genome k-mer Topic Modeling Associates Bacterial Families. Genes. 2020; 11(2):197. https://doi.org/10.3390/genes11020197
Chicago/Turabian StyleBorrayo, Ernesto, Isaias May-Canche, Omar Paredes, J. Alejandro Morales, Rebeca Romo-Vázquez, and Hugo Vélez-Pérez. 2020. "Whole-Genome k-mer Topic Modeling Associates Bacterial Families" Genes 11, no. 2: 197. https://doi.org/10.3390/genes11020197
APA StyleBorrayo, E., May-Canche, I., Paredes, O., Morales, J. A., Romo-Vázquez, R., & Vélez-Pérez, H. (2020). Whole-Genome k-mer Topic Modeling Associates Bacterial Families. Genes, 11(2), 197. https://doi.org/10.3390/genes11020197