Using Machine Learning to Predict Genes Underlying Differentiation of Multipartite and Unipartite Traits in Bacteria
Abstract
:1. Introduction
2. Materials and Methods
2.1. Data Collection and Data Preparation
2.2. Performance Assessment of Machine Learning Methods
2.3. Generating Gene Sets of Differentially Present (or Absent) Genes Using the Genes Deemed Important in Discrimination by Machine Learning
2.4. Principal Component Analysis (PCA)
2.5. Determination of Gene Function Using eggNOG
3. Results and Discussion
3.1. Assessment of Performance of Different Machine Learning Algorithms
3.2. Principal Component Analysis
3.3. Gene Sets Derived from Genes That Consistently Ranked High across All Rounds of the Cross-Validation Gene Sets
3.4. Functional Annotation of Genes Obtained from All-Genes Approach and from Differentially Present Genes Approach Using eggNOG
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Jacob, F.; Brenner, S.; Cuzin, F. On the regulation of DNA replication in bacteria. Cold Spring Harbor Symp. Quant. Biol. 1963, 23, 329–348. [Google Scholar] [CrossRef]
- Cairns, J. The bacterial chromosome and its manner of replication as seen by autoradiography. J. Mol. Biol. 1963, 6, 208–213. [Google Scholar] [CrossRef] [PubMed]
- Bode, H.R.; Morowitz, H.J. Size and structure of the Mycoplasma hominis H39 chromosome. J. Mol. Biol. 1967, 23, 191–199. [Google Scholar] [CrossRef] [PubMed]
- Wake, R. Circularity of the Bacillus subtilis chromosome and further studies on its bidirectional replication. J. Mol. Biol. 1973, 77, 569–575. [Google Scholar] [CrossRef] [PubMed]
- Baril, C.; Richaud, C.; Baranton, G.; Girons, I. Linear chromosome of Borrelia burgdorferi. Res. Microbiol. 1989, 140, 507–516. [Google Scholar] [CrossRef] [PubMed]
- Suwanto, A.; Kaplan, S. Physical and genetic mapping of the Rhodobacter sphaeroides 2.4.1 genome: Genome size, fragment identification, and gene localization. J. Bacteriol. 1989, 171, 5840–5849. [Google Scholar] [CrossRef] [PubMed]
- Suwanto, A.; Kaplan, S. Chromosome transfer in Rhodobacter sphaeroides: Hfr formation and genetic evidence for two unique circular chromosomes. J. Bacteriol. 1992, 174, 1135–1145. [Google Scholar] [CrossRef] [PubMed]
- Koonin, E.V.; Wolf, Y.I. Genomics of bacteria and archaea: The emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008, 36, 6688–6719. [Google Scholar] [CrossRef]
- Val, M.-E.; Soler-Bistué, A.; Bland, M.J.; Mazel, D. Management of multipartite genomes: The Vibrio cholerae model. Curr. Opin. Microbiol. 2014, 22, 120–126. [Google Scholar] [CrossRef]
- Jha, J.K.; Baek, J.H.; Venkova-Canova, T.; Chattoraj, D.K. Chromosome dynamics in multichromosome bacteria. Biochim. Biophys. Acta (BBA)—Gene Regul. Mech. 2012, 1819, 826–829. [Google Scholar] [CrossRef]
- di Cenzo, G.C.; Finan, T.M. The Divided Bacterial Genome: Structure, Function, and Evolution. Microbiol. Mol. Biol. Rev. 2017, 81, e00019-17. [Google Scholar] [CrossRef] [PubMed]
- Harrison, P.W.; Lower, R.P.; Kim, N.K.; Young, J.P. Introducing the bacterial ‘chromid’: Not a chromosome, not a plasmid. Trends Microbiol. 2010, 18, 141–148. [Google Scholar] [CrossRef] [PubMed]
- Jiao, J.; Ni, M.; Zhang, B.; Zhang, Z.; Young, J.P.W.; Chan, T.-F.; Chen, W.X.; Lam, H.M. Coordinated regulation of core and accessory genes in the multipartite genome of Sinorhizobium fredii. PLoS Genet. 2018, 14, e1007428. [Google Scholar] [CrossRef] [PubMed]
- Misra, H.S.; Maurya, G.K.; Kota, S.; Charaka, V.K. Maintenance of multipartite genome system and its functional significance in bacteria. J. Genet. 2018, 97, 1013–1038. [Google Scholar] [CrossRef] [PubMed]
- Prozorov, A.A. Additional chromosomes in bacteria: Properties and origin. Microbiology 2008, 77, 385–394. [Google Scholar] [CrossRef]
- Bavishi, A.; Abhishek, A.; Lin, L.; Choudhary, M. Complex prokaryotic genome structure: Rapid evolution of chromosome II. Genome 2010, 53, 675–687. [Google Scholar] [CrossRef] [PubMed]
- Bavishi, A.; Lin, L.; Schroeder, K.; Peters, A.; Cho, H.; Choudhary, M. The prevalence of gene duplications and their ancient origin in Rhodobacter sphaeroides 2.4.1. BMC Microbiol. 2010, 10, 331. [Google Scholar] [CrossRef]
- Choudhary, M.; Zanhua, X.; Fu, Y.X.; Kaplan, S. Genome analyses of three strains of Rhodobacter sphaeroides: Evidence of rapid evolution of chromosome II. J. Bacteriol. 2007, 189, 1914–1921. [Google Scholar] [CrossRef]
- Cooper, V.S.; Vohr, S.H.; Wrocklage, S.C.; Hatcher, P.J. Why Genes Evolve Faster on Secondary Chromosomes in Bacteria. PLoS Comput. Biol. 2010, 6, e1000732. [Google Scholar] [CrossRef]
- Holden, M.T.; Titball, R.W.; Peacock, S.J.; Cerdeno-Tarraga, A.M.; Atkins, T.; Crossman, L.C.; Pitt, T.; Churcher, C.; Mungall, K.; Bentley, S.D.; et al. Genomic plasticity of the causative agent of melioidosis, Burkholderia pseudomallei. Proc. Natl. Acad. Sci. USA 2004, 101, 14240–14245. [Google Scholar] [CrossRef]
- Lykidis, A.; Pérez-Pantoja, D.; Ledger, T.; Mavromatis, K.; Anderson, I.J.; Ivanova, N.N.; Hooper, S.D.; Lapidus, A.; Lucas, S.; González, B.; et al. The Complete Multipartite Genome Sequence of Cupriavidus necator JMP134, a Versatile Pollutant Degrader. PLoS ONE 2010, 5, e9729. [Google Scholar] [CrossRef] [PubMed]
- Egan, E.S.; Fogel, M.A.; Waldor, M.K. MicroReview: Divided genomes: Negotiating the cell cycle in prokaryotes with multiple chromosomes. Mol. Microbiol. 2005, 56, 1129–1138. [Google Scholar] [CrossRef] [PubMed]
- Sunuwar, J.; Sunuwar, J.; Azad, R.K.; Azad, R.K. A machine learning framework to predict antibiotic resistance traits and yet unknown genes underlying resistance to specific antibiotics in bacterial strains. Brief. Bioinform. 2021, 22, bbab179. [Google Scholar] [CrossRef] [PubMed]
- Sunuwar, J.; Azad, R.K. Identification of Novel Antimicrobial Resistance Genes Using Machine Learning, Homology Modeling, and Molecular Docking. Microorganisms 2022, 10, 2102. [Google Scholar] [CrossRef] [PubMed]
- Almalki, F.; Choudhary, M.; Azad, R.K. Analysis of multipartite bacterial genomes using alignment free and alignment-based pipelines. Arch. Microbiol. 2022, 205, 25. [Google Scholar] [CrossRef] [PubMed]
- Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef] [PubMed]
- Huerta-Cepas, J.; Forslund, K.; Coelho, L.P.; Szklarczyk, D.; Jensen, L.J.; Mering, C.V.; Bork, P. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Mol. Biol. Evol. 2017, 34, 2115–2122. [Google Scholar] [CrossRef]
- Zhu, Q. On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset. In Pattern Recognition Letters; Springer: Berlin/Heidelberg, Germany, 2020; Volume 136, pp. 71–80. ISSN 0167-8655. [Google Scholar] [CrossRef]
Most Important Features for Gene-Level Analysis | |
---|---|
1 | transposase, IS4 family |
2 | MATE efflux family protein |
3 | dihydrodipicolinate synthase |
4 | metallo-beta-lactamase family protein |
5 | tRNA pseudouridine synthase A |
6 | transcriptional regulator, AraC family |
7 | dihydrodipicolinate reductase |
8 | glyoxalase family protein |
9 | efflux transporter, RND family, MFP subunit |
10 | OsmC-like protein |
11 | transcriptional regulator, MarR family |
12 | riboflavin biosynthesis protein RibF |
13 | tRNA pseudouridine synthase B |
14 | FeS assembly ATPase SufC |
15 | cyclic nucleotide-binding domain protein |
Most Important Features for Differentially Present Gene-Level Analysis | |
---|---|
1 | transcriptional regulator |
2 | chemotaxis protein |
3 | sugar ABC transporter permease |
4 | flagellar M-ring protein FliF |
5 | acetolactate synthase |
6 | PAS domain-containing protein |
7 | sugar ABC transporter substrate-binding protein |
8 | Porin |
9 | cell envelope biogenesis protein TolA |
10 | hybrid sensor histidine kinase/response regulator |
11 | GntR family transcriptional regulator |
12 | chemotaxis protein CheD |
13 | 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase |
14 | short-chain dehydrogenase |
15 | 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase |
Set 1 | Set 2 | Set 3 |
---|---|---|
3-octaprenyl-4hydroxybenzoate carboxy-lyase (H) | 10 kda chaperonin (O) | farnesyltranstransferase (H) |
30s ribosomal protein s17 (J) | acyl-coa dehydrogenase domain-containing protein (I) | maf-like protein (D) |
aminopeptidase p (E) | cyclic nucleotide-binding domain protein (T) | nad-dependent deacetylase (K) |
dihydrodipicolinate synthase (E) | cytochrome p450 family protein (Q) | outer membrane efflux protein (MU) |
prephenate dehydratase (E) | fad dependent oxidoreductase (Q) | prevent-host-death family protein (J) |
gtp-dependent nucleic acidbinding protein engd (J) | ribosomal large subunit pseudouridine synthase d (-) | |
inorganic polyphosphate/atpnad kinase (G) | thymidine kinase (F) | |
mate efflux family protein (V) | transcriptional regulator, padr family (K) | |
nitroreductase family protein (C) | anthranilate synthase component I (E) | |
pii uridylyl-transferase (O) | cation diffusion facilitator family transporter (P) | |
protein of unknown function duf403 (S) | fad-dependent pyridine nucleotide-disulfide oxidoreductase (S) | |
ribosome biogenesis gtpbinding protein ysxc (D) | inositol monophosphatase family protein (G) | |
s1 rna binding domain protein (J) | preprotein translocase, sece subunit (U) | |
thiamine-phosphate kinase (H) | preprotein translocase, yajc subunit (U) | |
transporter, cpa2 family (PT) | transcriptional regulator, arac family (K) | |
universal stress family protein (T) | transcriptional regulator, merr family (K) | |
HPS_21876 (-) | transcriptional regulator, tetr family (K) | |
cupin domain protein (L) | trna pseudouridine synthase a (J) | |
efflux transporter, rnd family, mfp subunit (M) | protein of unknown function duf482 (-) | |
lysophospholipase l2 (I) | ||
transposase is3/is911 family protein (L) | ||
transposase, is4 family (L) |
Set 1 | Set 2 | Set 3 |
---|---|---|
cell division protein ftsz (D) | d-isomer-specific 2-hydroxyacid dehydrogenase, nad-binding protein (CH) | chloramphenicol acetyltransferase (V) |
cobalamin biosynthesis protein cbig (S) | glycine dehydrogenase subunit 2 (E) | dna protecting protein dpra (LU) |
diguanylate cyclase/phosphodiesterase (T) | hypothetical protein cp97_01065 (-) | flagellar protein flgj (MNO) |
extensin family protein (S) | hypothetical protein turpa_2028 (-) | gtp-binding proten hflx (S) |
gcra cell-cycle regulator (-) | outer membrane chaperone skp family protein (M) | lipoprotein, putative (S) |
Invasion-associated locus b family protein (-) | preprotein translocase, secg subunit (U) | stage ii sporulation protein e (T) |
mgs domain protein (F) | protein of unknown function duf389 (S) | transcriptional regulator, crp/fnr family (K) |
ompa/motb domain protein (M) | single-stranded dna-binding protein-1 (L) | transcriptional regulator, merr family (K) |
protein of unknown function duf1127 (S) | tpr repeat domain protein (S) | transcriptional regulator, tetr family (K) |
putative metal-dependent protease (L) | twin-arginine translocation pathway signal domain protein (S) | hypothetical protein cp97_01545 (-) |
HPS_22 (-) | HPS_105 (-) | hypothetical protein cp97_01880 (-) |
HPS_107 (-) | hypothetical protein cp97_06070 (-) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Almalki, F.; Sunuwar, J.; Azad, R.K. Using Machine Learning to Predict Genes Underlying Differentiation of Multipartite and Unipartite Traits in Bacteria. Microorganisms 2023, 11, 2756. https://doi.org/10.3390/microorganisms11112756
Almalki F, Sunuwar J, Azad RK. Using Machine Learning to Predict Genes Underlying Differentiation of Multipartite and Unipartite Traits in Bacteria. Microorganisms. 2023; 11(11):2756. https://doi.org/10.3390/microorganisms11112756
Chicago/Turabian StyleAlmalki, Fatemah, Janak Sunuwar, and Rajeev K. Azad. 2023. "Using Machine Learning to Predict Genes Underlying Differentiation of Multipartite and Unipartite Traits in Bacteria" Microorganisms 11, no. 11: 2756. https://doi.org/10.3390/microorganisms11112756