Genome-Wide Identification and Evolutionary Analyses of SrfA Operon Genes in Bacillus

A variety of secondary metabolites contributing to plant growth are synthesized by bacterial nonribosomal peptide synthases (NRPSs). Among them, the NRPS biosynthesis of surfactin is regulated by the SrfA operon. To explore the molecular mechanism for the diversity of surfactins produced by bacteria within the genus Bacillus, we performed a genome-wide identification study focused on three critical genes of the SrfA operon—SrfAA, SrfAB and SrfAC—from 999 Bacillus genomes (belonging to 47 species). Gene family clustering indicated the three genes can be divided into 66 orthologous groups (gene families), of which a majority comprised members of multiple genes (e.g., OG0000009 had members of all three SrfAA, SrfAB and SrfAC genes), indicating high sequence similarity among the three genes. Phylogenetic analyses also found that none of the three genes formed monophyletic groups, but were usually arranged in a mixed manner, suggesting the close evolutionary relationship among the three genes. Considering the module structure of the three genes, we propose that self-duplication, especially tandem duplications, might have contributed to the initial establishment of the entire SrfA operon, and further gene fusion and recombination as well as accumulated mutations might have continuously shaped the different functional roles of SrfAA, SrfAB and SrfAC. Overall, this study provides novel insight into metabolic gene clusters and operon evolution in bacteria.


Introduction
A nonribosomal peptide synthase (NRPS) is a multifunctional mega-enzyme that usually includes many modules and functions to assemble the peptide backbones of structurally diverse and biologically active natural products. It is well known that surfactin is one of the most important biosurfactants; it has the ability to efficiently decrease the surface tension of water and has good heat endurance and salt tolerance, meaning that it has great potential for use in enhanced oil recovery (EOR) [1][2][3]. The SrfA operon, consisting of four adjacent genes, SrfAA, SrfAB, SrfAC and SrfAD, is required for the production of surfactin [4]. Among these four genes, SrfAA and SrfAB are both composed of three functional modules, and SrfAC consists of only one module. Each module is comprised of an adenylation (A) domain, a thiolation (T) domain and a condensation (C) domain [5,6], and has the ability to incorporate a specific amino acid into a peptide backbone, so the order of the modules in assembly lines is always in accordance with the product peptide sequence. SrfAD contains a thioesterase (TE) domain, and is responsible for catalyzing the cyclization of peptide chains. Additionally, a few single-module NRPS-like enzymes that lack the important C domain have been reported in fungi and bacteria [7]. The A domain, C domain and T domain in the functional module have been associated with selective substrate activation, peptide bond formation and substrate shuttling in activation sites, respectively. They are vital for biosynthesis pathways, and the TE domain alone has been shown to be responsible for the release of compounds [8]. Thus, the SrfAA, SrfAB and SrfAC genes are considered as critical components in the SrfA operon.
Bacillus bacteria, known as a remarkable source of bioactive lipopeptides, are widely applied in the biosynthesis of a wealth of diverse compounds, including antibiotics, biosurfactants and antitumor agents via gene-cluster mining [9][10][11][12]. Bacillus bacteria produce a diversity of surfactins; however, it remains poorly understood how such a diversity is generated. In this study, we performed genome-wide identification of SrfAA, SrfAB and SrfAC genes from 999 Bacillus genomes, searched for conserved motifs in whole genes and conducted phylogenetic analyses to clarify the gene fusion of SrfAA, SrfAB and SrfAC genes in the SrfA operon, and proposed that self-duplication within the operon has been a driving force in the formation of the SrfA operon. Overall, our findings provide evidence that self-duplication could be an important mechanism in the formation of metabolic gene clusters and operon evolution in bacteria.

Data Source
A total of 999 Bacillus genomes (belonging to 47 species), including sequences and annotation files, were downloaded from the NCBI database (https://www.ncbi.nlm.nih. gov/, accessed on 7 June 2022). All data sources are shown in Table S1.

Genome-Wide Identification of SrfAA, SrfAB and SrfAC Genes
SrfAA.hmm was built using hundreds of representative SrfA gene sequences from various bacteria downloaded from the NCBI database, using a hidden Markov model build (HMMbuild). SrfAB.hmm and SrfAC.hmm were built in the same way (Supplementary Datasets S1-S6). Briefly, a two-step strategy was adopted to identify SrfAA genes. The first step was to perform an hmmsearch for protein sequences in each genome, with SrfAA.hmm as a query. Then, the remaining sequences were used to conduct a Blast search against representative SrfAA genes, using the amino acid sequences of every genome as a query for potential SrfAA genes. The threshold expectation value was set to 10 −4 for the BLAST search. The same strategy was adopted for the identification of SrfAB and SrfAC genes.

Classification of Orthologous Genes and Phylogenetic Analysis
Orthologous genes encoded in 999 Bacillus genomes were classified using OrthoFinder (version 2.2.7) [13] with default parameters. SrfAA, SrfAB and SrfAC genes were selected from orthologous groups (OGs) containing SrfAA, SrfAB and SrfAC genes. Amino acid sequences of target genes were used for multiple alignments using ClustalW integrated in MEGA 7.0 with default parameter settings [14]. Overly divergent sequences were removed to prevent interference with the alignments and subsequent phylogenetic inference. Then, the resulting alignments were manually corrected using MEGA 7.0 for further improvement. Phylogenetic analyses were conducted using IQ-TREE and the maximum likelihood method [15]. The best-fit model was estimated using ModelFinder [16]. Branch support values were assessed using UFBoot2 tests [17]. The scale bar in the model indicates genetic distance.

Conserved Motif Analysis
The conserved protein motifs in the whole SrfAA, SrfAB and SrfAC genes were analyzed using Multiple Expectation Maximization for Motif Elicitation (MEME) and WebLogo with the default settings [18,19].

Genome-Wide Identificaiton and Orthology Classification of SrfAA, SrfAB and SrfAC Genes
Altogether, 4680 SrfAA genes, 3522 SrfAB genes and 3558 SrfAC genes were identified from 999 Bacillus genomes using OrthoFinder. They were scattered across 66 orthologous groups (OGs) (Table 1), indicating that these genes can be classified into 66 evolutionary gene families and suggesting a great diversity within the SrfA operon. Different OGs contained varying numbers of genes, with two OGs (OG0000009 and OG0000006) accounting for nearly 50% (5864/11,760) of all identified genes; the 18 largest OGs contained 98.3% of the genes, and the remaining 48 OGs contained fewer than 15 genes each. SrfAA, SrfAB and SrfAC genes were distributed in 49, 45 and 35 OGs, respectively. Of the 66 OGs, 39 contained more than one gene type; 21 OGs contained SrfAA, SrfAB and SrfAC genes, 11 contained only SrfAA and SrfAB genes, 4 contained only SrfAA and SrfAC genes, and 3 contained only SrfAB and SrfAC genes (Supplementary Table S1). Such mixed compositions of genes suggest high sequence similarity between SrfAA, SrfAB and SrfAC genes, and implies potential intraspecies recombination within the same SrfA operon.
Among the OGs containing mixtures of gene members, the gene compositions were usually not balanced; sometimes SrfAA dominated the others, e.g., in OG0000009 and OG0000878, and sometimes SrfAB outnumbered the others, e.g., in OG0000006 and OG0000736. The largest family, OG0000009, had 1867 copies of SrfAA and 1269 copies of SrfAB, but had only 9 copies of SrfAC. Such discrepancy of gene numbers within OGs may suggest differential degrees of duplication events among species and/or specificity of different gene families.

Phylogenetic Analysis and Gene Duplications of SrfA Operon
Phylogenetic analyses were conducted on the 17 largest OGs containing members of all three gene families (OG0004815 was excluded because it had no SrfAC genes). In each of the constructed phylogenetic trees, it was observed that SrfAA, SrfAB and SrfAC genes often clustered with each other while seldom forming separate monophyletic branches for different gene types, suggesting a high identity of the sequences within the SrfA operon and close evolutionary relationships among different genes (Figure 1 and Supplementary  Figures S1-S17). These results are in accordance with the above-mentioned high sequence similarity among the SrfA operon genes. All evidence supports a hypothesis that the complete operon is probably derived from self-duplication of only one gene, and further module fusion and recombination may have led to the formation of the complete operon.  Gene duplication events within the SrfA operon and the corresponding mechanisms were therefore analyzed. By randomly investigating 10 genomes, it was observed that a complete SrfA operon usually comprises multiple copies of SrfAA, SrfAB and SrfAC genes at a close distance (fewer than 50 bp) to each other, forming a gene cluster (Figure 2, Supplementary Table S2 and Supplementary Figures S18-S26). These physically close genes are normally the results of tandem duplications, which have likely made the biggest contribution to the establishment of a complete operon; they do not result from segmental or ectopic duplications.

Conserved Motifs in SrfAA, SrfAB and SrfAC Genes
Although the three genes of the SrfA operon have very similar gene sequence, their functional roles are different and thus they can be expected to have evolved distinct important motifs that are critical to their differentiation of functions. To explore the critical motifs, we searched for conserved motifs in the amino acid sequences of SrfAA, SrfAB and SrfAC proteins. For each gene, the five most conserved motifs were identified and displayed using WebLogo. We found that some of these conserved motifs were shared by different genes, suggesting conserved functional roles in these regions. For instance, motif 1 was shared among all three genes, motif 4 was shared between SrfAA and SrfAB. There were conserved motifs specific to one gene, for instance, motif 2 showed absolutely different sequences among three genes, while motif 3 exhibited a homology among the three genes but had a low level of sequence similarity (Figure 3). Therefore, these divergent motifs may be derived from different regions or the results of different degrees of mutations among different genes, and should indicate functional divergence of the three genes.

Conserved Motifs in SrfAA, SrfAB and SrfAC Genes
Although the three genes of the SrfA operon have very similar gene sequence, their functional roles are different and thus they can be expected to have evolved distinct important motifs that are critical to their differentiation of functions. To explore the critical motifs, we searched for conserved motifs in the amino acid sequences of SrfAA, SrfAB and SrfAC proteins. For each gene, the five most conserved motifs were identified and displayed using WebLogo. We found that some of these conserved motifs were shared by different genes, suggesting conserved functional roles in these regions. For instance, motif 1 was shared among all three genes, motif 4 was shared between SrfAA and SrfAB. There were conserved motifs specific to one gene, for instance, motif 2 showed absolutely different sequences among three genes, while motif 3 exhibited a homology among the three genes but had a low level of sequence similarity (Figure 3). Therefore, these divergent motifs may be derived from different regions or the results of different degrees of mutations among different genes, and should indicate functional divergence of the three genes.

Discussion
Microorganisms produce a large variety of products, the majority of which are derived via biosynthesis assembly lines known as NRPSs. The SrfA operon comprises a number of NRPS genes producing diverse metabolic products with a broad application, e.g., surfactin can be used to break the long carbon chains found in crude oil, degrading it into small molecules and thereby improving oil recovery [20]. SrfAA, SrfAB and SrfAC genes harbor entire modules (C-A-T) or didomains (e.g., T-A, T-C), the order and specificity of which determine the amino acid sequence in final peptide products [21,22]. Using a total of 999 Bacillus genomes, we identified all SrfAA, SrfAB and SrfAC genes and performed comparative and phylogenetic analyses to understand how these genes have evolved in the genus. The usual way in which prokaryotes acquire new genes is through horizontal gene transfer [3,[23][24][25]. However, this study reported another potential mechanism-selfduplication, which does not require any novel genetic material from the environment or other organisms, and thus may be considered an alternative approach for genomic innovation.
Certainly, to equip a complete SrfA operon is a complicated process, and gene duplication is just the beginning. Duplicated gene copies need to fuse to form multimodule genes like SrfAA and SrfAB, which comprise three duplicated modules each, and different genes have to accumulate mutations to achieve functional divergence, allowing them to be functionally responsible for different metabolic products. Meanwhile, tandem duplications should result in high-density gene clusters, and the multiple gene copies of a cluster likely increase the expressional efficiency and improve the yield of final products. Recent studies have provided a body of evidence supporting the opinion that gene duplication serves as an important driving force in the evolution of operons in bacteria. For instance, gene duplication within the mamAB operon in Alphaproteobacteria, the EPS operon in

Discussion
Microorganisms produce a large variety of products, the majority of which are derived via biosynthesis assembly lines known as NRPSs. The SrfA operon comprises a number of NRPS genes producing diverse metabolic products with a broad application, e.g., surfactin can be used to break the long carbon chains found in crude oil, degrading it into small molecules and thereby improving oil recovery [20]. SrfAA, SrfAB and SrfAC genes harbor entire modules (C-A-T) or didomains (e.g., T-A, T-C), the order and specificity of which determine the amino acid sequence in final peptide products [21,22]. Using a total of 999 Bacillus genomes, we identified all SrfAA, SrfAB and SrfAC genes and performed comparative and phylogenetic analyses to understand how these genes have evolved in the genus. The usual way in which prokaryotes acquire new genes is through horizontal gene transfer [3,[23][24][25]. However, this study reported another potential mechanism-selfduplication, which does not require any novel genetic material from the environment or other organisms, and thus may be considered an alternative approach for genomic innovation.
Certainly, to equip a complete SrfA operon is a complicated process, and gene duplication is just the beginning. Duplicated gene copies need to fuse to form multimodule genes like SrfAA and SrfAB, which comprise three duplicated modules each, and different genes have to accumulate mutations to achieve functional divergence, allowing them to be functionally responsible for different metabolic products. Meanwhile, tandem duplications should result in high-density gene clusters, and the multiple gene copies of a cluster likely increase the expressional efficiency and improve the yield of final products. Recent studies have provided a body of evidence supporting the opinion that gene duplication serves as an important driving force in the evolution of operons in bacteria. For instance, gene duplication within the mamAB operon in Alphaproteobacteria, the EPS operon in Gramnegative bacteria, and PE and PPE gene duplication in the ppe-pe operon in Mycobacterium genus has resulted in functional diversification and increased complexity of biological pathways [26][27][28].
Among the studied Bacillus genomes, it is true that some species had a complete SrfA operon (the Bacillus albus operon comprised all four genes, SrfAA, SrfAB, SrfAC and SrfAD), and some had incomplete SrfA operons with certain genes missing (e.g., Bacillus beveridgei). However, a similar scenario can be observed within the same Bacillus species: certain stains/accessions of one Bacillus species had the complete SrfA operon (Bacillus altitudinis strain: Ba1449), but other strains/accessions of the same species possessed incomplete operons (Bacillus altitudinis strain: SCU11). This phenomenon reflects the highly active and complicated evolution of operons, which are subject to gene duplication, loss, rearrangement and horizontal gene transfer [29].
Moreover, a broad repertoire of synthetic compounds depends on module self-function and module-module interaction among the SrfAA, SrfAB and SrfAC genes. Whole-module deletions/insertions or individual domain changes in a module are likely to re-engineer the biosynthetic pathway, resulting in the generation of novel products. A previous study reported that to reorganize the molecular structure in gene clusters and to deliver new products, gene editing was applied to swap the conserved flavodoxin-like subdomain (FSD), which still preserved the overall structure of the A domain [30]. However, it is possible that deletion of domains was followed by module skipping [31]. To date, the engineering of biosynthesis pathways and production of novel peptides needs further exploration.
It is well known that the majority of the metabolic gene clusters in bacterial genomes are organized into operons and clusters, which probably benefits the co-regulation and co-expression of adjacent genes with related functions [32]. Additionally, gene order and structure in operons are vital to the products and synthetic efficiency of protein complexes [33]. Thus, complicated metabolic networks including regulatory checkpoints and cascades are dominant in bacterial metabolism. To sum up, our findings suggest another means by which genomic innovation can be achieved, and provide novel insight into bacterial genome evolution.

Data Availability Statement:
The datasets generated for this study can be found in the National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/, accessed on 7 June 2022).

Conflicts of Interest:
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.