Pangenomics of the Symbiotic Rhizobiales. Core and Accessory Functions Across a Group Endowed with High Levels of Genomic Plasticity

Pangenome analyses reveal major clues on evolutionary instances and critical genome core conservation. The order Rhizobiales encompasses several families with rather disparate ecological attitudes. Among them, Rhizobiaceae, Bradyrhizobiaceae, Phyllobacteriacreae and Xanthobacteriaceae, include members proficient in mutualistic symbioses with plants based on the bacterial conversion of N2 into ammonia (nitrogen-fixation). The pangenome of 12 nitrogen-fixing plant symbionts of the Rhizobiales was analyzed yielding total 37,364 loci, with a core genome constituting 700 genes. The percentage of core genes averaged 10.2% over single genomes, and between 5% to 7% were found to be plasmid-associated. The comparison between a representative reference genome and the core genome subset, showed the core genome highly enriched in genes for macromolecule metabolism, ribosomal constituents and overall translation machinery, while membrane/periplasm-associated genes, and transport domains resulted under-represented. The analysis of protein functions revealed that between 1.7% and 4.9% of core proteins could putatively have different functions.


Introduction
With increasing availabilities of sequenced genomes in databases, pangenome-wide analyses have become a valuable tool to explore the breadth of genetic variability and capture the critical core-genes subset within a clade [1,2]. The boundaries of this comparison could be set at species level, in which a number of strains are analyzed, as in the cases of Escherichia coli [3], Sinorhizobium meliloti [4], Pseudomonas aeruginosa [5], or at higher rank level, as for the genus Clostridium [6] or Rhizobium [7]. Furthermore, comparisons can cross the hierarchical taxonomy and focus on common physiological/ecological features of distantly related organisms, as shown in the study of the predatory bacteria pangenome [8]. This functional group includes as many as 11 different lineages that occur throughout diverse unrelated habitats and has revealed interesting clues on the importance of gene acquisition processes in the adaptation of those species. In a recent study, the horizontal DNA transfer among 27 rhizobial genomes was analyzed and the resulting patterns concurred with the

Selected Bacterial Strains and Main Genomic Features
The dataset of species to be analyzed was chosen by using the following criteria: (i) available genomes that were selected all belong to the Rhizobiales order, and (ii) these organisms are nitrogen-fixing symbionts of plants. Under these constraints, among the existing possibilities, we selected 12 strains belonging to different rhizobia clades to be included in the present analysis. Their features are described hereafter.
Azorhizobium caulinodans ORS571; Family: Xanthobacteraceae; GenBank accession: GCA_000010525.1; genome size: 5,369,772 nucleotides [25]; number of plasmids: 0. Fast growing, isolated from Sesbania rostrata in the Sahel region of Africa [26]. A. caulonidans has the peculiarity of nodulating also on the stem, as the symbiont plant grows in habitats that are often submerged. Its host range is restricted to the host of isolation.
Bradyrhizobium japonicum USDA 6 T ; Family: Bradyrhizobiaceae; GenBank accession: GCA_000284375.1; genome size: 9,207,384 nucleotides [28]; number of plasmids: 0. Slow growing, isolated from soybean in Japan. The strain of B. japonicum USDA 6 T is a hostspecific symbiont that shares extended portions of genomic collinearity with B. japonicum, USDA 110 but a large region of genomic inversion. The nodulation occurs at the level of the root system. It is the type strain of the species.
Bradyrhizobium sp. BTAi1; Family: Bradyrhizobiaceae; GenBank accession: GCA_000015165.1; genome size: 8,493,513 nucleotides [29]; number of plasmids: 1. Slow growing, isolated from stem nodules of the aquatic legume Aeschinomene indica in North America. Bradyrhizobium sp. BTAi1 is capable of performing anoxygenic photosynthesis, denitrification, and it lacks the common nodulation genes which are shared by the vast majority of known rhizobia. The nodulation can occur at either roots or stems of specific plant hosts.
Bradyrhizobium sp. ORS278; Family: Bradyrhizobiaceae; GenBank accession: GCA_000026145.1; genome size: 7,456,587 nucleotides [29]; number of plasmids: 0. Slowgrowing, isolated from stem nodules of Aeschinomene sensitiva in Senegal, Africa. It shares the peculiarities described for strain BTAi1 (anoxygenic photosynthesis, denitrification, and lack of nod genes) to which is highly similar but lacks a large plasmid. Nodulation occurs on roots or stems of specific plant hosts.
Mesorhizobium loti MAFF303099 (also known as Mesorhizobium huakuii bv. loti) Family: Phyllobacteriaceae; Genbank Accession: GCA_000009625.1; genome size: 7,596,297 nucleotides [30]; number of plasmids: 0. Fast growing, isolated from Lotus japonicus in Japan. Nodulation on roots. The host range is wider than that described for the prior cases and includes the genera Lotus, Anthyllis and Lupinus.
Rhizobium etli CFN42 T Family: Rhizobiaceae; Genbank Accession: GCA_000092045.1; genome size: 6,530,228 nucleotides [32]; number of plasmids: 6. Fast growing; isolated from bean plants of the species Phaseolus vulgaris in Mexico. Nodulation on roots. The host range includes species within the single genus Phaseolus. It is the type strain of the species.
Rhizobium sullae IS123 T Family: Rhizobiaceae; Genbank Accession: GCA_900169785.1; genome size: 7,889,576 nucleotides [33]; number of plasmids: not specified. Fast-growing, isolated from wild stands of Sulla coronaria (formerly Hedysarum coronarium) in Spain. The host plant, widely diffused in the Mediterranean countries, has the peculiarity of existing still as native weed in wild conditions as well as agriculturally cropped ecotypes. Nodulation on roots; Host specific to H.coronarium. It is the type strain of the species.
Sinorhizobium fredii NGR234; Family: Rhizobiaceae; RefSeq Accession: GCF_000018545.1; genome size: 6,891,900 nucleotides [35]; number of plasmids: 2. Fast growing, isolated from Lablab purpureus in New Guinea. Nodulation on roots, this strain is able to establish a nitrogen fixing symbiosis with more than 112 plant genera including the non-legume Parasponia [36]. Currently no other rhizobial strain is known to exhibit such extremely wide host promiscuity.

Bioinformatics Analyses
Average nucleotide identity (ANI) and average amino acid identity (AAI) between Rhizobiales proteomes (i.e., predicted genes and proteins) were computed as previously described [38,39]. The pan-genome analysis of the references was performed using the software GET_HOMOLOGUES [40], implemented with the -M option which applies the orthoMCL algorithm to infer reciprocal homologies between genes. According to the results provided by such software, genes belonging to the strict-core [41] were separated by the others and analyzed in further depth. A phylogenomic tree was produced for the core gene set of the analyzed genomes using the PhyloPhlAn software tool, version 3.0 [42], with the Phylophlan database for broad bacteria comparisons.
A whole-genome orthologs search was performed to identify conserved functions across the different organisms using the Reciprocal Best BLAST Hit (RBBH, version 0.1.11) [43] setting an E-value cutoff of 1 × 10 −5 . Genes with a minimum of 60% identity and coverage higher than 60% were considered as orthologous candidates.
In addition to the aforementioned analyses, the core genome of Rhizobiales was analysed using a HMM (Hidden Markov Model) similarity search approach implemented in the bcgTree pipeline [44]. Genes which were confirmed as core genome components were aligned back to all genomes under study in order to score for conservation percentage.
The MAPLE resource (Metabolic and Physiological Potential Evaluator) [45] was used to estimate function abundance and evaluate metabolic and physiological potential [46]. The reference database was KEGG (Kyoto Encyclopedia of Genes and Genomes) [47], and proteins were mapped and normalized on the ribosomal proteins counts in its pathway database. KAAS (KEGG Automatic Annotation Server) was used for ortholog assignment (KO, KEGG Orthology) and pathway mapping [48] Functional divergence was assessed using DIVERGE 2.0 software [49]. Output data were parsed by ad hoc Perl scripts, while R provided statistical and graphic support.

Phylogenetic Relatedness Among the 12 Rhizobiales Strains
To frame the systematics position of the 12 pure isolates compared, ANI and AAI were computed using the whole genome information (Figure 1). The span of difference resulted from a minimum of 78% to a maximum of 99% ANI, which corresponded to minimum 50% to the maximum of 95% AAI. genomes, favoring the divergence in case of Bradhyrhizobium japonicum and promoting conservation in the case of Rhizobium sullae. This difference can be explained by the fact that the soybean host of B. japonicum was originally confined in far eastern Asia and its presence elsewhere, including America, is the result of man-managed cropping. On the contrary, both Italy and Spain are within the native home range of Sulla coronaria which is an Eurimediterranean plant.  Top ANI and AAI values were found between Rhizobium species, as well as between species belonging to the genus Bradyrhizobium. Highest ANI and AAI were found between R. sullae IS123 T and R. sullae WSM1592 (~99%,~95%), pointing that these two strains are the most closely related pair of the analysis. Within Bradyrhizobium species, it should be noted that~92% ANI and~87% AAI between the two B. japonicum species indicated that genetic and potential functional drift occurred in spite of extended and conserved collinearities at the level of their genomes. Interestingly, both Rhizobium sullae and Bradyrhizobium japonicum strain pairs were isolated from different places (Italy and Spain for R. sullae; USA and Japan for B. japonicum), and both genera show high specificity for their host macro-symbiont. This might imply that selective pressures differentially reshaped their genomes, favoring the divergence in case of Bradhyrhizobium japonicum and promoting conservation in the case of Rhizobium sullae. This difference can be explained by the fact that the soybean host of B. japonicum was originally confined in far eastern Asia and its presence elsewhere, including America, is the result of man-managed cropping. On the contrary, both Italy and Spain are within the native home range of Sulla coronaria which is an Eurimediterranean plant.

Pangenome and Core Genome Analyses
A total number of 84,482 genes was included in the analysis and comprised genes encoded either by the chromosome or by plasmids in the genomes of Rhizobiales. After removing duplicates, i.e., paralog genes in the genome, the reference pangenome was reduced to 37,364 non-redundant loci.
Concerning the core genome across the 12 taxa which was retrieved by GET_ HOMOLOGUES, it was found to be constituted by 700 genes, which corresponded to 1.87% of the pangenome (the full list of genes that belong to the core genome is available in Supplementary Dataset S1 and the comparisons between core genome and whole genome are provided in Supplementary Dataset S2). The accessory (flexible) genome, defined by genes which were not unique but not shared by all strains considered, and thus not included in the core genome [50], was equal to 15,394 genes. The number of unique genes (i.e., those that were identified as strain-specific) was in total 21,970 and accounted the 58.8% of the pangenome. The percentage of core genes over the number of total genes in each of the single genomes ranged from a minimum of 7.88% (B. japonicum UDSDA 6) to a maximum of 15.64% (A. caulinodans ORS571), with an average value per genome of 10.2%.
The gene partitioning on chromosomal or extra-chromosomal replicons in each of the genomes and its relations with the core genome showed differences at the level of different Rhizobiales species and strains. The details are presented in Table 1. To verify whether genes belonging to the core genome were still supportive of the taxonomic distances resulting from the ANI of full Rhizobiales genomes, a phylogenomic tree was constructed using solely the 700 genes which were identified as core genome components ( Figure 2). The topology obtained was consistent with the overall systematic placement of each strain at high taxonomy level, nevertheless different relationships were found regarding the position of A. caulinodans ORS57 with respect to the other Rhizobiales. The comparison between whole genomes, indeed, placed this microorganism close to the clade constituted by M. loti MAFF303099 and the two Sinorhizobium species. Considering the gene subset belonging to the core genome ( Figure 2), A. caulinodans ORS57 appears closer to the clade which includes the four Bradyrhizobium species here considered. We speculate that core and flexible genomes followed two distinct paths. The core genome appears more conserved in specific Rhizobiales groups, such as Azorhizobium, and Bradhyrizobium. The flexible genome would have been more likely reshaped by evolutionary shortcuts due to horizontal gene transfer, which would have brought higher similarities among Azorhizobium, Mesorhizobium and Sinorhizobium. An unrooted phylogenomic topology showing relationships between Rhizobiales basing on core-gene comparisons is instead shown in Figure S2. The analysis conducted through RBBH approach showed that Azorhizobium caulinodans ORS571 and the two Sinorhizobium species, S. fredii and S. meliloti, shared the lowest number of orthologous genes. As expected, the most correlated species belonged to same bacterial lineages and were identified to be the different Bradyrhizobium japonicum species and Bradyrhizobium sp. BtAi1 and ORS278. The Pearson correlation coefficient based on the RBBH result, showed correlation values comprised between -0.94 and 1 (Figure 3, panel A), and was in general consistent with the clustering of Rhizobiales genomes based on the ANI. Coefficient values visually separated Rhizobiales into two main clades, one which included Rhizobium and Sinorhizobium species (correlation coefficients comprised between 0.86 and 1), and the other which included all Bradyrhizobium species and Azorhizobium caulinodans ORS571 (coefficients between 0.85 and 0.95). In addition, the analysis of each genome percentage, which corresponded to conserved orthologous in the specific set of pairwise genomes, comprised between 49-75% between Bradyrhizobium species (average 58.3%), and it was just slightly lower (average 54.6%) between Rhizobium and Sinorhizobium. The relative genome percentage, which resulted constituted by proteins encoded by ortholog genes, was therefore comparable between distinct Rhizobiales groups, although the latter were constituted by organisms belonging to different species. The numbers of shared genes for each pairwise comparison is shown in the Figure S1. The analysis conducted through RBBH approach showed that Azorhizobium caulinodans ORS571 and the two Sinorhizobium species, S. fredii and S. meliloti, shared the lowest number of orthologous genes. As expected, the most correlated species belonged to same bacterial lineages and were identified to be the different Bradyrhizobium japonicum species and Bradyrhizobium sp. BtAi1 and ORS278. The Pearson correlation coefficient based on the RBBH result, showed correlation values comprised between -0.94 and 1 (Figure 3, panel A), and was in general consistent with the clustering of Rhizobiales genomes based on the ANI. Coefficient values visually separated Rhizobiales into two main clades, one which included Rhizobium and Sinorhizobium species (correlation coefficients comprised between 0.86 and 1), and the other which included all Bradyrhizobium species and Azorhizobium caulinodans ORS571 (coefficients between 0.85 and 0.95). In addition, the analysis of each genome percentage, which corresponded to conserved orthologous in the specific set of pairwise genomes, comprised between 49-75% between Bradyrhizobium species (average 58.3%), and it was just slightly lower (average 54.6%) between Rhizobium and Sinorhizobium. The relative genome percentage, which resulted constituted by proteins encoded by ortholog genes, was therefore comparable between distinct Rhizobiales groups, although the latter were constituted by organisms belonging to different species. The numbers of shared genes for each pairwise comparison is shown in the Figure S1. Microorganisms 2021, 9, x FOR PEER REVIEW 8 of 17 Regarding the degree of in-genome duplication, which corresponds to the generation of paralog genes, the first selection of the core genome included the filtering-off of those genes which belonged to a same gene family. Nevertheless, we sought to envisage the extent of this phenomenon and analyze the corresponding number of paralog genes within each genome ( Table 2). The most frequent cases of paralogs were originated by single-step gene duplication (2 copies of the gene per genome), but in some species (mostly Bradyrhizobiaceae) cases of a multiple-duplication of a given gene with up to 13 copies were observed. Regarding the degree of in-genome duplication, which corresponds to the generation of paralog genes, the first selection of the core genome included the filtering-off of those genes which belonged to a same gene family. Nevertheless, we sought to envisage the extent of this phenomenon and analyze the corresponding number of paralog genes within each genome ( Table 2). The most frequent cases of paralogs were originated by singlestep gene duplication (2 copies of the gene per genome), but in some species (mostly Bradyrhizobiaceae) cases of a multiple-duplication of a given gene with up to 13 copies were observed.

Analysis of Core Genome Function Enrichment and Depletion Using a Single Reference Genome
Having established the size of the 12 symbiotic Rhizobiales core genome, in order to ascertain whether such critically conserved gene subset would be related to particular biological processes, we used a reference-driven conserved genes estimation strategy. In essence, we correlated a full representative genome with predicted functions identified among the core genome genes. The functional analysis was performed by compiling the corresponding tables for the gene categories and Gene Ontology (GO) terms, distinguishing the cellular component, biological process and molecular function items, drawn from the corresponding genome project annotation. As reference genome for this analysis, we chose Rhizobium leguminosarum bv. viciae 3841 and plotted the data for the full genome (7342 genes) and for the core genome containing 700 of its genes (Table 3). Data showed that the top represented categories encoded by the R.leguminosarum 3841 genome include genes putatively involved in metabolism and processing of small molecules (~21% of the total genes) and related to cellular processes (~20%). Nevertheless, although the first of these categories also constituted the majority of shared genes with other Rhizobiales, the second most abundant function is represented by metabolism of macromolecules (39% of the core genome). Looking at the detailed GO terms, we analyzed the difference between the percentage of proteins for each GO annotation in the reference genomes and the percentage of those proteins at the level of the core, considering as significant those categories which were upor down-represented above of below 1%. The comparison showed that the top enriched categories (i.e., positive percentage difference at the level of the core genome) regarding Cellular Component GO included overall proteins involved in the cytoplasm metabolism ( Figure 4). They constituted 9.6% of the reference R. leguminosarum bv. viciae 3841 genome and represent 21% of the core genome (+12%). The second most represented category was found to be related to ribosome metabolism (2.5% of the reference genomes, 9.75% of the core), suggesting a conservation of the transcription-related metabolism among Rhizobiales bacteria. Among the less-represented functions related to this GO category, membrane-related components appeared strongly depleted in the core genome. Although 729 proteins belonged to this GO group, resulting in the most represented category in the reference genome (36% of total genes with GO term), they constituted 14% of the core proteins (-22%). Consistent with the low representation of membrane related proteins in the core functions of Rhizobiales, also proteins integral to membrane and proteins bound to the periplasmic space were less represented in the core genome (7.51% and 1.2%, respectively), in spite of the fact that both were among the top 10 most represented categories in the reference genome. Looking at the detailed GO terms, we analyzed the difference between the percentage of proteins for each GO annotation in the reference genomes and the percentage of those proteins at the level of the core, considering as significant those categories which were up-or down-represented above of below 1%. The comparison showed that the top enriched categories (i.e., positive percentage difference at the level of the core genome) regarding Cellular Component GO included overall proteins involved in the cytoplasm metabolism ( Figure 4). They constituted 9.6% of the reference R. leguminosarum bv. viciae 3841 genome and represent 21% of the core genome (+12%). The second most represented category was found to be related to ribosome metabolism (2.5% of the reference genomes, 9.75% of the core), suggesting a conservation of the transcription-related metabolism among Rhizobiales bacteria. Among the less-represented functions related to this GO category, membrane-related components appeared strongly depleted in the core genome. Although 729 proteins belonged to this GO group, resulting in the most represented category in the reference genome (36% of total genes with GO term), they constituted 14% of the core proteins (-22%). Consistent with the low representation of membrane related proteins in the core functions of Rhizobiales, also proteins integral to membrane and proteins bound to the periplasmic space were less represented in the core genome (7.51% and 1.2%, respectively), in spite of the fact that both were among the top 10 most represented categories in the reference genome. Concerning proteins which function was related to broad biological processes, enriched categories in the core genome were frequently associated to basal metabolism and housekeeping functions related to DNA-activity and genome maintenance ( Figure 5). The top enriched category in the core genome was related to translation machinery, which constituted the 1.6% of total proteins in the reference genome and corresponded to the 7.25% of the core (+5.58%), followed by DNA-repair (+1.68% in the core genome) and rRNA-metabolism (1.17%). Consistent with high variability of membrane-associated components between Rhizobiales bacteria, among the top depleted categories in the core genome resulted different transport-related proteins (transport components, -8.84% occurrences in the core genome; transmembrane transport, -1.82%) and proteins involved in regulation of the gene expression transcription regulation, -1.95%). Concerning proteins which function was related to broad biological processes, enriched categories in the core genome were frequently associated to basal metabolism and housekeeping functions related to DNA-activity and genome maintenance ( Figure 5). The top enriched category in the core genome was related to translation machinery, which constituted the 1.6% of total proteins in the reference genome and corresponded to the 7.25% of the core (+5.58%), followed by DNA-repair (+1.68% in the core genome) and rRNAmetabolism (1.17%). Consistent with high variability of membrane-associated components between Rhizobiales bacteria, among the top depleted categories in the core genome resulted different transport-related proteins (transport components, -8.84% occurrences in the core genome; transmembrane transport, -1.82%) and proteins involved in regulation of the gene expression transcription regulation, -1.95%). Similar results were found also for the category related to the molecular function ( Figure 6), in which ribosomal components, RNA-and ATP-binding proteins were conserved core-genome constituents (between +1.38% and +2.17%). Proteins which have transport activity, and that are likely to be associated with cellular membranes, were those which showed higher species-specificity and were less represented in the core genome (-5.45%). The low representation of proteins whose function is related to transcription processes, including transcription factors and DNA-binding proteins, suggests adaptation at different lifestyles, probably related to different host specificities and types of symbiotic relationship.

Functional Divergence Analysis
Besides the above described pangenome analysis, we carried out a pairwise comparison of each of the 12 isolates (in total, 144 matches), checking the possible extent of functional divergence between each shared orthologous gene belonging to the core genome. The objective of this analysis was to assess the magnitude of the genetic drift that could have determined a functional variation. Similar results were found also for the category related to the molecular function ( Figure 6), in which ribosomal components, RNA-and ATP-binding proteins were conserved core-genome constituents (between +1.38% and +2.17%). Proteins which have transport activity, and that are likely to be associated with cellular membranes, were those which showed higher species-specificity and were less represented in the core genome (−5.45%). The low representation of proteins whose function is related to transcription processes, including transcription factors and DNA-binding proteins, suggests adaptation at different lifestyles, probably related to different host specificities and types of symbiotic relationship. Similar results were found also for the category related to the molecular function ( Figure 6), in which ribosomal components, RNA-and ATP-binding proteins were conserved core-genome constituents (between +1.38% and +2.17%). Proteins which have transport activity, and that are likely to be associated with cellular membranes, were those which showed higher species-specificity and were less represented in the core genome (-5.45%). The low representation of proteins whose function is related to transcription processes, including transcription factors and DNA-binding proteins, suggests adaptation at different lifestyles, probably related to different host specificities and types of symbiotic relationship.

Functional Divergence Analysis
Besides the above described pangenome analysis, we carried out a pairwise comparison of each of the 12 isolates (in total, 144 matches), checking the possible extent of functional divergence between each shared orthologous gene belonging to the core genome. The objective of this analysis was to assess the magnitude of the genetic drift that could have determined a functional variation.

Functional Divergence Analysis
Besides the above described pangenome analysis, we carried out a pairwise comparison of each of the 12 isolates (in total, 144 matches), checking the possible extent of functional divergence between each shared orthologous gene belonging to the core genome. The objective of this analysis was to assess the magnitude of the genetic drift that could have determined a functional variation.
Between 19 and 208 instances scoring significantly as bona fide functionally divergent genes were found ( Figure S3). These genes corresponded to the minimum 1.39% (A. caulinodans vs. R. sullae WSM 1592) to the maximum 4.48% (R.etli vs. R. leguminosarum) of shared orthologous genes between Rhizobiales genomes. The top list (the eight highestscoring cases) of these genes, displaying a high-frequency of parallel divergence across several taxa, is shown in Table 4, along with the corresponding gene function annotations. Analyzing in depth all the diverged genes, we noticed that some among them encoded housekeeping proteins which were not expected to being subjected to functional divergence. For example, the gene encoding the DNA polymerase I was found to be a divergent candidate in the four Bradyrhizobium species well as A. caulinodans ORS571. Another gene which the analysis scored as putatively divergent encodes the gamma chain of the ATPsynthase, and data indicated that such as functional variation would have occurred in three out of the four Bradyrhizobium genomes. Other genes were involved in different metabolic pathways (e.g., cobyrinic acid synthase) and metabolisms (e.g., glutathione synthetase). Notably, these genes corresponded to same orthologous in more than two rhizobia genomes. It could be argued that these genes could have independently acquired functional divergence in several different rhizobia.

Discussion
The pangenome of the taxonomically heterogeneous group of Rhizobiales investigated here displayed a particularly large size, amounting to 37,364 gene families, and a nearly 50-fold smaller core genome (1.87 % of the pangenome). To give some comparisons, an intra-species pangenome with three strains which was reported for S.meliloti had a core genome of 5075 orthologous groups while those groups which constituted the flexible genome were 3810 [51]. The analysis of predatory bacteria [8], which was performed by testing 11 species, classified only 1977 gene families with a core genome constituted of 509 units. Our data, therefore, strengthen the evidence that symbiotic rhizobial diversity is represented by transversal events of gene acquisition moving genetic islands across species of rather distant ancestry [7,18].
The pairwise shared genes calculation yielded results that can be compared to the standard genome-to-genome alignments as ANI and AAI. In relation to the latter two, it can be commented that the pairwise orthologous presence analysis offers a resolution higher than each of those contributed by ANI or AAI, going from a minimum of 15.57 % (R. etli vs. B. japonicum USDA6) of conserved genes over the complete genome, to a maximum of 86.33 % (R. sullae IS123 T vs. R. sullae WSM1592). The 71% difference between these two extremes defines the interval which could be considered the index of resolution. The corresponding minimum and maximum ANI values were 68% and 98%, with an interval at 30%, while for AAI values (minimum 49%, maximum 82%) the fork spanned across the 33% difference.
The analysis of the number of paralogs showed a degree of variability that amounted from a minimum of 0.75% of the genes present in the full genome of A. caulinodans ORS571to the maximum 3.22% recorded in the genome of S. fredii NGR234. As expected, an analysis of the annotated functions of the paralogous genes revealed that all those genes that had high degrees of duplication (above 7 copies per genome) corresponded to transposase-encoding genes. This finding highlights the fact that many Rhizobiales genomes contain a particularly abundant set of insertion elements, which are regarded as a major source of genomic evolutionary plasticity [52][53][54]. Further supporting the magnitude of extrachromosomal inheritance in shaping the genome architecture of this clade, it is worth noticing that 5-7% of the core genes map on plasmids. Indeed, this testifies the relevance of mobile replicons in spreading core genes among N-fixing symbionts. Arguably, the whole phenotypic complex of plant symbiotic N-fixing capability could have been moved by plasmids. This instance was advanced early by authors that analyzed the Alphaproteobacteria 16S-based taxonomy trees [17]. These analyses reckoned that the last possible common ancestor of plant symbiotic N-fixing rhizobia would have existed in an era (> 500 MY before present) in which plants were just barely conquering land. As a consequence, the onset of root nodulation and all further rhizobial evolution was reputed to be explainable only by lateral gene transfer events [17,18].
Analyzing the differences in gene categories and functions in the core genome, and comparing those to a reference full genome, allowed us to test whether the process of core conservation would be due to a simple random gene subsampling ("null" hypothesis) of functional ontology groups, or precise genes would be retained as critical in the core genome. For this comparison, we took as reference the Rhizobium leguminosarum bv. viciae genome and analyzed its distribution both in the full genome and in the 700 genes core subset shared among the 12 species considered. This comparison rests on the a priori assumption that ontology annotation of orthologous genes should be the same in all of the genome projects analyzed. The core genome resulted in a highly enriched in macromolecule metabolism group, with percentages that increased from 5.64% (full genome) to 20.57% (core genome), and in the metabolism of small molecules that nearly doubles it share. The ribosomal components category was also highly preserved as, out of 55 genes present in the full genome, 43 of those are retained in the core genome, shifting the percentage of this group of genes from 0.75% to 6.14 %.
In the detailed GO terms, as regards cellular components, the membrane item appears strongly depleted in the core. This result applies also to the outer membrane-bound periplasmic space genes. By contrast, the cytoplasm and ribosome domains appear strongly enriched. For the biological processes a strongly lower proportion of conserved genes is shown for the transport and DNA-dependent regulation of transcription complexes, while translation and DNA repair are highly up-represented. The molecular function scheme confirms the under-representation of transport, transcription and DNA binding-related genes, and a corresponding enrichment of those involved in ATP binding, RNA binding and translational machinery.
Overall data suggest a more critical conservation of the downstream aspects of gene expression (protein synthesis) and a looser constraint on the upper sections (mRNA synthesis regulation, inducers, signaling components), whose nature would more flexibly explain the rather diverged ecological context of each taxon.
Coming to the functional divergence analyses, the phenomenon was detected in spite of the orthologous prediction. The observed rate appeared not just a function of available genes, as the percentage of diverged-over available-orthologs varies from a minimum of 1.74% to a maximum of 4.84% depending on the combination of strains examined. Among proteins which resulted under selective functional pressure, we found serine-type endopeptidases, a histidinol-phosphatase activity, glutamate-cysteine ligase activity, all belonging to a context of active metabolic genes. The fact that candidate orthologous attributes by bioinformatic tools could be labelled as functionally diverged in their products functions at the same time is puzzling. In a way, this could fuel the debate known as 'ortholog conjecture' [13][14][15][16]. However, in the absence of experimental functional assays, we leave open the possibility that such conflictual definition in a variety of cases could also be due to merely technical issues arising from mismatches in the GO terms definition. This might happen for genomes that have been annotated at different times between which the ontology nomenclature could have undergone updates and refinements.
Summarizing the findings in relation to our hypotheses, the following comments can be made.
The main evidence in terms of involved protein functions were the following: transporters result more prone to variation; our interpretation is that such category is highly involved in different sorts of adaptation and this can bring to host-driven speciation. Support for this consideration can be observed in Figures 4-6 in which these functions appear underrepresented in the core genome and are also those that show the highest degrees of divergence.
The Rhizobiales genera considered appear to group in two main clusters based on their genome average identities, one encompassing the Bradyrhizobium, Azorhizobium and Mesorhizobium and the other including Rhizobium and Sinorhizobium.
In conclusion, the pangenome of the N-fixing symbiotic Rhizobiales proved to be extremely broad in comparison to its core genome. The importance of mobile DNA in the genomics of these bacteria stands out in particular evidence. In this ecologically versatile order, horizontal gene transfer is confirmed to be a major driver of genomic plasticity and a provider of key genes for interactions and finely-tuned adaptations towards plant hosts. The analysis of the core genome allowed us to pinpoint that the majority of its conserved elements belong to an RNA-centered concept, which appears to be critically preserved along the evolution of this functional group of flexibly interactive bacteria.
Supplementary Materials: The following are available online at https://www.mdpi.com/2076-2 607/9/2/407/s1, Figure S1. Pairwise (two-taxa) core genomes. Top table: number of shared genes between each pair. Bottom table: Percent of own genome; In the horizontal dimension: percent of the genome of each taxon listed in rows which shares genes with taxa listed in columns. In the vertical dimension: percent of the genome of each taxon listed in columns which shares genes with taxa listed in rows; (e.g., the core set of 1737 in the A.caulinodans ORS571-B. japonicum USDA110 pair, represents 36.34% of the A. caulinodans ORS571 genome and 20.74% of the B. japonicum USDA110 genome), Figure S2 Unrooted phylogenomic topology showing relationships between Rhizobiales basing on core-gene comparisons, Figure S3: Functionally diverged genes. The number of cases observed in each pairwise comparison is reported on the bottom-left triangle. The corresponding percentage of the total number of shared genes (shown in Figure S1) is reported on the top-right triangle. Supplementary Dataset 1, Core genome, Supplementary Dataset 2, Comparisons Whole genome-Core genome.
Author Contributions: Conceptualization, A.S., N.L.P., G.C., R.M. and P.S.; methodology, software, validation, and investigation, R.R., writing-original draft preparation, writing-review and editing, supervision, project administration, funding acquisition, A.S. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported in part by grant "Progetto di Ateneo PRAT CPDA154841/15" from the University of Padova.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.