Genomic Diversity in the Endosymbiotic Bacterium Rhizobium leguminosarum

Rhizobium leguminosarum bv. viciae is a soil α-proteobacterium that establishes a diazotrophic symbiosis with different legumes of the Fabeae tribe. The number of genome sequences from rhizobial strains available in public databases is constantly increasing, although complete, fully annotated genome structures from rhizobial genomes are scarce. In this work, we report and analyse the complete genome of R. leguminosarum bv. viciae UPM791. Whole genome sequencing can provide new insights into the genetic features contributing to symbiotically relevant processes such as bacterial adaptation to the rhizosphere, mechanisms for efficient competition with other bacteria, and the ability to establish a complex signalling dialogue with legumes, to enter the root without triggering plant defenses, and, ultimately, to fix nitrogen within the host. Comparison of the complete genome sequences of two strains of R. leguminosarum bv. viciae, 3841 and UPM791, highlights the existence of different symbiotic plasmids and a common core chromosome. Specific genomic traits, such as plasmid content or a distinctive regulation, define differential physiological capabilities of these endosymbionts. Among them, strain UPM791 presents unique adaptations for recycling the hydrogen generated in the nitrogen fixation process.


Introduction
Rhizobia are applied across 400 million ha of agricultural land per annum to improve legume forage and crop production through symbiotic nitrogen fixation [1]. These bacteria must be able to Sequences were aligned and phylogenetically analysed using MEGA 6 [39]. Average Nucleotide Identity (ANI) using MUMmer [40] as the alignment algorithm (ANIm) was calculated using the JSpecies package [41]. Distances were calculated as 1-ANIm and trees were computed with the UPMGA [42] method using MEGA 6 package. In the case of the plasmid analysis, Average Nucleotide Identity using BLAST as the alignment algorithm (ANIb) was calculated using the web server JSpeciesWS [43].

Rhizobium leguminosarum bv. viciae UPM791 Complete Genome Is a New Reference Sequence for Endosymbiotic Bacteria
The genomic sequence of Rlv UPM791 was determined by means of a combination of 454 GLS-FLX pyrosequencing, Illumina, and PacBio technologies. Previous PacBio versions had been plagued by miscalled bases. However, recent improvements have dramatically increased its reliability [46], and we found this to be true for our sequences. After the recruitment of reads from two independent Illumina MiSeq sequence sets, only 30 SNPs were uncovered. One of them was localized in the chromosome and the remaining 29 in pRlvD, in a region between positions 60,000 and 88,000. In this region, 36 genes were predicted, out of which 16 open reading frames (ORFs) were annotated as transposase-related genes, one as integrase, and two as resolvases. It appears that regions with a high content of mobilisable elements contain an increased number of repeats and are, therefore, harder to assemble [47,48]. Seven insertions were also identified, all of them localized in the pRlvD plasmid, presumably for similar reasons. Additionally, 74 deletions were detected across the entire genome (32 in the chromosome, nine in pRlvA, four in pRlvB, 12 in pRlvE, six in pRlvC, and 11 in pRlvD). Sixty-six of those deletions consisted of the absence of a G or a C within either a rich GC-content region, or a short stretch containing more than three repeats of these nucleotides.
The Rlv UPM791 finished genome sequence ( Figure 1) has a total size of 7,837,567 bp organized in a circular chromosome (4,760,731 bp, representing 61% of the total genome) and five extrachromosomal replicons, as follows: pRlvA (1,291,920 bp); pRlvB (597,290 bp); pRlvE (564,641 bp); pRlvC, the symbiotic plasmid (405,209 bp); and pRlvD (217,776 bp). A very similar codon usage pattern was observed in all the Rlv UPM791 replicons. GC content of the chromosome is 61%, whereas those of the plasmids are: 59.7% for pRlvA, 61% for pRlvB, 57.3% for pRlvC, 58.5% for pRlvD, and 60.9% for pRlvE. In all the replicons, the high GC content typical of Rhizobiaceae implies a high frequency of GC-rich codons. Consequently, amino acids corresponding to GC-containing codons are overrepresented.
The genome is predicted to encode 7380 genes, with 7318 coding DNA sequences (CDS), nine rRNA (clustered in three ribosomal operons), and 53 transfer RNA (tRNA) ( Table 1). All the rRNA and tRNAs are found in the chromosome. All three rRNA clusters have an insertion of a tRNA-Met before the 5S gene and a tRNA-Ala and tRNA-Ile between the 23S and 16S genes. This genetic structure appears to be conserved across the rRNA operons of rhizobial genomes [49]. The general features and statistics of the Rlv UPM791 genome compared to those from reference genomes are presented in Table 1.   The 7318 CDS were classified into 19 different TIGRfam categories (each CDS could be classified into more than one category, see Supplementary Table S1). The majority of these genes (ca. 50%) are hypothetical proteins (19.66%), proteins of an unknown function (28.13%), and unclassified (9.07%), representing more than 56% of the total genome. A total of 4292 proteins received annotations other than "hypothetical protein" from the IGS (Institute for Genomic Sciences, Baltimore, MD, USA) prokaryotic annotation pipeline [25], with a differential distribution of categories among replicons. The chromosome harbours CDS with TIGRfam categories related to amino acid, protein, purines, pyrimidines, nucleosides, and nucleotides synthesis, as well as the cell envelope. The largest plasmid pRlvA contains the highest number of CDSs classified as regulatory functions. Plasmid pRlvB shows the highest number of CDSs allocated into the energy metabolism category and signal transduction. Plasmid pRlvE carries a high percentage of CDS categorized as fatty acids and transport and binding The 7318 CDS were classified into 19 different TIGRfam categories (each CDS could be classified into more than one category, see Supplementary Table S1). The majority of these genes (ca. 50%) are hypothetical proteins (19.66%), proteins of an unknown function (28.13%), and unclassified (9.07%), representing more than 56% of the total genome. A total of 4292 proteins received annotations other than "hypothetical protein" from the IGS (Institute for Genomic Sciences, Baltimore, MD, USA) prokaryotic annotation pipeline [25], with a differential distribution of categories among replicons. The chromosome harbours CDS with TIGRfam categories related to amino acid, protein, purines, pyrimidines, nucleosides, and nucleotides synthesis, as well as the cell envelope. The largest plasmid pRlvA contains the highest number of CDSs classified as regulatory functions. Plasmid pRlvB shows the highest number of CDSs allocated into the energy metabolism category and signal transduction. Plasmid pRlvE carries a high percentage of CDS categorized as fatty acids and transport and binding proteins. The symbiotic plasmid pRlvC shares with the chromosome the highest percentage of CDSs classified into DNA metabolism. This plasmid also has the largest percentages in terms of the biosynthesis of cofactors, prosthetic groups, and carriers, central intermediary metabolism, and protein fate. The smallest plasmid, pRlvD, holds the largest percentage of CDSs classified as cell processes and mobile and extrachromosomal elements (mainly transposons and integrases).
Although only sequenced to a draft level, the genome sequence of Rlv UPM791 parent strain 128C53 is available in databanks (SAMN02441035) and is essentially identical, although in the particular isolate that was sequenced, the smallest plasmid pRlvD was missing. A known difference between both genomes should be related to the streptomycin-resistance mutation in Rlv UPM791. Many high-level streptomycin resistance (Str R ) mutations map within the small ribosomal subunit protein S12 (rpsL) coding sequence [50]. A comparison of RpsL gene product sequences from Rlv UPM791 (RLV_4249) and 128C53 (WP_003547537.1) showed the presence of a K43R mutation in Rlv UPM791 RpsL, a common alteration in high-level Str R mutants [51,52] and likely responsible for the Str R phenotype.

Rhizobium leguminosarum bv. viciae UPM791 Global Genome Comparisons
A comparison of the Rlv UPM791 16S rRNA gene sequence with other sequences in databanks placed this strain together with those from other R. leguminosarum strains (Supplementary Figure S1). For a more accurate analysis, the entire genome was aligned to the sequenced strains and the ANIm was also determined [41,53]. Our results indicate that Rlv UPM791 is phylogenetically related to Rlt WSM1325, Rlv 3841, and Rlt WSM1689 with 93.06, 93.05, and 92.50% sequence similarity, respectively ( Figure 2). Considering that the ANI values between genomes of the same species are generally above 95%, these data would indicate that Rlv UPM791 should be considered a different genospecies. However, as reported by Kumar et al. [54], R. leguminosarum includes at least five distinct genospecies and could be considered a "species complex" rather than a regular species. An ANIm comparison of Rlv UPM791 with Kumar et al. genospecies A through E, suggests that Rlv UPM791 belongs to genospecies E (data not shown). proteins. The symbiotic plasmid pRlvC shares with the chromosome the highest percentage of CDSs classified into DNA metabolism. This plasmid also has the largest percentages in terms of the biosynthesis of cofactors, prosthetic groups, and carriers, central intermediary metabolism, and protein fate. The smallest plasmid, pRlvD, holds the largest percentage of CDSs classified as cell processes and mobile and extrachromosomal elements (mainly transposons and integrases).
Although only sequenced to a draft level, the genome sequence of Rlv UPM791 parent strain 128C53 is available in databanks (SAMN02441035) and is essentially identical, although in the particular isolate that was sequenced, the smallest plasmid pRlvD was missing. A known difference between both genomes should be related to the streptomycin-resistance mutation in Rlv UPM791. Many high-level streptomycin resistance (Str R ) mutations map within the small ribosomal subunit protein S12 (rpsL) coding sequence [50]. A comparison of RpsL gene product sequences from Rlv UPM791 (RLV_4249) and 128C53 (WP_003547537.1) showed the presence of a K43R mutation in Rlv UPM791 RpsL, a common alteration in high-level Str R mutants [51,52] and likely responsible for the Str R phenotype.

Rhizobium leguminosarum bv. viciae UPM791 Global Genome Comparisons
A comparison of the Rlv UPM791 16S rRNA gene sequence with other sequences in databanks placed this strain together with those from other R. leguminosarum strains (Supplementary Figure S1). For a more accurate analysis, the entire genome was aligned to the sequenced strains and the ANIm was also determined [41,53]. Our results indicate that Rlv UPM791 is phylogenetically related to Rlt WSM1325, Rlv 3841, and Rlt WSM1689 with 93.06, 93.05, and 92.50% sequence similarity, respectively ( Figure 2). Considering that the ANI values between genomes of the same species are generally above 95%, these data would indicate that Rlv UPM791 should be considered a different genospecies. However, as reported by Kumar et al. [54], R. leguminosarum includes at least five distinct genospecies and could be considered a "species complex" rather than a regular species. An ANIm comparison of Rlv UPM791 with Kumar et al. genospecies A through E, suggests that Rlv UPM791 belongs to genospecies E (data not shown). A comparative protein cluster analysis of Rlv UPM791 was carried out using available, complete R. leguminosarum genome sequences: Rlv 3841, Rlt WSM1325, Rlt WSM2304, and Rlt WSM1689, as well as the high-quality permanent draft genome from Rlt CB782, an elite inoculant strain of commercial importance isolated from Trifolium semipilosum [45]. The six genomes shared 4033 clusters, with WSM2304 showing the highest number of exclusive CDS (132, Figure 3) and Rlv UPM791 the lowest (68). Rlv UPM791 and Rlv 3841 share the highest number of protein families, 5147. Rlv UPM791 also shares a high number of protein families with Rlt WSM1325 (5033). However, Rlv UPM791 shares the lowest number of protein families with Rlt WSM2304 (4689). In general, the chromosome is highly conserved in all the strains, as shown by the global synteny analysis in Supplementary Figure  A comparative protein cluster analysis of Rlv UPM791 was carried out using available, complete R. leguminosarum genome sequences: Rlv 3841, Rlt WSM1325, Rlt WSM2304, and Rlt WSM1689, as well as the high-quality permanent draft genome from Rlt CB782, an elite inoculant strain of commercial importance isolated from Trifolium semipilosum [45]. The six genomes shared 4033 clusters, with WSM2304 showing the highest number of exclusive CDS (132, Figure 3) and Rlv UPM791 the lowest (68). Rlv UPM791 and Rlv 3841 share the highest number of protein families, 5147. Rlv UPM791 also shares a high number of protein families with Rlt WSM1325 (5033). However, Rlv UPM791 shares the lowest number of protein families with Rlt WSM2304 (4689). In general, the chromosome is highly conserved in all the strains, as shown by the global synteny analysis in Supplementary Figure

High Diversity of the Plasmid-Associated Genome in Rhizobium leguminosarum
Rlv UPM791 extrachromosomal replicons have been named pRlvA, pRlvB, pRlvC, pRlvD, and pRlvE. All five plasmids have putative replication systems based on repABC genes. The largest extrachromosomal replicon is chromid pRlvA (1,291,920 bp and 1246 CDSs). Plasmids pRlvB and pRlvE have similar sizes, with respective values of 597,290 bp and 564,641 bp (588 and 545 CDSs, respectively). These two plasmids were previously visualized as a unique, more intense band in Eckhardt gels [55]. The two smallest plasmids are pRlvC (symbiotic plasmid, 405,209 bp and 366 CDSs) and pRlvD (217,776 bp, 239 CDSs).
When the plasmid content was compared to that of Rlv 3841 ( Figure 4, see Supplementary Figure S4 for more detail), we observed that four out of the five Rlv UPM791 plasmids had an equivalent in Rlv 3841 [6]. Plasmid pRlvA combines sequences from three different origins: (a) pRL12, the Rlv 3841 chromid [56]; (b) pRL8, a pea rhizosphere-specific plasmid with little collinearity (<5%) with other sequenced rhizobial genomes [57]; and (c) sequences not present in the Rlv 3841 genome, namely regions 342-426 kb (84 kb), 567-700 kb (133 kb), and 717-802 kb (85 kb). One of the noteworthy differences found between pRlvA and pRL12 is the absence of the Type VI Secretion System (T6SS), a system that is altogether missing in the Rlv UPM791 genome. With a GC content of 59.7%, pRlvA appears close to the chromosome (GC content of 61%), and would represent a megaplasmid or chromid, not uncommon in proteobacteria [10,56]. As such, pRlvA harbours some relevant genes, such as ATP-binding cassette (ABC) transporters (like the urea transport system UrtABCDE, RLV_0741 to RLV_0745), MCPs, or genes related to organic acids and sugar metabolism.
pRlvB showed the highest degree of conservation with pRL11 (684,202 bp), its Rlv 3841 plasmid counterpart, as shown in more detail in Supplementary Figure S4. These plasmids exhibited an ANIb value of 92.76%, with a sequence coverage of 67.36% (Table 2). Interestingly, pRlvB harbours the only copy of the minCDE operon (RLV_1599-1601), involved in the accurate septum localization for cell division, implying that this replicon should be essential for the survival of this bacterium [58]. As in pRL11, pRlvB also includes the genes for cobalamin synthesis cobFGHIJKLM at the end of its sequence (RLV_1685-1678). Synteny analysis carried out with other Rl reference genomes suggests that large regions of this plasmid are also present in Rlt CB782 (plasmid 2), Rlt WSM1305 (pRL132502), Rlt WSM1689 (plasmid 2), and Rlt WSM2304 (pRLG202) (Supplementary Figure S2). When the plasmid content was compared to that of Rlv 3841 ( Figure 4, see Supplementary Figure  S4 for more detail), we observed that four out of the five Rlv UPM791 plasmids had an equivalent in Rlv 3841 [6]. Plasmid pRlvA combines sequences from three different origins: (a) pRL12, the Rlv 3841 chromid [56]; (b) pRL8, a pea rhizosphere-specific plasmid with little collinearity (<5%) with other sequenced rhizobial genomes [57]; and (c) sequences not present in the Rlv 3841 genome, namely regions 342-426 kb (84 kb), 567-700 kb (133 kb), and 717-802 kb (85 kb). One of the noteworthy differences found between pRlvA and pRL12 is the absence of the Type VI Secretion System (T6SS), a system that is altogether missing in the Rlv UPM791 genome. With a GC content of 59.7%, pRlvA appears close to the chromosome (GC content of 61%), and would represent a megaplasmid or chromid, not uncommon in proteobacteria [10,56]. As such, pRlvA harbours some relevant genes, such as ATP-binding cassette (ABC) transporters (like the urea transport system UrtABCDE, RLV_0741 to RLV_0745), MCPs, or genes related to organic acids and sugar metabolism.
pRlvB showed the highest degree of conservation with pRL11 (684,202 bp), its Rlv 3841 plasmid counterpart, as shown in more detail in Supplementary Figure 4. These plasmids exhibited an ANIb value of 92.76%, with a sequence coverage of 67.36% (Table 2). Interestingly, pRlvB harbours the only copy of the minCDE operon (RLV_1599-1601), involved in the accurate septum localization for cell division, implying that this replicon should be essential for the survival of this bacterium [58]. As in pRL11, pRlvB also includes the genes for cobalamin synthesis cobFGHIJKLM at the end of its sequence (RLV_1685-1678). Synteny analysis carried out with other Rl reference genomes suggests that large regions of this plasmid are also present in Rlt CB782 (plasmid 2), Rlt WSM1305 (pRL132502), Rlt WSM1689 (plasmid 2), and Rlt WSM2304 (pRLG202) (Supplementary Figure S2). Plasmid pRlvE appears to be a cointegrate of sequences that in Rlv 3841 are found in plasmids pRL10 and pRL9, with 91.62 and 88.75% of the ANIb in the conserved parts, respectively ( Table 2). Sequence conservation of pRlvE with plasmids pR132505 (294,782 bp) and pR132504 (350,312 bp), both from Rlt WSM1325 (90.97% and 88.80% ANIb, respectively), also emphasizes that plasmid pRlvE could be a cointegrate, with a total size (564,641 bp) roughly corresponding to the sum of plasmids pR132504 and pR132505. An alternative explanation would of course be that pRlvE has split into more than one plasmid in other Rl strains, as it occurs in Rlv 3841 and Rlt WSM1325. Plasmid pRlvE appears to be a cointegrate of sequences that in Rlv 3841 are found in plasmids pRL10 and pRL9, with 91.62 and 88.75% of the ANIb in the conserved parts, respectively ( Table 2). Sequence conservation of pRlvE with plasmids pR132505 (294,782 bp) and pR132504 (350,312 bp), both from Rlt WSM1325 (90.97% and 88.80% ANIb, respectively), also emphasizes that plasmid pRlvE could be a cointegrate, with a total size (564,641 bp) roughly corresponding to the sum of plasmids pR132504 and pR132505. An alternative explanation would of course be that pRlvE has split into more than one plasmid in other Rl strains, as it occurs in Rlv 3841 and Rlt WSM1325.
Finally, the sequence of the smallest plasmid, pRlvD, was mapped against pRL7 (217,776 bp). However, contrary to the other replicons in Rlv UPM791, which shared a similar distribution of genes across functional categories, pRlvD consists mainly of transposons and insertion sequences. Surprisingly, while plasmids pRlvA, pRlvB, pRlvD, and pRlvE showed relatively good conservation with counterparts in Rlv 3841, symbiotic plasmid pRlvC exhibited a low degree of conservation with its Rlv 3841 counterpart, pRL10 (488,135 bp). This led us to analyse it in depth in the next section.

Rhizobium leguminosarum bv. viciae UPM791 Harbours a Unique Symbiotic Plasmid
The Rlv UPM791 symbiotic plasmid, pRlvC, was compared to each of the Sym plasmids of the reference genomes: Rlv 3841 pRL10, Rlt WSM1325 pR132501 (828,924 bp), Rlt WSM2304 pRLG201 (1,266,105 bp), Rlt WSM1689 Rleg3_Contig1811.4 (341,391 bp), and Rlt CB782 plasmid C219.4 (251,612 bp). When these plasmids were aligned using pRlvC as the reference sequence, a very low degree of conservation was observed. These results were consistent with those obtained from the comparison of the proteomes deduced from these symbiotic plasmids. A very low number of protein families were shared among the replicons (20, Figure 5). Symbiotic plasmids from Rlv UPM791 and Rlt CB782 strains share the highest number of protein families (82), despite being phylogenetically less related ( Figure 5). Genes conserved among the five replicons are essentially restricted to those required to establish the symbiosis and fixing nitrogen (nod, fix and nif genes). As expected, and due to their differential host range, the genetic structure of the nodulation genes is only fully conserved between Rlv 3841 and Rlv UPM791, with small differences regarding those from the three Rlt strains. When the sequences of pRlvC and pRL10 were aligned, we observed that their arrangement was different ( Figure 6). Whereas in pRL10, fix genes are followed by nif, nod, and rhi genes, in pRlvC, fix and nif genes are located in a different region, separated from the rhi, nod, and fix-nif clusters. pRlvC includes 13 genes involved in nodulation, nodOTNMLEFDABCIJ, located close to the rhizosphere-induced genes rhiIABCR. Genes required for nitrogenase comprise nifHDKEN and nifAB, and those also required for nitrogen fixation are fixABCX and fixNOQPGHIS (RLV_1890-RLV_1892 and RLV_1827-1834, respectively).  In addition to the above genes, Rlv UPM791 symbiotic plasmid pRlvC also includes a large cluster of genes (hupSLCDEFGHIJK hypABFCDEX, corresponding to RLV_1962-1945), encoding the previously described NiFe hydrogenase system [59]. This hydrogenase enzyme recycles hydrogen generated in the nitrogenase catalytic cycle. Consistent with their symbiotic role, hydrogenase genes Rlv 3841  In addition to the above genes, Rlv UPM791 symbiotic plasmid pRlvC also includes a large cluster of genes (hupSLCDEFGHIJK hypABFCDEX, corresponding to RLV_1962-1945), encoding the previously described NiFe hydrogenase system [59]. This hydrogenase enzyme recycles hydrogen generated in the nitrogenase catalytic cycle. Consistent with their symbiotic role, hydrogenase genes  Figure 6. Circos representation of synteny between the symbiotic plasmids from Rlv UPM791 and Rlv 3841 strains.
In addition to the above genes, Rlv UPM791 symbiotic plasmid pRlvC also includes a large cluster of genes (hupSLCDEFGHIJK hypABFCDEX, corresponding to RLV_1962-1945), encoding the previously described NiFe hydrogenase system [59]. This hydrogenase enzyme recycles hydrogen generated in the nitrogenase catalytic cycle. Consistent with their symbiotic role, hydrogenase genes are co-expressed with nif genes via NifA-dependent regulation [60]. Interestingly, the hup/hyp gene cluster is highly conserved in a reduced number of R. leguminosarum strains, but absent in many others of the same species, including Rlv 3841 [61] and all Rlt strains analysed to date [62]. The Rlv UPM791 hydrogenase cluster contains hupE, encoding an energy-independent diffusion facilitator for the uptake of Ni(II) ions, required for synthesis of the hydrogenase heterobimetallic cofactor [63]. Similar hup/hyp clusters are present in some other species of Rhizobium, Bradyrhizobium, and Azorhizobium [64].

Specific Traits Define the Physiology of Each Rhizobial Strain
Rlv establishes a diazotrophic symbiosis with legumes of the tribe Fabeae (Pisum, Lens, Vicia, and Lathyrus) over a wide range of edaphoclimatic conditions. This broad host range could be partially explained by the existence of a large adaptive fraction of its genome that differs among isolates, as already suggested by the large number of plasmids. In the case of reference genomes studied in depth, such as Ensifer meliloti 1021 [65] or Rlv 3841 [6], these large genomes show a high number of genes devoted to solute uptake, the import and export of molecules, the production of cell-surface polysaccharides, regulatory functions, and also nitrogen metabolism [9], which could be necessary to fine tune required specificities.

Metabolic Pathways
The IGS prokaryotic annotation pipeline assigned 343 Rlv UPM791 proteins to the energy metabolism category and 88 proteins potentially involved in the synthesis of cell-surface polysaccharides and peptidoglycan. Metabolic pathways that are differently covered in Rlv UPM791 and Rlv 3841 are, for example, fluorobenzoate and dicloroethane degradation, with the BenAB (2-halobenzoate 1,2-dioxygenase, RLV_3415-3416) pathway absent in Rlv 3841. Rlv UPM791 also shows a higher number of proteins involved in beta-lactam resistance. Regarding specific traits, Rlv UPM791 has two copies of isocitrate dehydrogenase: the chromosomal RLV_4899 and RLV_1979, encoded in pRlvC and subjected to microoxic regulatory control (unpublished results); this second copy is absent in Rlv 3841. It appears to be the result of a duplication or insertion event, as it is located close to a transposable element. Rlv 3841 has two pathways for malate conversion to pyruvate: Dme and phosphoenolpyruvate (PEP) carboxykinase (PckA)/pyruvate kinase (PykA) [66]. These pathways are also present in Rlv UPM791: Dme, RLV_2776; and PckA, RLV_7044, with two copies of PykA: RLV_6265, similar to the protein of Rlv 3841, and RLV_3363, both in the chromosome. However, Rlv UPM791 contains a phosphoenolpyruvate carboxylase (PPC) that converts phosphoenolpyruvate to oxaloacetate by the addition of bicarbonate (HCO 3 − ), thus feeding into the TCA cycle, which is absent in Rlv 3841. A unique characteristic in Rlv 3841 was the presence of two functional PHB synthases: a chromosomal one, active in free-living and undifferentiated bacteria (type I, PhaC1), and another in pRLl10, active in bacteroids (type III, PhaE PhaC2) [67]. However, Rlv UPM791 presents only the chromosomal type I PHB synthase equivalent to PhaC1, RLV_4485, but lacks the type III one, PhaE PhaC2, as most of the sequenced rhizobia do. This emphasizes once again the differences between the symbiotic plasmids of these endosymbiotic bacteria and their potential symbiotic behaviour.

Transport Systems
The high number of ABC transport systems harboured by rhizobia, in the order of ca. 200 (as compared to 57 for Escherichia coli), is particularly remarkable and a reflection of their lifestyle [68,69]. These transporters enable rhizobia to access a wide diversity of nutrients in the soil and the plant rhizosphere, even if they are at low concentrations [70]. This is the case, for example, for Rlv 3841, which harbours 183 complete ABC operons, representing 11% of its protein complement [6]. In Rlv UPM791, among those proteins with an assigned function, a large number of proteins are transport and binding proteins (906 proteins), the majority of which correspond to ABC transporters. For example, Rlv UPM791 has a full copy of both Aap and Bra (RLV_4520-4523, RLV_6024-6027), two broad-specificity amino acid ABC transporters necessary for an effective nitrogen fixation in the bacteroids of Rlv 3841 [71]. Active transport efflux pumps represent the largest category of metal resistance systems [72]. For example, the Rl DmeRF system is a metal-responsive efflux mechanism acting as a key element for metal homeostasis under free-living and symbiotic conditions [73]. This system, present in the chromosome of Rlv UPM791 (RLV_3758-3759), controls the concentration of Ni and Co, which can be toxic at moderate concentrations but are essential microelements for microbial nutrition that participate in a variety of cellular processes, such as hydrogenase biosynthesis [74].

Motility and Chemotaxis Proteins
The presence of methyl-accepting chemotaxis proteins (MCPs) in the Rlv UPM791 genome was also analysed. These proteins are required in order to sense chemical gradients and direct the movements of the bacterium towards different attractants by chemotaxis [75]. Their structural variability relies upon the molecules they detect, harbouring two different domains: a methyl-accepting chemotaxis protein signaling domain and a N-terminal ligand-binding region (LBR). Soil bacteria are metabolically versatile, prepared to sense and respond to a wide range of compounds [76]. In the Rlv UPM791 genome, we have identified 35 MCP genes, with 22 of them located in the chromosome. Similarly, high numbers of MCP proteins were detected in the other Rl strains analysed (28, 31, 25, and 32 MCPs for strains Rlv 3841, Rlt WSM1325, Rlt WSM1689, and Rlt WSM 2304, respectively). A high number of MCPs per genome correlates with lifestyles involving complex behaviour or interactions [77], as is the case for rhizobia. A phylogenetic tree including MCPs from the five Rl strains studied (Supplementary Figure S5) revealed that 21 MCP genes (16 chromosomal and five plasmid-borne) present in Rlv UPM791 have a close ortholog in all the other four strains, thus indicating the existence of a shared core of chemical sensing machinery in this species. On the other hand, two MCP genes (RLV_0526 and RLV_0630) were not closely related to MCPs from any of the other strains; interestingly, the genes for these "orphan" MCPs are encoded in pRlvA. This set of MCPs might give specific sensing capabilities to some of the strains within this species.

Cell-Surface Polysaccharide Production
Bacterial surface polysaccharides have several functions, such as nutrient gathering, environmental stresses protection or attachment, and biofilm formation, ensuring adaptation to changing environmental conditions. In rhizobia, acidic extracellular polysaccharides (EPS) are crucial in the initial bacterial invasion for the establishment of a successful symbiosis with legumes, especially for the formation of indeterminate nodules [78]. Rl strains produce a characteristic EPS skeleton formed by octasaccharide repeating units, with glucose as the dominant sugar component [79]. However, they differ in the sugar composition of their repeating units, the length of the side chain, and the pattern of non-sugar decoration (acetyl, pyruvyl, and hydroxybutanoyl residues) [78]. The genes directing the biosynthesis of exopolysaccharides (exo/exs or pss genes) are found in large clusters either on the chromosome or on the megaplasmids [78]. In Rlv UPM791 and 3841, these genes are found in the chromosome. Both strains harbour a copy of the exoB (RLV_6933) and exoY (RLV_6099) genes, responsible for the synthesis of sugar precursors and unit assembly, respectively. The overall genetic organization of pss genes is conserved in Rlv strains 3841 and UPM791. Both genomes harbour two copies of the pssA gene (RLV_6029, RLV_4118 in Rlv UPM791), which encodes a conserved protein involved in the first step of the EPS synthesis located in a single ORF at a long distance from the other pss genes (RLV_5910-5935). In both Rlv strains, the pss cluster contains the EPS modification gene pssRMK and the polymerization and translocation machinery encoded by pssTNOP and pssL genes. However, they differ in the number of glycosyl transferases, with pssGH genes absent in Rlv UPM791, as happens in Rlt TA1 [80]; this difference suggests that Rlv UPM791 produces an alternatively structured EPS. Apart from the pss genes, other genes encoding proteins essential for EPS synthesis are also present in this large cluster in both Rlv strains, as is the case of the prsDE genes (RLV_3415-3416).
They encode two components of a type I protein secretion system, responsible for the secretion of a broad range of substrates, such as the nodulation protein NodD, the glycanase PlyA, or the rhizobial adhesion protein RapA. Consistently, rapA (RLV_3418) and plyA (RLV_3417) genes are part of the large polysaccharide gene cluster in both strains [80].

Secretion Systems
Interactions of bacteria with other organisms, prokaryotic or eukaryotic, rely on their ability to export proteins through dedicated protein secretion machineries. The general secretion (Sec) and twin-arginine (Tat) translocation pathways, responsible for the majority of protein export into the periplasm, were identified in the chromosome of Rlv UPM791. The Sec pathway primarily translocates proteins containing a signal peptide and in an unfolded state. The translocase comprises seven proteins, including an ATPase, SecA (RLV_6542); a chaperone, SecB (RLV_7013); a translocation channel, complex SecYEG (RLV_4274, RLV_4240, and RLV_4771); and two additional membrane proteins, SecDF, that promote the release of the mature protein into the periplasm (RLV_3073, RLV_4445). The Tat pathway secretes folded proteins with twin arginine signal peptides. In Rlv UPM791, the Tat system consists of three subunits encoded in the tatABC operon (RLV_4436-4438). This system was shown to be essential in this strain for symbiosis with peas [81]. The sequence identity of the genes involved in the Sec and Tat pathways compared to Rlv 3841 is above 95%.
In rhizobia, type I, III, IV, and VI secretion systems have been shown to be involved in symbioses with legume hosts [82]. Rlv UPM791 presents only type I (T1SS) and IV (T4SS) systems. A complete T1SS is encoded in pRlvA (RLV_1006-1012) and consists of two permeases, a LacI-type and a TetR-type regulator, two Haemolysin secretion protein D (HlyD) proteins, and a transmembrane protein.
Rlv UPM791 lacks a tolC gene, encoding a common outer membrane protein able to interact with T1SS. Other transmembrane HlyD type I secretion proteins are encoded by genes RLV_3630, RLV_5928, RLV_2681, and RLV_2693. Putative auto-aggregation protein type I substrates correspond to genes RLV_5342, RLV_5930, and RLV_2396. A Type II secretion system, which depends on the activity of the Sec and/or Tat pathways to deliver their substrate proteins into the periplasm, was not found in Rlv UPM791. A Type III secretion system (T3SS), responsible for producing nodulation outer proteins (Nops), was reported in Mesorhizobium loti, B. diazoefficiens, and Ensifer fredii NGR234 [83][84][85], but appeared to be absent in R. leguminosarum strains. Whereas a Type VI secretion system (T6SS) is present in pRL12 from Rlv 3841 [6], no T6SS has been found in the Rlv UPM791 genome. However, both Rlv UPM791 and Rlv 3841 contain a T4SS tra/trb conjugative system in their symbiotic plasmids. In pRlvC, this T4SS is adjacent to the repABC genes, a situation also found in pSyms from several Rhizobium strains, including pRL10 from Rlv 3841 [6], pRtrCIAT899a from Rhizobium tropici CIAT899, and pPRF81b from Rhizobium sp. PRF81 [86]. All these plasmids also contain traI and traR, suggesting a Quorum Sensing-dependent regulation. This system has also been located in the M. loti symbiosis island, indicating a potential for the transmission of this chromosomal island [83] that has been experimentally confirmed [87]. An additional Type-IV pilus T4SS cluster [88], containing putative genes virB1-B11 (RLV_0329-0340), was identified in pRlvA. Type-IV pili have been described as virulence factors in bacteria, and can be involved in the colonization of surfaces in different genera of Gram-negative bacteria [89]. A pilus-assembly system, similar to that of Rlv 3841, appears in the operon RLV_7221-7252. Finally, we found one type V autotransporter barrel domain protein (RLV_5683) in the Rlv UPM791 genome, highly similar to Rlv 3841 RL1196.

Control of Microoxic Metabolism
The fixNOQP fixGHIS operons encode a cbb 3 -type cytochrome oxidase that is induced under microoxic conditions and is essential for respiration during symbiotic nitrogen fixation. As mentioned above, these genes are present in the symbiotic pRlvC plasmid in Rlv UPM791 (RLV_1827-1830, RLV_1831-1834). The regulation of these operons in rhizobia has been thoroughly studied in E. meliloti and in B. diazoefficiens, where they are under the control of the oxygen sensor/regulator/regulator system FixLJK that also controls the expression of the key nitrogen fixation regulator NifA [90]. In R. leguminosarum and R. etli, a canonical FixLJK system is absent, and fixNOQP fixGHIS expression is controlled by FnrN [91][92][93][94][95]. As previously shown in our laboratory, in Rlv UPM791, NifA and FnrN constitute the main regulators of both nitrogenase and hydrogenase expression, thus ensuring their coordinated expression [94,96]. While nifA is present as a single copy gene (RLV_1894) located, as in Rlv 3841, in the symbiotic plasmid pRlvC, fnrN is present in two functional copies in Rlv UPM791, one in the chromosome (fnrN1, RLV_5077), and the other in pRlvC (fnrN2, RLV_1980). This sets Rlv UPM791 apart from Rlv 3841, which only contains a chromosomal copy [6], and confirms previous studies from our laboratory [92][93][94]96].

Regulatory Proteins
A total of 738 regulatory proteins were found in the Rlv UPM791 genome, constituting ca. 10% of the predicted proteins (data not shown). We made a specific search for transcriptional regulators of the LuxR-type family in the genome of Rlv UPM791. These proteins have two characteristic domains: an autoinducer-binding domain that binds AHLs in the N-terminal region, and the LuxR-family DNA-binding, helix-turn-helix (HTH) domain in the C-terminal region [97]. The latter domain is a general DNA-binding domain that is present in other, non-LuxR-like response regulators. We identified 21 candidate proteins from which only five also included an autoinducer-binding domain, indicating their potential role in Quorum Sensing (QS): RLV_5631 (243 aa), RLV_3298 (244 aa), RLV_1922 (247 aa), RLV_6914 (251 aa), and RLV_2364 (270 aa). Among these proteins we found the QS systems previously described in our laboratory [98]: CinR (RLV_5631), with its cognate synthase CinI (RLV_5632; 221 aa); and RhiR (RLV_1922), with its corresponding synthase RhiI (RLV_1926; 185 aa). Both systems are similar to their counterparts in Rl A34 and Rlv 3841 strains. In all these genomes, CinRI is located in the chromosome, whereas RhiRI is encoded in the symbiotic plasmid. For Rlv UPM791, Cantero (2005) [98] reported that the Rlv 3841 raiRI system was absent in this strain; our genome analysis suggests that the same applies for the LuxR orphan regulator bisR [99]. No other QS synthase genes were found associated with the remaining LuxR homologues, indicating that they are probably orphan LuxR proteins [100]. Not included among these five potential LuxR functional proteins is the pseudogene RLV_2010, which corresponds to TraR and that, as previously shown [98], presents a frameshift mutation located in the autoinducer binding domain. However, sequence analysis of the gene for the associated TraI synthase suggests that it is functional, although further studies are needed to confirm its potential role in the cell-cell communication machinery of Rlv UPM791. The presence of the antirepressor TraM [98] was also confirmed (RLV_2009, 78 aa).
Regarding other regulatory proteins important in rhizobia, we found a high number of GntR-like regulators in Rlv UPM791 (50). Well-known members of the GntR family in rhizobia control diverse processes, such as rhizopine catabolism in E. meliloti (MocR, [101]) or arabinose metabolism in B. diazoefficiens (AraR, [102]). LysR-family transcriptional regulators were also represented, among them NodD (RLV_1905), an important LysR regulator in rhizobia that mediates the activation of nod-gene expression in response to plant-produced flavonoids [103]. Among the ArsR-type regulators, the R. leguminosarum transcriptional repressor NolR acts on nod-gene expression after induction by NodD, which is required for optimal nodulation in some hosts [104], and is usually located in chromosomes or chromids [86]. In Rlv UPM791, it was accordingly found in the chromosome (RLV_5806), whereas a nolR-like gene is located in pRlvA (RLV_1120), a situation also found in Rlv 3841 (RL3537 and pRL120789). RosR is an additional transcriptional regulator involved in the positive regulation of exopolysaccharide synthesis [105] and the homologous gene is also present in the chromosome of Rlv UPM791 (RLV_3788).

Stress Proteins
Stress conditions are a key aspect in the definition of different symbiotic structures [106]. As in other host-microbe associations, endosymbiotic forms of rhizobia are subjected to different types of stress, such as the presence of antimicrobial NCR peptides [107]. Relevant members of the stress-response machinery are the small Heat-Shock Proteins (sHSPs), ubiquitous chaperones bearing a characteristic alpha-crystalline domain 80-100 residues long. These proteins are able to prevent the irreversible aggregation of proteins following denaturation due to high temperatures or other stress factors [108]. Rlv UPM791 encodes 7 sHSPs, five of which are located in plasmids (RLV_2751, RLV_6296, RLV_0361, RLV_0502, RLV_0817, RLV_0818, and RLV_1399). The genomes of Rlv 3841, Rlt WSM1325, and Rlt WSM1689 contain four, five, and five sHSPs, respectively. This is in sharp contrast to the situation commonly found in most bacteria, where no more than one member of the family is found per genome (71% of 318 genomes studied, [109]). In that study, a reduced group, including just 9% of the genomes studied, contained more than four sHSPs. This group included all legume endosymbiotic bacteria analysed, with the exception of Rlv 3841, which contains four sHSPs. It is tempting to speculate that a high number of sHSPs allows the rhizobia to cope with different kinds of stresses during their complex life style. A known source of stress for bacteria within indeterminate legume nodules is the presence of antimicrobial NCR (Nodule-specific Cysteine-Rich) peptides first described in alfalfa [107] and later on in other legumes showing differentiated bacteroids, including the pea [110]. NCR peptides are able to induce changes in free-living bacterial cells, such as the inhibition of cell division and endoreduplication, similar to those observed in bacteroids. Tolerance to these peptides under symbiotic conditions requires the presence of BacA, a protein proposed to function as a peptide transporter and required for bacteroid development in alfalfa and pea nodules [111,112]. The Rlv UPM791 genome encodes a BacA homologue (RLV_5830) highly conserved (96% identity) with those from the other Rlv and Rlt strains analyzed here.

Discussion
Genomic approaches are being used nowadays to define and understand the genetic systems from bacterial symbionts implicated in the interaction with their respective hosts. The genome sequences of strains from model symbiotic rhizobial species, such as E. meliloti 1021 [65], Rlv 3841 [6], R. etli bv. phaseoli CFN42 T [113], M. loti MAFF303099 [83], or Bradyrhizobium diazoefficiens USDA110 T [84], have been determined, providing new insights into these symbiotic interactions. Moreover, with advances in sequencing technologies, a genomic encyclopaedia of root nodule bacteria (GEBA-RNB) was created by sequencing 107 genomes and capturing their phylogenetic and biogeographic diversity [45]. However, despite these holistic approaches, the number of complete rhizobial genomes still remains low. Given the wide genomic diversity of rhizobia uncovered by the GEBA-RNB study [45], this is a limitation to in-depth comparative studies.
One of the characteristics of rhizobial genomes is the presence of multiple plasmids, especially in fast-growing symbionts such as Rlv UPM791. In this strain, five different extra-chromosomal replicons were identified. These multipartite genomes are often associated with species that encounter a host, and may increase their adaptive potential by extending their metabolic and symbiotic capabilities [10,12]. For this reason, soil-dwelling species tend to have large genomes with extra-chromosomal DNA, allowing them to adjust to survival as free-living cells in the soil, or associated with their symbiotic partners [114]. Rhizobia appear to have evolved by means of horizontal gene transfer and gene duplication events [113]. This could explain why the plasmids from Rlv UPM791 contain large DNA regions also found in different plasmids harboured by other rhizobial strains, especially for pRlvA and pRlvE. The presence of IS elements and repeated DNA sequences facilitates co-integration events [115], and this might be the case for plasmids pRlvA and pRlvE. The co-integration of E. meliloti megaplasmids with the chromosome has been previously reported [116]. These phenomena have also been demonstrated in E. fredii NGR234 between the symbiotic plasmid, its chromosome, and even its megaplasmid [117].
As expected from previous studies [10,118,119], the chromosome displays a high synteny across different Rl genomes, whereas extra-chromosomal DNA represents an important source of variability among strains. The fact that all the strains used in the analysis share a core genome would explain why they are considered as the same species, whereas the adaptive genome is strain-specific. This is specifically the case for the symbiotic plasmid, which clusters genes needed for symbiosis in rhizobia [9]. Notably, this plasmid showed the lowest degree of sequence similarity across different Rl strains. However, when Rlv UPM791 pRlvC was compared to the symbiotic plasmids from the other four strains (Rlv 3841, Rlt WSM1325, Rlt WSM2304, Rlt WSM1689, and Rlt CB782), we observed that the regions harbouring the symbiotic gene clusters were very similar. For example, the 13 nod genes involved in nodulation showed the same structure in Rlv UPM791 and Rlv 3841, with the rhizosphere rhiIABCR genes [120] nearby this nod cluster along with the genes required for nitrogen fixation (nif genes). This fact probably indicates recent transfer events of symbiosis genes between distantly related plasmids, as was already proposed when comparing symbiotic plasmids pRL10 and pRL1 (a symbiotic plasmid from another Rlv isolate [121]). A similar situation was also observed in a study where the sequences of symbiotic plasmids from 14 different rhizobial strains were compared [14]. In that study, little identity was detected among them, despite having similar nod, nif, and fix gene clusters. This was also true when different genera and symbiotic strategies were compared, such as with the M. loti MAFF303099 and E. meliloti 1021 genomes; the M. loti symbiotic island contained no genes with orthologues in the E. meliloti genome, except for the highly conserved nodulation and nitrogen fixation genes [65]. In view of this conservation, it has been suggested that the symbiotic and nitrogen fixing abilities of these bacteria have been acquired through multiple lateral transfer events [14]. The extent of these events and the selective forces implicated in the evolution of the symbiotic traits have been thoroughly reviewed recently [122].
Rhizobial genomes appear to be highly dynamic, as reflected by the presence of a high number of insertion sequence (IS) elements and transposases [12]. In the Rlv UPM791 genome, this is clearly observed in the case of plasmid pRlvD, the counterpart of pRL7 from Rlv 3841, which contains an uncommon number of repeated sequences and mobile elements. This may be due to the accessory nature of many of the genes located in these replicons, which are thought to offer potential sites for recombination [115]. Plasmid pRlvD is very different from the rest of the Rlv UPM791 genome, a situation also noted for pRL7 [6], with more than 80% of foreign genes and/or genes of an unknown function and up to 31 pseudogenes. Among these genes, 28% encoded transposases [6].
Regarding large plasmids or megaplasmids, the term "chromid" was proposed to designate all those replicons that have a nucleotide composition and codon usage similar to those found in the chromosome, carry essential genes, but have a plasmid-type replication system and are larger than the accompanying plasmids [56]. These are characteristics that describe pRlvA, the largest Rlv UPM791 plasmid (1.3 Mb). Chromids appear to contain genus-specific genes and in Rhizobium, Ensifer, and Agrobacterium, it was already proposed that chromids were acquired from the common ancestor of these genera [123]. Their biological role appears to be that of enabling the bacterium to have a larger genome, allowing the chromosome to remain small, thus ensuring shorter generation times, a characteristic that might be desirable in fast-changing environments. This would be the case for fast-growing rhizobia, like Rhizobium and Ensifer, both harbouring chromids, whereas Mesorhizobium and Bradyrhizobium, which do not have such replicons, grow more slowly in free-living culture [56]. The difficulty in obtaining derivative strains cured of such chromids in the laboratory (our unpublished results for pRlvA) suggests their relevance for bacterial viability.
Plasmid pRlvB, the Rlv UPM791 counterpart of pRL11 from Rlv 3841, could also be considered a chromid. These megaplasmids appear to be highly conserved among Rl strains, as was previously reported for pRL11, pRLG202 from Rlt WSM2304, pR132502 from Rlt WSM1325, and even p42e from R. etli CFN42 T , thus suggesting that these plasmids might carry essential metabolic genes [14]. Megaplasmid p42e was considered to be a secondary chromosome in R. etli, as it is a highly stable replicon impossible to eliminate due to the presence of genes involved in primary metabolism [124]. One of the common characteristics of these highly conserved plasmids in rhizobia, also including pA from R. etli CIAT652, is the presence of the minCDE operon. The products of these genes are involved in the accurate localization of the cell division site, leading to septum formation [124]. This cluster is also present in the megaplasmid pNGR234b from E. fredii NGR234 [85] and in pRlvB, suggesting that this plasmid is also essential in Rlv UPM791. In a comprehensive review, DiCenzo and Finan (2017) [10] observed that megaplasmids are common in genera containing soil and marine bacteria interacting with eukaryotic hosts. They also noted that the presence of both a chromid and a megaplasmid in the genome seemed to be a unique trait, as only ca. 2.5% of the 1708 bacterial genomes examined had both replicons, uniquely widespread in the genus Burkholderia and the order Rhizobiales [10].
Rhizobia must survive within the legume nodule, but also in the heterogeneous environment of the soil, which might explain the high metabolic burden and versatility of their genomes. The accessory (adaptive) genes present in the rhizobial genomes sequenced to date might be related to their lifestyle, conferring adaptability to different ecological niches [6]. These bacteria are known to be rich in ABC transporters. While not reaching the 816 ABC genes identified in Rlv 3841, the Rlv UPM791 genome encoded 906 transport and binding proteins, thus predicting capabilities for a wide range of substrates to be taken up. These proteins appear to be widely distributed in Rlv 3841 genome, and are especially abundant on plasmids pRL12, pRL10, and pRL9. The number of predicted proteins with regulatory functions is also high (738 proteins), which is consistent with such large genomes, given that all the metabolic pathways should be tightly regulated [6]. Also high is the number of sHSPs in rhizobia, although the reasons for this are still unknown, as the exact function of these proteins has not been determined in rhizobia [109]. It is tempting to speculate that sHSPs might have a specific role in the stabilization of proteins partly denatured as a consequence of specific interactions with host factors. Interestingly, an sHSP from Agrobacterium tumefaciens has recently been implicated in virulence towards Arabidopsis thaliana through the protection of a component of the virulence (Vir) system [125].
Similar to what has been described in other rhizobia, Rlv UPM791 contains two gene clusters encoding chemotaxis systems (RLV_3079-3087 and RLV_6231-6241), orthologs of che1 and che2 systems, respectively, described in Rlv 3841 [126]. Although the multiplicity of chemotaxis systems is a common theme in plant-associated bacteria [127], there is still little information on the actual role that chemotaxis might play in symbiosis. Deletion of the Rlv 3841 che1 system resulted in a significant reduction of nodulation competitiveness [126], likely due to less efficient localization on the root surface. In the case of Rlv UPM791, analysis of the genomic regions flanking mcp genes revealed that four out of the 35 chemoreceptor genes identified (RLV_3079, RLV_6234, RLV_6235, and RLV_6236) are physically linked to the chemotaxis-related operons. Interestingly, two mcp genes (RLV_5537 and RLV_6634) are linked to additional copies of cheW-like genes. CheW is known to participate in a highly ordered superlattice, also involving MCP chemoreceptors and CheA, a histidine kinase involved in signal transduction [128]. The physical association of mcp genes with cheW copies suggests that some MCPs might require specific mediators to attain optimal interaction with a CheA-dependent signaling pathway. The presence of multiple copies of mcp genes is widespread in plant-associated bacteria, likely providing additional tools to compete for root surface colonization [127]. However, information on the specific substrates recognized by rhizobial MCPs is also scarce. Work carried out with McpU, one of the most highly expressed MCPs in E. meliloti, has shown that this MCP binds proline, a major constituent of alfalfa seed exudate, and also directs chemotaxis towards other amino acids [129,130]. The existence of a significantly high number of mcp genes in R. leguminosarum (35 vs. 8 in E. meliloti) suggests a higher ability of this bacterium to direct the cell towards different compounds, perhaps linked to different specific hosts legumes. In addition, recent evidence indicates that, in bacteria with multiple mcp genes such as Azospirillum or Myxococcus, some chemoreceptors are involved in CheA-and CheW-dependent signal transduction pathways regulating other cellular functions such as the formation of resistant cyst cells, flocs, or biofilms [131]. It is possible that, also in rhizobia, some of the MCPs might be involved in other, non-chemotactic functions induced in response to specific chemical stimuli.
Comparative genomics represents a valuable tool for capturing the specificities and generalities of each genome. When compared to its closest relative, Rlv UPM791 shows distinctive features that set it apart from Rlv 3841. For example, the pSym-borne hup system for hydrogen uptake is only present in Rlv UPM791, whereas a T6SS, absent in this strain, is present in pRL12 from Rlv 3841 [6]. The T6SS in Rlv 3841 could be non-functional, because it presents a natural deletion in the promoter region of the two main clusters. This deletion, together with the absence in Rlv UPM791, may indicate that T6SS is not necessary for symbiosis and that its presence could be deleterious, as it was shown in the ineffective symbiosis of R. leguminosarum RBL5523 with pea plants [132]. These Rlv strains also differ in the number of glycanases forming the exopolysaccharide, with pssH and pssI absent in Rlv UPM791, indicating the production of a different EPS. Another distinct feature was the presence of a second copy of FnrN in Rlv UPM791. Contrary to the situation in other rhizobia, FnrN controls both Rlv UPM791 hydrogenase and nitrogenase activities in the nodule, indicating the lack of a functional fixK gene in this strain [93]. A genomic FnrN − mutant exhibited wild-type levels of hydrogenase activity [92], and only the fnrN1 fnrN2 double mutant formed ineffective nodules lacking both nitrogenase and hydrogenase activities [93]. The FnrN system has been hypothesized to be more flexible for adaptation to environments with fluctuating oxygen tensions [94], a situation probably encountered by the rhizobia, aerobic soil organisms that have to adapt to the microoxic conditions required inside the nodule to maintain an active nitrogenase enzyme. Moreover, a distinct redox environment has also been proposed for the rhizosphere after observing the induction of a rhizosphere-specific cytochrome oxidase, distinct from the purely aerobic aa 3 cytochrome oxidase complex and from the high-affinity cbb 3 cytochrome oxidase complex induced in the nodule, in Rlv 3841 [57].
Regarding the presence of intercellular communication systems in Rlv UPM791, both cinRI and rhiRI had been previously reported [98], as well as a traI synthase and a truncated traR regulator. The existence of other disrupted traR genes had already been reported, as is the case for trlR (traR-like regulator, [133]) in A. tumefaciens and traR from R. tropici CIAT899 [86]. In the case of Rlv UPM791 traR, a frameshift mutation has been found in the autoinducer binding domain. Conversely, the trlR mutation is located in the DNA-binding domain, which makes this protein still able to bind AHLs but prevents TraR activity in this bacterium, thus inhibiting conjugation [133]. In CIAT899, an insertion element is located upstream of the traR gene, disrupting its promoter [86]. The availability of the genome sequence allowed a global search for transcriptional regulators of the LuxR-type, and produced three additional proteins potentially involved in QS. Rhi and CinI genes seem to play an important role in the first steps of the symbiosis, as they were upregulated in the pea rhizosphere [57]. Further work is needed to ascertain the role of these proteins in Rlv UPM791.
Despite all the above remarkable traits, a large fraction of the gene content is annotated as hypothetical genes or unknown function (25.2% for Rlv 3841 and 32.2% for Rlv UPM791), which calls for a large scale functional investigation to understand the overall capabilities of these genomes. High-throughput experimental approaches are being developed to decipher the role of these genes. Among them, the Insertion Sequencing (INSeq) transposon (Tn) mutagenesis-based approach appears to be a very powerful technique to study gene function at the genome scale [134][135][136][137]. INSeq has been adapted for use in the Rhizobiaceae and tested under different metabolic conditions in Rlv 3841 [138][139][140]. When different growth media were analysed, the group of genes encoding hypothetical proteins was one of the five major categories assigned to the core functional genome, highlighting the importance of these proteins in core cellular processes. INSeq was also very helpful in establishing plasmid essentiality in Rlv 3841, with pRL12 encoding enzymes predicted to function in central carbon metabolism, such as pRL120209 (putative tpiA) and pRL120210 (putative rpiB), thus explaining why pRL12 cured strains were unable to grow on minimal media [139,141]. INSeq also helped to understand how succinate catabolism is affected by low-O 2 environments and how C 4 -dicarboxylic acids fuel N 2 fixation [140]. In that study, glnB (RL2393), encoding the nitrogen regulatory protein PII, was identified as specifically essential for growth on succinate at 1% O 2 , a condition similar to that experienced by the bacteroid. These functional approaches would be useful, for example, to explain the reasons why Rlv UPM791 outcompetes Rlv 3841 when inoculated on pea plants [142] and to identify those genetic determinants involved in root infection and nodule occupancy.

Conclusions
The newly reported Rhizobium leguminosarum complete genome Rlv UPM791 and the genomic comparisons presented here advance our understanding of the symbiotic potential within this species and demonstrate the wide diversity of plasmids within the same species, especially for the symbiotic plasmid, the most unique replicon in each of the strains analysed. In common with all Rhizobium leguminosarum genomes analysed, Rlv UPM791 shows a high degree of metabolic versatility, enabling the species to survive in a wide range of environments and to adapt to changing lifestyles as free-living or symbiotic cells. However, pathways involved in hydrogen recycling or the regulation of nitrogen fixation are unevenly distributed amongst genomes, suggesting that Rlv UPM791 harbours key specific traits and can become a useful reference sequence for hydrogenase positive rhizobial strains.
Supplementary Materials: The following are available online at www.mdpi.com/2073-4425/9/2/60/s1, Figure  S1. Maximum likelihood 16S rRNA phylogenetic tree; Figure S2. Global synteny between Rlv UPM791 and the Rhizobium leguminosarum complete genomes, including the chromosome and all the different plasmid replicons; Figure S3. NJ phylogenetic tree of RepABC concatenated protein sequences (numbers indicate the position of the genes according to the JGI database); Figure S4. Rlv UPM791 chromosome and plasmids mapped against Rlv 3841 replicons; Figure S5. NJ phylogenetic tree of MCP proteins from R. leguminosarum strains. Symbols indicate the strain of origin of each MCP as shown in the inset. Multiple alignment was carried out using CLC Main Workbench with the full sequences of a set of 152 MCP proteins. Phylogenetic tree was constructed using the Neighbor-Joining algorithm. Distance measures were calculated using the Kimura 2-parameter method with 1000 bootstrapped replicates; Table S1: List of TIGRfam categories for proteins found in the different replicons of Rlv UPM791 and in the overall genome (total). Acknowledgments: Work on Rhizobium leguminosarum bv. viciae UPM791 (128C53) started in the Ruiz-Argüeso lab in Madrid 40 years ago. Many past and present members of our laboratories, too numerous to mention here, have contributed to our understanding of the physiology, biochemistry, molecular biology, and, more recently, the genomics of this organism. We are grateful to all of them. Special thanks go to Ana Isabel Bautista for technical assistance, to Gonzalo Martín for IT support, and to Santiago de la Peña, Pablo Rodriguez Palenzuela, and Jorge Sánchez Colmenero for bioinformatics support. Funding for this work was provided by Spain's MINECO (BIO2013-43040-P to José M. Palacios) and the USA's National Institutes of Health (2R01GM080227 for the Analysis Engine to Michelle G. Giglio).