Comparative Genomics of Aeschynomene Symbionts: Insights into the Ecological Lifestyle of Nod-Independent Photosynthetic Bradyrhizobia

Tropical aquatic species of the legume genus Aeschynomene are stem- and root-nodulated by bradyrhizobia strains that exhibit atypical features such as photosynthetic capacities or the use of a nod gene-dependent (ND) or a nod gene-independent (NI) pathway to enter into symbiosis with legumes. In this study we used a comparative genomics approach on nine Aeschynomene symbionts representative of their phylogenetic diversity. We produced draft genomes of bradyrhizobial strains representing different phenotypes: five NI photosynthetic strains (STM3809, ORS375, STM3847, STM4509 and STM4523) in addition to the previously sequenced ORS278 and BTAi1 genomes, one photosynthetic strain ORS285 hosting both ND and NI symbiotic systems, and one NI non-photosynthetic strain (STM3843). Comparative genomics allowed us to infer the core, pan and dispensable genomes of Aeschynomene bradyrhizobia, and to detect specific genes and their location in Genomic Islands (GI). Specific gene sets linked to photosynthetic and NI/ND abilities were identified, and are currently being studied in functional analyses.

recent study, Miché and colleagues [16] sampled many rhizobial strains on CI group 3 Aeschynomene species in Central and South America. All isolates belonged to the PB clade, except a new group of non-photosynthetic strains phylogenetically placed at an intermediate position between the photosynthetic clade and the B. japonicum/B. elkanii clades, and nodulating the Aeschynomene CI group 3 (stem and roots) by the NI pathway (see Figure 1, STM3843 clade).  [16]. Numbers at nodes are bootstrap percentages from 100 replicates. Roman numbers indicate clades defined from AFLP analyses and recA phylogeny in [16]. Abbreviations: PB: photosynthetic bradyrhizobia; Sep: separate group with no affiliated number; ND/NI: nod gene dependent/independent symbiotic pathway. Research is currently under way to identify the genetic bases of the NI symbiotic pathway. Traditional genetic approaches using a random Tn5 transposon insertion allowed the identification of mutants affected in nodule development and nitrogen fixation [12]. However, no complete nodulation deficient mutants were identified during this screening, underlining a potential functional redundancy in genes involved in the early steps of this interaction, and the need for different approaches to identify them.
To better understand how photosynthetic bradyrhizobia have adapted to their particular niches (aquatic, nodules on Aeschynomene), we developed in this study a comparative genomics approach on several strains with a focus on the nod-dependent/nod-independent and photosynthetic/non-photosynthetic bacterial lifestyles. To reach this aim, we sequenced and compared the genomes of (i) five nod gene independent (NI) PB strains of CI group 3 from different sub-branches of the PB clade (ORS375, STM3809, STM3847, STM4509 and STM4523) and included the previously sequenced ORS278 and BTAi1 genomes; (ii) a PB strain hosting both ND and NI genetic bases (ORS285, CI group 2); (iii) a NI non-photosynthetic strain of CI group 3 (STM3843). Comparative genomics analyses allowed us to (i) infer the genomic diversity and evolutionary dynamics of PB genomes (core and accessory gene sets, Genomic Islands); (ii) identify gene sets potentially involved in symbiosis and photosynthesis. The latter are currently being studied by functional analyses.

Genomic Features of Bradyrhizobial Genomes
This study compares 10 Bradyrhizobium genomes, including nine genomes of Aeschynomene symbionts, and the genome of B. japonicum USDA110. Three complete genomes were already published (Bradyrhizobium sp. BTAi1, Bradyrhizobium sp. ORS278 [13], B. japonicum USDA110 [17]). Seven strains were sequenced for this study, selected as representatives of almost every phylogenetic branch of the photosynthetic bradyrhizobia clade (isolated from worldwide Aeschynomene species) as established in the recA gene phylogeny published in [16] (see Figure 1 for phylogenetic placement). Genetic features and sequencing methodologies used for each strain are given in Table 1.   [16]; Sep: separate group with no affiliated number; ND/NI: nod-dependent or nod-independent strain according to [13,16]; CI: Aeschynomene cross inoculation group according to [14,15]; % : ORS285 hosts both systems; § : genome size estimates from Solexa read contig assemblies; & EBI accession numbers; $ ENA (trace archive) accession number for solexa reads [18].
Presence of plasmids was checked for by PFGE analysis (not shown), as well as the search for replication genes (repABC) within each genome. All 10 genomes seem to be composed of a unique chromosome, except strain BTAi1 which also harbors a plasmid [19]. B. japonicum USDA110 possesses the biggest genome with a length of 9,105,828 bp, and Bradyrhizobium. sp. STM3809 the smallest with 7,391,986 bp (Table 1; when excluding Solexa draft genomes for which size inference is uncertain). Bradyrhizobial genomes are GC rich bacteria with an average content of 65% (G + C). The protein coding density has an average of 86% and the number of CDS fluctuates between 6,750 (ORS278) and 9,650 (B. japonicum). Strains host between 3 and 6 rRNA genes in 1 to 3 operons, and 47 to 52 tRNA genes were identified in the various strains (Table 2). Codon and amino-acid usage is identical within bradyrhizobia (Supplementary Figure S1), which can be explained by the same GC genome content (64%-66%) and their phylogenetic proximity.

Core, Dispensable and Strain-Specific Genes
A comparative analysis of gene content was undertaken to classify genes into groups depending on their degree of orthology/conservation among strains, following a bidirectional best hits approach (see Experimental Section). This analysis was performed on seven genomes (fragmented Solexa genomes were not included). First, genes present in strictly all seven genomes were pooled in a group called "core-genome". Then, genes identified in only one genome were pooled in groups "specific" to each strain. All other genes, present in a least two genomes (but not in all), were pooled in a group called "dispensable genome". Finally, the pangenome includes all genes (core, dispensable and specific). This analysis is illustrated by a Venn diagram presented in Figure 2A.

Figure 2. (A) Comparative gene orthology between seven genomes of Bradyrhizobium.
Numbers indicates the shared gene sets between groups of strains, satisfying the following criteria: BBH (bidirectional best hits) with 40% amino-acid identity on 80% of the smallest protein; (B) Core and pangenome gene count in the Bradyrhizobiaceae family. CDS indicates the number of protein-coding genes in each genome listed in x-axis. The y-axis indicates the number of genes in core and pangenome when adding genomes in the x-axis to the comparative analysis. The presented tree was built by Maximum likelihood (GTR + I + G model) on a partition of 5 taxonomic marker genes (atpD, dnaK, recA, rpoB, rpoD), with bootstrapped nodes (100 replicates The core genome of bradyrhizobia (i.e., genes shared by all seven strains) is composed of 3,663 genes, of which 67% have an annotated function, 32% encode conserved proteins of unknown function, and only 1% are hypothetical proteins (see Supplementary Figure S2A). This core gene set included all 206 genes from the minimal bacterial gene-set described in [20]: genes encoding DNA replication or repair machinery (DNA polymerase subunits, gyrase or helicase), transcription and translation enzymes and factors (RNA polymerase, amino-acids-tRNA synthases, ribosomal proteins, etc.), transport (PTS system), energy and intermediary metabolism (ATP synthase, acyl-CoA synthase, glycolysis enzymes). The dispensable genome was the largest group with 9,395 genes, reflecting a high rate of horizontal gene transfer and the metabolic versatility among bradyrhizobia. It was composed of 8% of hypothetical proteins, 35% of conserved proteins of unknown function, and 65% of proteins with functional annotation (statistics almost identical to the core genome annotation). The specific gene set of each genome ranges from 489 to 3,268 genes, and was directly related to the phylogenetic distance between one strain and its closest neighbors. For example, a strong correlation was found between the phylogenetic distance of Bradyrhizobium strains towards ORS278 (based on a distance matrix of the 6 genes used in MLSA for Figure 2B) and their number of specific genes compared to ORS278 (R 2 = 0.87). B. japonicum USDA110 is the most distant from other strains (harboring 3,268 specific genes) followed by STM3843 (1,892 specific genes), that is at an intermediate phylogenetic distance between USDA110 and the photosynthetic strains clade ( Figure 2B).
Photosynthetic strains harbors between 489 and 1,002 specific genes. BTAi1 genome exhibits the largest number of specific genes, of which 125 are located on its plasmid (pBRAB). pBRAB contains 296 CDS, meaning that 171 (57%) of these plasmidic genes are chromosomal in other strains, but they were not arranged in groups with obvious synteny to the chromosomes. Among the strain-specific gene sets, most genes were annotated as hypothetical protein (from 37% to 66% depending on strain) or conserved protein of unknown function (4% to 27%; with orthologs founds in RefSeq or Uniprot databases, see Supplementary Figure S2A-C), but several functions of ecological interest could be identified. For example, the ORS278 specific gene set encodes gas vesicle proteins (used by cyanobacteria to adjust their position in water depending on light conditions [21]) and RTX toxins (gram-negative exotoxin); the BTAi1 one encodes transporters involved in heavy metal resistance, a type IV secretion system, uptake hydrogenases (Hup/Hyp proteins, recycling of hydrogen produced during the nitrogen fixation process in legume nodules, [19]), a siderophore, an additional ATP synthase; the ORS285 specific gene set encodes several non-ribosomal peptide synthetase, a glycopeptide antibiotic (Tyrocidine synthetase 3); ORS375 contains specific genes encoding enzymes involved in anaerobic benzoate degradation pathway (AliB), gas vesicle proteins (GvpF/L), arsenic resistance; a bacteriophytochrome (BphB); STM3809 harbors specific genes encoding enzymes involved in catechol (CatABCD) and aromatic acid degradation (XylY, XylZ, BenA, BenD). In the case of STM3843 (1,892 specific genes), it was possible to identify functions involved in vitamin biosynthesis (B1, B2, B12), fumarate catabolism, urease metabolism, many adenylate/guanylate cyclases (regulation of AMPc/GMPc secondary messengers), rhizopine catabolism (moc genes), fermentation of carbohydrates (pyruvate pathway-porAB genes), biosynthesis of LPS and EPS (exo genes, O-antigen synthesis), and several toxin/antitoxin (TA, as YoeB-YefM) modules involved in bacterial stress response and/or stabilization of chromosomal integrons [22].
Each group of genes was functionally annotated according to the KEGG classification and distribution of functional categories of genes were compared between strain-specific, dispensable and bradyrhizobial core genes (shown in Supplementary Figure S2B,C). In general, specific and dispensable gene sets exhibited the same KEGG distribution, both differing highly in proportions from the core genome for several functions such as "xenobiotics degradation and metabolism" and "membrane transport" (mainly ABC transporters and secretions systems; see Supplementary Figure S2B) that are over-represented in the specific and dispensable gene sets, or cell routine functions (translation, protein folding, nucleotide metabolism) that are almost specific to the core gene set.
A distribution of the core and pangenome at different taxonomic levels (PB clade, Bradyrhizobium genus, Bradyrhizobiaceae family) is shown in Figure 2B. The core genome of the PB clade (including 5 photosynthetic bradyrhizobia) is composed of 4,792 genes, which decreases to 4,208 for Aeschynomene symbionts (PB clade + STM3843; excluding Bj USDA110), then to 3,663 (Bradyrhizobium genus), 1,632 (Bradyrhizobiaceae family, including Rhodopseudomonas and Nitrobacter genomes), and finally drops to 1,212 after inclusion of an out-family alphaproteobacteria (Azorhizobium). The pangenome gene number increases dramatically at each addition of a new bacterial genome, reaching 12,040 genes at 5 PB strains, with an increased rate when shifting taxonomical borders (13,898 with STM3843, 16,488 with USDA110, ending at 20,162 for the Bradyrhizobiaceae family). It does not seem to reach a plateau even after addition of 5 PB genomes, reflecting the large genomic diversity within this clade. By comparison, the core and pangenome of 5 Rhodopseudomonas strains (Bradyrhizobiaceae family, closely related to bradyrhizobia and belonging to at least 4 several species) contain 3,319 and 8,000 genes, respectively [23], around 30% less than the Bradyrhizobium genus.

Whole Genome Comparative Analyses and Distribution of Genomic Islands
Whole genome alignments were performed on the basis of a reference genome (see Experimental Section). Figure 3 presents whole genome alignments based on ORS278 (3A), ORS285 (3B), STM3843 (3C) and USDA110 (3D) with 9 genomes of Bradyrhizobium (other genomes and larger genome views are available as Supplementary Figure S4). This representation allows inclusion of complete, draft and even highly fragmented genomes (obtained by Solexa technology).
The first result obtained from Figure 3 is the high genome homology between strains from the PB clade (darker blue layers on Figure 3A,B), while less apparent when STM3843 ( Figure 3C) or USDA110 ( Figure 3D) were used as reference genomes (lighter blue circles around the reference genome). Interestingly, when B. japonicum USDA110 was used as the reference genome ( Figure 3D), the 680 kb symbiotic island (localized between 1.7 and 2.3 Mb in USDA110 genome, integrated into a val-tRNA gene) did not show any match with the Aeschynomene symbionts genomes, except for nif and fix genes for all strains (encoding the symbiotic nitrogen fixation system), the nodABCIJ and a type 3 secretion system for ORS285 (that use both ND and NI systems). However, circular views also allow the detection of various "small" GI, exemplified by holes in the blue layers around the reference genome. As shown in Figure 3, many GI were found among genomes, some being shared between strains. This analysis was further explored using RGP finder (see Experimental Section), and all detected GI were numbered and their distribution among bradyrhizobial genomes was investigated (except on the 3 solexa-based genomes). Supplementary Table S1 presents the listing and distribution of 613 detected GI (see Mat&Methods section for details), with associated functions and distribution in bradyrhizobial genomes. In PB genomes (5 genome-based), 278 GI were found (53 to 78 GI per genome), of which 33 (11%) host phage-related functions, 14 (6%) carry conjugal transfer genes, and 45 (16%) carry transposases or fragments of transposases. Overall, these GI carrying mobile elements represents 27% of PB GI-hosted genes.
A dendrogram based on the presence/absence of GI was built, and compared to a phylogeny based on the core genome set ( Figure 4). Interestingly, both trees were slightly discordant with a different position of STM3809 within the PB clade that seemed correlated in the GI-based dendrogram with the geographical origin of strains (STM3809 and BTAi1 are of American origin, while ORS strains are of West African origin). In both trees, STM3843 and B. japonicum USDA110 diverged from the PB clade, reflecting their high taxonomic distance to the PB clade. Aligned genomes to reference (9) Core Genome

Metabolic Network Comparison of Strains
A principal component analysis was performed to describe the completion (measured as a percentage) of all known metabolic pathways present in the seven Bradyrhizobium genomes in order to find metabolic specificities or common pathways among several strains. This analysis was associated with the symbiotic and photosynthetic phenotypes of strains. The three first factorial axis ( Figure 5, Supplementary Figure S5) captured over 80% of the total data variability. The first factor (Factor 1) isolates metabolic specificities of B. japonicum USDA110 against other strains, Factor 2 captures variability of STM3843 compared with the others, and the third axis (Factor 3) separates BTAi1 from the other genomes.
Strain segregation in the PCA plots can be explained by the metabolic differences which are larger in USDA110, STM3843 and BTAi1 compared to other strains. On the first factorial plane ( Figure 5), 64 pathways more complete in B. japonicum are represented by red vectors pointing in its direction, indicative of many metabolic specificities of this strain compared to others (biosynthesis of various molecules from carbohydrates to amino-acids, degradation of many xenobiotics, see Supplementary  Table S2). Pathways more complete in STM3843 (separated from others on second factorial plane, vector group 6, 20 pathways) included metabolic enzymes involved in biosynthesis and degradation of various substrates. Specific metabolic pathways for photosynthetic strains (vector 4) included 13 pathways with known roles in photosynthesis (bacteriochlorophyll a biosynthesis, phytyl diphosphate biosynthesis). Finally, 60 pathways more complete in nod-independent strains (First factorial plan, vector 2 and 7) separated them from nod-dependent ones. These pathways included many degradation enzymes (of molecules such as atrazine, nitrobenzene, purine, pyrimidine) and biosynthesis enzymes (for products as phycocyanobilin).

Figure 5.
Principal Component Analysis (PCA) of seven Bradyrhizobium genomes, performed on a two dimensional matrix compiling metabolic pathway completion values (the number of enzymatic reactions with coding genes for pathway x in a given organism, divided by the total number of reactions in pathway x as defined in the MetaCyc database; see Experimental Section) across all genomes. In the plot, each genome is represented by a point, whereas pathways are shown as colored vectors. Pathways with correlated completion values across organisms (vectors with similar orientation) have been clustered (corresponding to numbers 1 to 7) and drawn in the same color (vectors of pathways with identical completions overlap). Genomes can be associated with their representative and characteristic groups of metabolic pathways (i.e., vectors "pointing in their direction"). The corresponding pathway functions are listed in Supplementary Table S2. Vector length encodes the quality of representation of the pathway in the presented plot (i.e., the longer a vector is, the more representative its pathway is for genomes on the same side of the plot). Abbreviations: photo−/photo+: non-photosynthetic/photosynthetic strains; Nod+/Nod−: nod-dependent/nod-independent strains).

Functional Exploration of Genomes: Focus on Photosynthetic and Symbiotic Abilities
Orthologs within dispensable gene groups were searched for among strains sharing the photosynthetic/non-photosynthetic and nod-dependent/nod-independent phenotypes (strain using nod genes to develop their symbiosis with the plant host or not). The phylogenetic profile ("phyloprofile" for short) exploration of genomes is presented in Figures 6A (photosynthesis) and 6B (symbiosis). A KEGG functional classification of specific genes for each phenotype is presented in Supplementary Figure S3.

Photosynthetic Versus Non-Photosynthetic Strains
The specific gene set of PB is composed of 269 genes, of which 50% were annotated with functions known to be involved in bacterial photosynthesis (see Supplementary Table S3 for the complete listing of genes), 64 (23%) were annotated as hypothetical proteins, and 74 (27%) as conserved proteins of unknown functions. Supplementary Table S4 presents a list of photosynthetic strain-specific gene set with annotated functions, sort by functional roles.
Bacterial photosynthesis has been shown to play an important role during stem symbiosis with Aeschynomene plants. It has been proposed that during the early steps of symbiosis, the energy provided by the photosynthetic apparatus facilitates ex planta survival and infectivity whereas during the later steps this energy can be used by the bacteria to fix nitrogen, thus limiting the demand for plant photosynthetates [10]. All the genes necessary for the formation of the photosystem (PS) are clustered in a 50 kb-region, the so-called "photosynthesis gene cluster" or PGC that includes genes found in the 269 specific gene set of PB: (i) bch genes involved in bacteriochlorophyll biosynthesis, (ii) crt genes involved in carotenoid biosynthesis (spirilloxanthin), (iii) puc and puf genes which encode respectively the core proteins of the reaction center (RC) and for the light-harvesting complexes I (LHI), and (iv) several regulators (bacteriophytochrome and ppsR genes). The organization of the genes in the PGC is perfectly conserved in all the photosynthetic Bradyrhizobium strains examined during this study (ORS278, BTAi1, ORS285, ORS375 and STM3809) whereas none of these genes can be found in the non-photosynthetic Bradyrhizobium strains (USDA110 and STM3843). It is hypothesized that the ancestor of the Bradyrhizobium japonicum and B. elkanii clades was a photosynthetic free-living bacteria that later adapted to soil conditions and acquired nodulation genes, leading subsequently to a loss of the photosynthetic character due to a light-devoided habitat [11]. The absence of any hallmark of acquisition of the PGC by lateral gene transfer (absence of insertion sequences, GC% nearly identical to the whole genome, no GC skew deviation), a phylogeny of pufL being nearly identical to the one of the core genome (not shown), and the fact that Bradyrhizobium species are closely related with the photosynthetic Rhodopseudomonas palustris species sustain this hypothesis.
The synthesis of the photosynthetic apparatus is under the control of an original regulatory mechanism that has been described in ORS278 and BTAi1 strains [24,25]. Its formation occurs only under far-red light illumination and semi-aerobic conditions and involves a control by a light receptor (the bacteriophytochrome BrBphP1) and two transcriptional factors of the PpsR family (PpsR1 which is redox sensitive acts as an activator whereas PpsR2 acts as a repressor and its activity is modulated by BrBphP1). These three regulators are encountered in the three other photosynthetic Bradyrhizobium strains, suggesting that this unusual mechanism of regulation of PS synthesis by light and oxygen tension is a typical feature of this bacterial group. This regulatory circuit appears to be perfectly adapted to promote PS synthesis during stem symbiosis [24].
Unlike ORS278 strain, the PS apparatus of BTAi1 strain contains additional peripheral light-harvesting complex (LH2) encoded by a pucBA operon. The expression of this operon is also under the control of a bacteriophytochrome (BBta_3080) found at the vicinity of the pucBA operon [26]. Interestingly, this gene organization is found in two of the other three photosynthetic strains (ORS375 and STM3809) that were studied, suggesting that the presence of LH2 complexes is rather frequent in PB strains.
Several bacteriophytochromes genes have been detected in photosynthetic Bradyrhizobium strains (2 to 4). In particular, one bacteriophytochrome (BRADO2008) is encountered in all PB as well as in the non-photosynthetic strain STM3843. Up to now, no function for this bacteriophytochrome has been identified and no genes in relation with light could be found at its vicinity. We suspect that this function is not in relation with photosynthesis as it is also found in the non-photosynthetic strain STM3843 and no alteration in the photosynthetic activity of an ORS278 mutant was detected (data not shown). Some PB strains possess specific bacteriophytochromes. The ORS278 strain contains the gene BRADO1262, located on a Horizontally Acquired Island (HAI) together with genes involved in phycocyanobilin (PCB) and gas vesicle syntheses [27]. This bacteriophytochrome binds PCB as a chromophore instead of biliverdin and displays very atypical spectral properties [27]. The ORS375 strain also possesses a specific bacteriophytochrome (BRAO375v1_4770021) hosted on a HAI, though no functional data are yet available for this bacteriophytochrome.
The relatively high number of bacteriophytochromes found in PB strains (2 to 4, while 1 in STM3843, and none in B. japonicum USDA110) indicate that these bacteria are well equipped to cope with a luminous environment, which is probably related to their ability to occupy an ecological niche exposed to sunlight (aquatic environment and phyllosphere). This niche is different from those of "classic" rhizobia, which inhabit soil and usually lack any bacteriophytochromes.
Photosynthesis can be defined as the biological process by which organisms make carbohydrates from CO 2 using energy captured from sunlight. The functional relationship between the photosystem and the Calvin Benson Bassham cycle (CBB) is also found at the genetic level. Indeed in all PB strains studied, a cbb gene cluster is found contiguous to the PGC. Interestingly, whereas the PGC has been lost (or is absent) in the non-photosynthetic strains (USDA110 and STM3843), a cbb gene cluster is found in perfect synteny in photosynthetic strains. These cbb genes are surely functional as it has been clearly demonstrated that B. japonicum is able to fix CO 2 in presence of Hydrogen [28]. The maintenance of a functional CBB cycle might be the result of a selective advantage but its role remains unknown. Recently, it has been described in the ORS278 strain that Rubisco, the key enzyme of the Calvin cycle, is required for an efficient symbiosis with A. indica [29]. It has been proposed that the primary role of the CBB cycle is not to produce biomass but to act as an electron sink driving the cell's excess electrons to CO 2 and thus oxidizing reduced factors. This role might be critical in the early steps of symbiosis when oxygen tension in the nodule becomes limiting and the nitrogenase activity is not yet established. In such conditions, the functioning of the CBB cycle might help deal with the pool of reduced cofactors produced by the Krebs cycle.

Focus On Nod-Dependent (ND) and Nod-Independent (NI) Symbiotic Strains
Since our genome set included strains exhibiting ND and NI symbiotic phenotypes, we looked at specific gene sets present only in nod-independent strains, by subtracting the B. japonicum USDA110 gene set from the core genome of all Aeschynomene symbionts (ORS278, ORS285, BTAi1, ORS375 and STM3809). We hypothesized that all NI strains possess conserved genetic bases to establish their symbiotic interaction. This analysis revealed 446 genes specific to NI strains ( Figure 6B). This relatively high number of genes is probably related to the close phylogenetic distance between the PB strains (thus sharing many metabolic pathways due to their common ecological niche) and their monophyly. This gene set was also explored in solexa-based draft genomes (STM3847, STM4509 and STM4523) using tblastn searches, and 62 genes were found absent among the 3 genomes, though their absence could be the consequence of the fact that these genomes aren't completely assembled. These absent genes encoded mainly hypothetical or conserved hypothetical proteins.
Some NI candidate genes were mutated by directed Tn5 insertions (BRADO7016-IAA biosynthesis, BRADO0151-type 3 polyketide synthase, and BRADO3025-1,4-alpha-glucan branching enzyme) but none of the obtained mutants were found to be altered in their symbiotic performance suggesting that these determinants are not critical for the Nod-independent process [34].
Localization in GI was also investigated for the NI specific gene set. We had anticipated that the NI symbiotic pathway genes could be carried by a genomic island conserved among the NI strains. No signature of a conserved large symbiotic island (as in the USDA110 SI that harbored 60% of transposases of the whole genome [17]) could be identified among the NI strains. Unfortunately, the RGP search engine looked at various signatures of DNA (GC skew, tRNA) but focused on specific regions within one genome compared to others, thus excluding a genomic island if shared by all (or almost all) genomes. We thus mapped our gene set to the HAIs detected by DNA signatures (GC skew, tRNAs, Insertion Sequences) in our previous study on ORS278 [13]. Twenty genes were found harbored by HAIs, and are labeled in bold in Table 3. Most genes belonged to two HAIs in ORS278 (BRADO5154-BRADO5204, and BRADO4807-4814), and were also located in HAI in all NI strains. These genes are involved in the biosynthesis of polysaccharides (EPS, LPS), a class of molecules known to play important roles in legume invasion during symbiosis (attachment to roots, biofilm formation, or interaction with plant defenses) [35].
To obtain further insight into the bacterial genes involved during the early steps of the Nod-independent symbiosis, a blind approach consisting in the screening of a large Tn5 mutant library (25,000 mutants) of the Bradyrhizobium ORS278 strain for their inability to induce nodules on Aeschynomene plants has been developed [12,13]. No strict nodulation-deficient mutants were found, but several mutants severely affected in nodule organogenesis were obtained. Among them, a mutant affected in a putative transcriptional factor of the TetR family (BRADO3272) was the only one to belong to the 446 NI specific gene pool. Interestingly, two upstream genes (BRADO3274 and BRADO3275) encode proteins of unknown functions and are also NI specific. Further studies are in progress in order to understand the role of this gene cluster during the NI symbiotic process. Apart from the BRADO3272 gene, no gene overlap could be found between mutants altered in symbiotic abilities and the comparative genomics approach. This result can be partially explained by the high redundancy of the 446 nod-independent specific gene set, of which 48% possess paralogs (>40% amino acid identity) in ORS278 (Supplementary Table S4). Symbiotic signals in classic rhizobia (Nod factors) and endomycorhizal fungi (Myc factors) are N-acetyl glucosamine derivatives [36][37][38]. Enzymes involved in Glucosamine modifications were detected in the 446 genes set, as BRADO6639, encoding an UDP-N-acetylglucosamine 1-carboxyvinyltransferase, and BRADO4235/BRADO7123, two homologs encoding a putative O-N-acetylglucosamine transferase related protein with TPR domains. These genes are also good candidates for future directed mutagenesis studies.   The specific gene set of nod-dependent strains (Bj USDA110 and ORS285) was also investigated for functions of ecological interest. This gene set was 10 times smaller (49 genes) than the nod-independent one (446 genes). This low number of genes reflects the high phylogenetic distance between USDA110 (BJ clade) and ORS285 (BP clade), as all genes in PB have been subtracted from the gene pool of ORS285 in this comparative analysis. This result adds support to the hypothesis of a transfer of the symbiotic island from the BJ clade to BP. Interestingly, if such transfers can be detected, they are either rare since no other genes (apart from symbiotic ones) were detected in common, or they are very widespread in PB and so were present in the subtracted gene pool. Among the 49 common genes in nod-dependent strains, two main classes of genes with a predicted function can be distinguished: as expected, the nod genes (nodD, nodA, nodB, nodC, nodS, nodU, nodI, nodJ, nodZ, noeI, nolA), and more interestingly, the rhc genes (also named tts) that encode core components of a type three-secretion system (T3SS) [39,40].
The nod gene equipment therefore appears very similar between the two strains ORS285 and USDA110. The only differences observed are the additional presence in B. japonicum of: (i) nolO which encodes a carbamoyltransferase and (ii) the two component regulatory system nodV/nodW. In counter-part, ORS285 possesses an additional nodL gene that encodes an acetyl transferase. In accordance with similar nod genes content are the facts that both strains produce the same major Nod factor (a pentameric LCO with a 2-O-methylfucose at the reducing end) and are able to nodulate common host plants; in particular, the B. japonicum strain was recently shown to induce root and stem nodules on A. afraspera [41]. However, in contrast to the ORS285 strain, B. japonicum failed to induce nodule on Aeschynomene species which are nodulated by the nod-lacking strains such as A. indica and A. sensitiva suggesting that B. japonicum lacks key determinants required for the Nod-independent symbiotic process [34].
In B. japonicum, the tts gene cluster is found on a large symbiotic island of 681 kb [17]. As these genes are absent in the nod-lacking strains, we can postulate that ORS285 also acquired a symbiotic island harboring both the nod and the tts gene clusters by a lateral gene transfer. The T3SS is used by many bacterial pathogens to deliver directly a cocktail of effector proteins in eukaryotic cells that will subvert host defense [42]. This secretory machinery has been shown in several strains of rhizobia (including B. japonicum USDA 110) to affect symbiosis positively or negatively depending on the host [43]. It could therefore be particularly interesting to examine the impact of mutations in the tts cluster of the ORS285 strain to infer its role in the Nod-dependent or Nod-independent symbiotic processes in relation to the host plant. Table 1 were grown on YMA [44], and bacterial genomic DNA was isolated following a modified CTAB protocol as recommended by the Joint Genome Institute [45].

Sequencing and Assembly of 7 Bradyrhizobium Genomes
Four genomes (ORS285, ORS375, STM3809, STM3843) were sequenced with Roche 454 GS-FLX technology (Genoscope, CEA, France) with a coverage of 11 to 28X, and with Illumina Solexa technology (GATC company, 36 bp-reads length, 20X cover). GS-FLX reads were assembled using Newbler software (v. 1.1.03.24), leading to 300 to 800 contigs. These contigs were fused to obtain a single scaffold, as a single chromosome and no plasmid could be visualized by PFGE analyses (not shown) and no repC homologs could be found in the gene content of each of these genomes (repC being a specific gene for plasmid replication in alphaproteobacteria). Solexa reads were used to correct contig consensus and sequencing errors. Three other genomes were sequenced only with Illumina Solexa technology (STM3847, STM4509 and STM4523) by the company GATC (Germany). Short 36 bp reads were pre-assembled with PERL script Vcake [46], using a first round of k-mer extensions of Solexa reads, then a second round using progressive extension of contigs obtained from the first round. This strategy allowed the construction of a sequence database of 100-140,000 contigs for each genome, subsequently used in global alignments. A last contig assembly was also built using CLC Genomics Workbench 4.0.2 (mismatch cost 2, Insertion cost 3, Deletion cost 3, length fraction 0.5, Similarity 0.8; length kept >50 bp), producing 12 to 14,000 contigs per strain, allowing gene search and phylogeny. Annotated genomes based on 454 GS-FLX assemblies were submitted to the EBI [47]. Solexa reads were submitted to the Sequence Read Archive (SAR) of the European Nucleotide Archive (ENA) under accession number ERP000868. Genomes are also available on the MicroScope annotation and comparative genomics platform [48] in the Rhizoscope Project.

Automatic and Expert Annotation of Bradyrhizobial Genomes
Annotation of bacterial genomes was conducted following 3 main steps: syntactic annotation to find genomic objects, followed by functional annotation to assign roles on the basis of sequence similarity with sequences from different databanks, and a relational annotation which establishes links between different genomic objects. All these steps were performed in the MicroScope platform pipeline and results are available via the MaGe (Magnifying Genomes) graphical interface [48,49]. For syntactic annotation, coding sequences were predicted using AMIGene (Annotation of Microbial Genomes) software [50]. Functional annotation used tools listed in [51] to give functions to the set of predicted genes. First, for strains ORS285, ORS375, STM3809 and STM3843, annotations were automatically transferred from strains ORS278, BTAi1 and E. coli K12 (in that order of priority) for genes having good similarities (i.e., 80% identity over at least 70% of the length of the smallest protein). These automatically generated results were stored in a relational database, and were further used as a starting point for manual validation with the MaGe web-interface, allowing data exploration, graphical visualizations and genomic comparisons. A functional classification was also produced for strain(-set)-specific genes using the KEGG database (via the KEGG Automatic Annotation Server [52]).

Whole Genome Alignments
Four sequenced Bradyrhizobium genomes were selected as reference. The remaining genomes were aligned on each reference genome using the MUMmer 3.0 whole genome global alignment system [53] to find conserved and organism-specific regions. In order to find potential alignment anchors, defined as maximal unique matches (MUMs) between two genome sequences, the MUMmer software uses suffix-trees. A suffix tree is a compact representation that stores all possible suffixes of an input sequence. MUMer then performed a protein alignment, starting by the translation of nucleic sequences in all 6 reading frames. Aligned regions between the reference and another genome that did not have a minimal identity of 70% and a match length of at least 20 amino-acids were filtered out. The CGview application [54] was used for graphical representation of whole genome alignments.

GC% Deviation
GC% deviation was computed using sliding 1,000 bp windows and represents the ratio between GC percent in the window and the mean GC percent of the genome. High deviations can indicate a potential foreign origin of the sequence within the window.

Core-Genome and Pangenome
Genes from two or more genomes were considered as orthologous if their encoded proteins satisfied bidirectional best-hit (BBH) criteria and alignment thresholds (40% amino acid identity on 80% of the length of the smallest protein). BBH are commonly used in comparative genomics to infer orthologous genes [55][56][57], and are integrated in the DOE IMG and MAGE platforms [49,58]. Two genes from two different genomes are bidirectional best hits when each is the best hit of the other. A bidirectional best-hit is considered as a good evidence that the genes may be orthologs arising from a common ancestor.
For Venn diagrams displayed in Figure 2 and Figure 6, comparison of gene sets between genomes were performed using the Phyloprofile Exploration tool of the MicroScope platform [48], using the following orthology criterion: BBH with 40% amino acid identity on 80% of the length of the smallest protein. Gene exploration in the 3 highly fragmented genomes (derived from solexa reads assemblies) was performed using tblastn searches with the Stand alone Blast software [59].

Phylogenomics
The core genome-based phylogeny was produced by aligning partitions for the 3,663 bradyrhizobia orthologs using ASAP software [60] (based on the Muscle alignment software [61]), and using PAUP4b10 to infer phylogenies on a combined partitions of all 3,663 genes (criterion set to distance, using Jukes-Cantor distance correction, and 100 bootstrap replicates). MLSA-based phylogeny shown in Figure 2B was obtained by Maximum Likelihood using a GTR + I + G model and 100 bootstrap replicates (using MEGA5 [62])

Search for Genomic Islands (GI)
The «RGPfinder» tool of the MicroScope annotation platform [63] is a comparative genomics method that searches for synteny breakpoints that delimit regions of at least 5 kbp in a target genome that are missing in one or more other related genomes. These RGPs (for Regions of Genomic Plasticity) can be insertion sites of GI, or the result of deletions of particular DNA segments in one or more strains. Region prediction is accompanied by the characterization of several common GI features, such as detection of tRNAs, IS (Insertion Sequence) and repeats, as well as information about nucleotide composition abnormalities [% (G + C) deviation, Codon Adaptation Index]. GI from two distinct organisms were considered identical when they shared homologous genes (at least 40% identity on 80% of the length of the smallest protein) in synteny on at least half of the GI (considering its size and number of genes). Correspondences between genomic islands in different Bradyrhizobium genomes were thus established in order to construct a binary matrix wherein each line represents a Bradyrhizobium genome and each column a GI. This matrix was then used to build a similarity tree which was compared to the phylogenetic tree of the Bradyrhizobium genomes.

Metabolic Comparisons
In each genome, the metabolic network was predicted by the "Pathway Tools" software [64] using MetaCyc [65] as a reference pathway database (version 13.0). The metabolic networks for each Bradyrhizobium sp. genome are directly available in the MaGe graphical interface at [66]. Using these data, a two-dimensional matrix was built for use in a Principal Component Analysis (with R package "FactoMineR"), which will highlight possible metabolic similarities and specificities between the genomes. For graphical representations, the variable (pathways) plot and the individual (genomes) plot were combined, restricting plotted variables to those with a quality of representation greater than 0.75, in order to conserve interpretability. The clustering method used a Euclidean distance and ward's criterion; the number of classes was chosen after manual examination of the cluster tree, and led to 7 classes for the first factorial plan (see Supplementary Table S2).

Conclusions
Bacterial comparative genomics revealed a powerful tool to analyze Aeschynomene symbionts genome characteristics and their ecological niche adaptation. Photosynthetic Bradyrhizobium have large genomes (8-10 Mb), with a rather small core genome (3,663 genes) compared to their large pangenome (18,040 genes). The dispensable genome and strain-specific genes (mainly located on GI) exhibit numerous functions of adaptation towards the special ecological niche of PB (soil/aquatic, light-environment, stem/root nodule).
Our comparative genomics approach led us to the detection of specific gene pools in bradyrhizobial strains linked to two phenotypes of interest, i.e., the photosynthetic ability and the nod-independent/ nod-dependent symbiosis. Concerning bacterial photosynthesis, genome comparison reveals a clear ancestral origin for the photosynthetic activity. The LH2 type antennae complexes appear to be more widely distributed among PB that expected. The presence of a cbb gene cluster in synteny for all sequenced genomes suggests that preserving a functional CBB cycle confers a clear advantage. Finally, several new bacteriophytochromes were identified and further studies are required to explain their functional role. For the symbiotic process, we identified several target genes (as transcriptional regulators, polysaccharide biosynthesis and N-acetylglucosamine modification enzymes or plant-wall degrading enzymes) that will be further studied at the genetic level in order to prove their contribution to the nod-independent process. Finally, transcriptomic approaches on PB during their symbiosis with Aeschynomeme are currently being developed to identify more precisely the genetic determinants involved in this atypical mutualistic interaction. and ANR-SESAM-2010-BLAN-170801). D.M. was funded by a grant from IRD (Institut de Recherche pour le Développement). We wish to thank the Génoscope LABGeM team (Laboratoire d'Analyses Bioinformatiques en Génomique et Métabolisme, CEA) for support and training on bioinformatic analyses.