Reliable Identification of Environmental Pseudomonas Isolates Using the rpoD Gene

The taxonomic affiliation of Pseudomonas isolates is currently assessed by using the 16S rRNA gene, MultiLocus Sequence Analysis (MLSA), or whole genome sequencing. Therefore, microbiologists are facing an arduous choice, either using the universal marker, knowing that these affiliations could be inaccurate, or engaging in more laborious and costly approaches. The rpoD gene, like the 16S rRNA gene, is included in most MLSA procedures and has already been suggested for the rapid identification of certain groups of Pseudomonas. However, a comprehensive overview of the rpoD-based phylogenetic relationships within the Pseudomonas genus is lacking. In this study, we present the rpoD-based phylogeny of 217 type strains of Pseudomonas and defined a cutoff value of 98% nucleotide identity to differentiate strains at the species level. To validate this approach, we sequenced the rpoD of 145 environmental isolates and complemented this analysis with whole genome sequencing. The rpoD sequence allowed us to accurately assign Pseudomonas isolates to 20 known species and represents an excellent first diagnostic tool to identify new Pseudomonas species. Finally, rpoD amplicon sequencing appears as a reliable and low-cost alternative, particularly in the case of large environmental studies with hundreds or thousands of isolates.


Introduction
Pseudomonas species are ubiquitous bacteria present in terrestrial, aquatic, and marine environments [1,2]. While several species are a threat to human health or food industry, with detrimental impacts on crops and aquaculture, the large majority of the species are commensals [3][4][5]. In fact, several species have been described as potential biocontrol agents to fight against diverse plant pathogens but also for bioaccumulation or biodegradation of pollutants [6,7]. Furthermore, Pseudomonas species are known producers of a wide diversity of bioactive secondary metabolites with potentially high added value [8]. Consequently, it is of great interest to be able to identify Pseudomonas isolates, in a fast and inexpensive way, to monitor their occurrence and diversity in the environment.
To date, the identification of bacteria in microbiology is still based on the 16S rRNA gene sequence, although its discriminative power is often limited to the delineation of groups or clades within a particular genus [9]. For Pseudomonas, the 16S rRNA gene allows the differentiation with sister genera in the Pseudomonadaceae family (Cellvibrio, Oblitimonas, Thiopseudomonas, and Ventosimonas) and the delineation of the three main lineages (P. aeruginosa, P. fluorescens, and P. pertucinogena) [10][11][12]. However, intra-genomic heterogeneities originate from multiple copies of the 16S rRNA gene in Pseudomonas genomes, and, based on previously established cutoff values (between 98.2 to 99% of similarity), it is not possible to differentiate environmental isolates at the species level [9,[12][13][14]. Therefore, MultiLocus Sequence Analysis (MLSA), based on the concatenation of four housekeeping genes (16S rRNA, gyrB, rpoB, and rpoD), has been proposed for the identification of Pseudomonas isolates [15,16]. Nevertheless, the expansion of genomics in bacterial taxonomy, and attractive prices, have led to the genome sequencing of many environmental isolates of Pseudomonas [17,18]. Recently, the Pseudomonas phylogeny based on the 16S rRNA gene and the concatenation of 4, 100, and 120 genes was compared to whole genome analyses [12]. As a result, it seems that the MLSA based on four housekeeping genes has the best price-performance ratio, although the use of only two housekeeping genes (rpoB and rpoD) can also be found in the literature [19,20]. However, these methodologies are time-consuming, demanding, and/or expensive, particularly in the case of large environmental studies with hundreds or thousands of isolates. Consequently, the use of the 16S rRNA gene to identify Pseudomonas isolates remains relevant despite being prone to misidentifications, as observed frequently in public databases [21,22].
The rpoD gene is included in most MLSA procedures and was previously proposed to identify Pseudomonas species in environmental samples [23,24] or for the rapid identification of isolates belonging to the Pseudomonas syringae complex [25]. However, since these studies included only a limited number of Pseudomonas species, a genus-wide comprehensive overview of the rpoD phylogeny is presently missing. In this study, we present the rpoD-based Pseudomonas phylogeny, including a total of 217 type strains for which genomes were available in public databases. Secondly, we adapted the methodology described by Mulet et al. (2011) to sequence the rpoD gene and identify 145 environmental Pseudomonas isolates. To test our rpoD-based taxonomic affiliations, we sequenced one-third of these isolates with Illumina and used whole genome analysis (Average Nucleotide Identity and digital DNA-DNA hybridization) for comparison with the established taxonomy. We show here that the rpoD locus streamlines the identification of Pseudomonas isolates from a labor and cost perspective and provides, in comparison with the 16 rRNA gene, an excellent tool to accurately affiliate isolates.

Materials and Methods
All type strains used in this study, together with NCBI accession numbers, are listed in Table S1. The environmental isolates of Pseudomonas, their origin, and their NCBI accession number (rpoD and/or genome) are listed in Table S2.
To avoid DNA extractions, PCR was performed on cell lysates. One colony of each strain was suspended in 50 µL of Milli-Q water in 96-well PCR microplates, and plates were immersed three times, alternating between liquid nitrogen and water bath (+70 • C). The lysates were then stored at −20 • C until their use as PCR templates. PCR amplifications of the rpoD gene were performed using previously designed primers (PsEG30F and PsEG790R; [23]) and KAPA2G Fast HotStart ReadyMix (Sigma-Aldrich, Saint-Louis, Missouri, USA). Cycling conditions were as follows: initial denaturation at 95 • C for 5 min followed by 30 cycles of annealing at 60 • C for 30 s, extension at 72 • C for 30 s and denaturation at 95 • C for 15 s, and reactions were completed at 72 • C for 2 min. PCR products were then purified using the GenElute PCR Clean-Up kit (Sigma-Aldrich Saint-Louis, Missouri, USA). Sequencing was performed using the reverse primer PsEG790R first, and, in case of failed sequencing, the forward primer PsEG30F was used. Purified PCR products were then sequenced using Sanger sequencing (Macrogen Europe, Amsterdam, The Netherlands) to obtain a final fragment of approximately 650 bp. The rpoD sequences of 145 environmental isolates were aligned to those from the 217 type strains of Pseudomonas to generate a similarity matrix based on a~650 bp fragment (Table S3). A first phylogenetic tree containing all type strains of Pseudomonas and members of the sister genera was constructed to confirm the relatedness between validly published species ( Figure 1; Table S1). A second, restricted tree shows the phylogenetic relationship of our 145 Pseudomonas isolates with the known Pseudomonas species (Figure 2; Table S2). MEGA-X was used to estimate the best evolutionary model. Both trees were constructed using the general time-reversible model (GTR+G+I), and bootstrap values were calculated based on 1000 replications [26]. iTOL was then used to annotate the trees and create high-quality figures [27].
To validate the rpoD-based taxonomic affiliation, the whole genomes of 55 environmental isolates were sequenced. Three strains, namely SWRI103, BW11P2, and BW11M1, were included in this analysis but were previously sequenced by our group [28][29][30]. Briefly, genomic DNA was extracted using the Gentra Puregene Yeast/Bact. Kit (Qiagen, Hilden, Germany). A first subset of the genomes was sequenced by BASECLEAR (Leiden, The Netherlands) using the Nextera XT library preparation kit and the Illumina MiSeq sequencer. The second batch of strains was sequenced in-house using the Nextera Flex preparation kit and the Illumina MiniSeq device. All libraries were sequenced using a paired-end approach (2 × 150 bp), and the genome coverage was routinely above 40×. The quality of the Illumina reads was assessed using FastQC v. 0.11.9 and Trimmomatic v. 0.38 for adapter clipping, quality trimming (LEADING:3 TRAILING:3 SLIDINGWINDOW:4.15), and minimum length exclusion (>50 bp) [31]. De novo genome assembly was performed with the SPAdes assembler v. 3.13.0 [32].
Bioinformatics calculations, such as the Average Nucleotide Identity (ANI) and digital DNA-DNA Hybridization (dDDH), are now commonly used to describe new species or to affiliate strains to a specific taxon [20,33]. Threshold values for species delineation are, respectively, between 95 and 96.5 for ANI and 70 for dDDH [13,[34][35][36][37]. In this study, we calculated ANIb values between a total of 275 Pseudomonas strains (217 type strains and 58 environmental strains) using pyani v0.2.10 with default parameters and the ANIb method. ANIb values in Tables 1-3 are the averages of the bidirectional comparisons. In the case of ambiguous ANIb values (between 95 and 96.5%), we supplemented the analysis with the calculation of dDDH using the online tool Genome-to-Genome Distance Calculator GGDC (https://ggdc.dsmz.de/home.php, June 2020) [34,35]. Finally, in order to evaluate reliability of rpoD-based taxonomic affiliations, we calculated the correlation rpoD/ANIb using the statistical functions included in the SciPy Python package [38].

Species Delineation
Within the Pseudomonas genus, the lowest nucleotide identity based on the rpoD locus among the 217 type strains included in this study was 51.57%. We observed that the lowest intra-group nucleotide identity ranged from 66.06 to 88.54% within, respectively, the P. pertucinogena and P. oleovorans group. Overall, all type strains were differentiated at a species level with a cutoff value of 98% identity (Table S3), which is supported by ANIb values <95% (Table S4). On the other hand, a certain number of Pseudomonas species were already described as synonymous, and we highlighted, in Table 1, ten cases where similarities based on rpoD (>98%) and ANIb values (>96.5%) were consistent with previous observations [12,16,33,37]. However, we also observed limitations where similarities based on rpoD were not concordant with ANIb values (Table 2). MLSA was described as an efficient tool to identify Pseudomonas isolates, and a threshold value of 97% was recently proposed to differentiate strains at the species level [12]. Consequently, we also performed MLSA (following previous instructions [15]) to determine if these misinterpretations could be avoided by the use of multiple loci (Table S5). Table 2. Inconsistent species affiliations based on the comparison between rpoD, 4-genes MultiLocus Sequence Analysis (MLSA), and ANIb (Tables S3-S5). Species differentiation based on rpoD < 98%, MLSA < 97%, and ANIb < 96.5 were used. It is important to note that in 12 cases out of 15, MLSA and rpoD led to the same misinterpretations, while in three cases, incorrect affiliations were made by using the single locus. Overall, among the 23,436 pairwise comparisons, including the 217 type strains, only in 0.017% of the cases two strains belonging to the same species presented an identity < 98%, and in 0.047% of the cases, two strains belonging to different species presented an identity > 98%. In order to evaluate the overall reliability of these rpoD-based affiliations, we calculated the correlation between rpoD similarities (Table S3) and ANIb values (Table S4). We included 275 strains (217 type strains and 58 environmental strains) and obtained a Pearson correlation coefficient of 0.872 (p-Value < 10e-300).

Identification of Environmental Isolates
A total of 145 environmental Pseudomonas isolates were identified using the rpoD gene (taxonomic affiliations are summarized in Table S2). Briefly, Pseudomonas strains were isolated from banana plants (n = 7), rice (n = 33), maize (n = 16), or wheat (n = 89) and from diverse geographical locations in Belgium, Iran, and Sri Lanka [39,40]. Only two isolates clustered in the P. resinovorans group within the P. aeruginosa lineage while the majority, 143 isolates, were affiliated to the P. fluorescens lineage, respectively belonging to the P. putida (n = 63), P. asplenii (n = 1), and P. fluorescens (n = 79) groups ( Figure 2). Within the P. fluorescens group, strains are distributed in five subgroups, P. jessenii (n = 3), P. gessardii (n = 4), P. corrugata (n = 22), P. koreensis (n = 23), and P. fluorescens (n = 27) ( Figure 2). As previously observed with type strains of Pseudomonas, we identified a cutoff value of 98% nucleotide identity on the rpoD gene to differentiate strains at the species level. On this basis, 56 isolates were identified and affiliated to 20 known Pseudomonas species, with the most abundant ones being P. asiatica (n = 9), P. simiae (n = 8), and P. alloputida (n = 6), while the remaining 89 isolates were suspected to represent new species (Tables S2 and S3). These taxonomic affiliations were supported by ANIb calculations between our 58 genomes and all known Pseudomonas species included in this study (Table 3 and Table S4). Finally, the 89 unaffiliated isolates were confirmed to belong to 31 new Pseudomonas species (listed Pseudomonas sp. #1 to #31 in Table 3 and Table S2).

SWR I19
P. furukawaii KF707 S W R I1 0 * SWRI133 P . s y n x a n th a A T C C 9 8  The new species are distributed as follows: P. putida group (n = 13), P. asplenii group (n = 1), P. gessardii subgroup (n = 1), P. koreensis subgroup (n = 7), P. corrugata subgroup (n = 4), and P. fluorescens subgroup (n = 5). Their description and an update of the Pseudomonas phylogeny will be published elsewhere (Girard et al., in preparation). Overall,~95% of taxonomic affiliations based on the rpoD gene that were confirmed by whole genome comparison were accurate. However, we also observed limitations (~5% of the cases) where the rpoD alone would have led to affiliation to known species (i.e., SWRI103, SWRI126, OE 28.3, Table 3), while based on ANIb/dDDH these strains represent new Pseudomonas species. These strains are also affiliated to known species when using MLSA (>97% ,  Table S5). Therefore, based on a total of 37,402 dual comparisons (217 type strains and 58 environmental strains), a threshold of 98% similarity led to 0.048% of misidentifications.
The 16S rRNA gene is still widely used in microbiology to identify bacteria [9]. However, the presence of multiple copies of this gene can lead to different ribotypes for a given Pseudomonas strain [14]. On the other hand, the rpoD is present in a single copy, offers a higher resolution than the 16S rRNA gene [12], and appears to be as efficient as MLSA to taxonomically assign environmental isolates (this study). Furthermore, the sequencing of four genes for MLSA or the 16S rRNA gene (i.e., sequencing of both ends to have the longest fragment) generates higher costs than rpoD (one end only). Consequently, the use of this locus for the identification of large batches of environmental isolates is advantageous in terms of resolution but also from a practical and financial point of view. Moreover, the widespread distribution of the rpoD gene in public databases, especially through its use for MLSA, now makes it an interesting target for metagenomics studies heading to evaluate the diversity of Pseudomonas within environmental samples [23].

Conclusions
In this study, we used a~650 bp rpoD locus to describe the phylogenetic structure of the genus Pseudomonas. We provide a genus-wide overview of the taxonomy and an integrative study making the link between single locus and whole genome analysis. The use of the rpoD allowed us to accurately assign strains to specific phylogenetic groups or subgroups and affiliate strains to known Pseudomonas species. Furthermore, it has proven to be efficient as a first diagnostic tool to identify 31 new species. The high correlation between rpoD similarities and ANIb values emphasize the high discriminative power of this gene. Importantly, we show that the rpoD is a powerful tool for microbiologists, superior to the 16S rRNA gene, for accurate identification of Pseudomonas isolates.
Supplementary Materials: The following are available online at http://www.mdpi.com/2076-2607/8/8/1166/s1. Table S1: Species type strains used in this study. Table S2: List of environmental strains used in this study. Origin, NCBI accession number for the rpoD gene and/or genome, and taxonomic affiliations are specified. Table S3: rpoD similarity matrix, including all type and environmental strains. Table S4: ANIb similarity matrix, including all type and strains for which genomes were available. Table S5: