Comparative Analysis of the Core Proteomes among the Pseudomonas Major Evolutionary Groups Reveals Species-Speciﬁc Adaptations for Pseudomonas aeruginosa and Pseudomonas chlororaphis

: The Pseudomonas genus includes many species living in diverse environments and hosts. It is important to understand which are the major evolutionary groups and what are the genomic / proteomic components they have in common or are unique. Towards this goal, we analyzed 494 complete Pseudomonas proteomes and identiﬁed 297 core-orthologues. The subsequent phylogenomic analysis revealed two well-deﬁned species ( Pseudomonas aeruginosa and Pseudomonas chlororaphis ) and four wider phylogenetic groups ( Pseudomonas ﬂuorescens , Pseudomonas stutzeri , Pseudomonas syringae , Pseudomonas putida ) with a su ﬃ cient number of proteomes. As expected, the genus-level core proteome was highly enriched for proteins involved in metabolism, translation, and transcription. In addition, between 39–70% of the core proteins in each group had a signiﬁcant presence in each of all the other groups. Group-speciﬁc core proteins were also identiﬁed, with P. aeruginosa having the highest number of these and P. ﬂuorescens having none. We identiﬁed several P. aeruginosa -speciﬁc core proteins (such as CntL , CntM , PlcB , Acp1 , MucE , SrfA , Tse1 , Tsi2 , Tse3 , and EsrC ) that are known to play an important role in its pathogenicity. Finally, a holin family bacteriocin and a mitomycin-like biosynthetic protein were found to be core-speciﬁc for P. cholororaphis and we hypothesize that these proteins may confer a competitive advantage against other root-colonizers.


Introduction
The Pseudomonas genus is very diverse and ubiquitous and, thus far, it includes 272 species (http: //www.bacterio.net/pseudomonas.html) isolated from many different environments, such as soil, water, air, and sediment, as well as from many types of hosts, such as animals, plants fungi, and algae [1,2]. In addition, many of these species or strains are human, animal, or plant pathogens of great importance; in contrast, some actually promote plant growth. Of particular prominence is Pseudomonas aeruginosa, an opportunistic human pathogen that is resistant to many antibiotics. This species is a major cause of nosocomial infections and severely affects patients who are immunocompromised or suffering from cystic fibrosis [3]. Another close evolutionary relative, Pseudomonas stutzeri, is widely distributed in the environment and has been studied as a model for studies on denitrification, the degradation of effort has focused on the widely studied human pathogen P. aeruginosa [31,32]. With all these genomes available, new computational approaches have been introduced in order to define new Pseudomonas species by using the average nucleotide identity (ANI) at a threshold of 94-96% ANI similarity or the genome-to-genome distance (GGDC) [33][34][35].
The goal of the present study was to analyze the available high-quality complete Pseudomonas genomes/proteomes via a phylogenomic approach in order to better understand the extent to which the various species of the genus; the major evolutionary groups that they form; and the properties of the core genome/proteome at the genus, group, and species levels are well defined.

Materials and Methods
A total of 8999 proteomes (accessed on October 2019) were downloaded from the Pseudomonas Genome Database. One Candidatus proteome was subsequently excluded. In addition, if two or more assemblies existed for the same strain, only the one with the highest sequence coverage was kept. Thus, we retained 8883 proteomes. Of these, 494 were annotated as "complete", whereas, for the rest, their assembly level was characterized as either "chromosome", "scaffold", or "contig". In order to ensure that our analyses were not affected by badly assembled genomes/proteomes that could result in a significant underestimation of the core Pseudomonas proteome, we focused on the 494 "complete" proteomes.
In order to identify the core proteome of a set of organisms, we developed three Python scripts that perform best reciprocal protein Basic Local Alignment Search Tool (BLAST), using the biopython blast+ tools. The user has to define a reference proteome that is used to perform best reciprocal BLAST against all the other proteomes of the set. The parameters of the best reciprocal BLAST are E-value < 1 × 10 −5 . The user may define another E-value threshold. We also implemented an additional criterion in order to stringently filter out problematic orthologues. For all reciprocal BLAST hits of the reference proteome against a selected proteome, the Python script gathers all the percent identities (of the BLAST results), estimates the mean value and standard deviation, and then filters out all best reciprocal BLAST hits that have identities two standard deviations below the average value. The user may alter this threshold of standard deviations. This approach defines a cut-off of orthology that is adjustable, depending on the genetic distance of the two genomes/proteomes undergoing reciprocal BLAST, instead of fixed cut-offs of sequence identity/similarity or defined BLAST score ratios, as applied in many previous pangenome analyses [28,[36][37][38][39]. We did not implement any additional criterion/threshold of a required minimum pairwise alignment length. Afterwards, a matrix of best reciprocal BLAST hits is generated, where the rows correspond to the proteins of the reference proteome and their orthologues in the other proteomes, and the columns correspond to the various proteomes of the selected set. Next, a second matrix is generated that includes only the proteins and the orthologues that are present in all of the analyzed proteomes. This second matrix corresponds to the actual core proteome. Afterwards, the script generates multiple alignments with the Muscle software [40] for all identified groups of core orthologues, concatenating them in a super-alignment and then filtering badly aligned regions with the gblocks software (default parameters). The filtered alignment is then used to generate a neighbor-joining phylogenomic tree with the Kimura model and 500 bootstraps [41]. All the above parameters and directories that contain the reference proteome and the other proteomes are defined in a parameters file that is used by the scripts. The scripts are available upon request.
In order to identify the core proteome of the whole genus, we used (as a reference) the model strain P. aeruginosa PAO1. The same strain was used as a reference for identifying the core proteome of the P. aeruginosa species. However, in order to identify the core proteomes of the other evolutionary groups, we used a specific reference strain from within each group. Our approach of using a reference proteome for identifying core proteins with reciprocal BLASTs is scalable and thus may be readily applied in the future. The number of reciprocal BLASTs grows linearly with the total number of genomes/proteomes analyzed. In contrast, an all vs. all reciprocal BLAST approach is subject to a combinatorial explosion and so is not scalable. Future analyses with the all vs. all approach may be expected to require vast amounts of computing power, given the rate at which bacterial genomes are now being sequenced.
Average nucleotide identity for species boundary identification [34], based on the MUMmer software (from now on abbreviated as ANIm), was calculated with the pyani software [42]. In order to define the borders of a certain Pseudomonas species or group, (i.e., P. aeruginosa, P. stutzeri, P. putida, P. flluorescens, P. chlororaphis), we observed the genus phylogenomic tree and the ANIm values of the various genomes against a selected strain of a particular species. A workflow is presented in Figure 1. In order to compare the results from our approach with those from other published computational tools, we also implemented Orthofinder [43], which follows an all vs. all pairwise alignment approach, such as BLAST, or other fast alternative methods, such as Diamond [44]. For Orthofinder, we provided the rooted species tree that we calculated on the basis of our approach and we applied all other default parameters. Orthofinder analyzed the 494 proteomes in 3 days in an Intel dual processor workstation with 36 threads. Although it is feasible to run such all vs. all analyses with 500-1000 proteomes, due to the exponential computational cost of an all vs. all approach, such an analysis would be infeasible if we repeated it for the 8999 originally downloaded proteomes (estimated run time is 1026 days).
The assignment of functional category of the identified core proteins was based on the EGGNOG database v4.5 [45] and the Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthologies with the help of the KAAS tool [46], as well as on COG [47]. Heatmaps were generated with the help of the R programming language (https://www.r-project.org/).

Identification of the Major Phylogenetic Groups and Species
We analyzed all the 494 complete Pseudomonas proteomes using as reference the proteome of the first completely sequenced strain, that of P. aeruginosa PAO1 [48]. In this set of complete proteomes, the majority of them were annotated as P. aeruginosa strains (187), whereas the other most numerous proteomes were strains annotated as P. chlororaphis (43), P. syringae (28), P. putida (27), P. fluorescens (25), and P. stutzeri (18). Our first analysis identified 198 core proteins for the whole genus (see Supplementary File 1, Table S1). The multiple alignment of these 198 core proteins contained 27,997 variable sites (after G-blocks filtering) that were used to generate a phylogenomic tree in Seaview4 [41] with neighbor-joining (Kimura model) and 500 bootstraps (see Figure 2 and Figure S1). However, we observed an oddly long branch for Pseudomonas mesoacidophila ATCC 31433 7131, which is in actuality a mis-assigned Burkholderia cepacia [49]. Thus, this particular strain was further used as outgroup. In addition, at the base of this phylogenomic tree, we also observed two branches/strains that raised doubts about their correct classification as members of the genus. Accordingly, on the basis of the phylogenomic tree, we removed several basal branches/strains in a step-wise manner, created five genus core sets (see Figure 2 and Supplementary File 1; Table S1), and observed how the number of core proteins increased. Our rationale was that removal of strains that are not members of the genus would significantly increase the number of core proteins, whereas once we reached the genus borders, removal of true genus strains would not cause a significant increase in the number of core proteins. Indeed, we observed a significant increase of core proteins until core set 4, which contained 297 core proteins and excluded 3 Pseudomonas annotated strains, P. mesoacidophila (the mis-annotated Burkholderia), Pseudomonas sp. S-6-2 (PubMLST assigned it as Pseudomonas), and Pseudomonas sp. QZS01 (PubMLST did not assign it as Pseudomonas). In addition, Pseudomonas sp. QZS01 had an atypically low GC content of 38% with only 3004 proteins, whereas the GC content for all the other strains ranged between 54 and 67%. However, core set 5 further excluded four more strains-two Pseudomonas psychrotolerans, one Pseudomonas oryzihabitans, and Pseudomonas sp. LTJR-52-and contained 305 proteins, an increase of only eight proteins from core set 4. In addition, PubMLST annotated these four strains as Pseudomonas (Pseudomonas sp. LTJR-52 was annotated by PubMLST as Pseudomonas luteola). Thus, we subsequently used core set 4 as the genus core set.
According to a Gene Ontology enrichment analysis with PANTHER [50], the core proteins were housekeeping proteins, mostly related to metabolism, organic compound biosynthesis, translation, and gene expression, as expected (see supplementary file 1, Table S2), in accordance with the work of [51].
The observation of the phylogenomic tree together with ANIm values from selected representatives of key species and the PubMLST database allowed us to identify two well-defined species, P. aeruginosa and P. chlororaphis, as well as four other groups, P. fluorescens, P. putida, P. stutzeri, and P. syringae, but with much higher intragroup diversity. Our phylogenomic analysis is in accordance with a previous analysis of 107 Pseudomonas strains, using the 16S rRNA, gyrB, rpoB, and rpoD [6]. This previous analysis identified two major lineages or intrageneric groups (IGs). The first IG contained the species P. aeruginosa, P. stutzeri, and Pseudomonas oleovorans. However, the second IG was more complex and contained six groups, represented by P. fluorescens, P. syringae, Pseudomonas lutea, P. putida, Pseudomonas anguilliseptica, and P. straminea. The P. fluorescens group was further divided into nine subgroups, one of which was P. chlororaphis. They also included Pseudomonas psychrotolerance, P. oryzihabitans, and P. luteola within the genus, but not in any IG. The genus-level phylogenomic tree is important for identifying mis-assigned Pseudomonas genomes, a problem that has been observed in the past [35,52]. Our phylogenomic tree also revealed the position of Pseudomonas sp. S-6-2 and Pseudomonas sp. QZS01 at the borders of the genus. A recent study identified 13 Pseudomonas groups, composed of 2 to 66 species each [51].
As is evident from Figure 2 and Figure S1, the P. aeruginosa species (composed of 189 strains) is well grouped and monophyletic, in accordance with previous observations [6,29]. One genome was misidentified as P. fluorescens NCTC10783 9425 and should be renamed as a P. aeruginosa strain. This is also supported by the ANIm values when we used P. aeruginosa PAO1 as the reference strain, as well as by PubMLST. Interestingly, four P. aeruginosa strains (PA7 119, CR1 6461, AR441 6528, AR 0356 7083) that seem to form the most distant subgroup have ANIm values around 94%, whereas all the other members of the group have ANIm values above 98%. However, the inclusion of these four strains into the P. aeruginosa species is well supported, not only by the phylogenetic tree, but also from the PubMLST analysis. The core proteome of the P. aeruginosa species comprised 1811 proteins. A separate phylogenomic analysis of these 189 proteomes (see Figure S1), based on the 1811 P. aeruginosa core proteins and 31,145 variable sites, further confirmed the mis-annotation of P. fluorescens NCTC10783 9425. The protein count and GC content of the strains of this group range between 5500-7352 (average: 6192) and between 65.6-66.9% (average: 66.1%), respectively. Although it is a very well-defined species, phylogenomically and in terms of ANIm values, it is surprisingly diverse in terms of protein content, thus revealing a very dynamic accessory proteome, in accordance with several analyses [53][54][55]. In particular, an analysis of 1311 P. aeruginosa genomes revealed 665 core genes, a five-group population structure, and evidence of very extensive horizontal gene transfer [55], whereas another analysis of 22 strains showed that up to 20% of the genome consists of transferable genetic loci [54]. Another analysis of 17 reference strains revealed a core set of 5233 orthologs [56]. In addition, it appears that, on average, industrial strains have the largest genomes, followed by environmental strains, and then clinical isolates [57]. Interestingly, a recent study based on experimentally determined essential genes (by using transposon insertion sequencing) for several P. aeruginosa clinical isolates across different conditions revealed 321 core essential genes [58], with 24% (76/321) and 58% (185/321) of them being members of the genus and P. aeruginosa core sets of our analysis (see supplementary file 1, Table S10). In total, five studies of this type have identified 864 essential genes [58][59][60][61][62], with 15% (128/864) and 42% (359/864) of them being members of the genus and P. aeruginosa core sets of our analysis.
Another well-defined Pseudomonas species, based on the genus phylogenomic tree and the ANIm values was P. chlororaphis, comprised of 43 genomes. Pseudomonas sp. MRSN12121 3055 was found to be within the P. chlororaphis species, whereas P. chlororaphis UFB2 isolate Soil 2921 was not within the P. chlororaphis group that we defined but was within our wider P. fluorescens group. The core proteome of the P. chlororaphis species, using P. chlororaphis subsp. aurantiaca 449 7539 strain as the reference point, comprised 3587 proteins. A separate phylogenomic analysis (see Figure S1) of these 43 proteomes, on the basis of their 3587 core proteins and 103483 variable sites, further confirmed the inclusion of Pseudomonas sp. MRSN12121 3055 within the P. chlororaphis species. The protein count and GC content of the strains of this group range between 5599-6401 (average: 6076) and between 61.9-64% (average: 62.8%), respectively.
The P. fluorescens group that we defined, on the basis of the genus phylogenomic tree, comprised 96 genomes and displayed high levels of phylogenetic heterogeneity, in accordance with [29]. It comprised many species, such as Pseudomonas corrugata, Pseudomonas brassicacearum, Pseudomonas frederiksbergensis, Pseudomonas mandelii, Pseudomonas kribbensis, Pseudomonas koreensis, Pseudomonas mucidolens, Pseudomonas veronii, Pseudomonas antarctica, Pseudomonas azotoformans, Pseudomonas trivialis, Pseudomonas lurida, Pseudomonas azotoformans, Pseudomonas poae, Pseudomonas libanensis, Pseudomonas synxantha, and Pseudomonas orientalis. Interestingly, the genomes that were named as P. fluorescens did not form a monophyletic clade within this wider group, but were dispersed. In addition, two genomes within the fluorescens group have been mistakenly named as P. stutzeri (KGS-2 9480, KGS-8 9481), one genome was mistakenly named as P. putida (JBC17 6779), one was mistakenly named as P. syringae (isolate inb918 11494), and another as P. chlororaphis (UFB2 isolate Soil 2921). In essence, the name P. fluorescens encompasses many different and diverse species lumped together. The core proteome of the P. fluorescens group, using P. fluorescens Pf0-1 124 strain as the reference point, comprised 1396 proteins. A separate phylogenomic analysis (see Figure S1) of these 96 proteomes, on the basis of their 1396 core proteins and 115,099 variable sites further confirmed the observations above. The protein count and GC content of the strains of this group range between 4152-6678 (average: 5603) and between 58.7-62% (average: 60.3%), respectively. Other analyses, based on 3 and 8 P. fluorescens genomes, revealed 3642 [63] and 3039 [64] core genes, respectively. In addition, a comparative genomic analysis of 71 P. fluorescens genomes identified eight major subgroups and developed a set of nine genes as markers for classification within this lineage [65].
The P. syringae group that we defined, on the basis of the genus phylogenomic tree, comprised 34 genomes. It also included species such as Pseudomonas avellanae, Pseudomonas savastanoi, Pseudomonas amygdali, and Pseudomonas cerasi. The genomes that were named as P. syringae did not form a monophyletic clade within this wider group, but were dispersed. In addition, the P. syringae group formed three subgroups, according to our phylogenomic analysis and the ANIm values. The core proteome of the P. syringae group, using P. syringae pv tomato DC3000 111 strain as the reference point, comprised 2944 proteins. A separate phylogenomic analysis of these 34 proteomes (see Figure S1), based on their 2944 core proteins and 110,643 variable sites, further confirmed the above observations. The protein count and GC content of the strains of this group range between 4973-6026 (average: 5465) and between 58-59.3% (average: 58.6%), respectively.
The P. putida group that we defined, on the basis of the genus phylogenomic tree, comprised 63 genomes. It also included species such as Pseudomonas alkylphenolia, Pseudomonas monteilii, Pseudomonas cremoricolorata, Pseudomonas fulva, Pseudomonas parafulva, Pseudomonas entomophila, Pseudomonas mosselii, and Pseudomonas plecoglossicida. The genomes that were named as P. putida did not form a monophyletic clade within this wider group, but were dispersed. The core proteome of the P. putida group, using P. putida PP112420 5661 strain as the reference point, comprised 1724 proteins. A separate phylogenomic analysis of these 63 proteomes (see Figure S1), based on their 1724 core proteins and 155,470 variable sites, further confirmed the above observations. The protein count and GC content of the strains of this group range between 3748-6780 (average: 5197) and between 58.7-64.4% (average: 62.3%), respectively. Another analysis that was based on nine strains identified 3386 core genes [66].
The P. stutzeri group that we defined, on the basis of the genus phylogenomic tree, comprised 19 genomes. It also included Pseudomonas balearica, Pseudomonas sp. 20 BN 3032, and Pseudomonas sp. R2A2 5971. Although the P. stutzeri group is well defined, the ANIm values and the branch lengths of the genus phylogenomic tree support the idea that this is not a species, but rather a wider taxonomic group. The core proteome of the P. stutzeri group, using P. stutzeri 28a24 3065 strain as the reference point, comprised 2080 proteins. A separate phylogenomic analysis of these 19 proteomes (see Figure S1), based on their 2080 core proteins and 193,427 variable sites, further confirmed the above observations. The protein count and GC content of the strains of this group range between 3342-4524 (average: 4086) and between 60.3-64.7% (average: 63.2%), respectively.

Properties of the Core Proteomes
A critical issue that affects the estimation of a core genome/proteome is the number of strains used to analyze it. As more proteomes are included in the analysis, less proteins remain present in all the analyzed strains. By observing how the curve of core proteins decreases as more strains are analyzed, one can estimate the true number of core proteins for that set of organisms. We generated a graph by random sampling (20 times each) 16, 30, 50, 100, 150, up to 491 Pseudomonas strains and estimated the average number of core proteins for that sampling depth of the genus. Curve fitting by applying exponential decay suggests that our sampling depth provides a set of proteins that is very close to the core genus proteome (see Figure 3). In accordance with these observations, our genus core set (based on 491 proteomes-core set 4) composed of 297 proteins. Previous analyses based on 14, 166, and 1073 diverse Pseudomonas genomes had identified 1705 core genes [63], 794 core genes [51], and 1224 protein-coding gene families [29], respectively. Another type of analysis that was based on 58 complete and 432 draft Pseudomonas genomes identified a core set consisting of 2687 and 726 protein domains [67]. We repeated this analysis for the P. aeruginosa, P. chlororaphis, P. fluorescens, P. syringae, and P. stutzeri groups. The core proteome of P. aeruginosa, with 189 analyzed proteomes, seems to approach saturation, whereas this is not the case for the other groups. Most probably, the saturation level of a certain core proteome depends on the sampling depth (i.e., the number of sequenced genomes analyzed), the quality of the data (i.e., quality of the genome assembly), and the level of diversity within that particular group.
For each of these seven core proteome sets, that is, of the whole genus and the six evolutionary groups, we identified the frequency of functional categories of their core proteins using the EggNog 4.5 server. The results are summarized in Table 1 and in Supplementary File 1, Table S9. Accordingly, Gene Ontology analysis of the genus core proteome revealed a high enrichment for metabolism (e.g., nitrogen compound metabolic process) translation and gene expression, as expected (see Supplementary File 1, Table S2). It is intriguing that many proteins of the various core sets are of unknown function, thus highlighting the need for more focused in-depth molecular biology/biochemistry analyses by wet-lab biologists. Table 1. Summary table of the core proteomes and their functional categories for the Pseudomonas genus and the six evolutionary groups. Note that, for the phylogenomic analysis of the entire genus (core4), we used the alignment produced by core1, that also includes 3 outgroups.

Genus Core4
Pseudomonas aeruginosa We investigated the overlap of orthologues among the various core proteomes of the genus and the six evolutionary groups by mapping them to KEGG orthologs with KAAS. We also identified their wider functional category with EggNOG 4.5. The vast majority of the genus core proteins were mapped to KEGG orthologues, whereas for each of the other six core proteomes, between 50-70% of their proteins were mapped. Thus, more than 2000 KEGG orthologous groups were identified that had an orthologue in one or more core proteome sets. Of note, this analysis should be interpreted with caution, because the core proteomes of P. chlororaphis, P. fluorescens, P. putida, P. stutzeri, and P. syringae are based on a small number of complete proteomes and show no evidence of saturation, in contrast to P. aeruginosa. In addition, every phylogenetic group has a different level of intragroup diversity. Overall, the vast majority (91-99.5%) of the core proteins of each group were also identified in one or more of the other core sets.
We further investigated to what extent the core proteins of a certain group had a very significant presence in all the other evolutionary groups (see Table 1). Our criterion was that the core protein of that set should be present at least in 90% of the strains in each of the other five evolutionary groups. We observed such a high presence for 44% (803/1811) of the core proteins in P. aeruginosa, 39% (1387/3587) of the core proteins in P. chlororaphis, 70% (1458/2080) of the core proteins in P. stutzeri, 70% (1215/1724) of the core proteins in P. putida, 63% (875/1396) of the core proteins in P. fluorescens, and 50% (1475/2944) of the core proteins in P. syringae. By obtaining the functional annotation of these widely distributed core proteins via eggNOG, we observed that they were enriched (more than twofold compared to the background frequency of the P. aeruginosa PAO1 genome) for functions related to translation and the ribosome, as well as co-enzyme and nucleotide transport and metabolism.
Next, we identified 60 group-specific core proteins that may play a critical role in the adaptation of that specific group. Our criterion was that these proteins were found in all strains of that specific group and were absent in all the other Pseudomonas strains (based on our strictly defined orthology criteria of best reciprocal BLAST hit). These group-specific core proteins were usually hypothetical or of unknown function. More specifically, we observed 41 out of the 1811 core orthologues of the P. aeruginosa species that were specific to that group (see Supplementary File 1, Table S3), with 26 of them being annotated as hypothetical. Accordingly, the same analysis revealed 11 out of the 3587 core orthologs of the P. chlororaphis species that were specific for that group (see Supplementary File 1, Table S4). For the P. stutzeri group, we observed that 7 out of its 2080 core proteins were group-specific (see Supplementary File 1, Table S5). For P. putida, only 1 out of the 1724 core proteins was group-specific (see Supplementary File 1, Table S6). For the P. fluorescens and P. syringae groups, none of their 1396 and 2944 core orthologs (respectively) were group-specific (see Supplementary File 1, Tables S7 and S8).
In order to assess the reliability of our methodology, we also identified group-specific core proteins with Orthofinder (see Supplementary File 1, Table S11). Orthofinder correctly identified 41 of our 60 group-specific core proteins. It did not identify 14 of our group-specific core proteins because the Orthofinder orthogroups were broader and also included within them other widely distributed paralogues (of the reference proteome) and their respective orthologues. Orthofinder also did not identify five of our group-specific core proteins, because within those orthogroups, Orthofinder further included distant homologues from out-of-the-group species that did not pass our orthology criteria. Finally, Orthofinder identified three group-specific orthogroups that we did not detect. In these cases, our orthology groups further included out-of-the-group orthologues. However, Orthofinder would assign these out-of-the-group orthologues in other orthogroups. Thus, we conclude that our approach is accurate and scalable for future analyses with thousands of proteomes.
By further investigating Supplementary File 1, Tables S3-S8, one can identify more group-specific core ortholog candidates after applying two more relaxed criteria, where the group-specific core protein is allowed to be present in other groups as well, but with a maximum presence of 10% or 20% (in each group). By applying these two relaxed criteria of 10% and 20%, the group-specific core proteins are, respectively, 84 and 116 in P. aeruginosa, 32 and 61 in P. chlororaphis, 32 and 51 in P. stutzeri, 4 and 4 in P. putida, 0 and 0 in P. fluorescens, and 61 and 87 in P. syringae. The lack of group-specific core proteins for the P. fluorescens group could be attributed to the fact that this is a very diverse group. In addition, one may further hypothesize that this great diversity combined with the lack of group-specific core proteins could be attributed to the fact that this may be the oldest evolutionary group that gave rise to the other more specialized evolutionary groups; however, its position in the genus phylogenomic tree does not support this hypothesis. The relatively high number of group-specific core proteins in the P. aeruginosa group could be attributed to the fact that this is a rather specialized species that frequently uses humans as hosts and has very low genetic diversity (albeit having a very dynamic accessory proteome), whereas the other groups are not specialized as mainly human pathogens. An alternative explanation is that there is a sampling bias in favor of human-pathogenic P. aeruginosa strains.

Many P. aeruginosa-Specific Core Proteins Contribute to Pathogenicity
Our comparative analysis identified several P. aeruginosa-specific core proteins (present in all P. aeruginosa strains, but not present at all in any other Pseudomonas strains) with an important role in its pathogenicity. Our findings, explained further below, are summarized in Table 2. Notably, two of these (cntL and cntM) are members of a four-gene operon (cntOLMI), which is regulated by zinc levels. This operon is responsible for the production of pseudopaline, an opine metallophore used to scavenge metals such as nickel and zinc from the host that is believed to be implicated in pathogenesis [68,69]. Homologous operons that produce such opine metallophores are associated with virulence and have also been identified in other human pathogens, such as Staphylococcus aureus, Serratia marcescens (WW4), and Yersinia pestis (CO92). It has been shown that opine metallophores of this kind are probably implicated in nickel, cobalt, and iron acquisition in metal-limited environments [69,70]. Interestingly, cobalt is essential for the formation of biofilms, under oxygen-limited conditions, by P. aeruginosa [71]. The other two members of this operon encode membrane proteins implicated in pseudopaline export (cntI:PA4834) and the import of pseudopaline-metal complexes (cntO:PA4837) [68]. These two coding sequences were absent in only two and four P. aeruginosa strains, respectively. Apart from pseudopaline, pyoverdine, and pyochelin, there are the other two major metallophores (siderophores) identified in P. aeruginosa. Pyoverdine consists of three distinct structural parts: a conserved chromophore, a variable peptide arm, and a small dicarboxylic acid connected via an amide group to the chromophore. Many genes are required for pyoverdine biosynthesis, but four genes (pvdL, pvdI, pvdJ, pvdD) encoding non-ribosomal peptide synthetases (NRPSs) are the major biosynthetic genes [72,73]. NRPSs are large multimodular enzymes that enable microorganisms to produce a large repertoire of secondary metabolites including antibiotics, toxins, immunomodulators, and siderophores [74,75]. Pyoverdine biosynthesis starts with PvdL, which incorporates L-Glu, D-Tyr, and L-Dab (2,4-diaminobutyrate) into a precursor peptide. Through condensation and further modifications, this peptide forms the chromophore [76]. The pvdL gene was identified in 90% (171/189) of P. aeruginosa strains and in 84% (255/305) of all the other Pseudomonas strains in this study. Pyochelin biosynthesis requires salicylic acid and two cysteine residues. Three genes (pchD, pchE, pchF) are involved in pyochelin synthesis, the last two (pchE, pchF) encoding NRPSs [72]. The pyochelin synthetase PchF (PA4225) was present in 93% (176/189) of P. aeruginosa strains and in 19% (58/305) of all the other Pseudomonas strains. Both pyoverdine and pyochelin contribute to the pathogenicity exerted by P. aeruginosa in lungs of patients with cystic fibrosis [72]. Our analyses show that, overall, pseudopaline is P. aeruginosa-specific, phyochelin is mostly limited to P. aeruginosa, whereas pyoverdine has a very wide distribution across the genus.  PlcB (PA0026) and Acp1 (PA1869) are another two P. aeruginosa-specific core-proteins, with an important role in motility and pathogenesis. More specifically, PlcB is a phospholipase C implicated in directional twitching motility during chemotaxis [77]. PlcB expression is under the control of quorum sensing, a strategy employed by many pathogens during infection [78]. Regarding the potential role of PlcB in pathogenesis, it may be noted that the levels of phosphatidylethanolamine (the substrate of the PclB protein) are increased in the bronchoalveolar lavage fluid of young adults with cystic fibrosis (CF) in comparison with age-matched controls without CF [79]. Consequently, it is plausible that these increased levels of phospholipids could serve as a chemoattractant or a factor enhancing chemokinesis for P. aeruginosa bacteria that have initially colonized the upper airways in CF patients [77]. Acp1 is the only one of three acyl carrier proteins that functions in fatty acid synthesis. Replacement of Acp1 caused reduced production of the quorum-sensing signals (N-acyl homoserine lactones) and abolished swarming motility, thus demonstrating its indirect role in the pathogenicity exerted by P. aeruginosa [80].
Our analyses further identified two interesting P. aeruginosa-specific core-proteins, MucE (PA4033) and SrfA (PA2559), that are involved in mucilage production (mucoidy). P. aeruginosa very often exhibits mucoidy, a phenotype that is due to the overproduction of an exopolysaccharide called alginate. The emergence of mucoid P. aeruginosa isolates in sputum specimens from CF patients signifies the onset of chronic respiratory infections. Mucoidy plays an important role in the pathogenesis of P. aeruginosa infections in CF patients, which might include increased resistance to antibiotics, increased resistance to phagocytic killing, and assistance in evading the host's immune response. MucE is a small envelope lipoprotein that positively regulates mucoid conversion, thus directly contributing to pathogenicity [81,82].
A previous comparative analysis using the genomes of Pseudomonas species has demonstrated that MucE orthologues are found only in P. aeruginosa strains [81] this was confirmed in this study. SrfA appears to act as a stress response facilitator and positively regulates the expression of the algD operon. Furthermore, the sequence of SrfA was found to be extremely well conserved only in P. aeruginosa, suggesting that it is of high evolutionary importance in this species [83], again in agreement with our analysis.
Another very intriguing finding is the detection of three P. aeruginosa-specific toxin-antitoxin operons (Tse1-Tsi1, Tse2-Tsi2, Tse3-Tsi3) whose toxins are inserted into the periplasm of antagonizing bacteria by the H1 type VI (H1-T6SS) secretion system [84,85]. The Tse1 toxin (a peptidoglycan degrading amidase) is found in all P. aeruginosa strains, and in none of the other Pseudomonas strains, whereas its antitoxin Tsi1 is found in all P. aeruginosa strains, and in only one (P. alcaligenes NEB 585 4015) of the other Pseudomonas strains. The Tse2 toxin (an NAD-dependent cytotoxin) is found in 187/189 P. aeruginosa strains, and in none of the other Pseusodomonas strains, whereas its antitoxin Tsi2 is found in all P. aeruginosa strains, and in none of the other Pseusodomonas strains. The Tse3 toxin (peptidoglycan-degrading muramidase) is found in all P. aeruginosa strains, and in none of the other Pseudomonas strains, whereas its antitoxin Tsi3 is found in 186/189 of P. aeruginosa strains, and in none of the other Pseudomonas strains. It is reasonable to hypothesize that these three operons may allow P. aeruginosa to antagonize and prevail over other bacteria for the same ecological niche. Therefore, it would be interesting to investigate whether aerosols containing any of the three antitoxins (Tsi1-3) could block P. aeruginosa infections in CF patients by protecting other antagonizing bacteria.
EsrC encodes a negative regulator (moderator) of the mexC/D-oprJ envelope stress-inducible multidrug efflux operon that contains five cistrons. This operon is suggested to play a role in the remodeling of P. aeruginosa membranes in response to envelope stress [86]. EsrC is present in all P. aeruginosa strains, whereas the whole operon is found intact in 183/189 strains. However, this regulator (EsrC) that moderates (but does not totally inhibit) the expression of this efflux pump system is not detected in any other strains of the genus, although mexC and mexD are widely present (in 87% and 79%, respectively). Therefore, we suggest that this particular response (moderate expression of the efflux system) that affects membrane remodeling is a fine-tuning adaptation, specific for P. aeruginosa, which requires EsrC.
We also identified 19 other orthologous protein groups that are present in at least 188/189 P. aeruginosa strains and absent in all the other strains of the genus. Sixteen of them were annotated as hypothetical; however, of particular interest is fucose-binding lectin PA-IIL (lecB: PA3361). This lectin is important for biofilm formation and is believed to facilitate bacterial adhesion to the airway mucosa [87][88][89].
3.4. Pseudomonas chlororaphis-Specific Core Proteins with an Important Role in Niche Adaptation P. chlororaphis can be employed as a biocontrol agent (used as inoculant) in agriculture against certain fungal plant pathogens via production of phenazine-type antibiotics [90]. In this study, a phage-like protein (C4K22 RS23595), belonging to the holin family, has been detected as a P. chlororaphis-specific core protein. Members of the holin family are small transmembrane proteins that accumulate in the bacterial cell membrane and permeabilize it, thus allowing the endolysins produced by certain phages to reach and degrade the peptidoglycan layer [91]. Recently, it was demonstrated that the holin gene is present in a P. chlororaphis strain that produces distinct high molecular weight bacteriocins. These bacteriocins resemble bacteriophage tails known as R-type tailocins [92] and help P. chlororaphis to outcompete closely related bacteria and to persist in rhizosphere communities [92,93].
Furthermore, our analysis identified another interesting protein-coding gene (C4K22 RS23415). It was annotated as a mitomycin-biosynthesis protein and had orthologues in all but one P. chlororaphis strains and did not have any orthologues in the other Pseudomonas proteomes. Mitomycin is an anticancer drug produced by Streptomyces spp. Recently, it has been demonstrated that mitomycin in combination with a tobramycin-ciprofloxacin hybrid is very effective (at least in vitro) against multidrug-resistant Gram-negative bacteria [94]. Although mitomycin production by P. chlororaphis has not yet been demonstrated, it is plausible that the compound produced by C4K22 RS23415 is mitomycin, or something similar. Thus, we speculate that it may confer a distinct advantage over other Gram-negative root colonizers and may represent an adaptation by P. chlororaphis to its ecological niche.

Conclusions and Future Perspectives
The Pseudomonas genus is very diverse and includes organisms from many different habitats that may function as human, animal, or plant pathogens, but they may also promote plant growth and nitrogen fixation [1,2,95]. This analysis of more than 490 complete Pseudomonas proteomes revealed six major evolutionary groups that each have sufficient numbers of genomes/proteomes for core proteome analyses [17]. We identified the core protein sets of the entire genus and each of the six evolutionary groups, as well as the properties of each core set. However, we also observed many species with a very low number of sequenced strains. In addition, due to the high number of proteomes from P. chlororaphis and especially from P. aeruginosa, our computational approaches identified several proteins that may play a significant role in species-specific adaptations. Although the number of P. aeruginosa proteomes was sufficient to provide a saturated core set, this was not the case for the other five evolutionary groups. In addition, although P. chlororaphis is a well-defined species, the other four groups of P. fluorescens, P. putida, P. stutzeri, and P. syringae harbor much more genetic diversity, and each one of these evolutionary groups is not actually one species, rather, each group includes more than one species. In the near future, sufficient numbers of high-quality genomes from strains of the same species should become available for many species of the genus. Thus, this type of computational analysis clearly demonstrates that the secrets of species-specific adaptation events for many pseudomonads will be unraveled in the near future, as many of their strains will be sequenced.

Supplementary Materials:
The following are available online at http://www.mdpi.com/1424-2818/12/8/289/s1, Figure S1: Phylogenomic trees of the various Pseudomonas groups and the genus, Table S1: The genus core proteins. Table S2: Gene ontology of the genus core4. Table S3: The P. aeruginosa core proteins. Table S4: The P. chlororaphis core proteins. Table S5: The P. stutzeri core proteins. Table S6: The P. putida core proteins. Table S7: The P. fluorescens core proteins. Table S8: The P. syringae core proteins. Table S9: Information about the various groups. Table S10: Core proteins that are also essential. Table S11