Comparative Genomic Analysis Confirms Five Genetic Populations of the Select Agent, Rathayibacter toxicus

Rathayibacter toxicus is a Gram-positive, nematode-vectored bacterium that infects several grass species in the family Poaceae. Unique in its genus, R. toxicus has the smallest genome, possesses a complete CRISPR-Cas system, a vancomycin-resistance cassette, produces tunicamycin, a corynetoxin responsible for livestock deaths in Australia, and is designated a Select Agent in the United States. In-depth, genome-wide analyses performed in this study support the previously designated five genetic populations, with a core genome comprising approximately 80% of the genome for all populations. Results varied as a function of the type of analysis and when using different bioinformatics tools for the same analysis; e.g., some programs failed to identify specific genomic regions that were actually present. The software variance highlights the need to verify bioinformatics results by additional methods; e.g., PCR, mapping genes to genomes, use of multiple algorithms). These analyses suggest the following relationships among populations: RT-IV ↔ RT-I ↔ RT-II ↔ RT-III ↔ RT-V, with RT-IV and RT-V being the most unrelated. This is the most comprehensive analysis of R. toxicus that included populations RT-I and RT-V. Future studies require underrepresented populations and more recent isolates from varied hosts and geographic locations.


Individual Gene, Gene Cluster, Multi-Locus and Whole-Genome Sequence Analyses
MAUVE version 2.4.0 [108] was used to perform full genome sequence alignments with the progressive Mauve algorithm. The MAUVE guide tree output file was input into Geneious version 7.1.9 [44] to produce a phylogenetic tree.
PhyloSift [115] analysis was performed on all complete R. toxicus genome sequences. Sequences were queried for similarities to genes contained in a reference database using LAST [116]. The hmmalign program from the HMMER 3.0 software package [64] was used to concatenate and align marker gene sequences. The Phylogenetic Placer (pplacer) program [117] was used to infer phylogenies and produce a phylogenetic tree under default conditions (Maximum-likelihood). The output tree file was input into FigTree v1.4.3 (tree.bio.ed.ac.uk/software/figtree/) to produce the phylogenetic tree.
IslandViewer4, which incorporates IslandPick, IslandPath-DIMOB and SIGI-HMM to find potential HGT events, was used to analyze five strains acting as representatives for each of the five population of R. toxicus. IslandPick identified only one region in both RT-I and RT-IV (Table S13; Figure S3). IslandPath-DIMOB identified 15, 14, 15, 15 and 14 regions, while SIGI-HMM identified 6, 4, 4, 5 and 5 regions, in RT-I, RT-II, RT-III, RT-IV and RT-V, respectively; most regions found by either IslandPath-DIMOB or SIGI-HMM were the same, and most of the SIGI-HMM regions overlapped with those found by IslandPath-DIMOB (Table S13; Figure S3).
In all, HGT regions were identified throughout the length of the genome ( Figure S2). Interestingly, even though both IslandViewer and MicroScope use SIGI-HMM to identify HGT regions, there were a few exceptions that did not agree with each other; however, the majority of the SIGI-HMM regions overlapped between programs. Additionally, there was no instance in which all means for HGT identification agreed; there was no region that was identified by IslandPick, SIGI-HMM, IslandPath-DIMOB and IVOM. There was a variable overlap between the programs used, with some regions being identified by only a single program ( Figure S2). It is of note that the CRISPR-Cas system and tunicamycin BGC are known to have been acquired through an HGT event. While the tunicamycin cluster (ASC-12) was identified by SIGI-HMM from both IslandViewer and MicroScope, as well as IVOM, neither IslandPick nor IslandPath-DIMOB identified this region. This was also observed for the majority of the CRISPR region; however, no regions were assigned to the Cas portion. Regions of suspected HGT were assigned to almost all ASCs, with slightly varying results depending on the length of a specific ASC in a specific strain; no HGT region was identified in ASC-5, ASC-7, ASC-8 and ASC-11.
IslandPick uses a comparative genomics approach to determine HGT events [102]. SIGI-HMM looks for codon bias using hidden Markov models (HMM) [105,106]. IslandPath-DIMOB looks for regions of at least eight genes, one of which has to be a mobility gene (transposase, integrase, etc.), that also has biased dinucleotide composition [103,104]. AlienHunter IVOM looks for compositional biases based on variable order motif distributions [107]. Lack of consensus among these programs reflects variations among the methods used and raises questions as to the accuracy of HGT event calling, especially since no single region was identified by all methods. Perhaps any significant HGT events happened so long ago that they have lost the majority of the defining factors, making identification difficult.
Prophage and phage remnants. PhageWeb, PHAST (PHAge Search Tool), PHASTER (PHAge Search Tool Enhanced Release), Prophinder, and VirSorter was used to identify potential prophage and phage remnants within the genome. PhageWeb, which compares sequence similarity to a phage database and considers changes in G+C and presence of flanking tRNA, did not identify any regions of potential phage origin. PHAST was able to identify a~7 kb region. This region was considered incomplete by PHAST and had a score of 20. PHAST determined that this region contained seven proteins, six of which had matches to a phage protein database. These phage genes were considered to be similar to genes found in six phage species. These genes include a conserved protein of unknown function, a heme oxygenase, a PIF1-like helicase, an ABC-2 type transport permease, a copper transporter/copper resistance protein D, and a DNA-binding protein HU-1/HU-beta. Initially, analysis was performed on strains SA03-04, WAC7056, WA40-23C, CS36 and CS39, which were representative strains of RT-I, RT-II, RT-III, RT-IV and RT-V, respectively. The same~7 kb region was identified in all strains, except WAC7056; it is yet unknown why PHAST did not identify this region in the representative RT-II strain. As a result, this region was mapped to the genome of WAC7056 using the Bowtie2 in Geneious and then extracted. Mapping revealed that the PHAST phage region flanks ASC-20 on the right ( Figure S2). This process was repeated in all 18 strains of R. toxicus used in this study. Extracted sequences were aligned (7019 nt) using MAFFT and NJ trees were constructed using the Jukes-Cantor model, with confidence intervals being assessed using the bootstrap method with 1000 replications ( Figure S4A). Relative to CS36 (RT-IV), RT-I contained seven unique SNPs and possessed a 12-nt deletion, RT-II and RT-III had three identical SNPs that were unique to these populations, and RT-V had one unique SNP. Phylogeny of this region would seem to indicate that RT-II and RT-III are one population that is more closely related to both RT-IV and RT-V, with RT-I being the most distant; however, analysis was based only on a few SNPs and the deletion of 12 nt in all RT-I strains relative to the other populations would account for it being the most distantly related. It is interesting to note that PHASTER, which is supposed to have a very similar phage identification pipeline as PHAST with exception of some software and hardware enhancements, did not identify any putative prophage regions within any of the genomes analyzed. Why this "enhanced" version of PHAST, which found ã 7 kb region, did not identify any regions remains unknown.
Prophinder, which compares query sequences to the ACLAME (a CLAssification of Mobile genetic Elements) database [133,134], was able to identify a different~10 kb region that mapped to ASC-8 ( Figure S1). This region contained an ABC-2 type transport system permease / O-antigen export system permease, a conserved protein of unknown function, a glycosyl-transferase involved in cell-wall biosynthesis, a group 1 glycosyl-transferase, a GDP-mannose 4,6-dehydrogenase, a GDP-mannose 4,6-dehydrogenase NAD(P)-binding subunit, a conserved protein of unknown function, an EPS production protein, and an acyl-CoA dehydrogenase / glutaryl-CoA dehydrogenase. Prophinder only provided results for CS36 (RT-IV) and WAC7056 (RT-II); it is yet unknown why it did not identify this region in other representative strains. As a result, this region was mapped to the genomes of all 18 strains of R. toxicus used in this study. Extracted sequences were aligned (10,128 nt) using MAFFT and NJ trees were constructed using the Jukes-Cantor model, with confidence intervals being assessed using the bootstrap method with 1000 replications ( Figure S4B). Relative to RT-III, RT-I strains possessed 5 unique SNPs and one SNP in common with CS36 at position 5603. RT-II strains possessed unique SNPs. RT-IV possessed 15 unique SNPs and 7 SNPs in common with RT-V. RT-V possessed four unique SNPs. Phylogeny based on this region would suggest that RT-IV and RT-V are most closely related, followed by RT-I, then RT-III and RT-II; however, while this Prophinder region was larger than the PHAST region, it contained only slightly more SNPs.
VirSorter compares sequences to a database using HMM and BLAST and then detects viral regions by looking for the presence of viral "hallmark" genes, enrichment in viral, uncharacterized and/or short genes, depletion of Pfam affiliated genes, and depletion in strand switch. VirSorter was able to identify regions of varying size (~27-40 kb) in each of the representative R. toxicus strains. These regions were categorized as "category 3," which means they possess low confidence. These regions were also identified as ambiguous, as they only present secondary viral metrics, but do not possess viral enrichment nor viral "hallmark" genes. Nevertheless, these regions were analyzed further. The regions varied in size but were all located in the same area and had overlapping sections; therefore, all regions were extracted and aligned in Geneious and used to produce consensus sequence. The consensus sequence was mapped to all 18 R. toxicus strains used in this study. Sequences were aligned (47,107 nt) using MAFFT and NJ trees were constructed using the Jukes-Cantor model, with confidence intervals being assessed using the bootstrap method with 1000 replications ( Figure S4C). Phylogeny based on this region suggests that RT-IV and RT-V are most closely related, followed by RT-III, RT-II, and then RT-I.
In addition to all previous analyses, identified regions that were extracted from the representative genomes were also analyzed using NCBI BLAST, ViroBLAST, and the Actinobacteriophage Database. All three programs were unable to yield any usable results, as any hits to a phage genome were based on matches to only~20-30 nt, suggesting these regions may not be of viral origin. The relationships established by all three phage regions differ from each other, as well as from other analyses. The arbitrary nature of the VirSorter region, along with the few numbers of SNPs compared in the other two analyses would suggest a poor indication of evolutionary relationships; however, together, all these analyses reinforce the existence of five distinct genetic populations. All three programs identified different regions of the genome as being of viral origin and the lack of concordance in calling a region as being of viral origin speaks to the differences in approaches used by each method. The fact that none of these methods identified the same region, or at least regions within the same vicinity, as well as the fact that one program yielded no results when a similar version of the program yielded results, points to vastly different methodologies with varying levels of accuracy. It is unknown if these regions are artifacts or true viral elements, but these discrepancies cast some doubt as to the authenticity of these designations. Regardless, the programs found these elements in all five populations of R. toxicus; therefore, if these are true prophage regions, this would suggest that they were acquired a long time ago, prior to the delineation of these populations. Perhaps all viral elements in R. toxicus were acquired a sufficient time ago that they lost their viral signatures and are now hard to differentiate from the rest of the genome.
Core/Variable/Pan-genome analyses. The representative strains of R. toxicus populations were analyzed with the MicroScope Microbial Genome Annotation and Analysis Platform. The pan-, core-, and variable-genome for the five R. toxicus populations comprises 3262, 2114, and 1148 gene families (Figure 1), which correspond to 13,117, 10,812 and 2365 genes, respectively. The MicroScope summary output is shown in Table 2. Gene family counts ( Table 2) differed slightly from those reported in the Venn diagram generated from the same program ( Figure 1); the reason is unclear. Based on Figure 1, RT-I and RT-IV are most closely related, sharing the most total gene families (241) outside the core and having the most gene families shared between only these two populations (74). RT-I is next closest to RT-II and then RT-III, sharing 207 and 193 total gene families outside the core, and 20 and 18 gene families only between the two populations, respectively. RT-II and RT-III appear most closely related, sharing 237 total gene families outside the core and 33 gene families only between these two populations. RT-V is closest to RT-II, RT-III, and then RT-I, sharing 204, 198 and 177 total gene families outside the core, and 26, 20 and 25 gene families only between the two populations, respectively. RT-IV and RT-V were the least related, sharing only 156 total gene families outside the core and 21 gene families only between these two populations. RT-IV is closer to RT-II and then RT-III, sharing 172 and 168 total gene families outside the core, and 13 and 14 gene families only between the two populations, respectively. It is important to note that these numbers reflect gene families, as designated by MicroScope; the actual number of genes common between populations will differ, but the trends should remain the same. Based on shared gene family content, the relationship among strains is as follows: RT-IV ↔ RT-I ↔ RT-II ↔ RT-III ↔ RT-V. Unfortunately, evolutionary relationships cannot be determined based on gene content information only. (RT-IV). The KEGG database only identified genes for the Sec and Tat pathways, which corroborates results obtained by the other software tools. The absence of any bacterial type secretion system is not surprising, as R. toxicus is a Gram-positive bacterium, which do not typically possess these advanced secretion systems. Core/Variable/Pan-genome analyses. The representative strains of R. toxicus populations were analyzed with the MicroScope Microbial Genome Annotation and Analysis Platform. The pan-, core-, and variable-genome for the five R. toxicus populations comprises 3262, 2114, and 1148 gene families (Figure 1), which correspond to 13,117, 10,812 and 2365 genes, respectively. The MicroScope summary output is shown in Table 2. Gene family counts ( Table 2) differed slightly from those reported in the Venn diagram generated from the same program ( Figure 1); the reason is unclear. Based on Figure 1, RT-I and RT-IV are most closely related, sharing the most total gene families (241) outside the core and having the most gene families shared between only these two populations (74). RT-I is next closest to RT-II and then RT-III, sharing 207 and 193 total gene families outside the core, and 20 and 18 gene families only between the two populations, respectively. RT-II and RT-III appear most closely related, sharing 237 total gene families outside the core and 33 gene families only between these two populations. RT-V is closest to RT-II, RT-III, and then RT-I, sharing 204, 198 and 177 total gene families outside the core, and 26, 20 and 25 gene families only between the two populations, respectively. RT-IV and RT-V were the least related, sharing only 156 total gene families outside the core and 21 gene families only between these two populations. RT-IV is closer to RT-II and then RT-III, sharing 172 and 168 total gene families outside the core, and 13 and 14 gene families only between the two populations, respectively. It is important to note that these numbers reflect gene families, as designated by MicroScope; the actual number of genes common between populations will differ, but the trends should remain the same. Based on shared gene family content, the relationship among strains is as follows: RT-IV ↔ RT-I ↔ RT-II ↔ RT-III ↔ RT-V. Unfortunately, evolutionary relationships cannot be determined based on gene content information only.   Annotations from MicroScope were compared to RAST, PGAP and JGI data, then consensus annotations were input manually. There were varying degrees of agreement between annotations, which differed in start-end/gene size and functional call, particularly in hypothetical genes. The core-genome data output was analyzed by RAST (Table S14). As expected, majority of the subsystems identified by RAST were located within the core-genome (97.46%-98.71%). Analysis of the core-genome by RAST only identified 10 RNAs. The MicroScope output only includes coding genes that could be assigned to functional gene families, therefore non-coding regions that may code for RNAs were not included; the 10 identified RNAs are most likely located within the sequence that also codes for a protein. The core-genome data was compiled and compared among the strains representing the five populations of R. toxicus (File S9). Consensus annotation resulted in~27% of genes still having unknown functions.
By definition, the core-genome should only include sequence (gene content) contained in all genomes analyzed. In addition to genes essential for survival and basic functioning, other genes such as the CRISPR-Cas system, tunicamycin BGC, and vancomycin resistance were included within the core. Interestingly, the MicroScope program core outputs were not all the same among the strains analyzed. Certain genes were called in some strains and not others, which would suggest that they should be part of the variable genome; however, this occurred mainly with genes of unknown function. Additionally, some genes that were present in all strains were annotated as "fragments" in some strains relative to others. Even though some genes in the core-genome appear variable, it is believed that the sequences that these genes belong to are common among all strains, but the algorithm used makes different/extra gene calls based on sequence variation.
AntiSMASH identified 20 BGCs common to all strains tested, except for ASC-14 and ASC18, which was absent from RT-II and unique to RT-V, respectively. The majority of the genes included in these 20 ASCs are found in the core-genome. Even though ASC-14 was not identified in RT-II, genes from ASC-14 are present in the core-genome. However,~12,000 nt from the middle of ASC-14 are not present in the core, suggesting this portion is missing from RT-II, which may be the reason why AntiSMASH could not identify it in RT-II. Additionally, ASC-18 is unique to RT-V (CS39), though the first 5 genes in this cluster appear in the core-genome.
The variable-genome data was compiled and compared among the strains representing the five populations of R. toxicus (File S10). A small subset of genes in the variable-genome could not be compiled due to various reasons; for example, for a given gene family designation, the products differed in length, some strains had an extra gene called proximal or distant to each other, possibly resulting from gene duplications, or a gene in one strain was in a completely different location, possibly resulting from a rearrangement. Consensus annotation resulted in~79% of genes still having unknown function. Genes from twenty-one ASCs, except for ASC-18 which is unique to RT-V (CS39), were represented in the variable-genome to varying degrees.
The strain-specific-genome data output (File S11) was also analyzed by RAST (Table S14). Only one subsystem was identified by RAST, which occurred in RT-V (CS39), and only 1 RNA was identified each in RT-I (SA03-04) and RT-IV (CS36). Consensus annotation resulted in~90%,~94%,~76%, , including the unique thiopeptide cluster, were represented in the RT-V-specific-genome to varying degrees. It is of note to mention that MicroScope identified several proteins of unknown function in both the variable-and strain-specific-genome that were located within the CRISPR-Cas region of the genome. It is unclear how many inappropriate gene calls were made within the pan-genome or how this is reflected in the gene family/gene number comparisons; the numbers reported by MicroScope are likely higher than what actually occurs.
PanSeq was used to query each population, relative to the others, to find regions unique to each population (Table 3). PanSeq identified 10, 6, 8, 10 and 4 regions for RT-I, RT-II, RT-III, RT-IV and RT-V, respectively, that vary in length from approximately 600-20,000 bp. All regions were aligned to strains from all populations using Geneious, which found several regions that exist in one or more populations. Results were corroborated when compared to the variable-and strain-specific-genome results, as well as when compared to AntiSMASH results. One example is that RT-III-5, RT-III-6 and RT-III-7 were found in all populations except RT-II. It was also found that these regions were located in ASC-14, which was not found in population RT-II, thus corroborating results. Another example is that three of the four regions found specific for RT-V were found to occur within ASC-18, which is a thiopeptide cluster unique to RT-V. With regards to the variable-genome, one example is that region RT-I-3 was identified in the variable-genome and genes attributed to this region from MicroScope were found only in population RT-I and RT-V, which match results from mapping using Geneious. The genes attributed to strain-specific regions found by PanSeq were all represented in the MicroScope output, with the exception of RT-II-2. This is a relatively small region (607 bp) and since MicroScope only looks at coding regions, it is possible this region comprises mostly non-coding DNA and was therefore not accounted for in the MicroScope output. The RGP Finder program in MicroScope looked for regions unique to a particular genome. The RGP regions found for each representative strain were mostly associated with novel regions found using PanSeq ( Figure S2, Table 3), though some were not. Table 3. Unique regions determined by PanSeq for each population of R. toxicus.

RT-I-1
x 799 In ASC-2; ASC-2 longer in RT-V than all others but missing RT-I-1

RT-V-3 x 8451
In RT-V unique Thiopeptide Cluster (ASC-18); CS39 RGP-3; LAMP primers designed here [36] RT-V-4 x 10,447 In RT-V unique Thiopeptide Cluster (ASC-18); CS39 RGP-3 a Region name based on the strain analyzed for unique regions, relative to the others; i.e., RT-III-2 was the second region identified in population RT-III by PanSeq. X denotes which populations the regions were found in by mapping using Geneious.

Average Nucleotide Identity and Digital DNA-DNA Hybridization
DNA-DNA hybridization (DDH) has been considered the gold standard for species delineation at the genomic level for the past 50 years, as it was the only method to offer a numerical and relatively stable species boundary [149]. Traditional DDH suffers from the fact that the method is laborious and requires specially trained personnel, who are only available in select laboratories. Recent advancements in next-generation sequencing (NGS) and bioinformatics have caused scientists to push for easier and more current methods. ANI has been suggested as the best alternative for a gold standard [149,150], as it can provide values that correlate to DDH values, with~95%-96% ANI corresponding to the 70% DDH value for species delineation [67,149]. dDDH has also been suggested as a replacement for traditional DDH [82][83][84]. dDDH has an advantage over ANI in that the values are on the same scale as traditional DDH values, making comparison between digital and wet-lab results simple. Additionally, dDDH values calculated with GGDC have a higher correlation with traditional DDH. In this section, ANI and dDDH results were obtained from and compared using several independent programs.
ANI calculator: ANI values were compared between all strain genomes tested using the ANI calculator (Table S15). ANI values produced by the ANI calculator were over 99.99% for all strain comparisons, which were higher than those produce by other analyses; higher ANI values from the ANI calculator were observed previously [150]. Nonetheless, ANI values were converted into a visual heat map for comparison ( Figure 2). All strains grouped together with those of the same population type, reinforcing the population designations. Based on ANI values, RT-II and RT-III appear most closely related, as does RT-I and RT-IV. RT-V appears most closely related to RT-III and then RT-II; RT-V was most distantly related to RT-IV. This data is highly similar to that from pan-genome analyses. The ANI calculator data output includes a phylogenetic tree, which was imported into Geneious for tree generation (Figure 3). The phylogenetic tree produced from ANI distances shows the same trend; RT-II and RT-III are most closely related, with RT-V being the next closely related, while RT-I and RT-IV are most closely related. The web-interface for the ANI calculator also produces phylogenetic trees based on the NJ ( Figure S5A) [112], BIONJ ( Figure S5B) [151], and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) ( Figure S5C) [152] methods. The UPGMA tree is in complete agreement with Figure 3 and the ANI and pan-genome data with regards to population relatedness. The NJ and BIONJ trees have identical topologies that differ from the UPGMA tree; however, the relationships inferred from these trees are the same as the UPGMA tree, although lineages cannot be elucidated.     JSpecies: ANI values were compared between all strain genomes tested using JSpecies (Tables S16-S18). ANIb data (Table S16) suggests that RT-II and RT-III are most closely related among the five populations. RT-V was more closely related to RT-II and RT-III than to RT-I, and most distantly related to RT-IV. These results corroborate those found thus far. However, results suggest that RT-I is more closely related to RT-II and RT-III than RT-IV, though results for RT-IV show that it is still most closely related to RT-I. ANIm values (Table S17) corroborated the relationships between populations established by the ANI calculator and the pan-genome analysis. Tetranucleotide signatures were compared (Table S18) and results suggest that RT-II and RT-III are most closely related, with RT-V being more closely related to these populations and least related to RT-V. In contrast to other results, Tetra analysis suggests that RT-I is most closely related to RT-II; RT-IV also appears to be most closely related to RT-II. The only clear conclusions that could be drawn were that RT-II and RT-III were the most closely related and RT-IV and RT-V were the most distantly related. ANIb, ANIm and Tetra values were converted into visual heat maps for comparison ( Figure S6A-C, respectively). All strains grouped together with those of the same population type, reinforcing the population designations.
EZGenome: ANI values were compared between all strain genomes tested using EZGenome (Table S19). The OrthoANIu values corroborated the relationships between populations established by the ANI calculator and the pan-genome analysis. OrthoANIu values were converted into a visual heat map for comparison ( Figure S6D). All strains grouped together with those of the same population type, reinforcing the population designations.
Genome-to-genome distance calculator (GGDC): the GGDC was used to determine the similarity between strains/populations. BLAST+, the recommended default in GGDC, was used to calculate distance values, and then dDDH values were estimated according to three mathematical formulas (Tables S20-S25). dDDH values calculated based on Formula 1, 2 and 3 were converted into visual heat maps for comparison ( Figure S7A-C, respectively). All strains grouped together with those of the same population type, reinforcing the population designations. Formula 1 and 2 produced dDDH values that suggested relationships congruent with each other, as well as previous analyses, suggesting RT-II and RT-III are most closely related, with RT-V being the next closely related, while RT-I and RT-IV are most closely related. In contrast, Formula 2 produced a relationship similar to those produced by ANIb, except that RT-IV was suggested to be more closely related to RT-I and RT-V was more similar to RT-III and then RT-I. It is interesting to note that GGDC recommends Formula 2 due to its independence of genome length, thus making it useful for analyzing incomplete draft genomes [82][83][84]. The GGDC suggests that for complete genomes, much higher DDH values for Formula 1 over Formula 2 would indicate that the strains differ less in gene content and more so in the sequences of their gene content [82][83][84]. GGDC suggests that Formula 2 will provide more resolution for strains that only differ by a plasmid [82][83][84]. These recommendations are confusing in that analyses of the R. toxicus genomes shows sequence differences, along with differences in gene content, yet the values for Formula 1 were larger than Formula 2, appearing to contradict statements made by GGDC. MUMmer was also used to determine distances and estimate dDDH values (Tables S26-S27); however, results were very unusual, as strains within the same populations were suggested to be least related. For example, the two RT-II strains have a dDDH value of 88.7 between them, while RT-II strains have dDDH values over 99% when compared to strains from all other populations. It is suggested to use the recommended BLAST+ analysis in GGDC to avoid erroneous results. GGDC also calculated differences in G+C content (Table  S28); values were converted into a visual heat map for comparison ( Figure S7D). Differences in G+C did not provide a conclusive resolution between populations.
All analyses, except for GGDC MUMmer and G+C, agreed that RT-II and RT-III are most closely related, and RT-IV and RT-V are most distant. Varying relationships between other populations were observed depending on the program used; differences were observed between programs, as each program used slightly different algorithms for analysis, as well as between different formulas for calculating values within the same program. Another study observed these differences and suggests the use of multiple approaches for confirmation of taxonomic affiliations [150], which is the sentiment shared with this manuscript. It is believed that ANI is a robust measure of genetic and evolutionary distance; one reason being that it is not affected by HGT [67,153]. dDDH values are computed in a similar manner to ANI, therefore it can be assumed that dDDH may not account for HGT as well. Knowing now that HGT occurs frequently, it seems irresponsible to not account for it when analyzing the differences between two strains/populations; granted, current tools are limited in their ability to account for HGT events.
The fact that different programs gave different results and these calculations may not account for HGT led to further investigation into how these programs calculate ANI/dDDG values, in an attempt to understand how to interpret the data. Aside from reading the methods attributed to these programs, an experiment was performed to assess if and how each program accounts for HGT (Table S29). Strain SA03-04 was tested against itself as a control, and then the CRISPR-Cas region was deleted and compared with the unedited SA03-04 genome. The experiment was repeated with the CRISPR-Cas and tunicamycin gene cluster removed to test a larger deletion that occurs in two separate regions of the genome. As expected, all programs resulted in 100% identity values for the control experiment. ANI values from the ANI calculator were 100% for both experiments using the one-way ANI1 and two-way ANI. The one-way ANI2 produced an ANI value of 99.99% relative to the 100% control. There was a trend of decreasing fragment number analyzed as the amount of sequence removed from SA03-04 increased. When looking at the web-interface, the ANI value was~100% when removing CRISPR-Cas alone or both the CRISPR-Cas and tunicamycin cluster. Strangely, removing both segments resulted in a slightly higher ANI percentage (99.9964184516498% vs 99.9963614798399%). The NJ, BIONJ and UPGMA trees grouped the modified SA03-04 genomes with the RT-I population, but the alterations were reflected in a slight shift in location relative to the others within the RT-I clade ( Figure S8). ANIb and ANIm analyses from JSpecies showed a corresponding decrease in the percent genome assessed relative to the control as the amount of sequence removed increased; however, ANIb and ANIm values were 100% for both experiments. The Tetra values showed a corresponding decrease relative to the control, appearing to reflect the deletion made in SA03-04. OrthoANIu analysis from EZGenome was able to show a change relative to the control, reflecting the change in the analyzed genome; however, more of the genome was assed when both the CRISPR-Cas and tunicamycin were removed relative to when only the CRISPR-Cas system was removed. GGDC BLAST+ analysis showed no change relative to the control for the recommended Formula 2, thus showing that it does now account for HGT. Both Formulas 1 and 3 showed an increase in distance values and decrease in dDDH values relative to the controls that corresponded with increase deletion size, with both values being almost identical. The same trend was observed for BLAST+ and MUMmer percent G+C differences. GGDC MUMmer dDDH values reflected changes relative to the control when sections of the genome were removed; however, while the distance values increase with the corresponding increases in deletion, the dDDH values remained the same at 99.3%. These experiments demonstrate different programs may or may not account for genetic differences such as HGT within the system of interest; the extent to which these differences are represented is unknown. This again highlights the fact that multiple analyses should be performed in order to ensure accurate statements are made regarding the relatedness of organisms being studied.

Phylogenetic Analyses Using Single Gene, Multi-Locus and Whole-Genome Sequence
Phylogenetic analyses were used to determine the relationship among the R. toxicus strains used in this study. Complete genome sequences were aligned using MAUVE ( Figure S9A). Strain 70137 (RT-III) appeared to have an inversion not present in any other strain; Sechler et al. [19] also showed this phenomenon. Simple sequence repeat (SSR) analysis of R. toxicus strain 70137 showed a drastic difference in repeat length when comparing a microsatellite identified within the sequence deposited in GenBank to its PCR verification results. Strain 70137 was isolated around the same time as CS28 and WAC3373, which have similar alignments. Being the only strain to possess this inversion, along with the SSR discrepancy, suggests that this anomaly is attributed to inaccurate sequence assembly. Therefore, a new alignment was performed in MAUVE that did not include strain 70137 ( Figure  S9B). No major inversions or rearrangements were observed for any R. toxicus strain tested, with the exception of strain 70137, which is consistent with results from Sechler et al. [19]. Strains of RT-I, RT-II and RT-III showed a high degree of similarity to others included in their respective populations; therefore, a MAUVE alignment was created using the strains representative of populations RT-I, RT-II, RT-III, RT-IV and RT-V (Figure 4). sequence assembly. Therefore, a new alignment was performed in MAUVE that did not include strain 70137 ( Figure S9B). No major inversions or rearrangements were observed for any R. toxicus strain tested, with the exception of strain 70137, which is consistent with results from Sechler et al. [19]. Strains of RT-I, RT-II and RT-III showed a high degree of similarity to others included in their respective populations; therefore, a MAUVE alignment was created using the strains representative of populations RT-I, RT-II, RT-III, RT-IV and RT-V (Figure 4). The MAUVE guide tree output and REALPHY alignment data were input into Geneious for phylogenetic tree generation ( Figure 5). All strains grouped together with those of the same population type, reinforcing the population designations. Phylogenetic analysis by MUAVE shows that RT-II and RT-III are most closely related, while RT-V being next closely related ( Figure 5A). MAUVE analysis also suggests that RT-I and RT-IV are most closely related ( Figure 5A). These results agree with the majority of results presented thus far. REALPHY analysis also suggests that RT-I and RT-IV are most closely related; however, it also suggests that RT-V is the next closely related population ( Figure 5B). MAUVE and REALPHY, which are both supposed to infer phylogenies based on whole-genome sequence, provide completely different topologies that suggest different relationships among populations. The phylogenetic tree produced by REALYPHY, as opposed to being created in Geneious using the REALPHY alignment file, has a similar topology to Figure 5B ( Figure S10). The discrepancy between these two programs was most likely due to the fact that REALPHY looks at single nucleotide polymorphisms within the core; therefore, is unable to account for segments of the genome that do not match, such as those from HGT. This was tested in the same manner as the ANI/dDDH. REALPHY does not account for HGT, as SA03-04 with the CRISPR-Cas system removed, or both the CRISPR-Cas and tunicamycin gene cluster removed, grouped perfectly with the unedited SA03-04 strain within the phylogenetic tree. In contrast, MAUVE analysis reflected the deletion of the CRISPR-Cas system ( Figure S11A) and both the CRISPR-Cas and tunicamycin gene cluster ( Figure S11B), showing a corresponding increase in phylogenetic distance as more sequence was deleted. The MAUVE guide tree output and REALPHY alignment data were input into Geneious for phylogenetic tree generation ( Figure 5). All strains grouped together with those of the same population type, reinforcing the population designations. Phylogenetic analysis by MUAVE shows that RT-II and RT-III are most closely related, while RT-V being next closely related ( Figure 5A). MAUVE analysis also suggests that RT-I and RT-IV are most closely related ( Figure 5A). These results agree with the majority of results presented thus far. REALPHY analysis also suggests that RT-I and RT-IV are most closely related; however, it also suggests that RT-V is the next closely related population ( Figure 5B). MAUVE and REALPHY, which are both supposed to infer phylogenies based on whole-genome sequence, provide completely different topologies that suggest different relationships among populations. The phylogenetic tree produced by REALYPHY, as opposed to being created in Geneious using the REALPHY alignment file, has a similar topology to Figure 5B (Figure S10). The discrepancy between these two programs was most likely due to the fact that REALPHY looks at single nucleotide polymorphisms within the core; therefore, is unable to account for segments of the genome that do not match, such as those from HGT. This was tested in the same manner as the ANI/dDDH. REALPHY does not account for HGT, as SA03-04 with the CRISPR-Cas system removed, or both the CRISPR-Cas and tunicamycin gene cluster removed, grouped perfectly with the unedited SA03-04 strain within the phylogenetic tree. In contrast, MAUVE analysis reflected the deletion of the CRISPR-Cas system ( Figure  S11A) and both the CRISPR-Cas and tunicamycin gene cluster ( Figure S11B), showing a corresponding increase in phylogenetic distance as more sequence was deleted. MAUVE and REALPHY analysis showed strains CS28 and WAC3773 group close together with negligible distance within the clade; while 70137 groups with these two strains, it is separated from these two strains within the same clade by some phylogenetic distance. This was also observed from phylogenetic analyses using the ANI calculator ( Figure S5C). These observed differences agree with the notion that the genome sequence of strain 70137 is inaccurate. Strain WA40-23C groups together with the other RT-III strains, but groups out separately in all phylogenetic analyses, including those from the ANI calculator. This is most likely due to the fact that this strain was isolated many years after the others, providing ample time to accumulate mutations that make its genome sequence different enough to be observed by these analyses. Similar findings were observed when analyzing the CRISPR spacer.
Complete R. toxicus genome sequences were also analyzed using PhyloSift, under default conditions. The output tree file was input into FigTree v1.4.3 (tree.bio.ed.ac.uk/software/figtree/) to display the Maximum-likelihood tree ( Figure 6). PhyloSift analysis grouped all strains with their respective populations, which is congruent with all other analyses mentioned previously. Based on the topology of the phylogenetic tree produced by PhyloSift, RT-I and RT-IV appear most closely related, followed by RT-V, and then RT-II and RT-III. Unlike some of the analyses mentioned previously, there were no intra-population differences observed, particularly within RT-III. The inability of PhyloSift to show differences among RT-III, particularly regarding recent strain WA40-23C, can most likely be attributed to the conserved nature of the gene families used during analyses. For prokaryotic analysis, PhyloSift uses a reference database of 37 gene families that were MAUVE and REALPHY analysis showed strains CS28 and WAC3773 group close together with negligible distance within the clade; while 70137 groups with these two strains, it is separated from these two strains within the same clade by some phylogenetic distance. This was also observed from phylogenetic analyses using the ANI calculator ( Figure S5C). These observed differences agree with the notion that the genome sequence of strain 70137 is inaccurate. Strain WA40-23C groups together with the other RT-III strains, but groups out separately in all phylogenetic analyses, including those from the ANI calculator. This is most likely due to the fact that this strain was isolated many years after the others, providing ample time to accumulate mutations that make its genome sequence different enough to be observed by these analyses. Similar findings were observed when analyzing the CRISPR spacer.
Complete R. toxicus genome sequences were also analyzed using PhyloSift, under default conditions. The output tree file was input into FigTree v1.4.3 (tree.bio.ed.ac.uk/software/figtree/) to display the Maximum-likelihood tree ( Figure 6). PhyloSift analysis grouped all strains with their respective populations, which is congruent with all other analyses mentioned previously. Based on the topology of the phylogenetic tree produced by PhyloSift, RT-I and RT-IV appear most closely related, followed by RT-V, and then RT-II and RT-III. Unlike some of the analyses mentioned previously, there were no intra-population differences observed, particularly within RT-III. The inability of PhyloSift to show differences among RT-III, particularly regarding recent strain WA40-23C, can most likely be attributed to the conserved nature of the gene families used during analyses. For prokaryotic analysis, PhyloSift uses a reference database of 37 gene families that were found to be single copy, almost without exception, and essentially universal [154]. These genes comprise mostly ribosomal proteins, translation elongation and initiation factors (EF-2 and IF-2), a metalloendopeptidase, ffh signal recognition particle, phenylalanyl-tRNA synthetase alpha and beta subunits, tRNA pseudouridine synthase B, porphobilinogen deaminase, phosphoribosylformylglycinamidine cyclo-ligase and ribonuclease HII. PhyloSift databases also include 16S and 18S ribosomal RNA genes, as well as mitochondrial, Eukaryote-specific and viral gene families [115]. As PhyloSift only analyzes genes contained within its database and which are common to all strains, it is likely that this platform cannot account for horizontal gene transfer events. This was tested in the same manner as the ANI/dDDH, MAUVE and REALPHY, by rerunning analyses with an SA03-04 sequence devoid of the CRISPR-Cas system or both the CRISPR-Cas system and tunicamycin gene cluster. Both analyses showed the altered sequence grouped perfectly with the unaltered SA03-04 sequence, suggesting the large deletions were not accounted for during analysis if not occurring among the preselected genes. found to be single copy, almost without exception, and essentially universal [154]. These genes comprise mostly ribosomal proteins, translation elongation and initiation factors (EF-2 and IF-2), a metalloendopeptidase, ffh signal recognition particle, phenylalanyl-tRNA synthetase alpha and beta subunits, tRNA pseudouridine synthase B, porphobilinogen deaminase, phosphoribosylformylglycinamidine cyclo-ligase and ribonuclease HII. PhyloSift databases also include 16S and 18S ribosomal RNA genes, as well as mitochondrial, Eukaryote-specific and viral gene families [115]. As PhyloSift only analyzes genes contained within its database and which are common to all strains, it is likely that this platform cannot account for horizontal gene transfer events. This was tested in the same manner as the ANI/dDDH, MAUVE and REALPHY, by rerunning analyses with an SA03-04 sequence devoid of the CRISPR-Cas system or both the CRISPR-Cas system and tunicamycin gene cluster. Both analyses showed the altered sequence grouped perfectly with the unaltered SA03-04 sequence, suggesting the large deletions were not accounted for during analysis if not occurring among the preselected genes. Multi-locus sequence analysis has been suggested as the new gold standard for species delineation [155,156]; however, one of the major drawbacks is gene selection bias [149]. Arif et al. [35] performed MLSA using the partial gene sequences of the chromosome partition protein SMC, CRISPR-associated protein cse4, cysteine desulfurase, secA ATPase, tRNA dihydrouridine synthase, and vancomycin A-type resistance protein vanA. These genes were chosen based on their discriminative power, variety of cellular function and spatial coverage of the entire genome [35]. As with many studies, attempts are made to minimize gene bias by selecting genes of various function Multi-locus sequence analysis has been suggested as the new gold standard for species delineation [155,156]; however, one of the major drawbacks is gene selection bias [149]. Arif et al. [35] performed MLSA using the partial gene sequences of the chromosome partition protein SMC, CRISPR-associated protein cse4, cysteine desulfurase, secA ATPase, tRNA dihydrouridine synthase, and vancomycin A-type resistance protein vanA. These genes were chosen based on their discriminative power, variety of cellular function and spatial coverage of the entire genome [35]. As with many studies, attempts are made to minimize gene bias by selecting genes of various function and spatial distribution throughout the genome. PhyloSift removes selection bias by analyzing a large number of preselected genes, regardless of the system used. In an attempt to minimize gene selection bias, MLSA was performed using genes selected for another system, Clavibacter michiganenis [157,158]; Rathayibacter spp. once grouped within the Clavibacter genus [1,8,159]. MLSA analysis of six full gene sequences was able to reproduce the population grouping for RT-I, RT-II, RT-III, RT-IV and RT-V ( Figure 7); results revealed different topology from PhyloSift and previous MLSA results [35,36] (Figure S12), thus suggesting different relationships between populations. Results are based on only 46 unique SNPs ( Figure 7A), which corresponds to 13 unique amino acid substitutions ( Figure 7B). and spatial distribution throughout the genome. PhyloSift removes selection bias by analyzing a large number of preselected genes, regardless of the system used. In an attempt to minimize gene selection bias, MLSA was performed using genes selected for another system, Clavibacter michiganenis [157,158]; Rathayibacter spp. once grouped within the Clavibacter genus [1,8,159]. MLSA analysis of six full gene sequences was able to reproduce the population grouping for RT-I, RT-II, RT-III, RT-IV and RT-V ( Figure 7); results revealed different topology from PhyloSift and previous MLSA results [35,36] (Figure S12), thus suggesting different relationships between populations.
Results are based on only 46 unique SNPs ( Figure 7A), which corresponds to 13 unique amino acid substitutions ( Figure 7B). Each gene included in the MLSA was also analyzed individually ( Figure S13). Analysis of atpD (1476 bp), dnaK (1863 bp), gyrB (2061 bp), ppK (2199 bp), recA (1068 bp) and rpoB (3492 bp) was based on 2, 9, 18, 9, 2 and 6 unique SNPs, respectively (Figures S13A-S13F), which corresponded to 0, 1, 3, 1, 2 and 6 amino acid substitutions, respectively. Interestingly, Arif et al. [35] screened dnaK and rpoB for their ability to differentiate populations and found no differences in nucleotide sequence; results were based on analysis of a 951 and 871 bp gene fragment, respectively. The dnaK gene contained only one unique SNP in RT-I and eight unique SNPs in RT-IV; therefore, not much resolution was provided ( Figure S13B). In contrast, the rpoB gene alone could separate all strains into their respective populations ( Figure S13F); however, distances separating the clades is based on only six total unique SNPs, which corresponded to six total unique amino acid substitutions.
Production of the tunicamycin toxin is the main reason R. toxicus has received attention in the past. How this organism utilizes this toxin remains unknown. It is speculated that the toxin is used to compete with the nematode or other microorganisms. As different populations may be exposed to different vectors, plant hosts and other competing microorganisms depending on geographic Each gene included in the MLSA was also analyzed individually ( Figure S13). Analysis of atpD (1476 bp), dnaK (1863 bp), gyrB (2061 bp), ppK (2199 bp), recA (1068 bp) and rpoB (3492 bp) was based on 2, 9, 18, 9, 2 and 6 unique SNPs, respectively ( Figure S13A-F), which corresponded to 0, 1, 3, 1, 2 and 6 amino acid substitutions, respectively. Interestingly, Arif et al. [35] screened dnaK and rpoB for their ability to differentiate populations and found no differences in nucleotide sequence; results were based on analysis of a 951 and 871 bp gene fragment, respectively. The dnaK gene contained only one unique SNP in RT-I and eight unique SNPs in RT-IV; therefore, not much resolution was provided ( Figure S13B). In contrast, the rpoB gene alone could separate all strains into their respective populations ( Figure S13F); however, distances separating the clades is based on only six total unique SNPs, which corresponded to six total unique amino acid substitutions.
Production of the tunicamycin toxin is the main reason R. toxicus has received attention in the past. How this organism utilizes this toxin remains unknown. It is speculated that the toxin is used to compete with the nematode or other microorganisms. As different populations may be exposed to different vectors, plant hosts and other competing microorganisms depending on geographic location, the tunicamycin gene cluster was analyzed for potential differences that may reflect this notion ( Figure  S14). The full tunicamycin gene cluster was analyzed ( Figure S14A) and grouped strains of R. toxicus into their respective populations; however, RT-II and RT-III strains grouped together in a single clade. Analysis also revealed some sequence length differences attributed to a poly-G repeat portion that could be a sequencing artifact. Therefore, individual tunicamycin gene coding sequences were extracted and analyzed individually; the order of tunicamycin genes were as described previously (tunC-A-B-K-G-H-L-D-E-I-J-F) [19]. Individual tunicamycin genes possess very few SNPs, with most occurring in RT-IV. Individual analyses did not provide resolution of R. toxicus populations; therefore, tun gene coding sequences were concatenated for reanalysis ( Figure S14B). Concatenated sequences were also translated into aminoacid sequences and analyzed ( Figure S14C). Nucleotide and amino acid analyses of concatenated tun gene coding sequences showed identical groupings of strains, which was also congruent with those produced when analyzing the full tunicamycin gene cluster; topologies varied slightly.
Tunicamycin production was thought to be attributed to the presence of a bacteriophage [160]; however, a recent study identified toxin-producing isolates without the bacteriophage [161]. To date, the trigger for the production of tunicamycin by the bacterium remains unknown, though toxin seems to appear as seeds are senescing [21]. Secondary metabolites tend to be produced during times of stress. Whether the stress is attributed to phage infection or senescence during the natural lifecycle remains unclear. Trehalose is a non-reducing disaccharide that can accumulate in bacteria during stress, such as osmotic stress experienced during seed senescence, to enable survival [162]. Bacteria possess five trehalose biosynthetic pathways [162]; R. toxicus has two trehalose biosynthetic pathways ( Figure S15). Specific plant hosts may senesce differently, which could be reflected in how R. toxicus populations deal with this stress through their trehalose pathways; therefore, each pathway was analyzed ( Figure S16). The tps has five total SNPs, one in RT-II/IV, two in RT-III/IV and two only in RT-IV, corresponding to only one amino acid substitution in RT-IV. tps analysis grouped each strain with their respective population; however, RT-II and RT-V were grouped into a single clade ( Figure  S16A). The tpp has 17 total SNPs, one in RT-V, two in RT-II/III and 14 in RT-IV, corresponding to 7 amino acid substitutions in RT-IV and only one amino acid substitution in RT-V. tpp analysis also grouped each strain with their respective population; however, RT-II and RT-III were now grouped into a single clade ( Figure S16B). Interestingly, when tps and tpp were concatenated, strain grouped with their respective populations ( Figure S16C), providing further validation of the five genetically distinct populations; however, phylogenetic topology differed from larger-scale analyses.
The treY has eight total SNPs, two in RT-I/IV and six in RT-V, corresponding to six amino acid substitutions in only RT-V. The treY analysis grouped RT-II and RT-III together, as well as RT-I and RT-IV ( Figure S16D). treZ has three total SNPs, one in RT-II and two in RT-V, corresponding to only one amino acid substitution in RT-II. treZ analysis was less informative, with RT-I, RT-III and RT-IV grouping together ( Figure S16E). Concatenating treY and treZ grouped each strain with their respective population; however, RT-I and RT-IV were now grouped into a single clade ( Figure S16F). Individual analyses of amino acid sequences did not provide resolution of R. toxicus populations. The analysis of individual genes from each pathway was based on only a few SNPs; therefore, all four genes were concatenated and reanalyzed ( Figure 8). All tested strains grouped with their respective populations and the topology seemed to mostly agree with those provided by larger-scale analyses, with RT-I and RT-IV being most closely related, as is RT-II and RT-III. Again, due to the few numbers of SNPs involved in this analysis, concatenated amino acid sequences of the four genes produced conflicting results; strains grouped with their respective populations, but RT-I and RT-III grouped together ( Figure S17). Again, this reinforces the point that phylogenetic relationships can differ depending on the type and number of genes used in a study. Exopolysaccharide (EPS) interacts with the environment (vector, host, etc.) and may play a role in protection during dehydration as seeds senesce. The exopolysaccharide production protein was analyzed and revealed five total SNPs, three in RT-I, one in RT-I/IV and one in RT-II; each SNP corresponded to an amino acid substitution. Strains grouped with their respective populations ( Figure S18A) and the inferred phylogenies seemed to mostly agree with those provided by larger-scale analyses, with RT-I and RT-IV being most closely related; however, RT-III grouped with RT-V. Amino acid analysis produced identical groupings, but the topology was reversed.
Vancomycin resistance genes (vanHAX) are possessed by all R. toxicus isolates, though not present in R. festucae, R. iranicus, R. rathayi, and R. tritici; presence of vancomycin resistance genes in the remaining Rathayibacter spp. is unknown. Each vancomycin gene was assessed individually. The full vancomycin cassette was assessed and revealed only five total SNPs, corresponding to only a single amino acid change in RT-I/IV in the vanX gene. Phylogenetic analysis of individual genes was uninformative. The full vancomycin cluster grouped strains with their respective populations; however, RT-II and RT-V grouped together into a single clade ( Figure S18B). It appears that selective Exopolysaccharide (EPS) interacts with the environment (vector, host, etc.) and may play a role in protection during dehydration as seeds senesce. The exopolysaccharide production protein was analyzed and revealed five total SNPs, three in RT-I, one in RT-I/IV and one in RT-II; each SNP corresponded to an amino acid substitution. Strains grouped with their respective populations ( Figure  S18A) and the inferred phylogenies seemed to mostly agree with those provided by larger-scale analyses, with RT-I and RT-IV being most closely related; however, RT-III grouped with RT-V. Amino acid analysis produced identical groupings, but the topology was reversed.
Vancomycin resistance genes (vanHAX) are possessed by all R. toxicus isolates, though not present in R. festucae, R. iranicus, R. rathayi, and R. tritici; presence of vancomycin resistance genes in the remaining Rathayibacter spp. is unknown. Each vancomycin gene was assessed individually. The full vancomycin cassette was assessed and revealed only five total SNPs, corresponding to only a single amino acid change in RT-I/IV in the vanX gene. Phylogenetic analysis of individual genes was uninformative. The full vancomycin cluster grouped strains with their respective populations; however, RT-II and RT-V grouped together into a single clade ( Figure S18B). It appears that selective pressure is not only strong enough for R. toxicus to retain these genes, but keep the gene sequenced conserved. Out of curiosity, EPS and vancomycin genes were concatenated and analyzed; all strains grouped with their respective populations and all populations formed unique clades. It should be noted that the vancomycin gene cassette (vanHAX) was inverted in only strain 70137; this is consistent with other analyses that found inversions and other errors in the published genome sequence.
All the genes tested individually thus far, including those used in MLSA, were analyzed by BLAST to determine where the genes may have been inherited from and/or what organisms closely share these gene sequences (Table S30). The CRISPR-Cas cassette was acquired by HGT; therefore, Cas genes were also included in this analysis (Table S30). Under parameters to identify highly similar matches, majority of the genes analyzed matched only to the two complete R. toxicus genomes (70137 and WAC3373) deposited on GenBank, with the exceptions being vanA, tps, treY and treZ, as well as tpp in RT-IV. vanA matched to Sacchrothrix espanaensis DSM 44299 with an identity and query coverage of 70% and 90%, respectively, while the TPP of RT-IV matched to R. tritici NCPPB 1953 with an identity and query coverage 77% and 91%, respectively. Interestingly, the other trehalose genes also matched with R. iranicus, with identities and query coverages of 74%-78% and 88%-97%, respectively. The cas and tun genes did not yield other matches, knowing that they were acquired horizontally. This may indicate/reflect that these genes were acquired long ago, and that sufficient time has passed to allow for the accumulation of enough SNPs that sequences no longer resemble that of the original acquisition.
Variable results were obtained for all genes when analyzed under less stringent conditions using the "Somewhat similar sequences (blastn)" option within NCBI BLAST. ESP matched with Agromyces aureus AR33, R. festucae DSM 15932, R. rathayi DSM 7485, R. iranicus NCPPB 2253 and R. tritici NCPPB 1953, but with varying query coverage (17%-65%) and identities (68%-74%). The individual vancomycin genes yielded various matches from a variety of organisms (none of the other Rathayibacter spp.), with high query coverages (>70%) and~70% identity. Trehalose genes only matches with few different organisms, including R. festucae, R. rathayi, R. iranicus, R. tritici, Clavibacter, Plantibacter and Leifsonia, which were all part of the Clavibacter genus at one point in time [1,8,159,163,164], with high query coverages (~60%-98%) and high identities (~65%-79%). This suggests that trehalose genes belonged to an ancient ancestor. The tun and cas genes yielded interesting results. The cas3 and cse1-cse3 genes also showed no matches under less stringent parameters. The cas5e gene only produced a single match to Actinomyces with relatively low query coverage (53%) and identity (66%), while the remaining genes matched a few different organisms with varying levels of query coverage and identity; matches to Nocardia were common. Under relaxed BLAST parameters, no matches were found for tunE, tunG, tunH and tunI. The remaining tunicamycin genes matched to the genome of R. iranicus NCPPB 2253, with tunA, tunB, tunC, tunD, tunF, tunJ, tunK and tunL having varying query coverages (16%, 97%, 25%, 11%, 90%, 65%, 22% and 59%, respectively) and identities (67%, 72%, 68%, 68%, 63%, 68%, 82% and 65%, respectively). tunB also matched to Streptomyces, Nocardia and Actinosynnema with high query coverage (~97%) and identity (~64%). This was expected as it was shown previously that R. toxicus tunicamycin genes are most closely related to that of Streptomyces chartreusis [19]; lack of any match to Streptomyces for other tun genes was unexpected. Tunicamycin production by R. iranicus has not been confirmed, but it is suspected to be able to produce corynetoxin [21]. BLAST data may suggest this possibility; however, mapping tun genes to R. iranicus NCPPB 2253 was unsuccessful. Any potential tunicamycin gene sequence, length and arrangement within the genome of R. iranicus NCPPB 2253 appear to differ considerably when compared to that of R. toxicus, based on searching for functional annotations similar to those of tun genes.
The lack of matches to anything except the two R. toxicus genomes for all cas and tun genes under stringent BLAST conditions, along with the few or no significant matches under less stringent conditions, suggest that these genes, whether acquired vertically or horizontally, were present in R. toxicus for a significant amount of time to allow accumulation of enough mutations to make these sequences unique to this organism. If it is assumed horizontally transferred genes such as cas and tun were acquired at or just after the split of R. toxicus from other Rathayibacter spp., as all others lack these genes, these genes could have been acquired 18-33 million years ago; the difference in G+C content between R. toxicus and R. festucae, R. iranicus, R. rathayi and R. tritici is approximately 11%, 6%, 8% and 8%, respectively, and studies suggest that genome-wide changes in G+C content in prokaryotes occur at a rate of approximately 1% every 3 million years [165,166].
Manual annotation of the core-genome led to the discovery of genes with interesting functional annotations, prompting further investigation; these genes had functions involved with cell wall synthesis, secretion systems and toxins, cell cycle, antibiotic resistance and phage-related proteins. Sequences for the cyclic-di-GMP-binding biofilm dispersal mediator protein, cell division trigger factor and cell division inhibitor protein were 100% identical; other genes shared varying numbers of SNPs: phage-shock protein PspC, cell division inhibitor protein and zeta toxin contained a single SNP. These genes were found in the core-genome designated by MicroScope, but cyclic-di-GMP-binding biofilm dispersal mediator protein could not be mapped to strain 70137. The core-genome was assessed using representative strains of each R. toxicus population, so it is possible that 70137 does not have this gene or sequence variation prevents proper mapping of this gene to strain 70137 using Geneious. As the sequence for this gene was found to be 100% identical in the other 17 R. toxicus strains tested, it is likely that errors within the 70137 sequence are responsible; other errors found in the 70137 sequence were explained previously. Phylogenetic analysis of these genes was not informative.
The T3SS SseB-like protein sequence was very large (6783 nt) and contained many SNPs; the sse gene is a Salmonella enterica effector associated with assembly of an F-actin coat around intracellular bacteria. Phylogenetic analysis was able to group all strains into their respective populations, with RT-I being most closely related to RT-IV ( Figure S19A). Additionally, the topology of the phylogenetic tree suggests that RT-IV and RT-V are the most distant ( Figure S19A); results corroborate previous analyses. This gene may warrant further investigation. MLSA was performed on 13 of these genes, excluding the three genes with 100% identity and the T3SS SseB-like protein. Concatenated sequences (15,795 nt) were analyzed and resulting phylogenies grouped all strains with their respective populations; topology suggested different relationships among the populations. Analysis was performed again, now including the T3SS SseB-like protein (14 genes). Concatenated sequences (22,578 nt) were analyzed and resulting phylogenies grouped all strains with their respective populations ( Figure S19B). The topology of the phylogenetic tree produced from MLSA of these 14 genes suggests the same relationships as gene family content analysis by MicroScope; RT-IV ↔ RT-I ↔ RT-II ↔ RT-III ↔ RT-V. Strain WA40-23C grouped within the RT-III clade but separate from all other strains within this clade. This observation can be attributed to the inclusion of the kojibiose/trehalose phophorylase gene. This gene contained nine total unique SNPs; WA40-23C had its own unique SNPs which caused it to group out separately from all other strains.
The T3SS SseB-like protein sequence was very large (6783 nt) and contained many SNPs; the sse gene is a Salmonella enterica effector associated with assembly of an F-actin coat around intracellular bacteria. Phylogenetic analysis was able to group all strains into their respective populations, with RT-I being most closely related to RT-IV ( Figure S19A). Additionally, the topology of the phylogenetic tree suggests that RT-IV and RT-V are the most distant ( Figure S19A); results corroborate previous analyses. This gene may warrant further investigation. MLSA was performed on 13 of these genes, excluding the three genes with 100% identity and the T3SS SseB-like protein. Concatenated sequences (15,795 nt) were analyzed and resulting phylogenies grouped all strains with their respective populations; topology suggested different relationships among the populations. Analysis was performed again, now including the T3SS SseB-like protein (14 genes). Concatenated sequences (22,578 nt) were analyzed and resulting phylogenies grouped all strains with their respective populations ( Figure S19B). The topology of the phylogenetic tree produced from MLSA of these 14 genes suggests the same relationships as gene family content analysis by MicroScope; RT-IV ↔ RT-I ↔ RT-II ↔ RT-III ↔ RT-V. Strain WA40-23C grouped within the RT-III clade but separate from all other strains within this clade. This observation can be attributed to the inclusion of the kojibiose/trehalose phophorylase gene. This gene contained nine total unique SNPs; WA40-23C had its own unique SNPs which caused it to group out separately from all other strains.

Conclusions
Rathayibacter toxicus is aptly named and unique among Rathayibacter species, in that five populations contain a functional tunicamycin gene cluster, evident by the livestock toxicities reported at each of the geographic locations where the bacterium has been detected. All five R. toxicus populations also contain a CRISPR-Cas system and vancomycin resistance genes; the CRISPR-Cas system appears most similar to a type I-E CRISPR-Cas system (Escherichia coli-type) [19]. Rathayibacter toxicus is the only Rathayibacter spp. containing these gene clusters, suggesting their acquisition from an HGT event that occurred early in the speciation of R. toxicus away from other Rathayibacter spp.
The purpose for retention of the tunicamycin gene cluster by R. toxicus remains unknown. Researchers have been unable to recreate the lifecycle of this pathogen in vitro, which includes the bacterium, a nematode vector and a plant host. This limits the ability to determine the role of the toxin in the life history of the bacterium. R. toxicus tunicamycin genes were shown to be closely related to that of Streptomyces chartreusis [19]; S. chartreusis is an Actinobacteria that was isolated from soil in Africa. Although only confirmed in Australia, the fact that R. toxicus-like disease was reported in South Africa, along with the similarity of the tunicamycin cluster to that of S. chartreusis, may suggest R. toxicus originated in South Africa. Exactly when and how this may have occurred requires further investigation; this could have been caused by movement of animals or plant host species, such as Ehrharta longiflora [167], into Australia. Another uncharacterized species of Rathayibacter, Rathayibacter sp. EV, was isolated by Riley et al. [168] from Ehrhata villosa var. villosa in South Africa and was shown to be vectored by Anguina woodi, but not Anguina funesta or Anguina tritici. Rathayibacter sp. EV, along with Rathayibacter iranicus, are suggested to be able produce tunicamycin [21]. Perhaps Rathayibacter sp. EV represents a population similar to the ancestral population, or it diverged from the common ancestor on a separate evolutionary path from populations in Australia. It is crucial that future studies include isolates from South Africa to either confirm or deny these statements.
To date, five distinct populations of R. toxicus have been identified based on multiple analyses, including AFLP [31], PFGE [31], MLSA [35,36], SNP [28,30] and ISSR analysis [35]. This study represents the first large-scale genome-wide study of R. toxicus to include strains from populations RT-I and RT-V. Genomes were analyzed using a variety of methods and bioinformatics tools. In regard to the conclusions, the term "method" is used to describe the type of analysis used to query the genome, such as ANI, MLST or whole-genome. The term "tool" refers to the software program used to perform a given method of analysis, such as using JSpeciesWS for ANI calculation or MAUVE for whole-genome analysis. The term "algorithm" is used to describe the mathematical formulas inherent and built into a given tool. Algorithms can differ between tools used for a given method, such as differing between the ANI calculator and EZGenome even though both are used to calculate ANI; these can also differ for a given tool, such as ANIb or ANIm, which are based on BLAST+ or MUMmer for JSpeciesWS, respectively.
All large-scale analyses grouped strains into five distinct populations. For any given set of analyses, only subtle differences were usually observed between programs; however, these subtle differences altered inferred relationships among these populations. The use of different annotation programs resulted in varying gene counts, gene locations, and functional annotations. The use of different algorithms by various bioinformatic tools used to calculate ANI and dDDH produced varying relationships among populations.
Significant differences were obtained for HGT and prophage/phage remnant analyses depending on the analytical tool used; there was no instance where all programs agreed. Lack of concordance was even more apparent between programs used to identify prophage and phage remnants. Three bioinformatics tools identified prophage/phage remnants; however, there was no agreement between them, as they each identified different sequences of different sizes and at different locations within the genome. In addition, there was a lack of consistency among and within programs. In some instances, sequences were not identified in a given strain even though it was actually present. One specific example is strain SA03-14 when analyzed by AntiSMASH 4.0. The majority of the BGCs were not identified in this strain but were identified by mapping these BGCs from other strains to the SA03-14 genome in Geneious. This was also observed to varying degrees during prophage analyses. These observations, along with SSR analysis of strain 70137, emphasize the importance of verifying results by other means, such as PCR or mapping known sequences to a query genome, to ensure the presence or absence of a locus, as well as its sequence accuracy. Another observation to note was the differences in results produced by different versions of the same program; AntiSMASH 4.0 and AntiSMASH 4.2.0 produced different results, while PHASTER, a more advanced version of PHAST with a very similar phage identification pipeline, did not identify prophage sequence found by PHAST.
In this study, different methods and different versions of the same software programs occasionally produced different results, sometimes leading to different conclusions. These observations indicate a certain risk in using a single method for analysis, not only the analytical method itself but also the tools used for that analysis. Additionally, it is important to understand how these tools function (e.g., what is included in the analysis and what is not considered) and the purpose for which they were designed. Understanding the limitations of the analytical tools will allow informed conclusions to be drawn from a particular analysis. Phylogenetic analyses like PhyloSift, REALPHY and MLSA did not account for HGT events during analyses; scientists now agree that HGT events occur at rates far higher than believed previously. Additionally, it is important to know whether an analytical tool considers other aspects of the genome, such as inversions and rearrangements, as the orientation and organization of the same genes can differ among strains. Many analyses do not account for these genome characteristics: they are more focused on the presence/absence of genes and the sequence differences; thus, the true phylogeny of the organism may not be accurately reflected. It is also important to note that even though phylogenetic relationships can be established with these analyses, evolutionary lineages cannot always be directly inferred; strain data and historical data need to be assessed together with the genetic data to infer evolutionary trajectories. For example, knowing the ages of the isolates, the location of isolates, and the historical movement of host material, can assist in putting the genetic data into perspective.
Many studies rely highly on MLSA data. Results obtained in this study demonstrate that depending on the type and number of genes used, phylogenetic relationships can differ. It seems apparent that including more genes in an MLSA would increase the resolution and provide results comparable to the true nature of the organism being studied; however, there appear to be no standards for MLSA. Additionally, results may vary depending on the PCR primers used to sequence the same gene, as different primers can target completely different regions within the same gene. Arif et al. [35] analyzed partial dnaK and rpoB gene sequences that were amplified using specific primers and saw no difference in sequence among the tested strains; however, this study showed apparent differences, particularly in the rpoB gene, when the full gene sequence was used.
Cumulative data from this study reinforce the existence of these five genetically distinct populations and suggest RT-II and RT-III, and then RT-V, are more closely related to each other than other populations, while RT-I and RT-IV are more closely related to each other than other populations. Cumulative data also suggest that RT-IV and RT-V are the most unrelated among the five populations. The idea that RT-III evolved from RT-II and reached Western Australia from South Australia through the movement of ryegrass is consistent with historical and genetic data; this conclusion was also reached by Davis II et al. [28,30]. Additionally, Davis II et al. [28,30] suggests that R. toxicus tends toward genome reductions. If true, the fact that population RT-IV has the largest genomes compared to other populations, followed by RT-V, would support the idea of R. toxicus populations evolving in Australia from RT-IV, as well as populations RT-II and RT-III arising from RT-V. A comprehensive analysis of CRISPR-Cas system also suggests this pattern of evolution. Although only two strains each were available for populations RT-IV and RT-V, every analysis performed supported their identification as distinct populations.
Apart from the co-existence of RT-I and RT-II on the Yorke Peninsula in South Australia, the remaining R. toxicus populations are geographically isolated; RT-III in Western Australia, RT-IV in New South Wales, and RT-V in a small area of southeast South Australia. This geographic isolation may have contributed to the genetic differences among populations. Future research will involve attempts to collect additional isolates belonging to populations RT-IV and RT-V, as well as regular sampling over time to track evolutionary changes.