Curtobacterium spp. and Curtobacterium flaccumfaciens: Phylogeny, Genomics-Based Taxonomy, Pathogenicity, and Diagnostics

The genus of Curtobacterium, belonging to the Microbacteriaceae family of the Actinomycetales order, includes economically significant pathogenic bacteria of soybeans and other agricultural crops. Thorough phylogenetic and full-genome analysis using the latest genomic data has demonstrated a complex and contradictory taxonomic picture within the group of organisms classified as the Curtobacterium species. Based on these data, it is possible to delineate about 50 new species and to reclassify a substantial part of the Curtobacterium strains. It is suggested that 53 strains, including most of the Curtobacterium flaccumfaciens pathovars, can compose a monophyletic group classified as C. flaccumfaciens. A genomic analysis using the most recent inventory of bacterial chromosomal and plasmid genomes deposited to GenBank confirmed the possible role of Microbacteriaceae plasmids in pathogenicity and demonstrated the existence of a group of related plasmids carrying virulence factors and possessing a gene distantly related to DNA polymerase found in bacteriophages and archaeal and eukaryotic viruses. A PCR diagnostic assay specific to the genus Curtobacterium was developed and tested. The presented results assist in the understanding of the evolutionary relations within the genus and can lay the foundation for further taxonomic updates.


Introduction
Since the definition of the genus Curtobacterium, in 1972 [1], the representatives of these microbacteria have been isolated from numerous plants and environmental samples. For a long time, the genus Curtobacterium comprised eight validated species, recently accompanied by a number of proposed species and unclassified strain groups. Phytopathogenic strains causing the wilting and rotting of various legumes (Fabaceae) and ornamental plants have been established as a separate species, C. flaccumfaciens, and further subdivided into several pathovars, depending on the host plants and physiological properties [2,3]. Meanwhile, most described Curtobacterium spp. are not known to cause any disease of plants from which they were primarily isolated [4]. Many strains have been isolated as endophytes in sugarcane [5], grapevines [6], maize [7], sorghum [8], tomatoes [9], coffee [10], black peppers [11], strawberries [12], citrus fruits [13], poplars [14] and eucalyptus [15] and have been found in oil brines [5] and marine sediments [7].
Therefore, the role of Curtobacterium spp. in plant pathogenesis worldwide and the links between genomic features and virulence with respect to plants used for taxonomic distribution need elaboration and revision.
It is difficult to unravel the structure of the complex genus Curtobacterium using data acquired from previous taxonomic studies. Polyphasic studies largely depend on the 16S rRNA gene [16], but despite their usefulness for resolving taxonomic questions in the past, one or a few household genes contain only a limited number of informative characteristics and, thus, can yield phylogenetic trees that lack the resolution to distinguish between closely related species [17]. Classifications based on whole-genome sequences and associated bioinformatic tools provide a significant change in the reliability of phylogenomic trees [17].
New taxonomic technologies often raise the question as to whether discrepancies associated with earlier phylogenetic structures were caused by conflicts between the phenotype, single genes or entire genomes or by data interpretation, such as those between the different algorithms of a taxonomic analysis [18]. In phylogenetic systematics, only monophyletic taxa can be accepted in taxonomic classifications [19].
The present study was designed to provide an improved framework for the classification of the genus Curtobacterium based on the principles of phylogenetic systematics applied for genome comparison. A comprehensive sampling of publicly available wholegenome sequences of strains representing the genus Curtobacterium was used to construct genome-scale phylogenetic trees and to address the following questions: (a) Is there any conflict between the phylogenies calculated from whole-genome sequences and the current classification of the genus Curtobacterium? (b) Which species within the genus Curtobacterium need to be revised because they are evidently nonmonophyletic?
Another important question concerns the relatedness between the mechanisms of pathogenicity of curtobacteria and their taxonomy and phylogeny. The genomes of Curtobacterium plasmids have been shown to contain possible virulence genes, and the effect of mobile elements and gene exchange could reveal the ways of the emergence of pathogenic strains and pathovars [20][21][22][23]. In this study, Curtobacterium plasmids were studied in the context of the genomics of pathogenicity.
In addition, this research concerns the development of existing diagnostic methods to better correspond to the Curtobacterium taxonomy and genomic data. The abbreviations in the text below are as follows: C.-Curtobacterium, C. f.-Curtobacterium flaccumfaciens, C. fpf -flaccumfaciens pv. f laccumfaciens and C. f. pv.-Curtobacterium flaccumfaciens pathovars.

Genomes' Annotation and Comparison
Bacterial chromosomal and plasmid genomes were downloaded from the NCBI Genome database [24]. Chromosomal genomes were annotated with Prokka [25], using the default settings of Prodigal [26] for finding open reading frames (ORFs) and Barrnap [27] for rRNA gene detection. The plasmid genomes were annotated manually, using Geneiuos [28], Glimmer [29] and Prodigal [26] for ORF detection. Functional assignments of plasmid genes were made using a BLAST homology search [30] on the NCBI nr/nt database and custom databases using a HMM-HMM search with HHpred [31] and Phyre2 [32], and using an InterPro [33] search.
A genome sequence comparison and a visualisation of the annotated plasmid genomes were made with EasyFig [34], applying TBLASTX [30].

ANI Calculation and Clustering
The average nucleotide identity (ANI) was calculated using OrthoANIu [35]. The data obtained were clustered using the Phylogeny.fr server [36], applying the BIONJ algorithm [37] to the distance matrix. The BIONJ dendrogram was visualised using the iTol server [38].
2.3. Phylogenetic Analysis of rRNAs, Ribosomal Proteins, gyrB, parE, rpoA and rpoB Genes sequences of the rRNAs, ribosomal proteins, gyrB, parE, rpoA and rpoB were extracted from the annotated genomes and aligned with MAFFT [39] with L-INS-i algorithms and other default settings. The concatenation of the sequences was carried out using Geneious tools [28]. Phylogenetic trees were constructed using maximum likelihood methods implemented in MEGA X [40] and RAxML [41] software packages, applying Tamura-Nei [42] and General Time Reversible (GTR) [43] nucleotide substitution models.

Phylogenetic Analysis of Core Genome
The core genes' concatenated alignment was obtained with the Panaroo pipeline [44] using Prokka annotated genomes, (-core_threshold = 0.95 -aligner clustal) settings, and with other parameters set to default. Phylogenetic inference was performed by RAxML using the GTR-CAT nucleotide substitution model [41].

Protein Structure Modelling
Protein structure modelling was performed in two steps. First, the structure was predicted using AlphaFold 2.0 [45] with default settings, I-TASSER [46,47] with default settings, Phyre2 [32] in intensive modelling mode and Rosetta [48] using RoseTTAFold modelling mode [49]. The accuracy of prediction of the best-scoring models obtained by the methods listed above was estimated using ModFold8 [50], and the model with the highest global model quality score was used for the refinement of the structure using ReFOLD3 [51] in the next step. The quality of the prediction of the resulting structure was also estimated by ModFold8 [50]. Similar experimentally obtained structures were found by the HHpred [31], Phyre2 [32] and I-TASSER [47] servers. The superimposition and visualisation of the structures were achieved using PyMOL v.2 [52].

Primer Design
The phylogenetic analysis showed that the 23S rRNA sequence could be used to differentiate Curtobacterium from other genera with fairly good reliability. To develop primers, the 23S rRNA sequences of all the available Curtobacterium strains and a set of unrelated strains were aligned using MAFFT [39]. Then, following the alignment, a conserved and unique plot for Curtobacterium was manually determined. Oligonucleotides for the amplification of this region were optimised using Primer3Plus [53].

Bacterial Strains and Growth Conditions
A total of 81 bacterial strains, listed in Table S1, were used in the research. Twenty-two strains were purchased from the All-Russian Collection of Microorganisms (Pushchino, Moscow Oblast, Russia). Another 23 strains were isolated from the affected plants using a semi-selective medium for the isolation of Curtobacterium [54]. The remaining strains were taken from the collection of phytopathogenic bacteria of the Laboratory of Molecular Bioengineering of the Institute of Bioorganic Chemistry of the Russian Academy of Sciences (Moscow, Russia), used in previous research [55,56].
Routine cultivation of the strains was carried out on YD medium (yeast extract-10 g/L, dextrose-20 g/L, and agar 15 g/L) at a temperature of 28 • C. Long-term storage of the strains was carried out at −80 • C in 30% glycerol.

DNA Isolation
A GeneJET Genomic DNA Purification Kit (Thermo Scientific, Waltham, MA, USA) was used to isolate genomic DNA following the manufacturer's instructions. The concentration of purified DNA was measured using a NanoProteometer N60 spectrophotometer (NanoProteometer, Munich, Germany). On average, the DNA amount in the samples was 20 ng.

Polymerase Chain Reaction (PCR)
Conventional PCR was performed using a BioRad T100 ThermalCycler (Bio-Rad, Hercules, CA, USA) in a volume of 25 µL. Each reaction contained five µL of ScreenMix (Evrogen, Moscow, Russia), 0.3 mM of each primer, and 20 ng of template DNA.
The thermal cycling mode was 94 • C for 300 s, followed by 30 cycles of 94 • C for 30 s, 65 • C for 10 s, and 72 • C for 20 s, with a circuit extension at 72 • C for 5 min. PCR products were visualised by electrophoresis (1.5% agarose gel with the addition of ethidium bromide and 1× TAE buffer). A 1-kb ladder length marker (Evrogen) was used to determine the size of the DNA fragments.

Test Plasmid Construction
To construct a test plasmid, the target sequence was first amplified in the Cff VKM Ac-1923 (← DSM 20129) strain. After the resulting PCR, the product was purified using the Qiagen MinElute PCR Purification kit (Qiagen, Dusseldorf, Germany) and ligated into the pDrive vector (Qiagen) using the Qiagen PCR Cloning kit. The correctness of the insertion was checked using Sanger sequencing.

qPCR Conditions
PCR was performed in a final volume of 10 µL. The reaction mixture included 2 µL of Evrogen qPCRmix-HS SYBR, 0.3 mM of each primer, and 20 ng of template DNA. The final concentration of dNTP was 0.12 mM, and the concentration of magnesium was 3 mM. Thermal cycling was performed on a LightCycler 96 (Roche, Basel, Switzerland) in the following mode: 94 • C for 300 s, then 45 cycles of 94 • C for 10 s, 65 • C for 10 s, and 72 • C for 10 s. All experiments were carried out in duplicate per run and repeated twice. Thus, there were four technical replicates. The processing of the amplification curves and calculation of the threshold cycles were carried out using Roche software. Reactions with water were used as a negative control and with a test plasmid as a positive control.
For the experiment to determine the detection sensitivity, tenfold dilutions of the test plasmid and genomic DNA Ac-1923 (← DSM 20129) were prepared. PCR was performed with each dilution in the same way as described above.

Curtobacterium Genomes in GenBank Database, Mislabelled Strains, and Curtobacterium flaccumfaciens Pathogenic Strains
As of October 2021, the NCBI Genome database [24] contained drafts and complete genomes of 191 strains attributed as Curtobacterium, including two metagenome assemblies. Most of the strains were unclassified on the species level; 35 strains were classified as Curtobacterium flaccumfaciens, including 27 pathovars; and seven strains were classified as representatives of seven other Curtobacterium species (C. albidum, C. ammoniigenes, C. citreum, C. herbarum, C. luteum, C. oceanosedimentum, and C. pusillum). The preliminary analysis demonstrated that one of these strains, Curtobacterium S6, had the lowest average nucleotide identity (ANI) value, i.e., about 69% of all the other Curtobacterium strains. The phylogenetic analysis using 16S rRNA and 23S rRNA sequences of the closest homologues belonging to different Microbacteriaceae genera placed strain S6-1 distantly from the Curtobacterium clade (data not shown). Seemingly, this strain has been misclassified and might possibly represent a novel genus. Thus, only 190 genomes can be considered as Curtobacterium spp. (Figure 1). It is difficult to estimate the number of C. flaccumfaciens pathogenic strains deposited in the genome database confidently, since the majority of the deposited strains have not been tested in biological experiments. However, the data available in the literature makes it possible to distinguish at least 28 pathogenic strains (Table 1).

ANI Analysis
Although the current taxonomy still relies on the classification system designed by Carolus Linnaeus, the use of criteria such as the genome index of the average nucleotide identity (ANI) can assist taxonomy [57][58][59]. The proposed minimal standards [60] and the regular practice of species delineation include ANI calculations and a multilocus phylogenetic analysis (MLPA). These standards apply to the ANI cut-off score of >95%, which indicates that strains belong to the same species [61][62][63]. The ANI data correlate well with DNA-DNA hybridisation results, and the recommended cut-off point of 70% DDH for species demarcation corresponds to 95% ANI [64,65]. A number of software packages can be used for calculations of ANI, including orthoANIu [35], Jspecies [65], ANI calculator [66], FastANI [62] and Gegenees [67]. Calculations with different packages can yield slightly different results [61].
In this study, the orthoANIu pipeline was used for the assessment of ANI. The advantage of orthoANIu is fast and accurate calculations due to the employment of the Usearch tool [68]. ANI values have been calculated for all 190 Curtobacterium genomes (Supplementary Figure S2) and used for BIONJ clustering ( Figure 2). The dendrogram places most C. flaccumfaciens strains, including the type strains C. fpf CFBP 3418 and C. f. LMG 3645 and other strains with confirmed pathogenicity, into a distinct clade containing 69 strains with an ANI value above 92.5% compared to C. flaccumfaciens-type strains. The strain labelled as C. f. pv. oortii CFBP 3400, however, was placed in other clades distant from the other C. flaccumfaciens strains. These results indicate that the results of whole-genome based calculations do not match the present taxonomic classification of C. flaccumfaciens strains.

16 S, 23S, and Concatenated Ribosomal RNA Genes Phylogeny
Genomic loci encoding ribosomal RNAs are often used for evolutionary phylogenetic analysis and species delineation [70][71][72]. As well as in other bacteria [73], the 16S rRNA and 23S rRNA genes of Curtobacterium are located in one operon, together with 5S rRNA, and are separated by internal transcribed spacer (ITS) regions. The majority of 16 complete Curtobacterium genomes contain three rRNA operons, and several complete genomes contain four copies of rRNA operons. In the latter case, two operons are located next to each other and are separated by the 1300-2200 base pairs (bp) region and are located in one direction. The average length of Curtobacterium 16S rRNA is about 1530 bases, and the average length of 23S rRNA is about 3120 nucleotides.
The copies of rRNA genes belonging to the same genome are often not identical. To conduct the phylogenetic analysis, the rRNA gene, which has an identical copy (copies), was chosen. If all the sequences were different and shared the same, or the adjacent, clade, the sequences with the least sum of branch lengths to the root of draft trees were chosen for the final tree.
The 16S sequence of C. fpf CFBP 3423 was found to be identical to Moraxella osloensis YV1 16S rRNA, which could have been the result of sample contamination and assembly error. Furthermore, this gene was not located in a common operon, together with the 23S and 5S rRNA genes. One of the 23S sequences of C. sp. HSID17257 was 100% identical to a Cryptococcus neoformans complex 25S rRNA and showed a high level of similarity with rRNA genes found in plasmids (Actinomyces oris strain FDAARGOS_1051 plasmid unnamed, Enterobacter sp. T2 plasmid unnamed, and Acinetobacter baumannii VB2139 plasmid pVB2139_3) and several bacteria.
The results of phylogenetic analysis of 16S rRNA genes belonging to 185 Curtobacterium strains (Supplementary Figures S2 and S3) raised doubts as to the employment of 16S phylogeny for Curtobacterium species delineation. A 16S maximum likelihood phylogeny could not distinguish the clades of Curtobacterium strains reliably-the bootstrap supporting values were often significantly lower than 50%. The use of different phylogeny inferring methods and nucleotide substitution models (RAxML, MrBayes, MEGA, TN93, GTR, TN93 + G + I, GTR + G + I, etc.) did not improve the robustness of inference and resulted in a similar consensus tree topology. This can be due to the low difference between the 16S rRNA sequences-the pairwise identity (the percentage of pairwise residues that are identical in the alignment, including gap versus non-gap residues but excluding gap versus gap residues) of the 16S rRNA alignment was as high as 99.4%.
The phylogenetic analysis of 23S rRNA genes belonging to 178 Curtobacterium strains (Supplementary Figures S4 and S5) seems to have been more informative, due to a bigger difference between the sequences (the pairwise identity of the 23S rRNA alignment was as high as 96.8%). The supporting bootstrap values were higher than in the case of 16S trees, and the composition of the clades revealed was often similar to the composition of ANI clusters ( Figure 2). It seems, however, that the 23S was not able to describe the entire complexity of taxonomic and evolutionary relations within the Curtobacterium genus. The concatenated 16S and 23S rRNA genes' phylogeny (Supplementary Figures S6 and S7) did not provide any improvement compared to the phylogeny of the single genes.

gyrB, parE, rpoA, rpoB, and Concatenated Genes Phylogeny
Conservative genes coding for proteins, mainly related to DNA processing (i.e., replication and transcription) and their concatenation, can be efficiently used for high-resolution phylogenetic analysis [74][75][76][77][78][79]. In this study, the nucleotide sequences of the following genes and their concatenation were used: gyrase subunit B (gyrB), topoisomerase IV subunit B (parE) evolutionarily related to gyrase B, DNA-directed RNA polymerase subunit α (rpoA), and DNA-directed RNA polymerase subunit β (rpoB). All the genes were found in single copies in almost all genome assemblies.
The phylogenetic trees of all the genes listed above (Supplementary Figures S8-S15) featured the overall resolution and bootstrap supporting values that were significantly higher than those in the rRNA phylogenetic trees. The composition of the revealed clades was close to the corresponding clades of the ANI tree, supporting the suggestion of the rarity of horizontal exchange events associated with the genes analysed. The gene parE encoding topoisomerase IV subunit B demonstrated good phylogenetic potential resolving the tree topology slightly better, and with a little better bootstrap support, than the gyrB, rpoA, and rpoB analyses, presumably because of the greater divergence of parE. The pairwise identities of the alignments were 91.4% for gyrB (189 sequences), 89.2% for parE (189 sequences), 96.6% for rpoA (189 sequences), and 95.7% for rpoB (188 sequences). It appears that a low number of variable sites hampers the fidelity and informativeness of the phylogeny.
The multilocus phylogenetic analysis (MLPA) (Supplementary Figures S16 and S17) employed the concatenated alignments of gyrB, parE, rpoA, and rpoB and demonstrated a higher resolution and better bootstrap support than the single genes phylogenies and, moreover, the rRNA phylogenies. The topologies of the MLPA trees and the compositions of the clades supported with high bootstrap values were close to those of the ANI tree ( Figure 2). This concatenated tree did not, however, properly resolve the branches that were close to the root of the tree.

Ribosomal Proteins Phylogeny
The phylogenetic analysis based on the amino acid and nucleic acid sequences of ribosomal proteins (r-proteins) was shown to contain a reliable phylogenetic signal at a wide range of taxonomic depths, which was not significantly affected by mutational saturation or lateral gene transfer [80][81][82][83]. The phylogenetic studies made it possible to efficiently reveal the evolutionary and taxonomic relations between both distant, and closely related, organisms [84][85][86][87].
The resulting best-scoring phylogenetic tree obtained with RAxML ( Figure 3) demonstrated bootstrap support that was significantly higher than that of the rRNA trees (Supplementary Figures S2-S7), four conservative genes' trees, and four concatenated genes' MLPA trees. The composition of clades, including the clade containing the strains classified as C. flaccumfaciens, was very close to the ANI tree ( Figure 2).

Multigene-Based Phylogenomic Analysis
The employment of large-scale phylogenetic analyses involving many orthologous genes provides a substantial number of opportunities for the phylogeny and taxonomy of prokaryotes [60,[88][89][90]. The multigene-based phylogenomic analysis was conducted using the core genes' alignment, obtained with pangenome pipeline Panaroo, which uses a graphbased algorithm to share information between genomes, allowing improvements to annotation calls and the clustering of orthologues and paralogues within the pangenome [44]. The definition of a core (≥0.95) was applied, and 506 genes were found by the pipeline using all 190 reannotated genomes. The total length of the concatenated alignment was 502,785 bases, and the pairwise identity was 82.7%.
The ML best-scoring phylogenetic tree is shown in Figure 4. The tree demonstrates high bootstrap support and a topology that is similar to the topologies of ANI and r-protein trees. The compositions of the clades within the range of 95% ANI species cut-off are identical to those of the ANI tree ( Figure 2).

Possible Taxonomy Revisions Based on ANI and Phylogenetic Analysis
Applying the criterion of full-genome similarity measured by ANI, which requires ≥95% identity and requirements of cladistics, which, in turn, demands the monophyleticity of taxa and using the results obtained by ANI calculations and phylogenetic analysis with high confidence, it is possible to propose the updates in the taxonomy of genus Curtobacterium that are shown in Table 2. Table 2. Suggested Curtobacterium species based on the genomic data.

Species
Strains Starting at the root of the core genome ( Figure 4) or r-protein (Figure 3) trees, the monophyletic group of 53 strains, containing most phytopathogenic strains, can be classified as "Curtobacterium flaccumfaciens" (Figure 5A). The group includes the strains currently classified as C. f., C. fpf, C. f. pv. Betae, C. f. pv. Oortii, and unclassified strains. The type strain can be replaced with another to maintain an ANI that is higher than the 95% threshold. This could be the strain CFBP 3423 or a different one. It appears that several subspecies might be established within this group.

Genomospecies 2
The group of 12 strains comprising the strains currently classified as C. f., C. fpf, C. f. pv. Poinsettiae, and unclassified strains ( Figure 5B). To satisfy the requirement of ANI ≥ 95%, any of these strains can be suggested to be the type strain.

Genomospecies 11-21
The proposed genomospecies 11-20 belong to a single clade of phylogenetic trees (Figures 3 and 4) and have ANI distance values that are large enough to distinguish them ( Figure 5D). These species comprise five strains currently classified as C. pusillum, the strain labelled as C. flaccumfaciens JUb65, and ten unclassified Curtobacterium strains. Taking into account the genetic distances and phylogenetic data, the strains, currently classified as C. pusillum, should be assigned to three distinct species.

Genomospecies 21-24
The phylogenetic trees' clade and ANI matrix cluster ( Figure 5E), comprising proposed genomospecies 21-24, contain seven strains, five of which are currently classified as C. luteum. One strain is unclassified, and one strain represents a metagenome assembly. It appears that the strains, currently classified as C. luteum, should be assigned to two different species.

Genomospecies 26-33
Twenty-three strains, including the strains currently classified as C. albidum, C. citreum, C. oceanosedimentum, an unclassified draft genome, and a metagenome assembly, constitute a clade on phylogenetic trees (Figures 2-4 and 5F) and can be assigned to at least eight genomospecies. It is proposed that the strain C. sp. SGAir0471, as well as two strains labelled as C. oceanosedimentum, should be classified as C. oceanosedimentum, but the closeness of ANI to 95% requires further analysis of the taxonomic positions of C. sp. SGAir0471. It appears that the strains currently classified as C. albidum DSM 20512 and C. citreum should be assigned to one species.

Genomospecies 42-48
Twenty-three strains, including the strains currently classified as C. herbarum and 19 unclassified strains, can be assigned to seven species. It appears that, according to ANI and phylogenetic data, the strains, currently classified as C. herbarum, should be assigned to two distinct species.

Genomospecies 49, 50
These two proposed genomospecies represent deeply rooted branches, which, nevertheless, may be assigned to the genus of Curtobacterium, according to the genomic data.
The ANI values of the representatives of these species are about 78% compared to other proposed species. Genomospecies 49 contains eight unclassified strains ( Figure 5I), and genomospecies 50 contains one unclassified strain.

C. ammoniigenes NBRC 101786
According to the results of core genome phylogeny and ANI clustering, C. ammoniigenes NBRC 101786 represents the strain closest to the root of the trees. The ANI values of C. ammoniigenes NBRC 101786 are about 75-76% compared to other Curtobacterium strains. The taxonomic assignment of C. ammoniigenes should be the subject of a separate discussion. It might be possible to elevate this species to the level of a genus, taking into account biological and biochemical data.

Curtobacterium Plasmid pCff1 and Curtobacterium Plasmids
As of October 2021, there were ten Curtobacterium plasmid genomes in the NCBI Genome database (Table 3). They comprise a giant plasmid pCPAA3 featuring a genome of 567,298-bp size, five relatively large plasmids with genomes of 77-147 kbp, a mediumsized plasmid pTC5 of 42-kbp genome size, and two relatively small plasmids of 22-25 kbp. Interestingly, the two latter plasmids possess a GC content as low as about 32-35%, while the remaining plasmids are characterised by a GC content close to that of most Curtobacterium chromosomes (about 65-72%). The genomes of the plasmids with a low GC contain multiple repeats, CTTT, CCTTTT, and similar, with the overweight of the thymine residues. The plasmids pCff1 and pCff113 attract special interest because of virulence proteins reported to be encoded in their genomes [23]. These plasmids have prominent withingenome similarities, with a linear plasmid, pCSL1, occurring in Clavibacter sepedonicus strain ATCC33113 [23].
The genome of pCff1 was reannotated using Prokka [25] and a thorough BLAST and HMM-HMM motif comparison involving NCBI and custom BLAST databases and databases offered by the HHpred [31], InterPro [33], and Phyre2 [32] servers ( Figure 6). The reannotation predicted 178 open reading frames (ORFs). The list of genes that may be related to virulence includes a pectate lyase gene, two adjacent genes of cellulases and cellulose-binding proteins, six genes of trypsin-like serine proteases, and seven genes of putative hydrocarbon hydrolases. A BLAST search on these genes demonstrated the presence of their homologues in most, or all, pathogenic curtobacteria, while the majority of genes presumably not related with virulence were also found in the genomes of strains with pathogenicity not confirmed ( Figure 6). Interestingly, close homologues of Curtobacterium trypsin proteases were found in pathovars of Xanthomonas campestris and Xanthomonas citri. Homologues of several supposed virulence genes were also found in phytopathogenic Clavibacter, Dickeya, and other phytopathogenic bacteria. Distant homologues of pectate lyase were found to be encoded by the genomes of phytopathogenic nematodes.
The pCff1 genome also contains the genes of conjugation apparatus, a toxin-antitoxin system, DNA repair and restriction, and several transposases.
The sequence search on pCff1 demonstrated the presence of extended homologous regions in chromosomes of 13 Curtobacterium strains with confirmed pathogenicity, Curtobacterium sp. PhB130 and PhB136 chromosomes, and Curtobacterium plasmids pCff113 and pCff119 (Figure 7). In addition, the homology search and sequence alignments demonstrated the partial collinearity of pCff1 with Clavibacter plasmids pCI3 (genome length 80,801 bp), pCM2 (133,237 bp), and pVQ28-1 (89,287 bp).  ; and other bacterial, archaeal, and eukaryotic viruses. Significant similarities between phage ϕ29 and pCff1 gp137 were also shown by the I-TASSER homology modelling server [47] (TM-score 0.844, RMSD 1.35 Å), as well as more distant similarities with other viral DNA polymerases. However, the BLAST search did not reveal homologues of gp137 among the sequences, other than Actinomycete hypothetical proteins.
The structure of gp137 has been predicted using several approaches-homology modelling with Phyre2 and I-TASSER and deep learning algorithms implemented in AlphaFold2 [45] and RoseTTAFold [49]. According to the ModFOLD8 model quality assessment, the latter two demonstrated high predictive accuracy, which far exceeded that of homology modelling. The ModFOLD8 global model quality score was 0.1870 for Phyre2, 0.2188 for I-TASSER, 0.3804 for AlphaFold2, and 0.3720 for RoseTTAFold. The AlphaFold2 model refolded with ReFOLD3 [51] is shown in Figure 8. The final ModFOLD8 global model quality score was 0.3937, with high confidence and a p-value of 4.523 × 10 −3 ; the predicted average residue error was 8.16 Å (Supplementary Figure S18). . The models are coloured based on a rainbow gradient scheme, where the N-terminus of the polypeptide chain is coloured blue, and the C-terminus is coloured red. (B) Superimposition of the predicted structure of putative DNA polymerase from plasmid pCff1, the predicted structure of pCff1 putative DNA polymerase and the experimentally found structure of ϕ29 DNA polymerase complexed with DNA using the protein surfaces. The arrows indicate the presence of similarly located cavities and tunnels in both proteins.
A comparison of the predicted structure of plasmid pCff1 gp137 and phage ϕ29 demonstrates an overall similarity and the presence of similarly located cavities and tunnels in both proteins, which can be used in DNA binding and processing.
It is noteworthy that the genome of giant plasmids pCPAA3 also encodes for putative DNA polymerases. The BLAST search and HMM-HMM motif comparison predicted the presence of the genes of DNA polymerase III subunits α, ε, and κ. Furthermore, the pCPAA3 genome encodes a protein showing distant homology with the temperate Siphoviridae Clavibacter phage CN1A large subunit of terminase.

Curtobacterium PCR Diagnostics and Genus-Specific Primers
The unavailability of a selective medium for isolating Curtobacterium causes substantial difficulties for isolating new strains and studies of the biodiversity of Curtobacterium sp. The use of a semi-selective medium, in practice, still led to the isolation of visually similar bacterial strains from plants that did not belong to Curtobacterium sp. To simplify the selection procedure for field isolates of Curtobacterium, a set of genus-specific primers are proposed for the initial assessment of isolated samples using the qPCR method.
For the development of a diagnostic kit, a 23S rRNA region was chosen, which seemed to be conservative for Curtobacterium and differed from the outer groups. To amplify the selected genus-specific region, oligonucleotides were designed, the sequences of which are shown in Table 4. To check the correctness of the identification with the designed primers, the assay was tested on the set of the strains indicated in Table S2.
The qPCR results are shown in Table S2, and an example of the amplification curves is shown in Figure 9. As can be seen from the data obtained, a positive signal was observed for all Curtobacterium strains, while no amplification was observed for non-Curtobacterium isolates. According to the results obtained, amplification proceeded quite selectively, making it possible to differentiate Curtobacterium isolates from other microbiota found in infected plants, including closely related Microbacteria.
Additionally, for the qPCR detection system, the efficiency and sensitivity of the detection were evaluated. For greater accuracy in calculating the number of copies in the medium, a test plasmid was constructed. For this purpose, an amplified fragment of the target 23S rRNA region was ligated to the pDrive vector. With the resulting plasmid and the genomic DNA of the C. flaccumfaciens Ac-1923 (← DSM 20129) strain, serial tenfold dilutions and qPCR were conducted. The amplification curves for the plasmid experiment are shown in Figure 10. 21.8 ----- Figure 10. Amplification curves of tenfold dilutions of the test plasmid. The numbers represent the corresponding dilution shown in Table 5.
The Cq values obtained made it possible to plot a graph of the dependence of the threshold cycle on the logarithm of the DNA concentration per reaction. The data obtained are presented in Table 5 and Figure 11. The figure shows that the standard curves obtained were linear, with slopes of 3.34 and 3.4 for the plasmid and genomic DNA, respectively. Thus, the PCR efficiency was 99.25 and 96.76%. The LoD in both cases was comparable and amounted to ≈10 3 cfu/mL, which is a normal sensitivity for such detection systems.

Challenges of Curtobacterium Taxonomic Classification
The genus Curtobacterium (family Microbacteriaceae) was defined by Yamada and Komagata in 1972 for a group of motile Brevibacteria [1]. It comprises a wide range of bacteria isolated from different environments and plants, including eight approved species:

1.
C. albidum (Komagata and Iizuka 1964 Other species have been proposed but not yet approved, such as C. glycinis from Glycine max, C. gossypii from Gossypium hirsutum, and C. oryzae sp. from Oryza sativa [96]. In the current work, 190 genomes of Curtobacterium spp. available at the NCBI Genome database were analysed using different bioinformatic methods. Deposited complete and draft genomes include 35 strains classified as C. flaccumfaciens, only 27 of which are identified as C. flaccumfaciens pathovars with confirmed virulence to a certain host plant. Seven genomes were classified as being representatives of other Curtobacterium species: C. albidum, C. ammoniigenes, C. citreum, C. herbarum, C. luteum, C. oceanosedimentum, and C. pusillum ( Figure 1). Thus, 77.9% (148) of available genomes of Curtobacterium spp. have not yet been identified to a certain species. The relatively large number of sequenced genomes of C. flaccumfaciens (18.4%) can be explained by the status of C. fpf as an emerging and regulated plant pathogen that is spreading rapidly worldwide and that occurs worldwide in legume-producing countries.
The results of a phylogenetic analysis using ANI and conventional 16S rRNA and 23S rRNA genes showed limited applicability to the definition of species within the genus Curtobacterium. It is probable that this limitation is the major reason for the uncertain taxonomic positioning of 148 analysed genomes.
Using the sequences of concatenated conservative genes coding for DNA processing proteins (gyrB, parE, rpoA, and rpoB) produced phylogenetic trees with significantly higher resolution and bootstrap supporting values than the rRNA phylogenetic analysis. The topology of these trees (Supplementary Figures S8-S15) suggests the rare occurrence of horizontal exchange events among the chosen genes.
A phylogenetic analysis based on concatenated nucleotide sequences of 46 ribosomal proteins (r-proteins) extracted from reannotated genomes was used. The resulting bestscoring phylogenetic tree obtained with RAxML ( Figure 4) demonstrated bootstrap support that was significantly higher than that of all other trees (Supplementary Figures S2-S7). The composition of the clades, including the clade containing the strains classified as Curtobacterium flaccumfaciens, was very close to that of the ANI tree ( Figure 2).
Applying the criterion of full-genome similarity as measured by ANI, which requires ≥95% identity and requirements for the monophyleticity of taxa, and using the results obtained by ANI calculations and phylogenetic analysis with a high level of confidence, a more detailed taxonomy of genus Curtobacterium with 51 potential genomospecies can be proposed (Table 2). A group of 53 monophyletic strains can be assigned to the species C. flaccumfaciens, which can include both pathogenic and non-pathogenic strains. All results presented in this work indicate the necessity for taxonomic revisions within the genus of Curtobacterium and correlate with previous genomic studies.
Chen et al. [22] analysed 45 strains from the NCBI GenBank database designated as Curtobacterium spp. The ANI values in pairs of these strains varied from 75% to 99% and, using standard criteria based on ANI and dDDH [97], showed that only four strains belonged to C. flaccumfaciens sensu stricto among the evaluated genome sequences. The strains UNCCL17 and MCBA15-005 were phylogenetically closely related to C. flaccumfaciens, but it is likely that they belong to a novel species. According to Reference [22], sensu stricto C. flaccumfaciens strains were clustered in a monophyletic clade showing 97% ANI with one another.
MLSA based on housekeeping genes recA, gyrB, ppK, atpD, dnaK, and rpoB was used to investigate the phylogenetic relationships among 84 strains of C. flaccumfaciens, which were distributed among four pathovars (C. flaccumfaciens pv. flaccumfaciens, poinsettiae, oortii, and betae) and isolated from host plants over a period of 77 years from North America, South America, Europe, and Asia. The C. flaccumfaciens strains were grouped into three main clusters. Colony colour aside, the three clusters did not correlate with pathovar affiliation, isolation date, geographical location, or isolation host. Thus, the pv. flaccumfaciens strains, pathogenic on beans, were scattered among the three clusters as diverse as other Curtobacterium species. Strains from the same pathovars (hosts of isolation) were placed in different clusters, irrespective of their geographical origin [98].
A core genome phylogeny of 50 genomes [99] provided evidence that Curtobacterium spp. strains with glycosyl hydrolases (GHs) and, thus, with the potential for being degraders of cellulose and other polysaccharides, were not phylogenetically related to the type strain of C. flaccumfaciens, because they shared only 85% ANI. Furthermore, strains GD1, BH-2-1-1, MCBA15_013, and YR515 probably belong to separate species that are currently unnamed.
Chen et al. [22] evaluated 45 bacterial strains designated as C. flaccumfaciens in the literature and found the genetic diversity within the group to be greater than had so far been described. Only two strains, MMLR14-002 and MMLR14-014, were identified as being members of C. flaccumfaciens, while the remaining strains needed to be reclassified as novel taxa.
Recently, MLST, using the sequences of five housekeeping genes (i.e., atpD, gyrB, ppk, recA, and rpoB), revealed that the three strains, MCBA15-007, MEB126, and UCD-AKU, were phylogenetically closely related to the type strain of the poinsettia (Euphorbia pulcherrima) pathogen (pv. poinsettiae) ICMP 2566T, while the strains MCBA15-005 and UNCCL17 were phylogenetically closely related to the type strain of the sugar beet pathogen (pv. betae) ICMP 2594T [21]. The authors indicated that the plant pathogenic members of C. flaccumfaciens must be attributed to different species, reinforcing the need to reconsider the taxonomy of phytopathogenic members of the species. Thus, previous evaluations of Curtobacterium strains using genome sequences or MLST/MLSA have illustrated the presence of strain groups that are different enough from type strains of the eight validated species for new unnamed species to be claimed.
Jain et al. [62] computed the pairwise average nucleotide identity (ANI) of 91,761 microbial genomes and found that the ANI values calculated from the 8 billion comparisons showed a strong bimodal distribution concentrated at 83 and 95%, with a wide gap between these two peaks. The authors concluded that a clear genetic discontinuum and species boundary were evident from the unprecedented large-scale ANI analysis and claimed that the 95% ANI threshold represented an accurate threshold for demarcating almost all currently named prokaryotic species. Murray et al. [100] argued that the creation of a universal genetic boundary among the named species in the current NCBI taxonomy was questionable and that it resulted from substantially biased sampling in genome sequencing. They urged caution against being excessively confident in using 95% ANI for microbial species delineation, since the high benchmarks reported in the paper were inflated by the use of highly redundant genomes. Microbes occupying narrow ecological niches and with a limited dispersal rate (e.g., obligate intracellular bacteria) may develop genetic clusters. Free living microbes exploring different habitats are more likely to exhibit a genetic continuum. Selection is also unlikely to produce a universal genetic boundary, as microbial species are unique in nature, with each species subject to its own evolutionary and ecological forces [100].
In answer to the critics of the general concept of Jain et al. [62], the authors said that some bacterial populations "show large intrapopulation sequence diversity, probably due to unique ecological niche(s) they have occupied for long evolutionary times compared to other marine taxa (i.e., they lack direct competition), and thus this threshold is around 90-92% ANI for these populations. In contrast, several more recently emerged pathogens like Bacillus anthracis show limited intrapopulation species diversity (ANI values > 99%). Hence, the area of genetic discontinuity may vary, depending on the taxa considered and their unique ecophysiologic and evolutionary characteristics, and 95% ANI appears to be the genetic level that distinguishes most natural discrete populations and named species, but not necessarily all." It was concluded that, "for these reasons, taxon descriptions should not be based on a single metric or threshold but the careful investigation of ecological and functional data together with genetic relatedness (e.g., ANI values)" [101].

Curtobacterium Plasmids and Pathogenicity
The occurrence of pCff1-related plasmids in the pathogenic strains of Curtobacterium and Clavibacter strains appears to be related to their virulence. The important role of plasmids has been shown for many phytopathogenic bacteria [102][103][104][105]. The analysis of, and homology search on, the genes of those plasmids indicated their presence mainly in virulent strains with pathogenicity confirmed. The occurrence of putative virulence genes in strains with no confirmed pathogenicity may be explained by the lack of experimentation on their pathogenic behaviour, their past lifestyle, and other benefits from the presence of such genes. For example, the presence of enzymes that assist in the use of plant cell walls and polymers present in the integuments of hosts is related to both saprotrophic nutrition and pathogenic lifestyle [106,107]. However, if such enzymes are useful for saprophytes, they may be obligate for pathogens.
Discovery of the gene that encodes the protein structurally similar to viral DNA polymerase I (which belongs to family B DNA polymerases) raises important questions about the origin and evolution of pCff1-related plasmids. A plasmid-encoded DNA polymerase is an unusual event for bacteria and archaea, but several incidents, including actinomycetes, have been reported [108][109][110][111][112]. Plasmid DNA polymerase genes are not unusual for plasmids of fungal mitochondria [113][114][115][116][117][118], including the mitochondria of a phytopathogenic fungus Claviceps purpurea [119]. Additionally, DNA polymerases are encoded in some temperate bacteriophages like N15, which persist like a linear plasmid in a host cell [120]. It is noteworthy that mitochondrial plasmid and phage N15 polymerases appear to belong to family B DNA polymerases.
The homologues of pCff1 putative DNA polymerase have been detected only in chromosomal and plasmid genomes of actinomycetes. The relatedness of pCff1 putative DNA polymerase with the DNA polymerases of bacteriophages and archaeal and eukaryotic viruses have been revealed with the HMM-HMM motif comparison and structure modelling. These analyses have demonstrated that pCff1 putative DNA polymerase is more similar to the phage DNA polymerase. It is possible to hypothesise about the comparatively ancient origin and divergence of this protein. It is necessary to study the functioning of this enzyme experimentally. It might be suggested that the pCff1 putative DNA polymerase I could participate in the initial stages of replication, as has been reported in the context of some other plasmids [121][122][123].

Curtobacterium spp. PCR Diagnostics
The current study has seen, for the first time, the development of a genus-specific primer set for the diagnosis of Curtobacterium. The set makes it possible to differentiate Curtobacterium from representatives of other microbiota of an infected plant, which is an urgent applied problem. Primers for the species-specific detection of Cff have previously been described in the literature [124], but there have been no kits for genus-specific diagnostics. The high level of efficiency of PCR with the test system developed and with a sufficiently low detection limit have been shown experimentally. The values are comparable to those for similar qPCR systems for other genera of phytopathogens [125]. The resulting test system can become a valuable and easy-to-use tool for rapidly assessing the strains isolated from a diseased plant. In addition, the sensitivity of detection should allow the use of this method for assessing plant extracts for the presence of a pathogen, even in asymptomatic plants.

Conclusions
ANI comparisons and the phylogenetic analysis of ribosomal and core proteins testify to the necessity for global taxonomic revisions within the genus of Curtobacterium. Based on these data, it is possible to discuss the delineation of up to several dozen genomospecies. A monophyletic group of 53 strains can be assigned to the species of Curtobacterium flaccumfaciens, which can include both pathogenic and non-pathogenic strains. The pathogenicity of Curtobacterium and other Microbacteriaceae can be related to a group of plasmids carrying virulence factors and featuring the presence of a gene distantly related to viral DNA polymerase I. The recent increase in genomic data challenges widespread diagnostic methods. The presented genus-specific PCR diagnostic kit developed and tested in the present work may serve as a good complementary tool for further studies of Curtobacterium strains Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/cimb44020060/s1: Figure S1: ANI matrix using 190 Curtobacterium genomes obtained with orthoANIu and clustered using BioNJ. Figure S2: Consensus tree obtained with MEGA using 16S rRNA genes of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The branches with bootstrap support of less than 50% were collapsed. Figure S3: Bestscoring tree obtained with RAxML using 16S rRNA genes of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The scale bar shows 0.005 estimated substitutions per site. Figure S4: Consensus tree obtained with MEGA using 23S rRNA genes of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The branches with bootstrap support of less than 50% were collapsed. Figure S5: Best-scoring tree obtained with RAxML using 23S rRNA genes of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The scale bar shows 0.01 estimated substitutions per site. Figure S6: Consensus tree obtained with MEGA using concatenated 16S and 23S rRNA genes of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The branches with bootstrap support of less than 50% were collapsed. Figure S7: Best-scoring tree obtained with RAxML using concatenated 16S and 23S rRNA genes of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The scale bar shows 0.01 estimated substitutions per site. Figure S8: Consensus tree obtained with MEGA using gyrB nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The branches with bootstrap support lower than 50% were collapsed. Figure S9: Best-scoring tree obtained with RAxML using gyrB nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The scale bar shows 0.05 estimated substitutions per site. Figure S10: Consensus tree obtained with MEGA using parE nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The branches with bootstrap support of less than 50% were collapsed. Figure S11: Best-scoring tree obtained with RAxML using parE nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The scale bar shows 0.1 estimated substitutions per site. Figure S12: Consensus tree obtained with MEGA using rpoA nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The branches with bootstrap support of less than 50% were collapsed. Figure S13: Best-scoring tree obtained with RAxML using rpoA nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The scale bar shows 0.05 estimated substitutions per site. Figure S14: Consensus tree obtained with MEGA using rpoB nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The branches with bootstrap support of less than 50% were collapsed. Figure S15: Best-scoring tree obtained with RAxML using rpoB nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The scale bar shows 0.05 estimated substitutions per site. Figure S16: Consensus tree obtained with MEGA using gyrB, parE, rpoB, and rpoB concatenated nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The branches with bootstrap support lower than 50% were collapsed. Figure S17: Best-scoring tree obtained with RAxML using gyrB, parE, rpoB, and rpoB concatenated nucleotide sequences of representatives of Curtobacterium and Gryllotalpicola ginnsengisoli DSM 22003 used as an outgroup. Bootstrap support values are shown near their branches as a percentage of 1000 replicates. The scale bar shows 0.1 estimated substitutions per site. Figure S18: (A) Predicted structure of putative DNA polymerase from C. fpf P990 plasmid pCff1 obtained with Alphafold and refined with ReFOLD coloured based on a rainbow gradient scheme, where the N-terminus of the polypeptide chain is coloured blue, and the C-terminus is coloured red. (B) ModFOLD 3D view of per-residue accuracy of the model of putative DNA polymerase from C. fpf P990 plasmid pCff1. The model is coloured based on a rainbow gradient scheme, where the residues with the lowest predicted residue errors are coloured blue, and the residues with the highest predicted residue errors are coloured red. (C) ModFOLD residue error plot. Table S1: List of bacterial strains used for the PCR development. Table S2: List of Curtobacterium genomes deposited in the NCBI GenBank database, as of October 2021.

Conflicts of Interest:
The authors declare no conflict of interest.