Retrospective Definition of Clostridioides difficile PCR Ribotypes on the Basis of Whole Genome Polymorphisms: A Proof of Principle Study

Clostridioides difficile is a cause of health care-associated infections. The epidemiological study of C. difficile infection (CDI) traditionally involves PCR ribotyping. However, ribotyping will be increasingly replaced by whole genome sequencing (WGS). This implies that WGS types need correlation with classical ribotypes (RTs) in order to perform retrospective clinical studies. Here, we selected genomes of hyper-virulent C. difficile strains of RT001, RT017, RT027, RT078, and RT106 to try and identify new discriminatory markers using in silico ribotyping PCR and De Bruijn graph-based Genome Wide Association Studies (DBGWAS). First, in silico ribotyping PCR was performed using reference primer sequences and 30 C. difficile genomes of the five different RTs identified above. Second, discriminatory genomic markers were sought with DBGWAS using a set of 160 independent C. difficile genomes (14 ribotypes). RT-specific genetic polymorphisms were annotated and validated for their specificity and sensitivity against a larger dataset of 2425 C. difficile genomes covering 132 different RTs. In silico PCR ribotyping was unsuccessful due to non-specific or missing theoretical RT PCR fragments. More successfully, DBGWAS discovered a total of 47 new markers (13 in RT017, 12 in RT078, 9 in RT106, 7 in RT027, and 6 in RT001) with minimum q-values of 0 to 7.40 × 10−5, indicating excellent marker selectivity. The specificity and sensitivity of individual markers ranged between 0.92 and 1.0 but increased to 1 by combining two markers, hence providing undisputed RT identification based on a single genome sequence. Markers were scattered throughout the C. difficile genome in intra- and intergenic regions. We propose here a set of new genomic polymorphisms that efficiently identify five hyper-virulent RTs utilizing WGS data only. Further studies need to show whether this initial proof-of-principle observation can be extended to all 600 existing RTs.

genomes obtained from Creighton University, 23 C. difficile genomes of five selected RTs were downloaded from NCBI to verify in silico amplification of the ISR region (Supplementary Table S1).

DBGWAS-Mediated Discovery New RT-Specific Markers
A total number of 160 C. difficile genome assemblies (training set) of 14 different RTs including hyper-virulent RT001, RT017, RT027, RT078, and RT106 were used for the discovery of unique RT genomic markers ( Table 2). This small training set allowed for the development of discriminatory markers to characterize the five major RTs among the 14 different RTs. Genomes were collected from the National Center for Biotechnology and Information (NCBI; www.ncbi.nlm.nih.gov), Creighton University, and the Enterobase databases (https://enterobase.warwick.ac.uk). Metadata of these genomes are summarized in Supplementary Table S2. To identify associations between variant genetic loci and PCR RT, a hypothesis-free DBGWAS method was used. DBGWAS defines genetic variants linked to phenotypic traits via single nucleotide polymorphism (SNP), insertions, deletions, and consequences of recombination [41,42]. We used an open source tool (https://gitlab.com/leoisl/dbgwas) [40]. The tool is able to cover variants in coding as well as non-coding regions of bacterial genomes. DBGWAS was performed keeping the tool parameters in the default setting for different C. difficile ribotypes (RT001, RT017, RT027, RT078, and RT106) and their RT-specific genetic variants observed in the training set. Each C. difficile RT was considered independently in this study. DBGWAS identified short signature sequences called (overlapping) k-mers, yielding a compact summary of all variations across a set of genomes [40]. Overlapping k-mers are called unitigs and were selected on the basis of their specific and unique presence in a particular RT. Q-values define test sensitivity and specificity and are Benjamini-Hochberg-transformed p-values for controlling the false-positive results in case of multiple testing [40,43]. R scripting was used to deal with large matrices defining the presence (1) or absence (0) of extracted unitigs in the training set of C. difficile genomes.

Validation of Markers
Validation of novel unitig markers was performed by means of BLAST searches against the test set of genomes (Table 3). A wide range of 2425 genomes covering 132 different C. difficile RTs was downloaded [44] and processed using a Linux shell script. These sequences represented PCR ribotyped strains from different countries and clinical and environmental specimens for which phylogenetic analyses were already performed by Frentrup et al. [44] (Supplementary Table S3). A database of this test set was created to perform local command line BLAST searches against the set of significant unitigs identified above. The specificity of all the unitigs was tested using strict parametric filters of 100% coverage and identity.

Statistically Reliable Ribotype Prediction
To evaluate the potential typing significance of the unitigs as compared to the classical ribotyping of C. difficile strains, sensitivity and specificity (selectivity) were computed for all the unitigs [45]. The efficiency of GWAS can be measured by assessing the false discovery rate (FDR) [46]. To increase the potential typing significance of our new method, combination statistics were performed. Sensitivity and specificity were also computed for certain combinations of two or more selected unitig sequences. The parameters defined were, next to the FDR, TP (true-positives, correctly predicting positive values, e.g., number of true RT017 predicted as RT017), FP (false-positives, missed negative values, e.g., number of non-RT017 genomes still predicted as RT017), FN (false-negatives, missed positive values), and TN (true-negatives, correctly rejected values).

Functional Annotation of Unitigs
Selected unitigs were annotated using BLASTn alignment. Well-characterized C. difficile genomes were used as a reference to locate these new markers. Specific annotation for each marker was filtered out using minimum E-value, 100% identity, 100% coverage, and 0 gap score.

In Silico PCR
The visualization of amplified DNA sequences from the intergenic region between 16S and 23S ribosomal genes is the current Gold Standard for C. difficile typing [25,47]. In our study, in silico PCR for 30 randomly selected, well-characterized C. difficile genome sequences was essentially unsuccessful ( Figure 1 and Supplementary Table S1). Genome sequences included generated insufficient numbers of differently sized fragments. The fragment sizes that were calculated were verified with the online tool available at http://insilico.ehu.es/PCR [10]. On the other hand, more recently sequenced C. difficile genomes were showing only one or even none of the expected amplified fragments (Supplementary  Table S1). There is a substantial possibility that the PCR ribotyping fragments observed upon laboratory experimentation for these strains may not derive from ISR variants but rather from random amplification. Thus, in silico PCR failed to generate reliable RTs which prompted us to explore the feasibility of DBGWAS-based typing. Of note, we presume here that NGS-based methods are very likely to be more reliable than any of the many other molecular typing methods.

Functional Annotation of Unitigs
Selected unitigs were annotated using BLASTn alignment. Well-characterized C. difficile genomes were used as a reference to locate these new markers. Specific annotation for each marker was filtered out using minimum E-value, 100% identity, 100% coverage, and 0 gap score.

In Silico PCR
The visualization of amplified DNA sequences from the intergenic region between 16S and 23S ribosomal genes is the current Gold Standard for C. difficile typing [25,47]. In our study, in silico PCR for 30 randomly selected, well-characterized C. difficile genome sequences was essentially unsuccessful ( Figure 1 and Supplementary Table S1). Genome sequences included generated insufficient numbers of differently sized fragments. The fragment sizes that were calculated were verified with the online tool available at http://insilico.ehu.es/PCR [10]. On the other hand, more recently sequenced C. difficile genomes were showing only one or even none of the expected amplified fragments (Supplementary Table S1). There is a substantial possibility that the PCR ribotyping fragments observed upon laboratory experimentation for these strains may not derive from ISR variants but rather from random amplification. Thus, in silico PCR failed to generate reliable RTs which prompted us to explore the feasibility of DBGWAS-based typing. Of note, we presume here that NGS-based methods are very likely to be more reliable than any of the many other molecular typing methods. Bar graphs show the number of theoretical PCR bands (vertical axis, number of bands labeled on each bar) in the ribosomal region of respective genome sequences (horizontal axis), whereas the genomes without any fragments depict the complete absence of primer binding sites in those genomes. Note that the expected outcome would be an identical number of fragments for each of the strains belonging to a single ribotype. We indicated this number as the first marker of reproducibility; it has to be stated that besides this variation of numbers of fragments, the size of the fragment was also determined as a variable as well.

New Genotyping Markers
A total number of 47 RT-specific unitigs (13 for RT017, 12 for RT078, 9 for RT106, 7 for RT027, and 6 for RT001) were identified. The unitigs shared an average length of 56 base pairs (Table 4). DBGWAS Bar graphs show the number of theoretical PCR bands (vertical axis, number of bands labeled on each bar) in the ribosomal region of respective genome sequences (horizontal axis), whereas the genomes without any fragments depict the complete absence of primer binding sites in those genomes. Note that the expected outcome would be an identical number of fragments for each of the strains belonging to a single ribotype. We indicated this number as the first marker of reproducibility; it has to be stated that besides this variation of numbers of fragments, the size of the fragment was also determined as a variable as well.

New Genotyping Markers
A total number of 47 RT-specific unitigs (13 for RT017, 12 for RT078, 9 for RT106, 7 for RT027, and 6 for RT001) were identified. The unitigs shared an average length of 56 base pairs (Table 4). DBGWAS generated compacted De Bruijn graphs (cDBG) containing the specific unitigs as nodes defining a genotypic association between a particular RT and the C. difficile genomes included ( Figure 2). Unitigs that were specific for a particular RT were color-coded according to their association to the RT (red for positive association, blue for negative association) and minimum q-values were provided by subgraphs. Q-values for selected unitigs ranged from 0 to 7.40 × 10 −5 . Significantly associated unitigs were extracted as FASTA formatted sequences (Supplementary Table S4).  ) and completely absent in the other ribotypes in the training set (Pheno 0). Additionally, the q-value linked to the first node is very significantly below 0.05 and hence, the estimated effect is high (represented by the red color of the node).

Validation of Markers
Unitigs showing 100% identity in all genomes belonging to a single RT in the validation set demonstrated the efficiency of these unique patterns to carry out in silico ribotyping. Although the individual unitig-based characterization of C. difficile strains was not absolute, it allowed RT determination with approximate sensitivity and specificity of between 0.90 and 1.0 ( Figure 3). FDR for all the unitigs for RT017 was the lowest (0.06) followed by RT027 (0.08), RT078 (0.23), RT001 (0.30), and RT106 (0.46) (Figure 3).  1 in the table) and completely absent in the other ribotypes in the training set (Pheno 0). Additionally, the q-value linked to the first node is very significantly below 0.05 and hence, the estimated effect is high (represented by the red color of the node).

Validation of Markers
Unitigs showing 100% identity in all genomes belonging to a single RT in the validation set demonstrated the efficiency of these unique patterns to carry out in silico ribotyping. Although the individual unitig-based characterization of C. difficile strains was not absolute, it allowed RT determination with approximate sensitivity and specificity of between 0.90 and 1.0 ( Figure 3). FDR for all the unitigs for RT017 was the lowest (0.06) followed by RT027 (0.08), RT078 (0.23), RT001 (0.30), and RT106 (0.46) (Figure 3). , which are denoted by their estimated effect ranging from high (28.304; red) to low (4.00; blue). Allele frequency is represented by the size of the node. The table explains that from the two selected significant nodes in terms of their association with ribotype, the node on the top right (n180654) is specific to RT001 (called Pheno 1 in the table) and completely absent in the other ribotypes in the training set (Pheno 0). Additionally, the q-value linked to the first node is very significantly below 0.05 and hence, the estimated effect is high (represented by the red color of the node).

Validation of Markers
Unitigs showing 100% identity in all genomes belonging to a single RT in the validation set demonstrated the efficiency of these unique patterns to carry out in silico ribotyping. Although the individual unitig-based characterization of C. difficile strains was not absolute, it allowed RT determination with approximate sensitivity and specificity of between 0.90 and 1.0 ( Figure 3). FDR for all the unitigs for RT017 was the lowest (0.06) followed by RT027 (0.08), RT078 (0.23), RT001 (0.30), and RT106 (0.46) (Figure 3). Some of the unitigs were shared by closely related RTs. Unitigs for RT001 were able to identify the genetically closely related RT087, RT241, and RT012, which altogether form a clonal complex (CC) 141 [44]. One of the markers identified for RT017 showed no false-positives or false negatives. Other  Some of the unitigs were shared by closely related RTs. Unitigs for RT001 were able to identify the genetically closely related RT087, RT241, and RT012, which altogether form a clonal complex (CC) 141 [44]. One of the markers identified for RT017 showed no false-positives or false negatives. Other markers for RT017 initially generated a small number of false-positives, but 100% true-positives in the validation dataset. Markers for RT078 identified 78 out of 79 isolates of RT126 and all of the RT413 strains from the test dataset, likely due to the close genetic relatedness of these RTs (CC 1) [44,48,49]. Unitig sequences for RT106 were also able to identify C. difficile RT500 along with RT106 from the test set. Phylogenetic grouping of C. difficile genomes [44] showed that C. difficile core genome multi locus sequence typing (cgMLST) of RT106 and RT500 (CC 22) generated completely indistinguishable groupings. Considering closely related strains as true-positives based on their respective RTs, the FDRs for each unitig subset were found to be smaller, underscoring the biological consistency of the results. Adding genomes of RT413, RT126, and RT500 to the training set resulted in a decreased FDR rate. The continuously increasing number of publicly available C. difficile genome sequences will provide substantial opportunities for improvement of our new characterization technique.

Marker Combination Study
For the ribotypes RT027, RT078, RT106, and RT001, every possible combination of RT-specific unitigs was created and tested for statistical significance. A combination of two unitigs was shown to increase sensitivity and specificity up to 1 and to reduce the FDR to 0.05 ( Figure 4A-D). Each combination was defined on the basis of logical operators "AND/OR". The AND operator symbolizes that both the markers in a combination need to be present with 100% identity, whereas the OR operator means that any one of the two markers in a combination need to be present at one time, again with 100% sequence identity. There is no combination required in the case of RT017. Conclusively, as clearly exemplified in Figure 4A-D, in certain cases, the combination of markers improves RT testing by suppression of the false discovery rate. Marker's SEQ ID numbers and their sequences are given in Supplementary  Table S4.
Diagnostics 2020, 10, x FOR PEER REVIEW 8 of 15 Figure 3. Statistical comparison of genome typing efficiency of discovered unique patterns for selected C. difficile ribotypes in terms of specificity, sensitivity, and false discovery rate (FDR).
Some of the unitigs were shared by closely related RTs. Unitigs for RT001 were able to identify the genetically closely related RT087, RT241, and RT012, which altogether form a clonal complex (CC) 141 [44]. One of the markers identified for RT017 showed no false-positives or false negatives. Other markers for RT017 initially generated a small number of false-positives, but 100% true-positives in the validation dataset. Markers for RT078 identified 78 out of 79 isolates of RT126 and all of the RT413 strains from the test dataset, likely due to the close genetic relatedness of these RTs (CC 1) [44,48,49]. Unitig sequences for RT106 were also able to identify C. difficile RT500 along with RT106 from the test set. Phylogenetic grouping of C. difficile genomes [44] showed that C. difficile core genome multi locus sequence typing (cgMLST) of RT106 and RT500 (CC 22) generated completely indistinguishable groupings. Considering closely related strains as true-positives based on their respective RTs, the FDRs for each unitig subset were found to be smaller, underscoring the biological consistency of the results. Adding genomes of RT413, RT126, and RT500 to the training set resulted in a decreased FDR rate. The continuously increasing number of publicly available C. difficile genome sequences will provide substantial opportunities for improvement of our new characterization technique.

Marker Combination Study
For the ribotypes RT027, RT078, RT106, and RT001, every possible combination of RT-specific unitigs was created and tested for statistical significance. A combination of two unitigs was shown to increase sensitivity and specificity up to 1 and to reduce the FDR to 0.05 ( Figure 4A-D). Each combination was defined on the basis of logical operators "AND/OR". The AND operator symbolizes that both the markers in a combination need to be present with 100% identity, whereas the OR operator means that any one of the two markers in a combination need to be present at one time, again with 100% sequence identity. There is no combination required in the case of RT017. Conclusively, as clearly exemplified in Figure 4A-D, in certain cases, the combination of markers improves RT testing by suppression of the false discovery rate. Marker's SEQ ID numbers and their sequences are given in Supplementary Table S4.  for the combination of two selected markers using OR operator for the identification of C. difficile RT027 (Panel A) and RT078 (Panel B). Panels C and D display similar analyses but then using the AND operator for identification of RT106 and RT001, respectively.

Functional Annotation of Markers
Functional characterization of the regions from which our unitigs originated demonstrated that 34% of the unitigs were localized in intergenic regions (five for RT027, four for RT001, three for of RT017 and RT106 each, and one for RT078 (Figures 5-9). Six percent of all markers were left unannotated in Diagnostics 2020, 10, 1078 9 of 14 RT001, RT027, and RT078 (one marker for each) ( Table 4). Only RT001 was identified with a unitig marker residing within the rRNA-23S ribosomal gene showing at least some correspondence with ribotyping ( Figure 5). This marker did not show sufficient diagnostic power and was thus not selected in the final set of markers. All other markers were observed to be scattered throughout the C. difficile genome. In RT078, one of these markers was identified in a mobile genetic element (Figure 8). Mostly, genes and intergenic regions, apart from the conserved ribosomal ISR, were observed to play a potential role in the unitig-mediated C. difficile typing.

Functional Annotation of Markers
Functional characterization of the regions from which our unitigs originated demonstrated that 34% of the unitigs were localized in intergenic regions (five for RT027, four for RT001, three for of RT017 and RT106 each, and one for RT078 (Figures 5-9). Six percent of all markers were left unannotated in RT001, RT027, and RT078 (one marker for each) ( Table 4). Only RT001 was identified with a unitig marker residing within the rRNA-23S ribosomal gene showing at least some correspondence with ribotyping ( Figure 5). This marker did not show sufficient diagnostic power and was thus not selected in the final set of markers. All other markers were observed to be scattered throughout the C. difficile genome. In RT078, one of these markers was identified in a mobile genetic element ( Figure 8). Mostly, genes and intergenic regions, apart from the conserved ribosomal ISR, were observed to play a potential role in the unitig-mediated C. difficile typing.

Conclusions
Strain typing has a proven value in monitoring the persistence and spread of bacterial pathogens in human populations. For C. difficile, PCR ribotyping is the current first choice but may be challenged now that genome sequencing is an option. No single-step test or algorithm is available so far for correlating C. difficile RTs with WGS data. This implies that there may be an issue with the correlation between WGS-based epidemiological analysis and PCR ribotyping for C. difficile. Here, we show that DBGWAS identified unique genomic markers that would suit that specific purpose. A combination of two unitigs led to 100% sensitive and specific discrimination between five important RTs. We believe that this approach is highly promising, providing a clear opportunity to define backward compatibility between classical RTs and WGS data.
Supplementary Materials: The following are available online at http://www.mdpi.com/2075-4418/10/12/1078/s1. Table S1 describes a collection of genome sequences that were used for the in silico search of ribotypes based on amplification of tentative ISRs. Table S2 contains a training set of C. difficile genomes used for the initial DBGWAS. Table S3 contains the C. difficile genomes used for DBGWAS validation. Table S4 shows a review of all unitigs that are statistically significantly associated with specific ribotypes.  Acknowledgments: During this study, M.G., L.H., H.P., M.J., K.D.B. and A.v.B. were employees of bioMérieux, a company designing, developing, and marketing tests in the domain of infectious diseases. The company was not involved in the design of the current study and the opinions expressed are those of the authors and may be different from formal company opinions and policies. We thank our colleagues from bioMérieux and Creighton University who provided expertise and insight that greatly assisted the research.

Conflicts of Interest:
This research was conducted in association with RG from Creighton University School of Medicine, NE, USA which could be constructed as a potential conflict of interest.