MALDI-TOF MS Using a Custom-Made Database, Biomarker Assignment, or Mathematical Classifiers Does Not Differentiate Shigella spp. and Escherichia coli

Shigella spp. and E. coli are closely related and cannot be distinguished using matrix-assisted laser desorption-ionization time-of-flight mass spectrometry (MALDI-TOF MS) with commercially available databases. Here, three alternative approaches using MALDI-TOF MS to identify and distinguish Shigella spp., E. coli, and its pathotype EIEC were explored and evaluated using spectra of 456 Shigella spp., 42 E. coli, and 61 EIEC isolates. Identification with a custom-made database resulted in >94% Shigella identified at the genus level and >91% S. sonnei and S. flexneri at the species level, but the distinction of S. dysenteriae, S. boydii, and E. coli was poor. With biomarker assignment, 98% S. sonnei isolates were correctly identified, although specificity was low. Discriminating markers for S. dysenteriae, S. boydii, and E. coli were not assigned at all. Classification models using machine learning correctly identified Shigella in 96% of isolates, but most E. coli isolates were also assigned to Shigella. None of the proposed alternative approaches were suitable for clinical diagnostics for identifying Shigella spp., E. coli, and EIEC, reflecting their relatedness and taxonomical classification. We suggest the use of MALDI-TOF MS for the identification of the Shigella spp./E. coli complex, but other tests should be used for distinction.


Introduction
The E. coli pathotype entero-invasive E. coli (EIEC) is thought to cause the same disease as Shigella spp. [1]. This pathotype consists of isolates that possess some of the E. coli phenotypic characteristics and the invasive nature of Shigella spp. [2,3]. EIEC harbors the same virulence markers as Shigella spp. that are used in molecular diagnostics to detect both Shigella spp. and EIEC but are not suitable to distinguish them [4]. Shigella spp. and E. coli are described to belong to one taxonomic species genetically, but classification in two genera is maintained for practical and taxonomic reasons [2][3][4][5]. Therefore, differentiation is challenging and is historically performed using phenotypical tests, serotyping, and the determination of virulence markers using PCR [6,7]. Multiple researchers have designed molecular methods to distinguish Shigella and E. coli, and EIEC in particular [8][9][10]. Although most molecular methods are ≥95% accurate using the initially selected set of isolates, they appeared not to be reliable when these methods were used for additional isolates [3,11].
Most clinical laboratories currently use matrix-assisted laser-desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) to identify bacteria in a routine diagnostic setting. Commercially available databases, such as MALDI biotyper ® in combination with the MALDI Security-Relevant (SR) library ® (Bruker Daltonik GmbH, Bremen, Germany) and VITEK ® MS (BioMérieux, Marcy-l'Etoile, France) can distinguish Shigella spp. and E. coli from other Enterobacteriaceae. However, they cannot distinguish between the different Shigella species and E. coli, including the EIEC pathotype [12].
The development of custom-made databases to identify bacteria using MALDI-TOF MS as an alternative to commercially available databases proved successful for multiple species before [13][14][15]. Most notably, an earlier study developed a custom-made database to identify and distinguish Shigella spp. and E. coli specifically. However, EIEC isolates were not included in their database [16]. Using a database approach, comparisons of unknown isolates to spectra in a database comprise the whole spectra for species identification. However, for closer related groups, a more subtle approach can be essential, in which variations within the spectra are examined in the presence or absence of specific peaks as biomarkers [17,18]. The biomarker approach was used to type E. coli isolates before, with varying success rates [18]. These approaches mainly targeted a selection of isolates representing the pathotype entero-hemorrhagic E. coli (EHEC) or the highly virulent ST131 clone [18], although two studies used biomarker typing specifically for Shigella spp. and E. coli, without EIEC isolates [19,20]. One of those studies identified biomarkers outside the mass range of 2000-20,000 Da used in routine applications [20], and the other did not specify in which species the biomarkers were present or absent [19]. Besides determining the presence or absence of single biomarkers, patterns of these biomarkers can be investigated and recognized with machine-learning algorithms [21]. These machinelearning-based methods can establish classifiers for identifying groups within species of bacteria [22,23]. Moreover, these classifiers were developed to identify Shigella spp. and E. coli before, although EIEC isolates were not included [19].
In this study, the ability of MALDI-TOF MS was assessed for the distinction of the four Shigella species, EIEC, and non-invasive E. coli using alternatives for the commercially available databases. First, a custom-made database, including all Shigella species, E. coli, and EIEC isolates, was developed and evaluated. Second, biomarkers were assigned and evaluated, and third, classifier models based on machine learning were defined, applied, and evaluated.
All isolates, except the references, were identified using a previously described culturebased identification algorithm [25]. They were divided into a set of training isolates (n = 288) and test isolates (n = 271), both having similar species and serotype distributions. The training set was used to construct the custom-made database, assign biomarkers, and define and train machine-learning classifier models. The test set was used to test all of these algorithms in duplicate, with both direct smear and ethanol-formic acid extraction application methods.

MALDI-TOF MS Preparation of Isolates
All isolates were grown overnight on Columbia Sheep Agar (CSA, Biotrading, Mijdrecht, The Netherlands) at 37 • C and were subsequently subjected to the direct smear method and the ethanol-formic acid extraction with silica beads as previously described [26]. Colonies or 1 µL extract were applied onto a polished steel plate, air-dried, and overlaid with 1 µL α-Cyano-4-hydroxycinnamic acid in 50% acetonitrile-2.5% trifluoroacetic acid (HCCA matrix). The samples were analyzed using a Bruker Microflex LT (Bruker Daltonik GmbH, Bremen, Germany) in a linear and positive mode, with 30-40% laser power and within a mass range of 2000-20,000 Da.

Database Development
The MSPs produced from 288 isolates in the training set were used to build a custommade database with Maldi Biotyper OC V3.1.66 (Bruker Daltonik). In addition, a dendrogram to assess the relatedness of these MSPs was inferred using default settings. The isolates in the test set were identified using this custom-made database. Additionally, the test isolates were also identified using the commercially available Bruker MALDI Biotyper database (V8.0.0.0) and the Bruker Security-Relevant Library (V1.0.0.0) and using a combination of the commercial and custom-made databases. Quality of the results was indicated by a log-score, calculated by Maldi Biotyper 3.0 RTC: a log-score of 2.000-2.300 corresponds to "secure genus identification, probable species identification", and a log-score of >2.300 corresponds to a "highly probable species identification". Both duplicate spots were analyzed, the highest log-score of at least 2.000 was considered as the definitive MALDI-TOF MS identification, as is done in a routine workflow. If an isolate had a log-score < 2.000 caused by a poor spectrum, it was disregarded from further analysis. Isolates were then assigned to different discrimination levels "genus", "pathotype", "group", and "species", as displayed in Figure 1. In short, all raw spectra were summarized into isolate spectra. Peak matching was performed on isolate spectra using a constant tolerance of 1.9, a linear tolerance of 550, and a peak detection rate of 10%. Binary peak matching tables were exported to summarize the presence of peak classes on all discrimination levels, as depicted in Figure 1. Decision diagrams were produced for the levels genus, pathotype, and groups (Supplemen- For accurate identification, only matches with database MSPs from the same species within a log-score range of 2.000-2.300 or >2.300 should be expected in one spot. Therefore, the ten MSPs from the database that produced the highest scores within a log-score of 2.000-2.300 or >2.300 per spot were determined. For each species identified with the culture-dependent identification algorithm, the median number of species resulting from MALDI-TOF MS and their quartile ranges per spot with a log-score of 2.000-2.300, or >2.300 were calculated and visualized using SPSS 24.0.0.1 (IBM, New York, NY, USA).

Biomarker Assignment and Principal Component Analysis
Spectra files from MSPs of 288 isolates in the training set were exported as mzXML files using Compassxport CXP3.0.5. (Bruker Daltonik) or exported via a batch process in Flexanalysis (Bruker Daltonik). A new database was created in Bionumerics v7.6.3 (Applied Maths NV, Sint-Martens-Laten, Belgium ) according to the manufacturers' instructions. All raw spectra were imported into the Bionumerics database with x-axis trimming to a minimum of 2000 m/z. Baseline subtraction, noise computing, smoothing, baseline detection, and peak detection were performed with default settings. Spectra summarizing, peak matching, and peak assignment were performed according to instructions from Bionumerics [23].
In short, all raw spectra were summarized into isolate spectra. Peak matching was performed on isolate spectra using a constant tolerance of 1.9, a linear tolerance of 550, and a peak detection rate of 10%. Binary peak matching tables were exported to summarize the presence of peak classes on all discrimination levels, as depicted in Figure 1. Decision diagrams were produced for the levels genus, pathotype, and groups (Supplementary Figure S1a-c). The spectra files of isolates from the test set were imported and preprocessed in Bionumerics, using the same methods and settings as for the spectra from the isolates in the training set. Peak matching of test isolates was performed using the option "existing peak classes only" to compare the presence of peaks in the test isolates with peaks in the isolates from the training set. Decision diagrams (Supplementary Figure S1) and the presence or absence of peak masses as depicted in Table 2 were applied to assign unknown isolates from the test set according to the different levels, as shown in Figure 1. By assigning biomarkers, only the presence and absence of peaks were investigated. To assess quantitative peak data such as peak intensity and peak area, a principal component analysis (PCA) was performed on all isolates in the training set to visualize the position of isolates in three dimensions.

Presence of Biomarkers Identified in Previous Studies
All isolates in the training and test sets were examined for the unique masses (±500 ppm) found in the biomarker assignment to Shigella spp. and E. coli in previous studies [19,20]. Additionally, because peaks in our study were assigned at m/z values instead of masses only and because masses could be potentially charged with two electrons, this is corrected by examining the previously published masses divided by two (±500 ppm) [19,20].

Classifier Models Based on Machine Learning
Peak data of the summarized isolate spectra of the 288 isolates in the training set were used to define and train machine learning-based classifiers using Bionumerics v7.6.3 according to the manufacturers' instructions. In short, peak matching with a constant tolerance of 1.9 and a linear tolerance of 550 was performed on isolate spectra on the different levels: genus, pathotype, group, and species. Classifiers were created at all levels using character values. Support vector machine (linear) learning was used as a scoring method in which p-values were used for ranking. The classifiers were trained and crossvalidated to check their performance for identification. Subsequently, the classifier models were used to classify the unknown isolates in the test set at the different discrimination levels to evaluate their performance.

Database Development
All MSPs of 288 training isolates were added to a custom-made database. The relatedness of these MSPs is shown in a dendrogram ( Figure 2). The Maldi Biotyper OC software recognized three large MSPs clusters that are not species-specific within this custom database. This did not change if clusters were assigned manually with a lower distance level at 50-100 relative units, indicating that similarity in spectrum profiles is distributed over the species level ( Figure 2).
Additionally, the duplicate spots of test isolates using either the direct smear or extraction method resulted in a different species designation in 10-15% of the samples. Furthermore, with an accurate distinction of species, one would not expect assignment to multiple species above the threshold of log-score 2.000. However, with both application methods, most isolates were assigned to several species with a log-score of 2.000-2.300 or >2.300 per spot, indicating no specificity at all (Figure 3).
One isolate from the test set (S. boydii serotype 13) showed a low-quality spectrum (log score 1.574-1.930), and one isolate (S. dysenteriae serotype 1) had initially been incorrectly stored, as this isolate was identified as Corynebacterium diphtheriae using the Bruker databases. Both these isolates were ignored in further analyses. All other isolates had log-scores higher than 2.000, and percentages of MALDI-TOF MS identification concordant with the original identification on all discrimination levels were as displayed in Table 3. With the Bruker databases only, percentages of correctly identified Shigella spp. on all discrimination levels are low, ranging from 6% to 45% correct designations, both for the direct smear and extraction methods (Table 3). In contrast, 90-100% of E. coli isolates were correctly identified. When identification was based on the custom-made database with or without the Bruker databases, the percentage of correctly identified E. coli isolates decreased to a range of 29-71%. In contrast, Shigella spp. were correctly identified, ranging from 94% to 99% of cases on the genus, pathotype and group levels. In addition, 91-97% of S. flexneri and S. sonnei were correctly identified at the species level, in contrast to S. dysenteriae and S. boydii, for which the percentages of correct identification were low (Table 3). Furthermore, with an accurate distinction of species, one would not expect assignment to multiple species above the threshold of log-score 2.000. However, with both application methods, most isolates were assigned to several species with a log-score of 2.000-2.300 or >2.300 per spot, indicating no specificity at all (Figure 3).  Furthermore, with an accurate distinction of species, one would not expect assignment to multiple species above the threshold of log-score 2.000. However, with both application methods, most isolates were assigned to several species with a log-score of 2.000-2.300 or >2.300 per spot, indicating no specificity at all (Figure 3). Identity (x-axis) was assigned using the culture-based identification algorithm. Black horizontal bars represent the median number of species; the 25-75% interquartile ranges are indicated by the blue vertical bars, and 5-95% intervals by the black vertical lines. Outliers are indicated with blue dots.

Biomarker Assignment and Principal Component Analysis
The decision diagrams based on biomarkers assigned to the isolates in the training set were used to identify unknown isolates in the test set. Distinctive peaks on the species levels are summarized in Table 2. High percentages for correct identification of S. sonnei isolates were achieved at the species level using both the direct smear and the extraction method. However, the biomarkers are not specific for S. sonnei, as other species contain them as well. For other species, the identified biomarkers correctly identified isolates below 38%. Specific biomarkers were not detected for all the classes at the different discrimination levels, as depicted in Figure 1. Consequently, it was not possible to identify S. dysenteriae, S. boydii, and E. coli isolates at all because of the absence of discriminating peaks for these species (Table 3).
In the PCA of the detected peaks in the isolates of the training set, one large cluster was formed, with a few outliers at both ends ( Figure 4). If the isolates were colored according to their identity based on the culture-based identification method, separate groups of isolates were seen in none of the discrimination levels (Figure 4a-d).

Biomarker Assignment and Principal Component Analysis
The decision diagrams based on biomarkers assigned to the isolates in the training set were used to identify unknown isolates in the test set. Distinctive peaks on the species levels are summarized in Table 2. High percentages for correct identification of S. sonnei isolates were achieved at the species level using both the direct smear and the extraction method. However, the biomarkers are not specific for S. sonnei, as other species contain them as well. For other species, the identified biomarkers correctly identified isolates below 38%. Specific biomarkers were not detected for all the classes at the different discrimination levels, as depicted in Figure 1. Consequently, it was not possible to identify S. dysenteriae, S. boydii, and E. coli isolates at all because of the absence of discriminating peaks for these species (Table 3).
In the PCA of the detected peaks in the isolates of the training set, one large cluster was formed, with a few outliers at both ends ( Figure 4). If the isolates were colored according to their identity based on the culture-based identification method, separate groups of isolates were seen in none of the discrimination levels (Figure 4a-d).

Presence of Biomarkers Identified in Previous Studies
The specific biomarkers for S. flexneri, S. sonnei, and E. coli assigned by Everley et al. [20] were not present in any of the 559 isolates in this study when using an error limit of ±500 ppm. They were also not present if they were corrected for a charge with 2 electrons. A few biomarkers for Shigella spp. and E. coli described by Khot and Fisher [19] were present within a range of 500 ppm in isolates used in this study, i.e., 4163 Da, 7157 Da, 8326 Da, and 9227 Da, and corrected for a charge of 2 electrons, 5096 Da and 5752 Da.

Classifier Models Based on Machine Learning
Using the internal cross-validation of the classifiers at all discrimination levels, all but one class offered an accuracy of more than 87.5%. The only class with a lower accuracy (77%) was "Escherichia" at the genus discrimination level.
When using machine learning-based classifiers for identification, 96% of Shigella spp. isolates and 21% of the E. coli isolates from the test set were correctly identified at the genus level, using the direct smear application method and, respectively, 100% and 8% using the ethanol-formic acid extraction method (Table 3). Correct identification percentages for the pathotype, group, and species level were displayed in Table 3. Although more than 80% of S. sonnei isolates were correctly identified with the species classifier, specificity was low, as more than 70% of S. flexneri isolates were also identified as S. sonnei.

Discussion
Current commercially available MALDI-TOF MS databases cannot distinguish between Shigella spp. and E. coli. Therefore, three different alternatives were explored in this study. A custom-made database was developed, biomarkers were identified, and machine learning classification models were designed.
Compared to a previous study, our custom-made database assigned fewer E. coli isolates correctly [16]. This indicates that the inclusion of EIEC isolates in the custommade database and the test set complicates the identification. Half of the EIEC isolates were assigned to one of the Shigella species, thereby decreasing the percentage of correctly identified E. coli. The poor performance of identifying E. coli with our custom-made database can result from an overrepresentation of S. flexneri and S. sonnei. A second custommade database was developed to investigate this hypothesis, based on 17 isolates of each species, representing the diversity in serotypes. This database did not perform better or worse than the custom-made database that contained 288 MSPs (Supplementary Table S1), indicating that a more even distribution of species in the database does not improve the identification of E. coli. Although percentages of correct species assignments to S. flexneri and S. sonnei were high, other species were falsely assigned to them in our study and a previous study [16]. In the latter study, correct species identification was based on the majority rule that three out of four spots should indicate the same species. Besides the fact that the interpretation of four spots per isolate is not feasible in clinical diagnostics, this indicates that the assignment of species is based on probabilities rather than actual variations in spectra. Our study confirms this phenomenon because multiple species identifications within the same log-score range were made per spot. Moreover, 10-15% of duplicate spots resulted in different species assignments using commercially available and custom-made databases. Additionally, in the dendrogram inferred from the MSPs of the training set only into the custom-made database, the same species were not clustering together, indicating that the resulting database would not be capable of identifying the isolates from the test set correctly.
Another alternative approach for using commercially available databases is the detection of discriminating biomarkers. However, in our study, many isolates resulted in inconclusive identification, as specific biomarkers were not detected for most classes. Although more than 90% of S. sonnei isolates were identified at the species level, other species, such as S. boydii and E. coli, are also frequently falsely identified as S. sonnei. Moreover, when also analyzing peak intensity and area rather than just peak presence, the PCA showed that Shigella spp. and E. coli did not represent separated groups based on their biomarkers. In contrast, one large cluster with a few outliers was formed, demonstrating their genetic similarity. Furthermore, the absence of 85% of the masses assigned as biomarkers in a former study [15] in our isolates indicates that the detected biomarkers vary amongst isolate sets tested and that a stable variation per species is not observed. Consequently, we anticipate that the assignment of biomarkers based on yet another set of isolates will lead to even more diversity in biomarkers, demonstrating their unsuitability for distinct identification of Shigella spp., E. coli, and EIEC. In fact, peaks described in specific sets of isolates should not be considered as biomarkers if they are not detectable in (almost) all isolates of a species.
The use of classifier models based on machine learning resulted in comparable percentages of correctly identified Shigella on the genus level, i.e., ≥94%, as reported in other studies [19]. In our classifier model designed on the pathotype level, EIEC isolates were not incorporated in the class E. coli; correct identification was 67%, comparable to a previous study [19]. Nonetheless, the other remaining E. coli isolates were falsely classified as Shigella, both with our classifiers and with previously published ones [19], decreasing the specificity for identifying Shigella. Classifiers performed even less at the group and species level, and most species could not be identified at all. The poor performance of the classifier models may be caused by an overrepresentation of S. flexneri and S. sonnei, as discussed for the custom-made database in our study. Therefore, 17 isolates of each species were selected again, and alternative classifiers were designed. These classifiers did not perform better or worse than the classifiers designed using all 288 isolates in the training set, indicating that an absence of an even distribution of species was not the cause for poor identification with classifiers (Supplementary Table S1).
We used a substantially more extensive set of isolates than previous studies and included the E. coli pathotype EIEC. Another strength of our study was that multiple alternative approaches for identifying Shigella spp. and E. coli using MALDI-TOF MS were explored. Although S. sonnei and S. flexneri isolates were overrepresented in both the training and test sets, this distribution represents high-resource settings.
In conclusion, none of our explored alternative approaches for identifying Shigella spp., E. coli, and EIEC with MALDI-TOF MS were suitable to use in clinical diagnostics, as all rendered a poor distinction based on spectra or biomarkers. This poor discrimination merely reflects the problematic taxonomical classification of Shigella spp. and E. coli into two different genera and does not reflect MALDI-TOF MS's performance as an identification technique in general. Therefore, we propose an identification algorithm in which MALDI-TOF MS is used to identify and differentiate Shigella/E. coli as a group from other Enterobacteriaceae, followed by tests other than MALDI-TOF MS to distinguish between the different Shigella species, E. coli, and specific E. coli pathotypes, including EIEC.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.