An Unsupervised Prediction Model for Salmonella Detection with Hyperspectral Microscopy: A Multi-Year Validation

: Hyperspectral microscope images (HMIs) have been previously explored as a tool for the early and rapid detection of common foodborne pathogenic bacteria. A robust unsupervised classiﬁcation approach to differentiate bacterial species with the potential for single cell sensitivity is needed for real-world application, in order to conﬁrm the identity of pathogenic bacteria isolated from a food product. Here, a one-class soft independent modelling of class analogy (SIMCA) was used to determine if individual cells are Salmonella positive or negative. The model was constructed and validated with a spectral library built over ﬁve years, containing 13 Salmonella serotypes and 14 non- Salmonella foodborne pathogens. An image processing method designed to take less than one minute paired with the one-class Salmonella prediction algorithm resulted in an overall classiﬁcation accuracy of 95.4%, with a Salmonella sensitivity of 0.97, and speciﬁcity of 0.92. SIMCA’s prediction accuracy was only achieved after a robust model incorporating multiple serotypes was established. These results demonstrate the potential for HMI as a sensitive and unsupervised presumptive screening method, moving towards the early (<8 h) and rapid (<1 h) identiﬁcation of Salmonella from food matrices.


Introduction
Salmonella is a leading cause of gastroenteritis, with severe cases occasionally resulting in death. The World Health Organization estimates that 550 million people fall ill to foodborne diseases annually, with 33 million healthy life years calculated as lost. Nontyphoidal Salmonella represents one of the four primary pathogenic bacteria responsible [1]. Traditional detection methods such as the use of a nutrient enriched growth medium or polymerase chain reaction (PCR) have been used as the standard for the detection of Salmonella for years. While these methods are effective, the incubation time required for nutrient enriched growth media, or the reoccurring costs along with the advanced training requirement of PCR are disadvantages that influence the time required to correctly identify the causative agent and source of a foodborne disease outbreak.
In recent years, hyperspectral imaging (HSI) and hyperspectral microscope images (HMI) have been approached for food safety and quality assessment. HSI methods have been applied for the estimation of bacterial total viable counts (TVC) on the surface of salmon, pork, and chicken cuts [2][3][4]. HSI has seen application for the determination of the Campylobacter species or Shiga toxin-producing E. coli. (STEC) serogroup of bacterial colonies grown on their respective selective nutrient enriched agar plates [5][6][7]. Anderson et al. [8] discovered that an HMI system could differentiate between spectral patterns of viable and non-viable Bacillus anthraces spores damaged from contact with hydrogen peroxide. Previously, our laboratory's research has shown that bacterial species can be differentiated through HMI, as well as serotypes of the same species, by using a single cell-based mean pixel intensity pattern, and early detection was possible in times of 8 h or less [9].
The previous project objectives involved the use of discriminant analyses (DA) or other multivariate approaches to determine if the differences in a specific experimental treatment existed. In order to advance this technology forward, an unsupervised classification approach for HMI data is necessary to determine a taxonomical identification. Presumptive pathogen screening of a food sample would require HMI to produce a yes/no answer, similar to qualitative PCR. In order to construct an unsupervised prediction model that results in a bacterial HMI slide testing positive or negative for the presence of Salmonella, a soft independent modelling of class analogy (SIMCA) approach was chosen. In this application, SIMCA was preferable to a DA with hard decision boundaries, because DA will force a sample into a positive or negative category, whereas the soft boundaries of SIMCA can reject a sample outside of the calibration model's boundaries [10]. This is preferable for a qualitative food safety approach that gives a binary yes or no determination for a bacteria's presence in a food product. If a DA forces a sample into a false-negative or type II error, a potentially contaminated product can be overlooked and erroneously regarded as safe, jeopardizing public health. Previously, food authentication research has addressed the issue of a food product's quality by using spectroscopy methods paired with SIMCA to determine if product adulterations have been made for economic benefit [11].
HMI research has shown potential for application in early and rapid food safety methodologies, but the validation of a comprehensive and robust modeling approach is necessary in moving the unsupervised classification technology forward. Data were collected over the span of five years, between May 2012 and May 2017. The aim of this study was that a robust one-class SIMCA calibration model for rapid Salmonella prediction at a cellular level was constructed to determine if validation from a multi-year study could accurately predict Salmonella presence at a comparable performance to traditional detection methods such as PCR and nutrient enriched plating, with approximately 95% accuracy.

Sample Preparation and Collection
Bacterial cultures were isolated and purified from broiler chicken carcass rinses at the U.S. National Poultry Research Center by the Poultry Microbiological Safety and Processing Research Unit and were stored in 20% glycerol at −80 • C, except for the Campylobacter species, which were obtained from the American Type Culture Collection (Manassas, VA, USA). Stock cultures were removed from the freezer as needed and were inoculated onto the organism's appropriate growth media, then incubated for the necessary timetemperature relationship [12]. A list of the microorganisms and abbreviations used can be found in Table 1. After incubation, the cultures were stored at 4 • C with sample slides prepared, and the HMI was collected within 24 h. Bacterial cultures were sampled as mentioned in Park et al. [13]. In brief, the method calls for an inoculation loop to pick a typical colony from an agar plate, then it is inoculated into 100 µL of deionized water, vortexed, followed by placing 3 µL of the bacterial suspension on a common glass microscope slide, then allowing it to air dry under a biosafety cabinet for 15 min. A coverslip was applied, and the glass slide was placed on the HMI system's sample stage and viewed under a 100× oil objective (Olympus, Tokyo, Japan). This effectively affixes the cells to the slide for hypercube image collection, without damaging the microorganisms, resulting in HMI of individual live cells obtained without the use of reagents, tags, or dyes.
The HMI system consists of an acousto-optic tunable filter (AOTF; Gooch and Housego, Ilminster, UK), 16-bit electron multiplying charge coupled device (EMCCD) (Andor Technology, Belfast, Northern Ireland), optimized darkfield condenser (Cytoviva, Auburn, AL, USA), 24 W tungsten halogen (TH) light (Osram, Munich, Germany), and a digital upright microscope (i80 Nikon, Lewisville, TX, USA). The TH light source was offset from the HMI system in a lamp house connected underneath the sampling stage via a fiber optic cable, which prevents heat damage to bacterial cells generated from the lamp. The HMI system collected 89 TIFF files in 4-nm increments in the range of 450-800 nm, stacking files together to form a hypercube. Hypercubes were 1000 × 1000 × 89, resulting in 89 million data points per hypercube from one sample.

HMI Processing
Fiji (ImageJ 2.0) [14] was used to process raw TIFF images collected in the hypercube stacks. Figure 1 shows a flowchart for the image processing method that extracts the mean single cell spectra in less than 5 min. The hypercube was imported into Fiji as a virtual stack, and the spectral band resulting in a high cell to background contrast was identified and duplicated as an 8-bit grayscale image for shape analysis. The auto-thresholding option in Fiji was selected, with 16 thresholding algorithms being tested. It was found that Otsu's method gave the optimal The hypercube was imported into Fiji as a virtual stack, and the spectral band resulting in a high cell to background contrast was identified and duplicated as an 8-bit grayscale image for shape analysis. The auto-thresholding option in Fiji was selected, with 16 thresholding algorithms being tested. It was found that Otsu's method gave the optimal separation of cells from the background. Here, Otsu's thresholding method was applied to mask the background, leaving a mask with only pixels representing cells. Otsu's thresholding assumes a Gaussian distribution for image values, where the objective is to maximize the difference between-group variance, in this case, the feature (bacterial cells) and the background [15]. The probabilities of a pixel value falling into one of two groups can be calculated by Equation (1), as follows: where P 1 and P 2 represent cumulative probabilities of the two groups, T = a threshold that divides the image into pixel set S 1 or S 2 , and P i = the probability of image value i. After the global thresholding was computed, the Time Series 3.0 plugin [16] was used to apply the masks to the virtual stack, calculating the mean of the pixels in each regions of interest (bacterial cell). Next, Fiji exported two comma-separated value (CSV) files, where one file represented the spectral data and one file represented the shape metrics. The two CSV files were combined into one matrix, where rows were single cells with corresponding shape and spectral data shown as columns. Circularity represents how close a shape is to a perfect circle on a scale of 0 to 1, and was computed by Equation (2), as follows: where Cir = circularity, A = area, and P = perimeter. Bacterial cells are not always close to a value of 1, as Salmonella, E. coli, and many others are rod-shaped, in addition to Campylobacter, which can take on an S-shape. It was found that extremely low circularity values were correlated with clumps of overlapping cells, and extremely high values were typical of a small number of pixels representing extracellular debris. Thresholding values of 0.35-0.9 were optimal in removing large clumps of cells, as well as extracellular debris. Figure 2 shows an example of the bacterial hypercube and data files.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 13 0.9 were optimal in removing large clumps of cells, as well as extracellular debris. Figure  2 shows an example of the bacterial hypercube and data files.

Spectral Pre-Processing
The standard normal variant (SNV) transformation has been shown to reduce spectral variation in hypercube data sets caused by small variations in sampling conditions, particle size, or bacterial size [17,18]. The SNV was calculated by Equation (3), as follows:

Spectral Pre-Processing
The standard normal variant (SNV) transformation has been shown to reduce spectral variation in hypercube data sets caused by small variations in sampling conditions, particle size, or bacterial size [17,18]. The SNV was calculated by Equation (3), as follows: where x i is the SNV adjusted spectra, m i is the sample's mean, x i is the sample's spectra, and δ i is the sample's standard deviation. Following SNV, outlier detection was calculated by applying a centroid-based Mahalanobis distance (MD) between two vectors, one being the individual cell's mean spectra, and the other vector representing the class mean spectra, and was calculated by Equation (4), as follows: where x i = an object vector and x = the cluster centroid. From here, single cell values within ±3δ of the class mean MD were removed from the dataset, with 0.97% of the calibration data and 1.37% of the validation data being labeled as outliers and being removed.

SIMCA Classification Model
The SIMCA approach has previously been well defined [19][20][21]. Here, the SIMCA model was constructed for a single class, Salmonella. The calibration model was obtained through a principal component analysis (PCA), built on an optimal number of significant principal components (PCs) and defined as Equation (5), as follows: where n = the number of objects, r = selected PCs, p = selected variables, X K = the mean centered matrix, T K (nxr) = the score matrix obtained from n objects and r selected PCs, V T K (rxp) = the loading matrix obtained for r selected PCs and p variables, and E K (nxp) = the residual matrix [22]. The leave-one-out-cross-validation (LOO-CV) was an important step in the development of the prediction model, which has previously been shown to reduce the number of false outliers by inflating the within class component variances [23]. Class boundaries of the SIMCA are determined by Equation (6), as follows: where s 0 = mean distance between objects belonging to the k class model and e 2 ki = squared residual of the kth object for the ith (latent) variable. The critical distance value is then calculated through an F-test at a specified significance level (α) by Equation (7), as follows: Thirteen Salmonella serotypes were used to establish the calibration model. HMI were collected with multiple repetitions of each serotype, resulting in a collection of 3315 bacterial cells after outlier removal. Each repetition involved culturing the serotypes from frozen stock cultures. The experimental conditions were kept the same; however, small variances in colony size, or cellular size could be noticed after the incubation of the same strain. For this reason, multiple repetitions of the same strains were regrown from frozen stock for each serotype in the calibration model in order to sufficiently cover a robust set of Salmonella bacterial conditions and spectral variation within the species.

SIMCA Validation
Over five years, the SIMCA prediction model was validated by Salmonella serotypes, similar Enterobacteriaceae family members, and other pathogenic/spoilage microbes commonly found in food products, totaling 19 microorganisms and 3421 bacterial cells after outlier removal. Table 2 describes the sample size breakdown of the Salmonella spectral library and validation. Five Salmonella serotypes common to foodborne disease outbreaks, namely S. Enteritidis (SE), S. Heidelberg (SH), S. Infantis (SI), S. Kentucky (SK), and S. Typhimurium (ST), were cultured, in addition to 14 other organisms known to be foodborne pathogens [24]. The HMI for these samples were collected in the same manner as the calibration model. Preprocessing and outlier detection methods were also repeated. New single cell mean spectra were projected onto the Salmonella calibration model's PC space, and distances towards the class's model were calculated by Equations (8)-(10), as follows: where e 2 new = the new object's squared residual, and S K = distance towards the class model and is compared to the S crit value from Equation (7). Bacteria cells are labeled as Salmonella if S K < S crit . If S K > S crit , then the bacteria cell is classified as a non-Salmonella cell.

Standard Normal Variant and Spectra
The number of outliers detected by the MD method was less than 1% for the calibration dataset and less than 2% for the validation dataset, which was due to the image processing method setting thresholding limits that removed large clumps of cells. While Otsu's thresholding method did improve the cell cluster separation, overlapping cells still existed. Figure 3 shows an example image of Salmonella Heidelberg taken at 638 nm, with the raw image shown in Figure 3A, and the cell segmentation image shown in Figure 3B. Here, we can see that some cells are touching other cells and some are not. To increase the number of cells analyzed per image, an improved single cell separation method would need to be implemented. Figure 4 shows the mean spectra for the Salmonella calibration data set (n = 3315). In Figure 4A, it is noticeable that the raw TH spectra show a large variance in intensity values, ranging from around 1500 to 16,000 a.u. at a maximum peak of 638 nm. Applying the row-based SNV preprocessing step placed the spectra on a consistent scale, as shown in Figure 4B. High collinearity between bacterial species is an issue that should be taken into con- To increase the number of cells analyzed per image, an improved single cell separation method would need to be implemented. Figure 4 shows the mean spectra for the Salmonella calibration data set (n = 3315). In Figure 4A, it is noticeable that the raw TH spectra show a large variance in intensity values, ranging from around 1500 to 16,000 a.u. at a maximum peak of 638 nm. Applying the row-based SNV preprocessing step placed the spectra on a consistent scale, as shown in Figure 4B. To increase the number of cells analyzed per image, an improved single cell separation method would need to be implemented. Figure 4 shows the mean spectra for the Salmonella calibration data set (n = 3315). In Figure 4A, it is noticeable that the raw TH spectra show a large variance in intensity values, ranging from around 1500 to 16,000 a.u. at a maximum peak of 638 nm. Applying the row-based SNV preprocessing step placed the spectra on a consistent scale, as shown in Figure 4B. High collinearity between bacterial species is an issue that should be taken into consideration. Because PCA utilizes an orthogonal transformation of the spectra to calculate the PCs, this aids in negating the influence of collinearity in the classification model. An High collinearity between bacterial species is an issue that should be taken into consideration. Because PCA utilizes an orthogonal transformation of the spectra to calculate the PCs, this aids in negating the influence of collinearity in the classification model. An advantage of SIMCA is that it is sensitive to dissimilarities between objects [22], which is significant given the close spectral relationships between bacteria. Eliminating these false outliers is key in the prediction of Salmonella, as type II errors can result in a pathogenically contaminated food product to be released to the consumer market. Careful consideration of outliers was performed in this application; determining too many bacterial cells to be outliers would result in underfitting the prediction model, thus being counterintuitive to the purpose of this SIMCA application, and potentially resulting in a high number of type II errors.

SIMCA Calibration Model
As a result of the highly collinear nature of the mean bacterial cell spectra, a large amount of data benefited the robustness of the SIMCA's prediction capability. It was found that increasing the Salmonella serotype numbers and serotype repetitions began to incorporate sufficient robustness over time, and that the model could predict the Salmonella HMI collected several years later. In Figure 5A, the distribution of the PCA score plots can be seen, and as more data points are added to the calibration model, the distribution across PC1 and PC2 becomes more normally distributed. The plots shown in Figure 5 were indicative of a robust model that could offer unsupervised classification of Salmonella cells. Figure 5B shows the loadings vectors for PCs 1-4. PC1 shows the strongest loading vectors in the red color bands, while PCs 2, 3, and 4 appearing to be strongest in the green color bands, and PC 4 represented the strongest of the blue color bands. The explained variance of PCs 1-4 is detailed in Figure 5C, with 95% of the Salmonella calibration model's explained variance described in the first four PCs. The error matrix plotting Hotelling's T 2 values against the F-residuals is shown in Figure 5D.
There are over 2500 known serotypes of Salmonella [25]. As new serotypes are added to this calibration model, it would be assumed that some serotypes may skew the spread of these scores in the principal component space, but with enough HMI repetitions, the PCA scores will progress towards filling the multivariate space representative of Salmonella. Bacteria share many physiological traits, especially those of the same Enterobacteriaceae taxonomical family, including common foodborne pathogens such as Salmonella, E. coli, Shigella, Enterobacter, and Klebsiella [26]. These microbes tend to share many common traits such as lipopolysaccharide cell wall structures, porins, and other features that make for a single pixel differentiation between cells virtually impossible under the given conditions. For this reason, a mean spectrum was calculated per cell. For example, the pixelwise classification of E. coli cells resulted in many pixels misclassified as Salmonella pixels because of the common physiological characteristics of the two Enterobacteriaceae species. Single cell mean spectra offer an overview of the cellular characteristics, while maintaining the representation of the inherent biological variability between bacterial species.

SIMCA Validation
Validation of the SIMCA model consisted of HMI collected from 19 microorganisms, and resulted in 3222 of 3421 bacterial cells correctly labeled as Salmonella or non-Salmonella and are shown in Table 3. The SIMCA prediction model had an accuracy of 95.4%, sensitivity of 0.97, and specificity of 0.92. The five Salmonella serotypes used for validation are serotypes that commonly appear in foodborne disease outbreaks, especially SE and ST. Fairly consistent unsupervised prediction accuracies were obtained for all five serotypes, ranging between 94.6% (SH) and 98.0% (SE) accuracy. The PCA projections of the score plots calculated from the validation set are shown overlaying the Salmonella calibration score plot. Figure 6A shows a visual representation of the SE scores projected onto the Salmonella model, with most points projected inside the SIMCA boundaries of the second and third PC, while Figure 6B projects the validation set of Staphylococcus aureus (Sa) scores and the SIMCA calibration boundaries, with most Sa cells projected just outside of the model. distribution across PC1 and PC2 becomes more normally distributed. The plots shown in Figure 5 were indicative of a robust model that could offer unsupervised classification of Salmonella cells. Figure 5B shows the loadings vectors for PCs 1-4. PC1 shows the strongest loading vectors in the red color bands, while PCs 2, 3, and 4 appearing to be strongest in the green color bands, and PC 4 represented the strongest of the blue color bands. The explained variance of PCs 1-4 is detailed in Figure 5C, with 95% of the Salmonella calibration model's explained variance described in the first four PCs. The error matrix plotting Hotelling's T 2 values against the F-residuals is shown in Figure 5D.  that the MD outlier detection threshold should be lowered. Salmonella and E. coli (Ec) are both similar in composition and taxonomy, which is why a larger number of Ec (767 cells) were selected to validate the Salmonella SIMCA prediction model. Previously, Eady and Park [18] showed that the spectral patterns of Salmonella and Ec were more similar than comparing Salmonella to Sa or Li, with Salmonella and Sa being the most dissimilar. The prediction model resulted in a lower type II error rate, of 0.030, than a type I error rate, of 0.076. This was preferable in regard to a single class model for food safety application, reducing the potential of a false negative sample being made available to consumers. Standard microbial analysis methods for food items such as PCR or the use of nutrient enriched growth media are well established, but come with disadvantages. These results suggest that it is possible to establish a reference library for a bacterial species of interest and to build a SIMCA calibration model that is robust enough for species level detection as a presumptive screening tool, effectively reducing the amount of time and reoccurring cost associated with traditional detection methods. Microorganisms of interest to the food industry, such as Listeria, Campylobacter, or Staphylococcus aureus, could have HMI reference libraries established and validated. Here, the Salmonella model can be tuned over time to incorporate the addition of more serotypes and wild type Of the 14 non-Salmonella serotypes from the validation dataset, there was a larger range of prediction accuracy, varying from 63.6 to 100%. Pseudomonas putida (Ppu) showed the lowest accuracy, with 63.6% classified as non-Salmonella bacteria, while 36.4% were misclassified as Salmonella cells. Of the three Ppu HMI repetitions, one HMI had a significantly higher misclassification rate at 49%. The single cell mean spectra of this HMI were not marked as outliers and were removed from the dataset; this could suggest that the MD outlier detection threshold should be lowered. Salmonella and E. coli (Ec) are both similar in composition and taxonomy, which is why a larger number of Ec (767 cells) were selected to validate the Salmonella SIMCA prediction model. Previously, Eady and Park [18] showed that the spectral patterns of Salmonella and Ec were more similar than comparing Salmonella to Sa or Li, with Salmonella and Sa being the most dissimilar.
The prediction model resulted in a lower type II error rate, of 0.030, than a type I error rate, of 0.076. This was preferable in regard to a single class model for food safety application, reducing the potential of a false negative sample being made available to consumers. Standard microbial analysis methods for food items such as PCR or the use of nutrient enriched growth media are well established, but come with disadvantages. These results suggest that it is possible to establish a reference library for a bacterial species of interest and to build a SIMCA calibration model that is robust enough for species level detection as a presumptive screening tool, effectively reducing the amount of time and reoccurring cost associated with traditional detection methods. Microorganisms of interest to the food industry, such as Listeria, Campylobacter, or Staphylococcus aureus, could have HMI reference libraries established and validated. Here, the Salmonella model can be tuned over time to incorporate the addition of more serotypes and wild type bacteria isolated from field trials, and it could eventually be tested in industry settings for the early and rapid presumptive screening of pathogenic microorganisms.

Conclusions
Previous HMI experiments address base studies in the system's design and approach to pathogenic bacteria detection. In order to build an unsupervised HMI classification model for bacterial species with the sensitivity potential of single cell detection, it was essential to include HMI collected from a range of timeframes and repetitions for adequate model boundary definition. Here, 13 Salmonella serotypes commonly associated with poultry were used to build the calibration model. The SIMCA prediction for Salmonella can be used as a presumptive screening method for early and rapid bacterial detection with a minimal reoccurring sample cost versus detection methodologies requiring expensive reagent kits, dyes, or markers. Here, a Salmonella prediction accuracy of 95.4% was achieved, along with a specificity of 97%. Industry standards for Salmonella detection are approximately 97-98% with qualitative PCR or plating methods. The SIMCA prediction model can be tuned with potential outlier identification or preprocessing methods to increase the selectivity of the model. Future work can add additional Salmonella serotypes to SIMCA's calibration model, tuning the soft boundaries of the unsupervised classification approach for a slight prediction selectivity increase. The results shown here indicate that it is possible to build qualitative single class prediction models for bacteria at a species level, as a tool for high-throughput foodborne pathogen detection.