Non-Destructive Detection Pilot Study of Vegetable Organic Residues Using VNIR Hyperspectral Imaging and Deep Learning Techniques

Contamination is a critical issue that affects food consumption adversely. Therefore, efficient detection and classification of food contaminants are essential to ensure food safety. This study applied a visible and near-infrared (VNIR) hyperspectral imaging technique to detect and classify organic residues on the metallic surfaces of food processing machinery. The experimental analysis was performed by diluting both potato and spinach juices to six different concentration levels using distilled water. The 3D hypercube data were acquired in the range of 400–1000 nm using a line-scan VNIR hyperspectral imaging system. Each diluted residue in the spectral domain was detected and classified using six classification methods, including a 1D convolutional neural network (CNN-1D) and five pre-processing methods. Among them, CNN-1D exhibited the highest classification accuracy, with a 0.99 and 0.98 calibration result and a 0.94 validation result for both spinach and potato residues. Therefore, in comparison with the validation accuracy of the support vector machine classifier (0.9 and 0.92 for spinach and potato, respectively), the CNN-1D technique demonstrated improved performance. Hence, the VNIR hyperspectral imaging technique with deep learning can potentially afford rapid and non-destructive detection and classification of organic residues in food facilities.


Introduction
Contamination inspection of food facilities is indispensable to ensure food safety. While raw agricultural products are consumed without any processing, certain procedures, such as peeling, shredding, cutting, trimming, extruding, and sanitizing, require processed agricultural products [1]. After the completion of processing operations, certain amounts of organic materials may remain on the blades, cracks, or crevices in the facilities. This can produce infectious foodborne bacteria. Therefore, several studies investigated the risk of infection from food processing machines. Researchers collected bacterial samples from fresh-cut processing facilities after sanitization. The collected samples were incubated on a general growth media for 24 h, from which the mesophilic and psychrotrophic bacteria were isolated and identified. The studies reported that approximately 30% of more than 1000 isolated pathogen samples can potentially provide environmental protection to the foodborne pathogens [2]. Therefore, contamination of fresh-cut products can occur at any time during the processing of agricultural products [3,4].
Conventionally, the hygiene assessment in food processing facilities is performed by plating and incubating the samples on growth media for 24-48 h [5]. Typically, the cultivation of microorganisms is time-consuming and requires trained operators to perform intensive lab work. However, several rapid and non-destructive organic molecular component detection and classification techniques exist to evaluate food safety, particularly to ensure hygiene and sanitation in mass food processing facilities.
For decades, X-ray, near-infrared spectroscopy, and computer vision have contributed to the development of non-destructive safety inspection technologies [6,7]. For instance, Fourier transform near-infrared (FT-NIR) and FT-IR spectroscopic methods identified the unexpected contamination of onion powder by starch [8], and Raman spectroscopic technology differentiated between fake and real eggs [9]. Hyperspectral imaging (HSI) is an emerging technology that uses a data cube with spectral and spatial data to analyze organic residues. For instance, the melamine content in infant formulas was detected and isolated using HSI technology [10]. Furthermore, multiple HSI modalities, including visible and near-infrared (VNIR, 400-1000 nm) and short-wavelength IR waves (900-1700 nm), were used to detect and classify mislabeled fish fillets [11]. Additionally, multispectral laser-induced fluorescence imaging detected different dilutions of animal fecal matter on apples, and three dilutions of 1:2, 1:20, and 1:200 detected approximately 80% of fecal matter within 24 h after the application of the technique. However, the detection accuracy lowered when the apples were brushed and washed; the fecal matter detected were 100%, 30%, and 0% for the 1:2, 1:20, and 1:200 dilutions, respectively [12].
This study aimed to develop a non-destructive technique to detect and classify the concentration of biological residues in spinach and potato juices on stainless-steel surfaces. The proposed technique acquired the 3D hypercube data using a VNIR HSI system. Furthermore, a CNN-1D and several chemometric methods demonstrated the detection and classification results of the potato and spinach droplets diluted to six different concentrations.

Sample Preparation
We purchased fresh potato (Solanum tuberosum) and spinach (Spinacia oleracea) from a local market to prepare the diluted residues. Initially, the products were cut and squeezed to extract the juice. The diluted juice samples were then placed on stainless-steel plates and dried for 24 h. The experimental analysis was performed on six samples, namely, a 100% undiluted original fresh juice and five dilutions of potato and spinach, at 20% (1:5), 10% (1:10), 5% (1:20), 2% (1:50), and 1% (1:100), prepared by adding distilled water to the juice. Approximately 10 µL of the diluted solutions were placed on the plate with 15 rows and 2 replicates using a pipette. Thus, the total number of diluted droplets was 90 (6 dilutions × 15 repeats) on the stainless-steel plate.

VNIR HSI System and Data Acquisition
The HSI system comprises a 14-bit electron multiplying charge-coupled device (EM-CCD) camera (Luca R DL-604M, Andor Technology, South Windsor, CT, USA) with a shutter speed of 80 ms, coupled with a C-mount objective lens (F1.9 35 mm compact lens, Schneider Optics, Van Nuys, CA, USA). The VNIR spectra in the range of 400-1000 nm were acquired using a spectrophotometer (VNIR Hyperspec, Headwall Photonics, Inc.; Fitchburg, MA, USA) that was combined with the EMCCD camera. Additionally, two halogen lamps provided the lighting system. Each sample plate was placed on a linear motorized platform (Velmex, Inc.; Bloomfield, NY, USA) to convey the samples using a lab-built HSI system (ARS, USDA, Beltsville, MD, USA). The sample plates placed on the moving platform acquired the VNIR 3D hypercube data of the diluted droplets using a line-scan camera. Initially, the dark and white references were captured and applied to calibrate the raw images before acquiring the hyperspectral images. The VNIR HSI data contained images of 1000 × 1004 pixels in size and 128 bands in the range of 400-1000 nm. Figure 1 illustrates the flow chart of the image processing and the development of the classification models. To extract spectral data from the raw VNIR HSI images, an optimal region of interest (ROI) must be selected. The Otsu algorithm, principal component analysis (PCA), and U-net demonstrated the ROI selection results.

VNIR HSI System and Data Acquisition
The HSI system comprises a 14-bit electron multiplying charge-coupled device (EMCCD) camera (Luca R DL-604M, Andor Technology, South Windsor, CT, USA) with a shutter speed of 80 ms, coupled with a C-mount objective lens (F1.9 35 mm compact lens, Schneider Optics, Van Nuys, CA, USA). The VNIR spectra in the range of 400-1000 nm were acquired using a spectrophotometer (VNIR Hyperspec, Headwall Photonics, Inc.; Fitchburg, MA, USA) that was combined with the EMCCD camera. Additionally, two halogen lamps provided the lighting system. Each sample plate was placed on a linear motorized platform (Velmex, Inc.; Bloomfield, NY, USA) to convey the samples using a lab-built HSI system (ARS, USDA, Beltsville, MD, USA). The sample plates placed on the moving platform acquired the VNIR 3D hypercube data of the diluted droplets using a line-scan camera. Initially, the dark and white references were captured and applied to calibrate the raw images before acquiring the hyperspectral images. The VNIR HSI data contained images of 1000 × 1004 pixels in size and 128 bands in the range of 400-1000 nm.  The Otsu method is a well-known image binarization algorithm that uses an image thresholding technique [23]. In this study, the threshold values for spinach and potato were determined as 110 and 98, respectively. On the other hand, PCA calculates the correlation with input data. However, the spectral data were extracted based on the ROI selected using the U-net method. Additionally, stainless-steel background (BG) spectral data were randomly selected from 12 regions apart from the sample droplets. To enhance The Otsu method is a well-known image binarization algorithm that uses an image thresholding technique [23]. In this study, the threshold values for spinach and potato were determined as 110 and 98, respectively. On the other hand, PCA calculates the correlation with input data. However, the spectral data were extracted based on the ROI selected using the U-net method. Additionally, stainless-steel background (BG) spectral data were randomly selected from 12 regions apart from the sample droplets. To enhance the quality of the selected band image, a median filter and an image sharpening technique were applied. The dilutions of the potato and spinach droplets were considered as "Hundred", "Twenty", "Ten", "Five", "Two", and "One" for the classification, corresponding to the  Figure 2 illustrates the schematic of the U-net architecture applied in the image processing. U-net is a CNN developed for biomedical image segmentation [24], wherein the architecture comprises the encoding and decoding procedures. The left (box# 1-4) and right (box# 6-9) portions in the figure represent the encoding and decoding procedures, respectively. A convolution kernel exists throughout the procedure. The data extraction and compression occurs during encoding. Additionally, this model adopted the rectified linear unit (ReLU) as the activation function and a 2 × 2 max-pooling kernel with a dropout coefficient of 0.5 to achieve dimensionality reduction in the data. The input data were downsized by a quart during the encoding procedure and subsequently merged during up-sampling in the initial convolution of the decoding procedure. The concatenate function merges the dropout and up-sampling convolution. the quality of the selected band image, a median filter and an image sharpening technique were applied. The dilutions of the potato and spinach droplets were considered as "Hundred", "Twenty", "Ten", "Five", "Two", and "One" for the classification, corresponding to the dilutions of 100% (original fresh juice), 20% (1:5), 10% (1:10), 5% (1:20), 2% (1:50), and 1% (1:100), respectively. Figure 2 illustrates the schematic of the U-net architecture applied in the image processing. U-net is a CNN developed for biomedical image segmentation [24], wherein the architecture comprises the encoding and decoding procedures. The left (box# 1-4) and right (box# 6-9) portions in the figure represent the encoding and decoding procedures, respectively. A convolution kernel exists throughout the procedure. The data extraction and compression occurs during encoding. Additionally, this model adopted the rectified linear unit (ReLU) as the activation function and a 2 × 2 max-pooling kernel with a dropout coefficient of 0.5 to achieve dimensionality reduction in the data. The input data were downsized by a quart during the encoding procedure and subsequently merged during up-sampling in the initial convolution of the decoding procedure. The concatenate function merges the dropout and up-sampling convolution. In the decoding process, a 2 × 2 convolution layer was implemented to reconstruct a new feature map rather than the max-pooling layer. Moreover, the concatenate function (merging operation) was implemented with the corresponding feature maps (results of the dropout or convolution) from the encoding process to develop a feature map in each decoding layer. For instance, two 128-channel feature maps (from boxes #4 and #6) were merged using the concatenate function to generate a 256-channel feature map (box #7). At the final layer (box #9), a 1 × 1 convolution kernel transformed the final feature map to yield the output (mask image in binary mode). In the decoding process, a 2 × 2 convolution layer was implemented to reconstruct a new feature map rather than the max-pooling layer. Moreover, the concatenate function (merging operation) was implemented with the corresponding feature maps (results of the dropout or convolution) from the encoding process to develop a feature map in each decoding layer. For instance, two 128-channel feature maps (from boxes #4 and #6) were merged using the concatenate function to generate a 256-channel feature map (box #7). At the final layer (box #9), a 1 × 1 convolution kernel transformed the final feature map to yield the output (mask image in binary mode).

Development of the Classification Model
The classification model was developed based on two strategies ( Figure 1). As indicated in the figure, STEP #1 constitutes the chemometric methods that involve multivariate analysis methods and machine learning algorithms. Conversely, STEP #2 uses the CNN-1D algorithm. Table 1 presents the detailed model architecture and specifications. Both linear and non-linear multivariate classification methods, such as linear discriminant analysis (LDA), partial least squares discriminant analysis (PLS-DA), support vector machine (SVM), decision tree (DT), least squares support vector machine (LSSVM), and random forest (RF), were used to analyze the results [25,26]. RF is an ensemble algorithm of DTs {T 1 (X), . . . , T n (X)}, wherein X = {x 1 , . . . , x n } is an n-dimensional vector of properties associated with a dependent variable (spectrum of diluted droplets). The tree ensemble yields N outputs . . , N, represent the class predicted by the n-th tree for the input data [27,28]. The classification results were obtained using six preprocessing methods. No-P denotes no pre-processing; D1 and D2 are the 1st and 2nd derivatives, respectively, based on the Savitzky-Golay algorithm; MSC, MA, and NM represent the multiplicative scatter correction, moving average, and normalization, respectively. The accuracy and Cohen's kappa coefficient were used to show the results of the used classificaiton methods based on the confusion matrix [29]. Cross-validation was performed to evaluate the accuracy of the classification models using the leave-one-out (LOO) method. All classification algorithms and pre-processing methods were coded using R (Ver. 3.6.2.), the statistical open-source environment and language. The model was developed using multiple classification packages, such as caret (Ver. 6.0-85), e1071 (Ver. 1.7-3), rpart (Ver. 4.1-15), kernlab (Ver. 2004), and randomForest (Ver. 4.6). The Otsu algorithm was performed using ImageJ (Ver. 1.53c), which is an open-source scientific image processing program.
To classify the spectral data obtained from the diluted residues, a CNN-1D model was developed based on the architecture and parameters presented in Table 1. The optimized CNN-1D algorithm comprises convolution, average pooling, max-pooling, dropout, and output. The activation function uses ReLU to produce an image from a linear model. Both average pooling and max-pooling are applied to reduce the dimensionality of the spectral data. The total number of parameters and repeated epochs were 123,967 and 5000, respectively. However, these values can vary depending on the state of convergence. Deep learning classification was performed using Python (Ver

ROI Segmentation
We used a mask image to select the ROIs of the potato and spinach residues automatically from the corrected sample images. The mask image was developed to observe  Figure 3 depicts the mask images of the spinach and potato residues obtained from the Otsu algorithm, PCA, and U-net methods along with the sample raw images. In the case of spinach, the color of the residue Hundred (100%) in the raw image is different from that of the other diluted residues. While the Otsu and PCA masks segmented the entire sample in the five diluted residues, U-net produced all the samples with limited loss in image pixels. Although the droplets of potato exhibited different intensities between the residues of Hundred (100%) and One (1%) in the raw image, PCA and U-net produced appropriate results.

ROI Segmentation
We used a mask image to select the ROIs of the potato and spinach residues automatically from the corrected sample images. The mask image was developed to observe the segmentation results in the column regions (20 × 86 pixels) of six dilutions obtained from the 15 repeated raw images (391 × 86 pixels). Figure 3 depicts the mask images of the spinach and potato residues obtained from the Otsu algorithm, PCA, and U-net methods along with the sample raw images. In the case of spinach, the color of the residue Hundred (100%) in the raw image is different from that of the other diluted residues. While the Otsu and PCA masks segmented the entire sample in the five diluted residues, U-net produced all the samples with limited loss in image pixels. Although the droplets of potato exhibited different intensities between the residues of Hundred (100%) and One (1%) in the raw image, PCA and U-net produced appropriate results.  Figure 4 illustrates the mean spectra extracted from the HSI data obtained from the residues of the potato juice and the stainless-steel BG. The colored image depicts six dilutions of potato residues and the extracted region of the BG spectrum. The peaks in the mean spectrum were observed at 625, 720, 785, and 860 nm. Typically, most of the selected bands within the VNIR regions are associated with physiological substances, such as the CH, NH, and OH stretching, in the vibrational spectrum. For instance, absorption of anthocyanin and carotenoid occur at 650 and 680 nm, respectively [30,31]. Spectral bands at 690-710 nm and 760-800 nm represent the total chlorophyll bands, whereas the absorption bands at 705, 842, and 920 nm are associated with carbohydrates [31]. The band at 995 nm represents the 2nd vibration of the NH bonds in proteins or amino acids, whereas that at 880 nm constitutes the 3rd overtone absorption of CH. Additionally, the band relates to the 2nd overtone absorption of the OH and NH bonds at 750-900 nm and 962-1000 nm, respectively [32]. Figure 5 illustrates the score scattering attributes demonstrated in the principal components (PCs) during intuitive data analysis. We assigned seven colors to the residues based on the dilutions of potato and the BG surface. The first PC (PC1) denotes the variance of the potato residues at six dilutions and the BG spectral data as 99% and 1%, respectively. Conversely, the second PC (PC2) indicates the variance of the spinach residues at six dilutions and the BG spectral data as 98% and 2%,  Figure 4 illustrates the mean spectra extracted from the HSI data obtained from the residues of the potato juice and the stainless-steel BG. The colored image depicts six dilutions of potato residues and the extracted region of the BG spectrum. The peaks in the mean spectrum were observed at 625, 720, 785, and 860 nm. Typically, most of the selected bands within the VNIR regions are associated with physiological substances, such as the CH, NH, and OH stretching, in the vibrational spectrum. For instance, absorption of anthocyanin and carotenoid occur at 650 and 680 nm, respectively [30,31]. Spectral bands at 690-710 nm and 760-800 nm represent the total chlorophyll bands, whereas the absorption bands at 705, 842, and 920 nm are associated with carbohydrates [31]. The band at 995 nm represents the 2nd vibration of the NH bonds in proteins or amino acids, whereas that at 880 nm constitutes the 3rd overtone absorption of CH. Additionally, the band relates to the 2nd overtone absorption of the OH and NH bonds at 750-900 nm and 962-1000 nm, respectively [32]. Figure 5 illustrates the score scattering attributes demonstrated in the principal components (PCs) during intuitive data analysis. We assigned seven colors to the residues based on the dilutions of potato and the BG surface. The first PC (PC1) denotes the variance of the potato residues at six dilutions and the BG spectral data as 99% and 1%, respectively. Conversely, the second PC (PC2) indicates the variance of the spinach residues at six dilutions and the BG spectral data as 98% and 2%, respectively. The original juice (Hundred, no dilution) was easily distinguishable in the PC score plots (black, circle). Moreover, a class of BG (yellow, square) was isolated from the diluted residues. Figure 5 shows that the low-dilution residues (<10%) of potato demonstrated overlapping clusters. respectively. The original juice (Hundred, no dilution) was easily distinguishable in th PC score plots (black, circle). Moreover, a class of BG (yellow, square) was isolated from the diluted residues. Figure 5 shows that the low-dilution residues (<10%) of potat demonstrated overlapping clusters.

Classification Results
Six multivariate analysis methods and machine learning algorithms were used to classify the diluted residues on the stainless-steel surface. Tables 2 and 3 present the classification results of the potato and spinach residues, respectively, considering the accuracy (A) and kappa coefficient (K) in the classification models.

Classification Results
Six multivariate analysis methods and machine learning algorithms were used to classify the diluted residues on the stainless-steel surface. Tables 2 and 3 present the classification results of the potato and spinach residues, respectively, considering the accuracy (A) and kappa coefficient (K) in the classification models.
In the case of potato residues, LSSVM and RF exhibited higher accuracies than 0.86 based on the pre-processing methods, such as No-P, D1, MA, and NM. Additionally, LDA demonstrated reasonable classification results at an accuracy of 0.83; however, the accuracies of PLS-DA and DT were less than 0.77. Conversely, the classification results obtained from SVM were of the highest accuracy at 0.90 (Table 2), and the detailed results for each of the diluted residues were 1.0, 0.89, 0.87, 0.93, 0.71, 0.94, and 0.95 for Hundred, Twenty, Ten, Five, Two, One, and BG, respectively. However, CNN-1D demonstrated improved accuracies compared to those of SVM in each of the residues at 1.0, 0.97, 0.96, 0.89, 0.90, 0.89, and 1.0 ( Figure 6).
In the case of spinach residues, SVM exhibited the most accurate classification results with an accuracy of 0.92 (Table 3). Moreover, the accuracies of the classification results obtained from RF were higher than 0.  Figure 7).
To compare the results of the classification models, spectral data analysis was performed using the CNN-1D algorithm. Six diluted residues were classified, and the results were presented using a confusion matrix. In this study, the numbers of training epochs and parameters were 500 and 123,967, respectively. The mean absolute error and the loss were 0.0262 and 0.0093, respectively, after the model was trained. Figures 6 and 7 depict the confusion matrices representing the prediction accuracies of the developed CNN-1D model applied to the validation dataset. While Figure 6 depicts the classification and validation results of the potato residues (Ac = 0.99, Av = 0.94), Figure 7 presents those of the spinach residues (Ac = 0.98, Av = 0.94). These results indicate that the CNN-1D improves the classification accuracy by 2-4% from 0.92 and 0.90 in the case of potato and spinach, respectively, using the chemometric method Table 2. Result of the classification of the diluted residues of potato as the accuracy (A) and kappa coefficient (K) based on chemometric methods.     In the case of spinach residues, SVM exhibited the most accurate classification results with an accuracy of 0.92 (Table 3). Moreover, the accuracies of the classification results obtained from RF were higher than 0.9 in the case of D1 and MSC. While SVM classified the results of each of the residues at accuracies of 1. To compare the results of the classification models, spectral data analysis was performed using the CNN-1D algorithm. Six diluted residues were classified, and the results were presented using a confusion matrix. In this study, the numbers of training epochs and parameters were 500 and 123,967, respectively. The mean absolute error and the loss were 0.0262 and 0.0093, respectively, after the model was trained. Figures 6 and 7 depict the confusion matrices representing the prediction accuracies of the developed CNN-1D model applied to the validation dataset. While Figure 6 depicts the classification and validation results of the potato residues (Ac = 0.99, Av = 0.94), Figure 7 presents those of the spinach residues (Ac = 0.98, Av = 0.94). These results indicate that the CNN-1D improves the classification accuracy by 2-4% from 0.92 and 0.90 in the case of potato and spinach, respectively, using the chemometric method

Conclusions
To detect and classify the organic residues on a metal surface accurately, we developed a classification model using VNIR HSI technology and machine learning methods. We implemented deep learning methods, such as U-net and CNN-1D, to generate a mask image in the classification model. Owing to the enhanced ROI segmentation and fine-tuned parameters in the CNN layers, both the deep learning methods demonstrated improved classification accuracies in the case of diluted residues. The two mask image algorithms, such as Otsu and PCA, used an optimal thresholding based on a single intensity threshold, which is calculated by the difference between the inter-class variance and between-class variance or loading vector, respectively. These two methods use a single intensity threshold so as to have a fast, simple technique, whereas they tend to find it difficult to separate detailed image parts from background. In turn, U-

Conclusions
To detect and classify the organic residues on a metal surface accurately, we developed a classification model using VNIR HSI technology and machine learning methods. We implemented deep learning methods, such as U-net and CNN-1D, to generate a mask image in the classification model. Owing to the enhanced ROI segmentation and fine-tuned parameters in the CNN layers, both the deep learning methods demonstrated improved classification accuracies in the case of diluted residues. The two mask image algorithms, such as Otsu and PCA, used an optimal thresholding based on a single intensity threshold, which is calculated by the difference between the inter-class variance and between-class variance or loading vector, respectively. These two methods use a single intensity threshold so as to have a fast, simple technique, whereas they tend to find it difficult to separate detailed image parts from background. In turn, U-net adopted encoding and decoding procedures for data reduction and optimal feature selection. Furthermore, data augmentation using the annotated images demonstrated a precise feature segmentation [24]. Typically, organic residues can potentially generate biofilms that cause extracellular polymeric substance production and maturation [33]. Therefore, this study can potentially afford the early detection of biofilms in food processing machines using VNIR HSI and machine learning at an accuracy of A = 0.94. However, further research is necessary to obtain diverse evidence to fine-tune the hyperparameters of the deep learning methods, particularly when multiple samples are considered. Additionally, the proposed model must be validated across areas and in more practical locations of the food industry.