Rapid Identification of Rainbow Trout Adulteration in Atlantic Salmon by Raman Spectroscopy Combined with Machine Learning

This study intends to evaluate the utilization potential of the combined Raman spectroscopy and machine learning approach to quickly identify the rainbow trout adulteration in Atlantic salmon. The adulterated samples contained various concentrations (0–100% w/w at 10% intervals) of rainbow trout mixed into Atlantic salmon. Spectral preprocessing methods, such as first derivative, second derivative, multiple scattering correction (MSC), and standard normal variate, were employed. Unsupervised algorithms, such as recursive feature elimination, genetic algorithm (GA), and simulated annealing, and supervised K-means clustering (KM) algorithm were used for selecting important spectral bands to reduce the spectral complexity and improve the model stability. Finally, the performances of various machine learning models, including linear regression, nonlinear regression, regression tree, and rule-based models, were verified and compared. The results denoted that the developed GA–KM–Cubist machine learning model achieved satisfactory results based on MSC preprocessing. The determination coefficient (R2) and root mean square error of prediction sets (RMSEP) in the test sets were 0.87 and 10.93, respectively. These results indicate that Raman spectroscopy can be used as an effective Atlantic salmon adulteration identification method; further, the developed model can be used for quantitatively analyzing the rainbow trout adulteration in Atlantic salmon.


Introduction
Atlantic salmon (Salmo salar) has attracted consumer interest because of its unique taste and rich nutritional value. Even though there is a huge demand for Atlantic salmon in the Chinese market, imported Atlantic salmon is often in short supply. Under these circumstances, rainbow trout (Oncorhynchus mykiss) is often used to imitate or adulterate Atlantic salmon meat or products. Rainbow trout is considerably less expensive than Atlantic salmon but looks similar, making it difficult for consumers to distinguish between the two. Adulterated Atlantic salmon meat not only infringes the legitimate rights and interests of consumers but also causes serious food safety problems, leading to widespread concern among consumers, producers, retailers, and food regulatory agencies.
Therefore, an accurate and expedient method is required for identifying Atlantic salmon adulteration. Traditional meat identification methods have mainly used enzyme-linked immunosorbent assays, deoxyribonucleic acid (DNA) [1][2][3][4][5][6], proteome [7][8][9], and triacylglycerol-based analytical techniques [10]. Although these methods have proved to be accurate, they exhibit long processing To eliminate the effects of baseline drift and scattering distortion on the spectra and to compare the spectral differences among the two fish, preprocessing methods, including baseline correction and MSC, were conducted, as depicted in Figure 1a. Some significant differences were observed between the intensities of the spectral absorption peaks of the two fish. The mean and standard deviation spectra of salmon and rainbow trout have been depicted in Figure 1b. Eight overlapping peaks with different intensities (with the exception of 1748 cm −1 ) were identified in two spectra; the peaks associated with the rainbow trout were characterized as the stronger of the two. These peak intensity differences formed the basis for distinguishing between rainbow trout and Atlantic salmon. The component functional groups corresponding to the Raman peaks of the two fish fats were analyzed and have been provided in Table 1.
As presented in Table 1, the peak at 1748 cm −1 is weak in strength and is attributed to the C=O ester stretching mode (C=O). The peak at 1659 cm −1 corresponds to a Z-alkene, ν(C=C), in the fatty acid chain, whereas the strong peak at 1441 cm −1 corresponds to the C-H bending stretching modes. The peak at 1303 cm −1 is attributed to the CH2 twisting modes (C-H), the peak at 1268 cm −1 is due to To eliminate the effects of baseline drift and scattering distortion on the spectra and to compare the spectral differences among the two fish, preprocessing methods, including baseline correction and MSC, were conducted, as depicted in Figure 1a. Some significant differences were observed between the intensities of the spectral absorption peaks of the two fish. The mean and standard deviation spectra of salmon and rainbow trout have been depicted in Figure 1b. Eight overlapping peaks with different intensities (with the exception of 1748 cm −1 ) were identified in two spectra; the peaks associated with the rainbow trout were characterized as the stronger of the two. These peak intensity differences formed the basis for distinguishing between rainbow trout and Atlantic salmon. The component functional groups corresponding to the Raman peaks of the two fish fats were analyzed and have been provided in Table 1. ν(C-C) -(CH 2 ) n -Medium As presented in Table 1, the peak at 1748 cm −1 is weak in strength and is attributed to the C=O ester stretching mode (C=O). The peak at 1659 cm −1 corresponds to a Z-alkene, ν(C=C), in the fatty acid chain, whereas the strong peak at 1441 cm −1 corresponds to the C-H bending stretching modes. The peak at 1303 cm −1 is attributed to the CH 2 twisting modes (C-H), the peak at 1268 cm −1 is due to the Z conformation stretching modes (=CH) from the unsaturated fatty acids, the peaks at 1079 cm −1 and 872 cm −1 are due to gauche C-C stretching vibrations (C-C), and the peak at 974 cm −1 is caused by the bending vibration of trans (=CH). These peaks exhibited medium strength. The eight characteristic peaks have also been reported as common features in the Raman spectra of edible oils below 2000 cm −1 [40][41][42], and the peak positions and intensities of different fatty acids have been observed to be slightly different [43].
The Raman spectra in case of Atlantic salmon with different proportions of rainbow trout adulteration are depicted in Figure 2. the Z conformation stretching modes (=CH) from the unsaturated fatty acids, the peaks at 1079 cm −1 and 872 cm −1 are due to gauche C-C stretching vibrations (C-C), and the peak at 974 cm −1 is caused by the bending vibration of trans (=CH). These peaks exhibited medium strength. The eight characteristic peaks have also been reported as common features in the Raman spectra of edible oils below 2000 cm −1 [40][41][42], and the peak positions and intensities of different fatty acids have been observed to be slightly different [43]. The Raman spectra in case of Atlantic salmon with different proportions of rainbow trout adulteration are depicted in Figure 2. The Raman spectra of the adulterated Atlantic salmon fat were very similar, with characteristic absorption peaks being observed at 1748, 1659, 1441, 1303, 1268, 1079, 974, and 872 cm −1 . The Raman peak intensity increased with increasing amounts of rainbow trout meat, enabling the Raman spectra of the different Atlantic salmon meat samples to be distinguished. These differences in absorption peaks provided the basis for further model development.

Preprocessing Analysis
The spectra of the Atlantic salmon samples obtained using different pretreatment methods, such as baseline correction, MSC, SNV, FD, and SD, are depicted in Figure 3a-f. The Raman spectra of the adulterated Atlantic salmon fat were very similar, with characteristic absorption peaks being observed at 1748, 1659, 1441, 1303, 1268, 1079, 974, and 872 cm −1 . The Raman peak intensity increased with increasing amounts of rainbow trout meat, enabling the Raman spectra of the different Atlantic salmon meat samples to be distinguished. These differences in absorption peaks provided the basis for further model development.

Preprocessing Analysis
The spectra of the Atlantic salmon samples obtained using different pretreatment methods, such as baseline correction, MSC, SNV, FD, and SD, are depicted in Figure 3a  To compare the effects of different pretreatment methods, a partial least squares regression (PLSR) model was used to evaluate the results of the pretreatment methods. The experiment conducted without a pretreatment step was used as a reference. The results are presented in Table 2.  To compare the effects of different pretreatment methods, a partial least squares regression (PLSR) model was used to evaluate the results of the pretreatment methods. The experiment conducted without a pretreatment step was used as a reference. The results are presented in Table 2. The model performance was the highest when the principal component number of PLSR modeling was 10. While observing the test sets, RMSEP and R 2 could reach values of 17.27 and 0.70 in case of the usage of raw spectral modeling, respectively. In contrast, the RMSEP values obtained using FD and SD were 23.10 and 30.15 and R 2 values were 0.48 and 0.12, respectively. These results demonstrated that FD and SD modeling were less effective than raw spectral modeling. The FD and SD methods amplified the noise in the spectra, which can explain the poor performance of these models. Further studies using MSC and SNV revealed that both the methods could achieve better results when compared with the original spectra, i.e., the resulting RMSEP was smaller and R 2 was larger. The two methods also eliminated the scattering effect that negatively influenced the spectral data.
By comparing the two methods in cases in which they performed similarly, the number of principal components required for SNV modeling was observed to be larger than that required for MSC. Therefore, the MSC modeling performance could be considered to be slightly better than SNV. Further, while comparing different machine learning modeling methods in this study, MSC was the only method employed for spectral preprocessing.

Important Spectral Band Selection
In this study, different supervised methods (RFE, GA, and SA) and unsupervised methods (KM) were combined to reduce the dimension of spectral wavelengths, and optimal bands were selected (Table 3). Table 3. Three feature wavelength selection methods based on the PLSR modeling results. The results presented in Table 3 denote that the performances of the three methods were relatively similar and not distinct from those of the full-spectra model. However, the number of required spectral bands was considerably reduced in comparison with that in the full-spectra method, improving both the efficiency and stability of the model. Among them, GA-KM was considered to be the best method; the required wavelengths of the model were considerably reduced from 882 to 431, and this method exhibited improved prediction performance (R 2 = 0.81, RMSEP = 13.34%).

Results of the Cubist Model
After applying the MSC pretreatment method and selecting the optimal feature bands by GA-KM, the Cubist model was established to identify adulterated Atlantic salmon samples containing different from 882 to 431, and this method exhibited improved prediction performance (R 2 = 0.81, RMSEP = 13.34%).

Results of the Cubist Model
After applying the MSC pretreatment method and selecting the optimal feature bands by GA-KM, the Cubist model was established to identify adulterated Atlantic salmon samples containing different proportions of rainbow trout. To optimize the Cubist model, different sizes and numbers of committees and instances in the model were examined. The cross-validation curve of the Cubist model is presented in Figure 4. Regardless of the number of instances, the error significantly decreased as the commits gradually increased to 10. While increasing the number of commits from 10 to 20, the error only slightly decreased. When the Cubist model used 20 commits and 5 instances, the error was the smallest and the modeling effect was the largest. Furthermore, when the number of instances was too low or too high, the performance of the model would decrease. The adulteration ratio of Atlantic salmon was predicted based on the aforementioned parameters, and the results are presented in Figure 5. The RMSE in the calibration sets was 12.67, and R 2 was 0.84; these values were 10.93 and 0.87, respectively, in the test sets. The experimental data exhibited a high degree of agreement with the predicted data, and the modeling performance was good, suggesting that this technique could be used to quickly identify Atlantic salmon adulteration. Regardless of the number of instances, the error significantly decreased as the commits gradually increased to 10. While increasing the number of commits from 10 to 20, the error only slightly decreased. When the Cubist model used 20 commits and 5 instances, the error was the smallest and the modeling effect was the largest. Furthermore, when the number of instances was too low or too high, the performance of the model would decrease. The adulteration ratio of Atlantic salmon was predicted based on the aforementioned parameters, and the results are presented in Figure 5.

Results of the Cubist Model
After applying the MSC pretreatment method and selecting the optimal feature bands by GA-KM, the Cubist model was established to identify adulterated Atlantic salmon samples containing different proportions of rainbow trout. To optimize the Cubist model, different sizes and numbers of committees and instances in the model were examined. The cross-validation curve of the Cubist model is presented in Figure 4. Regardless of the number of instances, the error significantly decreased as the commits gradually increased to 10. While increasing the number of commits from 10 to 20, the error only slightly decreased. When the Cubist model used 20 commits and 5 instances, the error was the smallest and the modeling effect was the largest. Furthermore, when the number of instances was too low or too high, the performance of the model would decrease. The adulteration ratio of Atlantic salmon was predicted based on the aforementioned parameters, and the results are presented in Figure 5. The RMSE in the calibration sets was 12.67, and R 2 was 0.84; these values were 10.93 and 0.87, respectively, in the test sets. The experimental data exhibited a high degree of agreement with the predicted data, and the modeling performance was good, suggesting that this technique could be used to quickly identify Atlantic salmon adulteration. The RMSE in the calibration sets was 12.67, and R 2 was 0.84; these values were 10.93 and 0.87, respectively, in the test sets. The experimental data exhibited a high degree of agreement with the predicted data, and the modeling performance was good, suggesting that this technique could be used to quickly identify Atlantic salmon adulteration.

Discussion
To denote the advantages of the Cubist algorithm with respect to model prediction, 13 types of machine learning methods and the PLSR method were used to model the selected spectral bands. The results are summarized in Table 4. For the test sets, the Cubist model was observed to have the smallest RMSEP (10.98) followed by PLSR (13.34). This result indicated that the modeling performance of the linear regression model (Cubist and PLS) may be better than those of other models. One explanation for this result is that adulteration using the rainbow trout followed a linear relation in the Raman spectra; as the adulteration ratio increased, the peak strength of the Raman spectra increased. The RMSEP of the Cubist method was much smaller than that of the remaining models, which could have been due to the Cubist method being a rule-based model. The model tree leaf node was a linear regression model, and the regression equation modeling on the node was more flexible and more accurate when compared with the other regression models [44]. Furthermore, complex linear regression models did not yield better performances. The RMSEP of Glmboost, Enet, Ridge, and Rqlasso were not as suitable as that of the PLS model. The modeling performances obtained using nonlinear models, such as random forests and neural networks, were even worse than those of the complex linear models. This indicated that the more complex linear or nonlinear models were not globally optimal when the number of samples was not sufficiently large and that they were prone to over-fitting, leading to a decrease in the accuracy of the model. In summary, the modeling performance of the linear model was generally better than those of the other models, and there was a linear relation between the rainbow trout adulteration and peak intensity of the Raman spectra. The Cubist model exhibited the best modeling performance and was combined with Raman spectroscopy to develop a new technique for identifying Atlantic salmon adulteration.

Sample Preparation
Different amounts of rainbow trout were added to Atlantic salmon to create adulterated samples. To expand the sample diversity and improve the credibility of the experimental results, Atlantic salmon was obtained from different batches, at different times, and from different regions (Denmark, Scotland, Chile, and Norway). To ensure the authenticity of the samples, import-certified Atlantic salmon stores were selected. Danish Atlantic salmon meat was purchased from Hippo Fresh Food in Guangzhou, China; Chilean Atlantic salmon meat was purchased from Jingdong Supermarket in Guangzhou, China; Scottish Atlantic salmon meat was purchased from Haidi Wang Fresh Seafood in Shanghai, China; and Norwegian Atlantic salmon meat was purchased from the Yuesheng official store in Shenzhen, China. Rainbow trout was also purchased from different regions and stores. Qinghai rainbow trout was purchased from the Tmall Longyangxia store in Gonghe, China; Gansu rainbow trout was purchased from the Tmall Shangzhi store in Lanzhou, China; Shandong rainbow trout was purchased from the Laoshan ecological farm in the Taobao store in Qingdao, China; and Liaoning rainbow trout was purchased from the Tmall supermarket in Benxi, China.
The Atlantic salmon meat obtained from four regions (Denmark, Chile, Scotland, and Norway) and the rainbow trout obtained from Qinghai, Gansu, Shandong, and Liaoning were all crushed using a grinder in KRUPS, Germany. Further, the Atlantic salmon and rainbow trout were mixed according to the following weight percentages: 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100% (w/w). Each of the 176 mixed samples weighed 50 g, and 2-3 parallel samples were prepared for each gradient, affording 516 prepared samples. Each sample was subsequently homogenized using a meat grinder and centrifuged at a speed of 10,000 rpm and a centrifugation time of 5 min. After centrifugation, the upper layer of an oily material was pipetted into a chemical reaction plate for conducting the Raman spectroscopy measurements.

Raman Spectral Data Measurements
A portable Raman spectrometer (FoodDefend RM, Thermo Fisher, Waltham, MA, USA) was used to collect the Raman spectra. The Raman system was equipped with a laser excited at 785 nm. When scanning, the laser power source was set to 250 mW, and the spectral range was 250-2500 cm −1 . The spectral resolution was 7 cm −1 , the exposure time was 5 ms, the scanning delay was 60 s, and the operating temperature was set to 30 • C to achieve the optimal Raman peaks. To eliminate noise and ensure data repeatability, each sample was scanned thrice; the average values were used as the sample spectra.

Spectral Pretreatment
Before modeling, spectral preprocessing was required to reduce noise or to eliminate random and systematic changes in the data [26]. The four preprocessing methods (FD, SD, MSC, and SNV) had different effects on the spectral data. For example, FD was used to remove the baselines, SD was used to remove the baselines and linear trends [45], and MSC and SNV were typically used to eliminate the unwanted scattering effects [46]. In this study, PLSR was used to model the same spectral data, and the effects of four pretreatment methods were evaluated for conducting the Atlantic salmon contamination analyses.

Spectral Band Selection Methods
Selection of important spectral bands was critical to reducing the high dimensionality of the spectral data and increasing the processing speed [16]. Some researchers used unsupervised methods, such as clustering algorithms, for conducting feature selection [47,48]. As the clustering method has been shown to result in the selection of irrelevant features [47][48][49], irrelevant features were deleted before feature clustering. Some supervisory feature selection methods, such as RFE [50], GA [51], and SA [52], have been shown to be effective approaches for removing unrelated features. Furthermore, we attempted to combine the supervised and unsupervised methods for performing dimensionality reduction and variable selection. Firstly, the RFE, GA, and SA algorithms were used to remove the uncorrelated wavelengths and select the relevant characteristic bands. Then, KM [53][54][55] was used to optimize the feature wavelengths, and PLSR was used to monitor the modeling error of these selections. The optimal feature wavelengths could be obtained based on these data.

Modeling Methods
Certain machine learning models have proven to be effective for identifying food adulteration [56][57][58]. In this study, the applicability of several machine learning models for predicting Atlantic salmon adulteration were evaluated. The R language was used for modeling, and a total of 14 mainstream machine learning algorithms, including linear regression models, nonlinear regression models, tree-based, and rule-based models, were used for training and testing. Linear regression models included PLSR, the boosted generalized linear model (Glmboost) [59], Elasticnet regression (Enet) [60], ridge regression (Ridge) [61], quantile regression with LASSO penalty (Rqlasso) [62], multi-step adaptive MCP-net (Msaene) [63], quantile random forest (Qrf) [64], parallel random forest (parRF) [65], random forest (Rf) [66], k-nearest neighbors (Kknn) [67], and multivariate adaptive regression spline (Earth) [68]. The tree-based models include conditional inference tree (ctree) [69] and extreme gradient boosting (xgbTree) [70]. The Cubist model (Cubist) [71] was the only rule-based model considered in this study. The selected wavelength sets and adulteration levels were used as the input and output variables, respectively, for the model, and the input and output data and other conditions were observed to be consistent while evaluating and comparing the model performance. A random sampling method was selected to divide the data sets into two subsets: training data (75%) and test data (25%). When modeling, 10-fold cross-validation was used, and five training times were repeated and averaged for the final results. The aforementioned process was implemented using the R language Caret package.

Model Evaluation
The determination coefficient (R 2 ), root mean square error of calibration sets (RMSEC), and root mean square error of test sets (RMSEP) were used to evaluate the performance of the regression model. The definitions were as follows: whereŷ i is the predicted adulteration level of the ith sample, y i is the true adulterated level of the ith sample, y i is the average of y i , and N is the number of samples.

Software
All the Raman spectral data pretreatments were performed using TheUnscrambler X14.1 software (CAMO, Oslo, Norway). All the calculations were performed using the R program (version 3.5.1). The Kknn package (version 1.3.1) was used for variable clustering, and the Caret package (version 6.0-82) was used for performing feature wavelength selections and machine learning modeling.

Conclusions
In this study, we evaluated the ability of a combined Raman spectroscopy and machine learning approach to rapidly detect the adulteration of Atlantic salmon using rainbow trout. A linear relation can be observed between the adulteration ratio of Atlantic salmon and the Raman spectra intensity. In this experiment, MSC was shown to be a better pretreatment method when compared with FD, SD, and SNV. GA was used to delete the irrelevant wavelengths, and KM was used to optimize the spectral bands. The Cubist method achieved the highest performance while modeling the spectra. Thus, the machine learning model developed in this study based on the MSC-GA-KM-Cubist method is an effective tool for quickly identifying the adulteration of Atlantic salmon meat.

Conflicts of Interest:
The authors declare no conflict of interest. This article does not contain any studies with human or animal subjects. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. All the authors involved with the work agree to submit this paper to Molecules and claim that none of the material in the paper has been published or is under consideration for publication elsewhere.