A Support Vector Machine-Assisted Metabolomics Approach for Non-Targeted Screening of Multi-Class Pesticides and Veterinary Drugs in Maize

The contamination risks of plant-derived foods due to the co-existence of pesticides and veterinary drugs (P&VDs) have not been fully understood. With an increasing number of unexpected P&VDs illegally added to foods, it is essential to develop a non-targeted screening method for P&VDs for their comprehensive risk assessment. In this study, a modified support vector machine (SVM)-assisted metabolomics approach by screening eligible variables to represent marker compounds of 124 multi-class P&VDs in maize was developed based on the results of high-performance liquid chromatography–tandem mass spectrometry. Principal component analysis and orthogonal partial least squares discriminant analysis indicate the existence of variables with obvious inter-group differences, which were further investigated by S-plot plots, permutation tests, and variable importance in projection to obtain eligible variables. Meanwhile, SVM recursive feature elimination under the radial basis function was employed to obtain the weight-squared values of all the variables ranging from large to small for the screening of eligible variables as well. Pairwise t-tests and fold changes of concentration were further employed to confirm these eligible variables to represent marker compounds. The results indicate that 120 out of 124 P&VDs can be identified by the SVM-assisted metabolomics method, while only 109 P&VDs can be found by the metabolomics method alone, implying that SVM can promote the screening accuracy of the metabolomics method. In addition, the method’s practicability was validated by the real contaminated maize samples, which provide a bright application prospect in non-targeted screening of contaminants. The limits of detection for 120 P&VDs in maize samples were calculated to be 0.3~1.5 µg/kg.


Introduction
It has always been a headache to resolve pesticide contamination in plant-derived foods, which poses a considerable threat to global food safety.Some national governments (e.g.,China [1], the United States [2], and Japan [3]) and international authorities (e.g., the European Union [4]) have issued a series of formal regulatory documents on the maximum residue limits (MRLs) of pesticides in plant-derived foods.With the deepening of research, more and more evidence has shown that plant-derived foods are also suffering from serious contamination by veterinary drugs [5][6][7][8][9][10][11][12][13].Previous studies [5][6][7][8][9][10][11][12][13] have shown that veterinary drugs excreted by livestock and poultry are commonly used as fertilizers on farmland, and some plant-derived foods (e.g., maize, potatoes, cucumbers, and lettuce) are easy to absorb from the soil, resulting in a variety of veterinary drugs (e.g., tetracycline, quinolones, and sulfonamides) accumulating in the foods, with a total concentration up to several mg/kg levels [8,[14][15][16][17].Due to the lack of risk evaluation standards for veterinary drugs in plant-derived foods at home and abroad, it is difficult to directly judge whether the concentration of veterinary drugs will cause adverse effects.Referring to the regulatory documents of MRLs of veterinary drugs in animal-derived foods [18,19], the concentration of 10 µg/kg was deemed the safety threshold for most veterinary drugs, and it is inferred that the reported concentration level of veterinary drugs in some plant-derived foods is likely to cause serious food safety incidents.For example, some organophosphorus pesticides can damage the nervous system, while sulfonamide antibiotics can cause allergic reactions after entering the human body.To sum up, resolving the food safety issue caused by the co-existence of pesticides and veterinary drugs (P&VDs) is urgent, and developing an effective screening method is a feasible way to prevent the occurrence of potential contamination risks.
Traditional methods to screen contaminants in foods are mostly based on the establishment of a set of databases containing a variety of P&VDs [20][21][22].These methods have proved to be successful in discerning contaminants in the database, but they are helpless for contaminant identification outside the database, which casts a shadow on food safety.For example, some infamous food safety incidents took place in the 21st century, including melamine milk powder and fipronil eggs, which arose from some enterprises or individuals illegally adding unexpected contaminants not in the routine test scope of foods.Even if the concentrations of these contaminants were highly above the red line of food safety, these defective products were also considered to be qualified.As a result, they have gradually developed into extremely serious food safety incidents.It can be seen that the constant establishment of new databases of contaminants to screen unexpected P&VDs in plant-derived foods is a passive way that consumes a lot of human and material resources and is also unable to meet the increasing demands for the detection of unexpected P&VDs in foods.Therefore, it has become a popular trend to develop highly efficient non-targeted screening methods for contaminants, as the European Commission NORMAN network proposed [23,24].
Metabolomics has shown great promise in screening unexpected contaminants in the food safety field due to its advantages in handling massive data with complicated characteristics, such as small sample sizes, mass interferents, and high noise [25][26][27][28][29]. Marker compounds on behalf of unexpected contaminants in foods were screened and identified by in-house or network databases (e.g., Massbank, SciFinder, ChemSpider, PubChem, and Metlin) during metabolomics analysis in order to achieve non-targeted screening of contaminants [28,29].
As we know, metabolomics data are composed of fewer samples but massive features (i.e., variables).How to extract meaningful information and find the eligible variables on behalf of marker compounds is a primary issue for researchers to resolve.Generally speaking, the distinguishable factors for metabolomics data between any two groups (e.g., experiment vs. control) are a combination of variables rather than a single one.Therefore, several chemometric approaches, such as principal component analysis (PCA) [30][31][32], orthogonal partial least squares discriminant analysis (OPLS-DA) [30][31][32], random forest (RF) [33,34], and support vector machine (SVM) [33,35,36], have widely been used to deal with these metabolomics data.Among all these modeling methods, SVM is gaining popularity in a wide variety of metabolomics studies due to its prediction performance.SVM is known to have excellent generalization ability and can be applied to nonlinear cases with the assistance of kernels relative to PCA and OPLS-DA, with only an assumption of linearity.SVM recursive feature elimination (SVMRFE) is a wrapper approach that adopts the rule of weight (w) to rank the variables and works well when the number of samples is small but the number of variables is large.Due to this feature of SVMRFE, which is similar to that of metabolomics data, SVMRFE is often used in metabolomics studies [37][38][39][40].In addition, SVMRFE has shown better predictability than OPLS-DA.To date, SVMRFE-involved metabolomics studies are mainly applied to seek eligible biomarkers for disease diagnosis, but they remain insufficient to complete the non-targeted screening of contaminants in foods.In this work, we try to apply SVM to metabolomics analysis to better screen and identify marker compounds on behalf of 124 multi-class P&VDs in maize so as to realize non-targeted screening of residual contaminants in plant-derived foods.

Data Preprocessing
The total ion chromatograms containing 124 P&VDs in three concentration groups are shown in Figure 1.The matrix complexity of maize may cause the recoveries of each contaminant in different samples to be significantly discrepant.As a result, no evident correlation between peak intensities and concentration levels was observed, as shown in Figure 1.To eliminate the peak intensity errors of variables arising from different recoveries of P&VDs during the pretreatment process, enrofloxacin-d5 (parent ion m/z 365.21092; fragment ions m/z 348.19692, 322.21804, and 246.11012; retention time 7.01 min) and atrazine-d5 (parent ion m/z 221.14017; fragment ions m/z 179.08602, 137.06393, and 101.08703; retention time 5.77 min) were jointly spiked for recovery calibration.Blank maize extract solutions were employed to prepare the standard curves (5, 10, 25, 50, and 100 ng/mL) of enrofloxacin-d5 and atrazine-d5 to calculate the recoveries of these two deuterated compounds in maize samples, with the recoveries to be 75.6%~93.1%,72.9%~93.4%,and 74.9%~93.6%for enrofloxacin-d5 and 73.6%~95.2%,81.2%~94.7%,and 72.9%~96.4% for atrazine-d5 in the 20, 50, and 100 ng/mL groups, respectively (Table S1, Supplementary Materials).multi-class P&VDs in maize so as to realize non-targeted screening of residual contaminants in plant-derived foods.

Data Preprocessing
The total ion chromatograms containing 124 P&VDs in three concentration groups are shown in Figure 1.The matrix complexity of maize may cause the recoveries of each contaminant in different samples to be significantly discrepant.As a result, no evident correlation between peak intensities and concentration levels was observed, as shown in Figure 1.To eliminate the peak intensity errors of variables arising from different recoveries of P&VDs during the pretreatment process, enrofloxacin-d5 (parent ion m/z 365.21092; fragment ions m/z 348.19692, 322.21804, and 246.11012; retention time 7.01 min) and atrazine-d5 (parent ion m/z 221.14017; fragment ions m/z 179.08602, 137.06393, and 101.08703; retention time 5.77 min) were jointly spiked for recovery calibration.Blank maize extract solutions were employed to prepare the standard curves (5, 10, 25, 50, and 100 ng/mL) of enrofloxacin-d5 and atrazine-d5 to calculate the recoveries of these two deuterated compounds in maize samples, with the recoveries to be 75.6%~93.1%,72.9%~93.4%,and 74.9%~93.6%for enrofloxacin-d5 and 73.6%~95.2%,81.2%~94.7%,and 72.9%~96.4% for atrazine-d5 in the 20, 50, and 100 ng/mL groups, respectively (Table S1, Supplementary Materials).For each sample, we adopted the formula of 2 × 100%/((100%/recovery of enrofloxacin-d5) + (100%/recovery of atrazine-d5)) to calculate the final recovery of internal standards (Table S1), whose average value from the same concentration group needs to multiply a calibration coefficient to obtain 100% recovery.Peak intensities for internal standards and all other variables were also calibrated after multiplying the coefficient.After this, eligible variables with a relative standard deviation of peak intensity less than 30% in QC and three concentration groups were obtained to establish a 1447 × 39 data matrix for further analysis [41].

PCA Results
As indicated in Figure 2, all samples showed their credibility within a 95% confidence level.QC group samples have a close gathering, implying that the high quality of the data was available and deserved further analysis [42].Nine samples from the same concentration group clustered closely, indicating the good intra-group similarity of these samples.Inter-group samples separated obviously on the first principal component axis, For each sample, we adopted the formula of 2 × 100%/((100%/recovery of enrofloxacin-d5) + (100%/recovery of atrazine-d5)) to calculate the final recovery of internal standards (Table S1), whose average value from the same concentration group needs to multiply a calibration coefficient to obtain 100% recovery.Peak intensities for internal standards and all other variables were also calibrated after multiplying the coefficient.After this, eligible variables with a relative standard deviation of peak intensity less than 30% in QC and three concentration groups were obtained to establish a 1447 × 39 data matrix for further analysis [41].

Multivariate Analysis 2.2.1. PCA Results
As indicated in Figure 2, all samples showed their credibility within a 95% confidence level.QC group samples have a close gathering, implying that the high quality of the data was available and deserved further analysis [42].Nine samples from the same concentration group clustered closely, indicating the good intra-group similarity of these samples.Inter-group samples separated obviously on the first principal component axis, implying distinguishable differences in variables existed among these groups, which provided the possibility to hunt for marker compounds among concentration groups.implying distinguishable differences in variables existed among these groups, which provided the possibility to hunt for marker compounds among concentration groups.

Cluster Analysis Results
Cluster analysis can directly describe the similarity level of samples, meaning that samples with the highest similarity gather together preferentially, level by level, until all the samples finish their aggregation.As observed in Figure S1 (Supplementary Materials), the samples from the same concentration group were obviously separated from others, showing their high intra-group congeniality.The evident inter-group separation can be attributed to the differences in peak intensities of variables.The findings of cluster analysis were consistent with those of PCA, and both supported the existence of variables with significant differences.

OPLS-DA Results
As we know, OPLS-DA is a two-class classification model; the size imbalance of two classes may introduce a bias in the computation of the decision rule, which can result in the unreliability of the OPLS-DA model [43].To resolve this issue, we adopted the synthetic minority over-sampling technique (SMOTE) [44] to increase the sample size of the class with fewer samples.Herein, when the 20 ng/mL concentration group was designed as Class 1 (only 9 samples), and the 50 and 100 ng/mL concentration groups as a whole (18 samples in total) were designated Class 2, it needed to increase the size of Class 1 from 9 to 18 samples (Figure 3a).Similarly, the 100 ng/mL concentration group in Class 1 also needed to increase the class size to match that of the 20 and 50 ng/mL concentration groups in Class 2 (Figure 3b).As indicated in Figure 3, the obvious separation between Class 1 and Class 2 on the first principal component axis in OPLS-DA score plots was observed, implying that the differences in variables were significant between the two classes.R 2 Y and Q 2 are two key parameters in the OPLS-DA model to evaluate the interpretability along the Y-axis and the predictability of the model, respectively [45,46].When R 2 Y and Q 2 values are close to 1, it means that the OPLS-DA model has high reliability and predictability.In Figure 3, the R 2 Y and Q 2 values were both equal to 1, which favored the robustness of these OPLS-DA models.

Cluster Analysis Results
Cluster analysis can directly describe the similarity level of samples, meaning that samples with the highest similarity gather together preferentially, level by level, until all the samples finish their aggregation.As observed in Figure S1 (Supplementary Materials), the samples from the same concentration group were obviously separated from others, showing their high intra-group congeniality.The evident inter-group separation can be attributed to the differences in peak intensities of variables.The findings of cluster analysis were consistent with those of PCA, and both supported the existence of variables with significant differences.

OPLS-DA Results
As we know, OPLS-DA is a two-class classification model; the size imbalance of two classes may introduce a bias in the computation of the decision rule, which can result in the unreliability of the OPLS-DA model [43].To resolve this issue, we adopted the synthetic minority over-sampling technique (SMOTE) [44] to increase the sample size of the class with fewer samples.Herein, when the 20 ng/mL concentration group was designed as Class 1 (only 9 samples), and the 50 and 100 ng/mL concentration groups as a whole (18 samples in total) were designated Class 2, it needed to increase the size of Class 1 from 9 to 18 samples (Figure 3a).Similarly, the 100 ng/mL concentration group in Class 1 also needed to increase the class size to match that of the 20 and 50 ng/mL concentration groups in Class 2 (Figure 3b).As indicated in Figure 3, the obvious separation between Class 1 and Class 2 on the first principal component axis in OPLS-DA score plots was observed, implying that the differences in variables were significant between the two classes.R 2 Y and Q 2 are two key parameters in the OPLS-DA model to evaluate the interpretability along the Y-axis and the predictability of the model, respectively [45,46].When R 2 Y and Q 2 values are close to 1, it means that the OPLS-DA model has high reliability and predictability.In Figure 3, the R 2 Y and Q 2 values were both equal to 1, which favored the robustness of these OPLS-DA models.
A series of data points to represent the variables were shown in the S-plot plots (Figure 4), in which the points near the two tips of the 'S' plots signified the larger contribution and higher confidence levels of the corresponding variables in differentiating two classes of the OPLS-DA model.These variables were qualified to be marker compound candidates.Herein, marker compounds in the significantly low (20 ng/mL) and high (100 ng/mL) concentration groups should be sought at the right end of Figure 4a   A series of data points to represent the variables were shown in the S-plot plots (Figure 4), in which the points near the two tips of the 'S' plots signified the larger contribution and higher confidence levels of the corresponding variables in differentiating two classes of the OPLS-DA model.These variables were qualified to be marker compound candidates.Herein, marker compounds in the significantly low (20 ng/mL) and high (100 ng/mL) concentration groups should be sought at the right end of Figure 4a and the left end of Figure 4b, respectively.
As we know, the Y-axis in the permutation test plot is on behalf of R 2 Y and Q 2 Y for the model, and the X-axis represents the correlation coefficient between original and permuted response data.It is desirable to summarize the results of the permutation test in a quantitative manner.One way to do this is to conduct conventional regression analysis on the two sets of points, i.e., one regression line is fitted among the R 2 Y points (green circles) and another line among the Q 2 Y points (blue squares), as shown in Figure 5.The intercepts of the resulting regression lines are interpretable as measures of 'background' R 2 Y and Q 2 Y obtained by fit to random data.Experience shows that the R 2 Yintercept should not exceed 0.3 ~ 0.4 and that the O 2 Y-intercept should not exceed 0.05.Intercepts below these limits indicate valid models.A 200-iteration permutation test was a common method to evaluate whether the OPLS-DA model underwent over-fitting by the significance test (p-value) of Q 2 Y metrics [47].If p < 0.05, it meant that the OPLS-DA models were free of over-fitting, as reflected by the Q 2 Y-intercept being less than 0.05 [47].Figure 5 clearly points out that all OPLS-DA models were immune to over-fitting under the principles mentioned above.two classes of the OPLS-DA model.These variables were qualified to be marker compound candidates.Herein, marker compounds in the significantly low (20 ng/mL) and high (100 ng/mL) concentration groups should be sought at the right end of Figure 4a and the left end of Figure 4b, respectively.
As we know, the Y-axis in the permutation test plot is on behalf of R 2 Y and Q 2 Y for the model, and the X-axis represents the correlation coefficient between original and permuted response data.It is desirable to summarize the results of the permutation test in a quantitative manner.One way to do this is to conduct conventional regression analysis on the two sets of points, i.e., one regression line is fitted among the R 2 Y points (green circles) and another line among the Q 2 Y points (blue squares), as shown in Figure 5.The intercepts of the resulting regression lines are interpretable as measures of 'background' R 2 Y and Q 2 Y obtained by fit to random data.Experience shows that the R 2 Yintercept should not exceed 0.3 ~ 0.4 and that the O 2 Y-intercept should not exceed 0.05.Intercepts below these limits indicate valid models.A 200-iteration permutation test was a common method to evaluate whether the OPLS-DA model underwent over-fitting by the significance test (p-value) of Q 2 Y metrics [47].If p < 0.05, it meant that the OPLS-DA models were free of over-fitting, as reflected by the Q 2 Y-intercept being less than 0.05 [47].Figure 5 clearly points out that all OPLS-DA models were immune to over-fitting under the principles mentioned above.As we know, the Y-axis in the permutation test plot is on behalf of R 2 Y and Q 2 Y for the model, and the X-axis represents the correlation coefficient between original and permuted response data.It is desirable to summarize the results of the permutation test in a quantitative manner.One way to do this is to conduct conventional regression analysis on the two sets of points, i.e., one regression line is fitted among the R 2 Y points (green circles) and another line among the Q 2 Y points (blue squares), as shown in Figure 5.The intercepts of the resulting regression lines are interpretable as measures of 'background' R 2 Y and Q 2 Y obtained by fit to random data.Experience shows that the R 2 Y-intercept should not exceed 0.3~0.4 and that the O 2 Y-intercept should not exceed 0.05.Intercepts below these limits indicate valid models.A 200-iteration permutation test was a common method to evaluate whether the OPLS-DA model underwent over-fitting by the significance test (p-value) of Q 2 Y metrics [47].If p < 0.05, it meant that the OPLS-DA models were free of over-fitting, as reflected by the Q 2 Y-intercept being less than 0.05 [47].Figure 5 clearly points out that all OPLS-DA models were immune to over-fitting under the principles mentioned above.VIP, as a measure of the importance of the variable in the OPLS-DA model, was computed for each extracted variable.Usually, VIP > 1 was considered to be the crucial threshold to pick eligible variables on behalf of marker compounds.In this study, a total of 228 variables with VIP > 1 from 20 concentration groups vs. 50 and 100 concentration groups (20 vs. 50 and 100 in abbreviation applied below) were chosen to be the marker compound candidates, but only 188 variables were eligible from 100 concentration groups VIP, as a measure of the importance of the variable in the OPLS-DA model, was computed for each extracted variable.Usually, VIP > 1 was considered to be the crucial threshold to pick eligible variables on behalf of marker compounds.In this study, a total of 228 variables with VIP > 1 from 20 concentration groups vs. 50 and 100 concentration groups (20 vs. 50 and 100 in abbreviation applied below) were chosen to be the marker compound candidates, but only 188 variables were eligible from 100 concentration groups vs. 20 and 50 concentration groups (100 vs. 20 and 50 in abbreviation applied below).One hundred and thirty-nine variables were observed to be overlapped between two VIP lists.

SVM Results
After running SVM codes, all the classification accuracy of the 10-fold cross-validation was computed to be 100%, in favor of the models with good generalization ability.The best penalty factor (c) and best kernel function parameter (g) from 3-fold cross-validation at 100% accuracy rate were computed to be 0.0039 and 0.0039 for 20 vs. 50 and 100, and 0.0037 and 0.0037 for 100 vs. 20 and 50.Two hundred and twenty-eight variables with VIP > 1 obtained from 20 vs. 50 and 100 were included in the weight table of two hundred and fifty-three variables.Similarly, 188 variables with VIP > 1 corresponded to those in the weight table, which contained an additional 26 variables with VIP < 1 from 100 vs. 20 and 50.From the above, we can see that the overlap ratio of variables obtained between VIP and weight ranking methods was 90.1% (i.e., 228 × 100%/253) from 20 vs. 50 and 100, while this ratio fell to 87.8% (i.e., 188 × 100%/214) from 100 vs. 20 and 50, indicating that the variables sought by the two methods were of high consistency, which laid a solid foundation to accurately seek eligible variables to represent marker compounds.One hundred and fifty-four overlapped variables from the two weight lists mentioned above were obtained for univariate analysis to further confirm the validity of those variables.

Univariate Analysis
Pairwise t-tests [47][48][49] and fold changes of concentration [27,50] have proved to be appropriate for univariate analysis of the metabolomics data.A pairwise t-test focuses on the significant differences of variables in peak intensity between two concentration groups, while a fold change of concentration considers the peak intensity ratio of variables between the high and low concentration groups.In general, the variables with a significance level (p) of pairwise t-tests below 0.05 and fold change (FC) of concentration above 2 are deemed eligible.In this study, p values of 154 variables between 20 and 50, 20 and 100, as well as 50 and 100 ng/mL concentration groups, i.e., p 20vs.50 , p 20vs.100, and p 50vs.100 , were calculated to be lower than 0.05 in a pairwise t-test (Table S2), indicating that the significant inter-group differences of 154 variables indeed existed.In addition, the fold changes of concentration between 50 and 20 (FC 50vs.20 ), as well as 100 and 20 ng/mL groups (FC 100vs.20 ), were also calculated to be 2.333~2.671and 4.555~5.332(Table S3), which were above 2 to further support the eligibility of 154 variables as marker compound candidates.
As shown in Table 1, 120 out of 154 variables were confirmed as marker compounds (124 in total) by the weight ranking method, with a screening rate of over 96% (i.e., 120 × 100%/124), while the VIP method can only find 109 marker compounds with a screening rate of about 88% (i.e., 109 × 100%/124), showing that the weight ranking method expressed better performance in selecting eligible variables to represent marker compounds than the VIP method.Fleroxacin, lincomycin, clindamycin, and mebendazole, as four veterinary drugs, have no variables equal to their identities as marker compounds in weight and VIP lists.To be different with these four contaminants, three pesticides (i.e., pyridaben, hexazinone, and phosmet) and another eight veterinary drugs, including sulfaphenazole, sulfamethoxazole, sulfamethoxypyridazine, lomefloxacin, fenthion, danofloxacin, sparfloxacin, and tilmicosin, played the role of unique marker compounds, which corresponded to the variables with VIP < 1 from 20 vs. 50 and 100, but were also included in the weight lists (Table 1).Thirty-four in one hundred and fifty-four variables not identified as marker compounds in weight lists showed more complicated and erratic LC-MS/MS and metabolomics information, which needed more in-depth studies to discern their origins from a matrix or foreign impurity.A blank maize sample was also investigated by the SVM-assisted metabolomics method with no corresponding variables of 124 P&VDs found, eliminating the inherent (rather than spiked) interference of these contaminants in maize to screen marker compounds and the existence of false positives.

Limits of Detection
According to the method proposed by the US Environmental Protection Agency to calculate the limits of detection (LODs) of 120 P&VDs [51], we first prepared a 2.0 g blank maize sample for the same pretreatment as introduced above to obtain a 1 mL extract solution.In this extract, we spiked a mixed solution (1 µg/mL, 20 µL) containing 120 P&VDs, resulting in a final concentration of 20 ng/mL.This process was repeated to gain seven replicates, which were also subjected to metabolomics analysis to obtain peak intensities of 120 P&VDs.For each P&VD, the concentration level of 20 ng/mL corresponds to the peak intensity average value of seven replicates.Therefore, the concentration (ng/mL) of each P&VD can be calculated by the formula for the respective peak intensity × 20/average peak intensity.As shown in Table 1, the LOD results for 120 P&VDs varied between 0.3 and 1.5 µg/kg.
The metabolomics method has successfully been applied to non-targeted contaminant screening in plant-derived foods such as tea [27], lettuce [29], maize [29], and orange juice [33].In contrast to these studies, which only focused on pesticides or veterinary drugs, our study expands the domain of target compounds to simultaneously screen multi-class P&VDs.This is the first time to prove the introduction of SVM as an effective tool for promoting the screening rate of the metabolomics approach.In addition, method LODs were calculated to be significantly below 10 µg/kg for 120 P&VDs, which even have the upper hand over some targeted multi-residue methods [20][21][22].

Practicability Test
For plant-derived foods, the contamination pathways of pesticides are almost explicit; however, the contamination sources of veterinary drugs have not been fully understood.As reported in previous studies [52][53][54], the application of fertilizer and irrigation water contaminated with veterinary drugs onto farmland is a primary reason to cause the contamination of plant-derived foods; we hypothesized that some plant-derived foods from rural areas exposed to the abuse of veterinary drugs in livestock farming may suffer higher contamination risks.To investigate the practicability of our proposed SVM-assisted metabolomics screening method, we adopted maize samples from Qili and Bali villages as a case study.These two villages in Jinpu New Area, affiliated with Dalian City, are not only crucial maize suppliers but also pivotal providers of fresh livestock and poultry meat, which makes the maize samples highly susceptible to contamination by veterinary drugs.
Maize samples were collected after the maturity period in late September 2023.After metabolomics analysis, four contaminants, including norfloxacin, enrofloxacin, imidacloprid, and carbendazim, were all detected in two villages, with their concentrations being 10.7~17.7 µg/kg (Table S4).The abundance of norfloxacin and enrofloxacin in maize samples can be attributed to their extensive use as medicines and feed additives in animal husbandry [55], while the residual imidacloprid and carbendazim in maize samples with relatively high concentrations can be due to their widespread application to combat aphid [56] and bacterial wilt [56], respectively.As indicated in Table S4, the contamination of pesticides remained predominant over that of veterinary drugs, and imidacloprid presented the highest concentration among the four contaminants in both villages.In addition, Qili village suffered more serious overall contamination from four contaminants than Bali village.The test results supported the practicability of our proposed SVM-assisted metabolomics method in non-targeted screening of contaminants.It is worth noting that all four contaminants identified by the method presented a concentration above 10 µg/kg, which was understandable to show the feasibility of the method.But for those P&VDs with concentrations below 10 µg/kg, it is still unclear how to prove the applicability of the method, which probably led to those contaminants being unable to be detected.That is to say, the actual contamination of maize by P&VDs in the Jinpu New Area can be even worse than what we expected, and it deserves more attention to defuse the underlying risks.Food safety has no borders.In view of some food issues that initially occurred in local areas on a small scale, it is still essential for any country and international organization to come to notice and to strengthen cooperation in the prevention and control of the issues, such as through information sharing, experience exchange, and even the publication of standardized documents to ensure food safety from the source.

Chemicals and Materials
Maize samples were purchased from a local farmer's market in Dalian City, China.Methanol and acetonitrile (HPLC grade) were purchased from Merck Corporation (Darmstadt, Germany).Formic acid and acetic acid (HPLC grade) were provided by China Shanghai ANPEL Laboratory Technologies Inc. (Shanghai, China).Ammonium formate, anhydrous sodium sulfate (Na 2 SO 4 ), anhydrous sodium acetate (NaAC), and primary secondary amine (PSA) sorbent were obtained from China Sinopharm Chemical Reagent Co., Ltd.(Shanghai, China).Filter membrane (0.22 µm) was purchased from Agilent Technologies (Santa Clara, CA, USA).Ultrapure water was produced by the Milli-Q ultrapure water system from Merck Corporation (Darmstadt, Germany).Enrofloxacin-d5 and atrazine-d5 (100 µg/mL in methanol) for recovery calibration were provided by First Standard (Worcester, MA, USA).Sixty-nine pesticides and fifty-five veterinary drugs were purchased from the corporations First Standard (Worcester, MA, USA), Sigma (St. Louis, MO, USA), Dr. Ehrenstorfer (Wesel, Germany), and TRC (Montréal, QC, Canada), with a purity greater than 98%.The detailed information on these P&VDs is shown in Table 2.

Solution Preparation
A total of 124 P&VDs was separately prepared in methanol with a concentration of 100 µg/mL as the stock solution, 1 mL of which was taken, mixed together, and diluted with methanol to prepare a 1 µg/mL working solution.A 100 ng/mL mixed working solution of enrofloxacin-d5 and atrazine-d5 was prepared by diluting their 100 µg/mL stock solution with methanol.

Sample Preparation and Pretreatment Process
(a) The maize sample was ground into powder by a grinder; (b) 2.0, 5.0, and 10.0 g of maize powder, together with corresponding 20, 50, and 100 µL of 124 P&VDs mixed solutions (1 µg/mL), were poured into 50 mL polypropylene centrifuge tubes, respectively.To calibrate the recovery during the sample pretreatment process, a mixed solution (0.5 mL, 100 ng/mL) of enrofloxacin-d5 and atrazine-d5 was further added as recovery internal standards; (c) a total of 20 mL of acetonitrile/water (80/20, v/v) with 1% acetic acid was dumped into the tube by vortex and ultrasonic extraction for 1 and 30 min, respectively.A total of 6.0 g anhydrous Na 2 SO 4 and 1.5 g NaAC were added, shocking violently for 1 min and centrifuging at 6000 r/min for 5 min; (d) all the supernatant solutions were taken into a new tube containing 2.0 g anhydrous Na 2 SO 4 and 0.3 g PSA, centrifuging at 6000 r/min for 2 min; (e) all the solutions were extracted, dried with nitrogen gas flow, and redissolved with 1 mL 40% (v/v) methanol-4 mmol/L ammonium formate buffer solution, vortexing for 1 min; (f) filtered with a 0.22 µm filter membrane, the sample solutions of 124 P&VDs at the theoretical concentrations of 20, 50, and 100 ng/mL were prepared.Each concentration experiment was completed in nonuplicate to obtain replicate samples.

Sample Grouping and Naming
Twenty-seven samples named 20 ng/mL−1~20 ng/mL−9, 50 ng/mL−1~50 ng/mL−9, and 100 ng/mL−1~100 ng/mL−9 were obtained from three concentration groups.A quality control (QC) sample was prepared by mixing 30 µL of each sample from the abovementioned concentration groups [27,57,58] and underwent three repeated injections into the LC-MS/MS system before and after the injection of each concentration group.Twelve injections of QC samples were named after QC-1, QC-2, . .., and QC-12 to evaluate the stability of LC-MS/MS.

Analytical Method
A quadrupole/electrostatic field orbitrap LC-MS/MS system (Q Exactive Plus, Thermo Fisher Scientific Inc., Waltham, MA, USA) equipped with a heatable electrospray ion (ESI) source was used to analyze all pesticides, veterinary drugs, enrofloxacin-d5, and atrazine-d5.The ESI-positive mode was adopted to obtain M+H adducts for metabolomics analysis.An Accucore RP-MS column (100 mm × 2.1 mm, 2.6 µm particle diameter) purchased from Thermo Fisher Corporation (Waltham, MA, USA) was employed to separate components after an injection of 10 µL sample solution.The mobile phases were composed of 0.1% (v/v) formic acid and ammonium formate at 4 mmol/L in water (eluent A) and 0.1% (v/v) formic acid and ammonium formate at 4 mmol/L in methanol (eluent B).The flow rate was kept at 0.3 mL/min.The elution program was as follows: 100% A (0 min), 100% A (2 min), 0% A (8 min), 0% A (13 min), and 100% A (14 min) until the end of the run.The oven temperature was kept at 40 • C. The Q Exactive Plus parameter settings include (a) heating and capillary temperature at 320 • C; (b) sheath and auxiliary gas from N 2 , with flow rates of 40 and 10 arb, respectively; (c) lens and spray voltage at 50 and 3200 V, respectively; (d) scan mode to be full-scan/data-dependent two-stage scanning; (e) MS parameters to be full-scan resolution at 70,000, AGC target at 1 × 10 6 , maximum dwell time at 100 ms, and scan range of m/z between 100 and 1000; (f) MS/MS parameters to be resolution at 17,500, AGC target at 2 × 10 5 , and maximum dwell time at 50 ms.
Trace Finder software (Version 3.3) installed on the LC-MS/MS system was employed for residue analysis, with specific screening conditions as follows: (a) for primary parent ions, the response intensity threshold at 10,000, the ratio of signal to noise at 5.0, and mass error at 5 ppm; (b) for secondary fragment ions, the minimum number of matching ions at 1, the response intensity threshold at 10,000, and mass error at 5 ppm.The quantification of enrofloxacin-d5 and atrazine-d5 was finished by standard curves on the basis of the peak areas of primary parent ions for recovery calibration.

Metabolomics Data Processing
LC-MS/MS output files cannot be directly employed for metabolomics analysis and need to be converted from .RAW format to .mzXMLformat [59].The new-formatted files were uploaded onto the Workflow4Metabolomics (W4M) platform (https://workflow4 metabolomics.usegalaxy.fr/,accessed on 19 October 2022) for further analysis [60].Chromatographic peak detection, alignment, and retention time calibration were performed on the platform, with further operations including normalization, centralizing, scaling, and data transformation of peak intensity to acquire the data matrix, in which variable and sample names were designated abscissa and ordinate, respectively [60,61].PCA [62][63][64] and OPLS-DA [65,66] embedded in SIMCA software (Version 14.1) [67], together with cluster analysis in Heml software (Version 1.0.3.7), were performed for multivariate analysis of the data matrix.Over-fitting of the OPLS-DA model was evaluated by permutation tests [67,68].Variable importance in projection (VIP) > 1 is the threshold to screen variables as marker compound candidates [67][68][69].Under this principle, eligible variables from three concentration groups were included in the VIP lists.
SVM can perform nonlinear classification by mapping the input data into a highdimensional feature space, the process of which is known to be the kernel trick [70].Many kernel functions are available, but the most commonly used one is the radial basis function (RBF).Therefore, we chose SVMRFE under RBF to run the program in MATLAB software (Version R2019b).A 10-fold cross-validation was performed by randomly dividing the dataset into 10 subsets and computing the mean value of 10 accuracy values.This process was repeated 10 times, and the mean accuracy was obtained as the final classification accuracy.Kernel parameters were chosen through a 3-fold cross-validation approach, which divided the training dataset randomly into three subsets.The classifier was trained on either of the two subsets and tested on the third one.A set of parameters that provided the best cross-validation accuracy was employed for further analysis using the LIBSVM interface [71].The kernel parameters with the best cross-validation accuracy were chosen to perform the actual classification.Under the conditions described above, weight values (w) and weight squared values (w 2 ) of variables were obtained, with the latter ones ranging from large to small to form a variable weight table for further analysis.The MATLAB codes used to generate the results were provided in the Supplementary Materials.
All VIP > 1 variables filled their corresponding vacancies in the variable weight table.For those variables, their sequence in the weight table was sequential, implying that two methods, including VIP and weight ranking, to seek eligible variables have completely consistent findings.But for those variables, their VIP > 1 sequence in the weight table was not consecutive, implying some variables with VIP < 1 were also listed in the weight table.As mentioned above, SVMRFE has higher predictability than OPLS-DA in general, i.e., the weight ranking method is more reliable than the VIP method in picking eligible variables.Therefore, we still kept those variables with VIP < 1 in the weight table to prevent the possible loss of some valid variables during VIP computation for further analysis.Overlapped variables in 20 and 100 ng/mL groups, on behalf of their significantly low and high concentrations, underwent further verification by pairwise t-test [47][48][49] performed in SPSS Statistics software (Version 17.0) and fold change calculation of concentration during univariate analysis.Finally, eligible variables were confirmed to represent the marker compounds by comparing the retention time and precise molecular weight (absolute value of error less than 5 ppm) of 124 P&VDs (Table 2).

Conclusions
This study developed an SVM-assisted metabolomics method to screen the marker compounds of 124 P&VDs in maize samples.One hundred and twenty out of one hundred and twenty-four P&VDs can be identified by our proposed method, while only one hundred and nine P&VDs can be found by the metabolomics method alone, implying that SVM can promote the screening accuracy of the metabolomics method.It is the first time to apply the SVM-assisted metabolomics method to non-targeted screening of P&VDs in plant-derived foods.Method LODs were calculated for 120 P&VDs, with values significantly below 10 µg/kg for all of them, which were favorably comparable with a targeted multi-residue method.Our approach, developed on a simple and self-designed case, has been validated on a more complicated and realistic condition.This study promoted insight into the development of truly non-targeted screening approaches on the basis of metabolomics (particularly LC-MS/MS and chemometrics) for food safety assessment, which may be in need of non-targeted methods as a complement to targeted methods in view of screening potentially contaminated food products in the near future.Since our proposed method was rather generic, it was applicable with only a few modifications to any other LC-MS/MS dataset or even to other fields like authenticity or origin issues to complement existing methods.In conclusion, there are reasons to believe that these findings will encourage more breakthroughs in analytical issues, both in the methods and the tools.

Figure 1 .
Figure 1.Total ion chromatograms of spiked maize sample groups on the W4M platform.

Figure 1 .
Figure 1.Total ion chromatograms of spiked maize sample groups on the W4M platform.

Figure 2 .
Figure 2. PCA score plot of spiked maize sample groups.

Figure 2 .
Figure 2. PCA score plot of spiked maize sample groups.
and the left end of Figure4b, respectively.

Table 1 .
Marker compounds were screened in maize sample groups.

Table 1 .
Cont.Note: a two VIP values from 20 vs. 50 and 100 and 100 vs. 20 and 50, respectively; b two weight squared values (×10 5 ) from 20 vs. 50 and 100 and 100 vs. 20 and 50, respectively; c m/z represents extracted molecular weight from W4M platform; d t represents retention time from W4M platform; e Mass error (ppm) = (extracted molecular weight from W4M platform − extracted molecular weight from LC-MS/MS) × 10 6 /extracted molecular weight from LC-MS/MS.