Combined Experimental and Multivariate Model Approaches for Glycoalkaloid Quantification in Tomatoes

The intake of tomato glycoalkaloids can exert beneficial effects on human health. For this reason, methods for a rapid quantification of these compounds are required. Most of the methods for α-tomatine and dehydrotomatine quantification are based on chromatographic techniques. However, these techniques require complex and time-consuming sample pre-treatments. In this work, HPLC-ESI-QqQ-MS/MS was used as reference method. Subsequently, multiple linear regression (MLR) and partial least squares regression (PLSR) were employed to create two calibration models for the prediction of the tomatine content from thermogravimetric (TGA) and attenuated total reflectance (ATR) infrared spectroscopy (IR) analyses. These two fast techniques were proven to be suitable and effective in alkaloid quantification (R2 = 0.998 and 0.840, respectively), achieving low errors (0.11 and 0.27%, respectively) with the reference technique.


Introduction
Fruit ripening is a complex process that causes considerable changes in color, texture and flavor. Additionally, the chemical composition of fruits is affected during this process, including conversion of starch to sugars, biosynthesis and accumulation of pigments and aromatic volatiles, as well as modification of cell wall ultrastructure [1].
In tomatoes, a significant decrease in steroidal glycoalkaloid content (i.e., α-tomatine and dehydrotomatine) occurs as a function of the ripening process. These components are not useful for plant growth but play a significant role in defense mechanisms against pathogens. According to their specific function, the glycoalkaloid concentration is highest in stems and leaves, during the first stages of plant growth. In the fruits, the glycoalkaloid content decreases as a function of the ripening process and at the same time, the color changes from green to red [2][3][4][5]. It is well known that the tomato glycoalkaloids are biosynthesized and then degraded during fruit ripening. In particular, in tomato fruits, the tomatine content decreases as the fruits grow and it is completely degraded when the fruits turn red [4,6,7].
Most of the studies on natural glycoalkaloid quantification are based on chromatographic techniques [17], including the quantification of tomatine [6,7,10]. Although highperformance liquid chromatography (HPLC) still remains the gold standard technique for

HPLC-ESI-QqQ-MS/MS Glycoalkaloid Determination in Different Industrial Tomato Varieties at Different Vine-Ripe Stages
The present study reports the quantification of α-tomatine and dehydrotomatine, in eight industrial varieties of tomatoes, carried out via an HPLC-ESI-QqQ-MS/MS protocol in hydroalcoholic acidic extracts of lyophilized samples. The results are summarized in Table 1. In agreement with other studies [4,6,7,27] on the same variety, a higher content of glycoalkaloids was found in green tomatoes. The turning stage represents the ripening phase at which the most significant changes occur. Indeed, up to a 77% decrease of the two glycoalkaloids was observed. A further decrease between 24 to 77% was observed in the pink ripening stage. Only the H7204 red ripened tomato variety showed a very low α-tomatine content, corresponding to 1.3% of the α-tomatine found in the corresponding green tomatoes, while dehydrotomatine was detected only in trace amounts.
The pseudo-exponential decrease in tomatine (the sum of α-tomatine and dehydrotomatine) as a function of the ripening stage is strictly related to the variety, and the equations that describe the decreasing pattern are reported in Figure 1.
Nevertheless, more significant differences were found among varieties. In particular, H1301 and H3406 varieties were the richest in α-tomatine and dehydrotomatine (Table 1) whereas the lowest values were found in H5108 and Lyco1 varieties. It is possible to notice that the α-tomatine content at the green ripening stage is strongly influenced by the varietal factor, covering a wide range of concentrations of glycoalkaloids (from 1772 ± 33 to 552 ± 45 mg/kg dry weight, DW). The difference in the α-tomatine content among the different tomato varieties became less significant proceeding through the ripening stages, from the green to the pink ripening stage. In the latter, the range of the α-tomatine content was rather narrow, ranging from 238 ± 17 to 120 ± 11 mg/kg DW. Interestingly, the rate of α-tomatine degradation, due to the ripening process of the fruits, also differed within the different varieties, underlining that this process is mainly influenced by the intrinsic genetic differences linked to the variety of the plants. Table 1. α-Tomatine and dehydrotomatine contents in vine-ripened industrial tomato varieties. The values are expressed as mg/kg DW and mean ± SD (standard deviation; n = 9). The content of tomatine (sum of the two glycoalkaloids) is also reported.

Variety
Ripening  The values are expressed as mg/kg DW and mean ± SD (standard deviation; n = 9). The content of tomatine (sum of the two glycoalkaloids) is also reported. The pseudo-exponential decrease in tomatine (the sum of α-tomatine and dehydrotomatine) as a function of the ripening stage is strictly related to the variety, and the equations that describe the decreasing pattern are reported in Figure 1.

Thermogravimetric Analysis (TGA)
Thermogravimetry has already been applied to analyze and quantify vegetal compounds in complex matrices [28,29]. The weight loss of each freeze-dried sample in the ranges 120-200 and 200-400 • C are summarized in Table 2. The weight loss in the range 120-200 • C increases from 16 ± 1% for tomatoes at the green stage to 18.7 ± 0.5% and 21 ± 1% for tomatoes at the turning and pink stages, respectively. Concerning the decomposition between 200-400 • C, a clear trend can be observed in the analyzed samples: moving from the green to pink stage, a decrease in weight loss is observed. Indeed, in the green stage, a mean weight loss of about 38 ± 1% is found that decreases to 36 ± 1% in the turning stage and reaches 33.7 ± 0.4% in the pink stage.
In Figure 2, a comparison between thermographs of H7204 variety samples at different ripening stages is depicted, showing the peculiar increase in weight loss between 120 and 200 • C. In this temperature range, the degradation of volatiles occurs [30]. The increase in weight loss observed in this region of the thermogram could be explained by the accumulation of flavor and aromatic compounds as a consequence of the ripening process. Concurrently, a decrease in weight loss between 200 and 400 • C is observed. At these temperatures, macromolecules such as cellulose, hemicellulose, pectin and starch are thermally degraded [30,31]. During the ripening process, many hydrolytic enzymes are involved in cell wall metabolism, causing the softening of fleshy fruits. At the same time, the depolymerization of many polysaccharides to sugars occurs. Consequently, the content of these macromolecules in tomatoes progressively decreases as a function of the ripening process.
Concurrently, a decrease in weight loss between 200 and 400 °C is observed. At these temperatures, macromolecules such as cellulose, hemicellulose, pectin and starch are thermally degraded [30,31]. During the ripening process, many hydrolytic enzymes are involved in cell wall metabolism, causing the softening of fleshy fruits. At the same time, the depolymerization of many polysaccharides to sugars occurs. Consequently, the content of these macromolecules in tomatoes progressively decreases as a function of the ripening process. Therefore, a multiple linear regression (MLR) model was proposed to extrapolate the tomatine concentration in lyophilized tomato samples from TGA, taking into account the peratures, macromolecules such as cellulose, hemicellulose, pectin and starch are thermally degraded [30,31]. During the ripening process, many hydrolytic enzymes are involved in cell wall metabolism, causing the softening of fleshy fruits. At the same time, the depolymerization of many polysaccharides to sugars occurs. Consequently, the content of these macromolecules in tomatoes progressively decreases as a function of the ripening process. Therefore, a multiple linear regression (MLR) model was proposed to extrapolate the tomatine concentration in lyophilized tomato samples from TGA, taking into account the Therefore, a multiple linear regression (MLR) model was proposed to extrapolate the tomatine concentration in lyophilized tomato samples from TGA, taking into account the weight loss in the selected temperature ranges. The MLR analysis gave the equation Y = −0.0472X 1 + 0.0983X 2 (R 2 = 0.999 and adjusted R 2 = 0.998). The intercept of the regression model was forced to zero since it was not statistically significant (p > 0.05), while both the coefficients of X 1 and X 2 were significantly different from zero (p << 0.05). The F-test showed that the overall model is significant (p << 0.05), with a root mean square error (RMSE) of 0.11. The relationship between the tomatine concentrations predicted by the MLR model and the measured ones is reported in Figure 4a. The residual plot of the model (Figure 4b), reporting the autoscaled Y-residuals vs. the predicted Y, gave a random distribution between ± 3 (99% confidence interval) for all the samples, confirming that the MLR model reported here is able to predict tomatine values along the whole range of concentrations employed.
showed that the overall model is significant (p << 0.05), with a root mean square error (RMSE) of 0.11. The relationship between the tomatine concentrations predicted by the MLR model and the measured ones is reported in Figure 4a. The residual plot of the model (Figure 4b), reporting the autoscaled Y-residuals vs. the predicted Y, gave a random distribution between ± 3 (99% confidence interval) for all the samples, confirming that the MLR model reported here is able to predict tomatine values along the whole range of concentrations employed.   Figure 5a reports the spectrum of a dried tomato sample (as an example). None of the spectra deviate from the example, if not considered in terms of peak intensity. In the spectra displayed, the specific bands of vegetal samples are highlighted and assigned as follows. The band at 3289 cm −1 is characteristic of NH and OH stretching vibrations. The region of 2923-2853 cm −1 can be assigned to the symmetrical and asymmetric stretching modes of the CH3 and CH2 groups. The 1720 cm −1 band corresponds to the stretching of the C=O ester carbonyl or carboxylic acid groups, which are characteristic of fatty acids and polysaccharides. The amide I band at 1652 cm −1 results from the C=O stretching in the amides I, II and III, while the amidic band II at 1520 cm −1 originates from the bending vibrations of the N-H groups.   Figure 5a reports the spectrum of a dried tomato sample (as an example). None of the spectra deviate from the example, if not considered in terms of peak intensity. In the spectra displayed, the specific bands of vegetal samples are highlighted and assigned as follows. The band at 3289 cm −1 is characteristic of NH and OH stretching vibrations. The region of 2923-2853 cm −1 can be assigned to the symmetrical and asymmetric stretching modes of the CH 3 and CH 2 groups. The 1720 cm −1 band corresponds to the stretching of the C=O ester carbonyl or carboxylic acid groups, which are characteristic of fatty acids and polysaccharides. The amide I band at 1652 cm −1 results from the C=O stretching in the amides I, II and III, while the amidic band II at 1520 cm −1 originates from the bending vibrations of the N-H groups.

Attenuated Total Reflection-Fourier Transform Mid-Infrared Spectroscopy (ATR-FT-MIR)
showed that the overall model is significant (p << 0.05), with a root mean square error (RMSE) of 0.11. The relationship between the tomatine concentrations predicted by the MLR model and the measured ones is reported in Figure 4a. The residual plot of the model (Figure 4b), reporting the autoscaled Y-residuals vs. the predicted Y, gave a random distribution between ± 3 (99% confidence interval) for all the samples, confirming that the MLR model reported here is able to predict tomatine values along the whole range of concentrations employed.   Figure 5a reports the spectrum of a dried tomato sample (as an example). None of the spectra deviate from the example, if not considered in terms of peak intensity. In the spectra displayed, the specific bands of vegetal samples are highlighted and assigned as follows. The band at 3289 cm −1 is characteristic of NH and OH stretching vibrations. The region of 2923-2853 cm −1 can be assigned to the symmetrical and asymmetric stretching modes of the CH3 and CH2 groups. The 1720 cm −1 band corresponds to the stretching of the C=O ester carbonyl or carboxylic acid groups, which are characteristic of fatty acids and polysaccharides. The amide I band at 1652 cm −1 results from the C=O stretching in the amides I, II and III, while the amidic band II at 1520 cm −1 originates from the bending vibrations of the N-H groups.   Table 3 summarizes bands in the IR region from 1800 to 900 cm −1 that include the "fingerprint" region, which includes bands corresponding to the vibrations of the C-O, C-C, C-H and C-N bonds. This region is, on the one hand, very rich in information, but, on the other hand, difficult to analyze due to its complexity. This area provides important information about organic compounds, such as sugars, alcohols and organic acids, present in the sample by featuring their molecular vibrations (stretching, bending and torsions of the chemical bonds) in specific infrared regions. This region was dominated by a broad band centred at 1055, 1035 cm −1 , with evident shoulders at 1145 and 1100 cm −1 , due to strong vibrational modes of various carbohydrates and acids, which are abundant groups in tomatoes. Tomatine shows main bands in the region between 1100 and 950 cm −1 , which, however, are covered by the stronger sugar and polysaccharide absorption and are not useful for quantitative analysis as the absorbance band at 956 cm −1 corresponds to a trans -CH=HC-bending out of plane deformation band that is the unique IR marker band specific to lycopene [32]. Table 3. Main functional groups assigned to ATR-FTIR spectra of tomato [33]. As discussed, the spectra contain a multitude of bands that are characteristic of vegetal samples which do not allow to obtain information on a specific compound without the interference of the matrix as whole. For this reason, we used an ATR-FT-MIR "fingerprint" analytical approach [21,34] for the structural identification of compounds considering that no two chemical structures will have the same ATR-FT-MIR spectrum. ATR-FT-MIR provides a characteristic signature of chemical or biochemical substances present that can be used for chemometric studies.

Chemometric Approach
The average spectrum of 22 samples, as regards the most informative regions located at low wavenumbers (752-1800 cm −1 ), is reported in Figure 5b. This was used to build the regression model. The predictive model for the determination of tomatine in extracted tomato samples was obtained by applying the partial least squares regression (PLSR) model to 16 samples of the calibration set, after data pre-treatment and column mean centering. The uncertainty test identifies 48 significant variables, leading to a decrease in the sample/variable ratio. This also determines a decrease in the risk of finding random correlation and leads to an increase in the reliability of the regression model [35].
The first two components have been identified as significant for tomatine quantification. In more detail, in the final model, the first three latent variables explain 95% of the Y-variance and 73% of the variance in the X-block, with Pearson's regression coefficient R 2 = 0.95 and root mean square of calibration (RMSEC) = 0.11. Figure 6a shows the relationship between the measured values and the predicted ones for the calibration set; all the samples are randomly distributed around the regression line with a negligible dispersion for the whole range of variability. In Figure 6b, the relationship between the autoscaled Y-residual and the predicted ones is shown: all the samples are randomly distributed within ± 3 values that correspond to the 99% confidence interval. Therefore, it is possible to confirm that the regression model presents a comparable predictability along the whole concentration interval with a negligible BIAS value. Figure 6a shows the relationship between the measured values and the predicted ones for the calibration set; all the samples are randomly distributed around the regression line with a negligible dispersion for the whole range of variability. In Figure 6b, the relationship between the autoscaled Y-residual and the predicted ones is shown: all the samples are randomly distributed within ± 3 values that correspond to the 99% confidence interval. Therefore, it is possible to confirm that the regression model presents a comparable predictability along the whole concentration interval with a negligible BIAS value. Finally, the model was validated on the test set. The tomatine contents, predicted by the PLS regression model, were compared with the experimental data to evaluate the actual predictability of the proposed model. The performances were evaluated through the joint analysis of the Pearson's regression coefficient, R 2 pred, and the root mean square error in prediction, RMSEP: where − is the difference between the measured and predicted value for the -th sample in the test set.
The quality parameters calculated for the proposed model confirmed its goodness in predicting tomatine concentration, from spectral features of tomato extract, showing a high R 2 pred in prediction (0.84) and a low RMSEP (0.27), as well as a negligible BIAS value (0.08). Table 4 reports the comparison between experimental data obtained via HPLC-ESI-QqQ-MS/MS determination of tomantine and predicted data through the combined experimental/statistic approaches: TGA/MLR and the ATR-FT-MIR/PLS model. The data have been statistically compared via two-way ANOVA followed by Dunnett's test. It is evident that the results are not statistically different considering the 95% of confidence interval (Dunnett's test, p >> 0.05). This result underlines that the two alternative semiquantitative methods, TGA/MLR and ATR-FT-MIR/PLS, that do not require any pre-treatment extraction of the lyophilized and ground tomato material, are reasonable alternatives to the quantitative HPLC-ESI-QqQ-MS/MS determination (strictly dependent on a solid/liquid extraction). Finally, the model was validated on the test set. The tomatine contents, predicted by the PLS regression model, were compared with the experimental data to evaluate the actual predictability of the proposed model. The performances were evaluated through the joint analysis of the Pearson's regression coefficient, R 2 pred , and the root mean square error in prediction, RMSEP: where y i −ŷ i is the difference between the measured and predicted value for the i-th sample in the test set. The quality parameters calculated for the proposed model confirmed its goodness in predicting tomatine concentration, from spectral features of tomato extract, showing a high R 2 pred in prediction (0.84) and a low RMSEP (0.27), as well as a negligible BIAS value (0.08). Table 4 reports the comparison between experimental data obtained via HPLC-ESI-QqQ-MS/MS determination of tomantine and predicted data through the combined experimental/statistic approaches: TGA/MLR and the ATR-FT-MIR/PLS model. The data have been statistically compared via two-way ANOVA followed by Dunnett's test. It is evident that the results are not statistically different considering the 95% of confidence interval (Dunnett's test, p >> 0.05). This result underlines that the two alternative semi-quantitative methods, TGA/MLR and ATR-FT-MIR/PLS, that do not require any pre-treatment extraction of the lyophilized and ground tomato material, are reasonable alternatives to the quantitative HPLC-ESI-QqQ-MS/MS determination (strictly dependent on a solid/liquid extraction).
In conclusion, the complex changes occurring during the ripening process produce significant changes in the chemical composition of tomato fruits. TGA and ATR-FT-MIR analyses revealed significant differences in tomatoes at the different ripening stages, probably due to the variation in the macromolecule content, as a consequence of the enzymatic reactions occurring during fruit growth and maturation. Through chemiometric approaches, these variations were found to correlate with the glycoalkaloid content. The MLR and the PLS regression models constructed on the values detected by HPLC-ESI-QqQ-MS/MS analyses as a reference allowed us to accurately quantify tomatine in the tomato samples via TGA and ATR-FT-MIR analyses. These two techniques may represent a valid alternative in the quantification of tomatine in tomatoes, permitting the omission of the pre-treatments required for chromatographic analyses.

Plant Materials
Eight different vine-ripened industrial varieties of tomato fruits, harvested in summer 2017 at different ripening stages (green, turning, pink and red) were analyzed (Figure 7). The samples were classified according to the definitions provided by the California Tomato Commission and United States Department of Agriculture (USDA, [36]): green (the surface of the tomato is completely green in color, the shade of green may vary from light to dark), turning (more than 10%, but not more than 30%, of the total surface shows a definite change in color from green to tannish-yellow, pink, red or a combination thereof), pink (more than 30%, but not more than 60%, of the total surface is pink or red in color).
in color).
The samples were washed with deionized water and dried. Six tomatoes of each variety at the different ripening stages were homogenized with a blender (Moulinex, 1000 W) and the homogenates were freeze-dried (VirTis BenchTop Pro lyophilizer, −51 ± 2 °C, 1.3 ± 0.5 mbar) until a constant weight (5 days, averaged samples water content, 92 ± 1%). Lyophilized samples were then ground in porcelain mortar, sieved to a 500 µm particle size and stored at −20 ± 1 °C before the subsequent analyses.

HPLC-ESI-QqQ-MS/MS Quantification of Glycoalkaloids
Glycoalkaloid extraction and analytical quantification were carried out via HPLC-ESI-QqQ-MS/MS as previously validated and reported [6,7], with slight modifications. Briefly, lyophilized samples were extracted by an hydroalcoholic acidic mixture consisting of EtOH/CH3COOH 1%, (70:30, v/v) and the extraction was ultrasound assisted (twocycle extraction on the solid phase). The extracts were dried under nitrogen flow with The samples were washed with deionized water and dried. Six tomatoes of each variety at the different ripening stages were homogenized with a blender (Moulinex, 1000 W) and the homogenates were freeze-dried (VirTis BenchTop Pro lyophilizer, −51 ± 2 • C, 1.3 ± 0.5 mbar) until a constant weight (5 days, averaged samples water content, 92 ± 1%). Lyophilized samples were then ground in porcelain mortar, sieved to a 500 µm particle size and stored at −20 ± 1 • C before the subsequent analyses.

HPLC-ESI-QqQ-MS/MS Quantification of Glycoalkaloids
Glycoalkaloid extraction and analytical quantification were carried out via HPLC-ESI-QqQ-MS/MS as previously validated and reported [6,7], with slight modifications. Briefly, lyophilized samples were extracted by an hydroalcoholic acidic mixture consisting of EtOH/CH 3  The ESI-MS conditions were optimized through the direct injection of a tomatine standard solution in MeOH in positive ionization mode and α-tomatine and dehydrotomatine were quantified by multiple reaction monitoring mode (MRM; collision energy −31 V and scan time, 500 ms). The analytical quantification was carried out via an external calibration method: linearity range, 0.5-30 mg/L of tomatine (a mixture of α-tomatine, 85% and dehydrotomatine, 13%). The tomatidine was used as an internal standard, taking into account that it does not naturally occur as aglycone in tomatoes [4,6,7]. Calibration curves showing equations with R 2 > 0.990 were used for the quantification. The limit of detection (LOD) and limit of quantification (LOQ) were: 0.20//0.05 and 0.50//0.20 mg/L for αtomatine and dehydrotomatine, respectively. The results were expressed as mg/kg of sample dry weight (DW).
All the samples were extracted in triplicate and each extract was analyzed three times (n = 9). and results were reported as mean ± standard deviation (SD). The significant differences between means were assessed by analysis of variance (ANOVA) followed by Tukey's post hoc test. All the statistical treatments were run on Microsoft Office Excel 365, implemented with the Real Statistic subroutine, setting the level of significance at p < 0.05.

Thermogravimetric Analysis (TGA)
The analyses were performed using a Q1000 thermogravimeter (TA, Instruments) applying a thermal program from 30 to 900 • C with a heating ramp of 10 • C/min, under constant nitrogen flow [37]. Three lyophilized aliquots for each variety (15 mg in weight) were analyzed.

Attenuated Total Reflection-Fourier Transform Mid-Infrared Spectroscopy (ATR-FT-MIR)
All the samples were analyzed by FTIR using a Nicolet IS50 FTIR spectrophotometer (Thermo Nicolet Corp., Madison, WI, USA), equipped with a single-reflection germanium ATR crystal (Pike 16154, Pike Technologies, Madison, WI, USA) and a deuterated triglycine sulphate (DTGS) detector. The spectra were acquired (32 scans per sample or background) in the range of 4000-800 cm −1 at a nominal resolution of 4 cm −1 . The spectra were corrected using the background spectrum of air. The analysis was carried out at room temperature by spreading a lyophilized sample onto the surface of the ATR crystal. Before each sample was analyzed, the ATR crystal was carefully cleaned with water-wet cellulose tissue and dried using a flow of pure nitrogen gas. The cleaned crystal was checked spectrally to ensure that no residue was retained from the previous sample. The spectrum of every sample was collected 3 times to check the reproducibility and do a statistical analysis. It should be noted that the individual spectra of the same tomato did not vary in band wavelength and varied to some extent in the absorbance. However, the characteristic signatures remained very similar, so averaged spectra are shown in the results.
The frequency scale was internally calibrated with a helium-neon reference laser to an accuracy of 0.01 cm −1 . OMNIC software (OMNIC software system Version 9.8 Thermo Nicolet) was used for spectra acquisition and manipulation [38].

Multiple Linear Regression (MLR)
The multiple linear regression (MLR) analysis was performed in Microsoft Office Excel 365, implemented with the Real Statistic subroutine, plotting tomatine concentration (Y) determined by HPLC-ESI-QqQ-MS/MS analyses vs. the weight loss detected by TGA analyses in the range of temperature, 120-200 • C (X 1 ) and 200-400 • C (X 2 ). Since the concentration of tomatine ranged over one order of magnitude, the values were converted in the logarithmic form to make the data distribution symmetrical. To build the MLR model, the H7204 sample at the red ripening stage was discarded as a leverage outlier, since its tomatine content was too low. In this way, the distribution of tomatine content was fairly symmetrical, allowing us to build a reliable regression model.

Data Processing
The ATR-FT-MIR spectra were imported into UnscramblerX Software (Camo Analytics, Oslo, Norway) for multivariate data analysis. ATR-FT-MIR spectra absorbances at different wavelengths were stored in a data matrix with samples placed in rows and reference values of tomatine placed in the first column. As extensively reported in the literature [39][40][41][42], it is necessary to split the entire database into two subsets, to evaluate the performance of the prediction model. The first subset, representing 75% of the available samples, was used to build the regression model, whilst the second subset, consisting in 25% of total samples, was used to estimate the actual predictability of the model. Therefore, the dataset was divided into a calibration set and a test set using random sample selection in order to avoid a biased model. In addition, an internal cross-validation step was performed using the venetian blind algorithm to find the best modeling settings (i.e., the number of latent variables, [43]). This procedure consisted in building, optimizing and testing models to obtain reliable prediction of tomatine concentration for future extracted tomato samples.

Spectral Pre-Processing
The ATR-FT-MIR spectra of the 22 samples were averaged and the most informative region was located at low wavenumbers and the regression model was built using the spectral features from 752 cm −1 to 1800 cm −1 . The H7204 sample at the red ripening stage was also discarded in this case.
Usually, when a regression model is built from spectral data, it is necessary to pre-treat the input variables to remove useless information resulting from unwanted systematic variations. In the present work, a combination of standard normal variate (SNV) and second derivative were used to pre-treat spectral data, removing the baseline offset and differences in global signal intensity [44][45][46]. SNV transforms the original data according to the following equation: where x ij represents the absorbance for the j-th wavelength and the i-th sample, x i and s i are the average and the standard deviation of the wavelengths for the i-th spectrum, respectively. For this reason, SNV transformation is also known as row autoscaling. A second derivative transformation was also applied to FT-MIR spectra pre-treated with standard normal variate (SNV) transformation. In more detail, a Savitszi-Golay algorithm was applied using a second order polynomial and symmetric kernel option with 7 smoothing points [46].

Partial Least Squares Regression (PLSR) and Martens' Uncertainty Test
Partial least squares regression (PLSR) was used on the calibration set to obtain the model that was subsequently used to predict tomatine content in industrial tomato samples. Additionally, in this case, tomatine concentration values were converted into the logarithmic form to make the distribution fairly symmetrical.
A kernel PLS algorithm was used to correlate the logarithm of tomatine concentration and spectral features transformed as mentioned previously [47]. The best number of latent variables that was used in the regression function was estimated for the cross-validation procedure, taking into account the cross-validation root mean square error (RMSECV) trend with respect to the number of retained components. In the present work, only three components were necessary to explain the majority of the total X-and Y-variance. Once the optimal number of latent variables were chosen, Martens' uncertainty test [48] was performed in order to remove unimportant information within spectral data. This powerful tool allowed us to improve the predictability of the model, retaining only the significant variable by giving a more reliable estimate of the prediction error when the model was tested on new samples. Moreover, since a reduced number of spectral variables were used, a simpler model was generated.
In Martens' uncertainty test, the regression coefficients Bi for each cross-validation sub-model, chosen with the venetian blind option, were calculated and the differences from the regression coefficient of the total model, B tot , were computed. The sum of the squares of the differences in all sub-models was finally evaluated in order to obtain an expression of the variance of the Bi estimate for a specific wavelength. With a t-test, the significance (confidence level of 95%) of the estimate of Bi was calculated and the resulting regression coefficients were presented with uncertainty limits. Variables with uncertainty limits that did not contain the zero were significant variables. This procedure was iteratively repeated until the difference between RMSEC and RMSECV reached a minimum value.
Funding: This research was partially funded by the Italian Ministry for Economic Development (MISE) financing the project "New industrial biotechnology processes for the recovery and use of bioactive tomatine from tomato by-products" in which Società Agricola Cooperativa Consorzio Casalasco del Pomodoro was the Coordinator and requested the University of Siena cooperation, as third part.
Data Availability Statement: Data sharing not applicable.