Rapid Authentication of 100% Italian Durum Wheat Pasta by FT-NIR Spectroscopy Combined with Chemometric Tools

Italy is the country with the largest durum wheat pasta production and consumption. The mandatory labelling for pasta indicating the country of origin of wheat has made consumers more aware about the consumed pasta products and is influencing their choice towards 100% Italian wheat pasta. This aspect highlights the need to promote the use of domestic wheat as well as to develop rapid methodologies for the authentication of pasta. A rapid, inexpensive, and easy-to-use method based on infrared spectroscopy was developed and validated for authenticating pasta made with 100% Italian durum wheat. The study was conducted on pasta marketed in Italy and made with durum wheat cultivated in Italy (n = 176 samples) and on pasta made with mixtures of wheat cultivated in Italy and/or abroad (n = 185 samples). Pasta samples were analyzed by Fourier transform-near infrared (FT-NIR) spectroscopy coupled with supervised classification models. The good performance results of the validation set (sensitivity of 95%, specificity and accuracy of 94%) obtained using principal component-linear discriminant analysis (PC-LDA) clearly demonstrated the high prediction capability of this method and its suitability for authenticating 100% Italian durum wheat pasta. This output is of great interest for both producers of Italian pasta pointing toward authentication purposes of their products and consumer associations aimed to preserve and promote the typicity of Italian products.


Introduction
Pasta is a key element of the Mediterranean diet providing complex carbohydrates, proteins, vitamins, mineral salts, and dietary fiber to consumers. Pasta produced from durum wheat (Triticum durum Desf.) is of superior cooking quality thanks to several properties, including rheological properties, texture, color, and taste [1]. A quarter of the worldwide pasta output is produced in Italy, which has the highest annual production of about 3.4 million tons/year, accounting for 6% of total Italian food industry production [2]. Italy is also the largest pasta consumer (25.3 kg per capita/year), meaning that pasta represents a strategic product in the Italian agro-food industry [2]. Over the years, the consumption of pasta has spread to other countries, thus constantly increasing the growth of the global pasta market, and pasta composition has been influenced by regional tastes and preferences.
Pasta is considered an indisputable symbol of "Made in Italy" products over the world, thanks to its nutritional value, high digestibility, good shelf life, availability through the market in many shapes, and low cost, representing a good attraction for the consumer. Italian pasta companies commonly use

FT-NIR Spectroscopy Analysis and Multivariate Statistical Analysis
FT-NIR spectroscopy analysis of grounded pasta samples (approximately 30 g) was carried out according to De Girolamo et al. [23] using the spectrometer Nicolet iS50 FT-IR (Thermo Fisher Scientific Inc., Madison, WI, USA). Spectra were recorded by using 32 interferometer sub-scans and a resolution of 4 cm −1 in the range between 10,000-4000 cm −1 .
Before performing multivariate analysis, maximum normalization followed by standard normal variate (SNV) and mean centering were applied to FT-NIR spectral data to reduce the spectral baseline shift, improve the signal-to-noise ratio, and remove light scatter influence [34]. Multivariate statistical analyses, i.e., principal component analysis (PCA), principal component-linear discriminant analysis (PC-LDA), partial least-squares discriminant analysis (PLS-DA), and support vector machine (SVM), were conducted with The Unscrambler ® X, v10.1, software (CAMO Software AS, Oslo, Norway, 2011) [35]. The Mann-Whitney U Test, performed on the content (according to the label information of the collected pasta samples) of proteins, carbohydrates, lipids, fiber, and salt, was carried out using Statistica 6.0 (StatSoft, Tulsa, OK, USA).

Principal Component Analysis (PCA)
PCA is an unsupervised pattern recognition approach that was firstly applied to raw spectra to explore data and to recognize potential clustering (similarities and differences) of the pasta samples based on the wheat origin of the two classes (i.e., "Pasta 100% ITA wheat" and "Pasta MIX wheat"). The data matrix used for the PCA consisted of 1557 columns (corresponding to the wavenumbers recorded every 4 cm −1 in the range from 10,000 to 4000 cm −1 ) and of 361 rows (corresponding to the number of samples). Outliers were detected by using the graphical tools of the Unscrambler ® X software, i.e., the Hotelling T 2 line plot, using a critical limit of p-value < 5%, and the influence plot, displaying outlier samples as those having both high leverage and high residuals variance. No outliers were detected, even though 12 samples showed high leverage, while 5 different samples showed high residuals. The correlation loading plot was also used to investigate FT-NIR variables with correlation values within +0.7 and +1 and −0.7 and −1 (values were arbitrarily defined) that were selected as those contributing to the differentiation between the two classes. Then, the entire spectra and the selected variables were pre-processed before performing a new PCA. The Kennard-Stone (KS) algorithm was applied to each FT-NIR range (entire and partial) to split the sample set in a training set (2/3 of the total) to calibrate the models and a test set (1/3 of the total) to validate the prediction ability of the suggested models [36] (Table 1). The PC-LDA supervised approach was used to classify samples based on the origin of the wheat according to the label information. The final PC-LDA model was carried out by applying PCA and choosing the first PCs maximizing the performance of the model. Considering that the number of variables should not exceed (n − g)/3, where n is the number of objects in the training set (i.e., 238 samples), and g is the number of categories (2 classes) [37], the maximum number of PCs should be 78. The numbers of PCs maximizing the performance of the model was 20. The comparison of the Mahalanobis distance of each spectrum from the two classes of the model assigned the belonging class. Specifically, samples were classified into class "Pasta 100% ITA wheat" or "Pasta MIX wheat" considering the lowest distance between the origin of the discrimination plot and the projection of the sample.

Partial Least-Squares Discriminant Analysis (PLS-DA)
The supervised PLS-DA tool is a method that compresses the spectral data into orthogonal structures called latent variables (LVs), which describe the maximum covariance between the spectral information and the reference values. The chosen number of LVs was the one providing the lowest prediction error in cross-validation (20 segments equivalent to leave-11-out). A total of 9 LVs, explaining 56% of the total variance, guaranteed the optimal model complexity. The PLS-DA model is based on the PLS algorithm, where the dependent variable y is categorical and represents samples' class membership [35]. Using a threshold at 0, the categorial variable y had a value of +1 for class "Pasta 100% ITA wheat" and a value of −1 for "Pasta MIX wheat". Samples were assigned to the class with the highest value of the y variable.

Support Vector Machine Classification (SVMc)
The SVMc is a classification method based on statistical learning that uses kernel functions to map from the original space to the feature space. It attempts to find the optimal separation between classes of the training set by fitting them a unique hyperplane [35,38]. The final decision function of SVMc is determined by a small number of support vectors that are the points lying on the margins of the hyperplane. Proper selection of kernel functions is essential to SVMc and affects the performance of the model. In the present paper, among different available functions, the linear kernel type was used to determine the hyperplane separating the two classes with a nu-value set at 0.5 (and a threshold set at 0.25), where nu serves as the upper bound of the fraction of errors and is the lower bound of the fraction of support vectors.

Evaluation of Classification Performance
The performance of the classification models in terms of sensitivity, accuracy, and specificity was calculated for both the training and the test sets according to the confusion matrices for binary classification [35,39]. By assuming that the class "Pasta 100% ITA wheat" is that of interest, samples are defined as true positive (TP) if they are correctly found as belonging to this class, or false negative (FN) if they are classified as not belonging to it. By analogy, samples of the "Pasta MIX wheat" are defined Sensitivity, calculated for both training and test sets, is defined as the fraction of samples, belonging to the class of interest, which are correctly classified by the model and is a measure of the confidence level of the class space: Specificity is defined as the fraction of samples not belonging to the class of interest that are correctly rejected by the model: Accuracy is defined as the fraction of correctly classified samples with respect to the entire set:

Spectral Information
The unprocessed average FT-NIR spectra profiles of pasta samples, manufactured with wheat grown exclusively in Italy and of those manufactured with a mixture of wheat cultivated abroad, are presented in Figure 1. The assignment of pasta signals was done through comparison with the list of the basic characterizing wavelengths in the NIR region of different functional groups related to agricultural products described by Shenk et al. [40], Stuart et al. [34], and Manley et al. [41]. By comparing the average FT-NIR raw spectra of the two classes, no visible differences in the shape of the spectra were observed with main absorbance features around 8300 cm −1 , 6800 cm −1 , 5300 cm −1 , and 4000 cm −1 (Figure 1).

Spectral Information
The unprocessed average FT-NIR spectra profiles of pasta samples, manufactured with wheat grown exclusively in Italy and of those manufactured with a mixture of wheat cultivated abroad, are presented in Figure 1. The assignment of pasta signals was done through comparison with the list of the basic characterizing wavelengths in the NIR region of different functional groups related to agricultural products described by Shenk et al. [40], Stuart et al. [34], and Manley et al. [41]. By comparing the average FT-NIR raw spectra of the two classes, no visible differences in the shape of the spectra were observed with main absorbance features around 8300 cm −1 , 6800 cm −1 , 5300 cm −1 , and 4000 cm −1 (Figure 1).

Figure 1.
Overlay of average FT-NIR raw spectra of pasta classes, i.e., pasta made with 100% Italian wheat (blue trace) and pasta made with a mixture of wheat cultivated abroad (red trace), with fundamental spectral bands.
Specifically, the band between 8600 and 7900 cm −1 was attributed to the second overtone of C-H stretching and was ascribed to lipids, while the broad absorption bands at 6800 cm −1 and at 5200  Specifically, the band between 8600 and 7900 cm −1 was attributed to the second overtone of C-H stretching and was ascribed to lipids, while the broad absorption bands at 6800 cm −1 and at 5200 cm −1 are combination bands related to the hydrogen of water as well as of other hydrogen-containing molecules. The latter band is also associated with the O-H stretching first overtone of starch and to the third overtone of the carbonyl group of proteins. Furthermore, the regions between 7400 and 7000 cm −1 and 4350 and 4030 cm −1 could be attributed to a combination of stretching and deformation of the C-H group, typically from fatty acids and carbohydrates. The band at 6300 cm −1 arose from the first overtone of the O-H stretching of starch, while the band between 5800 and 5500 cm −1 was attributed to the first overtone of C-H stretching of lipids and to the O-H combination of water. Finally, the band between 4900 and 4500 cm −1 arose from the combination of C = O stretch second overtone and C-N stretching and N-H in-plane bend of both proteins and carbohydrates [34,40,41] ( Figure 1). These outcomes confirmed those reported by other authors applying NIR spectroscopy to the traceability of wheat and pasta [24,27,28,[30][31][32][33].

PCA
Unsupervised PCA by using a total number of 20 PCs was employed on pre-processed spectra for visualizing data trends in a dimensional scatter plot. By plotting the PCA score plot of PC1 vs. PC2 (explaining 77% and 16% of the total variance, respectively), a low level of clusterization of the pasta samples in relation to their wheat origin was found. Specifically, a group of pasta samples manufactured with a mixture of foreign wheat were separated by pasta samples manufactured exclusively with Italian wheat, while the others completely overlapped (Figure 2). This could be explained by the high percentage of Italian wheat used in pasta samples belonging to the class "Pasta MIX wheat". It should be highlighted that, within this class, 102 samples were labelled as "EU and NON-EU", and 83 were labelled as "Italy and other EU or NON-EU countries". This means that the latter pasta samples could contain 50-99% of Italian wheat, thus explaining the observed overlapping between the two classes of samples.
To explore the contribution of original variables to the PCs, we focused on the loading plot, which might reveal regions of high importance for the model (variations related to the geographical origin of wheat). Variables lying within the upper and lower bounds (correlation values between +0.7 and +1 and between −0.7 and −1) were the most important ones modelled by PC1. Specifically, it was observed that only the absorbance range between 7500 and 4000 cm −1 was dominant in PC1, showing correlation loading values ≥+0.7. This region was related to lipid, carbohydrates, water, and protein absorption, thus suggesting that it contained most of the information responsible for the discrimination of pasta samples. The Mann-Whitney U Test performed on the content of proteins, carbohydrates, lipids, fiber, and salt indicated that the level of lipids in pasta manufactured with a mixtures of durum wheat (1.5 ± 0.21%) was significantly (p < 0.05) different compared to that of pasta manufactured with durum wheat grown exclusively in Italy (1.4 ± 0.32%), thus suggesting the contribute of these compounds to the variability between the two classes. This variability was probably related to the different growing zones, latitudes, and moisture conditions of the wheat used for manufacturing the samples analyzed in the present study, as reported elsewhere [8,9,42]. These results also confirmed those reported by Firmani et al. [32,33] that investigated the variable importance in projection indices to examine which variables were mainly involved in the discrimination between Gragnano and non-Gragnano pasta samples and between durum semolina varieties harvested in different Italian macro-areas. They found that the most relevant spectral zones were around 4000, 5000, and 7000 cm −1 . explained by the high percentage of Italian wheat used in pasta samples belonging to the class "Pasta MIX wheat". It should be highlighted that, within this class, 102 samples were labelled as "EU and NON-EU", and 83 were labelled as "Italy and other EU or NON-EU countries". This means that the latter pasta samples could contain 50-99% of Italian wheat, thus explaining the observed overlapping between the two classes of samples.  Then, a new PCA was run by using a reduced spectral range, i.e., from 7500 to 4000 cm −1 . By plotting the PCA scores of PC1 vs. PC2 (explaining 63% and 29% of the total variance, respectively), a slightly better clustering of the pasta samples based on wheat origin was observed, even though some overlapping between the two classes was still observed.

Supervised Classification Models
The supervised classification models used for the authentication of Italian pasta samples manufactured exclusively with durum wheat grown in Italy were PC-LDA, PLS-DA and SVMc. The performance results in terms of sensitivity, specificity, and accuracy were compared to select the best classification approach.
According to the above-mentioned results, a reduced spectral range, i.e., from 7500 to 4000 cm −1 , was used for developing and validating the three supervised classification models; the results are reported in Table 2. By keeping in mind that the class "Pasta 100% ITA wheat" was that of interest, sensitivity, specificity, and accuracy were calculated accordingly, as reported in Section 2.2.5. Sensitivity rates were between 93 and 98% for the training set and between 88 and 95% for the test set, while the specificity rates of the three models were between 77 and 95% and 82 and 94%, respectively. Finally, the accuracy rates were between 86 and 96% for the training set and between 85 and 94% for the test set (Table 2). Table 2. Performance parameters (sensitivity, specificity, and accuracy) of the principal component-linear discriminant analysis (PC-LDA), support vector machine classification (SVMc), and partial least-squares discriminant analysis (PLS-DA) for both the training and the test sets of pasta samples obtained in the spectral range between 7500 and 4000 cm −1 . By comparison, it was observed that PC-LDA was the best-performing approach, yielding a sensitivity of 98% for the training set and of 95% for the test set and an accuracy of 96% for the training set and of 94% for the test set. The goodness of the model was also confirmed by the specificity of 94% for the test set, indicating the high ability of the model to reject the objects of the other class. The good ability of the model to discriminate the two classes, i.e., "Pasta 100% ITA wheat" and "Pasta MIX wheat", was also confirmed by the PC-LDA score plot of the test set shown in Figure 3. The observed partial scattering of the class "Pasta MIX wheat" was probably due to the variable percentage of Italian wheat used in these samples, as previously observed for PCA. On the other hand, the worse model was that obtained using the PLS-DA approach, as clearly visible in the score plot showing the numerous misclassified samples in the two classes ( Figure 4). set and of 94% for the test set. The goodness of the model was also confirmed by the specificity of 94% for the test set, indicating the high ability of the model to reject the objects of the other class. The good ability of the model to discriminate the two classes, i.e., "Pasta 100% ITA wheat" and "Pasta MIX wheat", was also confirmed by the PC-LDA score plot of the test set shown in Figure 3. The observed partial scattering of the class "Pasta MIX wheat" was probably due to the variable percentage of Italian wheat used in these samples, as previously observed for PCA. On the other hand, the worse model was that obtained using the PLS-DA approach, as clearly visible in the score plot showing the numerous misclassified samples in the two classes ( Figure 4).   To be sure that the use of the selected range did not affect the final results, the three supervised classification models were also applied on the entire spectral range (i.e., 10000-4000 cm −1 ). Comparable results in terms of sensitivity and accuracy rates between the three classification models were obtained for both training and validation sets. Specifically, sensitivity rates were between 92 and 97% for the training set and between 87 and 90% for the test set, while accuracy rates were between 92 and 98% and between 86 and 89%, respectively (data not shown). These accuracy results lower than 90% obtained for the test set indicated that, although the entire spectral range could increase the amount of information, worse results were obtained.

Sample Set
The results obtained in the present study are in agreement with those recently reported by Firmani and co-workers [32] on the use of NIR spectroscopy in combination with PLS-DA and soft independent modelling of class analogies (SIMCA) for the authentication of protected geographical indication (PGI) Gragnano pasta, a typical durum pasta produced in a specific area in the South of Italy. Moreover, Biancolillo et al. [29] applied NIR spectroscopy for the determination of turmeric adulteration in egg pasta by using the PLS regression model. Recently, FT-NIR spectroscopy combined with the PC-LDA classification model was successfully applied for tracing the geographical origin of durum wheat samples from different areas of Italy (North, Center, and South) and to discriminate Italian durum wheat samples from those cultivated abroad [23]. Accuracy values up to 100% confirmed the robustness of this classification model. Furthermore, the FT-NIR spectral range between 7700 and 4500 cm −1 provided the best performance results, thus suggesting that starch, lipids, and proteins that absorb in this region may be the factors responsible for the geographical origin discrimination of wheat [23]. In another work, NIR spectroscopy was applied for the classification of different cultivars of Italian durum wheat semolina. The fusion of NIR data with alveographic parameters correctly classified 100% of samples based on their geographical origin [33]. NIR spectroscopy, in combination with PC-LDA and PLS-DA, was also successfully applied for the geographical origin discrimination of wheat flour [24,27,28]. Furthermore, very recently, an EU validation procedure for screening methods was successfully applied to a multivariate FT-NIR spectroscopic method for durum wheat pasta authentication. The results of this proof-of-concept To be sure that the use of the selected range did not affect the final results, the three supervised classification models were also applied on the entire spectral range (i.e., 10,000-4000 cm −1 ). Comparable results in terms of sensitivity and accuracy rates between the three classification models were obtained for both training and validation sets. Specifically, sensitivity rates were between 92 and 97% for the training set and between 87 and 90% for the test set, while accuracy rates were between 92 and 98% and between 86 and 89%, respectively (data not shown). These accuracy results lower than 90% obtained for the test set indicated that, although the entire spectral range could increase the amount of information, worse results were obtained.
The results obtained in the present study are in agreement with those recently reported by Firmani and co-workers [32] on the use of NIR spectroscopy in combination with PLS-DA and soft independent modelling of class analogies (SIMCA) for the authentication of protected geographical indication (PGI) Gragnano pasta, a typical durum pasta produced in a specific area in the South of Italy. Moreover, Biancolillo et al. [29] applied NIR spectroscopy for the determination of turmeric adulteration in egg pasta by using the PLS regression model. Recently, FT-NIR spectroscopy combined with the PC-LDA classification model was successfully applied for tracing the geographical origin of durum wheat samples from different areas of Italy (North, Center, and South) and to discriminate Italian durum wheat samples from those cultivated abroad [23]. Accuracy values up to 100% confirmed the robustness of this classification model. Furthermore, the FT-NIR spectral range between 7700 and 4500 cm −1 provided the best performance results, thus suggesting that starch, lipids, and proteins that absorb in this region may be the factors responsible for the geographical origin discrimination of wheat [23]. In another work, NIR spectroscopy was applied for the classification of different cultivars of Italian durum wheat semolina. The fusion of NIR data with alveographic parameters correctly classified 100% of samples based on their geographical origin [33]. NIR spectroscopy, in combination with PC-LDA and PLS-DA, was also successfully applied for the geographical origin discrimination of wheat flour [24,27,28]. Furthermore, very recently, an EU validation procedure for screening methods was successfully applied to a multivariate FT-NIR spectroscopic method for durum wheat pasta authentication. The results of this proof-of-concept strategy demonstrated that FT-NIR is an attractive and suitable tool for the detection of durum wheat pasta adulteration [31].

Conclusions
This paper describes, for the first time, the application of FT-NIR spectroscopy, combined with chemometrics, for the authentication of Italian pasta made exclusively with durum wheat cultivated in Italy. To achieve this goal, more than 350 pasta samples marketed in Italy were analyzed by FT-NIR coupled with supervised classification models, i.e., PC-LDA, SVMc, and PLS-DA, and the performance results were compared. The analysis of correlation loadings indicated that the FT-NIR region in the range between 7500 and 4000 cm −1 contained most of the information useful for the discrimination of pasta samples based on the geographical origin of wheat, which was also indicated by the lipid content that was significantly different among the two classes. PC-LDA provided the highest accuracy values, while PLS-DA showed the lowest ones. Furthermore, the PC-LDA model demonstrated to be particularly suitable to reject samples not belonging to the class "Pasta 100% ITA wheat", thus resulting as the best discriminant approach.
The outputs of this work showed that FT-NIR spectroscopy, coupled with a suitable multivariate classification model, i.e., PC-LDA, is a reliable and robust tool that can be used for authenticating durum wheat pasta samples based on the geographical origin of wheat, with a low probability of discrimination error. These results will certainly have a great impact for both Italian pasta producers founding their business on local authenticated products and consumer associations highly interested in preserving and promoting the typicity of Italian pasta and other classes of products for their high added value.