Sequential Data Fusion Techniques for the Authentication of the P.G.I. Senise (“Crusco”) Bell Pepper

: Bell pepper is the common name of the berry obtained from some varieties of the Capsicum annuum species. This agro-food is appreciated all over the world and represents one of the key ingredients of several traditional dishes. It is used as a fresh product, or dried and ground as a seasoning (e.g., paprika). Speciﬁc varieties of sweet pepper present organoleptic peculiarities and they have been awarded by quality marks as a further conﬁrmation of their unicity (e.g., Piment d’Espelette, Pimiento de Herb ó n, Peperone di Senise). Due to the market value of this aliment, it can be subjected to frauds, such as adulterations and sophistication. The present study lays on these considerations and aims at developing a spectroscopy-based approach for authenticating Senise bell pepper and for detecting its adulteration with common paprika. In order to achieve this goal, 60 pure samples of bell pepper from Senise were analyzed by mid- and near-infrared spectroscopies. Then, in order to mimic the adulteration, 40 mixtures of Senise bell pepper and paprika were prepared and analyzed (by the same spectroscopic techniques). Eventually, two different multi-block classiﬁcation approaches (sequential and orthogonalized partial least squares linear discriminant analysis and sequential and orthogonalized covariance selection linear discriminant analysis) were used to discriminate between pure and adulterated Senise bell pepper samples. Both proposed procedures achieved extremely successful results in external validation, correctly classifying all the (thirty-ﬁve) test samples, indicating that both approaches represent a winning solution for the investigated classiﬁcation problem.


Introduction
Bell pepper is the common name of the berry obtained from some varieties of the Capsicum annuum species. This product is appreciated all over the world and represents one of the key ingredients of several traditional dishes.
From the nutritional point of view, this aliment is valued for its relevant content in antioxidants and vitamins. In particular, bell peppers are rich in vitamin A, C and E, and, to a lesser extent, also in vitamin D and those belonging to the B group [1]. The levels of these compounds in capsicum berries depend on several factors: the genus, the variety, the production practices, the maturity at harvest and the storage conditions [2]. In Europe, several varieties of bell pepper are grown. Among these, some have been awarded quality marks by the European Union, as a further confirmation and protection of their quality. Some examples of these are the Protected Designation of Origin (PDO) Piment d'Espelette, grown in France, the PDO Pimiento de Herbón, harvested in Spain, or the Protected Geographical Indication (PGI) Peperone di Senise produced in southern Italy.
Customarily, the main quality controls conducted on these agro-foods are generally aimed at quantifying the specific substances they contain. For example, a lot of effort has been put in the quali-and quantitative analysis of carotenoids, as carried out by Gregory and collaborators [3], who used high performance liquid chromatography (stationary phase: octadecyl silica; mobile phase: methanol-ethyl acetate) for the quantification of carotenoids and carotenoid esters in bell peppers, or by Gentili et al. [4] who exploited HPLC-photodiode array detection-tandem mass spectrometry (HPLC-PAD-MS/MS) for the untargeted determination of carotenoids in three different varieties of sweet peppers. Similarly, in the literature it is possible to find different approaches for the quantification of capsaicinoids, polyphenols and for the estimation of the antioxidant capability of bell peppers. In this context, a valuable example is represented by the study proposed by Sora and collaborators, who analyzed all those families of compounds in the berries, by means of HPLC (for the quantification of capsaicin, dihydrocapsaicin, and the total phenols content), and radical tests (for estimating the scavenging activity) [5].
If, on the one hand, the analytical methodologies developed for the quality control of bell peppers are very advanced, then on the other hand, in the literature, there is a reduced number of papers focused on the authentication and the characterization of high value-added peppers.
Some authors have highlighted the possibility of using analytical approaches for discriminating greenhouse from outdoor-grown bell peppers [6], or to differentiate organic vegetables from those cultivated following the traditional systems, but not many have investigated the possibility of developing an analytical methodology for detecting adulterations on this agro-food product. This aspect is particularly relevant considering that, in many countries, bell pepper is not only consumed as a fresh product, but it is often dried and ground (e.g., paprika), and used as the main ingredient or seasoning in several traditional dishes. Given the powdery state of the product, this is easily subjected to frauds, such as adulterations and sophistications. Additionally, all of the production processes needed to obtain the final product greatly increase the market value of this aliment, furtherly motivating the possibility of deceits, hence the need to develop fraud detection methodologies for the prevention of possible illicit actions.
The present study lays on these considerations and aims to propose a non-destructive approach for the detection of adulterations on the Senise bell pepper. This typical aliment, produced in a restricted area in southern Italy (around Senise, a small town in Basilicata) is a Protected Geographical Indication product. It is generally dried and ground and used as seasoning in different typical dishes. This food can be easily adulterated with similar products, for instance, paprika, and, even reaching high levels of adulterations, the difference between the two products would not be visible by sight.
The proposed adulteration-detection tool is based on exploiting the coupling of midand near-infrared spectroscopies (MIR/NIR) with different sequential multi-block classification approaches. In particular, sequential and orthogonalized partial least squares linear discriminant analysis (SO-PLS-LDA) and sequential and orthogonalized covariance selection linear discriminant analysis (SO-CovSel-LDA) were used to process the spectroscopic signals.
These two instrumental techniques were chosen because they are relatively rapid, non-destructive, and they have demonstrated to be suitable allies against frauds in food matrices [7][8][9][10][11]. On the other hand, the choice of the classifier fell on data fusion (DF) approaches because, when applicable, multi-block methodologies are expected to perform better than the disjoint analysis of the individual data blocks [12][13][14]. Among the others, SO-PLS-LDA and SO-CovSel-LDA were chosen because their sequential nature provides a number of benefits related to the interpretation of the system [15], and, moreover, they demonstrated to represent a suitable tool in similar situations [16][17][18][19][20][21].

Samples
Ground Senise bell pepper samples were collected from local producers in Senise (Basilicata region in southern Italy). Forty adulterated samples were prepared in the laboratory by mixing, in different proportions, pure ground Senise bell pepper and lowervalued ground paprika samples, purchased from different retailers in Italy. Details about the relative composition of these samples are reported in Table 1. In total, one hundred samples were available: sixty being pure ground Senise bell pepper and forty mixtures of Senise bell pepper and paprika.

NIR Measurements
All the available samples were analyzed in diffuse reflectance by a Fourier transform-Near infrared (FT-NIR) instrument (Nicolet 6700, Thermo Scientific Inc., Madison, WI, USA) equipped with an integrating sphere which allowed the direct analysis of the samples without any further pretreatment.
An aliquot of each sample was introduced into a glass vial collocated on the top of the window of the integrating sphere. For each sample, two analytical replicates were analyzed in the spectral range between 4000 cm −1 and 10,000 cm −1 , at a nominal resolution of 4 cm −1 . Spectra were collected in reflectance mode, visualized by the OMNIC software (Thermo Scientific Inc., Madison, WI, USA) and exported as .CSV files to be further processed by means of in-house written functions running in MATLAB (R2015b; The Mathworks, Natick, MA, USA). Prior to the chemometric analysis, signals (originally acquired as reflectance, R) were converted to pseudo-absorbance (log(1/R)).

MIR Measurements
MIR spectra of both pure and adulterated bell pepper samples were collected using a PerkinElmer Spectrum Two™ (PerkinElmer, Waltham, MA, USA) FT-IR spectrometer, equipped with a PerkinElmer Universal Attenuated Total Reflectance (uATR) device (with single bounce diamond crystal) and a deuterated triglycine sulfate (DTGS) detector. The inspected spectral range was between 4000 cm −1 and 400 cm −1 (1 cm −1 nominal resolution). The background was collected with the crystal exposed to the air and updated (approximately) every hour. Any possible sample leftover on the attenuated total reflectance (ATR) crystal was removed using soft tissues and methanol. The cleaning procedure was carried out prior to every measurement and the crystal was air-dried before collecting a new spectrum. IR signals were recorded by pressing the ATR device on the ground samples. The pressure applied was optimized for each sample, by means of a monitoring system implemented in the software of the instrument.

Data Fusion Approaches: Sequential and Orthogonalized Partial Least Squares-Linear Discriminant Analysis (SO-PLS-LDA) and Sequential and Orthogonalized Covariance Selection-Linear Discriminant Analysis (SO-CovSel-LDA)
Sequential and orthogonalized partial least squares (SO-PLS) [15,22] is a multi-block method conceived for solving regression problems which has been extended to deal with classification problems by combination with linear discriminant analysis (LDA) [23,24]. SO-PLS allows a sequential extraction of non-redundant information from different data blocks and it is particularly suitable to handle highly correlated data sets. Considering the case where two sets of measurements (X 1 and X 2 ), e.g., data collected by different analytical platforms, are used to estimate a response Y, the SO-PLS algorithm can be briefly summarized as follows: (a) Y is fitted to X 1 by PLS regression:Ŷ = X 1 B 1 . (b) X 2 is orthogonalized with respect to the X 1 -scores estimated in (a), obtaining X 2,Orth (c) X 2,Orth is fitted to the residuals from step (a) by partial least squares (PLS) regression. (d) The overall regression model is obtained by combining the outcomes of (a) and (c), and can be expressed as: indicates model predictions and B 1 and B 2,orth are the regression coefficient matrices.
In order to use SO-PLS to deal with classification problems, i.e., in SO-PLS-LDA, at first, information about class belonging has to be encoded in a dummy response matrix, as it is customarily performed in partial least squares-discriminant analysis (PLS-DA). Then, a SO-PLS regression model is calculated between the independent blocks of data and the dummy Y and eventually, once the SO-PLS model is built, LDA can be applied either to the concatenated scores of scores of X 1 and X 2,Orth or to the predicted responses [15,24].
SO-CovSel is a multi-block regression method derived from SO-PLS [25]. The two approaches present a similar algorithm; the main difference lies in the fact that, while in SO-PLS feature reduction is achieved by extracting latent variables (PLS components) from the independent blocks, in SO-CovSel, experimental variables are selected by means of an algorithm known as covariance selection [26]. This leads to the fact that the regression steps in (a) and (c) involve ordinary least squares instead of PLS and that X 2 is orthogonalized with respect to the variables selected by CovSel on X 1 . Consequently, SO-CovSel (for a two-predictor blocks case) involves the following steps: (a) A set of X 1 -variables are selected by means of CovSel (obtaining X Sel 1 ). (b) Y is fitted to X Sel 1 by means of ordinary least squares (OLS). (c) X 2 is orthogonalized with respect to X Sel 1 (obtaining X 2,Orth ). (d) A set of X 2,Orth -variables are selected by means of CovSel (obtaining X Sel 2,Orth ). (e) The Y-residuals from step (a) are fitted to X Sel 2,Orth by means of OLS. (f) The overall regression model is obtained by combining the outcomes of steps (b) and (e):Ŷ = X Sel

Results
All the collected spectra were imported in MATLAB for the successive data processing. Since two spectra were acquired on each sample, at first, these two replicates were averaged, leading to two sets of 100 profiles each, one corresponding to the MIR results and the other to the NIR signals, which are displayed in Figure 1a,b, respectively. In the same Figure 1 (panels c, and d), the mean MIR and NIR spectra of the two investigated categories (pure Senise or adulterated) are compared: by looking at the profiles in Figure 1c,d it is apparent that the differences between pure Senise and adulterated samples are rather subtle and that the identification of adulterated samples cannot rely on visual inspection of the recorded signals only. Prior to the creation of the classification models, samples were divided into a training and a test set by the Duplex algorithm [27], and the splitting was carried out on each class separately. In order to create two representative sets of samples taking into account the variability present in both data blocks, the same procedure as reported in [18] was followed. Briefly, for each category, the MIR and NIR spectra were concatenated row-wise, leading to the two augmented matrices: and ). Afterwards, a Principal component analysis (PCA) model was separately calculated on each of the augmented matrices (after block-scaling) and, in both cases, the first five principal components (PCs) were extracted. Eventually, the Duplex algorithm was run individually on each of the two score matrices (i.e., category-wise) and signals were divided accordingly. Of the 100 investigated samples, 65 (40 pure Senise and 25 adulterated) were selected as the training set, while the remaining 35 (20 pure Senise and 15 adulterated) were left out for the validation of the models (test set).
At first, classification models were built on each spectroscopic block individually, both to investigate how efficient any of the two infrared regions could be in allowing the identification of the adulteration of the product and to compare the resulting outcomes with those obtained by the multi-block strategies. For the analysis of the individual data blocks, two different classifiers were used: partial least squares discriminant analysis (PLS-DA) [28] and a classification strategy based on the fusion of different pre-treatments through SO-PLS-LDA known as sequential preprocessing through orthogonalization (SPORT); [29].
Briefly, in the context of classification, SPORT is based on applying different pretreatments to the spectroscopic matrix of interest, so as to create a multi-block data set, which is then processed by SO-PLS-LDA. Prior to the creation of the classification models, samples were divided into a training and a test set by the Duplex algorithm [27], and the splitting was carried out on each class separately. In order to create two representative sets of samples taking into account the variability present in both data blocks, the same procedure as reported in [18] was followed. Briefly, for each category, the MIR and NIR spectra were concatenated row-wise, leading to the two augmented matrices: X Aug.

Classification Models Built on Individual Spectroscopic Blocks
Senise (= X MIR Senise X N IR Senise ) and X Aug.
Adult. (= X MIR Adult X N IR Adult ). Afterwards, a Principal component analysis (PCA) model was separately calculated on each of the augmented matrices (after block-scaling) and, in both cases, the first five principal components (PCs) were extracted. Eventually, the Duplex algorithm was run individually on each of the two score matrices (i.e., category-wise) and signals were divided accordingly. Of the 100 investigated samples, 65 (40 pure Senise and 25 adulterated) were selected as the training set, while the remaining 35 (20 pure Senise and 15 adulterated) were left out for the validation of the models (test set).
At first, classification models were built on each spectroscopic block individually, both to investigate how efficient any of the two infrared regions could be in allowing the identification of the adulteration of the product and to compare the resulting outcomes with those obtained by the multi-block strategies. For the analysis of the individual data blocks, two different classifiers were used: partial least squares discriminant analysis (PLS-DA) [28] and a classification strategy based on the fusion of different pre-treatments through SO-PLS-LDA known as sequential preprocessing through orthogonalization (SPORT); [29].
Briefly, in the context of classification, SPORT is based on applying different pretreatments to the spectroscopic matrix of interest, so as to create a multi-block data set, which is then processed by SO-PLS-LDA.

Classification Models Built on Individual Spectroscopic Blocks
In order to use the SPORT approach to build (and validate) the individual classification models for the MIR and the NIR data sets, each of the two matrices were pre-processed using the following pre-treatments: no pretreatment (raw data), standard normal variate (SNV); [30], first derivative and second derivative (both calculated with a 15 points window and a third order polynomial) [31]. Since SO-PLS-LDA is a sequential model, it is important to specify the order of the blocks, which was the one reported above; moreover, mean centering was always used as a further pre-processing step. For the sake of comparison, PLS-DA analysis was also performed on the MIR and NIR data set after the individual application of the abovementioned pretreatments; consequently, 8 different models (four per data block) were calculated.
The classification accuracy of the PLS-DA and SPORT models built on the MIR and the NIR data both for the training (in cross-validation) and the test sets are summarized in Table 2, together with their optimal complexity (the range of latent variables explored was 0-10). Here it should be stressed that for each data set, in the model selection stage, the number of latent variables (LVs) to be retained in each pre-processing block was selected as the one leading to the lowest classification error in a fivefold cross-validation procedure. Moreover, in the case of PLS-DA, the best pre-processing was also selected based on the maximum classification accuracy in cross-validation. Table 2. Results of partial least squares-discriminant analysis (PLS-DA) and sequential preprocessing through orthogonalization (SPORT) classification for the MIR and the NIR data sets.

Method
Data Set

Optimal Number of LVs Correct Classification Rate (%)
Raw SNV 1st Der. 2nd Der. Looking at the Table, it is possible to notice how the use of SPORT not only allows one to fuse different pre-processings into a single model, but also straightforwardly indicates which ones are most effective to deal with the problem at hand. Indeed, in the case of the MIR signals, only two of the four matrices corresponding to the different pre-treatments contributed with a non-zero number of latent variables to the classification model, i.e., the raw data and the second derivative (with one and four LVs, respectively). The SPORT model on the MIR data performed very well on the training set (around 92% correct classification rate for both classes) but showed a lower though still good predictive accuracy (82.8%, corresponding to five Senise samples and one adulterated sample misclassified) on the test individuals. The comparison of these results with those of the best individual PLS-DA model (which is the one built on data pretreated with the second derivative) indicates that the fusion of different preprocessing strategies could help improving the classification accuracy on the training data; however, in this case, the classification error on the test set is the same. On the other hand, the best model on NIR spectra was created selecting only one LV from the block preprocessed by the first derivative. This led to the perfect classification of all the training objects in cross-validation and an overall accuracy of 94.3% on the test set, corresponding to the misclassification of two adulterated samples. Obviously, since only the first derivative block was selected by SPORT, in the case of the NIR data the results are identical to those of the best PLS-DA model. Anyway, it should be stressed that, for both data sets, the use of SPORT provides comparable or better results without the need to choose the optimal pre-processing strategy.

Multi-Block Analysis
Two multi-block classification approaches, namely SO-PLS-LDA and SO-CovSel-LDA, were used to integrate the information in the two spectroscopic data sets into a single model, hoping that this could also lead to better predictions. In the model selection stage, the best pre-processing and the optimal number of latent/original variables to be retained for each block were identified as those resulting in the lowest classification error in a fivefold crossvalidation procedure. Specifically, for each block the same four pre-processings already discussed in the case of the SPORT models were tested, and the optimal combination, consistently with the results obtained on the individual spectroscopic data, was found to be second derivative for MIR and first derivative for NIR. The optimal SO-PLS-LDA model included one latent variable from the MIR block and three from the NIR data matrix. On the other hand, three and two original variables were selected from the NIR and MIR blocks, respectively, to build the SO-CovSel-LDA model. Regardless of the data fusion approach used, both models achieved 100% correct classification in both calibration and validation, correctly assigning all the training and test samples. A graphical representation of the classification accuracy of the SO-PLS-LDA model is displayed in Figure 2, where the projection of the training and test samples onto the space spanned by the first LVs extracted from the MIR and the NIR blocks is shown; this representation exploits the sequential multi-block nature of the SO-PLS algorithm allowing one to straightforwardly visualize the separation between the categories and the within-class scatter [24].The accurate model predictions result from the pure Senise (red diamonds) and adulterated (blue squares) samples were clearly separated in space. Moreover, inspection of Figure 2 also allows one to observe how, as it could be expected given the nature of the samples, the withinclass variance of the adulterated category is higher than the one of pure Senise. The distribution of the samples also indicates that the information from both blocks is needed to discriminate between the two classes, since the separation occurs along a direction which is not parallel to any of the axes. At the same time, it is also evident that, consistently with the classification results obtained on the individual blocks, the NIR block is more discriminant than the MIR one, as the two classes are more separated along LV1 NIR than on LV1 MIR : indeed, almost all the pure Senise samples have positive scores on LV1 NIR , while the large majority of the adulterated peppers fall at negative values of the component. In addition to the excellent classification performances, not only on the training data but, more relevantly, on the test samples, both SO-PLS-LDA and SO-CovSel-LDA models can be interpreted so as to identify what the experimental variables are that contribute the most to the observed discrimination between the investigated categories. Relying on a In addition to the excellent classification performances, not only on the training data but, more relevantly, on the test samples, both SO-PLS-LDA and SO-CovSel-LDA models can be interpreted so as to identify what the experimental variables are that contribute the most to the observed discrimination between the investigated categories. Relying on a variable selection algorithm, in SO-CovSel such information results directly from the model building stage, where the most relevant predictors from each block are extracted. On the other hand, in the case of SO-PLS the identification of the variables contributing the most requires a post-hoc analysis of the model parameters, which can be carried out, e.g., by inspecting the values of the variable importance in projection (VIP) scores, as discussed in [32], which was the strategy adopted in the present study. The variables identified as relevant for the SO-PLS-LDA model based on the VIP analysis and those selected by the SO-CovSel-LDA algorithm are graphically compared in Figure 3. In these plots, the relevant predictors in the two blocks are highlighted in red over the corresponding mean spectrum, which is plotted in black. SO-CovSel-LDA has an embedded variable selection step through an algorithm that is specifically designed to provide an extremely parsimonious solution: indeed, only five predictors (two from the MIR and three from the NIR blocks) are selected. On the other hand, the definition of the VIP scores, which are the basis for the identification of the relevant predictors in SO-PLS-LDA, is such that usually, a rather high number of variables is selected. In particular, by adopting the "greater-thanone" criterion to establish whether a predictor was significantly contributing to the model or not, VIP analysis identified 920 variables (over 3601) and 730 variables (over 3112) for the MIR and the NIR blocks, respectively. Despite the differences between the two approaches, it is evident from the Figure that they are consistent in terms of the spectral bands identified as relevant for the discrimination between pure and adulterated Senise samples.  In fact, as far as the MIR block is concerned, VIP analysis identifies as relevant the intervals of 955-1223 cm −1 and 2844-2960cm −1 , while, in the same spectral ranges, SO-CovSel-LDA selects two variables, the peak at 1024 cm −1 and the one at 2924cm −1 , attributable to the C-O-C, C-C and C-O stretching in organic acids and carbohydrates and to the (a) symmetric stretching of CH2 and CH3 [33], respectively. The other relevant spectral intervals selected based on the values of the VIP scores are those between 3171 cm −1 and In fact, as far as the MIR block is concerned, VIP analysis identifies as relevant the intervals of 955-1223 cm −1 and 2844-2960 cm −1 , while, in the same spectral ranges, SO-CovSel-LDA selects two variables, the peak at 1024 cm −1 and the one at 2924 cm −1 , attributable to the C-O-C, C-C and C-O stretching in organic acids and carbohydrates and to the (a) symmetric stretching of CH 2 and CH 3 [33], respectively. The other relevant spectral intervals selected based on the values of the VIP scores are those between 3171 cm −1 and 3429 cm −1 ascribable to the O-H stretching, and between 1691 cm −1 and 1757 cm −1 , probably related to the absorption of the C=O, C=C and the O-H vibrations in phenolic compounds, carotenes and organic acids [33]. For the NIR block, the bands identified as relevant by the VIP analysis of the SO-PLS-LDA model and by SO-CovSel-LDA are highly consistent. In fact, VIP selects the range of 4923-5294 cm −1 while SO-CovSel-LDA the peak at 4337 cm −1 , probably indicating that moisture could be a suitable parameter to differentiate pure and adulterated Senise samples. Moreover, SO-CovSel-LDA selected the spectral variable at 4007 cm −1 , whereas VIP analysis highlighted as relevant the region between 4000 cm −1 and 4800 cm −1 , which can be ascribed to sugars in bell peppers [6], suggesting that those compounds could also represent a marker for the identification of the adulteration of Senise bell peppers with paprika.

Conclusions
The present work aimed to develop a spectroscopic-based tool for the authentication of PGI Senise bell pepper, and for detecting its possible adulteration with common paprika. In order to achieve this goal, pure and adulterated Senise bell pepper samples were analyzed by MIR and NIR, and then spectra were jointly classified by means of two different multiblock approaches (SO-PLS-LDA and SO-CovSel-LDA). At the same time, classification models were also built on the individual blocks by means of SPORT, a recently proposed technique which exploits the possibility of combining multiple versions of the same data matrix, differently preprocessed, into a single, boosted model. Even if rather satisfactory results were obtained when using either the MIR (82.8% accuracy on the test set) or the NIR (93.4% accuracy on the test set) block, a perfect classification of all the training and validation samples could only be obtained when integrating the information from both spectral ranges through multi-block approaches.
In particular, in the case of SO-CovSel-LDA, only five variables (three from NIR and two from MIR) were necessary to achieve 100% accuracy. In general, the results have clearly shown that infrared spectroscopy coupled to chemometrics can represent a non-destructive and effective tool for authenticating ground Senise bell pepper, and for detecting its adulteration with common paprika.