Authentication of Rice ( Oryza sativa L. ) Using Near Infrared Spectroscopy Combined with Different Chemometric Classiﬁcation Strategies

: Rice is a staple food in Vietnam, and the concern about rice is much greater than that for other foods. Preventing fraud against this product has become increasingly important in order to protect producers and consumers from possible economic losses. The possible adulteration of this product is done by mixing, or even replacing, high-quality rice with cheaper rice. This highlights the need for analytical methodologies suitable for its authentication. Given this scenario, the present work aims at testing a rapid and non-destructive approach to detect adulterated rice samples. To fulﬁll this purpose, 200 rice samples (72 authentic and 128 adulterated samples) were analyzed by near infrared (NIR) spectroscopy coupled, with partial least squares-discriminant analysis (PLS-DA) and soft independent modeling of class analogies (SIMCA). The two approaches provided different results; while PLS-DA analysis was a suitable approach for the purpose of the work, SIMCA was unable to solve the investigated problem. The PLS-DA approach provided satisfactory results in discriminating authentic and adulterated samples (both 5% and 10% counterfeits). Focusing on authentic and 10%-adulterated samples, the accuracy of the approach was even better (with a total classiﬁcation rate of 82.6% and 82.4%, for authentic and adulterated samples, respectively).


Introduction
Rice (Oryza sativa L.) has increasingly become a hugely important staple food worldwide, and its consumption has surged as a result of population growth, changing food preferences, and urbanization [1]. Vietnam is among the top-five leading net exporters of rice and, while quantity is no longer the main problem, high quality is important to serve the high-end market segments. From 2015 to 2018, Vietnam's rice price increased at a fast pace, from 334 to 391 USD/ton, but then suddenly dropped to only 323 USD/ton in 1 year [2]. Not only the export market, but also the domestic market is facing the same problem: high volume but low value. Currently, it is estimated that more than 130 different types of rice brand exist on the market, and no brand accounts for more than 3% of the market. This means that Vietnam's rice market is considered as "fragmented", and only a few businesses have a clear vision for branding.
The benefit race is more intense than ever, so fraud is one way that some producers choose to get an advantage in this game. This can be done easily when high-quality rice can be mixed with, or even replaced by, cheaper rice. Rice products are particularly vulnerable, and become a "prime target" of these frauds, as an additional processing step is included, and adulteration cannot easily be detected by visual examination alone. Moreover, with Appl. Sci. 2021, 11, 362 2 of 11 low awareness, limited knowledge about rice, and a loose tracing system, consumers and even some trading companies cannot distinguish or identify counterfeit products. Consequently, a reliable and highly reproducible technique that can effectively authenticate blended products with excellent sensitivity and accuracy is urgently needed.

Techniques Used in Rice Authentication
Given the importance rice has on a global level, in the literature it is possible to find several studies aimed at the analysis and characterization of this agro-food.
Nevertheless, due to the market value of this product, non-destructive approaches are generally preferred. Hence, imaging analysis has been proposed for the characterization and quality assessment of rice [3], as, for instance, has been done by Kuo et al. [14], where the images collected by a digital camera were elaborated and used to classify 30 different varieties of rice. In addition, spectroscopy represents a suitable non-destructive (or semi-destructive) solution. For instance, 1H-NMR has been used for discriminating rice according to its geographical origin [15], and lased-induced fluorescence to classify according to the variety [16], whereas terahertz spectroscopy imaging has been exploited to detect transgenic rice [17].
Despite the efficiency of traditional approaches, they present a common drawback: the preparation of samples can be time-consuming, as well as economically and environmentally disadvantageous. In this context, near infrared (NIR) spectroscopy has been widely used in the measurement of agricultural products due to its many advantages, such as being easy-to-use, non-destructive, fast and accurate, providing highly reproducible results, requiring minimum or, often, no sample preparation, and allowing the analysis of multiple constituents with a single measurement [18][19][20][21]. Direct spectroscopic measurements have been widely applied for several foods and commodities, especially in the grain, cereal products, and oilseed processing industries [22][23][24][25][26][27][28][29][30]; additionally, they have been successfully applied for the classification of rice [24][25][26][27][28]. Moreover, in the framework of the evaluation of rice quality, NIR spectroscopy has been used for the discrimination of rice [21,29]; the classification of varieties and detection of transgenic rice [31]; the quantification of various physico-chemical properties (such as moisture content, sound whole kernel, whiteness, translucency, color, and amylogram characteristics) [32]; classification of cultivars [33], prediction of protein and amylase content [34,35]; detection of wax rice [36]; and prediction of eating quality [37].
In this regard, the present paper aims at developing a non-destructive analytical methodology for detecting rice adulteration using a hand-held NIR spectrometer. Classification models were built using two different chemometric techniques, which often coupled with NIR spectroscopy, in particular for the analysis of plants and agro-food products [38,39]: partial least squares discriminant analysis (PLS-DA) and soft independent modelling of class analogy (SIMCA).

Spectral Acquisition
The spectrum of each rice sample was collected using a hand-held spectrometer SCIO™ (Consumer Physics, Inc., Saint Cloud, MN), with a spectral range between 740 nm and 1070 nm at a 1-nm resolution. Samples (50 g) were collected in glass containers, and scanned three times after rotating the cup. The whole process was carried out at ambient temperature.

Rice Data Set
Rice samples were collected across different seasons from local millers, and from four regions in Vietnam. Authentic samples were high-quality rice (namely J85), while Appl. Sci. 2021, 11, 362 3 of 11 adulterated samples were obtained by the addition of two different percentages (5% and 10%, given in w/w%) of lower quality rice (namely DT8).
Jasmine 85 (J85) is an aromatic rice variety developed in Thailand by the International Rice Research Institute from a cross between IR262-43-8-11 and KHAO DAWK MALI 4-2-105. J85 was imported to Vietnam in 1993 by the Cuu Long Delta Rice Research Institute, and mostly cultivated in 13 provinces in the Mekong Delta. This variety has many outstanding characteristics, such as a good-looking grain, and amylose content (around 18-20%). When cooked, the rice has a medium grain cooking quality, soft texture, and is cohesive.
Dai Thom 8 (DT8) is a rice variety cultivated in different places, from the Mekong Delta to the Central Highlands and Central Coast. This variety produces long, clear, silverbrown rice, with an incredibly low amylose content (around 16.29%). When cooked, the rice has a light, soft, fragrant aroma.
Authentic samples (J85) were collected from 4 provinces in the Mekong Delta: Can Tho, An Giang, Dong Thap, and Hau Giang from 2 factories during 3 seasons (Monsoon, Summer-Autumn, Winter-Spring), for a total of 24 lots. Each lot was further sampled three times; therefore, there were 72 high-quality rice samples (Authentic rice).
DT8 samples were obtained from the same provinces and factories as the authentic ones, and were collected across the same three seasons. For the preparation of the adulterated samples, each J85 rice resulting from the combination of a particular region and factory was cross-mixed with any of the region/factory combinations of DT8 rice at both 5% and 10% ratios. Despite the season not being explicitly considered as an additional combinatorial factor, care was taken that all the three seasons were fairly represented in the adulterated samples. Accordingly, 64 adulterated samples at 5% addition, and 64 adulterated samples at 10% addition, were available for the analysis.
As the main purpose was to discriminate high-quality rice from adulterated rice, the two groups with 5% and 10% addition were collected together into a single class of adulterated samples; consequently, 128 adulterated samples (Adulterated rice) were used for this study. Prior to the creation of the classification models, replicated measurements were averaged, obtaining a data matrix of dimensions 200 × 331.
The spectra were split by the Duplex algorithm [40] into a training set, used for the calibration of the models, and a test set, which was only used for their external validation. The most appropriate data-preprocessing approach, and the number of latent variables (LVs) to be extracted, were chosen in a 5-fold cross-validation procedure on the calibration samples only.

Spectral Preprocessing Techniques
Different signal preprocessing approaches were applied to the profiles [41]: the first and second derivative [42], and the standard normale variate (SNV) [43]. This step was necessary to remove the non-informative variability present in the spectra. First and the second derivatives were calculated according to the Savitzky-Golay (SG) approach [42] using a 19 points window and a 2nd or 3rd order polynomial. Each spectrum was mean centered (MC) prior to the creation of the models, to remove the variability due to the offsets differences identifiable with the average trend of the data, and this step was performed either as the only pretreatment step or, in all the other cases, after the application of other preprocessing methods.

Partial Least Squares Discriminant Analysis (PLS-DA)
Partial least squares discriminant analysis (PLS-DA) [44][45][46][47][48] is a discriminant classifier, and is particularly appropriate for handling correlated features (e.g., spectroscopic variables). Briefly, PLS-DA is based on expressing the classification problem as a regression one, which can be solved by PLS [49][50][51]. More specifically, a predictor block is used to estimate (by PLS) a binary response called dummy Y (a binary response matrix encoding the class-belonging).
In a two-class problem, as in the present study, Y is a dummy vector (y) presenting either 1 or 0 in the rows corresponding to objects belonging to Class 1 (Authentic rice) and Class 2 (Adulterated rice), respectively. Mathematically, the regression relation between the data matrix X and the dummy vector y for a two-class case can be represented by the model in Equation (1) y =ŷ + e = Xb + e (1) whereŷ, b, and e are the vectors of predicted responses, regression coefficients, and residuals, respectively. After solving the regression problem in Equation (1) by PLS, a further step is needed to achieve classification, since the predicted responses,ŷ, are not binary, but continuous values: accordingly, a suitable classification rule has to be built based on the values of y. One possibility is to use the predicted responses,ŷ, (or the PLS scores) as input for classification methods such as linear discriminant analysis (LDA) or quadratic discriminant analysis (QDA) [52]. In the present study, the threshold was calculated based on the probabilistic approach proposed by Perez and colleagues [53].
When new samples (test set) need to be classified, their predicted responses,ŷ new , are calculated based on the measurements, X new , and the regression coefficients, b, estimated on the training set, and the classification rule described above is then applied to assign each individual to one of the categories under study.

Soft Independent Modelling of Class Analogies (SIMCA)
SIMCA [54] is a class-modeling approach, meaning that, in defining the class boundaries, the method focuses on the similarities among samples from the same category [55]. In SIMCA, unlike PLS-DA, since every class is individually modeled (i.e., region-boundaries are defined individually for each class and independently from the others), it can happen that an object can be assigned to more than one class (being "confused") or excluded from every category. SIMCA results are presented in terms of "sensitivity" and "specificity", where the former indicates the percentage of samples truly belonging to the category correctly accepted by the class model, while the latter expresses the percentage of the objects from other classes which have been correctly rejected.
SIMCA starts from a principal component analysis (PCA) of only the training objects belonging to the category to be modeled, in order to "capture" the systematic variability ascribable to the similarities among samples of the same class [56,57]. Once the PCA is calculated, objects are accepted or refused by the class-model according to their reduced distance from the class space, indicated as d.
For a generic i th sample, the d value is calculated by Equation (2).
where T 2 is the Mahalanobis distance of the sample from the center of the class space and Q is its orthogonal distance from the PC subspace. These values are divided by T 2 0.95 and Q 0.95 , which are the 95th percentiles of the T 2 and Q distributions, obtaining the reduced T 2 (T 2 red ) and the reduced Q (Q red ), respectively [56]. Due to the normalization, T 2 and Q limit values are equal to 1; a sample will then be accepted by the class model if d ≤ √ 2, otherwise it is rejected.

Software Details
Calculations were run in MATLAB 2015b (The Mathworks Inc., Natick, MA, USA) using in-house functions. In particular, PLS-DA models were built using the routines freely available for download at: https://www.chem.uniroma1.it/romechemometrics/research/ algorithms/plsda/.

Results
Prior to the creation of any classification models, the spectra of the analytical replicates were averaged, obtaining 200 NIR signals. Furthermore, in order to allow the external validation of the models, samples were divided into a training set of 140 objects (49 authentic and 91 adulterated), and a test set of 60 individuals (23 authentic and 37 adulterated) by the Duplex algorithm.
The NIR spectra of all the analyzed samples are displayed in Figure 1a, while, in Figure 1b, the average profile for the authentic and adulterated samples are plotted in red and blue, respectively. limit values are equal to 1; a sample will then be accepted by the class model if ≤ √2, otherwise it is rejected.

Software Details
Calculations were run in MATLAB 2015b (The Mathworks Inc., Natick, MA, USA) using in-house functions. In particular, PLS-DA models were built using the routines freely available for download at: https://www.chem.uniroma1.it/romechemometrics/research/algorithms/plsda/.

Results
Prior to the creation of any classification models, the spectra of the analytical replicates were averaged, obtaining 200 NIR signals. Furthermore, in order to allow the external validation of the models, samples were divided into a training set of 140 objects (49 authentic and 91 adulterated), and a test set of 60 individuals (23 authentic and 37 adulterated) by the Duplex algorithm.
The NIR spectra of all the analyzed samples are displayed in Figure 1a, while, in Figure 1b, the average profile for the authentic and adulterated samples are plotted in red and blue, respectively. Irrespective of the classifier used, NIR signals were pretreated by different preprocessing approaches: besides considering raw data, the first derivative (Savitzky-Golay approach, 15 points window, 2nd order polynomial), second derivative (Savitzky-Golay approach, 15 points window, 3rd order polynomial), and standard normal variate (SNV) were also tested. Prior to the creation of any calibration model, data were further meancentered.
The most suitable preprocessing approach, together with the optimal complexity (number of LVs or PCs to be extracted) of any classification model, were defined based on a cross-validation procedure (5 cancellation groups). In particular, PLS-DA selection was based on the combination of pre-processing and model complexity leading to the lowest mean classification error, whereas for SIMCA the maximum efficiency (geometric mean of sensitivity and specificity) was sought.

PLS-DA Analysis
As mentioned above, four different PLS-DA models were calculated on the training samples, one per each tested pre-treatment. Cross-validated classification rates, together with the number of latent variables (LVs) used for the creation of the models, are reported in Table 1. Irrespective of the classifier used, NIR signals were pretreated by different preprocessing approaches: besides considering raw data, the first derivative (Savitzky-Golay approach, 15 points window, 2nd order polynomial), second derivative (Savitzky-Golay approach, 15 points window, 3rd order polynomial), and standard normal variate (SNV) were also tested. Prior to the creation of any calibration model, data were further mean-centered.
The most suitable preprocessing approach, together with the optimal complexity (number of LVs or PCs to be extracted) of any classification model, were defined based on a cross-validation procedure (5 cancellation groups). In particular, PLS-DA selection was based on the combination of pre-processing and model complexity leading to the lowest mean classification error, whereas for SIMCA the maximum efficiency (geometric mean of sensitivity and specificity) was sought.

PLS-DA Analysis
As mentioned above, four different PLS-DA models were calculated on the training samples, one per each tested pre-treatment. Cross-validated classification rates, together with the number of latent variables (LVs) used for the creation of the models, are reported in Table 1. Inspecting the outcome of the PLS-DA analysis, the first derivative was considered the most suitable preprocessing approach for this data, because the model built on data pretreated by this preprocessing led to the most accurate predictions.
The application of this model to the test set led to a correct classification rate of 63.3%, corresponding to 10 authentic and 12 adulterated samples erroneously assigned. It has to be noted that, among the 12 misclassified adulterated samples, 10 are those with the lowest extent of adulteration (5%).
Consequently, the counterfeited samples with 5% adulteration were removed from the analysis and new PLS-DA models were calculated on the remaining samples. In this way, the training set consisted of 96 objects (49 authentic and 47 adulterated), and the test set of 40 individuals (23 authentic and 17 adulterated). Data were preprocessed as described before, and PLS-DA models were built and cross-validated; the corresponding results are reported in Table 2. In this case the best calibration model was the one built on data preprocessed by SNV. Its application to the test set led to the correct classification of 82.6% of the authentic samples, and 82.4% of the adulterated objects, corresponding to four (over 23) and three (over 17) misclassified samples, respectively.
Eventually, in order to investigate which spectral variables most influenced the PLS-DA model, the values of the variable importance in projection (VIP) [53] for the individual predictors were inspected, and 82 NIR wavelengths were identified as significantly contributing (based on the greater than one criterion). A graphical representation of the selected bands is reported in Figure 2 Additionally, a further PLS-DA model was calculated on the data set reduced in agreement with the VIP analysis, but the feature selection did not lead to improvements from a prediction point of view.

SIMCA Analysis
The distinction of authentic and counterfeit products can be associated with the "asymmetric classification problems", i.e., all those cases where the interest is on a specific category (authentic) rather than on the other class under examination. These kinds of problems are often solved by means of class-modeling approaches, which focus on identifying the (usually bound) portion of the multivariate space where it is more likely to find individuals from a specific category. For this reason, despite the relatively accurate results obtained by PLS-DA analysis, SIMCA was also used in order to model the authentic class of samples.
The same pre-treatments investigated in the case of PLS-DA analysis were tested as well in the SIMCA context; nevertheless, different figures of merit were used for assessing the optimal preprocessing approach and the number of principal components (PCs) to be extracted. In particular, the outcome of the SIMCA models was reported in terms of sensitivity (the percentage of the authentic samples correctly accepted by the model), specificity (the percentage of the adulterated samples correctly rejected by the model), and efficiency (the geometric average of sensitivity and specificity). The optimal model parameters (pre-treatments and number of PCs) were defined on the basis of the efficiency values resulting from a cross-validation procedure with seven cancelation groups. Similarly to above, four different models were created, one for each pre-treatment; the results are reported in Table 3.  In the plot, variables with a VIP higher than one are highlighted in red, over the mean spectrum of the samples (in black). It can be seen that the spectral ranges contributing the most to the discrimination are those between 1040 and 1070 nm, corresponding to the second overtones of N-H stretching, and O-H stretching; those between 950 to 1000 nm, also corresponding to the second overtone of N-H stretching and of O-H stretching; and some variables between 900 and 950 nm, corresponding to the third overtone of C-H stretching [58,59].
Additionally, a further PLS-DA model was calculated on the data set reduced in agreement with the VIP analysis, but the feature selection did not lead to improvements from a prediction point of view.

SIMCA Analysis
The distinction of authentic and counterfeit products can be associated with the "asymmetric classification problems", i.e., all those cases where the interest is on a specific category (authentic) rather than on the other class under examination. These kinds of problems are often solved by means of class-modeling approaches, which focus on identifying the (usually bound) portion of the multivariate space where it is more likely to find individuals from a specific category. For this reason, despite the relatively accurate results obtained by PLS-DA analysis, SIMCA was also used in order to model the authentic class of samples.
The same pre-treatments investigated in the case of PLS-DA analysis were tested as well in the SIMCA context; nevertheless, different figures of merit were used for assessing the optimal preprocessing approach and the number of principal components (PCs) to be extracted. In particular, the outcome of the SIMCA models was reported in terms of sensitivity (the percentage of the authentic samples correctly accepted by the model), specificity (the percentage of the adulterated samples correctly rejected by the model), and efficiency (the geometric average of sensitivity and specificity). The optimal model parameters (pre-treatments and number of PCs) were defined on the basis of the efficiency values resulting from a cross-validation procedure with seven cancelation groups. Similarly to above, four different models were created, one for each pre-treatment; the results are reported in Table 3. Inspecting Table 3, it is straightforward to see that the models provide very high sensitivity, but, on the contrary, their specificity is definitely low, indicating that adulterated samples were accepted by the model of the authentic class. This leads to generally low efficiencies: the highest one was 32.8%, and it was provided by the model built on NIR signals preprocessed by the second derivative. As expected, the application of this model to the test set was not completely satisfactory; in fact, it led to a sensitivity of 98.0% and a specificity of 11.0%. A possible explanation for these results can be identified in the high heterogeneity of the modeled category (the authentic samples); in order to account for the wide range of variability spanned by the class (different factories, different regions, and different seasons) and to obtain a high sensitivity, the model space needs to be expanded to the point that a high amount of samples of the alternative class is accepted. Altogether, these results indicate that the tested strategy is not suitable for the investigated classification problem.

Conclusions
The present study represents a proof-of-concept study to investigate the possibility of coupling NIR spectroscopy and chemometric classifiers with the aim of detecting adulterated rice samples. In order to achieve this goal, two different strategies were exploited; one, based on a discriminant classifier (PLS-DA), and one employing a class-modelling technique (SIMCA). The two approaches provided different results; in particular, SIMCA appeared unable to solve the investigated problem. On the other hand, PLS-DA analysis is a suitable approach for the purpose of the work. In fact, while this approach provided not completely satisfactory results in discriminating authentic and adulterated samples, when considering both 5% and 10% counterfeits (with a total correct classification rate of 63.3%), on the other hand, when the analysis was restricted to the 10%-adulterated samples only, a very good accuracy was achieved. In fact, 82.6% of the authentic samples and 82.4% of the adulterated objects were correctly classified (corresponding to 4 authentic and 3 adulterated misclassified samples). These results indicate that the high within-class variability (due to the different origins of the samples in terms of factory, region, and season) can have an impact on the possibility of detecting low levels of adulteration; at the same time, they also suggest that the proposed approach could be useful for detecting samples adulterated at 10% or more.
In conclusion, we assert that the present preliminary study demonstrates that the combination of NIR spectroscopy and PLS-DA can represent an effective, rapid and nondestructive tool for the determination of adulteration in jasmine rice.