## 1. Introduction

With the rising prices of cooking oil, greedy traders and suppliers may resort to unethical practices, such as mixing low-value cooking oil with high-value cooking oil [

1]. The consumers cannot detect these low-value, inexpensive ingredients in cooking oils, so they pay more for them. Extra virgin olive oil (EVOO) is native to the Mediterranean area, is known as “the gold of liquids”, “the queen of plant oils”, and “the Mediterranean nectar”, and is an established Chinese consumer favorite [

2]. The consumption of the EVOO has increased in recent years. However, the production of EVOO is not enough to cope with the growing consumer demand in China because of the demanding production conditions of EVOO. Therefore, EVOO adulteration has spread in the Chinese market. Adulteration not only causes confusion in the edible oil market but also violates the rights of consumers. Therefore, a fast and effective analytical method of EVOO adulteration is required to assist government’s regulations.

Fourier transform near-infrared (FT-NIR) molecular spectroscopy is a technique widely applied in food quality analysis [

3,

4,

5,

6] that can provide abundant information about the chemical composition and molecular structure of various food substances. In addition, this technology also has the advantages of being non-destructive, fast, low-cost, with good reproducibility and broad application prospects. Recently, the FT-NIR spectroscopy technique has been extensively used in quality and safety analysis of EVOO [

7,

8,

9]. In addition, other molecular spectroscopy techniques, such as fluorescence spectroscopy [

10,

11,

12], infrared spectroscopy [

13,

14,

15], Raman spectroscopy [

16,

17,

18], and nuclear magnetic resonance spectroscopy [

19,

20], have good applications in the analysis of EVOO adulteration. With the technological developments, the amount of spectral data acquired is increasingly large because of the improvement of instrument resolution. Therefore, the selection of spectral characteristic wavenumbers plays an important role in spectral model development. Moreover, more and more researchers have proved that the selection of characteristic wavenumbers in the multivariable model calibration can not only improve the prediction performance of the chemometrics model but also enhance the interpretability of the model [

21,

22,

23,

24].

Partial least-square (PLS) regression is a statistical method related to principal component regression (PCR), which is to search a linear regression model by projecting predicted variables and observed variables into a new state space [

25]. Because of the advantages of variable selection, many PLS-based feature variable selection algorithms have been developed [

26], for example, the variable importance in projection (VIP) score [

27], the successive projections algorithm (SPA) [

28], the uninformative variable elimination (UVE) algorithm [

29], and the selectivity ratio (SR) [

30]. These methods were developed on the basis of the criteria of variable weights or regression coefficients. Additionally, some other feature wavenumber selection methods based on model population analysis (MPA) strategies have been developed [

31], for instance, the iteratively retaining informative variables (IRIV) [

32], the variable iterative space shrinkage approach (VISSA) [

33,

34], the variable combination population analysis (VCPA) [

35], and the bootstrapping soft shrinkage (BOSS) [

36]. Compared with IRIV, VISSA, and VCPA, an important feature of the BOSS algorithm is the introduction of weighted bootstrap sampling (WBS) criteria that the other three algorithms do not consider. Furthermore, different from other bootstrap-based algorithms, the BOSS algorithm performs the bootstrap criteria in the variable space, while other algorithms perform the criteria in the sample space. Thus, in this study, the BOSS algorithm was applied for the wavenumber selection of spectral data of EVOO doped samples.

The aim of this study was to verify the feasibility of establishing an improved and reliable reduced spectral model which can directly and quantitatively determine the doping content of EVOOs by their spectra. The feature wavenumbers were first selected by the BOSS algorithm, and a detection model based on the PLS regression using the selected wavenumbers by the BOSS algorithm was built. Finally, the performance of the reduced BOSS–PLS model was compared with the performances of the other three commonly used reduced models (i.e., competitive adaptive reweighted sampling PLS (CARS–PLS), Monte Carlo uninformative variable elimination PLS (MCUVE–PLS), and iteratively retaining informative variables PLS (IRIV–PLS)).

## 3. Discussion

In order to show the advantages of the BOSS algorithm in terms of wavenumber selection, it was compared with other three high-performance approaches for wavenumber selection, i.e., CARS, MCUVE, and IRIV. The best results of PLS models based on variables selected from different variable selection algorithms are shown in

Table 1. The results in

Table 1 show that the prediction accuracy of the PLS model could be improved by the four wavenumber selection algorithms with respect to the full-spectrum PLS model. Moreover, compared with the CARS–PLS model, the MCUVE–PLS model, and the IRIV–PLS model, the BOSS–PLS model achieved better results not only in the calibration process but also in the validation process. The main reason is that, quite likely, the BOSS algorithm combines the strategies of soft shrinkage, MPA, and WBS and makes full use of the regression coefficient information.

Also, the BOSS algorithm adopts the soft shrinkage strategy to select informative variables. Compared with the method of variable selection based on the hard shrinkage strategy, such as CARS and MCUVE, which delete less informative wavenumbers directly, the soft shrink strategy allocates smaller weights to wavenumbers with less information. However, these wavenumbers can still participate in the sub-models’ construction for further evaluation considerations in the next iteration. Thus, the advantage of the soft shrink strategy is that it is able to reduce the risk of removing characteristic variables during the iteration and to choose the optimal variable subsets with better prediction ability.

The best variable subset is finally obtained by the BOSS algorithm on the basis of the criteria of the MPA combined with those of the WBS. Concretely, the sub-models are obtained in terms of the weight of each variable by the BOSS algorithm. The weight of each wavenumber is determined according to the value of the regression coefficients of multiple PLS sub-models by using the MPA strategy, rather than by using a single full-spectrum model. Then, the WBS strategy is used to stepwise update the weight of the wavenumbers selected so that the variable space can be compressed better. Thus, the BOSS algorithm considers all possible combinations of the selected wavenumbers, which is reasonable because the best number of variable subsets obtained is unknown before and during wavenumbers selection.

## 4. Materials and Methods

#### 4.1. Sample Preparation and Division

In this study, extra virgin olive oil, peanut oil, sunflower seed oil, soybean oil, sesame oil, and maize oil were purchased in local supermarkets. In the experiments, peanut oil, sunflower seed oil, soybean oil, sesame oil, and corn oil were used as adulterating oils, which would be added separately to the EVOO to prepare the samples to be tested. That is to say, the adulterated oil samples were prepared including only two kinds of edible oil, namely, the EVOO and one of adulterating oils. The specific preparation process is reported below.

The doped oil samples were prepared using the EVOO and one of the adulterating oils. The volume fraction of each adulterated oil ranged from 2.5 to 50% v/v, increasing by 2.5% v/v volume fraction. Thus, 100 samples could be obtained in the experiment process.

In this study, the 100 samples were divided into two subsets. One was the calibration set, which was adopted to construct the prediction model, the other was the validation set, which was applied to verify the accuracy and generalization performance of the model. In order to meet the statistical requirements, three samples at the same doping concentration were randomly selected and put into the calibration set during sample division. Thus, there were 60 samples in the calibration set and 40 samples in the validation set. Because the adulterated samples obtained in this study only contained two kinds of edible oils, the calibration model established in this study can only be used to quantitatively detect one adulterated oil mixed with EVOO.

#### 4.2. FT-NIR Spectra Acquisition

In this study, the NIR spectra of the doped samples were collected in transmission mode by means of an Antaris II NIR spectrophotometer (Thermo Scientific Co., Waltham, MA, USA). The number of spectral scanning was set to 32, and the spectral resolution was set to 4 cm^{−1}. The range of spectral scanning was set from 10,000 cm^{−1} to 4000 cm^{−1}. Thus, the original spectrum of each doped sample contained 1557 wavenumbers (i.e., 1557 wavelength variables). The absorbance data were stored as Log (1/T), T being the transmittance.

In spectral collection, each doped sample was first placed in a cuvette with a diameter of 6.0 mm, and then in the sampling chamber of the spectrometer for original spectral collection. The spectra of each doped sample were collected three times, and the mean values of the three measured spectra were taken as the original NIR spectra of the sample. When the spectra were collected, the laboratory temperature maintained at 25 °C.

#### 4.3. Spectra Preprocessing

Figure 4a shows the raw FT-NIR spectra of all collected samples. As can be seen from

Figure 4a, the spectra obtained contained not only useful sample information but also certain noise information, even overflow occurred in some wavenumbers. In order to eliminate the influence of these adverse factors, it was necessary to adopt appropriate methods to preprocess the spectra obtained before multivariable model calibration. Standard normal variate (SNV) transformation, which can be used to eliminate not only the baseline drift of diffuse reflectance spectrum but also the overflow phenomenon of diffuse reflectance spectrum, is mainly used to eliminate the influence of surface scattering and optical path change on diffuse reflectance spectra. Therefore, in this study, the SNV method was adopted to pretreat the spectra obtained, and the FT-NIR spectra after SNV preprocessing are presented in

Figure 4b.

#### 4.4. Data Analyses Methods

The BOSS algorithm applied here, which can be used to select the characteristic variables in the presence of collinearity, was described by Deng et al. [

36]. The BOSS algorithm is based on a favorable criterion of shrinkage and utilizes the information of regression coefficients instead of the traditional hard shrinkage strategy. The BOSS algorithm, which is based on the bootstrap sampling (BSS) [

37] and WBS [

38] techniques, was used to determine the random combination wavenumbers and to establish the sub-models. The MPA was applied to extract informative variable subsets from the sub-models developed on the basis of PLS regression. The specific process of the BOSS algorithm was as follows:

In the process of spectral data analysis, suppose the spectral data matrix is X, of size N × P, which includes N samples and P wavenumbers, and a vector Y, of size N × 1, which represents the reference measurements.

Step 1, K subsets were generated in a variable space by the BSS. In each subset, one of many redundant variables remained by the BSS to extract characteristic variables. In the step, all wavenumbers were treated equally so that they had the same probability of being selected into the variable subset. That is to say, each variable had the same weights (w)

Step 2, the K sub-model of PLS were first developed by the data from the subsets selected. Then, the cross-validation RMSECV of each sub-model was calculated, each sub-model was sorted from smallest to largest, according to the RMSECV value, and the sub-model ranked in the top 10% was extracted.

Step 3, the regression coefficients of each sub-model extracted was calculated. By normalizing each regression vector, all elements in the regression vector were transformed into the absolute value of unit length. The new weights of the variable selected were then obtained according to the following summation formula:

where

K represents the number of sub-models that are extracted, and

${b}_{i,k}$ is the absolute value of the normalized regression coefficients for the

ith wavenumber in the

kth sub-model.

Step 4, the WBS was used to generate some new subsets based on the new weight of each variable selected, and the number of substitution wavenumbers in the WBS was obtained according to the average number of wavenumbers selected in the last step.

Step 5, steps 2 to 4 were repeatedly conducted until the number of wavenumbers selected in the renewed variable subset equaled one, and the variable subset was finally selected according to the lowest value of the RMSECV during the iterations as the best variable subset.

#### 4.5. Model Evaluation

The prediction and generalization performances of the models were examined by a five-fold cross validation and an independent validation set. The values of the RMSECV, RMSEP, and coefficient of determination (R

^{2}) were used as measures for model performance evaluation. RMSECV, RMSEP, and R

^{2} are given by the expressions

For RMSECV, n is the number of samples in the calibration set, y_{i} is the reference measurement value from the ith sample, and $\widehat{{y}_{\backslash i}}$ is the estimated value of the ith sample, when the model is constructed with the removed ith sample. For RMSEP, n is the number of samples in validation set, y_{i} is the reference measurement value of the ith sample in the validation set, and $\widehat{{y}_{i}}$ is the estimated value of the ith sample in the validation set. For R^{2}, n is the number of samples, y_{i} is the reference measurement value from the ith sample, $\widehat{{y}_{i}}$ is the estimated value of the ith sample, and $\overline{{y}_{i}}$ is the mean of all samples.