The proposed methodology that has been carried out to correlate and identify the EVOO near infrared spectra with the polyphenol content has been structured in five blocks. The first stage was to select the production process point to get the data source, i.e., the EVOO samples, so that the olive oil turbidity was in an acceptable range. Then, secondly, the spectra acquisition was carried out using the experimental setup built ad hoc. Also, as will be shown, different acquisition parameters were modified at this stage. Consecutively, in the third stage, the raw frequency values were processed with different pretreatment algorithms to remove non-desired effects in the NIR spectra. After that, in the fourth stage, the spectra were filtered in order to select certain wavelengths related to the compound of interest. Finally, the last phase consisted of building the regression model which correlates the absorbance values with the concentration of total polyphenols.

Figure 1 shows the aforementioned workflow and the next sections explain the details.

#### 3.1. Extra Virgin Olive Oil Samples

Different olive oil batches were sampled from the olive oil factory, at the exit of the vertical centrifugal machine (

Figure 2), and before the filtering step. At this point, the microdroplets of water and different solid particles have been removed from the olive oil. The samples were produced using different values of the process variables, such as malaxing temperature, malaxing time, and water addition before the oil extraction, which resulted in olive oils with different content of minor components. The initial set was composed of 11 samples with the following polyphenol concentrations in mg/kg: (1017, 1132, 1247, 1362, 1478, 1593, 1708, 1823, 1939, 2054, 2169). Finally, these samples were blended to increase the dataset to 21 olive oil samples. The blends were carried out by pouring the same quantity of olive oil from consecutive concentrations. The final dataset was composed by the following concentrations: (1017, 1074, 1132, 1189, 1247, 1305, 1362, 1420, 1478, 1535, 1593, 1650, 1708, 1766, 1823, 1881, 1939, 1996, 2054, 2112, 2169). The olive oil was monovarietal and the original olive fruit variety was Picual. On the other hand, the samples were stored at 6◦C in darkness until their chemical and spectral analysis.

This study had the collaboration of the Spanish National Research Council (CSIC) through the department of Food Biotechnology (

www.ig.csic.es). They assessed the total content of polyphenols in the collected samples. Briefly, the standardized method was as follows; 0.6 g of olive oil was extracted using 3 × 0.6 mL of dimethylformamide (DMF); the extract was then washed with hexane, and N

_{2} was bubbled into the DMF extract to eliminate residual hexane. Finally, the extract was filtered through a 0.22 µm pore size and injected into the chromatograph. The method is more detailed in [

27].

#### 3.3. Spectra Acquisition Procedure

Firstly, with the aforementioned probe, the transflectance spectra was acquired for each sample. This process was carried out in the laboratory of the Robotic, Automation and Computer Vision research group at the University of Jaén, and at room temperature (20 ± 0.5 °C). The optical path length was constant and equal to 5 mm. The transflectance methodology can be briefly explained by following the path that the light follows. Firstly, the light goes out from the halogen lamp to the sample through the optic fiber. The light goes from the tip of the probe to the sample, it traverses the oil and is reflected by a specular surface. Then, the light comes back to the optic fiber until it arrives to the near infrared sensor.

The number of captures was 10 for each sample; however, each acquisition was saved independently. Then the transflectance spectra were coded in matrix form (Equation (1)) where rows (I) are the sample number and columns (Υ) are the light power transmitted by the sample for each wavelength. In our case, I was 21 and Υ was equal to 921.

The power received (P

_{r}) by the NIR device follows the classical energy balance and it depends on the transmitted power by the halogen lamp (P

_{t}), the power light that has been absorbed by the olive oil samples (P

_{abs}) and different losses caused by hardware components, such as connectors or multiplexors in a multipoint system (L). The energy balance equation for each wavelength (υ) is treated in Equation (2).

One of the aims of this work is to present the effect of the halogen light power on prediction results. Therefore, three P

_{r}(υ) targets were fixed for each acquisition, 200 ut, 300 ut and 400 ut, where ut are the units of transflectance, and show the raw value obtained by the analogue to digital converter of the NIR device. Also, the referenced wavelength υ was set in 1600 nm, because it is the maximum value of olive oil transflectance spectrum (

Figure 3). Then, three groups of spectra (S1, S2, and S3) were built, and the output light power was adjusted for each group in order to reach the targeted Pr(1600 nm) value (

Table 1).

#### 3.4. Spectra Preprocessing

Firstly, the raw spectra were transformed from transflectance values to absorbance values. The spectra of absorbance show the quantity of light that is absorbed at different wavelengths. It is useful when the purpose is to build a calibration model [

29].

Therefore, Equation (3) was applied to the whole dataset, where

X_{abs} are the absorbance spectra and

X_{trans} are the transflectance spectra. The NIR sensor sensitivity is not the same among the different wavelengths, and there are spectral bands where the signal-to-noise ratio is poor due to the low gain (see

Figure 3, from 2300 nm to 2600 nm).

Thus, the coefficient of variation (CV) (Equation (4)) was applied for each sample and each wavelength, in order to detect these spectral bands, where s is the sample standard deviation and $\overline{x}$ is the sample mean for each wavelength υ.

Furthermore, the acquired spectra were preprocessed to remove and/or correct some disturbances in the measurements. The disturbances could be due to scattering effects in the olive oil according to microparticles or inhomogeneities of the surface. Also, they can be due to differences in the spectrum baseline based on differences in the optical path length or interferences from external light and random noise. Normally, the success ratio of the preprocessing algorithms can be evaluated by experimental tests. Firstly, the classification or regression model is configured, and then the error of the model is computed with the same inputs, but preprocessed with different algorithms. The pre-treatment methods studied in this work were the standard normal variate (

SNV) [

30], multiplicative scatter correction (MSC) [

31], and Savitzky–Golay (SG) [

32] derivative method.

The SNV is a pretreatment method used quite often in NIR to remove the scatter added to the spectrum. Its approach is to calculate the average and standard deviation of all the data points for that spectra, and each data point of the spectra is subtracted from the mean and divided by the standard deviation. It is applied to all the spectrum individually, and it allows one to compare different spectra on the basis of the same reference (Equation (5)), where x_{iυ,SNV} is the transformed element for original element x_{iυ}, and ${\overline{x}}_{i}$ is the mean of spectrum i, and Υ is the number of variables in the spectrum.

On the other hand, the MSC methodology is normally used to detect additive and/or multiplicative effects spectral signals. It detects and removes physical effects, like heterogeneities in the size of particles, and corrects the intensity of the wavelengths in the spectrum which do not carry any chemical or physical information.

Firstly, each spectrum is fitted to the average spectrum by least squares (Equation (6)), where

x_{i} is the spectrum of an individual sample

i,

$\overline{{x}_{\mathsf{\upsilon}}}$ is the mean spectrum of the group, and the error

e_{i} corresponds to all other effects in the spectrum that cannot be modelled by an additive and multiplicative constant, in other words, it represents the chemical differences among the olive oil samples.

The corrected spectrum x_{i,MSC} is calculated using the fitted constants a_{i} (intercept) and b_{i} (slope) with the Equation (7).

Alternatively, the derivative approach, named Savitzky–Golay method [

32], was evaluated in order to remove possible overlapping peaks and correct the spectra baseline. It is a spectral low-pass filtering method where a convolution is used instead of the mean of the spectrum.

Its mathematical expression appears in Equation (8), where C_{n} is the convolution coefficient for the spectral value υ, and N is the size of the window. The convolution coefficients are obtained using minimum squares adjustment of the points of the spectrum to a polynomial of determined grade.

The result is a spectrum similar to the input one, but smoother. This approach saves the features of the original signal, such as maximums, minimums, widths of the peaks, and preserving the original distribution. Also, it is useful to remove, at the same time, the baseline of the spectra, if it is applied in derivative form.

#### 3.5. Spectra Filtering

The amplitude of the absorption at any particular wavelength is determined by its absorptivity and the number of molecules encountered within the beam path of the measuring instrument. This relationship is described by Beer’s law [

33]. Then, the absorptivity is not the same throughout the spectrum, and there are wavelengths that provide more information than others related to the studied compounds.

In this step, the one-way ANOVA analysis was employed in order to compare the features between samples in the same class and samples in different classes. It was performed by defining two classes: the first class was composed by samples with high concentrations of polyphenols, and the second class with samples with low concentrations. The aim was to remove irrelevant wavelengths from the feature vector, and search the most discriminant wavelengths. If the ratio of within-group variation to between-group variation for one wavelength is significantly high, we can conclude that the group means are significantly different from each other. We can measure this using a test statistic that has an F-distribution with (k − 1, N − k) degrees of freedom. This test was applied for each wavelength. If the p-value for the F-statistic is smaller than the significance level, then the test rejects the null hypothesis that all group means are equal, and concludes that at least one of the group means is different from the others. The most common significance levels are 0.05 and 0.01. In our case, features with p-value lower than 0.05 were candidates to be removed.

#### 3.6. Dimensionality Reduction with Regression Models

There are two general approaches for performing dimensionality reduction: feature selection models and feature extraction models [

34]. The first one identifies a subset of features without applying any transformations, and the second one employs mathematical operations in order to transform the originals features into a lower dimensional space. The algorithm evaluated in this work was the stepwise multilinear regression (SMLR) [

35] algorithm, in order to evaluate the effect in the prediction results when different wavelengths were introduced in the model. It uses the polyphenol content as dependent variable (Y), and wavelengths as independent variables (X).

Stepwise multilinear regression model is an iterative algorithm, and it consists on adding and removing terms from a linear model based on their statistical significance in explaining the response value. The method begins with an initial model, and then compares the explanatory power of incrementally larger or smaller models.

SMLR uses forward and backward stepwise regression to build the final model. At each step, the algorithm searches for wavelengths to add or remove from the model according to a specific criterion. In our case, the criterion was to use the statistical

p-value and

F-value to test models with and without a potential wavelength at each step. Then, if a wavelength is not currently in the model, the null hypothesis (H

_{0}) is that the coefficient attached to the wavelength would have a zero value if it is added to the model (forward approach). If the null hypothesis is rejected, the wavelength is added to the model. Furthermore, if a wavelength is currently in the model, the H0 is that the coefficient of the wavelength is equal to zero. If there is insufficient evidence to reject the null hypothesis, the wavelength is removed from the model (backward approach) [

36].

The algorithm was applied throughout the next steps:

Step 1: Fit a linear model with β

_{0} and the set of selected (B) wavelengths:

Step 2: For β_{k} (k from 1 to number of wavelengths) not in B iterate from 3 to 6.

Step 3: Adjust the lineal model:

Step 4: Null hypothesis: H_{0}: β_{k} = 0

Step 5: F-value and p-value are obtained.

Step 6: If p-value < 0.05: β_{k} is added to the set B and go to Step 1.

Finally, the regression model was evaluated with leave one out cross validation (LOOCV) and holdout validation. LOOCV is an iterative validation wherein one sample

x_{i} is removed from the dataset in each iteration. The number of iterations is equal to the number of samples (

N). Then the model is built with the rest of samples in the dataset, and it predicts the

y_{i} related to the previously removed sample

x_{i}. On the other hand, holdout validation employs part of the dataset for training, and the rest of samples to validation purpose. In our case, the criterion was 50%/50% and the samples in each group were selected randomly.

The predicted value is represented as $\widehat{{y}_{i}}$. At the end, the root mean square error (RMSE) is obtained according to the Equation (11). Furthermore, the determination coefficient R^{2} is calculated in order to evaluate how the variability of the response variable Y is explained by the regressor variable X. It is explained with the Equation (12).