Authentication of Greek PDO Kalamata Table Olives: A Novel Non-Target High Resolution Mass Spectrometric Approach

Food science continually requires the development of novel analytical methods to prevent fraudulent actions and guarantee food authenticity. Greek table olives, one of the most emblematic and valuable Greek national products, are often subjected to economically motivated fraud. In this work, a novel ultra-high-performance liquid chromatography–quadrupole time of flight tandem mass spectrometry (UHPLC-QTOF-MS) analytical method was developed to detect the mislabeling of Greek PDO Kalamata table olives, and thereby establish their authenticity. A non-targeted screening workflow was applied, coupled to advanced chemometric techniques such as Principal Component Analysis (PCA) and Partial Least Square Discriminant Analysis (PLS-DA) in order to fingerprint and accurately discriminate PDO Greek Kalamata olives from Kalamata (or Kalamon) type olives from Egypt and Chile. The method performance was evaluated using a target set of phenolic compounds and several validation parameters were calculated. Overall, 65 table olive samples from Greece, Egypt, and Chile were analyzed and processed for the model development and its accuracy was validated. The robustness of the chemometric model was tested using 11 Greek Kalamon olive samples that were produced during the following crop year, 2018, and they were successfully classified as Greek Kalamon olives from Kalamata. Twenty-six characteristic authenticity markers were indicated to be responsible for the discrimination of Kalamon olives of different geographical origins.


Introduction
The olive tree (Olea Europaea L.) is cultivated mainly in the Mediterranean region and has diverged naturally to many cultivars [1]. Table olives, prepared from the olive fruit, are considered a highly functional food and their quality depends on the cultivar [2,3]. Eating table olives provides us with energy, vitamins, minerals, and a large variety of phenolic antioxidants, such as phenolic acids, phenolic alchohols, flavonoids, and secoiridoids, compounds known for their antioxidant, antiinflammatory, anticarcinogenic, antimicrobial, antidyslipidemic, antihypertensive, cardiotonic, laxative, and antiplatelet properties [3][4][5][6][7].
According to the International Agreement on Olive Oil and Table Olives, the table olive is defined as the healthy fruit of certain varieties of the cultivated olive tree, harvested at a stage of appropriate maturity and quality, to yield a well-preserved product after proper treatment [8]. Table olives are classified as green olives, turning color olives, and black olives depending on their ripening stage. Olive drupe processing is essential, aiming primarily at the degradation of the phenolic glycoside oleuropein, a compound that gives a bitter taste to the fruit, making immediate consumption impossible [7]. There cultivated in different regions: Kalamata (Greece), Egypt, and Chile. Partial Least Square Discriminant Analysis (PLS-DA) was used for the development of a reliable and robust prediction model able to accurately distinguish between the authentic Greek PDO Kalamata table olives and Kalamata type table olives of different origins.

Table Olive Samples
Sixty-five commercially available Kalamon table olive samples were acquired from the Greek market. The samples were all produced in the crop year 2017 and originated from Kalamata, Greece (PDO Kalamata table olives, 20 samples), Egypt (24 samples), and Chile (21 samples). Eleven more samples originated from Kalamata, Greece, were obtained during the following crop year (2018) and were used as a set of "suspect" samples to evaluate the prediction accuracy of the PLS-DA discrimination model. All samples had been subjected to the same de-bittering protocol involving brining with dry salt and water.
Besides the table olive samples, Quality Control (QC) samples were also prepared and analyzed throughout the batch to verify adequate analytical performance. The QC samples were prepared by mixing same-volume aliquots of all table olive extracts examined, representing both the sample matrix and the metabolic profile of the samples. Finally, four analytical procedural blanks (mixtures of MeOH: H 2 O, 80:20, v/v) were prepared and analyzed.

Sample Extraction
Firstly, the hardened endocarp (shell) was removed from the olive drupe with a sharp knife and the fleshy mesocarp (hull) and exocarp (external skin) were homogenized in a food processor and further ground into a fine paste using a mortar and pestle. Then, 0.5g (± 0.1 mg) of the homogenized sample was weighted in 2 mL Eppendorf tubes and spiked with myricetin (IS) at a final concentration of 1 mg kg −1 . A 1 mL aliquot of MeOH: deionized water mixture (80:20, v/v) was used for the extraction of the analytes. The mixture was vortexed for 1 min and centrifuged for 5 min at 3400 rpm. Consequently, the upper phase was collected and filtered through RC syringe filters and 5 µL of the extract was injected into the chromatographic system.

UHPLC-ESI-QTOF-MS/MS Analysis
Reversed phase liquid chromatographic analysis was performed using a UHPLC system equipped with a HPG-3400 pump (Dionex UltiMate 3000 RSLC, Thermo Fisher Scientific, Dreieich, Germany) interfaced with a QTOF-MS analyzer (Maxis Impact, Bruker Daltonics, Bremen, Germany). An Acclaim RSLC C18 column (2.1 mm × 100 mm, 2.2 µm) was used for chromatographic separation, purchased from Thermo Fisher Scientific (Driesch, Germany) coupled with an ACQUITY UPLC BEH C18 precolumn (1.7 µm, VanGuard precolumn, Waters, Ireland). The column temperature was set at 30 • C. The mobile phase consisted of water: MeOH (90:10%, v/v) containing 5 mM ammonium acetate (solvent A), and 100% MeOH containing 5 mM ammonium acetate (solvent B). The elution gradient program adopted started with 1% of solvent B (flow rate 0.2 mL min −1 ) for 1 min, gradually increasing to 39% in the following 2 min, and then increasing to 99.9% of solvent B (flow rate 0.4 mL min −1 ) in the following 11 min. These conditions were kept constant for 2 min (flow rate 0.48 mL min −1 ), and then the initial conditions (1% solvent B, 99% solvent A) were restored within 0.1 min (flow rate decreased to 0.2 mL min −1 ) to re-equilibrate the column for 3 min before the next injection. The QTOF MS system was equipped with an electrospray ionization (ESI) interface, operating in negative mode with capillary voltage of 3500 V, end plate offset of 500 V, nebulizer pressure of 2 bar (N 2 ), drying gas flow rate of 8 L min −1 (N 2 ), and drying temperature of 200 • C. A QTOF external calibration was performed daily using a sodium formate cluster solution, and an internal calibration was performed in every chromatogram, using a calibrant injection at the beginning of each run (0.1−0.25 min). The sodium formate calibration mixture consisted of 10 mM sodium formate in a mixture of water and 2-propanol (1:1, v/v). Full scan mass spectra were recorded in the range from 50 to 1000 m/z, with a scan rate of 2 Hz. MS/MS experiments were conducted using data dependent AutoMS mode in which the five ions with the highest abundance per MS scan are fragmented with a pre-defined collision energy set based on the mass and charge stante of the ions. AutoMS mode provides clear and compound-specific MS/MS spectra, enabling the structure elucidation of unknown compounds. The analytical instrument provided a typical resolving power (full width at half maximum) between 36,000 and 40,000 at m/z 226.1593, 430.9137, and 702.8636.

Method Performance Evaluation
The performance of the UHPLC-ESI-QTOF-MS/MS methodology was evaluated in order to verify its performance characteristics. A set of 14 phenolic compounds was used for the assessment of the applicability of the methodology and several performance characteristics were examined such as the method's linearity, accuracy, precision, sensitivity, and matrix effects. A target list was created, including the phenolic compounds that belong to the classes of phenolic acids, flavonoids secoiridoids, and lignans, and that have already been identified in Greek table olives and olive oils [23,25]. The compiled target list is presented in Table S1 in the Electronic Supplementary Material, including the names and molecular formulas of the compounds, as well as their molecular ions in negative ionization mode (M − H) − , retention times (t R ), and most abundant fragment ions. The data processing was performed using Bruker Daltonics Data Analysis 4.1 and TASQ Client 1.4 software packages (Bremen, Germany).
Standard addition curves were constructed to assess the linearity of the method. Caffeic acid, ferulic acid, gallic acid, syringic acid, p-coumaric acid, homovanillic acid, pinoresinol, apigenin, vanillin, epicatechin, luteolin, hydroxytyrosol, tyrosol, and oleuropein were spiked at concentrations between 0.02 and 10 mg·kg −1 (10 calibration levels with 3 replicates at each level). Myricetin, a phenolic compound that did not exist in the matrix, was chosen and used as an IS to compensate for recovery losses and to correct any potential drift of the analytical signal throughout the batch. The calibration curves were constructed by correlating the spiked concentration of each analyte to the peak area of the spiked analyte, after subtracting the peak area of the neat sample, and divided by the peak area of the IS. The limits of detection (LODs) and limits of quantification (LOQs) of the method were evaluated using the standard error and slopes of low-concentration standard addition calibration curves, constructed using the lowest concentration range 0.02-1 mg kg −1 , according to the following equations [5]: where S a is the standard error of the intercept a and b is the slope. The accuracy of the method was estimated by calculating the relative recovery of all target analytes at the 0.5 mg·kg −1 concentration level. The matrix effect was also calculated at the 0.5 mg·kg −1 concentration level and, finally, the precision of the method was estimated as repeatability (intra-day precision) and intra-laboratory reproducibility (inter-day precision). Repeatability was expressed as the %RSDr values of eight replicate analyses (n = 8) of spiked samples at a concentration of 0.5 mg kg −1 analyzed at the same laboratory day, while the reproducibility results were expressed as the%RSDR value of three replicates of spiked table olive samples analyzed during two consecutive days (n × k = 3 × 2 = 6).

Data Processing
The raw data from the analysis of 65 samples were converted into mzXML files using the MSconvert module from Proteowizard [32]. These raw data were then imported into R workspace and the centWave algorithm implemented in XCMS R package was used to perform peak picking [33]. The internal parameters of XCMS were optimized by IPO R package using the analyzed QC samples. The optimized values of ppm (threshold for tolerated mass deviation), minimum and maximum chromatographic peak width were 23.3, 17.5, and 40, respectively. The signal-to-noise and the intensity threshold was set to 3 and 1000. The "prefilter" parameter, being the threshold for an m/z to be considered as a true peak if it exists in k consecutive scans at J intensity threshold (k, I), was set at 3-1000. The rector and group functions were used from XCMS to align the retention time data and group the detected peaks across all samples. The retention time alignment method was based on the non-linear retention time alignment wrapping algorithm (loess). The annotation of extracted m/z features was done by CAMERA and non-target R package. This was to prevent adducts/isotopic peaks to cofound with their molecular ions. The masses (m/z) present within the analytical procedural blank were removed from sample peaks-list using an advanced chemometric method based on deep learning [34]. Furthermore, the (MS-FLO) approach (https://msflo.fiehnlab.ucdavis.edu/#/) was used to improve the quality of the feature lists after initial processing, for instance by removing the duplicates and decreasing the initial list of 4144 m/z features to 2108. Eventually, a table including 65 samples and 2108 m/z features was generated. Each of these m/z values could be a potential mass of interest to be identified.
Consequently, a normalization or transformation of the data was necessary before performing the chemometric data analysis on the peaks-list, [35][36][37][38]. Normalization omits the influence of confounding variations that might have been comprised from experimental sources, such as experimental bias, analytical noise, or instrumental sensitivity. Moreover, the signal of a compound with lower concentration can be easily manipulated by the analytical noise, and scaling of the peaks-list is a necessity in order to compare different biomarkers. Here, the optimal division of the peak-list consisted of 65 samples and 2108 m/z values into training (51 samples) and independent test (14 samples); the subset was achieved based on the Kennard-Stone approach [39]. The Kennard-Stone dataset division method begins by selecting a pair of points (i samples and m/z features) that are the furthest apart. The selected samples are assigned to the training sets and then excluded from the initial list. Then, the next pair of samples, which are the second furthest apart, are listed into the test subset. The procedure continues to assign each remaining sample alternatively to the training and test sets based on a distance function and compared to the previously selected samples. The distance function used was Euclidean distance. Prior to the final discrimination model, the peaks-list (training and test subset) was scaled. In this case, Pareto scaling was selected after evaluating various pretreatment methods including range scaling, Mean centering, Autoscaling, variable stability (Vast) scaling [40], Level scaling [35], Variance Stabilizing Normalization (VSN) [41], mean centered log2 transformation [42], and Probabilistic Quotient Normalization (PQN) [43]. PQN is based on calculating the most probable dilution factor by looking at the distribution of the quotients of the response factor of a test spectrum by those of a reference spectrum. For calculating the quotients in the PQN method, a reference spectrum was needed. The reference spectrum could be a single representative spectrum of the study, or a calculated median or mean spectrum of a subset of controls samples (QC). When using PQN, the median as an estimation of the most probable quotient was applied on the set of 6 QC samples.

Chemometrics
Chemometrics-based classification methods were used to discriminate the Greek, Egyptian, and Chilean Kalamon olives based on their geographical origins and to reveal characteristic geographical origin markers. At first, Principal Component Analysis (PCA) was used to perform data exploratory analysis for QC and analytical procedural blank samples. Consequently, the peaks-list generated from the UHPLC-QTOF-MS analysis of the samples was used for PLS-DA [44]. More details about PLS-DA can be found elsewhere [45]. The optimal number of 3 latent variables (LVs) was selected after evaluating the miss-classification error in 5-leave out cross validation. Variable Importance in Projection (VIP) [46] was also used to find the potential markers revealed by the PLS-DA model discriminating between Greek, Egyptian, and Chilean table olive samples. VIP scores represent the influence of each individual m/z value and they are calculated as the weighted sum of squares of the PLS-DAs (PLS-DA1, PLS-DA2 etc.) weights. The cut-off value of above 0.83 was used for the VIP score [47]. In addition, in order to highlight the key features most responsible for the differentiation of the Kalamon olives, an orthogonal PLS-DA (OPLS-DA) was employed. One class was faced against two other ones in turn, so that the pairwise VIP values could be derived. To build the aforementioned supervised method and conduct cross-validation analysis, the PLS-DA toolbox was written in MATLAB [45] and R (ropls R package) [48] were used.

Model Validation and Classification Performance
The PLS-DA model was validated internally and externally using mis-classification error rate in leave-many-out cross validation and using 14 table olive samples as the test set. The total error rate, class specificity, and sensitivity, as well as class assigned probabilities for training, k-fold-cross-validation (k = 5), and test set were calculated and used in Receiver Operating Characteristics (ROC) curve. Considering these validation steps, a classification model is internally appropriate when it gives a point in the upper left corner of the ROC curve, providing the total area under curve equal to 1 as well as the maximum sensitivity and specificity. However, a random model gives a curve positioned around diagonal line from the left bottom to the top right corner [45]. Several statistical parameters were calculated for ranking and evaluating the PLS-DA model. These parameters were R 2 X (measures the accumulative variance explained by selected number of latent variables), R 2 Y (measures the goodness of fit), Q 2 (k-fold-cross-validated, predictive ability of the model), as well as RMSEE (Root Mean Square Error of Estimation) and RMSEP (Root Mean Square Error of Prediction), which are the square root of the mean error between the actual and the predicted classes for the training and test set, respectively. The statistical significance of the PLS-DA model was also assessed by the diagnostic statistics of Q residuals and Hotelling's T 2 (for outlier analysis) [45] and the permutation test of cumulative R 2 Y and Q 2 values [49]. Hotelling's T 2 test is based on the sum of normalized squared scores, which calculates the variation for each sample within the developed supervised classification model. A high Hotelling's T 2 value means that the sample is distant from the center of the established model. On the other hand, a high Q residual indicates that there is a high difference between the actual data and the one that is reconstructed from the supervised classification model. Therefore, any samples with higher Hotelling's T 2 and Q residual values can be considered as outliers.

Marker Identification
Variable Importance in Projection (VIP) [46] values revealed all the significant markers that were responsible for the discrimination between the Greek, Egyptian, and Chilean Kalamon table olive samples. These markers were identified using an in-house compound identification protocol and following the identification confidence levels for HRMS proposed by Schymanski et al. [50].
At first, the corresponding m/z features (reaching to the level of identification 5) were looked up within a detailed list of known unknowns including 259,298 natural products, built by our group, as well as PubChem. The look up task was done with a mass accuracy threshold of below 5 mDa. Next, the theoretical isotopic pattern was calculated for each matched candidate by "enviPat" R package [51] and then compared with the observed experimental isotopic pattern using dot product algorithm (reaching to the level of identification 4). Isotopic fit was expressed as mSigma with a threshold for positive identification of < 100. Then, the experimental and predicted retention time of each candidate was evaluated to exclude any false positives. For the interpretation of the MS/MS fragments as well as ranking (based on their explained MS/MS fragments and mass accuracy), the in silico fragmentation tools of MetFrag [52] and CFM-ID [53] were used (reaching to the level of identification confidence 3). The online public mass spectrum libraries (mzcloud, massBank, METLIN, and GNPS) were also used to increase the identification confidence of the compounds to 2a. In this regard, a MS/MS similarity score threshold of 0.7 was used between the MS/MS spectra of the candidates and the spectra of reference standards reported. Where no intensity information was available for the MS/MS spectra of reference standard, the presence of at least three matched fragments was required to positive identify the detected compound. All of these steps were done using an in-house program called "AutoSuspect" [54].

Method Performance Results
The calibration curves were constructed, and excellent linearity was achieved (r 2 > 0.99, in all cases). The precision limit was ≤ 3.6% RSD for intra-day experiments and ≤ 4.4% RSD for inter-day experiments, demonstrating the good precision of the method developed. LODs and LOQs were satisfactory and ranged between 0.019 (apigenin) and 0.041 (p-coumaric acid) mg kg −1 and 0.056 (apigenin) and 0.091 (gallic acid) mg kg −1 , respectively. All examined analytes showed adequate recovery efficiency (between 81.4% and 94%) and low matrix suppression (ME%: 8.09-19.4). The method performance parameters of the developed methodology are presented in Table 1. Table 1. Reversed phase ultra-high-pressure liquid chromatography quadrupole time-of-flight mass spectrometric (RP-UHPLC-ESI-QTOF-MS/MS) method performance parameters.

Data Processing
Nine different scaling methods were tested, and Pareto scaling was selected as the optimum after evaluating the R 2 X, R 2 Y, Q 2 , and ROC curve, as well as RMSEE and RMSEP. Table S2 in the Electronic Supplementary Material shows the evaluation results of each scaling method tested. Most of the scaling methods showed comparable R 2 Y and Q 2 values (except for VSN) whereas Pareto scaling showed all in all satisfactory results (with the score of 0.932) for RMSEP and RMSEE. PQN, tested for the first time in authenticity studies of olive drupes by HRMS, provided relatively compatible statistical results in comparison with other methods (Pareto scaling, mean centering, and Level). PQN is a robust method to reduce any variance arising from subtly differing dilutions between the extracts. Although PQN was not used for the final scaling method after considering the final score, it can have a good potential use in foodomics studies. The score was developed based on the importance of each statistical parameter to rank the scaling methods used. Equation (3) shows the weights used for each parameter:

Chemometrics
PCA was used in order to qualitatively evaluate the distribution of the samples on the score plot before establishing the final supervised classification model. As depicted in Figure 1, PCA was applied on the scaled peaks-list including 2108 m/z values and 65 samples, as well as 6 QCs and 4 analytical procedural blanks. This was important to verify that the analytical procedural blank and QC samples were grouped together and apart from the other samples. A clear separation was also observed between samples (within 95% confidence intervals) from the different geographical origins explaining over 62% of variance.
Consequently, a PLS-DA supervised classification model was developed using the scaled peaks-list from Pareto scaling and 51 out of 65 samples as the training set. The remaining 14 samples were used as a test set to externally evaluate the accuracy of the PLS-DA model. Three latent variables were selected for PLS-DA according to the leave-k-out cross validation (k = 5) results. The mis-classification error rate for leave-one-out cross validation, the test set, and the training set was 0.00. The plot of Hotelling's T 2 versus Q residuals was constructed, and is shown in Figure 2A. All classes had only a few samples with high Q residuals and only one sample ("A23") higher than the confidence limits (both Hotelling's T 2 value and Q residuals). However, attempting to exclude this sample did not affect the overall classification performance and PLS-DA model; therefore, the "A23" sample was not excluded from the dataset. After 20 random permutations of the classes, the random PLS-DA had a significantly lower Q 2 Y value than the one obtained by the trained PLS-DA model, as shown in Figure 2B. This confirms that overfitting did not occur. ROC curves were also calculated for each class (plots of sensitivity versus 100-specificity) and the area under curve was calculated as 1.00, representing an excellent classification model ( Figure 2C). Consequently, a PLS-DA supervised classification model was developed using the scaled peakslist from Pareto scaling and 51 out of 65 samples as the training set. The remaining 14 samples were used as a test set to externally evaluate the accuracy of the PLS-DA model. Three latent variables were selected for PLS-DA according to the leave-k-out cross validation (k = 5) results. The misclassification error rate for leave-one-out cross validation, the test set, and the training set was 0.00. The plot of Hotelling's T 2 versus Q residuals was constructed, and is shown in Figure 2A. All classes had only a few samples with high Q residuals and only one sample ("A23") higher than the confidence limits (both Hotelling's T 2 value and Q residuals). However, attempting to exclude this sample did not affect the overall classification performance and PLS-DA model; therefore, the "A23" sample was not excluded from the dataset. After 20 random permutations of the classes, the random PLS-DA had a significantly lower Q 2 Y value than the one obtained by the trained PLS-DA model, as shown in Figure 2B. This confirms that overfitting did not occur. ROC curves were also calculated for each class (plots of sensitivity versus 100-specificity) and the area under curve was calculated as 1.00, representing an excellent classification model ( Figure 2C).   Figure 3 shows the final PLS-DA model, along with the test set and suspect samples located within the 95% confidence intervals of each geographical origin. Even though different environmental conditions could cause variation in the chemical content of table olive samples from different years, the suspect samples were grouped very closely to the training set used by the PLS-DA model and were classified correctly as Greek Kalamon olive drupes. This proved the robustness of the model and its applicability to the accurate discrimination of Greek, Egyptian, and Chilean Kalamon olive drupes, indicating the potential use of the current workflow for PDO authentication purposes. Finally, the developed PLS-DA model's robustness was proven by analyzing a set of "suspect" samples, consisting of 11 table olive samples of the Greek Kalamata olives collected in a different harvesting year (2018). Figure 3 shows the final PLS-DA model, along with the test set and suspect samples located within the 95% confidence intervals of each geographical origin. Even though different environmental conditions could cause variation in the chemical content of table olive samples from different years, the suspect samples were grouped very closely to the training set used by the PLS-DA model and were classified correctly as Greek Kalamon olive drupes. This proved the robustness of the model and its applicability to the accurate discrimination of Greek, Egyptian, and Chilean Kalamon olive drupes, indicating the potential use of the current workflow for PDO authentication purposes.

Marker Identification
With the use of 65 samples and a peak-list of 2108 features (m/z), PLS-DA (from VIP values) indicated 26 m/z features that are significant to discriminate the geographical origins of the Kalamon table olive samples (Greek, Chilean, or Egyptian). The list of these m/z features can be found in Table  2.

Marker Identification
With the use of 65 samples and a peak-list of 2108 features (m/z), PLS-DA (from VIP values) indicated 26 m/z features that are significant to discriminate the geographical origins of the Kalamon table olive samples (Greek, Chilean, or Egyptian). The list of these m/z features can be found in Table 2.  The identification methodology that was used is described in Section 2.6.4. One of the most important markers, detected in high abundance in Egyptian Kalamata type table olives, was the mass feature with m/z 109.0299 and t R = 4.14 min (Figure 4). For this mass feature, the molecular formula C 6 H 6 O 2 was assigned using Bruker "Smart Formula Manually" with mass accuracy and isotopic fit (mSigma) being −0.4 mDa and 10.5, respectively (Figure 4a). Then, 371 compounds were retrieved from PubChem using a mass accuracy threshold of 5 ppm. Afterwards, the candidates were processed by MetFrag [52] and an in-house QSRR-based retention time prediction model that was developed by our group [55]. The most probable candidate was found to be catechol and the elucidated chemical structures of the fragments are given in Figure 4b. According to the degree of MEAN value (absolute values of mean of predictive residuals) in the Monte Carlo Sampling (MCS) plot [55], this candidate was not classified as a potential false positive (Figure 4c). The spectra of the reference standard of catechol was found in mzCloud database (mzCloud No: 2991) and the fragments matched the ones of the mass feature detected (Figure 4d). Therefore, the identification was confirmed by t R prediction, MCS plot, and MS/MS comparison (spectra similarity score of 0.966) at a level of identification confidence of 2a. The identification methodology that was used is described in Section 2.6.4. One of the most important markers, detected in high abundance in Egyptian Kalamata type table olives, was the mass feature with m/z 109.0299 and tR = 4.14 min (Figure 4). For this mass feature, the molecular formula C6H6O2 was assigned using Bruker "Smart Formula Manually" with mass accuracy and isotopic fit (mSigma) being −0.4 mDa and 10.5, respectively (Figure 4a). Then, 371 compounds were retrieved from PubChem using a mass accuracy threshold of 5 ppm. Afterwards, the candidates were processed by MetFrag [52] and an in-house QSRR-based retention time prediction model that was developed by our group [55]. The most probable candidate was found to be catechol and the elucidated chemical structures of the fragments are given in Figure 4b. According to the degree of MEAN value (absolute values of mean of predictive residuals) in the Monte Carlo Sampling (MCS) plot [55], this candidate was not classified as a potential false positive (Figure 4c). The spectra of the reference standard of catechol was found in mzCloud database (mzCloud No: 2991) and the fragments matched the ones of the mass feature detected (Figure 4d). Therefore, the identification was confirmed by tR prediction, MCS plot, and MS/MS comparison (spectra similarity score of 0.966) at a level of identification confidence of 2a.  Two mass features with the same retention time were indicated from the PLS-DA model to be important for the Greek, Chilean, and Egyptian Kalamon olive discrimination: the marker detected at m/z 153.0556_t R = 3.53 min, as well as the marker detected at m/z 123.0452_t R = 3.53 min. For the first compound, the molecular formula of C 8 H 10 O 3 was assigned (mass accuracy: 0.1mDa, mSigma 42.6) and after evaluation of MS/MS fragments and retention time, it was found to be hydroxytyrosol, for which a reference standard was also available (identification level 1). For the marker detected at m/z 123.0452_t R = 3.53 min, the molecular formula C 7 H 8 O 2 was assigned using "SmartFormula Manually", fulfilling the required criteria of mass accuracy (−0.1 mDa) and isotopic fit (9.4 mSigma). This m/z was also found to be present in the MS and MS/MS spectrum of hydroxytyrosol ( Figure S1b,c in the Electronic Supplementary Material). Therefore, it was concluded that m/z 123.0447_t R = 3.53 min is an in-source fragment of hydroxytyrosol with its most possible and appropriate structure being methyl-catechol moiety.
For the marker detected at m/z 137.0609_t R = 4.14 min, the molecular formula of C 8 H 10 O 2 was suggested (mass accuracy: −0.1mDa, mSigma 19.2). After the evaluation of the MS/MS fragments and retention time it was confirmed that the compound was tyrosol, which had also been used for the method performance evaluation (level of identification 1).
For the mass feature detected at m/z 133.0137_t R = 1.15 min, which was more abundant in the Greek olives than in Egyptian and Chilean ones, the molecular formula of C 4 H 6 O 5 was assigned with a mass accuracy and isotopic fit of 0.5 mDa and 10.6 mSigma, respectively. Sixty-two candidates were retrieved from PubChem and processed using MetFrag. This list was further filtered with the use of the QSRR-based retention time prediction model. Twelve potential candidates presented a MetFrag Score above 0.5 and retention time error below 1.8 min, among which L-malic acid for which a reference standard was available in our lab. After the analysis of the reference standard, the mass feature with m/z 133.0137_t R = 1.15 min was positively identified as L-malic acid. Another marker compound detected in high abundance in Greek olive drupes was m/z 191.0562_t R = 1.60 min, with an assigned molecular formula of C 7 H 12 O 6 (mass accuracy 0.1mDa, mSigma 18.6). Then, 702 candidates were retrieved from PubChem and evaluated by MetFrag score and MCS value. This m/z marker was identified as quinic acid with a level of identification 2a after matching the MS/MS fragments with those in the METLIN database (METLIN ID: 3329).
Several mass features were detected in high abundance in Chilean olives, such as m/z 269.0453_t R = 8.23, m/z 285.0403_t R = 8.20, and m/z 285.0403_t R = 7.51 min. These markers were identified as the flavonoids apigenin, kaempferol, and luteolin, respectively, after evaluating the mass accuracy, isotopic pattern, experimental retention time, and MS/MS match with reference standards.
The next group of markers, identified with levels of confidence 1 and 2a, belong to the fatty acids group and were found at high levels in Egyptian and Greek Kalamon Three isobaric compounds were determined with m/z 403.1244 and t R = 3.08, 3.58, 4.03 min and were indicated by the PLS-DA model to play an important role in the geographical origin determination of the Kalamon olives. For these compounds, the C 17 H 24 O 11 molecular formula was assigned with a mass accuracy and an isotopic fit of 0.2mDa and 25.6 mSigma, respectively. After the evaluation of MS/MS fragmentation and retention time, several tentative candidates were found in the literature [56]. Therefore, three secoiridoids, namely Oleoside methyl ester, 8-epikingiside, and secoxyloganin were tentatively identified at a level of identification 3. Since they all share the common fragment of 165.0554 (loss of glycoside and methyl ester moiety) and 223.0607 (loss of glycoside moiety), it was difficult to distinguish between these candidates based on MS/MS fragmentation and, therefore, the predicted retention time was used to prioritize them (as shown in Figure 5). In addition, the in-source fragment of 223.0607 was detected in the MS spectrum.
Molecules 2020, 25, 2919 18 of 26 glycoside moiety), it was difficult to distinguish between these candidates based on MS/MS fragmentation and, therefore, the predicted retention time was used to prioritize them (as shown in Figure 5). In addition, the in-source fragment of 223.0607 was detected in the MS spectrum. Apart from the above mentioned secoiridoids, three additional compounds from this class including p-coumaroyl-6-secologanoside [6], oleuropein, and dihydrooleuropein were identified at the level of identification of 2a, 1, and 3, respectively. p-Coumaroyl-6-secologanoside was assigned to the m/z feature 535.1455_tR = 4.61 min with a molecular formula C25H28O13 (mass accuracy 0.2 mDa and mSigma 11.1). The predicted retention time and MCS value was in line with the experimental. Two characteristic MS/MS fragments were reported in the literature [56] for p-coumaroyl-6secologanoside (m/z 265.0728 and 491.1560), matching to the experimental MS/MS spectra (265.0715 (S/N = 15.8, ionic formula = C13H13O6, 0.3 mDa) and 491.1547 (S/N = 2.5, ionic formula = C24H27O11, 1.2 mDa) of the samples from Greek Kalamon variety. The next important marker (m/z 539.1764_tR = 6.47 min), highly abundant in Egyptian Kalamata type table olives, was identified as oleuropein. The mass accuracy and isotopic fitting were 0.6 mDa and 93.2 mSigma, respectively. After the retention time and MS/MS evaluation, the most probable candidate was oleuropein and it was positively identified using the corresponding reference standard (level of identification 1). Three more isomers of oleuropein were observed at retention times of 6.46, 6.72, and 7.05 min. These isomers have been also reported previously [56]. Figure 6 shows the matched information between oleuropein and the reference standard as well as observed isomers. Apart from the above mentioned secoiridoids, three additional compounds from this class including p-coumaroyl-6-secologanoside [6], oleuropein, and dihydrooleuropein were identified at the level of identification of 2a, 1, and 3, respectively. p-Coumaroyl-6-secologanoside was assigned to the m/z feature 535.1455_t R = 4.61 min with a molecular formula C 25 H 28 O 13 (mass accuracy 0.2 mDa and mSigma 11.1). The predicted retention time and MCS value was in line with the experimental. Two characteristic MS/MS fragments were reported in the literature [56] for p-coumaroyl-6-secologanoside (m/z 265.0728 and 491.1560), matching to the experimental MS/MS spectra (265.0715 (S/N = 15.8, ionic formula = C 13 H 13 O 6 , 0.3 mDa) and 491.1547 (S/N = 2.5, ionic formula = C 24 H 27 O 11 , 1.2 mDa) of the samples from Greek Kalamon variety. The next important marker (m/z 539.1764_t R = 6.47 min), highly abundant in Egyptian Kalamata type table olives, was identified as oleuropein. The mass accuracy and isotopic fitting were 0.6 mDa and 93.2 mSigma, respectively. After the retention time and MS/MS evaluation, the most probable candidate was oleuropein and it was positively identified using the corresponding reference standard (level of identification 1). Three more isomers of oleuropein were observed at retention times of 6.46, 6.72, and 7.05 min. These isomers have been also reported previously [56]. Figure 6 shows the matched information between oleuropein and the reference standard as well as observed isomers. Dihydrooleuropein (molecular formula of C25H36O13, mass accuracy of 0.3 mDa, 23.0 mSigma) was tentatively assigned to the mass feature with m/z 543.2080_tR = 5.39 min. The prediction of retention time for this compound failed due to the fact that it lies outside of the application domain of the model in ESI negative mode. Therefore, this candidate was tentatively identified after evaluating its MS/MS fragments with the in silico fragmentation tool MetFrag (Figure 7).   Three more characteristic markers of Egyptian Kalamon table olives were tentatively identified as Verbascoside, Isoacteoside, and Campneoside II, belonging to the class of phenylpropanoids. These compounds give a diagnostic fragment ion at m/z 461.1670 that relates to the caffeoyl moiety loss [56]. Verbascoside and Isoacteoside are isomeric compounds and give similar MS/MS spectra. Unfortunately, we could not use the retention time prediction model to define their elution order as these compounds are out of the application domain of the model For the third mass feature with m/z 639.1927, only one peak was detected representing the hydroxyl form of one of these isomers (C 29 H 36 O 16 with isotopic pattern fit of 20.5 mSigma and mass accuracy of 0.4 mDa). According to the literature, β-Hydroxyverbascoside (Campneoside II) exists dominantly in olive drupes [56]. Therefore, evaluating the polarity and similarity of these three compounds, we can assume that Verbascoside would elute closer to β-Hydroxyverbascoside than Isoacteoside and, thus, the isomer eluting at 4.97 min is Verbascoside. Figure 8 shows the abundance of these phenylpropanoid compounds in all samples, the assigned compounds based on elution order, and their MS/MS evaluation.
The molecular formula of C 9 H 8 O 4 (with the mass accuracy of −0.2 and isotopic fit of 16.1) was assigned on the mass detected at m/z 179.0350, t R = 4.16 min. We retrieved 810 candidates from PubChem using the assigned molecular formula. After evaluating the MetFrag score and MCS value, caffeic acid and some derivatives of cinnamic acid as well as coumaric acid were found to be in the top hit list. However, caffeic acid was rejected as a potential candidate due to the retention time shift in comparison with the available reference standard. Although this peak shares common fragments (123.0442 and 137.0238) in contrast to hydroxytyrosol and cinnamic acid derivatives which could reveal the substructure, it could also be the in-source fragment of the m/z 639.1927, t R = 4.41 min. Therefore, the level of identification 4 was assigned to this m/z due to the lack of extra evidence for further prioritization.

Conclusions
This study examined the differences between the metabolomic fingerprints of Kalamon table olives of different geographical origins (Greece, Egypt, and Chile). A novel non-targeted UHPLC-ESI-QTOF-MS/MS methodology was developed and evaluated, providing adequate analytical performance characteristics. The non-target results were used for the development of a PLS-DA prediction model that successfully discriminated the PDO Greek Kalamata table olives from Kalamata type table olives from Egypt and Chile. Twenty-six compounds were suggested as characteristic authenticity markers responsible for the discrimination between the samples and all of them were tentatively identified using an in-house marker identification workflow. This study provides a comprehensive workflow that can be used to guarantee PDO product authenticity preventing fraud, protecting the consumers, and stopping unfair competition between producers.
Supplementary Materials: The following are available online: Table S1: Target list of phenolic compounds, Table S2: Evaluation of scaling methods used by PLS-DA model, Figure S1