Quantitative Structure–Property Relationship for the Retention Index of Volatile and Semi-Volatile Compounds of Coffee †

: This study describes the development of a quantitative structure–property relationship to predict the retention indices of volatile and semi-volatile compounds identiﬁed in Arabica coffee samples from different geographical origins. The analytical method utilized rapid headspace solid-phase microextraction (HSSPME)–gas chromatography–time-of-ﬂight mass spectrometry (GC-TOFMS) data measured in divinylbenzene/carboxen/polydimethylsiloxane (DVB/CAR/PDMS) ﬁber. A total of 102 molecules were optimized with the PM6/ZDO level of theory in order to calculate several molecular descriptors. Ordinary least squares were coupled with genetic algorithm–supervised variable subset selection to ﬁnd the best three descriptors. For model validation, the dataset was split into a training set (70%) and a test set (30%). The quality of the model was evaluated by means of the coefﬁcient of determination and the root-mean-square error.


Introduction
Brewed coffee is a popular, versatile and widely consumed beverage throughout the world.It is estimated that around 400 billion cups of coffee are consumed annually [1].Hence, it is one of the most important agricultural products in the international trade of coffee-producing countries and adds approximately 35 billion US dollars per year to the overall world economy [2].According to The International Coffee Organization (ICO), there was an increase in coffee consumption of 4.5% between the years 2016-2021 [3].In addition to the coffee culture and pleasure of the aroma and taste, the bioactive components of coffee, which include alkaloids, terpenes, phenolic compounds and other secondary metabolites, have been associated with antioxidant and anti-inflammatory properties that are associated with improved human health [4].
The aroma of coffee is one of its most relevant organoleptic characteristics, which is associated with volatile organic compounds (VOCs) and semi-volatile organic compounds (SVOCs) [5].VOCs are strong-smelling molecules that contribute to the definition of the aroma and flavor profile of coffee, while semi-volatile organic compounds (SVOCs) are a subgroup of VOCs exhibiting both high molecular weight and high boiling-point temperatures.These chemicals are fundamental for defining the organoleptic quality of coffee developed during the production process [6,7].In order to study the composition of these compounds, rapid headspace solid-phase microextraction (HS-SPME)-gas Chem.Proc.2022, 8, 48 2 of 9 chromatography-time-of-flight (high-speed data acquisition rate option) mass spectrometry (GC-TOFMS), using DVB/CAR/PDMS (divinylbenzene/carboxen/polydimethylsiloxane) fiber, were used for analysis [8].This fiber was shown to be the most suitable stationary phase for analyzing complex components with different polarities in food matrices due to its capability to deal with high temperatures.In this context, our research group developed a QSPR model to predict the retention indices (I) of diverse volatiles detected in the headspace of rice using DVB/CAR/PDMS fiber in solid-phase microextraction-gas chromatography-mass spectrometry (SPME-GC-MS) analysis [9].
The aim of the work presented here is the development of a computational model based on the quantitative structure-property relationship (QSPR) for the study of 102 VOCs and SVOCs detected in coffee samples from different origins.The specific property end-point is the calculation of a retention index quantified in DVB/CAR/PDMS (divinylbenzene/carboxen/polydimethylsiloxane) fiber by means of a coupled system of rapid headspace solid-phase microextraction (HS-SPME)-gas chromatography-time-of-flight (high-speed data acquisition rate option) mass spectrometry (GC-TOFMS).Compounds were optimized according to quantum chemical calculations and represented by several molecular descriptors and fingerprints.Then, unsupervised V-WSP variable reduction was used to obtain a pool of the most relevant descriptors to be submitted to the genetic algorithm-variable subset selection (GA-VSS) coupled with ordinary least squares (OLS) to search for an optimal model.The QSPR model was evaluated by diverse internal and external validation approaches, as well as an applicability domain assessment.As a complement, the mechanism of action of each molecular descriptor used to predict the I of the VOCs and SVOCs was provided.The model was developed following the five principles stated by the Organization for Economic Co-operation and Development [10].

Database Description
For this study, we considered the retention indices (I) of 102 volatile and semi-volatile organic compounds identified in Arabica coffee samples from different geographical origins (Brazil, Colombia, Costa Rica, Guatemala, Ethiopia and Indonesia) [8].The experimental I was obtained through rapid headspace solid-phase microextraction (HS-SPME)-gas chromatography-time-of-flight (high-speed data acquisition rate option) mass spectrometry (GC-TOFMS).A system comprising the CombiPAL SPME autosampler with a gas chromatograph 6890 coupled to a Pegasus III mass spectrometer was used for the automatization of the SPME procedure.In addition, the mass spectrometer was equipped with a time-of-flight mass analyzer.An SLB-5 column (10 m × 180 µm × 0.18 µm) was used for the separation of volatile and semi-volatile compounds.This column is constituted by 5% diphenyl and 95% dimethylpolysiloxane. Subsequently, a fiber optimization experiment was performed in order to evaluate diverse commercially available fiber coatings under the same GC and MS conditions.Divinylbenzene/carboxen/polydimethylsiloxane (DVB/CAR/PDMS) 50/30 µm proved to be optimal for the quantification of the retention indices of the volatile and semi-volatile organic compounds present in coffee samples.Refer to Table S1 for details of the retention indices of the 102 compounds.To the best of our knowledge, no computational modeling has been previously conducted with this dataset.

Molecular Representation and Geometry Optimization
The chemical name of each of the compounds [1-102] reported in Table S1 was used to retrieve the PubChem CID and the CAS registry number in the PubChem open library [11].In addition, the .sdffile format of each compound was also obtained for molecule visualization and optimization.Initially, the alvaMolecule software [12] was used for molecule standardization and curation.These processes include the verification of each query through several filters, which performed the standardization of benzene rings into aromatic form, converted unusual covalent bonds to ionic forms, added charge to quaternary nitrogen atoms, removed/added excessive/missing hydrogens, and standardized nitro, azide and diazo groups.The canonical SMILES (simplified molecular-input line-entry system) notation of each compound was also generated in alvaMolecule.Then, an initial optimization of all molecules was developed with the Avogadro program using the UFF force field and the steepest descent algorithm [13].The optimized conformations were then used to perform quantum chemical calculations in the ground state (gas phase) of the compounds with the program package Gaussian 09, Rev D.01 [14].The final optimization was performed with the semi-empirical method PM6/ZDO.In addition, the frontier orbital energies (HOMO and LUMO) and global reactivity parameters associated with chemical reactivity were calculated as 3D molecular descriptors.In this way, the gap in energy between HOMO and LUMO (∆E) allowed the estimation of the stability of the molecules.The electron affinity (EA) and ionization potential (IP) can be defined as EA = −E LUMO and IP = −E HOMO , respectively.Moreover, descriptors associated with the electronic structure and chemical reactivity such as the electronegativity (χ = (I + A)/2), chemical potential (µ = −χ), chemical hardness (η = ∆E/2), chemical softness (σ = 1/2η), electrophilicity index (ω = µ2/2η) and nucleophilicity index (N = 1/ω) were also calculated [15,16].

Molecular Descriptor Calculation and Reduction
For the development of the QSAR model, a new set of 5663 molecular descriptors (MDs) and 166 MACCS (Molecular ACCess System) fingerprints were computed in the alvaDesc software [17].The molecular descriptor is a useful number (or the result of a standardized experiment) obtained from a well-defined mathematical algorithm applied to a symbolic representation of molecules [18].In a first attempt to reduce the data dimensionality, descriptors with constant values and near-constant values were excluded, along with descriptors with missing values.These descriptors were merged with the 3D ones calculated in Gaussian.Subsequently, unsupervised variable reduction based on the algorithm of Wootton, Sergent and Phan-Tan-Luu (V-WSP) was applied [19].The V-WSP method uses a correlation threshold (defined by the user) in order to reduce the presence of descriptors with redundancy, multicollinearity and noise in such a way as to obtain an optimal pool of descriptors with minimal correlation in multidimensional space.

Molecular Descriptor Selection
For model development, a crucial step is the selection of a pool of the most important molecular descriptors for multiple linear regression (MLR) based on ordinary least squares (OLS).Among the diverse methods available for the variable selection, the genetic algorithm-variable subset selection (GA-VSS) technique [20] was coupled with the OLS method in order to find the optimal subset of molecular descriptors.This supervised variable selection starts with an initial random population of models (i.e., chromosomes) constituted by binary vectors indicating the presence (or absence) of each descriptor in the model.Then, new models are created through an evolutionary process by combining chromosomes (models) of the initial population (crossover), as well as by randomly including (or excluding) descriptors (mutation).During the evolution of the population, the root-mean-square error (RMSE) in Venetian blinds cross-validation is optimized.The model development, along with the descriptor selection, was performed in the alvaModel software [21].

Validation of the Model
The merit of a QSPR model is related to the ability to correctly predict the properties of an external set of VOCs and SVOCs, commonly not considered during the calibration of the model.Consequently, the dataset should be split into a training set and a test set in such a way as to guarantee a structure-property representation of the test set molecules in the space defined by the volatile and semi-volatile compounds of the training set.To this end, the Balanced Subsets Method (BSM) [22] was used for the partition of 102 compounds in the dataset.The essence of the BSM is to create groups based on k-means cluster analysis in order to identify similar volatile and semi-volatile compounds according to the descriptors (structure) and the retention index (property).During the genetic algorithm-supervised selection, the training set was used to calibrate the MLR model, while the test set was used to measure the predictive ability of the QSPR model.The computational model was further validated by the leave-one-out and leave-many-out cross-validation approaches.The former procedure excludes one compound from the training set at a time, while the latter randomly excludes a user-defined percentage of the molecules at each iteration.In the leave-many-out cross-validation, we applied the Venetian blinds approach, Monte Carlo random sub-sampling validation and the Bootstrap method [23].

Applicability Domain
Since QSPR models are reductionist models associated with the reliability of the predictions of certain molecules, it is necessary to define the theoretical region (chemical space) defined by the MDs within the model that performs reliable predictions for new molecules.In this work, the leverage approach was used to determine if a compound of the test set lies within (or outside) this theoretical space [24].This approach quantifies the distance of each test molecule with respect to the centroid of training compounds defined by the Hat matrix (X matrix of descriptors only).Thus, it is possible to define a threshold value when the warning leverage is equal to h* = 3p/n (p is the number of parameters in the QSPR model, and n is the number of training set molecules).Then, if the leverage value of each volatile or semi-volatile organic compound in the test set (h ii ) is lower than the defined threshold, the predicted I can be considered reliable.The applicability domain analysis was performed in the alvaModel software [21].

Results and Discussion
After the exclusion of non-informative alvaDesc descriptors, a pool of 3006 MDs was merged with the 11 quantum descriptors.Then, V-WSP variable reduction was applied at a threshold value of 0.95 (thr = 0.95).Thus, 1237 descriptors were submitted to supervised selection through GA-VSS coupled with the OLS method to develop the QSPR model.For splitting the 102 VOCs and SVOCs, the BSM was used to define the training set (71 molecules) and the test set (31 molecules).Refer to Table S1 for the training set and test set assignments.Subsequently, the 71 training molecules were used for GA-VSS coupled with the OLS method in order to search for the optimal pool of MDs.During the supervised selection, the root-mean-square error (RMSE) in the Venetian blinds cross-validation was optimized (minimized) to avoid overfitting the models.The coefficient of determination (R 2 ) was also considered as a parameter of the model.Thus, a three-variable model was retained as the QSPR model for further analysis: The statistical parameters for the training set (R 2 = 0.920 and RMSEC = 71.78)and the test set (R 2 = 0.897 and RMSEP = 81.50)reflected negligible differences and indicated the absence of potential overfitting in the model.Consequently, an appropriate QSPR model for predicting the I property was achieved.In addition, the model derived by Equation (1) was subjected to several cross-validation approaches: leave-one-out (R 2 = 0.869 and RMSECV = 91.74);Venetian blinds (R 2 = 0.872 and RMSECV = 90.62);Monte Carlo, leaving 20% out with 1000 iterations (R 2 = 0.870 and RMSECV = 90.74);and the Bootstrap method with 1000 iterations (R 2 = 0.867 and RMSECV = 92.97).Details of the I predicted by Equation ( 1) are available in Table S1, while the numerical values for the three MDs are available in Table S2.
Figure 1a shows the relationship between the experimental and predicted retention indices obtained by Equation ( 1).This figure suggests that this property has a linear relationship around the perfect fit line.Complementarily, Figure 1b shows the dispersion of the residuals vs. the predicted I, which suggests the random distribution of residuals around the zero line.In this model, no training molecules lie outside the limit of ± 2 times the RMSEC.
relationship around the perfect fit line.Complementarily, Figure 1b shows the dispersion of the residuals vs. the predicted I, which suggests the random distribution of residuals around the zero line.In this model, no training molecules lie outside the limit of ± 2 times the RMSEC.The molecular weight is a zero-order constitutional index calculated as the sum of the atomic weights of all of the atoms present in a molecule; that is, it describes the molecular size of both volatile and semi-volatile organic compounds [18].Since this descriptor represents a linear group contribution (atomic masses) for predicting the retention index property, the larger the MW for a given compound, the greater the retention index of that compound.In a recent study, the synergistic effect of MW for predicting the I property in comprehensive two-dimensional gas chromatography combined with quadrupole-mass spectrometry (GC × GC/qMS) using a BPX5 and BP20 column coupled system was presented [25].On the other hand, MDEN-23 is a descriptor belonging to the molecular distance-edge (MDE) vector [26].This descriptor measures the topological distances (2D) between nitrogen atoms, in particular, type 2 for secondary (-NH-) and type 3 for ternary (-N<) nitrogen atoms.Thus, the synergistic effect on the retention time could be related to the silanophilic effect.In fact, when dealing with the I of pesticides in ultrahigh-performance liquid chromatographyelectrospray ionization quadrupole-Orbitrap (UHPLC/ESI Q-Orbitrap) mass spectrometry (MS), using the Hypersil Gold selectivity column (and the guard column Accucore aQ), it was shown that the slow elution of polar molecules (pesticides) through the stationary phase was caused by the high affinity of basic amine compounds for the silica surface of the column (which was constituted by active or acidic silanol groups) [27].
The model presented here also considers a conformation-dependent descriptor belonging to the 3D-MoRSE (3D Molecule Representation of Structures based on Electron diffraction).These descriptors transform the 3D coordinates of compounds into a molecular code by applying a modified equation used in electron diffraction studies for preparing theoretical scattering curves [28,29].Specifically, the Mor13v descriptor considers the scattering parameter s = 12 Å −1 and weighs the atoms by the corresponding van der Waals volume.This weighting scheme considers a minimum effect of hydrogen atoms, decreases the contribution of nitrogen, oxygen and fluorine, and attributes significant influence to silicon, phosphorus, bromine and iodine atoms.Since the coefficient of Mor13v is negative, this descriptor could be related to the contribution of the volume of the volatile and semi-volatile compounds, expressed in terms of the van der Waals volume.Thus, higher retention indices are related to molecules having scaffolds with fragments (pairwise atoms) with high volume, which in turn interact with the stationary phase and delay the time of elution.
Finally, the applicability domain of the QSPR model was analyzed in order to define the theoretical space where the model makes reliable predictions of the retention indices of new VOCs or SVOCs (i.e., interpolations).The leverage approach (Figure 2) established a threshold limit of h* = 0.1408, which indicates that predictions were reliable for only VOCs or SVOCs with a leverage value below this threshold limit; that is, predictions are the result of interpolation of the model (i.e., reliable).In this work, no test set compounds fell outside the AD of the model, indicating that Equation (1) makes reliable predictions of the retention index property and, consequently, could be useful for eliciting this property of new VOCs and SVOCs of different coffee samples.

Conclusions
In this study, we developed a computational model based on the quantitative structure-property relationship for the retention indices of 102 volatile and semi-volatile organic compounds detected in coffee samples.A 3D chemical structure based on the PM6/ZDO of each molecule was used to calculate several molecular descriptors and fingerprints.To handle the large number of variables, V-WSP unsupervised reduction was applied to obtain a pool of useful descriptors to be used to calibrate the multiple linear regression model coupled with the genetic algorithm-variable subset selection.The threedescriptor predictive model was validated through several internal and external criteria, according to the five principles stated by the OECD, to make it applicable to predict the retention indices of new volatile and semi-volatile organic compounds present in coffee samples of diverse origin by means of HS-SPME-GC-TOFMS quantified in DVB/CAR/PDMS.

Supplementary Materials:
The following are available online at www.mdpi.com/xxx/s1,Table S1: Details of the dataset: chemical names, PubChem CID, CAS registry number, canonical SMILES and retention indices (I) for 102 VOCs and SVOCs by HS-SPME-GC-TOFMS, as well as the training set and test set assignments using the BSM technique; Table S2: Numerical values for the three molecular descriptors for each of the 102 VOCs and SVOCs in the computational model.

Conclusions
In this study, we developed a computational model based on the quantitative structureproperty relationship for the retention indices of 102 volatile and semi-volatile organic compounds detected in coffee samples.A 3D chemical structure based on the PM6/ZDO of each molecule was used to calculate several molecular descriptors and fingerprints.To handle the large number of variables, V-WSP unsupervised reduction was applied to obtain a pool of useful descriptors to be used to calibrate the multiple linear regression model coupled with the genetic algorithm-variable subset selection.The three-descriptor predictive model was validated through several internal and external criteria, according to the five principles stated by the OECD, to make it applicable to predict the retention indices of new volatile and semi-volatile organic compounds present in coffee samples of diverse origin by means of HS-SPME-GC-TOFMS quantified in DVB/CAR/PDMS.

Figure 1 .Figure 1 .
Figure 1.(a) Experimental versus predicted retention indices for the volatile and semi-volatile organic compounds of coffee detected in the DVB/CAR/PDMS fiber in the HS-SPME-GC-TOFMS technique.(b) Standardized residuals versus the predicted retention indices for the VOCs and SVOCs identified in coffee samples.Training set molecules are labeled in black, and test set compounds are labeled in blue.The mechanism of action of the retention index phenomenon presented in Equation (1) was constituted by two conformation-independent molecular descriptors: molecular weight (MW) and the molecular distance edge between all secondary and tertiary nitrogen atoms (MDEN-23).In addition, signal 13/weighted by van der Waals volume (Mor13v) conformation-dependent descriptor complements the topological information of the retention phenomenon.The maximum correlation (R 2 = 0.189) between MW and Mor13v

Figure 2 .
Figure 2. William plot for defining the applicability domain of the QSPR model.Training set molecules are labeled in black, and test set compounds are labeled in blue.

Author
Contributions: C.R.: conceptualization, methodology, formal analysis, validation and writing-original draft preparation; C.D.A.L.: methodology, software, formal analysis and writingoriginal draft preparation; E.C.A.: methodology, software, formal analysis and writing-original draft preparation; P.V.M.A.: software, data curation and validation; D.M.: data curation and validation.All authors have read and agreed to the published version of the manuscript.

Figure 2 .
Figure 2. William plot for defining the applicability domain of the QSPR model.Training set molecules are labeled in black, and test set compounds are labeled in blue.