Predicting Retention Times of Naturally Occurring Phenolic Compounds in Reversed-Phase Liquid Chromatography: A Quantitative Structure-Retention Relationship (QSRR) Approach

Quantitative structure-retention relationships (QSRRs) have successfully been developed for naturally occurring phenolic compounds in a reversed-phase liquid chromatographic (RPLC) system. A total of 1519 descriptors were calculated from the optimized structures of the molecules using MOPAC2009 and DRAGON softwares. The data set of 39 molecules was divided into training and external validation sets. For feature selection and mapping we used step-wise multiple linear regression (SMLR), unsupervised forward selection followed by step-wise multiple linear regression (UFS-SMLR) and artificial neural networks (ANN). Stable and robust models with significant predictive abilities in terms of validation statistics were obtained with negation of any chance correlation. ANN models were found better than remaining two approaches. HNar, IDM, Mp, GATS2v, DISP and 3D-MoRSE (signals 22, 28 and 32) descriptors based on van der Waals volume, electronegativity, mass and polarizability, at atomic level, were found to have significant effects on the retention times. The possible implications of these descriptors in RPLC have been discussed. All the models are proven to be quite able to predict the retention times of phenolic compounds and have shown remarkable validation, robustness, stability and predictive performance.


Introduction
Naturally occurring phenolic compounds are widespread among plants; they are synthesized during various metabolic pathways and their concentration varies over a wide range depending upon the plant [1][2][3][4]. They have significant importance during the current decade, due to their well-proven antioxidant, anti-aging, antimicrobial and immunomodulatory activities [5,6]. Phenolic compounds provide oxidative stability to foods and beverages, besides contributing health benefits [7][8][9]. A recent rising interest in the determination of phenolic compounds is mainly due to their potential protective roles against number of diseases associated with oxidative stress or initiated by free radicals, including coronary heart disease, stroke and cancer [10,11]. So the overwhelming beneficial attributes of phenolics requires detailed study of their structure and availability in different food items. For this purpose, separation as well as identification of these compounds is necessary. Numerous analytical approaches have been described in the literature for the analysis of variety of phenolics [12][13][14][15]. In this context, reversed-phase liquid chromatography-mass spectrometry is considered a practically state-of-the-art technique; as reversed-phase liquid chromatography (RPLC) provides better separation and mass spectrometry (MS) gives sensitive detection and confirms structures of compounds [16].
Quantitative structure-retention relationships (QSRRs) have gained wide attention in the area of separation science recently. These models are based on the relationship between structures and properties of compounds. Retention times of different compounds can be predicted from their formulae and even unknown compounds can be identified by using this method. In general, QSRR models attempt to predict the retention time of a molecule by characterizing it with a series of molecular descriptors. These models can effectively be used for the prediction of molecular structures, determination of retention times of new analytes and to understand the separation mechanism for a chromatographic system [17]. Several QSRRs have been developed to predict the retention times of different analytes on different systems [18][19][20][21][22][23][24]. Applications and implications of QSRR methodology in chromatography has recently been thoroughly reviewed and emphasized [25,26]. No comprehensive report describing the QSRR study of phenolic compounds from natural sources has been presented so far. Naturally occurring phenolic compounds belong to varied classes, have a range of simple to complex structures and therefore, need a compact statistical approach of QSRRs. The aim of this study is to develop statistically significant QSRR models, based on structural descriptors, for the prediction of retention times of naturally occurring phenolic compounds in RPLC. The approach consists of reduction of large descriptor pool to the most relevant descriptors with minimum multicollinearity and redundancy. The SMLR and UFS-SMLR have been used as supervised and unsupervised-supervised algorithms to reduce the descriptor pool. The selected descriptors are then used to generate ANN models with enhanced statistical significance. The study has generated reasonably stable, robust, and predictive models, which could provide an effective tool for predicting and analyzing the retention behavior of naturally occurring phenolic compounds in RPLC.

Results and Discussion
A total of 1519 descriptors were calculated from optimized structures of phenolics by use of MOPAC2009 and DRAGON version 3 softwares (Table 1). The descriptors were initially filtered by removing those with zero values, constant values for 50% of the compounds and variance less than 0.0005. This pretreatment left a total of 915 descriptors in the data, which were subsequently used for model generation. For QSRR development, data set of 39 phenolic compounds [27] was randomly split into a training set of 30 molecules and an external validation set of nine molecules. For the purpose of model generation, retention times (RT) were used as response variables.

Stepwise Multiple Linear Regression Model (SMLR Model)
The 915 descriptors, survived after initial filtration, were used to construct models by SMLR method using a sufficiently stringent criterion (F = 6 to enter, F = 3 to remove) in order to keep less number of descriptors in the model so as to avoid multi-collinearity. The five descriptor model based on training set for predicting retention times of phenolics is RT Equation 1 showed good stability as indicated by internal and external validation coefficients of determination. All the five descriptors exhibited very weak or negligible correlations with one another ( Table 2). Of all the descriptors, Ke, which appeared in step five of SMLR, showed somewhat more correlations with others, though not much significant, therefore, dropping this from the equation resulted in another equation with less number of descriptors and still of good statistical quality (Equation 2). Dropping step four descriptor Mor32e also resulted in a good model but it was comparatively poor in terms of external validation (Equation  The 915 descriptors left after pretreatment were subjected to UFS algorithm with R 2 max = 0.90, that decreased the data set to only 22 linearly independent descriptors with minimum multi-collinearity and redundancy (Table 3). Leaverage-weighted autocorrelation of lag 7/weighted by atomic poloarizabilities

GETAWAY
The SMLR method applied to UFS-selected descriptors produced a six descriptors model (Equation 4). This model is quite good in terms of the entire applied statistical criterion, though, less significant than SMLR model as indicated by the PRESS and co-efficient of determination statistics. y-Scrambling result for UFS-SMLR was found similar to SMLR model, though slightly of less quality with R 2 value 0.172.

Artificial Neural Network
The network architecture and validation statistics are given in Table 4. In this study, the whole data has been divided into three sets: training, test and validation sets. A test set is used for early stopping of training in order to avoid overfitting. Sometimes the test data alone may not provide an evidence of a good generalization an ANN e.g., it can be just a coincidence. To make sure that this is not the case, another validation set was used. This puts an extra check on the performance and generality of ANN. To make things clearer, the training, test and validation sets have been marked in Table 5. ANN models are better than both SMLR and UFS-SMLR models. Though, SMLR model is comparable to SMLR-ANN model, nevertheless, the real strength of artificial neural network mapping technique was observed for UFS-SMLR-ANN model, which showed considerably better prediction ability than the simple UFS-SMLR model as depicted by Q 2 ext . In ANN models, the global sensitivity analysis was performed which ranked the descriptors of SMLR-ANN model as HNar > DISPe > GATS2v > Mor32e and UFS-SMLR-ANN model as Mp > DISPm > Mor22v > IDM > Mor28e. The predictions of SMLR-ANN and UFS-SMLR-ANN are presented in Figure 3.

Interpretation of the Models
In case of selected SMLR model (Equation 2), HNar and GATS2v are 2D descriptors derived from molecular graph. HNar is the Narumi harmonic topological index related to molecular branching and represents the number of non-hydrogen atoms divided by the reciprocal vertex degree [28]. Its positive coefficient suggests that increase in HNar leads to an increase in RT. GATS2v is the Geary autocorrelation-lag2 weighted by atomic van der Waals volumes. The autocorrelation descriptors show the distribution of a certain property in the topological structure [29]. The GATS2v descriptor shows the distribution of atomic volume at a distance of two bonds in the topological structure of molecule. The negative coefficient of GATS2v is an indicative of decrease in RT with an increase in lag2 autocorrelations of atomic volumes on molecular graph. The descriptors DISPe and Mor33e are derived from three dimensional structures of the molecules. DISPe is the d COMMA2 value weighted by atomic senders on electronegativities and it represents the displacement between the geometric and the electronegativity centers of the molecule [30]. The positive coefficient of DISPe indicates that molecules with increased displacement between the geometric and the electronegativity centers will take more time to elute. Mor32e is the 3D-MoRSE signal-32, weighted by atomic Sanderson electronegativities. The 3D-MoRSE signals give three dimensional molecular representation of structure based on electron diffraction and contain information on mass distribution and branching within a molecule [31]. The negative coefficient for Mor32e suggests an inverse relation with RT. It follows, therefore, that molecule with more branching, less lag-2 autocorrelation of atomic volumes, enhanced displacement between the geometric and the electronegativity centers and low value of Mo32e descriptor will have more retention times in RPLC.
For UFS-SMLR selected model (Equation 5), Mp is a constitutional descriptor while IDM is a 2D topological descriptor. Mp is the mean atomic polarizability scaled on carbon atom, IDM is the mean information content on the distance magnitude. DISPm, and 3D-MoRSE signals are 3D descriptors. DISPm is the d COMMA2 value weighted by atomic masses, Mor22v and Mor28e are the 3D-MoRSE signals, 22 and 28, weighted by atomic van der Waals volumes and atomic Sanderson electronegativities, respectively. UFS-SMLR model also emphasized the importance of topological descriptor (IDM), atomic volume and atomic electronegativity based 3D descriptors of molecules for the retention behavior of phenolic compounds, as was observed in SMLR selected descriptors. Despite the weighing schemes, the behavior of three dimensional descriptors was similar in both approaches. 3D-MoRSE descriptors related negatively and 3D geometrical DISP descriptors related positively with the retention times in both types of models. This corresponds to similar effects of 3D descriptors in developed QSRRs. A positive coefficient for Mp is an indication of increase in retention time with increase in mean atomic polarizability. In phenolics, oxygen atom is largely present either as hydroxyl group (independent or as a part of carboxyl group) or as ether linkage. Based on the relative nature of carbon, hydrogen and oxygen, it is expected that a decrease in number of hydroxyl groups increases the Mp value. It therefore, suggests that molecules with more hydroxyl group will have low values of Mp and hence they are eluted earlier with the polar mobile phase due to greater number of polar hydroxyl groups in them and hence have less retention times. This behavior can be well observed in case of Gallic acid, Gentisic acid and Salicylic acid ( Table 5, Table S1 supplementary data) containing four, three and two hydroxyl groups with Mp values 0.64, 0.65 and 0.67, respectively. The other descriptor IDM also relates directly to RT suggesting an increase in RT with increase in its value. This descriptor provides mean information content on distance magnitude and it is expected to increase with increase in number of atoms in a molecule. Another descriptor DISPm is an indicative of conformational features of molecules. It is generally suggested that rigid molecules have low values of DISPm [29]. This descriptor relates directly to RT which suggests that rigid molecule will have less retention time. The foregoing discussion revealed that generally molecules with more hydroxyl groups, less number of atoms, rigidity and high values of 3D-MoRSE descriptors are eluted faster than others. Mathematical detail of the molecular descriptors is available in the Handbook of Molecular Descriptors [32].
Quantum mechanical descriptors failed to make any impact, whatsoever, on the models. Equations 2 and 5 and optimal artificial neural networks (Table 4) were used to predict the retention times of naturally occurring phenolic compounds. The predicted results are presented in Table 5, Figures 2 and 3 and residual plot for the developed models is presented in Figure 4.

Data for Retention Times of Phenolic Compounds
Data used to generate structure-retention relationship of phenolic compounds were obtained from a recently developed sharp method of their analysis in RPLC-MS system [27]. Briefly, the compounds were separated by gradient elution, using a reversed-phase C 18 analytical column (50 × 2 mm, 2.5 µm particle size; Phenomenex Synergi Fusion-RP100A) with a C 18 guard column (4 × 2 mm; Phenomenex Fusion-RP) maintained at 35 °C. The mobile phase used was deionized water (A) and acetonitrile (B); each containing 0.1% (v/v) formic acid in a linear gradient from 1% to 100% B during 9.5 min.

Descriptor Computation
Three dimensional structures of phenolic compounds, created by using Chemsketch, were optimized by the use of semi-empirical PM6 Hamiltonian with eigen vector following (EF) algorithm implemented in MOPAC2009 software [33]. Calculation of numerical descriptors from optimized geometries was performed usingMOPAC2009 and DRAGON, version 3 [34] softwares. Total number of calculated descriptors was 1519. Molecular weight (MW) descriptor was duplicated in both the softwares, therefore, MW only from MOPAC2009 was used in this study. Dragon was used to compute 1497 descriptors divided into 18 logical blocks and 23 descriptors were obtained from MOPAC2009 (Table 1).

Feature Selection and Model Generation
Step-wise multiple linear regression (SMLR) and unsupervised forward selection followed by step-wise multiple linear regression (UFS-SMLR) was used for feature selection. UFS is a technique to remove redundant and multi-collinear descriptors from the data set [35]. UFS was performed with ufs-1.8, obtained from the Centre for Molecular Design (CMD), University of Portsmouth, using R 2 max = 0.9. The subset of descriptors produced by UFS was later used to develop model by SMLR method. Before applying the regression method, all the data were standardized to zero mean and unit variance in order to avoid any biased nature of the calculated descriptors, which may lead to series errors in generation and application of the models. The standardized data were subjected to SMLR method for model generation.
ANN is a powerful multivariate data analysis technique, capable of both linear and non-linear modeling and has been widely used in modeling structure-property relationships [22,36,37]. An ANN mathematical model mimics the human brain intelligence system and consists of various interconnecting neurons organized in a sequential manner into an input layer, one or more hidden layers and an output layer. Each interconnection of the neurons has some numerical value (weight) associated with it. The signals are transmitted from the input layer to output layer through the neurons. The whole network is first trained on some data by adjusting the interconnection weights and is subsequently used to make predictions for external data. In the present study, optimal number of descriptors, selected by SMLR and UFS-SMLR techniques, was entered as continuous input signals into ANNs and output was the response variable RT. 500 ANNs were trained in both cases by the use of Statistica 8.0 automated artificial neural network implementation. Multilayer perceptrons (MLP) type network with feed-forward topology, Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm and normal randomization were used for ANNs training and sum-of-squares error function was used to test their performances. Identity, logistic, exponential and tanh activation functions both for hidden and output layer and number of hidden units from 3 to 8 were used in ANNs building. The models, exhibiting least external validation errors, were selected as optimal models. In ANNs building process, an early stopping technique was employed to avoid over-training of the ANN models. For this purpose, the training set was further sub-divided randomly into a subset of 25 molecules for training the ANNs and a subset of five molecules as a test set to avoid over-fitting. In the development of both SMLR descriptors based ANN (SMLR-ANN) and UFS-SMLR descriptors based ANN (UFS-SMLR-ANN), same subsets of training set were used. Further, for external validation of all the models, same external validation set of nine molecules was used.

Model Validation
Model validation is a requisite to assess the applicability of generated models. Several techniques are in use in chemometrics [38][39][40][41]. In the present study, models were validated both internally as well as externally and any chance correlation was tested by the use of a y-scrambling technique: a method frequently used for this purpose. Internal validation was performed by leave-one-out cross validation and external validation by applying the model on external validation set of nine molecules. The statistical quality of the model was judged by considering the sum of squares of prediction errors and the validation correlation coefficients Q 2 int & Q 2 ext for internal and external validation respectively (Equations 6 and 7, respectively).

PRESS
(6) where ŷ i is the predicted value, y i is the observed value for ith case in training or validation set as the case may be, and ў train is the mean of the training set. In above expressions, mean of the training set was used in order to have same standard reference for both internal and external validation statistics. However, using mean of validation set made almost no difference in the present study. For example, in case of SMLR model, Q 2 ext using training set mean was 0.769, while using validation set mean, it was 0.770. y-Scrambling was performed 500 times for the models in order to establish the stability of model and to negate any chance correlation. The statistical quality parameters of the scrambled models were compared with those of the original models. Performance of the selected ANN models was judged by the Q 2 ext statistics. All the statistical calculations were performed using Statistica 8.0 and MS Excel ® 2007.

Conclusions
SMLR, UFS-SMLR and ANN directed QSRR models have successfully been developed for predicting the retention times of naturally occurring phenolic compounds in the RPLC system. ANN models are more authentic in prediction of retention times of phenolics in RPLC than the other two approaches. SMLR model is comparable to SMLR-ANN, however, UFS-SMLR model was found less predictive than others. The models identified Mp, IDM, HNar, DISP, GATS2v and 3D-MoRSE (signals 22, 28 and 32), descriptors responsible for the retention of phenolic compounds. These descriptors signify the importance of branching, size, hydroxyl groups and 3D geometric,