Chemical Characterization of Marrubium vulgare Volatiles from Serbia

Marrubium vulgare is a cosmopolitan medicinal plant from the Lamiaceae family, which produces structurally highly diverse groups of secondary metabolites. A total of 160 compounds were determined in the volatiles from Serbia during two investigated years (2019 and 2020). The main components were E-caryophyllene, followed by germacrene D, α-humulene and α-copaene. All these compounds are from sesquiterpene hydrocarbons class which was dominant in both investigated years. This variation in volatiles composition could be a consequence of weather conditions, as in the case of other aromatic plants. According to the unrooted cluster tree with 37 samples of Marrubium sp. volatiles from literature and average values from this study, it could be said that there are several chemotypes: E-caryophyllene, β-bisabolene, α-pinene, β-farnesene, E-caryophyllene + caryophyllene oxide chemotype, and diverse (unclassified) chemotypes. However, occurring polymorphism could be consequence of adaptation to grow in different environment, especially ecological conditions such as humidity, temperature and altitude, as well as hybridization strongly affected the chemotypes. In addition, this paper aimed to obtain validated models for prediction of retention indices (RIs) of compounds isolated from M. vulgare volatiles. A total of 160 experimentally obtained RIs of volatile compounds was used to build the prediction models. The coefficients of determination were 0.956 and 0.964, demonstrating that these models could be used for predicting RIs, due to low prediction error and high r2.


Introduction
Marrubium vulgare L., also known as white horehound, is a perennial species from the Lamiaceae family. It is indigenous to the region between the Mediterranean Sea and Central Asia; however, today it is found worldwide, apart from the coldest regions and high altitudes [1]. This plant is highly resistant to drought and due to this it grows well in semiarid areas [2]. Additionally, as it is a moderate salt-tolerant species this medicinal plant could be grown on saline soil [3]. The surface of M. vulgare vegetative and generative organs is densely clothed with glandular and nonglandular trichomes which accumulate secondary metabolites [4]. M. vulgare produces structurally highly diverse groups of secondary metabolites, thus represents a valuable source of bioactive compounds and preparations with health-promoting effects: antioxidant, hepatoprotective, antiproliferative, anti-inflammatory, antidiabetic, and antimicrobial [5]. The use of this herb in traditional medicine is recorded worldwide for ameliorating chronic cough and cold, numerous conditions related to skin, liver, gastric, heart, and immune system [6]. Generally, M. vulgare is poor in essential oil, and the major compounds are diverse [1,[7][8][9][10][11][12][13][14][15][16][17].
The main aim of this investigation was to determine volatiles composition of M. vulgare grown in Serbia during two years and to compare its chemical composition with literature data not only of M. vulgare but with other species from this genus as well (M. anisodon, M. aschersonii, M. astracanicum, M. crassidens, M. deserti, M. duabense, M. parviflorum, M. peregrinum, M. persicum, M. propinquum, M. velutinum). Another goal was to establish the new quantitative structure retention relationship (QSRR) models for anticipating the retention indices (RIs) of certain compounds in M. vulgare volatiles obtained by GC-MS chromatography utilizing the genetic algorithm (GA) variable selection method and the boosted trees regression. Furthermore, we gather information about the volatile compounds of species from Marrubium genus in order to classify the chemotype of M. vulgare from this study according to unrooted cluster tree.

Results
The main components in M. vulgare volatiles were E-caryophyllene with 24.6% and 23.0%, followed by germacrene D with 9.6% and 17.0%, α-humulene with 5.2% and 5.3% as well as α-copaene with 3.3% and 6.1% in 2019 and 2020, respectively. All these compounds are from the sesquiterpene hydrocarbons class which was dominant in both years of the investigation, 52.0% in 2019 and 67.8% in 2020. This variation in volatiles composition could be a consequence of weather conditions, as in case of other aromatic plants [27][28][29][30][31][32][33].
The predicted RIs and molecular descriptors are presented in Table 1. Seven molecular descriptors were utilized for predictions of RIs in the two BRT models. The predicted RIs are presented in Figure 2, and visually confirm the adequate prediction capabilities of the constructed BRT by showing the relationship between the predicted and experimental retention values. In order to calculate the molecular descriptors, the PaDel-descriptor was used in this investigation. Due to a great amount of data that was obtained, it was required to select the most important set of descriptors to build the adequate model which would be able to predict the RIs [55]. The factor analysis was done before the GA calculation, and only ca. 320 uncorrelated descriptors remained in the GA calculation [56,57]. The seven most significant molecular descriptors chosen by GA are as follows: four 2D autocorrelation descriptors (AATSC4e, AATSC2p, GATS6v and MATS5v), two Barysz matrix descriptors (VR1_Dzs and SM1_Dzv) and Vertex adjacency information (magnitude) descriptor (VA-djMat).
The predicted RIs and molecular descriptors are presented in Table 1. Seven molecular descriptors were utilized for predictions of RIs in the two BRT models. The predicted RIs are presented in Figure 2, and visually confirm the adequate prediction capabilities of the constructed BRT by showing the relationship between the predicted and experimental retention values. Separation of compounds in GC-MS and their RIs is linked to their affinity towards mobile and stationary phases. Affinity and solubility of separated molecules directly depend on their chemical structure and physico-chemical properties, which could be expressed by molecular descriptors. According to Pearson's correlation coefficients, there

Separation of compounds in GC-MS and their
RIs is linked to their affinity towards mobile and stationary phases. Affinity and solubility of separated molecules directly depend on their chemical structure and physico-chemical properties, which could be expressed by molecular descriptors. According to Pearson's correlation coefficients, there was a rather poor correlation between all 3D autocorrelation descriptors (Table 2). Therefore, utilized molecular descriptors were appropriate to predict RIs of compounds in M. vulgare by the two multivariate BRT models [58]. Detailed explanations about the descriptors were found in the Handbook of Molecular Descriptors [59]. These descriptors encode different aspects of the molecular structure and were applied to develop the QSRR model. According to Pearson's correlation, there was a rather poor correlation between all molecular descriptors. Hence, utilized descriptors were appropriate to predict RIs of compounds isolated from M. vulgare volatiles by the two multivariate BRT models. The calibration and predictive capability of a QSRR model should be tested through model validation. The most widely used squared correlation coefficient (r 2 ) can provide a reliable indication of the fit of the model, thus, it was employed to validate the calibration capability of a QSRR model.
In order to explore the nonlinear relationship between RIs and the descriptors selected by GA, BRT technique was used to build the two predictive models. Two BRT models were constructed to predict the retention time of compounds isolated from M. vulgare volatiles, respectively. The coefficients of determination were 0.956 and 0.964, respectively, indicating that these models could be used for prediction of RIs, due to low prediction error and high r 2 . The tests of the two BRT models fit (2019 and 2020) are shown in Table 3, with the higher r 2 values and lower χ 2 , MBE, RMSE, and MPE values showing the better fit to the experimental results [60,61]. Obtained results reveal the reliability of the BRT models for predicting the RIs of compounds in M. vulgare volatiles obtained by GC-MS analysis. The influence of the seven most important molecular descriptors, identified by using genetic algorithm on the RIs was studied in this section. According to the Figure 3, VAdjMat was the most important molecular descriptor for chemical compounds' RIs calculation in M. vulgare, during 2019, while VR1 Dzs was the most important variable during 2020.

Discussion
According to the cluster analysis (unrooted cluster tree) with 37 samples of Marrubium sp. volatiles from literature and average values from this study (Figure 4), it could be said that there are several chemotypes, but only E-caryophyllene chemotype [9,12,36,38,47,50,51] is clearly segregated. However, these are samples of M. vulgare, M. incanum, M. parviflorum, M. peregrinum, and M. crassidens grown in Serbia, Poland, Slovakia, Egypt, and Iran. This indicated that genus Marrubiumis very diverse in the case of volatiles composition.
Occurring polymorphism could be a consequence of adaptation to grow in different environments [19,62], especially ecological conditions such as humidity, temperature and altitude [22] as well as hybridization [20] strongly affected the chemotypes, as well as bi-

Plant
Occurring polymorphism could be a consequence of adaptation to grow in different environments [19,62], especially ecological conditions such as humidity, temperature and altitude [22] as well as hybridization [20] strongly affected the chemotypes, as well as biotic and abiotic stresses (including temperature, light, water, salt, and oxidative stresses) [63].
Detected compounds in M. vulgare volatiles obtained by GC-MS analysis were used for QSRR analysis. The following seven molecular descriptors that characterize the RIs of obtained compounds were suggested by the genetic algorithm. Selected molecular descriptors were not autocorrelated which was suggested by a correlation coefficient matrix; thus, descriptors were suitable for QSRR analysis. These descriptors were utilized as inputs for the boosted trees regression models, for estimating the RIs using a set of GC-MS data from a series of 160 compounds found in M. vulgare volatiles. Statistical models that quantify the relation between the structure of molecules and their chromatographic RIs were represented by the quantitative structure retention relationship (QSRR) model [58,64]. Numerous publications related to the QSRR analysis in plants from Lamiaceae family could be found in the literature: Thymus vulgaris [65], T. serpyllum [66], Satureja kitaibelii [55], Salvia officinalis [67], as well as Stachys sp. [68]. The connection between the molecular descriptors and the retention time can be established by artificial neural network, machine learning algorithms [69][70][71][72][73] or by boosted trees regression (BRT) [74].

Volatiles Isolation and Analysis
Flowering aerial parts of M. vulgare (Marrubii herba) were collected during July 2019 and 2020, dried in a solar dryer at a temperature of 40 • with air circulation. After drying, plant material was fragmented, and volatiles was extracted by Clevenger apparatus. Taking in account that M. vulgare produces trace amounts of essential oil, it was trapped in nhexane. This process was performed in tree repetition for both years, as well as analysis of chemical composition.
GC-MS analysis was carried out using an Agilent 7890A apparatus equipped with a 5975 C MSD, FID and a nonpolar HP-5MS fused-silica capillary column (30 m × 0.25 mm, film thickness 0.25 µm). The carrier gas was helium, and its inlet pressure was 19.6 psi and linear velocity of 1 mL/min at 210 • C. The injector temperature was 250 • C, injection volume was 1 µL, split ratio, 10:1. Mass detection was carried out under source temperature conditions of 230 • C and interface temperature of 315 • C. The EI mode was set at electron energy, 70 eV with mass scan range of 40-600 amu. Temperature was programmed from 60 to 300 • C at a rate of 3 • C/min. The components were identified based on their linear retention index relative to C8-C32 n-alkanes, compared with data reported in the literature (Adams4 and NIST11 databases). The relative percentage of the oil constituents was expressed as percentages by FID peak area normalization.

QSRR Analysis
PaDel-descriptor software was used to calculate specified molecular descriptors [75], as described in our previous investigation [66]. Factor analysis and genetic algorithm (GA) were used to determine the most important descriptors [76,77]. The relationship between the chosen descriptors was examined and collinear descriptors were excluded. Statistica 10 software was used for the statistical investigation of the data [78].

BRT Model
In order to relate and to predict categorical or continuous dependent variables the BRT model could be used [79,80], as it does not require transformation or outliers [81]. The BRT method calculation is connected to the boosting methods enforced to regression trees [82]. The main idea is to calculate a set of simple trees, where each successive tree is built for the prediction residuals of the preceding tree [83]. This method builds binary trees such as partition the data into two samples at each split node [78].
The decision trees are combined through a cross-validation or "boosting" procedure in order to acquire the single computational model [84]. BRT modeling consists of the following steps: (a) an initial regression tree is defined according to a minimum loss function; (b) the other trees are engaged in the iterative process in which several new regression trees were developed and selected to the subsequent according to the StatSoft Statistica's recommendation-the least square error (LSE); (c) step (b) is repeated until a stopping criterion is reached (for instance, the value of LSE).
In this study, several regularization parameters were set in order to optimize the fit between experimental results and computing model: the number of trees (between 100 and 1000), learning rate (between 0.0005 and 0.1), random test data proportion (0.1-0.9) and subsample proportion (0.1-0.9). According to Statistica's manual, prior to computation, a subsample of data is created, according to random test data proportion of the cases, and these data are treated as test samples used to evaluate the appropriate fit of the model. The remaining set of data is used for the analyses via stochastic gradient boosting (for the selection of samples for consecutive boosting steps).

Cluster Analysis
The cluster analysis (CA) was used to evaluate intra-and interpopulation variability and differentiation of volatile constituents of Marrubium samples collected in different locations and/or taken from literature reports. The phylogenetic tree diagram for Marrubium samples was calculated and plotted using R software 4.0.3 (64-bit version). The R package "ape" (Analysis of Phylogentics and Evolution) was used for calculation, applied as a graphical tool to represent the arrangements of similar volatiles concentration (evaluated in the cluster analysis). The obtained experimental results were collected in the matrix, after which the hierarchical cluster analysis was performed. The distance matrix was determined using the Euclidean method, while the cluster analysis was performed using the "complete" method.
The results demonstrated that the boosted trees regression models were adequate in predicting the RIs of the compounds in M. vulgare volatiles obtained by GC-MS analysis on a HP-5MS column. The coefficients of determination were 0.956 and 0.964 (for compounds found in M. vulgare volatiles, during the years 2019 and 2020, respectively), which is a good indication that these models could be used as a fast mathematical tool for prediction of RIs, due to low prediction error and moderately high r 2 . Suitable models with high statistical quality and low prediction errors were derived, and it could be further used for estimation of RIs of newly detected compounds.
According to the unrooted cluster tree with 37 samples of Marrubium sp. volatiles from literature and average values from this study, it could be said that there are several chemotypes: E-caryophyllene, β-bisabolene, α-pinene, β-farnesene, E-caryophyllene + caryophyllene oxide chemotype, and diverse (unclassified) chemotypes. However, occurring polymorphism could be a consequence of adaptation to grow in different environments, especially ecological conditions such as humidity, temperature and altitude, as well as hybridization which strongly affected the chemotypes. Further research on M. vulgare chemotypes needs to be focused on genetic markers, because evaluation of genetic diversity has key importance in improving the quality of raw material used for industrial purposes.