Self-Organizing Maps of Molecular Descriptors for Sesquiterpene Lactones and Their Application to the Chemotaxonomy of the Asteraceae Family

The Asteraceae, one of the largest families among angiosperms, is chemically characterised by the production of sesquiterpene lactones (SLs). A total of 1,111 SLs, which were extracted from 658 species, 161 genera, 63 subtribes and 15 tribes of Asteraceae, were represented and registered in two dimensions in the SISTEMATX, an in-house software system, and were associated with their botanical sources. The respective 11 block of descriptors: Constitutional, Functional groups, BCUT, Atom-centred, 2D autocorrelations, Topological, Geometrical, RDF, 3D-MoRSE, GETAWAY and WHIM were used as input data to separate the botanical occurrences through self-organising maps. Maps that were generated with each descriptor divided the Asteraceae tribes, with total index values between 66.7% and 83.6%. The analysis of the results shows evident similarities among the Heliantheae, Helenieae and Eupatorieae tribes as well as between the Anthemideae and Inuleae tribes. Those observations are in agreement with systematic classifications that were proposed by Bremer, which use mainly morphological and molecular data, therefore chemical markers partially corroborate with these classifications. The results demonstrate that the atom-centred and RDF descriptors can be used as a tool for taxonomic classification in low hierarchical levels, such as tribes. Descriptors obtained through fragments or by the two-dimensional representation of the SL structures were sufficient to obtain significant results, and better results were not achieved by using descriptors derived from three-dimensional representations of SLs. Such models based on physico-chemical properties can project new design SLs, similar structures from literature or even unreported structures in two-dimensional chemical space. Therefore, the generated SOMs can predict the most probable tribe where a biologically active molecule can be found according Bremer classification.


Introduction
The Asteraceae family is one of the largest families of angiosperms in the World. The main secondary metabolites isolated from the species of Asteraceae are monoterpenes, sesquiterpenes, sesquiterpene lactones-SLs [1], polyacetylenes [2], flavonoids [3,4], benzofurans and benzopyrans [5], coumarins [6], diterpenoids [7] and triterpenoids [8]. All of the classes currently have a number of representatives (at least several hundred) that are considered to be satisfactory for studying the chemotaxonomy.
The first classifier was Cassini, who published a diagram in 1816 showing the interrelationships of 19 tribes [12]. In 1976, Carlquist [15] divided the Asteraceae into two subfamilies, based on morphological studies: Asteroideae and Cichorioideae. Wagenitz [16] also proposed a division, in the same year, into two subfamilies that differed from Carlquist by including the Eupatorieae tribe in the Asteroideae instead of the Cichorioideae. This diphyletic view of the family was a major step in understanding the relationships between the Asteraceae tribes.
In 1987, Bremer presented a cladogram of the Asteraceae that was based on 81 characters, 10 of which were chemical [17]. The remaining characters were mostly morphological and DNA characteristics. This study is an example of classification that incorporates chemical characters combined with morphological and molecular characters. Years later, Bremer unveiled a new classification of the Asteraceae that was based mainly on morphology, proposing four subfamilies: Asteroideae (Ast), Cichorioideae (Cic), Carduoideae (Car) and Barnadesioideae (Bar) [18]. Bremer placed the Mutiseae tribe on a branch (clade) that was not well-positioned between Barnadesioideae and Carduoideae ( Figure 1).
In 2005, Funk et al. produced a "supertree" that showed the phylogeny of the Asteraceae family [19], using the most cited studies as well as unpublished data that were provided by contributing authors.
Secondary metabolites have a restricted distribution and specific botanical sources. In the Chemistry of Natural Products, secondary metabolites are important chemotaxonomic markers [20,21]. Chemosystematics consists of classifying organisms through chemical characteristics, thus providing some answers and/or proposals for a greater understanding of evolution [22][23][24][25]. Some chemosystematics studies tried to relate the oxidation of secondary metabolites with the phytochemical evolution and diversity based on that oxidative pathways in plants occur parallel to protective mechanisms against oxidative degradation [23,25].
There are many different types of molecular descriptors, which are often distinguished by their physico-chemical meanings or by the specific mathematical tools that are used for their calculation, such as the DRAGON v. 6.0 program [26]. The use of robust methodology to attempt to detect groupings and patterns that may be unclear even for trained human experts generally employs artificial neural networks (ANNs). ANNs are defined as computational models with structures that are derived from a simplified concept of the brain, in which a number of nodes, called neurons, are interconnected in a network-like structure. Because ANNs are not restricted to linear correlations and are capable of considering non-linear data correlations, they can be used efficiently for modelling, prediction and classification. The most commonly used ANN architecture for pattern recognition and classification is the self-organising map (SOM). This method can map multivariate data onto a two dimensional grid, grouping similar patterns near to each other [27,28]. In natural product chemistry there are few works applications of ANNs for classification of Asteraceae tribes using SLS [29,30]. Da Costa and coworkers used SOM with Radial Distribuction Function (RDF) descriptors of a set of 144 SLs of three tribes (Eupatoriae, Heliantheae and Vernonieae) of the family Asteraceae [29]. Hristozov and co-workers used two different molecular descriptors: Atom counts and with two supervised methods-counterpropagation neural networks and k-nearest neighbors (k-NN) [30].
In the present study, 1,111 SLs, which were extracted from 658 species, 161 genera, 63 subtribes and 15 tribes of the Asteraceae family, were registered in two dimensions using SISTEMATX, our in-house databank software, which allowed a chemotaxonomic analysis among the tribes of Asteraceae family, employing molecular descriptors (including average degree of oxidation), multiple linear regression and unsupervised neural networks. The results are compared with the systematic classifications proposed by Bremer [18]. Therefore, this work has the following objectives: (1) To compare the classification using SLs as chemical markers with the morphological and molecular classification used in the Bremer tree. Such models based on physico-chemical properties can project new design SLs, similar structures from literature or even unreported structures in two-dimensional chemical space. Therefore, the generated SOMs can predict the most probable tribe where a biologically active molecule can be found according Bremer classification. (2) To verify the relationship between the degree of oxidation of the sesquiterpene lactones and the evolution of the Asteraceae tribes.

Data Generated from the Three-Dimensional Structures and Their Respective Chemical Occurrences, Using SISTEMATX
Data shown in Table 1 were obtained after the generation of botanical data. There are 1111 different sesquiterpene lactones, which correspond to 1979 chemical occurrences for 15 tribes, 63 subtribes, 161 genera and 658 species of the Asteraceae family. We define number of occurrences for a superior taxon, counting how many times a compound appears in determined species belonging to that taxon. The Heliantheae was found to be the tribe with the greatest number of genera, species, occurrences and compounds. Other tribes with large numbers were the Anthemidae, Eupatorieae, Vernonieae and Helenieae.

Molecular Descriptors and Average Degree of Oxidation (NOX/nC)
The 1,111 molecules obtained were used as input data in the DRAGON 6.0 program for calculating the descriptors. During the pre-treatment of data, constant variables were excluded for each block of descriptors, as well as those that presented the same value except in one case (near constant variable). For the remaining descriptors, pairwise correlation (r < 0.99) analysis was performed to exclude those highly correlated.
After the values of the descriptors were calculated, each of the 11 generated files, one for each block of descriptor, was attached to its respective botanical occurrence (tribe, subtribe, genus) and to its degree of oxidation (NOX/nC), i.e., the value equal to the number of oxidations divided by the number of carbon atoms present in each molecule. Subsequently, the mean of the descriptor values and the degree of oxidation was calculated for each tribe. The resulting file was used as input data for MobyDigs 1.1, to select the descriptors and to generate multiple linear equations.
Several statistically significant equations were obtained for each block of descriptors, to explain the variance in the values of the degree of oxidation between tribes. The equations yielding the highest values of Q cv 2 with only one descriptor were selected, as follows: Equation (1) contains a constitutional descriptor (AMW) and has high values of statistical coefficients (r 2 , Q cv 2 and F). This descriptor represents the mean molecular weight, including the atomic weights of the molecule. Sesquiterpene lactones mostly present only carbon, hydrogen and oxygen atoms; SLs with few hydrogen atoms, more double bonds and more oxygen atoms have larger AMW values, and consequently, have higher values for the degree of oxidation (NOX / nC).
A WHIM (Weighted Holistic Invariant Molecular) (L2v) descriptor is present in equation (2), which also showed high statistical coefficients. The L2v descriptor is the second eigenvalue of the covariance matrix of the molecule atomic coordinates, weighted by the van der Waals volume, and it accounts for the molecular size along the second principal direction. The SLs with an inverted configuration of the double bonds of the ring, in other words, Z and E configurations on the positions 1 and 4 respectively, show branches with angles that are closer to 90° relative to the plane of the ring. These ramifications were found to be rich in esters and hydroxyls ( Figure 2). Moreover, epoxidised SLs were found to have high L2v values.
Equation (3) contains a GETAWAY (Geometry, Topology, and Atom-Weights Assembly) descriptor (H5m). The degree of accessibility between the atoms that are separated by a topological distance of 5 and their respective atomic masses, are positively related to the value of H5m. Compounds with higher value of this descriptor have greater accessibility between oxygen with topological distance of 5. Considering a graph representation of a compound, topological distance is the number of edges (bonds) between two vertices (atoms), Sesquiterpene lactones with ester-rich ramifications have high H5m values. Table 2 was obtained through equation (1), which yields the best statistical indices.
It is evident, observing the degree of oxidation for 15 tribes (Table 2), that the Cichorioideae and Asteroideae tribes did not cluster and apparently, present no relationship between the degree of oxidation of the sesquiterpene lactones and the evolution of the tribes and for this family, which produce several metabolites classes, evolution not only occurs through oxidation process as have reported in others studies [25].

Self-Organising Maps (Kohonen) and Molecular Descriptors in the Chemotaxonomy of the Asteraceae Family Tribes
The following blocks of descriptors were generated from 2D molecular structures: Constitutional descriptors, which reflecting the molecular composition of a compound, Functional groups, that are groups of atoms having a characteristic and specific reactivity, Atom-centred descriptors defined as the number of specific atom types in a molecule, BCUT (Burden Eigenvalues) are eigenvalues of a modified connectivity matrix, 2D autocorrelations are spatial autocorrelations on a molecular graph weighted by atom physico-chemical properties, Topological are descriptors based on a graph representation of the molecule [31].
Geometrical are defined in several different ways, but always derived from the 3D structure of the molecule, RDF (Radial Distribution Function) are descriptors based on a radial distribution function which can be interpreted as the probability distribution of finding an atom in a spherical volume of radius R, 3D-MoRSE (3D-Molecule Representation of Structures based on Electron diffraction) are based on the idea of obtaining information from the 3D atomic coordinates by the transform used in electron diffraction studies, GETAWAY (GEometry, Topology, and Atom-Weights AssemblY) are related to the influence of the atoms in the determination of the molecular form (leverages), and to the distance between them, WHIM (Weighted Holistic Invariant Molecular descriptors) are geometrical descriptors based on statistical indices calculated on the projections of the atoms along principal axes [31].
The 11 files of botanical occurrences were used (one for each block of descriptors) for nine tribes, along with the values of the descriptors, as input data for the SOM toolbox 2.0 software. The nine tribes selected for analysis were those with the highest numbers of botanical occurrences (Table 3). SOM projects objects (chemical occurrence) from a multidimensional space (variables-molecular descriptors) into a space of lower-dimensionality (2D plane). In this projection, the similarity relationship between objects is conserved. Thus, Kohonen networks (SOMs) can be used for clustering of objects. SOMs accomplish two things, they reduce dimensions and display similarities, which make the SOM a powerful visualization tool. The training of these networks (Kohonen) is unsupervised, that is, the investigated property is not used during the training process. Each neuron in the grid is associated with a weight, and similar patterns stimulate neurons with similar weight, so that similar patterns are mapped near each other [28].
All of the blocks of descriptors, with the exception of the constitutional (66.7%) and the 3D-MoRSe (Molecule Representation of Structure based on Electron diffraction) (67.5%) descriptors, had overall hit rates that were above 80%. The maps measured 40 by 30, with 1200 neurons, except for the maps that were obtained using functional group descriptors (35 by 35, with 1225 neurons) and GETAWAY descriptors (40 by 35, with 1400 neurons).
In the represented maps, the chemical occurrences of certain tribes occupy regions that are labelled by the following colours:
The self-organising map (SOM) using the RDF (Radial Function Distribution) descriptors showed the largest overall hit rate (83.6%), this match rate is similar obtained by Hristozov and co-workers using the same kind of descriptors and supervised methods [30]. The SOM obtained using the constitutional descriptors had the worst hit rate. There was no significant difference regarding the percentage of hits among the other blocks of descriptors, indicating that functional groups or atom-centred descriptors, with respect to their simplicity of interpretation and acquisition, are sufficient to encode information of SLs that are differentiated according to their occurrence among the tribes of the Asteraceae family. Figure 3 shows some the most representative maps.
The Inuleae tribe had the worst hit rate; the maps could not differentiate this tribe from the others. The hit rates ranged from 27.6%, for the map that was obtained from using constitutional descriptors, to 56.7%, for the map obtained using the BCUT descriptors. This tribe's areas (pink neurons) were close to the Anthemideae tribe (blue neurons) in all of the maps. The low hit rate of the Inuleae tribe was a result of the neurons that were occupied by this tribe, which were mixed with neurons of the Anthemideae (blue regions) and Heliantheae (regions in red) tribes (Figure 3). Table 3. Results of the self-organising maps and their respective dimensions, with the values of the occurrences and the number of absolute and relative hits for the nine tribes of the Asteraceae family, using the blocks of descriptors that were generated by the DRAGON 6.0 program.   The Anthemideae tribe had the highest hit rates, the lowest value of 73.8% on the map using constitutional descriptors and the highest value obtained with the BCUT descriptors (94.2%). In all of the maps that were obtained, the regions occupied by this tribe (in blue) were different from those for the Eupatorieae (yellow), Vernonieae (green) and, to a lesser extent, the Heliantheae (red) and Helenieae (orange) tribes (Figure 3).

Constitutionals
The Senecioneae tribe had high hit rates, which were above 75%, except for the maps that were obtained by using the constitutional and 3D-MoRSE descriptors. In most of the maps, the regions occupied by this tribe (light blue) were near to the Anthemideae (blue) tribe, which agrees with the classifications proposed by Bremer, Jansen and Funk.
The Heliantheae tribe had high values of hit rates in all of the maps. This tribe had the highest number of occurrences and compounds in the present study, and its structural diversity, in terms of SLs, is displayed in the self-organising maps. Despite the fact that it is mainly concentrated in a specific region (red), this tribe occupies an extensive area in the SOMs. As previously mentioned, the regions of the Heliantheae tribe are close to those of the Helenieae (orange) and Eupatorieae (yellow) tribes ( Figure 3).
The Vernonieae tribe generated high hit rates, except for the maps that use constitutional (68.4) and 3D-MoRSE (66.5) descriptors. With the exception of the map that used the atom-centred block of descriptors, no proximity was noted between the Vernonieae tribe and the Lactuceae (grey), which also belongs to the same subfamily (Cichorioideae). This observation can be explained by the low number of occurrences (28) and compounds (17) of the Lactuceae tribe used in this study. This result helps elucidate the low hit rates that this tribe achieved in all of the maps, except for in the maps that used functional groups (71.4%) and topological (78.6%) and geometrical (71.4%) descriptors. This tribe (grey) occupied few neurons that are distributed in all of the maps, thus preventing the determination of a predominant region (Figure 3).
The Cardueae tribe (subfamily Carduoideae) presented high hit rates in all of the maps, except in the SOM obtained from using constitutional descriptors (68.6%). The regions that are occupied by this tribe (brown) were distributed between the Heliantheae (red), Eupatorieae (yellow), Vernonieae (green) and Anthemideae (blue) tribes.
The prediction performance of the most significant block of descriptors 2D and 3D, atom-centred and RDF, respectively for the nine training set and test set generated from original set, are shown on Tables 4 and 5.    Table 3 and show the same performance on test results. Inuleae is the only one that the prediction is insignificant (lower than 50%). The test results for the atom-centred descriptors are slight higher than SOMs using RDF (Table 5) which have the overall.
The chemical pattern of SLs from Asteraceae family, obtained by using molecular descriptors and neural networks is a powerful tool for bioprospecting studies. Since these models are built based on physico-chemical properties, some SLs proposed and design based on QSAR models [32][33][34][35][36][37][38][39] of drug design studies can be projected in these SOMs, which provide the following information: association of similar compounds with potential analogous biological activities and the probable tribe that these structures can be isolated. Some models with the same or similar tasks were proposed in literature as already cited work Hristozov and coworkers [30] and the "ChemGPS Chemical Space" proposed by Larsson and coworkers [40].

Acquisition and Registering of Structures of Sesquiterpene Lactones and Their Respective Botanical Occurrences
Data on the structure of the investigated molecule and its respective botanical occurrences were added to the SISTEMATX, based on a literature review [1]. In the SISTEMATX, the molecule must be associated with its class (sesquiterpene lactones) and with the respective skeleton. The botanical record is made in the "Botanical Data" module of the SISTEMATX. For the present study, Bremer's classification was followed [18]. In the SISTEMATX program, a new version of SISTEMAT [41], compounds are drawn in two dimensions. For the chemotaxonomic studies nine tribes with higher values of botanical occurrence were used (>50). Lactuceae tribe was included in the studies for two reasons: to analyse the chemical pattern with Vernonieae tribe (both belong to Cichorioideae subfamily) and, have a significant value of botanical occurrence/compounds (>1.5).

Acquisition of Three-Dimensional Structures of Sesquiterpene Lactones and Exportation of Botanical Data
The 3D coordinates of the SLs were generated by the SISTEMATX program, based on the 2D constitution data of the molecules that were drawn directly into the system, with the "Export Botanical Data" module, using the CORINA 3.2 software [42]. Three-dimensional structures generated can be saved in files under the formats .mol or .hin, thus allowing them to be used as input data for the generation of various molecular descriptors. All of the registered sesquiterpene lactones from the Asteraceae family were selected. The molecules were saved in MDL format files (.mol).

Molecular Modelling
Molecular modelling computations were performed on Hyperchem 8.0 for Windows [43]. The molecules were subjected to geometry optimisation and conformational analysis. The rotatable cyclic bonds were included as variable torsions and allowed to be changed simultaneously in the conformational search using Hyperchem's force field MM+. The search was performed applying a usage-directed search method which was terminated after energy minimization of 500 unique starting geometries. The resulting structures were energy minimized using semi-empirical quantum chemical method Austin model 1 (AM1), and the value of 0.001 kcal/mol was used for the root mean square gradient termination condition [44,45].
The molecules were saved as .mol (mdl) files, to compute various molecular descriptors using DRAGON Professional v. 6.0 [26].
Constant variables were excluded for each block of descriptors as well as those that presented only one different value in the series. For the remaining descriptors, pairwise correlation (r < 0.99) analysis was performed to exclude those that were highly correlated. The remaining descriptors, 32 from Constitutional, 35 from Functional groups, 40 from BCUT, 42 from Atom-centred, 38 from 2D autocorrelations, 59 from Topological, 42 from Geometrical, 150 from RDF, 160 from 3D-MoRSE, 188 from GETAWAY and, 99 from WHIM, were submitted to statistical analysis.

Calculation of Multiple Linear Regression
The MobyDigs 1.1 program [52] was used for the calculation of MLR models using a Genetic Algorithm (GA) [53]. Given the need to select the most statistically significant models, the search for the best models is usually performed using the ordinary least squares regression under the GA approach, i.e., through the genetic algorithm variable subset selection method. In the GA terminology, a population is characterised by a set of candidate variables (the genetic heritage of the population) and is composed of individuals, meaning that models are made of one or more populations of variables. The parameters used in GA setup of the MobyDigs program were: maximum number of models in a population = 50; maximum number of variables in a model = 3; trade-off parameter = 0.5 (both crossover and mutation are taken into account and the role played by the two processes is equally balanced); number of the best models for each size = 3; selection pressure (bias) = 0.5.

Self-Organising Maps
The whole dataset was presented to the network before any adjustments were made. In each training step, the dataset was partitioned according to the Voronoi regions of the map weight vectors. The neural network was trained until its convergence to minimal error. The correct prediction of these groups and the total correct prediction of the compounds were subsequently evaluated. All of the DRAGON descriptors that were previously selected by pairwise correlation analysis were analysed with SOMs in Matlab 6.5 and the SOM Toolbox 2.0 [54][55][56].
The SOM toolbox is a set of Matlab functions that can be used for the creation, visualisation and analysis of self-organising maps. The literature shows that the determination of the size of an SOM is an empiric process. Initially, a heuristic formula of m = 5.(n) 0.5 is typically used for the total number of map units, where n is the number of samples [28]. For the most significant models, to evaluate the prediction ability, the set was split into training and test set. The size of the test set was around 10% of the whole set, randomly selected, and it was generated nine series. This is important because there must be a trend between the learning of the network and its ability of prediction that is verified in the results. Training and test performance are evaluated by computing the ratio of the number of samples correctly classified by SOM.

Conclusions
No relationship could be established between the degree of oxidation of the sesquiterpene lactones and the evolution of the Asteraceae tribes. Several equations had highly significant coefficients, with only one descriptor of several blocks. Some structural features, which were related to the degree of oxidation, could be identified by interpreting molecular descriptors.
The self-organising maps obtained for the nine tribes, except Inuleae, had high hit rates and separated the tribes according to previously proposed classifications when using molecular descriptors. SOMs generated from atom-centred and RDF descriptors yield most significant results and can be used as a classification tool for lower hierarchical levels, such as tribes. Since these SOMs are built based on physico-chemical properties, they can be used in the search for SLs with potential biological activities with the respective taxonomic information. For this propose the authors are working to increase the database and improve the models.