Machine Learning Analysis of Essential Oils from Cuban Plants: Potential Activity against Protozoa Parasites

Essential oils (EOs) are a mixture of chemical compounds with a long history of use in food, cosmetics, perfumes, agricultural and pharmaceuticals industries. The main object of this study was to find chemical patterns between 45 EOs and antiprotozoal activity (antiplasmodial, antileishmanial and antitrypanosomal), using different machine learning algorithms. In the analyses, 45 samples of EOs were included, using unsupervised Self-Organizing Maps (SOM) and supervised Random Forest (RF) methodologies. In the generated map, the hit rate was higher than 70% and the results demonstrate that it is possible find chemical patterns using a supervised and unsupervised machine learning approach. A total of 20 compounds were identified (19 are terpenes and one sulfur-containing compound), which was compared with literature reports. These models can be used to investigate and screen for bioactivity of EOs that have antiprotozoal activity more effectively and with less time and financial cost.


Introduction
An essential oil (EO) is a concentrated plant secondary metabolite composed of a mixture of volatile chemical compounds [1], with a long history of use in food, cosmetics, perfumes, agricultural and pharmaceuticals industries [2]. In the last decade, almost 5000 articles related to uses of EOs have been published, with a positive increment of more than 7% per year [3]. In this scenario, the scientific, economic, and biological importance of EOs is growing as alternatives to synthetic compounds commonly used in industry [4][5][6].
In particular, numerous studies demonstrated the wide pharmacological spectrum of EOs, including: antimicrobial [7,8], antifungal [9], antiparasitic [10], antiviral [10,11], insecticidal [12], anticarcinogenic [13][14][15][16], immunomodulatory [17], anti-inflammatory and antioxidant [14]. Nevertheless, the chemical composition of obtained EOs can unfortunately be different depending on the chosen method, geographical origins, the season, the type of soil and the agricultural conditions in which plants have grown. Thus, the same plant could produce different EO chemical composition profiles and therefore display different biological effects [3,8,14]. In this sense, some approaches have been used, such as the 'chemotype concept' with the aim to discern the bioactivity of EOs based on chemical profile. However, more complex questions have emerged due to the high complexity in the The dataset of 45 EOs was analyzed to find a chemical pattern between the EOs and antiprotozoal activity, including three activities: antiplasmodial, antileishmanial and antitrypanosomal. The analysis started with the use of the SOM, where the chemical composition of the 45 EOs was used as information to find patterns with the antiparasitic activity. Among them, 21 had some of the analyzed activities (antiplasmodial, antileishmanial and antitrypanosomal), with median inhibitory concentrations (IC 50 ) in in-vitro cultures < 100 µg/mL. The remaining EOs (24) had no antiprotozoal activity or activity had not been reported.
In the generated map, the hit rate was higher than 84%. The SOM validation was then performed using the 5-fold external cross-validation technique [32,33]; this means that the entire dataset is partitioned five times into a modeling set (training set) including 80% of the compounds in the set, and the external cross-validation data set, comprising the remaining 20% of the compounds in the data set. After this, only the modeling set is used to build the models and then the models are validated with the external cross-validation technique. In this sense, the dataset was subdivided into five training groups and five test groups, always keeping the ratio between active and not reported EOs. The validation results are described in Table 1. Table 1. Accuracy statistics of the training and tests groups of the 5-fold external cross-validation of the Self-Organizing map (SOM). Analyzing Table 1, we see that the hit rate for true positive rate (EOs active) and true negative rate (EOs that did not display antiprotozoal activity or had not been reported) both in training sets and in test sets were higher than 0.7, showing that the SOM model is robust. Model accuracy assessment gives information about the overall performance of the model, indicating the overall hit rate. The hit rate is the rate that evaluates how well the model correctly classified the EOs. Accuracy values vary between 0 and 1. Models with accuracy rate closer to 1 represents the higher model's hit rate; while an accuracy rate equal to or greater than 0.7 is considered models of optimal performance [23,27].

Classification of EOs
The SOM managed to find a chemical pattern between the chemical composition of EOs and antiprotozoal activity. In parallel, we chose to check if this chemical pattern is also found by using a supervised algorithm, known as RF. The RF model was generated using the 5-fold external cross-validation technique [32,33]; this means that the entire data set is partitioned five times into a modeling set (training set) including 80% of the compounds the set, and the external cross validation data set, comprising the remaining 20% of the compounds the data set. After this, only the modeling set is used to build the models and then the models are validated with the external cross validation technique. Its performance was evaluated through the statistics such as specificity, sensitivity, which obtained satisfactory values that corroborate the accuracy of the superior model, at 70%. The performances can be observed in Table 2, these parameters are an average between the five models. During the creation of the model, we also observed the domain of applicability to ensure that the samples tested were within the chemical space of each model. In Table 3, we can see the accuracy and the global hit of both models and show that the chemical same pattern could be obtained use an unsupervised (SOM) and supervised (RF) machine learning. In both analyses, an accuracy rates higher than 0.7 were appreciated. Figure 2 shows the U-matrix of the SOM, i.e., the visual analysis of the SOM. The U-matrix is constructed by measuring the Euclidean distance in the vector space between adjacent neurons [21,24,34]. It is possible to normalize the distances to be represented by colors or in shades of gray [21,24]. What is represented in the U-Matrix are the clusters mapped by the SOM and not the individual samples. Table 3. Summary of test averages corresponding to 5-fold cross-validation using the different machine learning algorithms, self-organizing maps (SOM) and random forest (RF).  Table 3. Summary of test averages corresponding to 5-fold cross-validation using the different machine learning algorithms, self-organizing maps (SOM) and random forest (RF).  Figure 2 shows the U-matrix of the SOM, i.e., the visual analysis of the SOM. The Umatrix is constructed by measuring the Euclidean distance in the vector space between adjacent neurons [21,24,34]. It is possible to normalize the distances to be represented by colors or in shades of gray [21,24]. What is represented in the U-Matrix are the clusters mapped by the SOM and not the individual samples.  Forty-five EOs were used for the SOM analysis. After mapping the SOM, the 45 EOs were correctly grouped into active and inactive (EOs that do not display antiprotozoal activity or have not been reported). There was also the separation of groups of greater similarity and difference between them, taking into account the chemical composition of the EOs, which were approximated or distanced in the SOM. Thus, in the U-matrix, each square represents a group of EOs that are organized both by activity and chemical similarity, with the purple ones relating to active EOs and the yellow ones to inactive EOs.

Classification of EOs
It is also worth noting that the U-matrix is a visual representation of the topological mapping of the SOM, in this way, the white squares are valleys that separate the clusters that were generated.
It is also possible to observe in Figure 2 the principal component analysis (PCA) graph, which was generated from the correlation matrix of the EOs dataset used in the generation of the SOM. PCA is used to reduce the dimensionality of the data and allow a better visualization of the clusters, since it allows representing the input data as linear combinations of their projections [23]. The PCA performed in this study has an explained variance of 25.47%, that is, using only two variables it is possible to explain a quarter of the entire variance.
While in the U-matrix we have the white squares representing valleys that distance the clusters, in the PCA graph neighboring cartographic units are connected by lines to make the map view clearer and more defined [23].
After the general analysis with the 45 EOs, the SOM was constructed considering the chemical patterns of each sample. The most significant molecules for the chemical pattern separation of active and not reported EOs obtained with SOM Toolbox tool are shown in Figure 3. In this sense, 20 compounds present in the EOs were associated with at least one of these three biological activities, of which 19 are terpenes (10 monoterpenes and 9 sesquiterpenes) and one sulfur-containing compound. As is evident, a high predominance of terpene-type compounds was observed. Previously, the role of terpene compounds has been reviewed, suggesting the promising therapeutic value against protozoa parasites [35][36][37].
The identification of the most significant molecules is made by observing the region in the U-matrix of active EOs. Once the region was identified, we observed the most expressive molecules in that region. For example, when analyzing the U-matrix, we observe that in the lower right corner, there is a region in purple color, indicating a region of active EOs. Following the analysis, we will observe which molecules are most representative of that region; thus, we have the molecules (E)-β-ocimene, (Z)-β-ocimene and β-phellandrene. Note, in Figure 3, that the individual matrices of these molecules indicate their greater presence in the lower right region, the region of active EOs.
In a general comparison of listed components between Figures 1 and 3, note that only three compounds match as major component of EOs and as significant molecules generated by SOM strategy: camphor, piperitone and safrole. In general, pharmacological studies of EOs suggest that major identified components could be responsible for the biological activity. However, some studies did not correlate the main compound with the antiprotozoal effect [38][39][40]. Thus, using the present model, we selected molecules present in EOs that can influence in the antiprotozoal activity of studied EOs, and could suggest other EOs based in the complete chemical composition and not only in the major components. In addition, it is interesting to specify that in the used data, camphor was identified in 5 samples with concentrations between 0.1 to 17.1%, piperitone was present in 7 samples ranging from 0.1 to 23.7%, and safrole was documented in 3 samples from 1.6 to 71.8% [17]. In regard to antiprotozoal activity, analysis of the samples with these compounds with concentrations higher than 5%, we note that, for example, camphor was reported in the EOs from Piper aduncum L. and Piper aduncum var. ossanum (C.DC.) Trel. that showed antiplasmodial, antileishmanial and antitrypanosomal activity, as well as piperitone. Safrole, in contrast, was identified in Piper auritum Kunt as major compound and displayed antileishmanial activity [17]. These examples could corroborate the observed results from the SOM analysis and probably could highlight Piper as a promising genus to study antiprotozoal properties, related with the main compounds or synergism resulting from the presence of these components in this genus. In fact, antileishmanial potentialities of the Piper genus was recently reviewed [41]. The identification of the most significant molecules is made by observing the region in the U-matrix of active EOs. Once the region was identified, we observed the most expressive molecules in that region. For example, when analyzing the U-matrix, we observe that in the lower right corner, there is a region in purple color, indicating a region of active EOs. Following the analysis, we will observe which molecules are most representative of that region; thus, we have the molecules (E)-β-ocimene, (Z)-β-ocimene and β-phellandrene. Note, in Figure 3, that the individual matrices of these molecules indicate their greater presence in the lower right region, the region of active EOs. 18.4 µg/mL [43], respectively; while methyleugenol had an IC 50 of 5.7 µg/mL against Plasmodium falciparum [44]. However, it was noted that several of the identified compounds were not evaluated against these protozoa parasites, which could be addressed in further screening assays.
Nevertheless, several studies in the literature have already confirmed the antiparasitic action of EOs with identified compounds obtained from plants in other geographical locations, which is summarized in Table 4 together with results of EOs from Cuban plants (supplementary material). The higher number of reports from Cuban and other EOs was found for camphor. For example, antiprotozoal activity was evaluated for EOs from Cuban plants against Plasmodium falciparum, Leishmania spp. and Trypanosoma spp. from Alpinia zerumbet (Pers.) B. L. Burtt & R. M. Smith [45], Piper aduncum L. [46] and Piper ossanum (C.DC.) Trel [47]; while the rest of EOs displayed activity only against kinetoplastid parasites from Alpinia speciosa K. Schum. [39], Artemisia absinthium L. [42], Piper cubeba L. [48], and Thymus hirtus sp. algeriensis Boiss. et Reut [30], which camphor proved to be one of the major substances in all included samples.
However, although in the literature, piperitone was only found in an EO from Benin with antitrypanosomal and antiplasmodial activity [49], in Cuban samples, it was found in higher concentrations of EOs (19 to 24%) that showed a broad spectrum of antiprotozoal effects mainly from Piper species [46,47]. In contrast, a diverse number of studies from worldwide plants, EOs with germacrene D and with antikinetoplastid activity correlated with antiplasmodial activity shown by Cuban EOs with this compound [50].  Figure 3).

Essential Oils Database
The 31 articles previously selected and analyzed by Monzote et al. [19] were used. A database with identified compounds with a concentration > 0.1% were performed and stored in Excel spreadsheet, which traces were not included. In parallel, the described pharmacological properties to each EO were assigned.

Self-Organizing Maps (SOMs)
The database contained information on 45 essential oils, with chemical composition and biological activity. For the realization of the neural maps, the information of the composition of the each EO from the dataset was used like descriptors. The chemical components were analyzed with SOMs in Matlab 6.5 and SOM Toolbox 2.0 [24]. The SOM Toolbox tool is a set of Matlab functions that can be used for the elaboration and implementation of neural networks, since it contains functions for the creation, visualization, and analysis of SOMs.
The data set was presented to the network before any adjustments were made. Subsequently, the data group was partitioned according to the regions of the weight vectors of the map, in each training stage. Then, the correct prediction of these sets and the total correct predictions of the compounds were evaluated. In the most relevant models, the set was divided into training and test sets to assess the forecasting capacity. Training and test performance were assessed by calculating the proportion of the number of samples correctly classified by SOM. For each map, 5 cross-validations were performed, being partitioned into 80% training and 20% testing. In the SOM, sites containing molecules for each descriptor were identified to highlight existing chemical patterns. The SOM was generated with a 4 × 6 rectangular GRID.

Principal Component Analysis (PCA)
PCA analysis was calculated using the SOM toolbox 2.0 [24]. The utilization of PCA for dimension reduction lies in the fact that the PCs are generated so that they explain maximal amounts of variance [27].
The PCA was calculated using the database contained information on 45 essential oils.  [76,77] was used to perform the analyses and to generate the model, in silico. The EOs dataset were divided using a "Partitioning" tool, with the "Stratified sample" option, separated between training and testing datasets, which represented 80% and 20% of all compounds, respectively. Molecules in the training and testing datasets were randomly selected, but the same proportions of active and not reported substances were maintained for both databases. The information of the composition of the EOs was used like descriptors.
The model utilized a "5-fold external cross-validation" procedure and the Random Forest (RF) algorithm. The RF parameters selected for all models generated 100 total forests to be built, and −5,440,374,124,525,988,069 static random seeds (get reproducible results) were generated using random numbers for the model.
The external performances of the selected models were analyzed for sensitivity (true positive rate, which represents the active rate), specificity (true negative rate, which represents the inactive rate), and accuracy (general predictability).
The Applicability domain (APD) corresponds to the chemical space that surrounds the descriptors of the molecules used in the construction of the model. In this way, the applicability domain will provide information about the similarity between what is being tested and what was used to build the model [78][79][80].
The APD was used to assess whether predictions for the compounds in each dataset were reliable. The APD is based on Euclidean distances, and measures of similarity between the training set descriptors are used to define the APD. Therefore, if a compound in the test set has distances and similarities beyond the APD limit, its prediction will not be reliable. APD can be calculated using the following formula: where d and σ are the Euclidean distances for the mean and standard deviation of the compounds in the training set, respectively. Z is an empirical cutoff value, which was set to 0.5 in this study [81].

Conclusions
Scientific studies corroborate the results found in this study. Thus, this study of EO analysis establishes a way to find chemical pattern between EOs and antiparasitic activity (antileishmanial, antitrypanosomal and antimalarial). This finding makes it possible to direct studies and biological tests for EOs that have antiparasitic activity more effectively and with less time and financial cost. In particular, we strongly suggest further antiprotozoal studies with EOs from species of the Piper genus and the pure compound camphor taking into account data from Cuban EOs. Nevertheless, machine learning analysis studies will be interesting for EOs from different geographical locations to predict bioactive components with potential antiplasmodial, antileishmanial, and antitrypanosomal activity.