Essential Oils Biofilm Modulation Activity and Machine Learning Analysis on Pseudomonas aeruginosa Isolates from Cystic Fibrosis Patients

The opportunistic pathogen Pseudomonas aeruginosa is often involved in airway infections of cystic fibrosis (CF) patients. It persists in the hostile CF lung environment, inducing chronic infections due to the production of several virulence factors. In this regard, the ability to form a biofilm plays a pivotal role in CF airway colonization by P. aeruginosa. Bacterial virulence mitigation and bacterial cell adhesion hampering and/or biofilm reduced formation could represent a major target for the development of new therapeutic treatments for infection control. Essential oils (EOs) are being considered as a potential alternative in clinical settings for the prevention, treatment, and control of infections sustained by microbial biofilms. EOs are complex mixtures of different classes of organic compounds, usually used for the treatment of upper respiratory tract infections in traditional medicine. Recently, a wide series of EOs were investigated for their ability to modulate biofilm production by different pathogens comprising S. aureus, S. epidermidis, and P. aeruginosa strains. Machine learning (ML) algorithms were applied to develop classification models in order to suggest a possible antibiofilm action for each chemical component of the studied EOs. In the present study, we assessed the biofilm growth modulation exerted by 61 commercial EOs on a selected number of P. aeruginosa strains isolated from CF patients. Furthermore, ML has been used to shed light on the EO chemical components likely responsible for the positive or negative modulation of bacterial biofilm formation.


Introduction
The opportunistic pathogen Pseudomonas aeruginosa is a significant cause of healthcareassociated infections correlated with high morbidity and mortality in individuals with pneumonia, chronic obstructive pulmonary disease (COPD), or cystic fibrosis (CF) [1][2][3][4]. These infections are particularly problematic in intensive care units. For these reasons, Microorganisms 2022, 10, 887 2 of 14 this microorganism is included in the critical category of the World Health Organization's (WHO) priority list of pathogens for which the discovery of new therapeutics is urgently needed [5]. P. aeruginosa can cause both acute and chronic infections, since its pathogenic profile originates from a large and variable arsenal of virulence factors and antibiotic resistance determinants. In the airways of CF patients, P. aeruginosa persists, inducing a chronic infection; furthermore, it is widely known that the CF pulmonary environment confers multiple advantages to P. aeruginosa over other pathogens, such as Staphylococcus aureus and Klebsiella pneumoniae [6]. The ability to form a biofilm plays a pivotal role in CF airway colonization by P. aeruginosa. Indeed, among its various virulence factors, the ability to produce highly structured biofilms confers important advantages, including phenotypic resistance to host defenses, antibiotics, and disinfectants [7]. These characteristics prevent bacterial clearance and allow the establishment of highly recalcitrant chronic infections [8,9].
A novel strategy to fight P. aeruginosa infection could derive from the identification of compounds acting on the biofilm phenotype without affecting bacterial vitality; these antibiofilm compounds could also enhance the effectiveness of conventional therapies, particularly in chronic infections such as CF [10,11].
Herbal antimicrobials are considered as a potential alternative in clinical settings for the prevention, treatment, and control of infections sustained by microbial biofilms [12]. Essential oils (EOs) are complex mixtures of different classes of organic compounds, and they are usually used for the treatment of upper respiratory tract infections in traditional medicine [13]. Furthermore, bacteria fail to develop resistance to multi-component treatments such as EOs due to their multitarget actions [14].
Recently, a wide series of EOs from Mediterranean plants were investigated for their ability to modulate biofilm production by different pathogens comprising S. aureus, S. epidermidis, and P. aeruginosa strains [15,16]. In this study, Machine learning (ML) algorithms were applied to develop classification models in order to suggest a possible antibiofilm action for each chemical component of the studied EOs. An analysis of the ML models indicated the chemical components possibly responsible for the inhibition or stimulation of bacterial biofilms. In two recent publications, ML-based clustering was used to develop a convergent microbiological protocol in which 61 EOs were evaluated on 40 clinical isolated of S. aureus and P. aeruginosa strains from CF patients [16,17]. First, the antimicrobial activity of each EO was tested against each S. aureus and P. aeruginosa clinical strain. Then, the antibiofilm activity was evaluated in the same S. aureus clinical isolates [17]. Based on these results, in the present study, we assessed the biofilm growth modulation exerted by the same EOs on a selected number of P. aeruginosa strains isolated from CF patients. Furthermore, ML has been used to shed light on the EO chemical components likely responsible for the positive or negative modulation of bacterial biofilm formation.

Ethics Approval and Informed Consent
This research, performed according to the principles of the Helsinki Declaration, was approved by the ethics committee of the Children's Hospital and Institute of Research Bambino Gesù (OPBG) in Rome, Italy (no. 1437_OPBG_2017 of July 2017). The individual participants and parents/legal guardians of the patients have signed an informed consent form included in the study.

Description of P. aeruginosa Clinical Isolates from CF Patients
Six representative clinical P. aeruginosa strains were used in this investigation, previously selected by a mean of unsupervised ML clusterization, as recently described [17].
Patients were treated according to the current standards of care [18]. Microbiological cultures were performed according to the approved guidelines as already described in Ragno et al. [17]. In Table S1, the 18 qualitative descriptors used to cluster and define the six selected P. aeruginosa strains are described. Phenotypic and genotypic characteristics of these strains are summarized in Table S2. The moderately virulent P. aeruginosa PAO1 (PAO1) and the highly virulent P. aeruginosa PA14 (PA14) were used as reference strains [19].

Biofilm Production Assay in the Presence of EO
The biofilm production was quantified in vitro by microtiter plate biofilm assay (MTP). A bacterial suspension (about 0.5 OD 600 nm) in the exponential growth phase was diluted into the wells of a sterile 96-well polystyrene flat base plate prefilled with medium containing or not containing each of the EOs listed in Table S3, as previously reported [20]. Each EO was solubilized by adding DMSO, to generate a mother stock solution at 50% v/v concentration. As a control, the bacterial cells were grown in Brain Hearth Infusion broth (BHI, Oxoid, Basingstoke, UK) in the first row of the plate. In the second row the same culture medium was supplemented with each EO at a final concentration of 1.00% v/v. The incubation was performed aerobically overnight at 37 • C. After 18 h of incubation, planktonic cells were gently removed by washing each well three times with double-distilled water, and patted dry in an inverted position. Each well was stained with 0.1% crystal violet for 15 min at room temperature, rinsed twice with double-distilled water, and thoroughly dried to quantify the biofilm formation. The biofilm was subsequently solubilized with 20% (v/v) glacial acetic acid and 80% (v/v) ethanol. The total biomass of biofilm was spectrophotometrically quantified at 590 nm. Each data point is composed of four independent experiments, each performed in at least three replicates.

Essential Oil Chemical Composition Analysis
The EOs are listed in Table S3. They were purchased from Farmalabor srl (Assago, Italy) and their chemical composition was analyzed by gas chromatography-mass spectrometry (GC-MS). The adopted operative conditions followed Papa et al. [16]. Each component was identified by comparing the obtained mass spectra with those reported in the Nist 02 and Wiley mass spectra libraries. Linear retention indices (LRIs) of each compound were also calculated using a mixture of aliphatic hydrocarbons (C8-C30, Ultrasci Bologna, Bologna, Italy) injected directly into the GC injector. All analyses were repeated twice.

Machine Learning Binary Classification Modeling
All analysis were performed using the Python programming language (version 3.7, https://www.python.org/) [21,22] by executing in-house code in the Jupyter Notebook platform [16,17,20]. The chemical composition of each EO and the microbiological data were imported, subsequently loaded into a Python Pandas dataframe, and pre-processed to the final datasets to obtain the classification models. Scikit-learn (sklearn) [23] and the Pandas [24,25] libraries were used to implement Machine learning (ML) algorithm protocols.
During model development, an unsupervised dimensionality reduction/transformation was performed with principal component analysis (PCA) [26] to extract 60%, 80%, 90%, and 100% of the explained variance (Table S4). Different cut-off values related to the percentage of biofilm reduction/augmentation were used to develop ad hoc models to inspect strong, moderate, and weak biofilm inhibition and biofilm enhancement. In a departure from previous applications, a data augmentation (DA) approach was also implemented herein [27]. The EO dataset was augmented by means of composition random perturbation, while keeping the same bioactivity for each augmented related EO. In particular, for each EO, all the components were randomly modified by adding or subtracting up to 15% to/from each EO component, increasing the number of data rows by 10 (aug10) or 20 (aug20) times. In the case of unbalanced augmentation, for each EO, 10 new "virtual" records were generated (baug10 and baug20 in the table), while for the balanced process, with w being the weight of the EO class, it was augmented w*10 times. Moreover, components represented by an occurrence of 2, 4, or 6 times were therefore eliminated from the training set. The robustness of the final models, as well as during the hyperparameters' tuning, was evaluated by cross-validation (CV).
Due to the high number of considered hyperparameter combinations, the ML modeling strategy was conducted as follows:

1.
A first coarse ML model generation was run with 10 random hyperparameter combination runs from all possible considered combinations (Tables S5 and S6) [28]; 2.
A second level of investigation was run with 100 random hyperparameter combination runs from all possible considered combinations (Tables S6 and S7) to select the optimal DA settings; 3.
A pre-final level was run with 1000 random hyperparameter combinations to check for protocol correctness, while extracting statistical coefficients for preliminary model evaluation; 4.
A final hyperparameter combination selection was performed by running 10,000 random combinations; 5.
The best model was finally further investigated with 1000 runs of DA perturbations, and the top scored model was used to deeply analyze the data.
Linear and non-linear ML classification algorithms were used to develop different models: random forest (rf), logistic regression (lr), support vector (sv), gradient bosting (gb), decision tree (dt), and k nearest neighbors (knn) as implemented in sklearn. The accuracy (ACC), F1 score, and Matthews correlation coefficient (MCC) were used to numerically and graphically evaluate the binary classification models. The importance of each chemical component present in EOs was independently evaluated through the "feature importance" (FI) and partial dependence (PD) [29] methods, as implemented in the Skater python library [30,31].
Models were validated by leave-some-out CV by means of five groups using the stratified K-fold method monitoring the average value of MCC obtained from 50 random CV iterations [15,32]. The selection of the final models was based on the MCC values.

Biofilm Production Modulation by EOs
The EOs' ability to modulate P. aeruginosa biofilm production was evaluated at a concentration of 1.00 v/v % on the basis of a previous report [17]. The antimicrobial activity of the 61 EOs listed in Table S3 was evaluated, and the results are reported in Table S8. Inactive EOs were investigated for their ability to modulate biofilm production. Biofilm production was compared to that of untreated bacteria (Table 1). Table 1. Effect of EO on biofilm formation. Percentage of bacterial biofilm formation in the presence of each EO listed in Table S3 at a concentration of 1.00% v/v relative to untreated bacteria. Each data point is composed of four independent experiments, each performed with at least three replicates. NA: not applicable, being EO antimicrobial at tested concentration for this strain. At the concentration tested, the EO was antimicrobial, and consequently the biofilm modulation was not evaluated.

Essential Oil Chemical Composition
The chemical compositions of the 61 EOs have already been reported as described in reference [16], and they are also reported in the Supplementary Material (Table S9).

Datasets
Considering the antimicrobial activity data (Table S8), the biofilm production investigations (Table 1), and the eight P. aeruginosa strains, a total of eight different initial datasets were loaded into a Pandas dataframe. Each dataset was composed of a data matrix of 61 rows (EO1-EO61, samples listed in Table S3) and 240 columns (one bioactivity and 239 chemical components). To evaluate the underdevelopment of the ML model's ability to discriminate between biofilm-inhibiting or biofilm-stimulating EOs, the biological data were binarized (partitioned into two classes) using different percentages of the biofilm production threshold value, SM. For all the strains used, threshold values of 40% (strong biofilm inhibition) and 120% (strong biofilm stimulation) were selected.
For completeness, moderate biofilm inhibition (threshold of 80%) and a direct classification of biofilm inhibitors and enhancers (threshold of 100%) were also taken into consideration, and the results are reported in the Supplementary Material. As the antimicrobial data were too unbalanced, no tentative work was conducted in developing ML models.

Classification Models
To avoid too many unbalanced datasets, the modeling was restricted to binarized data showing, at a maximum, a ratio of 10% ÷ 90% (or 90% ÷ 10%) data distribution, thus allowing the development of 27 models out of the 32 possible combinations (eight strains by four thresholds).
Classification modeling at 40% and 120% thresholds were carried out with six different ML algorithms (rf, gb, sv, lr, dt and knn) using the introduced datasets. Initial classification models were built using the same protocol reported in reference [16], but, unfortunately, statistically acceptable models (MCC values greater than 0.4) were obtained only for two strain/threshold combinations (Table S10). Similarly, only a few weak models were obtained for 80% and 100% threshold values (Table S11). Recently, DA has been reported as a useful tool to develop ML models suffering from either an insufficient amount of data or the presence of noisy experimental data [33]. Despite the intrinsic power of ML, the latter conditions can lead to poor models as the available data do not cover the possible range of applications, such as EOs chemical composition variability. Therefore, DA was implemented herein in a new strategy to develop ML models (see Materials and Methods). Classification models were built with a number of latent variables corresponding to 60%, 80%, 90%, and 100% of the whole chemical components' variance extracted by PCA. Moreover, to avoid the development of models driven by poorly represented components, those components with occurrences lower than 2, 4, and 6 were systematically removed from the training set. Hyperparameter optimization was carried out with a wide range of settings, leading from thousands to billions of combinations (Tables S5 and S7). Therefore, to speed up the calculations, a random search was used in place of the most common and exhaustive grid search. Random search hyperparamenters' optimization was proved, having a probability of 95% of finding a combination of parameters within the optimal 5% with only 60 iterations [28]. Herein, the procedure described in the Material and Methods section led to the elaboration of more than three quarters of a million models (Table S12) to seek the best combination of settings (DA and hyperparameters) to define eleven final ML models ( Table 2). The initial DA and hyperparameter optimization was run with only 10 iterations and with coarse settings (Table S5) leading to the generation of 2880 models for each of the 11 datasets of Table 2. For each dataset the top 3 models were selected leading to select the 33 preliminary ML models P1-P33 with cross-validated MCC values ranging from 0.34 to 0.78 (Tables S13 and S14). Then, the P1-P33 models were subjected to a further 100 iterations to select 11 models in which the DA settings (Tables S4 and S16) were finally selected, leading to the intermediate models I100_1-I100_11 characterized by MCC values in the 0.47-0.88 range (Table S15). A third round of hyperparameter optimization was performed with 1000 random iterations while keeping the models' I100_1-I100_11 DA settings, furnishing models I1000_1-I1000_11 (Tables S17 and S18) which were optimized to the pre-final ML models (PF1-PF27) through a further 10,000 random iterations. Interestingly, models PF1-PF11 were characterized by the same range MCC values of models I1000_1-I1000_11 and models I100_1-I100_11, thus indicating a sort of convergence being reached for the optimal hyperparameter selection (Tables S19 and S20). The models PF1-PF11 were then subjected to 100 rounds of iteration of random DA with the DA settings and hyperparameters selected using the associate models I100_1-I100_11 and PF1-PF11 themselves, respectively. The top-scoring DA final models F1-F11 were then selected, and the associated MCC, ACC, and F1 values calculated ( Table 2). Models F1-F11 were finally analyzed through FI and PD values and plots to investigate the most important chemical components likely responsible for biofilm modulation (FIs) and to seek their statistical responsibility in each model. For completeness, the same procedures were applied using threshold values of 80% and 100% (Table S21).

Chemical Components Importance and Partial Dependences
Chemical component importance was evaluated through FIs and PDs. Each FI indicates a sort of absolute correlation coefficient for each of the chemical components (Figures S1-S13), while the associated PD gives its negative, positive, or no influence. PDs' positive or negative trends were evaluated through the Spearman correlation (SP) coefficient. The SP values were used to correct the corresponding FI into positive or negative weighted FIs (WFIs) and plotted. To reduce useless redundant values, only the top 10 and lowest 10 WFIs values were inspected (Figures 1 and 2). The analysis of the WFI values led to the association of the overall effect on biofilm inhibition or stimulation for each chemical component (Table 3).   Weighted feature importance (WFI) plot for models F1 to F6 obtained on the dataset binarized at 40% biofilm inhibition. Positive bars are associated with inhibition of biofilm production, whereas negative bars are associated with augmented biofilm production. Only the 10 highest (antibiofilm) and 10 lowest (pro-biofilm) values are displayed.

Figure 2.
Weighted feature importance (WFI) plot for models F7 to F11 obtained on the dataset binarized at 120% biofilm inhibition. Positive bars are associated with inhibition of biofilm production, whereas negative bars are associated with augmented biofilm production. Only the 10 highest (anti-biofilm) and 10 lowest (pro-biofilm) values are displayed.

Figure 2.
Weighted feature importance (WFI) plot for models F7 to F11 obtained on the dataset binarized at 120% biofilm inhibition. Positive bars are associated with inhibition of biofilm production, whereas negative bars are associated with augmented biofilm production. Only the 10 highest (antibiofilm) and 10 lowest (pro-biofilm) values are displayed.

Chemical Components Importance and Partial Dependences at 40% Biofilm Production Threshold Value
At a 40% biofilm production threshold value, good MCC, ACC, and F1 values were obtained for six out of the eight P. aeruginosa strains (models F1-F6, Table 2 and Figure 1). In particular, linalool, listed in the top 30 most frequent EOs' components with a percentage of presence of about 60% (Table S22), proved to be the chemical component most likely to be involved in strong biofilm production inhibition as identified in four out of six ML models (22P, 25P, 27P, 39P). Other compounds that seem to be important for a strong biofilm reduction are eucalyptol, linalyl anthranilate, geranyl acetate, bornyl acetate, cis-geraniol, sabinene, and cis-3-pinanone. Differently from linalool, these compounds are associated with the inhibition of biofilm production for one, two, or three strains. All together, the nine components might ensure a wide spectrum against the 22P, 25P, 27P, 37P, and 39P isolated strains. Interestingly, linalool and geranyl acetate are two of the most abundant components in EO54 and, in agreement with the above analysis, this EO showed a strong biofilm reduction with an average percentage of biofilm production as low as 31% against the 22P, 25P, 27P, 37P, and 39P isolated strains. Indeed, linalool was present at different percentages in seven of the eight more potent biofilm-reducing EOs (EO10, EO11, EO24, EO44, EO46, EO53, and EO54, each composition reported in Table S9), combined mainly with eucalyptol and geranyl acetate, likely acting in a synergistic way. Interestingly βcaryophyllene, α-pinene, limonene, and p-cymene were indicated as important to decrease the biofilm production for different strains, while this had a negative impact on EOs' biofilm inhibition for the other strains (Table 3). In contrast, β-pinene and carvacrol were found to exert only negative modulation on biofilm inhibition.

Chemical Components Importance and Partial Dependences at a 120% Biofilm Production Threshold Value
As seen for the threshold value of 40%, at 120%, ML models (F6-F11, Table 2, and Figure 2) with MCC acceptable values were obtained for only five out of eight strains (PAO1, 25P, 26P, 27P, and 39P). Eucalyptol and o-cymene were the components calculated as likely to be responsible for slowing down biofilm production in PAO1, while thymol, p-cymene, citronellal, and carvacrol were mainly found as compounds possibly important for biofilm production stimulation. The balancing compounds for biofilm production enhancement were indicated to be linalool, linalyl anthranilate, limonene, and α-pinene.

Discussion
Biofilm represents the strongest form of phenotypical resistance to the host immune defenses and antibacterial drugs operated by bacteria. It plays a pivotal role in the chronicization of many infections, including lung infections as in CF patients. The identification of new compounds able to interfere with biofilm development could lead to the removal of a primary cause of the persistence of infections.
In previous reports, it has been demonstrated that EOs can exert either antibacterial [15,17,[34][35][36][37][38][39][40][41][42] or biofilm modulation effects [15][16][17]20,[42][43][44][45][46][47][48]. As a continuation of a previously reported screen for antibacterial and antibiofilm EOs [15][16][17]20,42], herein, 61 previously investigated commercial samples have been evaluated for their abilities to modulate the biofilm formation of six P. aeruginosa clinical strains (22P, 25P, 26P, 27P, 37P, and 39P) in comparison with the reference strains PAO1 and PA14. Except for a few samples, the EOs tested at a concentration of 1.00% v/v showed a wide variability in either positively or negatively modulating bacterial biofilm production. A biofilm is continuously in equilibrium between accumulation and disruption, being subjected to a wide array of intracellular and extracellular factors. Therefore, it is not surprising that the same EO, that is a complex mixture of many chemical compounds (molecules), may act synergistically or anti-synergistically in stimulating or inhibiting biofilm development. The application of ML algorithms led to models that allowed the identification of the chemical compounds most related to strong biofilm growth inhibition. In particular, linalool (and to a lesser extent eucalyptol, linalyl anthranilate, geranyl acetate, bornyl acetate, cis-geraniol, sabinene, and cis-3-pinanone) is indicated as the most important component endowing EOs with a strong antibiofilm potency. In agreement with previous reports on several chemical constituents of the same EOs [16,17], it could be speculated that eucalyptol and linalool could be listed as common chemical compounds that reduce biofilms in both S. aureus and P. aeruginosa reference and clinical isolates strains. Indeed, Karuppia and coworkers, and Kifer and coworkers in two independent reports demonstrate that eucalyptol plays an antibiofilm role in S. aureus and P. aeruginosa [49,50], while linalool was independently pointed to by Lahiri and Kerekes as an important regulator of S. aureus and P. aeruginosa biofilm formation [51,52].
Regarding the biofilm enhancement driven by our 61 tested EOs, thymol, p-cymene, citronellal, and carvacrol were indicated by the ML models as those compounds important for biofilm production stimulation. In the face of our experimental evidence, a literature survey on Scopus (www.scopus.com, accessed on 1 March 2022) showed almost no reports on small molecules' or EOs' abilities to increase biofilm production.
In this regard, 89 EOs extracted from Mediterranean plants previously screened for their biofilm modulation capability in P. aeruginosa PAO1 [15] and in four Staphylococcus strains [20] showed their abilities in stimulating biofilm production. The analysis of their composition by means of ML methods did highlight the important role of a few chemical compounds in modulating biofilm production. Nevertheless, the overall chemical compounds of the studied EOs were not overlapping with those investigated herein and therefore different conclusions were drawn. Interestingly, for sheer speculation, in previous published reports, limonene was indicated as a potential key molecule that, due to its lipophilic nature, could likely exert some gate role for different either anti-biofilm or probiofilm compounds. Herein, limonene and other hydrophobic components (α-pinene and p-cymene) seem to be confirmed to serve as enhancers (positively or negatively) for other components.
In spite of reports supporting the above hypothesis on biofilm inhibition [12,[49][50][51], further investigations on ad hoc selected EOs or their isolated chemical compounds are required to confirm the role of single molecules and their synergistic or anti-synergistic effects.
In conclusion, in this study, according to previously published articles, the role of EOs and their chemical components is less obscure and ML algorithms have further confirmed their potential as valuable tools to shed light on EOs' likely mechanism of activity. Furthermore, herein, the DA application proved to be a valid method to build robust models, when classical ML application failed. In particular, DA application seems particularly suitable for EOs, which are always critical for their scarce standardizability by chemists and medicinal chemists' communities. As herein applied, the DA considers the composition variability of EOs obtained from the same plants, and also the intrinsic low ratio stability due the different and high volatility associated to each compound.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/ 10.3390/microorganisms10050887/s1, Table S1. Qualitative descriptors used for the unsupervised machine learning clusterization of P. aeruginosa strains. Table S2. Phenotypical and genotypical characterization of 6 representative strains of P. aeruginosa. Table S3. Essential oil IDs and associated plant names. Table S4. List of systematic DA settings varied during ML hyperparameter optimization. Table S5. List of hyperparameter settings used for the preliminary ML models through random search optimization. Table S6. List of weight for the class_weight hyperparamenters in Table S6. Data are presented as python dictionaries. Table S7. List of hyperparameter settings used for the models' refinement through random search optimization. Table S8. Antimicrobial activity of EOs listed in Table S1, on representative clinical and reference strains of P. aeruginosa. Table S9. Compositions of the 61 essential oils used in the study. Table S10. Preliminary models developed with the procedure described in reference [16]. Table S11. Preliminary models developed for thresholds at 80% and 100% biofilm modulation [16]. Table S12. Number of models evaluated during the ML optimization process. NA means no models were developed for the strain/threshold combination due to the low number of active or inactive samples. Table S13. Preliminary models P1-P33 obtained with the combination of DA and random search hyperparameter optimization. Table S14. Preliminary models P1-P33s' associated hyperparamenters. Table S15. Intermediate ML models I100_1-I100_11 with the  data augmentation and 100 random iterations. Table S16. Intermediate ML models I100_1-I100_11  associated hyperparamenters as listed in Table S15. Table S17. Intermediate ML models with the data augmentation setting selected from models I100_1-I100_11 and 1000 random iterations. Table S18. Intermediate models I1000_1-I1000_11 associated hyperparamenters as listed in Table S17. Table S19. Final models PF1-PF11 with the data augmentation setting selected from models I1000_1-I1000_11 and 10,000 random iterations to seek for the best hyperparameters. Table S20. Models hyperparamenters as listed in Table S19. Table S21. Optimized final models obtained with 100 random iterations of data augmentation at threshold values of 80% and 100%. Table S22. Occurrences of the EOs' chemical components. Only the most frequent compounds are listed. Figure S1. Feature importance for model F1 (see main text Table 1). The top 20 components are displayed. Figure S2. Feature importance for model F2 (see main text Table 1). The top 20 components are displayed. Figure S3. Feature importance for model F3 (see main text Table 1). The top 20 components are displayed. Figure S4. Feature importance for model F4 (see main text Table 1). The top 20 components are displayed. Figure S5. Feature importance for model F5 (see main text Table 1). The top 20 components are displayed. Figure S6. Feature importance for model F6 (see main text Table 1). The top 20 components are displayed. Figure S7. Feature importance for model F7 (see main text Table 1). The top 20 components are displayed. Figure S8. Feature importance for model F8 (see main text Table 1). The top 20 components are displayed. Figure S9. Feature importance for model F9 (see main text Table 1). The top 20 components are displayed. Figure S10. Feature importance for model F10 (see main text Table 1). The top 20 components are displayed. Figure S11. Feature importance for model F11 (see main text Table 1). The top 20 components are displayed. Figure S12. Normalized feature importances for the final models F1-F6 developed at a threshold value of 40% (see main text Table 1). The top 20 components are displayed. Figure S13. Normalized feature importances for the final models F23-F27 developed at a threshold value of 120% (see main text Table 1). The top 20 components are displayed. Informed Consent Statement: Informed consent was obtained from all subjects involved in this study.