A Computer-Driven Approach to Discover Natural Product Leads for Methicillin-Resistant Staphylococcus aureus Infection Therapy

The risk of methicillin-resistant Staphylococcus aureus (MRSA) infection is increasing in both the developed and developing countries. New approaches to overcome this problem are in need. A ligand-based strategy to discover new inhibiting agents against MRSA infection was built through exploration of machine learning techniques. This strategy is based in two quantitative structure–activity relationship (QSAR) studies, one using molecular descriptors (approach A) and the other using descriptors (approach B). In the approach A, regression models were developed using a total of 6645 molecules that were extracted from the ChEMBL, PubChem and ZINC databases, and recent literature. The performance of the regression models was successfully evaluated by internal and external validation, the best model achieved R2 of 0.68 and RMSE of 0.59 for the test set. In general natural product (NP) drug discovery is a time-consuming process and several strategies for dereplication have been developed to overcome this inherent limitation. In the approach B, we developed a new NP drug discovery methodology that consists in frontloading samples with 1D NMR descriptors to predict compounds with antibacterial activity prior to bioactivity screening for NPs discovery. The NMR QSAR classification models were built using 1D NMR data (1H and 13C) as descriptors, from crude extracts, fractions and pure compounds obtained from actinobacteria isolated from marine sediments collected off the Madeira Archipelago. The overall predictability accuracies of the best model exceeded 77% for both training and test sets.


Introduction
The use as well as misuse of antibiotics in animal feeding and human medicine has resulted in global antimicrobial resistance epidemics [1][2][3]. The Center for Disease Control and Prevention classifies methicillin-resistant Staphylococcus aureus (MRSA) as a severe threat in health care, leading to more than 80,000 invasive infections and resulting in over 11,000 deaths per year [4]. MRSA infection is one of the leading causes of hospital-acquired infections and is commonly associated with significant morbidity, mortality, length of stay, and cost burden in hospital, e.g., the mortality rates vary from 5-60% and depending on the patient population and site of infection [3,5]. Recently was reported  1 Cluster number and chemical structure of the cluster centroid. 2 Number of molecules. 3 Within the cluster for the training set.
Although the molecules of these five structural clusters were distributed over a wide range of pMIC values, between 2.64 and 11.82, all clusters showed an average pMIC value greater than or equal to 4.59 (corresponding to a MIC value less than or equal to 25.7 mM), and a maximum pMIC value greater than 7.80 (corresponding to a MIC value less than 0.0158 mM) for the training set. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. Furthermore, one of the most important molecular descriptors is the octanol-water partition coefficient (Log P), which is highly correlated with lipophilicity, thus, more lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [40]. Besides molecular weight (MW) and Log P, as described above in the introduction the electronic descriptor TopoPSA appears to have a remarkable performance in discriminating between antibiotic and non-antibiotic compounds [21]. In order to   1 Cluster number and chemical structure of the cluster centroid. 2 Number of molecules. 3 Within the cluster for the training set.
Although the molecules of these five structural clusters were distributed over a wide range of pMIC values, between 2.64 and 11.82, all clusters showed an average pMIC value greater than or equal to 4.59 (corresponding to a MIC value less than or equal to 25.7 mM), and a maximum pMIC value greater than 7.80 (corresponding to a MIC value less than 0.0158 mM) for the training set. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. Furthermore, one of the most important molecular descriptors is the octanol-water partition coefficient (Log P), which is highly correlated with lipophilicity, thus, more lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [40]. Besides molecular weight (MW) and Log P, as described above in the introduction the electronic descriptor TopoPSA appears to have a remarkable performance in discriminating between antibiotic and non-antibiotic compounds [21]. In order to  Although the molecules of these five structural clusters were distributed over a wide range of pMIC values, between 2.64 and 11.82, all clusters showed an average pMIC value greater than or equal to 4.59 (corresponding to a MIC value less than or equal to 25.7 mM), and a maximum pMIC value greater than 7.80 (corresponding to a MIC value less than 0.0158 mM) for the training set. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. Furthermore, one of the most important molecular descriptors is the octanol-water partition coefficient (Log P), which is highly correlated with lipophilicity, thus, more lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [40]. Besides molecular weight (MW) and Log P, as described above in the introduction the electronic descriptor TopoPSA appears to have a remarkable performance in discriminating between antibiotic and non-antibiotic compounds [21]. In order to Although the molecules of these five structural clusters were distributed over a wide range of pMIC values, between 2.64 and 11.82, all clusters showed an average pMIC value greater than or equal to 4.59 (corresponding to a MIC value less than or equal to 25.7 mM), and a maximum pMIC value greater than 7.80 (corresponding to a MIC value less than 0.0158 mM) for the training set. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. Furthermore, one of the most important molecular descriptors is the octanol-water partition coefficient (Log P), which is highly correlated with lipophilicity, thus, more lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [40]. Besides molecular weight (MW) and Log P, as described above in the introduction the electronic descriptor TopoPSA appears to have a remarkable performance in discriminating between antibiotic and non-antibiotic compounds [21]. In order to Although the molecules of these five structural clusters were distributed over a wide range of pMIC values, between 2.64 and 11.82, all clusters showed an average pMIC value greater than or equal to 4.59 (corresponding to a MIC value less than or equal to 25.7 mM), and a maximum pMIC value greater than 7.80 (corresponding to a MIC value less than 0.0158 mM) for the training set. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. Furthermore, one of the most important molecular descriptors is the octanol-water partition coefficient (Log P), which is highly correlated with lipophilicity, thus, more lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [40]. Besides molecular weight (MW) and Log P, as described above in the introduction the electronic descriptor TopoPSA appears to have a remarkable performance in discriminating between antibiotic and non-antibiotic compounds [21]. In order to 540 169 5. 18/8.68 Although the molecules of these five structural clusters were distributed over a wide range of pMIC values, between 2.64 and 11.82, all clusters showed an average pMIC value greater than or equal to 4.59 (corresponding to a MIC value less than or equal to 25.7 mM), and a maximum pMIC value greater than 7.80 (corresponding to a MIC value less than 0.0158 mM) for the training set. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. Furthermore, one of the most important molecular descriptors is the octanol-water partition coefficient (Log P), which is highly correlated with lipophilicity, thus, more lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [40]. Besides molecular weight (MW) and Log P, as described above in the introduction the electronic descriptor TopoPSA appears to have a remarkable performance in discriminating between antibiotic and non-antibiotic compounds [21].
In order to exploit the training set chemical and biological diversity, the active and inactive molecules of the training set were analyzed, in accordance with the five structural clusters, using MW, XLogP (an estimation of the octanol-water partition coefficient, Log P) and TopoPSA. The analysis of these data indicates that the active and inactive molecules against MRSA in the training set are distributed over a wide range of MW (i.e., 89-1490 Da), XLogP (i.e., −9.34-21.90) and TopoPSA (i.e., 0-644.98 Å 2 ) values. Interestingly, approximately 63% of the compounds present in the training set have a MW bellow 500 Da. This MW interval contains approximately 54% and 67% of all active and inactive molecules against MRSA in the training set, respectively. In the same way, using this rule (MW < 500 Da) it is possible to discriminate inactive molecules in relation to active molecules in four structural clusters, namely in clusters: A, B, D, and E, which comprises 67%, 69%, 70%, and 56% of inactive molecules as compared to 48%, 57%, 52%, and 44% of active molecules, respectively. The same behavior was described by Ebejer et al. [24] in which only 75% of marketed antibacterial drugs have a MW bellow 500 Da, whereas for marketed other (non-antibacterial) drugs was 86.7%. In addition, more than 68% and 69% of the active and inactive molecules against MRSA in the training set have an XLogP that is lower than 5, respectively. In spite of this using the XLogP < 5 it is possible to prioritize active molecules in relation to inactive molecules in two clusters, namely in clusters: D and E, which comprises 74% and 81% of active molecules as compared to 67% and 77% of inactive molecules, respectively. Furthermore, using the rule TopoPSA ≥ 120.7 Å 2 for antimicrobial compounds, it was possible to discriminate approximately 50% and 41% of all active and inactive molecules against MRSA in the training set, respectively. Therefore, using this rule it was possible to prioritize active molecules in relation to inactive antibacterial molecules in all the five clusters, A-E, which comprises 58%, 38%, 40%, 55%, and 73% of active antibacterial molecules as compared to 43%, 36%, 37%, 36%, and 69% of inactive molecules, respectively.

Approach B
The whole data set, covering 155 samples (50 crude extracts, 55 fractions, 50 pure compounds), was divided into a training set of 116 samples (37 crude extracts, 41 fractions, and 38 pure compounds) and a test set of 39 samples (13 crude extracts, 14 fractions, and 12 pure compounds), which were used for the development and external validation of the QSAR classification models, respectively. Two classes of antibacterial activity against MRSA were set, moderate-active-to-active with MIC < 100 µg/mL (in total 49 samples) and inactive with MIC ≥ 100 µg/mL (in total 106 samples). The whole data set comprising five actinobacteria genera (i.e., Actinomadura, Brevibacterium, Micromonospora, Salinispora, and Streptomyces) in accordance with our previously reported work [38]. The five actinobacteria genera are represented in Table 2 along with their activity classes and average anti-MRSA MIC values. It is interesting to highlight that the most abundant genera in our data set are Streptomyces (in total 80 samples, 62 and 18 samples in the training and test sets, respectively), Salinispora (in total 39 samples, 29 and 10 samples in the training and test sets, respectively), and Micromonospora (in total 32, 23, and 9 samples in the training and test sets, respectively). The genus with the most bioactive potential against MRSA is Streptomyces, comprising 28 (corresponding to 45%) and 9 (corresponding to 50%) active samples out of the 62 and 18 total samples in the training and test sets, respectively. This result is not surprising since the genus Streptomyces over the last decades has stirred huge interest as a source of bioactive compounds, more than 60% of all known antibiotics have been isolated from streptomycetes [39], and moreover a similarly result was obtained by us for the anticancer screening against HCT116 in a previously study from actinobacteria isolated from marine sediments collected off the Madeira Archipelago [37].

Exploration of Empirical Molecular Descriptors and Fingerprints for QSAR Approach A
Two wide sets of descriptors were explored, one with 6 different types of fingerprints (FPs) with different sizes (166 MACCS, MACCS keys; 307 Substructure, presence and count Sub and SubC respectively; 881 PubChem fingerprints; 1024 CDK, circular fingerprints; and 1024 CDK Ext, extended circular fingerprints with additional bits describing ring features) and other with a total of 218 1D and 2D descriptors (including 6 electronic, 195 topological, and 17 constitutional descriptors). The FPs and the molecular descriptors were calculated by PaDEL-Descriptor [41] and CDK Descriptor Calculator (http://www.rguha.net/code/java/cdkdesc.html), respectively. Random Forests (RF) [42] ML technique was used for building pMIC against MRSA prediction regression models, and the performance of the models was successfully evaluated by internal validation (out-of-bag, OOB, estimation on the training set), Table 3.  it was very surprising the poor performance of the 1D and 2D descriptors in the modeling of pMIC against MRSA since several molecular descriptors, as detailed in Section 2.1.1, correlate relatively well with the values of pMIC. In this way we analyzed the Pearson correlation (R) of each of these descriptors with the pMIC values, and it was possible to verify that 36, 73, and 126 descriptors out of the 179 descriptors (39 descriptors have constant values) have an absolute R pMIC higher than the R pMIC obtained by the MW (R pMIC = 0.2388), TopoPSA (R pMIC = 0.1760), and XLogP (R pMIC = 0.0439), respectively. From the seven sets of descriptors and fingerprints, only MACCS, SubC, PubChem, and CDK Ext fingerprints were used in further investigations. The 3D descriptors have a well-established relationship with biological activity and are expected to increase both the accuracy and robustness of the predictive models. Several well-established 3D molecular descriptors were exploited, such as 3D CDK descriptors (e.g., BCUT, WHIM, CPSA and geometrical) and Radial Distribution Function (RDF) pair [43] derived 3D descriptors in which the property is the partial atomic charge, Table 4. Contrary to what would be expected, the predictive power of the anti-MRSA QSAR models does not improve and even worse by including 3D descriptors to the best four sets (MACCS, SubC, PubChem and CDK Ext). Although the model using the CDK Ext FP with 3D CDK descriptors achieved the best results for the training set with R 2 of 0.54 and RMSE of 0.67 (Table 4), the model using only the CDK Ext FP still has better performance (R 2 = 0.57 and RMSE = 0.64, Table 3).
Procedure for feature selection based on RF variable importance were applied to the descriptors of models SubC, PubChem, and CDK Ext- Table 5.  1 Percent of molecules predicted with absolute error above or below 0.5. 2 OOB estimation. 3 Using the mean decrease in accuracy measure of importance for the descriptors in the RF algorithm.
The selection of the 150 most important descriptors from the CDK Ext FP set used to build the model with the RF enabled the training of much smaller RF models with even slightly better prediction accuracies (R 2 = 0.574 and RMSE = 0.635) than the models trained with the whole set of descriptors (1024 descriptors, R 2 = 0.572 and RMSE = 0.636) for the training set.

Exploration of Other State-of-the-art Machine Learning (ML) Techniques
A comparison of three ML techniques, RF, support vector machines (SVM), and Gaussian processes (GPs), for building QSAR pMIC models with the 150 most important descriptors using in the model with CDK Ext FPs by the five structural clusters for training and test sets is shown in Table 6 and Figure 1, respectively.  Variation of the ML algorithm could not achieve any consistent improvement of the results obtained with RF for all molecules in the training and test sets taking into account the R 2 and RMSE values, Table 6 and Figure 1. For all ML models, the best predictions obtained for the structural clusters are obtained from cluster A-indole derivative centroid-and are also better than those obtained for all training and test set taking into account the RMSE value. Interestingly, the worse Variation of the ML algorithm could not achieve any consistent improvement of the results obtained with RF for all molecules in the training and test sets taking into account the R 2 and RMSE values, Table 6 and Figure 1. For all ML models, the best predictions obtained for the structural clusters are obtained from cluster A-indole derivative centroid-and are also better than those obtained for all training and test set taking into account the RMSE value. Interestingly, the worse predictions obtained taking into account the RMSE value for all the clusters in the training and test set were to the cluster C-2-oxazolidone derivative centroid.
Averaged predictions (consensus) obtained by the RF, SVM and GPs models with the 150 most important CDK Ext FPs (CM1), or by the four best set of descriptors, MACCS, SubC, PubChem and CDK Ext (CM2) further improved the results, Table 7. The best results for the prediction of pMIC against MRSA are achieved with CM2 model with 35% of the structures predicted with a deviation less than 0.5 and 91% of the structures predicted with a deviation less than 1 for the test set. Figure 2 represents the plot of predicted versus experimental pIMIC against MRSA for the test set using the best consensus model, CM2. The best results for the prediction of pMIC against MRSA are achieved with CM2 model with 35% of the structures predicted with a deviation less than 0.5 and 91% of the structures predicted with a deviation less than 1 for the test set. Figure 2 represents the plot of predicted versus experimental pIMIC against MRSA for the test set using the best consensus model, CM2.

Applicability Domain of the pMIC against MRSA Model
The similarity between a molecule of an external data set and all the 5112 molecules in the training set for the regression MRSA model was used to define its applicability domain. The molecules of the training set were mapped on a Kohonen self-organizing map (SOM), using in-house developed software based on JATOON Java applets [44], on the basis of the 512 FPs calculating by GenerFP program in JChem according to the five structural clusters, Table 1. No information about anti-MRSA activity was used. In Figure 3, a trend for clustering according to structural clusters features of the compounds was shown. molecules of the training set were mapped on a Kohonen self-organizing map (SOM), using in-house developed software based on JATOON Java applets [44], on the basis of the 512 FPs calculating by GenerFP program in JChem according to the five structural clusters, Table 1. No information about anti-MRSA activity was used. In Figure 3, a trend for clustering according to structural clusters features of the compounds was shown.
(a) (b) After training, the metric of similarity used was the SOM response patterns after normalization, where d(x,ni) is the Euclidian distance between the molecular descriptor vector x and ni (represents the centroid vector of the i th SOM neuron). The threshold definition based on the average of SOM distance (ASD) between each molecule of the test set and all the molecules of the training set was accomplished in order to reduce the MAE of the molecules belonging to the applicability domain of the anti-MRSA model and was set in accordance with the mapping of the five structural clusters (A-E) on SOM and MAE obtained in the best RF model (CM2). The applicability domain of the model is defined as containing all molecules of the training set that were mapped as belonging to one of the five clusters on SOM with an ASD lower than 0.4. Therefore, using this threshold for the test set, 1230 molecules belonging to the applicability domain of the anti-MRSA model were obtained, with R 2 = After training, the metric of similarity used was the SOM response patterns after normalization, where d(x,n i ) is the Euclidian distance between the molecular descriptor vector x and n i (represents the centroid vector of the i th SOM neuron). The threshold definition based on the average of SOM distance (ASD) between each molecule of the test set and all the molecules of the training set was accomplished in order to reduce the MAE of the molecules belonging to the applicability domain of the anti-MRSA model and was set in accordance with the mapping of the five structural clusters (A-E) on SOM and MAE obtained in the best RF model (CM2). The applicability domain of the model is defined as containing all molecules of the training set that were mapped as belonging to one of the five clusters on SOM with an ASD lower than 0.4. Therefore, using this threshold for the test set, 1230 molecules belonging to the applicability domain of the anti-MRSA model were obtained, with R 2 = 0.693, RMSE = 0.586 and MAE = 0.436. However for the molecules of the test set outside the defined applicability domain (i.e., 303 molecules) worse predictions were obtained, with R 2 = 0.639, RMSE = 0.620 and MAE = 0.479.

Application of the in silico Anti-MRSA Model in Virtual Screening
In the present study, a virtual screening campaign was carried out to search for new lead-like inhibitors against MRSA. The best model, the consensus model-CM2 that is averaged predictions obtained by the four best set of descriptors, MACCS, SubC, PubChem, and CDK Ext, was selected for the virtual screening procedure. The virtual library comprises 3990 molecules from the StreptomeDB 2.0 database [39], an extended compilation of NPs produced by Streptomyces. In addition, StreptomeDB 2.0 includes comprehensive background information e.g., host organisms, predicted physicochemical properties, synthesis routes, and biological activities. Although the virtual library includes 1994 molecules with at least one biological activity record, the complete activity spectrum for this data set is not clarified yet.
The best model, CM2, selected 212 virtual hits from the StreptomeDB 2.0 library with a pMIC against MRSA higher than 5.3 (corresponding to a MIC value lower than 5 mM, the cutoff defined by us for antibacterial active molecules against MRSA).  Furthermore, using the threshold defined for the applicability domain of the anti-MRSA model (i.e., ASD < 0.4), it was possible to prioritize the most probable lead-like antimicrobial against MRSA NPs from the StreptomeDB 2.0 library. Therefore, the virtual hits were reduced to 150 compounds, which were ranked by their predicted pMIC, predicted structural clusters and ASD on SOM and are shown in Table S1, Supporting Information. The predicted structural clusters with the best predictive power in accordance with RMSE for the external test set were A, D and E, Figure 1. None of the NPs that were predicted as belonging to the structural cluster D passed the thresholds defined, i.e., pMIC > 5.3 and ASD < 0.4. The list of twelve NPs that were prioritized based upon their ranking is shown in Table 8 and Figure 5.  Furthermore, using the threshold defined for the applicability domain of the anti-MRSA model (i.e., ASD < 0.4), it was possible to prioritize the most probable lead-like antimicrobial against MRSA NPs from the StreptomeDB 2.0 library. Therefore, the virtual hits were reduced to 150 compounds, which were ranked by their predicted pMIC, predicted structural clusters and ASD on SOM and are shown in Table S1, Supporting Information. The predicted structural clusters with the best predictive power in accordance with RMSE for the external test set were A, D and E, Figure 1. None of the NPs that were predicted as belonging to the structural cluster D passed the thresholds defined, i.e., pMIC > 5.3 and ASD < 0.4. The list of twelve NPs that were prioritized based upon their ranking is shown in Table 8 and Figure 5.  From the twelve resulting virtual screening hits only the bis-pyrrole derivative, ID 10301, was already presented in our training or test sets (training set Mol3405, Table S2, Supporting Information). From the twelve resulting virtual screening hits only the bis-pyrrole derivative, ID 10301, was already presented in our training or test sets (training set Mol3405, Table S2, Supporting Information).

Exploration of NMR Descriptors for QSAR Approach B
Three ranges for 13 C NMR and four ranges for 1 H NMR were used to generate the NMR descriptors and were the following, respectively: (1) 1.5 (133 descriptors), 1.0 (200 descriptors), and 0.5 ppm (400 descriptors); and (2) 0.5 (23 descriptors), 0.2 (61 descriptors), 0.1 (120 descriptors) and 0.05 ppm (240 descriptors). Exploratory QSAR experiments using three NMR descriptors sets (with 1 H NMR descriptors, 13 C NMR descriptors and combining 1 H and 13 C NMR descriptors) were built with RF algorithm to predict two classes of antimicrobial activity against MRSA (i.e., moderate-active-to-active with MIC < 100 µg/mL and inactive with MIC ≥ 100 µg/mL) within a OOB estimation procedure for the training set (Table 9). In Table 9, only the best model of the twelve models, which were trained combining 1 H and 13 C NMR descriptors, is represented. The best model was achieved using 0.5 and 0.2 ppm ranges for 13 C and 1 H NMR, respectively, in total 461 NMR descriptors. As we did for approach A, a procedure for feature selection based on RF variable importance were applied to the descriptors of the best 13  A comparison of three ML techniques, RF, SVM, and Convolutional Neural Network (CNN), for building QSAR classification models with the 100 most important descriptors using in the model with 461 NMR descriptors for training and test sets is shown in Table 10. Averaged predictions (consensus) obtained by the RF, SVM, and CNN models with the 100 most important 13 C and 1 H NMR descriptors (CM_NMR) further improved the results. The two classes of antimicrobial activity against MRSA were predicted for the training and test sets with Q of 0.81 and 0.77, respectively. The MCC were also improved to 0.55 and 0.49 for the training and test sets, respectively.
Moreover, the CM_NMR model was validated with a final prediction set consisting of four pure compounds not used for any task before, which were isolated in our MNP group. The four MNP are two diketopiperazine derivatives, one napyradiomycin derivative and one unknown compound that appears to be a new macrocyclic derivative. The chemical structure of these 4 MNP has not yet been fully elucidated and consequently they are excellent candidates for this QSAR approach B model. In Table 11 is showed the predictions obtained using the CM_NMR model for the four pure compounds and anti-MRSA experimental assay results. Only a misclassification between moderate-active-to-active and inactive class was obtained for the pure compound, PTM-420 F5,F45, however its MIC value is close to the threshold for the inactive class, MIC ≥ 100 µg/mL.

Analysis of NMR Descriptors Identified as Relevant for Modeling Anti-MRSA Activity in the RF Model
The 100 most important descriptors, found by the RF algorithm using the MeanDecreaseAccuracy parameter (Mean Decrease in Accuracy) [45] and using for building the CM_NMR were analyzed and the fifteen most relevant descriptors were represented in Table 12. Interestingly, there are nine descriptors that codify 1 H NMR data out of the ten most important NMR descriptors. From those nine 1 H NMR descriptors, five descriptors codify polar functional groups such as -NH (3, 8, 19 and 22), -OH (3, 8, 19 and 22), -COOH (19 and 22), C=N-OH (22), or -CHO (28). On the other hand, one out nine 1 H NMR descriptors codifies saturated alkyl groups (58). The only 13 C NMR descriptor in the ten most important NMR descriptors discriminated aromatic, olefinic or nitrile functional groups (318). Moreover, it is also verified that importance by activity classes (moderate-active-to-active and inactive classes) for each of the ten most important descriptors is more or less similar except to descriptors C318, which gives the inactive class a higher weight.

Approach A
A data set of 7744 organic molecules was extracted from the ChEMBL (https://www.ebi.ac.uk/ chembl/) [46], PubChem (http://pubchem.ncbi.nlm.nih.gov) [47], ZINC (https://zinc15.docking. org/) [48] databases, searching by antibacterial activity against MRSA with MIC values and their chemical structures saved in the SMILES or MDL SDF data format. In addition, more 653 chemical structures with anti-MRSA activity records were also added by searching in the literature indexed in Web of Science™ Core Collection between May 2013 and March 2015. After assembling these databases, the duplicates were removed based on InChI codes, although the chirality was taken into account, racemic molecules (or cases where the stereochemistry was not shown) were considered as one of the possible stereoisomers. For the duplicates with different MIC values, the most recent were considered. After this, the final data set comprises 6645 molecules and the MIC values were converted to pMIC. The SMILES strings of the data set, the corresponding experimental and predicted activities are available as Supplementary Material, Tables S2 and S3.

Approach B
A total of 155 samples was used for approach B, comprising 50 crude extracts, 55 fractions, and 50 pure compounds obtained from microbial actinobacteria isolated from ocean sediments collected off the Madeira Archipelago [38], corresponding to 49 moderate-active-to-active (MIC < 100 µg/mL) and 106 inactive (MIC ≥ 100 µg/mL) samples antibacterial against MRSA. Actinobacteria strains were isolated from the marine sediments, taxonomically characterized through 16S rRNA gene sequencing, and the crude extracts were obtained through liquid-liquid extraction with ethyl acetate (EtOAc) in accordance with our previously reported work [38]. The EtOAc crude extracts were fractionated by silica flash chromatography, eluted with step gradients of isooctane/EtOAc followed by EtOAc/MeOH and were obtained nine fractions. Pure compounds were isolated by reversed phase HPLC (Phenomenex Luna, 250 × 100 mm, 5 µm, 100 Å, 1,5 mL/ min, UV 210, 250 and 360 nm) using a gradient solvent system of acetonitrile and water. The code, type, and the actinobacteria genus of the samples comprising the data set, the corresponding experimental and predicted activity classes are available as Supplementary Information, Tables S4, S5 and S6.

Approach A
The molecular structures were standardized by normalizing tautomeric and mesomeric groups and by removing small disconnected fragments using JChem Standardizer tool version 5.7.13.0 (ChemAxon Ltd., Budapest, Hungary). Three-dimensional models of the molecular structures were generated with CORINA version 2.4 (Molecular Networks GmbH, Erlangen, Germany. Empirical Molecular fingerprints were calculated by PaDEL-Descriptor version 2.21 (Yap Chun Wei, Pharmaceutical Data Exploration Laboratory)) [41]. Different types of fingerprints with different sizes were calculated and explored: 166 MACCS (MACCS keys), 307 Substructure (presence and count of SMARTS patterns for Laggner functional group classification-Sub and SubC respectively), 881 PubChem fingerprints (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/ pubchem_fingerprints.txt), 1024 CDK (circular fingerprints), and 1024 CDK extended (Ext circular fingerprints with additional bits describing ring features). 1D, 2D, and 3D Molecular descriptors were calculated by CDK Descriptor Calculator version 1.4.6 (Rajarshi Guha), which comprise 218 1D/2D descriptors (including 6 electronic, 195 topological, and 17 constitutional descriptors) and 66 3D descriptors (including 6 BCUT, 13 WHIM, 29 CPSA and 18 geometrical descriptors). Radial Distribution Function (RDF) pair descriptors [43], 3D RDF descriptors, were calculated by sampling the function of Equation (1) at 128 equally distributed values of r between 0 and 12.8 Å: where N is the number of atoms in the molecule, pi is the charge of atom i, B is a fuzziness parameter (it was 100 in this study), and r ij is the 3D distance between atoms i and j. Three sets of 128 RDF descriptors were separately calculated, derived from atom pairs with (a) a positive and a negative charge, (b) two positive charges, and (c) two negative charges. The partial atomic charges-natural bond orbital (NBO) partial atomic charges were estimated using a ML tool developed by Aires-de-Sousa and co-workers [49] (http://joao.airesdesousa.com/charges).

Approach B
All samples were evaluated for antibacterial activity against MRSA and the 1D NMR spectra were also acquired. NMR spectra were obtained using a Bruker Advance spectrometer, model ARX 400, (400 MHz for 1 H and 100 MHz for 13 C) with tetramethylsilane (TMS) as internal reference and deuterated chloroform as solvent. NMR spectra were handled with the ACD/NMR Processor (version 12.01) and the range of chemical shifts used was 0-200 ppm and 0-12 ppm for the 13 C and 1 H, respectively. The NMR descriptors were generated using the following ranges: (1) 1.5 (133 descriptors), 1.0 (200 descriptors), and 0.5 ppm (400 descriptors) for 13 C NMR; and (2) 0.5 (32 descriptors), 0.2 (61 descriptors), 0.1 (120 descriptors) and 0.05 (240 descriptors) ppm for 1 H NMR.

Approach A
The whole data set of 6645 small molecules was randomly divided into a training set of 5112 molecules (comprising 1688 active molecules with MIC < 5 µM and 3424 inactive molecules with MIC ≥ 5 µM) and a test set of 1533 molecules (comprising 543 active molecules and 990 inactive molecules) in order to the biological diversity of the data set was captured by both sets. The built QSAR models were developed and externally validated using the training and test sets, respectively.

Approach B
The whole data set, comprising 155 samples (50 crude extracts, 55 fractions, 50 pure compounds), was split into a training set of 116 samples (37 crude extracts, 41 fractions, and 38 pure compounds) and a test set of 39 samples (13 crude extracts, 14 fractions, and 12 pure compounds), which were used for the development and external test validation of the QSAR models, respectively. The approximate 3:1 partition for training and test sets, respectively, was carried out randomly according to the two classes of antibacterial activity against MRSA (i.e., moderate-active-to-active with MIC < 100 µg/mL, in total 49 samples, and inactive with MIC ≥ 100 µg/mL, in total 106 samples), the type of sample (i.e., crude extracts, fractions, or pure compounds), and the five actinobacteria genera (i.e., Actinomadura, Brevibacterium, Micromonospora, Salinispora, and Streptomyces) in order to the biological diversity of the data set was captured by both sets.

Random Forests (RF)
A RF [42,50] is an ensemble of unpruned trees, which are created using bootstrap samples of the training set and for each individual tree the best split at each node is defined using a randomly selected subset of descriptors. A different training and validation set was used to create each individual tree. Prediction is made by a majority vote of the classification trees (classification) or by average of the individual regression trees (regression) in the forest. The prediction error for the compounds left out in the bootstrap procedure (internal cross-validation or OOB estimation) was used to assess the internal performance. The RF method quantifies also the importance of a descriptor by the increase in misclassification occurring when the values of the descriptor are randomly permuted, correlated with the mean decrease in accuracy parameter, or by the decrease in a node's impurity every time the descriptor is used for splitting, correlated with the mean decrease in the Gini coefficient parameter.
In the experiments presented here, RF were used for the development of regression or classification models to estimate antibacterial activity against MRSA. The R program [51], version 3.2.3 was used to grow RF using the RandomForest library [52]. The number of trees in the forest was set to 500 and the other parameters, except mtry, were used with default values. The mtry parameter values were selected using factor levels of the default value (i.e., 1/3 of the number of descriptors or square root of the number of descriptors in the data for regression or classification, respectively).

Support Vector Machines (SVM)
SVM [53] map the data into a hyperspace through a nonlinear mapping (a boundary or hyperplane), for classification models the two class of compounds are separated in this space and for regression models a linear regression is performed in this space. Support vectors, which are examples in the training set, were used to position the boundary. Kernel functions can be used to transform nonlinear data into a hyperspace where the classes become linearly separable. In the current study, SVM models were explored with the Weka (version 3.8.2) [54] implementation of the LIBSVM software [55]. The C-SVM-classification or ε-SVM-regression types were chosen, the kernel function selected was the radial basis function and used the default value for the gamma parameter, and the parameter C was optimized in the range of 1-200 through 10-fold cross-validation with the training set.

Gaussian Processes (GPs)
GPs use lazy learning and a measure of the similarity between compounds (the kernel function) to predict the value for an unseen compound from training set. The prediction is not just an estimate for that compound, but also has uncertainty information, it is a one-dimensional Gaussian distribution. It was used in this work as specifically implemented by the Gaussian Processes class in Weka version 3.8.2 [54], with the options set as default, except the kernel function and the noise that were optimized in cross-validation experiments with the training set. GPs were implemented for regression without hyperparameter-tuning using a RBF kernel, Equation (2).
The noise level was chosen applying normalization/standardization to the target attribute as well as the other attributes based on normalize training set.

Convolutional Neural Network (CNN)
CNN is a feed-forward neural network (NN) with more than one hidden layers or convolutional layers. CNN implementation with dropout regularization and Rectified Linear Units. The Weka [54] NeuralNetwork package (version 3.8.2) was used to implement feed-forward neural networks. The NeuralNetwork options were set as default, except the number of hidden units, number of layers, hiddenLayersDropoutRate and inputLayerDropoutRate parameters that were optimized in cross-validation experiments with the training set.

Antibacterial Screening against MRSA Strain
The antibacterial activity was evaluated for the crude extracts, fractions and pure compounds by performing screening against methicillin-resistant Staphylococcus aureus (MRSA) COL [56], using Brain Heart Infusion (BHI) medium (DIFCO Laboratories, Detroit, USA, 1l DI water). The screening was performed in 96 well plates; each crude extract, fractions or pure compound, previously concentrated at 10mg/mL in DMSO, was added to a log-phase grown culture (OD 600 nm = 0.04−0.06) to a 2.5% (v/v) final concentration. All samples were two-fold serially diluted five times, resulting in final concentrations of the tested compounds ranging from 250 to 7.81 µg/mL. Further dilutions were tested, values down to 0.03 µg/mL, if the 7.81 µg/mL concentration showed inhibitory activity. After 18 h of incubation at 37 • C, minimal inhibitory concentrations were determined by visual inspection and spectrophotometric analysis.

Conclusions
Following our previous work that modeling the anticancer activity against HCT116 [37], the current results suggest as well that the chemoinformatics QSAR approach relying on a ligand-based methodology either based on the molecular structures or the NMR spectra, corroborated with an experimental approach, could be used to predict new inhibitory compounds against MRSA. To our knowledge, the QSAR regression model developed here, approach A, is the largest study ever performed with regard both to the number of compounds involved and to the number of structural families involved in the modeling of the antibacterial activity against MRSA [19,20,25]. The NMR QSAR classification model, approach B, was extended to a high number of samples containing additional 45 pure compounds and therefore the overall predictability accuracies (Q) were improved as compared with those obtained in our previously work [37]. The Q of the best model exceeded 77% and 63% for test set in MRSA and HCT116 modeling, respectively. Moreover, the development of QSAR model using predicted NMR spectra of molecules and mixture of molecules (simulating fractions and extracts) would be an important strategy to be developed in future work.
Supplementary Materials: The following are available online at http://www.mdpi.com/1660-3397/17/1/16/s1. The following files are available free of charge. The SMILES strings of the list of the virtual screening hits from the StreptomeDB 2.0 library, training and test sets, and the corresponding experimental and predicted activities against MRSA for approach A are available as Supporting Information, Tables S1, S2, and S3, respectively. Furthermore, the code and type and the actinobacteria genus of the samples comprising the training, test, and test 2 sets and the corresponding experimental and predicted activity classes against MRSA for approach B are also available as Supporting Information, Tables S4, S5, and S6, respectively (XLSX).