In Silico HCT116 Human Colon Cancer Cell-Based Models En Route to the Discovery of Lead-Like Anticancer Drugs

To discover new inhibitors against the human colon carcinoma HCT116 cell line, two quantitative structure–activity relationship (QSAR) studies using molecular and nuclear magnetic resonance (NMR) descriptors were developed through exploration of machine learning techniques and using the value of half maximal inhibitory concentration (IC50). In the first approach, A, regression models were developed using a total of 7339 molecules that were extracted from the ChEMBL and ZINC databases and recent literature. The performance of the regression models was successfully evaluated by internal and external validations, the best model achieved R2 of 0.75 and 0.73 and root mean square error (RMSE) of 0.66 and 0.69 for the training and test sets, respectively. With the inherent time-consuming efforts of working with natural products (NPs), we conceived a new NP drug hit discovery strategy that consists in frontloading samples with 1D NMR descriptors to predict compounds with anticancer activity prior to bioactivity screening for NPs discovery, approach B. The NMR QSAR classification models were built using 1D NMR data (1H and 13C) as descriptors, from 50 crude extracts, 55 fractions and five pure compounds obtained from actinobacteria isolated from marine sediments collected off the Madeira Archipelago. The overall predictability accuracies of the best model exceeded 63% for both training and test sets.


Introduction
Colorectal cancer is the third most commonly detected cancer and the fourth foremost cause of cancer deaths in the world, accounting for about 1.4 million new cases and almost 700,000 deaths in 2012 [1,2]. The distribution of colorectal cancer burden varies widely, with more than two-thirds of all cases and almost 50% of all deaths occurring in more developed regions [2]. Therefore, colorectal cancer is considered one of the clearest markers of the cancer transition [1], replacing infection-related cancers in low developed regions that are undergoing rapid societal and economic changes together with other cancers predominantly linked to western lifestyles, which are often found in highly developed countries. Furthermore, drug research and development (R&D) is comprehensive, complex, expensive, 2.2.2. Approach B All samples were evaluated for HCT116 cytotoxic activity and the 1D NMR spectra were also acquired. NMR spectra were obtained using a Bruker Advance spectrometer, model ARX 400, (400 MHz for 1 H and 100 MHz for 13 C) with tetramethylsilane (TMS) as internal reference and deuterated chloroform as solvent. NMR spectra were handled with the ACD/NMR Processor (version 12.01, Advanced Chemistry Development, Toronto, Canada) and the range of chemical shifts used were 0-200 ppm and 0-12 ppm for the 13 C and 1 H, respectively. The NMR descriptors were generated using the following ranges: (1) 1.5 (133 descriptors), 1.0 (200 descriptors), and 0.5 ppm (400 descriptors) for 13 C; and (2) 0.2 (61 descriptors), 0.1(120 descriptors) and 0.05 (240 descriptors) ppm for 1 H.

Approach A
Two different approaches were used for the partition of the training and test sets. In one approach, the entire data set was divided into a training and a test sets of 5873 and 1466 compounds, respectively. The built QSAR models were developed and externally validated using the training and test sets, respectively. The approximate 4:1 partition for training and test sets, respectively, was carried out randomly according to the three categories of anticancer activity (i.e., active-to-very-active with IC 50 < 10 µM, active-to-moderate-active with 10 µM ≤ IC 50 < 50 µM, and inactive with IC 50 ≥ 50 µM) in order to the biological diversity of the data set was captured by both sets. These three categories of activity were used only for the partition of the training and test sets. In the following experiments the definition of the limit of the anticancer activity against human colon carcinoma HCT116 cell line that was referred in Section 2.1.1 (active category with IC 50 ≤ 10 µM) will always be used. In the other approach, the entire data set was divided into a training set of 5875 compounds and a test set of 1464 compounds, respectively. The approximate 4:1 partition was performed by a Kohonen Self-Organizing Map (SOM) [35] in such a way that both sets reflect the chemical diversity of the data set. The 7339 chemical structures of the whole data set were mapped on a SOM on the basis of Substructure fingerprint according to the three categories of anticancer activity. A tendency for clustering according to the structural classes of compounds was verified. The compounds for the test set were selected from occupied neurons and belonging to each of the structural clusters. The two approaches using the Substructure fingerprints were compared using the random forest (RF) algorithm in an out-of-bag (OOB) estimation, and a better performance was achieved by the SOM partition for the training set. The SOM partition between training and test sets of 5875 and 1464 compounds, respectively, was therefore included in the following experiments.

Approach B
The whole data set, comprising 110 samples (50 crude extracts, 55 fractions, five pure compounds), was split into a training set of 74 samples (35 crude extracts and 39 fractions) and a test set of 36 samples (15 crude extracts, 16 fractions, five pure compounds), which were used for the development and external test validation of the QSAR models, respectively. The approximate 2:1 partition for training and test sets, respectively, was carried out randomly according to the two classes of anticancer activity (i.e., moderate-active-to-active with IC 50 < 156 µg/mL, in total 34 samples, and inactive with IC 50 ≥ 156 µg/mL, in total 76 samples) and the type of sample (i.e., crude extracts, fractions, or pure compounds) in order to the biological diversity of the data set was captured by both sets.

Selection of Descriptors and Optimization of QSAR Models
Approach A The constant descriptors were removed and the training set was used to build MLRs [36] with Weka 3.7.12 [37] to select descriptors by the M5 method. With this method, all descriptors were used to build a first regression model, and then descriptors with the smallest standardized regression coefficients were removed in a stepwise way until no improvement was observed in the estimate of the average prediction error given by the Akaike information criterion (AIC) [36]. This procedure was separately applied to all the type of descriptors for a preliminary selection of descriptors. There is a demand for QSAR models with the minimum possible number of descriptors in order to develop more interpretable QSAR models, descriptor selection was further performed with the correlation-based feature selection (CFS) [38] algorithm implemented in Weka 3.7.12 (The University of Waikato, Hamilton 3216, New Zealand). The CFS takes into account the usefulness of individual descriptors for predicting the anticancer activity category or the pIC 50 value together with the level of intercorrelation among them. The AttributeSelectedClassifier routine of Weka with the CfsSubsetEval option for evaluator and BestFirst, LinearForwardSelection, GreedyStepwise and PSOsearch options for search were used to compare experiments using the ten collections of descriptors, as well as different machine learning (ML) techniques. Selection of descriptors was accomplished using this procedure with the CFS algorithm within a ten-fold cross-validation methodology and k-nearest neighbors (k-NN) algorithm as ML technique. Optimization of QSAR regression models was performed using ten-fold or OOB cross-validation methodology with the training set employing the following statistical metrics: (1) R 2 , the square of the correlation coefficient (Pearson's r); (2) RMSE, root mean square error; and (3) MAE, mean absolute error.

k-Nearest Neighbors
The k-NN technique predicts the activity category or the pIC 50 value for a compound by majority voting of the k most similar compounds or by the average of the values for the k most similar compounds in the training set, respectively. Here, the k-NN algorithm was applied with the Weka (version 3.7.12) [37] using a k of 10, Euclidean distances, and contributions of neighbors weighted by the inverse of distance, which were optimized in ten-fold cross-validation methodology with the training set.

Random Forests
A RF [39,40] is an ensemble of unpruned trees, which are created using bootstrap samples of the training set and for each individual tree the best split at each node is defined using a randomly selected subset of descriptors. A different training and validation set was used to create each individual tree. Prediction is made by a majority vote of the classification trees (classification) or by average of the individual regression trees (regression) in the forest. Moreover, the prediction error for the objects left out in the bootstrap procedure (internal cross-validation or OOB estimation) was used to assess the internal performance. The RF method quantifies also the importance of a descriptor by the increase in misclassification occurring when the values of the descriptor are randomly permuted, correlated with the mean decrease in accuracy parameter. RF give as well a probability to every prediction on the basis of the number of votes obtained by the predicted class. In the experiments presented here, RF were used for the development of classification or regression models to estimate anticancer activity against HCT116. The R program [41], version 3.2.3 was used to grow RF using the RandomForest library [42]. The number of trees in the forest was set to 500 and the other parameters, except mtry, were used with default values. The mtry parameter values were selected using factor levels of the default value (i.e., square root of the number of descriptors or 1/3 of the number of descriptors in the data for classification or regression, respectively).

Support Vector Machines
Support vector machines (SVM) [43] map the data into a hyperspace through a nonlinear mapping (a boundary or hyperplane), for classification models the two class of compounds are separated in this space and for regression models a linear regression is performed in this space. In the current study, SVM models were explored with the Weka (version 3.7.12) [37] implementation of the LIBSVM software [44]. The C-SVM-classification or ε-SVM-regression types were chosen, the kernel function selected was the radial basis function and used the default value for the gamma parameter, and the parameter C was optimized in the range of 10−1000 through ten-fold cross-validation with the training set. The descriptors selected with the CFS algorithm within a ten-fold cross-validation for the training set were normalized and used to develop the classification and regression models.
2.6. Anticancer Screening in HCT116 Human Colon Carcinoma Cells 2.6.1. Cell Culture The HCT116 human colon carcinoma cell line was grown in McCoy's 5A supplemented with 10% fetal bovine serum, 1% antibiotic/antimycotic (Invitrogen, Grand Island, NY, USA) and cultured at 37 • C in a humidified atmosphere of 5% CO 2 . The MTS metabolism assay was performed with cells seeded in 96-well plates at 3750 cells/well.

Crude Extract and 5-FU Exposure
Stock solutions of 10 mg mL −1 of samples (actinomycete crude extracts, fractions, and pure compounds) and positive control 5-fluorouracil (5-FU) at 8 mM (Sigma, St. Louis, MO, USA) were prepared in dimethylsulfoxide (DMSO). Twenty-four hours after cell platting, cells were exposed to serial dilutions of samples and 5-FU, or DMSO vehicle control, for 72 h. All test samples, 5-FU or DMSO were serially diluted four-fold in culture medium.

Evaluation of Cytotoxicity
To determine cancer cell response to chemotherapeutics and other compounds in targeted screenings, as well as to explore colon cancer signaling pathways, we tested the anticancer activity of samples in HCT116 cells. After 72 h of cell exposure to samples, the activity was evaluated and in parallel to the positive and vehicle controls. CellTiter 96 ® aqueous non-radioactive cell proliferation assay (Promega, Madison, WI, USA) was used to anticancer activity evaluation using 4-(4,5-dimethylathiazol-2-yl)-5-(3carboxymethoxyphenyl)-2-(4-sulfophenyl), inner salt (MTS), according to the manufacturer's instructions. And the quantity of formazan product was measured after 1 h of incubation, using a Bio-Rad microplate reader Model 680 (BioRad, Hercules, CA, USA) at 490 nm. GraphPad Prism (version 5 GraphPad Software) was used to the IC 50 values determination. All samples were diluted, resulting in final concentrations of the tested samples ranging from 156.2 to 0.08 µg mL −1 .

Results and Discussion
In the current work, we report the building of two QSAR studies using the chemical structures of a data set of molecules and the 1D NMR spectra of a data set of samples (crude extracts, fractions and pure compounds) isolated from marine sediments collected from the Madeira Archipelago for the prediction of anticancer activity against human colon carcinoma HCT116 cell line, using the value of IC 50 . The two approaches, A and B, comprise several steps in order to build a comprehensive HCT116 QSAR model building process, Figure 1. dimethylathiazol-2-yl)-5-(3carboxymethoxyphenyl)-2-(4-sulfophenyl), inner salt (MTS), according to the manufacturer's instructions. And the quantity of formazan product was measured after 1 h of incubation, using a Bio-Rad microplate reader Model 680 (BioRad, Hercules, CA, USA) at 490 nm. GraphPad Prism (version 5 GraphPad Software) was used to the IC50 values determination. All samples were diluted, resulting in final concentrations of the tested samples ranging from 156.2 to 0.08 μg mL −1 .

Results and Discussion
In the current work, we report the building of two QSAR studies using the chemical structures of a data set of molecules and the 1D NMR spectra of a data set of samples (crude extracts, fractions and pure compounds) isolated from marine sediments collected from the Madeira Archipelago for the prediction of anticancer activity against human colon carcinoma HCT116 cell line, using the value of IC50. The two approaches, A and B, comprise several steps in order to build a comprehensive HCT116 QSAR model building process, Figure 1.

Approach A
The whole data set (i.e., 7339 small molecules) was divided by the SOM into a training set of 5875 molecules (comprising 3441 active-to-very-active, Act-to-Vact, with IC50 < 10 μM, and 2434 inactive-to-moderate-active, Iact-to-Mact, with IC50 ≥ 10 μM, molecules) and a test set of 1464 molecules (comprising 1071 Act-to-Vact and 393 Iact-to-Mact molecules), which were used for the development and external validation of the QSAR regression models, respectively. The whole data set was clustered into 10 structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table 1 along with as their average and maximum HCT116

Approach A
The whole data set (i.e., 7339 small molecules) was divided by the SOM into a training set of 5875 molecules (comprising 3441 active-to-very-active, Act-to-Vact, with IC 50 < 10 µM, and 2434 inactive-to-moderate-active, Iact-to-Mact, with IC 50 ≥ 10 µM, molecules) and a test set of 1464 molecules (comprising 1071 Act-to-Vact and 393 Iact-to-Mact molecules), which were used for the development and external validation of the QSAR regression models, respectively. The whole data set was clustered into 10 structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table 1 along with as their average and maximum HCT116 pIC 50 values. Although, the molecules of these structural clusters are distributed over a wide range of pIC50 values between −0.03 and 12, all the ten clusters have an average pIC50 value higher or equal to 5.24 (corresponding to an IC50 value lower or equal to 5.75 μM), and a maximum pIC50 value higher than 9 (corresponding to an IC50 value lower than 0.001 μM).
The Lipinski rule only informs if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. Furthermore, LogP is one of the most important molecular descriptors since it is highly correlated with lipophilicity, thus, more lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [45]. Besides MW and LogP, we recently reported that the topological descriptor MDEO-12 (molecular distance edge between primary and secondary oxygen atoms) [46], the electronic descriptor TopoPSA (topologicalpolar surface area) [46] and the quantum-chemical descriptor HOMO (highest occupied molecular orbital energy) [47], have a remarkable performance in discriminating antitumor, antibiotic and overall biological lead-like compounds, respectively.
In order to exploit the training set chemical diversity, the Act-to-Vact and Iact-to-Mact molecules of the training set were analyzed, in accordance with the 10 structural clusters, using MW, XLogP (an estimation of the octanol-water partition coefficient, LogP) and MDEO-12. The analysis of these data indicates that the Act-to-Vact and Iact-to-Mact molecules against HCT116 in the training set are distributed over a wide range of MW (i.e., 76-1461 Da), XLogP (i.e., −8.14-21.24) and MDEO-12 (i.e., 0-27.67) values. The MDEO-12 descriptor is known to codify the molecular size by taking into account oxygen atoms also characterizes polarity [48] and provides an indication of the presence of oxygen-containing groups such as glycosyl, amide, lactam, ester or lactone together with hydroxyl, carboxylic acid or ether functional groups [46]. For example the Act-to-Vact spirostane-type saponin, orchidastroside A, from the Cluster I of the training set has the highest MDEO-12 descriptor value of 27.67. Interestingly, more than 63% of the compounds present in the training set have a MW that belongs to the interval between 300 and 500 Da. This MW interval contains approximately 64% and 62% of all Act-to-Vact and Iact-to-Mact molecules against HCT116 in the training set, respectively. However, using this rule (300 < MW ≤ 500 Da) it is only possible to discriminate Act-to-Vact molecules in relation to Iact-to-Mact molecules in three structural clusters, namely in Clusters III, VI and VIII, which comprises 79%, 82% and 77% of Act-to-Vact molecules as compared to 74%, 79% and 66% of Iact-to-Mact molecules, respectively. In addition, more than 66% and 74% of the Act-to-Vact and Iactto-Mact molecules against HCT116 in the training set have an XLogP that is lower or equal to 5 and Although, the molecules of these structural clusters are distributed over a wide range of pIC50 values between −0.03 and 12, all the ten clusters have an average pIC50 value higher or equal to 5.24 (corresponding to an IC50 value lower or equal to 5.75 μM), and a maximum pIC50 value higher than 9 (corresponding to an IC50 value lower than 0.001 μM).
The Lipinski rule only informs if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. Furthermore, LogP is one of the most important molecular descriptors since it is highly correlated with lipophilicity, thus, more lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [45]. Besides MW and LogP, we recently reported that the topological descriptor MDEO-12 (molecular distance edge between primary and secondary oxygen atoms) [46], the electronic descriptor TopoPSA (topologicalpolar surface area) [46] and the quantum-chemical descriptor HOMO (highest occupied molecular orbital energy) [47], have a remarkable performance in discriminating antitumor, antibiotic and overall biological lead-like compounds, respectively.
In order to exploit the training set chemical diversity, the Act-to-Vact and Iact-to-Mact molecules of the training set were analyzed, in accordance with the 10 structural clusters, using MW, XLogP (an estimation of the octanol-water partition coefficient, LogP) and MDEO-12. The analysis of these data indicates that the Act-to-Vact and Iact-to-Mact molecules against HCT116 in the training set are distributed over a wide range of MW (i.e., 76-1461 Da), XLogP (i.e., −8.14-21.24) and MDEO-12 (i.e., 0-27.67) values. The MDEO-12 descriptor is known to codify the molecular size by taking into account oxygen atoms also characterizes polarity [48] and provides an indication of the presence of oxygen-containing groups such as glycosyl, amide, lactam, ester or lactone together with hydroxyl, carboxylic acid or ether functional groups [46]. For example the Act-to-Vact spirostane-type saponin, orchidastroside A, from the Cluster I of the training set has the highest MDEO-12 descriptor value of 27.67. Interestingly, more than 63% of the compounds present in the training set have a MW that belongs to the interval between 300 and 500 Da. This MW interval contains approximately 64% and 62% of all Act-to-Vact and Iact-to-Mact molecules against HCT116 in the training set, respectively. However, using this rule (300 < MW ≤ 500 Da) it is only possible to discriminate Act-to-Vact molecules in relation to Iact-to-Mact molecules in three structural clusters, namely in Clusters III, VI and VIII, which comprises 79%, 82% and 77% of Act-to-Vact molecules as compared to 74%, 79% and 66% of Iact-to-Mact molecules, respectively. In addition, more than 66% and 74% of the Act-to-Vact and Iactto-Mact molecules against HCT116 in the training set have an XLogP that is lower or equal to 5 and Although, the molecules of these structural clusters are distributed over a wide range of pIC 50 values between −0.03 and 12, all the ten clusters have an average pIC 50 value higher or equal to 5.24 (corresponding to an IC 50 value lower or equal to 5.75 µM), and a maximum pIC 50 value higher than 9 (corresponding to an IC 50 value lower than 0.001 µM).
The Lipinski rule only informs if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. Furthermore, LogP is one of the most important molecular descriptors since it is highly correlated with lipophilicity, thus, more lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [45]. Besides MW and LogP, we recently reported that the topological descriptor MDEO-12 (molecular distance edge between primary and secondary oxygen atoms) [46], the electronic descriptor TopoPSA (topologicalpolar surface area) [46] and the quantum-chemical descriptor HOMO (highest occupied molecular orbital energy) [47], have a remarkable performance in discriminating antitumor, antibiotic and overall biological lead-like compounds, respectively.
In order to exploit the training set chemical diversity, the Act-to-Vact and Iact-to-Mact molecules of the training set were analyzed, in accordance with the 10 structural clusters, using MW, XLogP (an estimation of the octanol-water partition coefficient, LogP) and MDEO-12. The analysis of these data indicates that the Act-to-Vact and Iact-to-Mact molecules against HCT116 in the training set are distributed over a wide range of MW (i.e., 76-1461 Da), XLogP (i.e., −8.14-21.24) and MDEO-12 (i.e., 0-27.67) values. The MDEO-12 descriptor is known to codify the molecular size by taking into account oxygen atoms also characterizes polarity [48] and provides an indication of the presence of oxygen-containing groups such as glycosyl, amide, lactam, ester or lactone together with hydroxyl, carboxylic acid or ether functional groups [46]. For example the Act-to-Vact spirostane-type saponin, orchidastroside A, from the Cluster I of the training set has the highest MDEO-12 descriptor value of 27.67. Interestingly, more than 63% of the compounds present in the training set have a MW that belongs to the interval between 300 and 500 Da. This MW interval contains approximately 64% and 62% of all Act-to-Vact and Iact-to-Mact molecules against HCT116 in the training set, respectively. However, using this rule (300 < MW ≤ 500 Da) it is only possible to discriminate Act-to-Vact molecules in relation to Iact-to-Mact molecules in three structural clusters, namely in Clusters III, VI and VIII, which comprises 79%, 82% and 77% of Act-to-Vact molecules as compared to 74%, 79% and 66% of Iact-to-Mact molecules, respectively. In addition, more than 66% and 74% of the Act-to-Vact and Iact-to-Mact molecules against HCT116 in the training set have an XLogP that is lower or equal to 5 and MDEO-12 that is lower than 0.999, respectively. Therefore, using the XLogP ≤ 5 and MDEO-12 < 0.999 rules it is possible to prioritize Act-to-Vact molecules in relation to Iact-to-Mact molecules in six (I, III, IV, VII, VIII, and X) and five (II, III, VIII, IX, X) structural clusters, respectively.

Approach B
The whole data set, comprising 110 samples (50 crude extracts, 55 fractions, 5 pure compounds), was divided into a training set of 74 samples (35 crude extracts and 39 fractions) and a test set of 36 samples (15 crude extracts, 16 fractions, 5 pure compounds), which were used for the development and external validation of the QSAR classification models, respectively. Two classes of anticancer activity were set, moderate-active-to-active with IC 50 < 156 µg/mL (in total 34 samples) and inactive with IC 50 ≥ 156 µg/mL (in total 76 samples). The whole data set was divided into five actinomycetes genera (Actinomadura, Brevibacterium, Micromonospora, Salinispora, and Streptomyces) in accordance with our previously reported work [29]. The five actinomycetes genera are represented in Table 2 along with their activity classes and average HCT116 IC 50 values. It is interesting to highlight that the most abundant genera in our data set are Streptomyces (in total 50 samples, 36 and 14 samples in the training and test sets, respectively), Salinispora (in total 29 samples, 21 and 8 samples in the training and test sets, respectively), and Micromonopora (in total 27 samples, 15 and 12 samples in the training and test sets, respectively). The genus with the most bioactive potential against HCT116 is Streptomyces, comprising 20 (corresponding to 56%) and 7 (corresponding to 50%) active samples out of the 36 and 14 total samples in the training and test sets, respectively. It is not surprising since the genus Streptomyces over the last decades has stirred huge interest as a source of bioactive compounds, more than 60% of all known antibiotics have been isolated from streptomycetes [49].

Exploration of Empirical Molecular Descriptors and Fingerprints for QSAR Approach A
Two wide sets of descriptors were calculated by PaDEL-descriptor [34], one with 12 different types of fingerprints (FPs) with different sizes (79 Estate, E-State fragments; 166 MACCS, MACCS keys; 307 Substructure, presence and count Sub and SubC respectively; 780 2D atom pairs (presence and count of atom pairs at various topological distances, AP2D and APC2D, respectively), 881 PubChem fingerprints; 1024 CDK, circular fingerprints; 1024 CDK Ext, extended circular fingerprints with additional bits describing ring features; 1024 CDK graph, specialized version of the FP which does not take bond orders into account; and 4860 Klekotha-Roth, presence and count of chemical substructures, KR and KRC respectively) and other with a total of 1869 1D, 2D and 3D molecular descriptors (including electronic, topological, geometrical, constitutional and hybrid, BCUT and WHIM, descriptors). For the calculation of the 3D molecular descriptors the 3D models of the molecular structures were generated with CORINA. The performances of the two sets of descriptors in QSAR experiments in predicting pIC 50 against HCT116 were compared. These exploratory QSAR experiments employed selection of descriptors with the CFS filter [38] followed by the simple k-nearest neighbor (k-NN) prediction of pIC 50 against HCT116, within a ten-fold cross-validation procedure ( Table 3). Table 3. Exploration of two collections of empirical descriptors for the quantitative structure-activity relationship k-nearest neighbors (QSAR k-NN) model of pIC 50 for the training set with a ten-fold cross-validation. The best models are highlighted in bold. The CFS filter maximizes the correlation with the variable to predict and minimizes intercorrelation between descriptors. The two molecular descriptors sets,1D2D and 1D2D3D, and the four fingerprints sets, CDK, CDK Ext and PubChem, achieved the best results, taking into account the value of the RMSE ( Table 3, the best models are highlighted in bold). From the fourteen sets of descriptors and fingerprints, only 1D2D, 1D2D3D, CDK Ext, and PubChem fingerprints were used in further investigations.

Exploration of Other State-of-the-Art Machine Learning Techniques
A comparison of three ML techniques, RF, SVM, and k-NN, for building QSAR models with the descriptors are described in Table 3 for SVM and k-NN, and without selection for RF is shown in Table 4.
In general, we did not observe an effective improvement in performance of RF algorithm with descriptor selection as has been reported in literature [39]. Here, RF showed a better performance when compared to SVM and k-NN for all descriptors sets (i.e., 1D2D, 1D2D3D, PubChem, CDK) in the prediction of the pIC 50 against HCT116 taking into account the value of the RMSE ( Table 4). The best model was accomplished by the RF using the PubChem fingerprints for the training set with an R 2 of 0.751 and RMSE of 0.664. This model was further optimized through descriptor selection, based on the importance assigned by the RF model- Figure 2.
The selection of the 350 most important (mi) descriptors from the PubChem fingerprints set used to build the model with the RF enabled the training of much smaller RF models with even better prediction accuracies (R 2 = 0.752, RMSE = 0.664 and R 2 = 0.729, RMSE = 0.689) than the models trained with the whole set of descriptors (881 descriptors) for the training and test sets, respectively. The analysis of the best HCT116 QSAR model by the ten structural clusters, was displayed in Table 5 for training and test sets. In general the predictions obtained for the structural clusters are better than those obtained for all training set taking into account the RMSE value, except for Clusters I, IV, and VIII (bold highlighted in Table 5).    Interesting, the worse prediction obtained taking into account the RMSE value for all the ten clusters in the test set was also to the Cluster IV. Analyzing the number of outliers, i.e., with an absolute error greater than 3 × MAE (0.455), in each of the ten clusters of the test set allows to identify two clusters (I and IV) that stand out for having a percentage higher than the one obtained for all the test set (78 outliers, 5.33%), 7.75% and 7.50%, respectively. Figure 3 represents the plot of predicted vs. experimental pIC 50 values against HCT116. Interesting, the worse prediction obtained taking into account the RMSE value for all the ten clusters in the test set was also to the Cluster IV. Analyzing the number of outliers, i.e., with an absolute error greater than 3 × MAE (0.455), in each of the ten clusters of the test set allows to identify two clusters (I and IV) that stand out for having a percentage higher than the one obtained for all the test set (78 outliers, 5.33%), 7.75% and 7.50%, respectively. Figure 3 represents the plot of predicted vs. experimental pIC50 values against HCT116.  Table 6 we describe the twenty most important PubChem FPs for Section 6-Simple SMARTS patterns (SSP), 253 FPs; and Section 7-Complex SMARTS patterns (CSP), 168 FPs. The twenty most important PubChem FPs, found by the best RF model, combined two HEC FPs, two ESSSR FPs, two SAP FPs, four SANN FPs, one DANh FP, six SSP FPs, and three CSP FPs. In Table 6 we describe the twenty most important PubChem FPs for modeling the pIC 50 against HCT116.               In the set of the twenty most important descriptors, the descriptors codifying the presence of oxygen-containing groups are very relevant; eleven descriptors out of twenty most important In the set of the twenty most important descriptors, the descriptors codifying the presence of oxygen-containing groups are very relevant; eleven descriptors out of twenty most important descriptors. The alcohol functional group appears to be very relevant for modeling the activity against HCT116 and is codified by the descriptors CSP_819 (the 2nd), SANN_339 (the 11th), SSP_631 (the 20th) and may be also codified by the descriptor SANN_346 (the 15th). The others oxygen containing groups were methyl ketones (DANh_432, 8th), hydroxylamines (SSP_514, 12th), alkoxy alkylamines (SSP_615.13th), α,β-unsaturated carbonyls (SSP_672, 14th), nitro or N-oxide groups (SAP_301, 18th) and may be also codified aldehydes (SANN_346, 15th). Moreover, the methyl group appears also to be very important and is codified by several descriptors, namely the dimethylphenyl group (para-substituted, CSP_713, the 1st and ortho-substituted, CSP_755, the 9th), and 2-methylcyclohexanol (CSP_819, 2nd).

Applicability Domain of the pIC 50 Against HCT116 Model
A definition of applicability domain based on the similarity between a molecule of an external data set and all the 5875 molecules in the training set was explored. The molecules of the training set were mapped on a Kohonen self-organizing map (SOM), using in-house developed software based on JATOON Java applets [35], on the basis of the 307 Substructure fingerprints (Sub) according to the ten structural clusters, Table 1. No information about HCT116 activity was used. A trend for clustering according to structural cluster features of the compounds was observed, Figure 4. descriptors. The alcohol functional group appears to be very relevant for modeling the activity against HCT116 and is codified by the descriptors CSP_819 (the 2nd), SANN_339 (the 11th), SSP_631 (the 20th) and may be also codified by the descriptor SANN_346 (the 15th). The others oxygen containing groups were methyl ketones (DANh_432, 8th), hydroxylamines (SSP_514, 12th), alkoxy alkylamines (SSP_615.13th), α,β-unsaturated carbonyls (SSP_672, 14th), nitro or N-oxide groups (SAP_301, 18th) and may be also codified aldehydes (SANN_346, 15th). Moreover, the methyl group appears also to be very important and is codified by several descriptors, namely the dimethylphenyl group (para-substituted, CSP_713, the 1st and ortho-substituted, CSP_755, the 9th), and 2methylcyclohexanol (CSP_819, 2nd).

Applicability Domain of the pIC50 Against HCT116 Model
A definition of applicability domain based on the similarity between a molecule of an external data set and all the 5875 molecules in the training set was explored. The molecules of the training set were mapped on a Kohonen self-organizing map (SOM), using in-house developed software based on JATOON Java applets [35], on the basis of the 307 Substructure fingerprints (Sub) according to the ten structural clusters, Table 1. No information about HCT116 activity was used. A trend for clustering according to structural cluster features of the compounds was observed, Figure 4.  Then, the SOM response patterns were used as a metric of similarity after normalization, where d(x,ni) is the Euclidian distance between the molecular descriptor vector x and ni (represents the centroid vector of the ith SOM neuron). A threshold based on the average of SOM distance (ASD) between each molecule of the test set and all the molecules of the training set were set in accordance with the mapping of the ten structural clusters on SOM and MAE obtained in the best RF model in order to no misclassification of the structural cluster was obtained. The applicability domain of the model is defined as containing all molecules of the training set that were mapped as belonging to one of the ten clusters on SOM with an ASD lower than 0.421. Therefore, using this threshold for the test set, 873 molecules belonging to the applicability domain of the HCT116 model were obtained, with R 2 = 0.758, RMSE = 0.681 and MAE = 0.462. However, for the molecules of the test set outside the defined applicability domain (i.e., 591 molecules) worse predictions were obtained, with R 2 = 0.669, RMSE = 0.702 and MAE = 0.489.
Finally, the best RF model and its applicability domain were validated with a final prediction set consisting of 151 molecules not used for any task before, which were recently reported in the literature [50][51][52]. Only 50 molecules of this data set were in the applicability domain of the HCT116 Then, the SOM response patterns were used as a metric of similarity after normalization, where d(x,n i ) is the Euclidian distance between the molecular descriptor vector x and n i (represents the centroid vector of the ith SOM neuron). A threshold based on the average of SOM distance (ASD) between each molecule of the test set and all the molecules of the training set were set in accordance with the mapping of the ten structural clusters on SOM and MAE obtained in the best RF model in order to no misclassification of the structural cluster was obtained. The applicability domain of the model is defined as containing all molecules of the training set that were mapped as belonging to one of the ten clusters on SOM with an ASD lower than 0.421. Therefore, using this threshold for the test set, 873 molecules belonging to the applicability domain of the HCT116 model were obtained, with R 2 = 0.758, RMSE = 0.681 and MAE = 0.462. However, for the molecules of the test set outside the defined applicability domain (i.e., 591 molecules) worse predictions were obtained, with R 2 = 0.669, RMSE = 0.702 and MAE = 0.489.
Finally, the best RF model and its applicability domain were validated with a final prediction set consisting of 151 molecules not used for any task before, which were recently reported in the literature [50][51][52]. Only 50 molecules of this data set were in the applicability domain of the HCT116 model and were predicted with acceptable accuracy of R 2 = 0.544, RMSE = 1.024 and MAE = 0.879. The SMILES strings of this data set, the corresponding experimental and predicted activities are also available as Supplementary Material.

Exploration of NMR Descriptors for QSAR Approach B
The NMR descriptors were generated using the following ranges: (1) 1.5 (133 descriptors), 1.0 (200 descriptors), and 0.5 ppm (400 descriptors) for 13 C; and (2) 0.2 (61 descriptors), 0.1 (120 descriptors) and 0.05 (240 descriptors) ppm for 1 H. Exploratory QSAR experiments employed three NMR descriptors sets (with 1 H-NMR descriptors, 13 C-NMR descriptors and combining 1 Hand 13 C-NMR descriptors) followed by RF prediction of two classes of anticancer activity (i.e., moderate-active-to-active with IC 50 < 156 µg/mL and inactive with IC 50 ≥ 156 µg/mL) within a OOB estimation procedure (Table 7). In Table 7, only three of the best models of the nine models, which were trained combining 1 H-and 13 C-NMR descriptors, are represented. The best model was achieved using 0.1 and 0.5 ppm ranges for 1 H and 13 C, respectively, in total 520 NMR descriptors (bold highlighted in Table 7). For the HCT116 NMR model, the data are imbalanced as concerns the moderate-active-to-active and inactive classes (34, 76 and 9, 27 samples for the moderate-active-to-active and inactive classes of the training and test sets, respectively), and therefore in order to alleviate this problem the class weights were adjusted to 50:50, Table 8. This procedure achieved an improvement of the predictive power of the model for training and test sets.
For the best model, the results were analyzed with respect to the most active samples, i.e., with IC 50 values lower than 5 µg/mL. In the training set, there are two samples (one crude extract, PTM-304 and one fraction, PTM-128_F8) that are predicted to be inactive i.e., FNs, with probability of being moderate-active-to-active lower than 0.25, and five samples that are correctly predicted i.e., TPs, with probability of being moderate-active-to-active higher than 0.80 (one crude extract, PTM-46 and four fractions, PTM-128_F9, PTM-29_F5, PTM-81_F2_F3, and PTM-420_F5). In relation to the test set, there are two samples with IC 50 values lower than 5 µg/mL (one crude extract, PTM-99 and one fraction, PTM-29_F2), both are incorrectly predicted, i.e., FNs, with values of probability of being moderate-active-to-active very low (0.07 and 0.22, respectively). For pure compounds, none were in the training set because of the low amount and the fact that they are all inactive. When the five pure compounds present in the test set were predicted, the model predicted them as being moderate-active-to-active when all are inactive, with the range of probability to be moderate-active-to-active ranging from 0.57 to 0.73. This may be due to the lack of representation of these compounds in the training set that is only constituted by crude extracts and fractions. Finally, the best RF model was validated with a final prediction set consisting of five pure compounds not used for any task before, which were recently isolated and purified in our group. These five marine natural products appear to be of the same structural family of macrocyclic compounds, however their chemical structure has not yet been fully elucidated, and therefore, are excellent candidates for this QSAR approach B model. In Table 9, we show the predictions obtained using the best model for the five pure compounds that had not previously been tested against the HCT116 cell line. For the test set, no misclassifications between moderate-active-to-active and inactive classes were obtained with probability of being moderate-active-to-active higher than 0.77 and probability of being moderate-active-to-active lower than 0.39, respectively. In this way, and with a high degree of confidence, we can only affirm that the compound PTM-99_F2_F27 belongs to the inactive class and only this one should be validated experimentally. Table 9. Prediction of activity classes against HCT116 of the five pure compounds with the best model.

Code
Activity Class Probability of Being Moderate-Active-to-Active However, the five compounds were tested experimentally against the HCT116 cell line, and for the standard range considered in the biological assays it was difficult to calculate an IC 50 for any of the compounds. Since the model used had a high cut-off, assays with high concentrations were carried out at the concentration of 125 µg/mL and the following absorbance results were found: 0.300, 0.324, 0.056, 0.289 and 0.076 for PTM-99_F2_F27, PTM-99_F2_F31, PTM-420_F4_F15, PTM-420_F5_F42 and PTM-420_F5_F43, respectively. Absorbance for the negative control, using the DMSO solvent, and positive control, using the reference compound 5-FU, of 0.322 and 0.052, respectively, were obtained.

Analysis of NMR Descriptors Identified as Relevant for Modeling HCT116 Activity in the Best RF Model
The fifty most relevant descriptors, found by the RF algorithm using the MeanDecreaseAccuracy parameter (Mean Decrease in Accuracy) [53] were analyzed and represented in Table 10.                  Interestingly, there are nine descriptors that codify 1 H-NMR data out of the ten most important NMR descriptors. From those nine 1 H-NMR descriptors, six descriptors codify saturated alkyl groups (14, 2, 3, 4, 5 and 6), where five out of the six descriptors are correlated to the methyl group. The other three 1 H-NMR descriptors codify the bonding to an electronegative atom such as N, O or halogen (44 and 45) and vinyl groups (52). The only 13 C-NMR descriptor in the ten most important NMR descriptors discriminated the alcohol, ether or alkyne functional groups (271). Moreover, we also verified that the importance by activity classes (moderate-active-to-active and inactive classes) for each of the ten most important descriptors is more or less similar, which seems to indicate that although they are the most relevant descriptors, they do not allow the discrimination among the classes of activity. On the other hand, the analysis of the fifty most important descriptors permitted the identification of seven descriptors (H: 32, 51, 73, 13; C: 170, 352, 280) that allow the discrimination among the classes of activity. The descriptors that give the moderate-active-to-active class a higher weight are C (352) and H (13), which encode aromatic, olefinic or nitrile carbon atoms and saturated alkyl methylene proton atoms, respectively.

Conclusions
The results suggest that the chemoinformatics QSAR approach relying on a ligand-based methodology either based on the molecular structures or the NMR spectra, corroborated with an experimental approach and could be used to predict new compounds inhibitors against the human colon carcinoma HCT116 cell line. To our knowledge, the QSAR regression model developed here, Approach A, is the largest study ever performed with regard both to the number of compounds involved and to the number of structural families involved in the modeling of inhibitory activity against HCT116 [10][11][12][13][14][15][16][17][18][19][20][21][22]. The performance achieved by the NMR QSAR classification model, Approach B, allowed us to conclude that it was an excellent effort and a useful tool for dereplication to be developed if this study is extended to a high number of samples containing crude extracts, fractions and mainly pure compounds. This will be an interesting approach in future work. The two approaches developed (A, through molecular structures, and B, through NMR spectra) allow the development of a complementary strategy to predict new anticancer compounds against the human colon carcinoma HCT116 cell line. Approach B allows the prioritization of the isolation, purification and structural elucidation of crude extracts, fractions and pure compounds. Pure compounds that are elucidated may be subjected to model A and the compounds predicted to be most active against HCT116 line may be evaluated experimentally.
Supplementary Materials: The following are available online at http://www.mdpi.com/2218-273X/8/3/56/s1, Tables S1-S3: the SMILES strings of the training, test, and test 2 sets, and the corresponding experimental and predicted activities against HCT116 for Approach A, respectively; Tables S4-S6: the code and type and the actinomycete genus of the samples comprising the training, test, and test 2 sets and the corresponding experimental and predicted activity classes against HCT116 for Approach B, respectively.