High-Performance Prediction of Agonists on Human Estrogen Receptor Based on Chemical Structures

Many agonists for the estrogen receptor are known to disrupt endocrine functioning. We have developed a computational model that predicts agonists for the estrogen receptor ligandbinding domain in an assay system. Our model was entered into the Tox21 Data Challenge 2014, a computational toxicology competition organized by the National Center for Advancing Translational Sciences. This competition aims to find high-performance predictive models for various adverse-outcome pathways, including the estrogen receptor. Our predictive model, which is based on the random forest method, delivered the best performance in its competition category. In the current study, the predictive performance of the random forest models was improved by strictly adjusting the hyperparameters to avoid overfitting. The random forest models were optimized from 4,000 descriptors simultaneously applied to 10,000 activity assay results for the estrogen receptor ligand-binding domain, which have been measured and compiled by Tox21. At this time, our model delivers the highest predictive power on estrogen receptor agonists in the world. Furthermore, analysis of the optimized model revealed some important features of the agonists, such as the number of hydroxyl groups in the molecules.


Introduction
Estrogen receptors (ER) belong to the steroid receptor superfamily of ligand-dependent transcription factors [1]. Compounds related to ER activation, such as isoflavones and polycyclic aromatic hydrocarbons, disrupt the endocrine processes in humans and other species, severely affecting reproduction and growth [2,3]. Therefore, screening for ER agonists can counteract environmental contaminants and improve public health. Although ultra-high-throughput screening systems have been developed for several adverse-outcome pathways [4], experimental in vitro detection is limited by the vast number of screening targets. Consequently, a comprehensive assay is precluded by both economics and time. In contrast, predictive methods based on chemical structures can greatly accelerate the estimation and are expected to replace wet experimental systems.
The National Institute of Health (NIH), Environmental Protection Agency, and Food and Drug Administration have collaborated to launch the Tox21 challenge, a large project targeting a variety of toxicity problems, including the environmental effects of ER agonists [5] [6]. The Tox21 project has assayed the activities of the compounds in the Tox21 10K library [4], which contains 10,000 chemicals for toxicity estimation. The adverse-outcome pathways selected for toxicity evaluations are the androgen receptor (AR), aryl hydrocarbon receptor (AhR), estrogen receptor (ER), aromatase and peroxisome proliferator-activated receptor (PPAR), nuclear factor (erythroid-derived-2)-like 2/antioxidant responsive element (ARE), ATP-ase family AAA domain containing 5 (ATAD5), heat shock factor response element (HSE), mitochondrial membrane potential (MMP), and p53 [7].
One aim of the Tox21 project is to construct a predictive system based on computational toxicology. The Tox21 Data Challenge 2014 was organized by NIH's National Center for Advancing Translational Sciences as a "cloud-sourcing" search for high-performance predictive models of adverse-outcome pathways. Using computational toxicology technologies, participants competed in accuracy of predictive models with the biological toxic responses of compounds in the Tox21 10 K compound library, whose activities were known, applying the abovementioned adverse-outcome pathways as the training set [8]. Dr. Yoshihiro Uesawa, a winner of the 2014 challenge and a coauthor of the present study, constructed a predictive model for the ER-ligand binding domain (ER-LBD), which showed the best performance among the models submitted by the registered teams [9].
Uesawa's ER model was based on a machine learning method called random forest (RF). RF is an ensemble learning method that decides the best predictive result by majority rule of the various predictive results gained from many decision trees [10]. Each tree is constructed from bootstrapped data of the training set. This method achieves high predictive performance at low computational cost, even when the dataset is large or prejudiced. It also assesses the importance of the variables used in the construction. In the previous study, we confirmed that rigorous selection from many kinds of RF models significantly improved the performance of RF. However, we also observed an overfitting tendency [9]. In this study, we attempt to improve the predictive performance of the previous RF construction by a novel RF optimization technique.

Conformations and Descriptors
Various descriptors were calculated by optimizing the three-dimensional structures of their chemical constituents, as implemented in the Tox21 Data Challenge 2014 [9]. Briefly, the SD file of each descriptor, which contains the chemical structure and ER-LBD assay results (active or inactive) of the compound, was downloaded from a homepage dedicated to the competition [8]. The threedimensional conformations of the chemical structures in two configurations (a charged form at neutral pH and an uncharged form) were calculated in the Molecular Operating Environment (MOE) 2013.08 (Chemical Computing Group Inc., Quebec, Canada) [11]. Finally, 4071 different molecular descriptors were generated by MOE, MarvinView 5.12.4, and Dragon 6.

Construction of Predictive Models
In the previous study [9], the original training set of 8,733 chemicals and final evaluation set of 599 chemicals were downloaded from the homepage of the Tox21 Data Challenge website [8]. The same data were used in the present study. The original training set was randomly divided into a training set (50%) for constructing the predictive model and a test set (50%) for model validation. The selection processes in the predictive models were constructed from the training set (50%). In our basic method, we applied the bootstrap-forest function in JMP Pro 12.0.1 (SAS Institute Inc., Cary, NC, USA) as the RF algorithm [12]. Finally, the compounds in the final evaluation set were predicted by the model. The evaluation indices were the predictive ability of each model in the evaluation step, and the area under the receiver operating characteristic curve (ROC_AUC Evaluation) (see Figure 1).  activities of the estrogen receptor ligand-binding domain for the compounds in the final evaluation set. 100 ROC_AUC values were plotted for each group. Green lines denote the averages and their 95% confidence intervals.

Number of Descriptors
To ascertain whether the performance of our model would be improved by adding more descriptors, we constructed 100 RF models of the 4071 descriptors calculated by MOE, MarvinView, and Dragon, and compared their predictive abilities with those of RF models constructed from MOE descriptors (738 kinds) alone (Figure 3).

Effects of Hyperparameters
To find the best combination of hyperparameters in the RF construction, we scanned two hyperparameters under the following conditions (10 iterations per condition): Number of Terms (the number of columns considered as splitting candidates at each split (range 1-1000)) and Maximum Splits per Tree (the maximum number of splits for each tree (range 2-400)). For each Number of Terms we constructed 190 models. The ROC_AUC values of all constructed models were then estimated on the compounds in the test set (50%) and evaluation set ( Figure 4). Furthermore, the 190 ROC_AUC Evaluation values obtained for each Number of Terms were compared ( Figure 5). Finally, for the models with Number of Terms = 1000, we generated scatter plots between Maximum Splits per Tree and the ROC_AUCs in the training set (50%) and evaluation set ( Figure 6).

Statistical Treatment
Significant differences in the means were tested by an unpaired Student's t-test and one-way ANOVA, followed by least-significant difference analysis. All analyses, including the ROC_AUC calculations, were performed in JMP-Pro. The significance level was set to P < 0.05.

Effects of Descriptors
Integrated descriptors from both charged and uncharged forms improved the predictive ability of the RF models, relative to descriptors from unilateral forms (Figure 2). Varying the charge conditions increased the diversity of the information in the RF models. Increasing the number of descriptors from 738 (calculated by MOE alone) to 4071 (calculated by three software programs, MOE, Dragon and MarvinView) also improved the predictive ability of the RF models (Figure 3). On the other hand, feature selections based on the known importance of each descriptor during the RFmodeling failed to improve the model performance (data not shown). These observations suggest that a large number of descriptors are advantageous for our models' performance.

Effects of Hyperparameters
The AUC values in the test set (50%) and the final evaluation set were not simply correlated; rather, there was an optimal point at which the AUC of the test set (50%) corresponded to the highest AUC of the evaluation model ( Figure 4). This point was emphasized in our previous paper [9]. In the competition, model selection was optimized using the AUC values in the test set (50%) because the AUCs between the test set (50%) and the training set (50%) showed good linear correlation [9]. Next, we analyzed the true performance of our models on the final evaluation set [9] (see Figure 4).
All further investigations in the current study were performed on the final evaluation set. In a scanning search for the Number of Terms hyperparameter, the best predictive model was obtained at Number of Terms = 1000 ( Figure 5). A scanning search for the predictive capabilities of Maximum Splits per Tree under restricted conditions, with Number of Terms = 1000, was also performed ( Figure  6). RF models with high Maximum Splits per Tree were found to overfit the training set. The optimal predictive ability of the models was obtained for Maximum Splits per Tree = 6 ( Figure 6). We inferred that we could regulate the overfitting by lowering the optimal point of Maximum Splits per Tree; that is, by restricting the tree growth. Figure 7 shows the ROC curves that evaluate the discrimination capacities of the new predictive model and the best model in the Tox21 Data Challenge 2014. The ROC-AUC values for the compounds with ER-LBD activities in the final evaluation test set were 86.6% and 82.7% in the present and previous models, respectively. In addition, the ROC between sensitivity and 1-specificity was better balanced in the present model than in the previous model. Overall, the predictive performance of the present model surpasses that of the previous model.

Most important descriptors
The importance of the descriptors used in constructing the present model was calculated from their split rankings (count of each descriptor usage when partitioning the decision trees in the RF modeling) and G2 values (chi-squared values of the likelihood-ratios) [13]. The descriptor rankings in 100 RF models under the optimized hyperparameter conditions are listed in Table 1. The topranking ER-LBD activator was SpMin 1_Bh(m). The topological shape, nArOH and number of OH groups bound to the aromatic rings might contribute to the activity of this compound in the ER-LBD assay [14].

Conclusions
We constructed a new predictive model with higher discrimination ability for estrogenic compounds than our previous best model, which was submitted to the Tox21 Data Challenge 2014. It means that we had to succeed to deliver the highest predictive power on estrogenic compounds in the world. The best conditions for the RF modeling regulated the overfitting of the test set (50%). This regulation was found to be important in the RF model constructions. Furthermore, the model analyses revealed that in compounds such as SpMin 1_Bh(m), interactions with ER were facilitated by the structural and physicochemical properties of the compounds and the number of phenolic OH groups. This modeling approach will be useful for predicting the toxicity of compounds and producing new drugs and chemicals.