Combined Naïve Bayesian, Chemical Fingerprints and Molecular Docking Classifiers to Model and Predict Androgen Receptor Binding Data for Environmentally- and Health-Sensitive Substances

Many chemicals that enter the environment, food chain, and the human body can disrupt androgen-dependent pathways and mimic hormones and therefore, may be responsible for multiple diseases from reproductive to tumor. Thus, modeling and predicting androgen receptor activity is an important area of research. The aim of the current study was to find a method or combination of methods to predict compounds that can bind to and/or disrupt the androgen receptor, and thereby guide decision making and further analysis. A stepwise procedure proceeded from analysis of protein structures from human, chimp, and rat, followed by docking and subsequent ligand, and statistics based techniques that improved classification gradually. The best methods used multivariate logistic regression of combinations of chimpanzee protein structural docking scores, extended connectivity fingerprints, and naïve Bayesians of known binders and non-binders. Combination or consensus methods included data from a variety of procedures to improve the final model accuracy.


Introduction
Concern has been rising due to the fact that many environmental factors can modulate the androgen receptor (AR) pathway: agricultural and industrial chemicals, pharmacological drugs and chemotherapeutics, aging, hyperthermia, and chronic infection, alcohol, tobacco, and other drugs [1][2][3][4]. These chemical agents enter the waterways, food chain, and affect other environments. In vivo rodent models are one way to try to elucidate the exact roles of AR in reproduction and the molecular mechanism of AR modulation in reproductive health [1][2][3][4]. Modulation of the AR has multiple biological effects on species in the environment, on health, and diseases. These include important roles in the development and maintenance of reproductive [5], musculoskeletal [6,7], cardiovascular [8][9][10], immune [11,12], nervous [13][14][15], and hematopoietic [16,17] systems. The AR has also been shown to be associated with the development of prostate [18], breast [19,20], bladder [21], liver [22], kidney [23], and lung [24] tumors [25]. Chemicals that mimic hormones can cause abnormalities and undesirable effects in the hormonal system, which in turn can lead to many of the aforementioned diseases. However, the number of synthesized and commercially produced chemicals is constantly increasing, resulting in more of them reaching the environment, the food chain, and eventually the human body. The compounds of most concern are those that may disrupt hormone-receptor binding in the body, or prevent other interactions that are responsible for metabolism, transport, or synthesis of substances, such as endocrine disrupting chemicals (e.g., disrupting the hormonal system) [5,26]. For example, AR binding compounds, or those that interfere with androgen-dependent path-

Data Sets and Molecular Structures
The dataset of active, i.e., binding compounds (n = 205, composed of both agonists and antagonists), and inactive (i.e., non-binding, n = 1480) compounds was obtained from the CoMPARA project website [40,41], and named training set (n = 1685, Table S1). A validation data set (n = 20, Kleinstreuer et al. [41]) with AR reference compounds included was used, that contained indication of their binding status. An evaluation data set (n = 3882, Table S2) was also provided by the CoMPARA project organizers [42] and was processed and scored with the best model trained with the training set. The status for each binding compound was reported as "Binding" in the SDF file (ToxCast_AR_Binding-2016-11-17.sdf). We considered the field "hitcall" in the SDF file that divided the compounds into "Active" and "Inactive" in order to test the procedure. This gave n active = 453, n inactive = 3429. All chemical structures were used as the QSAR ready SMILES provided within the CoMPARA project; 'QSAR ready' means that salts were converted into neutral form, counterions were removed, tautomers were normalized, etc. (detailed description about datasets and their assembly are provided in reference [42]). Minimized 3D structures of the ligands were also prepared by the CoMPARA organizers.

Molecular Docking
The data provided by the CoMPARA organizers contained data including in vitro compound assays on chimp, human, and rat androgen receptors [41]. In the present work we were more interested in grouping the actives as binders, irrespective if they were either agonist or antagonist, as both classes would presumably bind to the AR. The inactive compounds (organizer definition 'inactive' was Activity concentration >800 µM) were taken to be non-binders. Docking was performed using Glide XP v. 2017, as contained in the Schrodinger suite [44], with conditions as described previously, [34] with the main difference of docking being performed in the orthosteric site of AR using 15 Å inner box and 40 Å outer search boxes.

Protein Structures
The protein structure for the androgen receptor for chimp (Pan troglodytes), human (Homo sapiens), and rat (Rattus norwegicus) species were downloaded from the PDB [45], with structure codes (resolution) 1t7r (1.4 Å), 3v49 (1.7 Å), and 3g0w (1.95 Å), respectively. These structures were selected based on the availability of a protein-ligand complex, the highest possible X-ray crystal structure resolution, and the completeness of amino acid residue sequence. The best crystal structure among these was selected as the one that gave the best separation of known binders and non-binders. Structure 1t7r contains the AR in complex with 5-alpha-dihydrotestosterone at a resolution of 1.4 Ångströms.

Characterization and Comparison of Ligands
Chemical fingerprints were generated using Extended Chemical Fingerprints (ECFP), a circular fingerprint as encoded by Instant JChem [46]. Distances between chemical fingerprints were calculated by Tanimoto (a.k.a. Jaccard, T) coefficient for the case of a binary fingerprint (bit string), according to Equation (1), where N A and N B are the number of bits set in the bit strings of molecules A and B, respectively, and N A&B is the number of bits that are set in both. T The dissimilarity or distance between molecules is calculated according to Equation (2), where T(A, B) is the Tanimoto coefficient for molecules A and B.

Naïve Bayesians
Bayesian classifiers were calculated according to Equation (3), where µ is the mean, σ is the standard variation, and x is the independent variable, in this case the docking scores [47].
The probabilities are calculated for both binders and non-binders and their ratio for the values calculated for a new compound determines their classification into either group, i.e., if a probability for a compound was higher for the binding group than for the non-binding group, the compound was classified as a binder and vice versa.

Multivariate Logistic Regression
Multivariate logistical regression [48] was used according to Equation (4), where P cmpd is the probability of a compound of belonging to class 1 (classified as binding), or class 0 The linear form of Equation (4) (logit(Y)) can have infinitely large or small values for the dependent variable, so instead of ordinary least squares, maximum likelihood techniques are used to maximize the value of the log likelihood (LL) function, which indicates how likely it is to obtain the observed values of Y, given the values of the independent variables and the parameters β, α 1 , . . . , α n .

Performance Analysis
The confusion matrix is usually defined as the collection of four fields: true positives For more detailed analysis, we added several other measures: the probability that a chemical predicted as a binder is actually a binder (PPV, Equation (6)), the probability that a chemical predicted as a nonbinder is actually a nonbinder (NPV, Equation (7)), positive (+LR, Equation (8)) and negative (−LR, Equation (9)) likelihood ratios [49], and modified correct classification rate, giving higher scores for models with optimal balance between SE and SP (BCR, Equation (10)) [50].

Availability of Best Model
The numerical raw data and best classification model is provided in the QSAR Data Bank format [51] and uploaded to the QsarDB repository [52,53]. A digital object identifier (DOI) is assigned for the model and data [54].

Androgen Receptor from Chimp as Model Protein
The self-docking of the known binder dihydrotestosterone into structure 1t7r gave a strong docking score of −11.91 kcal/mol. The root-mean-square deviation (RMSD) of atom positions between the docked pose and the initial position of the co-crystallized ligand was 0.32 Å, indicating a good fit and small deviation from the crystal structure. Another known binder, testosterone, also scored a strong docking score of −10.32 kcal/mol. To widen the window for the predictions, we decided heuristically to use a (relatively strong) threshold value of −7 kcal/mol as this would correspond approximately to ligand submicromolar K d values. After docking the known binders and known non-binders to the three different protein structures from the different species, the results showed best agreement with experimental values (binding status) using the chimp protein structure, rather than using human or rat protein structures in molecular docking, or a combination of the three (Figure 1). Receiver-operator curves and area under the curve (AUC) values provided further validation, giving AUC values of 0.832 for chimp, 0.797 for human, and 0.744 for rat AR.

Androgen Receptor from Chimp as Model Protein
The self-docking of the known binder dihydrotestosterone into structure 1t7r gave a strong docking score of −11.91 kcal/mol. The root-mean-square deviation (RMSD) of atom positions between the docked pose and the initial position of the co-crystallized ligand was 0.32 Å, indicating a good fit and small deviation from the crystal structure. Another known binder, testosterone, also scored a strong docking score of −10.32 kcal/mol. To widen the window for the predictions, we decided heuristically to use a (relatively strong) threshold value of −7 kcal/mol as this would correspond approximately to ligand submicromolar Kd values. After docking the known binders and known non-binders to the three different protein structures from the different species, the results showed best agreement with experimental values (binding status) using the chimp protein structure, rather than using human or rat protein structures in molecular docking, or a combination of the three (Figure 1). Receiver-operator curves and area under the curve (AUC) values provided further validation, giving AUC values of 0.832 for chimp, 0.797 for human, and 0.744 for rat AR. The chimp docking results distribution was analyzed for both binding and non-binding compounds in the training set ( Figure 2), showing that around half of the compounds for both binding and non-binding compounds had a docking score of zero, while the other half was indeed separated, with binding compounds having values distributed over deeper docking scores. This good resolution for compounds with a non-zero docking score prompted to search for a further way to separate the compounds that did not resolve well, i.e., those binding and non-binding compounds that had a docking score of zero. The chimp docking results distribution was analyzed for both binding and nonbinding compounds in the training set ( Figure 2), showing that around half of the compounds for both binding and non-binding compounds had a docking score of zero, while the other half was indeed separated, with binding compounds having values distributed over deeper docking scores. This good resolution for compounds with a non-zero docking score prompted to search for a further way to separate the compounds that did not resolve well, i.e., those binding and non-binding compounds that had a docking score of zero.

Consensus Methods for Best Performance
Subsequently, various methods and combinations thereof were applied to separate the binding and non-binding compounds ( Figure 3). For this, several different metrics were checked for each procedure, with the Acc. and MCC as the primary metrics to guide the improvement of the procedures. The range of MCC is from −1, total disagreement between prediction and observation, to +1, a perfect prediction. Different procedures applied to the training set and their sequential relationship are schematically viewed in Figure 3

Consensus Methods for Best Performance
Subsequently, various methods and combinations thereof were applied to sepa binding and non-binding compounds (Figure 3). For this, several different metr checked for each procedure, with the Acc. and MCC as the primary metrics to g improvement of the procedures. The range of MCC is from −1, total disagreem tween prediction and observation, to +1, a perfect prediction. Different proced plied to the training set and their sequential relationship are schematically vi Figure 3, where horizontal levels from top to bottom show the complexity of c tions and path towards the improvement of the Acc. and MCC values. A first attempt at separation used a threshold on the docking scores in order to classify compounds into binding and non-binding. Different values were tested, with the best being a threshold score value of −7 kcal/mol. Compounds with scores above this level were classified as non-binders, and scores below this level were classified as binders (Figure 3: Procedure 1). Despite the fact that the Acc. value is relatively good, the MCC A first attempt at separation used a threshold on the docking scores in order to classify compounds into binding and non-binding. Different values were tested, with the best being a threshold score value of −7 kcal/mol. Compounds with scores above this level were classified as non-binders, and scores below this level were classified as binders (Figure 3: Procedure 1). Despite the fact that the Acc. value is relatively good, the MCC value is the lowest of all the tested Procedures (Table 1).   Naïve Bayesian classifiers were constructed with the docking scores of both active (binding) and inactive (non-binding) compounds, resulting in mean values of µ = −8.91 and −5.97 kcal/mol, and standard deviations (SD) = 1.94 and 2.01, respectively. The results of using only this classifier on its own are shown in Table 1: Procedure 2.
Procedure 3 was similar to Procedure 2, using a univariate logistic regression on the naïve Bayesian classifier probabilities. Procedure 4 was a modification of the Bayesian classifier, in which the docking score was used as in the first procedure, reducing the number of false positives by assuming zeroes were non-binding compounds. Both procedures did not improve the MCC values, while giving reduced Acc. values.
The following, Procedure 5, was ligand-based and involved calculating the distance between ECFPs for a compound and the known binders and non-binders. The smaller the distance between the ECFPs and each group, either the average distance towards known binders or average distance to known non-binders, would indicate the similarity of a compound to either group. This procedure resulted in an improved separation and therefore improved MCC at 0.3, while Acc. decreased.
Docking scores and ECFPs were used together for Procedure 6. Here, the threshold of −7 kcal/mol was used first on the docking scores, and the compounds that registered 0 kcal/mol were then separated using ECFPs. This combination had again the effect of improving the separation and thus, the Acc. and MCC values. Procedure 7 was similar, but was used in the inverted order, i.e., if the ECFP predicted a non-binder, then the docking score threshold was used. The latter procedure was not as effective as the former, giving a lower MCC and the lowest Acc. among Procedures.
For Procedure 8, a logistic regression was used on the docking scores and difference between ECFP distances to binders and non-binders. This improved the separation and provided the increase in MCC to 0.492 and the second best Acc. among the procedures. Procedure 9, used a sequence such as in Procedure 6, first docking scores, then ECFP, and then, if a zero was still obtained, used the logistic regression in Procedure 8. Neither this, nor Procedure 10 were better than Procedure 8. Procedure 10 was analogous to procedure 7, adding the logistic regression from 8 to the sequence.
Procedure 11 used a new logistic regression on the docking scores, ECFP differences, and ratio of Bayesian predicted classifications. This resulted in better separation than 10, an MCC to 0.4829 and third best Acc. Procedure 12 was similar to 11, but employed the Bayesian averages instead of the ratio. However, 11 and 12 performed similarly.
Out of all the combinations, a multivariate logistical model emerged as the best performing (Procedures 8 and 11). The best equation obtained was Procedure 13, represented as Equation (11)   This final model includes five descriptors: the docking score for the compound with chimp protein (ChimpDockScore), the average of the distances to the known active (toxic) chemicals (avgD Act ), the average of the distances to the known inactive (non-toxic) chemicals (avgD Inact ), the Bayesian probability value for the compound according to the distribution of known actives (toxic) compounds towards the chimp protein (P Act_dockChimp ), and the Bayesian probability value for the compound according to the distribution of known inactives (non-toxic) compounds towards the chimp protein (P Inact ).
For Equation (11) In more detailed analysis, NPV, PPV, +LR, −LR, and BCR statistics showed a slightly different view of the models (Table 2). In computational toxicology, one wishes to predict potentially harmful chemicals, minimizing FNs. This implies a classification model is usable even if it gives a high rate of FPs (low PPVs) but a low number of FNs (high NPVs) [56]. With this criterion, Procedure 8 (logistic regression on docking scores and ECFPs) gives the best model, with one of the lowest PPVs of 19.92, and the highest NPV of 98.07.
Positive (+LR) and negative (−LR) likelihoods ratios and balanced classification rate (BCR) are considered to be independent of the data distribution within the training set (noticeably unbalanced) [56]. By these measures, Procedure 8 again has the best, i.e., lowest, −LR value of 0.1420. On the other hand (admittedly less important in computational toxicology, since the objective is to minimize FNs), the best (highest) +LR value of 49.  [56]. Table 2. PPV (probability of predicted binder), NPV (probability of predicted nonbinder), positive (+LR) and negative (−LR) likelihood ratios, and BCR (modified correct classification rate) for 13 different procedures for the training set.

Validation of the Best Model
Using the best procedure, the AR pathway in vitro reference compounds presented in Kleinstreuer et al. [41] were used to further validate the model after recording a high docking score for the co-crystallized ligand. The results of the predictions on compounds that had verified experimental data as agonist and antagonist (i.e., without NA values), or that had only one strong or moderate value with the other being NA, are presented in Table 3. The results in Table 3 on the reference compounds show a good balance of 13 successful predictions versus seven mispredictions. If the moderate/weak compounds bisphenol A, linuron, flutamide, and prochloraz are classified as weak instead of moderate, then the balance of correct to incorrect predictions becomes 15 versus 5, respectively (75%). It is interesting to note that the most successful procedure combined structure-based values: docking to the chimp protein; ligand-based values: distances between extended connectivity fingerprints; and statistical comparisons: naïve Bayesian classifiers. This may reflect the need to use a variety of methods for a complex dataset as the one provided in the CoMPARA project.
An evaluation set provided by the organizers was also used. Results showed reasonably good Acc. These MCC values are still higher than random that would correspond to MCC = 0, lower than those obtained for the training set, yet appropriate for an evaluation set that was nearly twice the size of the training set and also highly unbalanced. They are comparable to the MCC obtained using support vector machines (SVM) in a different study by other groups on the same evaluation set [57]. Comparison of MCC values also clearly shows that MCC is not sufficient as the only measure for classification measurement on imbalanced datasets [58]. Procedure 10 has a BCR of 0.68, which is comparable to that obtained on the same evaluation set by a different group using SVM [57]. The Acc. values for Procedures 13 and 5 are comparable to those of Manganelli et al. using SVM, artificial neural networks (ANN), decision trees (DT), SARpy1 (fragment SAR [59]) SARpy2 [59], consensus models [57]. In parallel with this study, we developed a model using a random forest (RF) algorithm for balanced data sets, which did not use molecular docking data and were simpler in design than the current best model and gave the evaluation data set an Acc. of 0.78 (binders model only) [60]. A later development of the present study, the analysis of a balanced data set with deep neural networks (DNN), gave improved Acc. of 0.91 (MCC = 0.4685) [61].

Discussion
For the protein structural information part of this study, the best crystal structure for these purposes was employed using the available information at the time (see below). This is distinct from the approach of Trisciuzzi et al. [56] who used GOLD software to dock ligand decoys on nine protein structures, separating decoys from known binders and studying their applicability domain, choosing structure 2pnu (resolution of 1.65 Ångströms) instead. The assessment in the present study was done rather on the basis of the higher resolution of the crystal structure, amino acid sequence, and complex availability, and we chose structure 1t7r for the chimp AR, instead. In the present work, this was the crystal structure that best separated binders from non-binders among the CoMPARA data. Comparing results, both approaches are comparable in their active recall rate. Different measures such as MCC, BCR, +LR, and -LR give positive results for Procedures 13, 8, and 10. The consensus model for Procedure 13 is also available for the use at the QsarDB repository (http://dx.doi.org/10.15152/QDB.235 accessed on 21 June 2021.).
Structure-based design is directly impacted by the X-ray crystal protein structures and conformations of these that are employed. In the present work, it could be that the chimp AR structure best captures the conformation for binding actives/inactives, over those of rat and human.
The data set provided was challenging for any method, given the fact that the data is strongly unbalanced, containing a far larger number of non-actives than actives. This was true for the training set, n active = 205, n inactive = 1480; as well as for the evaluation set, n active = 453, n inactive = 3429. In addition, the protein structural information was not enough on its own to separate all the compounds. Though it could distinguish between binders and non-binders in the training set for those compounds that produced a docking score, there were a large number of compounds that had a docking score of 0, which required the use of additional techniques such as ligand-based ECFP fingerprints and statistical methods, such as Bayesians and combinations in consensus multivariate logistic regressions.

Conclusions
A variety of methods were systematically elaborated to model the binding of compounds to the androgen receptor for unbalanced data in order to help predict possible androgen pathway-disrupting compounds. The best methods out of the 13 tested included a multivariate logistic regression on values combining structure-based docking scores on the chimp protein, ligand-based Tanimoto dissimilarity distances using extended chemical fingerprints, and statistic comparisons between known binders and non-binders to the androgen receptor; as well as a multivariate logistic regression on docking scores and ECFP fingerprints. The best model includes only five descriptors: the docking score for the compound with chimp protein (ChimpDockScore), average of the distances to the known active (toxic) chemicals (avgD Act ), average of the distances to the known inactive (non-toxic) chemicals (avgD Inact ), the Bayesian probability value for the compound according to the distribution of known actives (toxic) compounds towards the chimp protein (P Act_dockChimp ), and the Bayesian probability value for the compound according to the distribution of known inactive (non-toxic) compounds towards the chimp protein (P Inact ). The model performed satisfactorily on the evaluation test provided in the CoMPARA project using MCC, BCR, and Acc. measures, as well as on an external reference set of compounds used in other studies and has easily interpretable variables and physicochemical reasoning.