Androgen Receptor Binding Prediction with Random Forest, Deep Neural Networks, and Graph Convolutional Neural Net- works

Substances that can modify the androgen receptor pathway in humans and animals are entering the environment and food chain with the proven ability to disrupt hormonal systems and leading to toxicity and adverse effects on reproduction, brain development, and prostate cancer, among others. State-of-the-art databases with experimental data of human, chimp, and rat effects by chemicals have been used to build machine learning classifiers and regressors and evaluate these on independent sets. Different featurizations, algorithms, and protein structures lead to different results, with deep neural networks on user-defined physicochemically-relevant features developed for this work outperform graph convolutional, random forest, and large featurizations. The results can help provide clues on risk of substances and better experimental design for toxicity assays. Source code and data are available at https://github.com/AlfonsoTGarcia-Sosa/ML


Introduction
Concerns are rising over endocrine disruptors entering the environment and food chain [1][2][3][4]. The Androgen Receptor (AR) is a protein involved in reproduction, brain development, prostate cancer, androgen insensitivity syndromes, spinal and bulbar muscular atrophy, acne, and alopecia [5]. Androgen receptor pathway modulators are compounds that can have an effect on tumors and reproductive systems [1][2][3][4]. The CoMPARA challenge was a collaborative modeling effort to predict possible AR modulators based on a wide collection of state-of-the-art experimental data [6]. Different modeling techniques have been attempted, including molecular docking [7], support-vector machines (SVMs), combined structure-based, ligand-based ECFP distances to known compounds, and Naïve Bayesians [8], among others [6].
Toxicity modeling of compounds is important in several ways: compounds that are used in pharmaceutical and industrial applications need to be assessed for possible adverse effects on humans and other organisms, as well as being an important development barrier for new drugs and useful compounds [1][2][3][4]. Difficulties include the lack of experimental tests including chronic and different exposure effects, as well as those of metabolites of compounds [1][2][3][4].
ML and AI are transforming many fields, including the computational chemistry and medicinal chemistry fields [9,10]. Particular advantages may be realized by ML methods in classification techniques for drug compound analysis and design [10,11]. A common criticism of ML methods is the potential to include a high number of variables that can have little insight into their physicochemical meaning [10]. Some ML and AI models are being developed with the aim of being 'explainable' [12] and afford interpretability to chemical groups responsible for positive or negative contributions to a prediction. The present work shows that different modeling techniques can have their advantages and disadvantages for modeling AR modulating compounds. Deep neural networks (DNNs) and graph convolutional neural networks (CNNs) have been used in other modeling studies, usually using featurization included in widely-available packages [13]. Here, an effort was made to build Random Forest (RF) and DNNs with a given set of features that are chemically important based on calculated protein-ligand binding to several targets, chemical fingerprint distances, and other results from statistical techniques. The results show that these user-provided features and specific deep learning models provided the best results as determined by metrics and interpretability and chemical meaning of the descriptors/features. In addition, the same features in the ML methods performed better than in a multivariate logistic model. ML techniques of this type may improve AR and toxicity description and prediction, improving assessment and design of compounds.

Results
The provided training dataset in the CoMPARA challenge was highly unbalanced with a large number of non-binders (N = 205) compared to binders (N = 1,468). Bias in datasets can affect strongly the results of ML algorithms and so addressing these issues is recommended [14]. The training set was thus balanced to provide an equal amount of binders and non-binders. Another effect was also apparent after docking calculations for the chimp protein (Figure 1), where distributions of docking scores for chimp androgen receptor for binders and non-binders show that for those compounds that have a nonzero docking score (N = 1,310), the docking scores are stronger for binders than for non-binders. However, for the smaller amount of those compounds with a docking score of zero (N = 363), the simple docking score on its own cannot separate both distributions. A similar effect is seen for the calculated Bayesian probabilities (Figure 2), where the distribution for binders is higher in probability to belong to the normal distribution of known binders and viceversa for non-binders. Bayesians constructed on the docking scores for both groups showed this separation with means and standard deviations of  = ̶ 8.91 kcal/mol and  = 1.94 for binders, and  = ̶ 5.97 kcal/mol and  = 2.01 for non-binders.    After balancing, the total numbers for both sets of compounds were: Number of compounds in train set: 410, composed of 205 binders and 205 nonbinders. No balancing was made for the evaluation set, where the number of compounds in validation set was 3,882.
Classifier and regression machine learning models were built and evaluated ( Table  1). For the Random Forest (RF) Classifier (I), the best results obtained after eight-fold cross-validation were: Best hyperparameters: (100, 'sqrt'), giving AUC values of train_score: 1.000000, validation_score: 0.7564. The Graph Convolutional Neural Network Classifier (CNN, VII) using hyperparameters: {'learning_rate': 0.00014031305077588868, 'weight_decay_penalty': 2.954578768759171e-06, 'nb_epoch': 40}, gave AUC values of: train_score: 0.827864, validation_score: 0.739908. There is clearly overfitting occurring in RF and CNN methods given the high ROC-AUC values for the training set ( Figure 4), with the best metrics for the evaluation score being around 19 runs. The best results for the training and validation sets based on the ROC-AUC metric was the DNN using the supplied 12 features (Table 1, model II), as well as showing less overfitting. The models for regression did not achieve good R 2 values, which is logical due to the awkwardness of fitting a regression to a binary outcome value. For RF in regression mode: Best hyperparameters: (10, 'log2'), giving R 2 values of train_score: 0.8817, validation_score: -0.0520. For CNN in regression mode, after four-fold cross-validation, with the best hyperparameters: {'learning_rate': 0.000359206871754689, 'weight_decay_penalty': 8.830664294504987e-06, 'nb_epoch': 20}, R 2 values were: train_score: 0.2721: valida-tion_score: -0.1926. The results ( Figure 4) show that increasing the number of estimators increases the degree of overfitting for the balanced training set. Tests on the full (unbalanced) initial training set show less overfitting, as well as using DNN ( Figure 5). DNN were better predictors than RF models, for a variety of metrics ( Figure 5). The results obtained are good for ROC-AUC, accuracy, F1-scores, and precision metrics for training (balanced), validation, as well as full training set (Table 1 and out.txt at https://github.com/AlfonsoTGarcia-Sosa/ML ). The MCC scores obtained are reasonable, considering the lack of balance in the datasets, as well as the lack of distinction in features between binders and non-binders in the evaluation set. This lack of feature distinction can be seen on Figures 6 and 7, where t-maps, histograms and density maps for calculated octanol/water partition (clogP), topological polar surface area (TPSA, Å 2 ), number of heterocycles, molecular weight (MW, g/mol), number of rotational bonds (nRotB), number of hydrogen bond donors (nHBDon), number of hydrogen bond acceptors (nHBAcc), and AlogP show a highly overlapping distribution for binders as well as nonbinders in the evaluation set, highlighting the difficulty of classification for this dataset. Using 20 gold-standard reference androgen receptor probe compounds as used by Kleinstreuer et al. [15] shows that there was a good result for predictions of 16/20, i.e., 80% were predicted correct for being a binder to AR (very weak binders were considered as non-binders). With respect to well-known compounds that are frequently misclassified [16], the results provided here (Table 3) show four out of 11 compounds correctly predicted compared to three out of 11 reported elsewhere, the difference being the correct prediction of finasteride [16]. Chemical structures in Tables 2 and 3 show several steroid core structures that may be difficult for algorithms to distinguish between actives and inactives, given the strong chemical similarity between them. The loss function diagram for the best DNN (Figure 8, II) shows a relatively stable function with loss for the training set starting around 0.5 and fluctuating moderately up to 0.6 but then decreasing steadily to around 0.27, while the loss function for the validation set fluctuates moderately starting from 0.3 to 0.26. The approach of using physicochemically and biochemically-relevant user-defined features (12 features, II) is seen in the better metrics performed by the DNN trained and validated on these rather than DNN trained on cddd descriptors (around 500 features, VI), as well as the closely performing RF on the user-defined features (I), and also being better than the CNN using vector featurization of the molecular graph (VII). In addition, the use of ML is warranted in this case, since the same features used in a multivariate logistic regression fashion produced metrics that were not as good, evaluation set MCC = 0.468 and training MCC = 0.868 for the present work compared to evaluation set MCC = 0.2036 and training set MCC = 0.5364 for the multivariate logistic regression [8].

Discussion
Toxicity classification problems can benefit from using DL and specific features that have rationalization on the biochemically and chemically relevant features of the compounds. In this case, predicted binding to chimp, rat, and human androgen receptor structures, in addition to average Tanimoto distances to known binders and nonbinders, as well as Naïve Bayesians as user-provided features to a DNN provided the best results. Unbalanced datasets can be transformed to balanced sets by dropping cases and perform better in training and evaluation. In particular, it is hard to produce toxicity data, especially chronic data, for chemicals with animals and humans, and this lack of data can translate into unbalanced datasets and difficulties for classification and regression techniques. Bias in datasets can be treated with different approaches, such as undersampling [17], as well as distribution following [14]. However, it is clear that more data is beneficial for methods and interpretations of androgen receptor pathway modulators and their toxicity potential. In addition, the most relevant biological assay, be that chimp, rat or human, or their use in consensus, may provide the best experimental setup for classification data.

Training Set.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 16 February 2021 doi:10.20944/preprints202102.0318.v1 The training set of compounds was provided during the CoMPARA challenge for predicting androgen receptor activity for chemicals [6], and included curated data with SMILES strings. This training set was composed of state-of-the-art experimental data from ToxCast [18], Tox21 [19], and DrugBank [20] databases, amounting to 1,673 chemical compounds with 205 positives (binders), and 1,468 negatives (non-binders). Binders were coded as actives ("1"), non-binders were coded as inactives ("0"). The SMILES strings were used as present in the files. Given that the training set was heavily unbalanced, the training set was balanced using pandas tools [21].

Independent Evaluation Set.
The evaluation data set was also provided in the CoMPARA challenge [6] from different databases being completely independent from the training set: EPA's NCCT collected and curated PubChem data (64 sources) [6,22]. After including only binding data, there were 3,882 compounds in the evaluation data set, composed of 446 positives (binders) and 3,437 negatives (non-binders). No balancing was performed for the evaluation set for an unbiased evaluation of the models.

Features.
The structures for the human, chimp, and rat androgen receptor were downloaded from the PDB [23] with codes 3v49, 1t7r, and 3g0w, ('HumDockScore', 'ChimpDockScore', 'RatDockScore') respectively, based on their resolution (1.4, 1.7, and 1.95 Å), completeness, and relevance of the complex. Protein X-ray crystal structures were preprocessed with the Protein Preparation Wizard from Schrödinger [24]. Docking scores were generated with Glide XP [25] using 15 Å inner box and 40 Å outer search boxes centered on the orthosteric site of AR, as different to default settings. The average of the docking scores for the three protein targets was also computed and stored as 'AVG', as computed before [26,27].
Extended connectivity fingerprints (ECFP), circular topology-based representations of compounds, were calculated with ChemAxon [30]. Distances between compound fingerprints were calculated by Tanimoto fingerprints using OpenBabel [31], giving 'avgD_Act' and 'avgD_Inact', respectively, for the calculated distance to the average of known active and inactive compounds.
Naïve Bayesians were constructed using the means and standard deviations of the docking scores of actives to the chimp receptor, and the probability given for each group 'P_Act_dockChimp', and 'P_Inact', respectively, as well as their ratio and Bayesian prediction 'PredBayes' for which P was greater.
For comparison, the cddd group of latent-space encoded ligand-based descriptors [31] were also used as described.

Metrics.
In all cases, the task classification or regressor was the "binding Class" status of the compounds, that is, coded 1 for binders and 0 for nonbinders that represented experimental actives and inactives for androgen receptor. Validation metrics included Area-Under-the Curve (AUC) measurements of the Receiver-Operator Curve of true positive and false positive rates, that range from 0 (complete misclassification) to 1.0 (complete classification), precision, recall, Matthews' correlation coefficient (MCC), F1-score, and accuracy as determined by sklearn [32].
A confusion matrix has four fields: All data and code were run on jupyter notebooks and python, deposited, and made available on github at: https://github.com/AlfonsoTGarcia-Sosa.
Author Contributions: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing-original draft preparation, writing-review and editing, visualization, supervision, project administration, funding acquisition, A.T.G.-S. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: All of the data used is provided in text file format, all jupyter notebooks and python code used are also provided and are available at the public repository: https://github.com/AlfonsoTGarcia-Sosa