Molecular Image-Based Prediction Models of Nuclear Receptor Agonists and Antagonists Using the DeepSnap-Deep Learning Approach with the Tox21 10K Library

The interaction of nuclear receptors (NRs) with chemical compounds can cause dysregulation of endocrine signaling pathways, leading to adverse health outcomes due to the disruption of natural hormones. Thus, identifying possible ligands of NRs is a crucial task for understanding the adverse outcome pathway (AOP) for human toxicity as well as the development of novel drugs. However, the experimental assessment of novel ligands remains expensive and time-consuming. Therefore, an in silico approach with a wide range of applications instead of experimental examination is highly desirable. The recently developed novel molecular image-based deep learning (DL) method, DeepSnap-DL, can produce multiple snapshots from three-dimensional (3D) chemical structures and has achieved high performance in the prediction of chemicals for toxicological evaluation. In this study, we used DeepSnap-DL to construct prediction models of 35 agonist and antagonist allosteric modulators of NRs for chemicals derived from the Tox21 10K library. We demonstrate the high performance of DeepSnap-DL in constructing prediction models. These findings may aid in interpreting the key molecular events of toxicity and support the development of new fields of machine learning to identify environmental chemicals with the potential to interact with NR signaling pathways.


Introduction
Many chemical substances have potential harmful effects, causing the perturbation of endocrine homeostasis by interfering with various nuclear receptors (NRs) of hormones [1][2][3][4][5]. In the disruption of hormone pathways, structurally diverse groups of chemicals are known to interact primarily with ligand-NR bindings, which have the ability to substitute for natural ligands, ultimately resulting in proliferative, reproductive, and metabolic disorders [6][7][8][9][10][11][12]. NRs are a superfamily of ligand-dependent transcriptional factors containing n N-terminal transactivation domain, a flexible hinge region, and a C-terminal ligand-binding domain (LBD) [6,8,13]. NRs are classified mainly into two types according to their subcellular distribution in the absence of a ligand and their mechanisms: Type I steroid receptors, including the estrogen receptor (ER), androgen receptor (AR), progesterone receptor (PR), and glucocorticoid receptor (GR); and Type II nonsteroid receptors, including the thyroid receptor (TR alpha and beta), retinoic acid receptor (RAR alpha, beta, and gamma), retinoid X receptor (RXR), vitamin D receptor (VDR), peroxisome proliferator-activated receptor (PPAR alpha, beta, and gamma), liver X receptor (LXR), farnesoid X receptor (FXR), and pregane X receptor (PXR), [6,14,15]. In the absence of a ligand, the type I NR forms inactive complexes with chaperone proteins in the cytoplasm, Using these molecular images as input data into the DL, the prediction models of 35 NR agonists and antagonists were constructed using Tra, and validated with Val. The values of mean Loss (Val) and Acc (Val) were 0.0748 ± 0.0035 and 97.56 ± 0.09, respectively (Figure 3, Figure S4a, Supplementary Materials). In addition, the highest prediction performance on the Val dataset was observed in the thyroid-stimulating hormone receptor agonist (TSHR2_ago, AID: 1259393), for which the mean Loss (Val) and Acc (Val) were 0.0017 ± 0.0008 and 99.93 ± 0.02, respectively ( Figure 3, Figure S4a). The prediction performance of these models was evaluated using Test based on five metrics, namely AUC, BAC, F, Acc (Test), and MCC. The results showed that the mean AUC, BAC, F, Acc (Test), and MCC were 0.8842 ± 0.0165, 0.8471 ± 0.0168, 0.3085 ± 0.0411, 82.73 ± 3.92, and 0.3536 ± 0.0377, respectively (Figures 4 and 5, Figure S4a,b). In addition, the highest prediction performance on Test was observed in the thyroid-stimulating hormone receptor agonist (TSHR2_ago, AID: 1259393), with the mean AUC, BAC, F, Acc (Test), and MCC being 0.9994 ± 0.0006, 0.9997 ± 0.0003, 0.9286 ± 0.0714, 99.94 ± 0.06, and 0.9327 ± 0.0673, respectively (Figures 4 and 5, Figure S4a  Next, the datasets were divided into Tra:Val:Test groups with a 4:4:1 ratio. The mean numbers of active and inactive chemicals were, respectively, 120.9 ± 124.1 and 3107.1 ± 153.1 in Tra, 120.8 ± 124.3 and 3106.8 ± 153.3 in Val, and 30.1 ± 30.9 and 271.8 ± 279.2 in Test (Table S1, Supplementary Materials). In addition, the highest and lowest numbers of the active chemicals were, respectively, 683 and 2 in Tra, 684 and 6 in Val, and 170 and 2 in Test (Table S1). The molecular images derived from the 3D chemical structures were generated using the DeepSnap approach at different angles along the x-, y-, and z-axes, i.e., (176 • , 176 • , 176 • ). A total of 27 images for one chemical compound was captured ( Figure 2, Figure S1).
Using these molecular images as input data into the DL, the prediction models of 35 NR agonists and antagonists were constructed using Tra, and validated with Val. The values of mean Loss (Val) and Acc (Val) were 0.0748 ± 0.0035 and 97.56 ± 0.09, respectively (Figure 3, Figure S4a, Supplementary Materials). In addition, the highest prediction performance on the Val dataset was observed in the thyroid-stimulating hormone receptor agonist (TSHR2_ago, AID: 1259393), for which the mean Loss (Val) and Acc (Val) were 0.0017 ± 0.0008 and 99.93 ± 0.02, respectively ( Figure 3, Figure S4a). The prediction performance of these models was evaluated using Test based on five metrics, namely AUC, BAC, F, Acc (Test), and MCC. The results showed that the mean AUC, BAC, F, Acc (Test), and MCC were 0.8842 ± 0.0165, 0.8471 ± 0.0168, 0.3085 ± 0.0411, 82.73 ± 3.92, and 0.3536 ± 0.0377, respectively (Figures 4 and 5, Figure S4a,b). In addition, the highest prediction performance on Test was observed in the thyroid-stimulating hormone receptor agonist (TSHR2_ago, AID: 1259393), with the mean AUC, BAC, F, Acc (Test), and MCC being 0.9994 ± 0.0006, 0.9997 ± 0.0003, 0.9286 ± 0.0714, 99.94 ± 0.06, and 0.9327 ± 0.0673, respectively (Figures 4 and 5, Figure S4a (Arfull_ago), AID:743077 (Erlbd_ago), and AID:743140 (PPARg_ago), were 0.6500, 0.7147, and 0.7852, respectively. However, the best prediction model of AID:743122 (AhR_ago) had a BAC value of 0.8528 in the Data Challenge, whose BAC outperformed that in the DeepSnap-DL method (0.7785). Up to now, conflicting observations have been reported regarding whether DL performs better than conventional shallow machine learning (ML) methods, such as random forest, support vector machine, and gradient boosting decision tree [40,43,[49][50][51][52][53]. Although some reports suggest that DL outperforms conventional ML methods owing to various improvements, the performance of DL in terms of QSAR may be affected by many factors, such as molecular descriptors, assay targets, chemical space, hyper-parameter optimization, DL architectures, input data size, and quality [40]. Furthermore, the DeepSnap-DL approach has the black box problem, that is, it lacks explainability and interpretability of the prediction models because the convolutional area on the image picture by CNN is not defined. This issue has been extensively studied, especially in the field of image recognition. These studies try to resolve the issue by calculating the gradient of the input image with respect to the output label and highlighting the target pixel as a recognition target when a slight change in a specific input pixel causes a large change in the output label. However, a simple calculation of the gradient generates a noisy highlight, so some improved methods have been proposed for sharpening [54][55][56][57][58][59]. In addition, in the DeepSnap-DL approach, the performance improves as data size increases, and performance deterioration is observed with insufficient data size or the presence of noise. However, simply increasing the sample size causes problems such as overfitting and increased calculation costs. To resolve the issues of the DeepSnap-DL approach, critical factors include specifying the image area and type required for effective feature extraction to reduce the input data volume, and clarification of the functional relationship of chemical substances with biological activity in vivo. Future applications may include screening of target molecules in specific pathological reactions. To investigate whether the in vitro bioassays for agonist and antagonist mode in the Tox21 program affect the prediction performance of NRs, we compared prediction performances among four in vitro assays, namely, luciferase, beta-lactamase, cAMP, and intracellular calcium assays, using the results of 35 NR agonist and antagonist prediction models. In the Val dataset, the loss and accuracy values in the luciferase assay were significantly higher and lower, respectively, compared with that of the beta-lactamase assay (Figure 6a,b, p < 0.05 for both Loss (Val) and Acc (Val)).  The Tox21 Data Challenge 2014 was designed to understand the interference of the chemical compounds derived from the Tox21 10K compound library in the biological pathway via crowdsourced data analysis by independent researchers. It used data generated from seven NR signaling pathway assays to construct prediction models for QSARs [48]. The BAC values of the three models constructed by the proposed DeepSnap-DL were 0.8361, 0.8204, and 0.8494, respectively, outperforming the Data Challenge models where the BACs of three models, namely AID:743053 (Arfull_ago), AID:743077 (Erlbd_ago), and AID:743140 (PPARg_ago), were 0.6500, 0.7147, and 0.7852, respectively. However, the best prediction model of AID:743122 (AhR_ago) had a BAC value of 0.8528 in the Data Challenge, whose BAC outperformed that in the DeepSnap-DL method (0.7785). Up to now, conflicting observations have been reported regarding whether DL performs better than conventional shallow machine learning (ML) methods, such as random forest, support vector machine, and gradient boosting decision tree [40,43,[49][50][51][52][53]. Although some reports suggest that DL outperforms conventional ML methods owing to various improvements, the performance of DL in terms of QSAR may be affected by many factors, such as molecular descriptors, assay targets, chemical space, hyper-parameter optimization, DL architectures, input data size, and quality [40].   Furthermore, the DeepSnap-DL approach has the black box problem, that is, it lacks explainability and interpretability of the prediction models because the convolutional area on the image picture by CNN is not defined. This issue has been extensively studied, especially in the field of image recognition. These studies try to resolve the issue by calculating the gradient of the input image with respect to the output label and highlighting the target pixel as a recognition target when a slight change in a specific input pixel causes a large change in the output label. However, a simple calculation of the gradient generates a noisy highlight, so some improved methods have been proposed for sharpening [54][55][56][57][58][59]. In addition, in the DeepSnap-DL approach, the performance improves as data size increases, and performance deterioration is observed with insufficient data size or the presence of noise. However, simply increasing the sample size causes problems such as overfitting and increased calculation costs. To resolve the issues of the DeepSnap-DL approach, critical factors include specifying the image area and type required for effective feature extraction to reduce the input data volume, and clarification of the functional relationship of chemical substances with biological activity in vivo. Future applications may include screening of target molecules in specific pathological reactions.
To investigate whether the in vitro bioassays for agonist and antagonist mode in the Tox21 program affect the prediction performance of NRs, we compared prediction performances among four in vitro assays, namely, luciferase, beta-lactamase, cAMP, and intracellular calcium assays, using the results of 35 NR agonist and antagonist prediction models. In the Val dataset, the loss and accuracy values in the luciferase assay were significantly higher and lower, respectively, compared with that of the beta-lactamase assay (Figure 6a,b, p < 0.05 for both Loss (Val) and Acc (Val)). In addition, F and MCC in Test of the cAMP assay significantly increased compared with those of the beta-lactamase assay (Figure 6c,d, p < 0.05 for both F-measure and MCC). The BAC value in the Test dataset of the cAMP assay showed a moderate increase compared with that of the beta-lactamase assay ( Figure S5c, Supplementary Materials, p < 0.09). These results indicate that the prediction performance of the NR agonists and antagonists in the Tox21 10K library may be affected by the choice of the in vitro assay method. There are several conflicting reports regarding the in vitro receptor-mediated activity. Chemicals such as bisphenol A (BSA) and its halogenated analogs (tetrabromo-BSA and tetrachloro-BSA) show weak TR antagonist activity but have a potential agonist-like effect at lower concentrations [60,61]. Thus, competitive agonists and antagonists of the steroids have long been known [62][63][64]. Among them, ligands exhibiting agonist and antagonist activity, called selective steroid receptor modulators (SSRMs), are known to show specificity on tissue or cell type [62,[65][66][67][68][69]. In addition, a competitive antagonist, known as the passive antagonist, hinders the binding but induces the inactive state of NRs by modifying interaction with their corepressor and interfering with their nuclear translocation or DNA binding at saturated concentrations [62,70]. These reports suggest that the ligand of the steroid NRs can serve not only as competitive agonists and antagonists that affect binding to the NRs, but also as a unique allosteric modulator for subsequent molecular interactions. Therefore, classification of the chemicals in the Tox21 10K library may require In addition, F and MCC in Test of the cAMP assay significantly increased compared with those of the beta-lactamase assay (Figure 6c,d, p < 0.05 for both F-measure and MCC). The BAC value in the Test dataset of the cAMP assay showed a moderate increase compared with that of the beta-lactamase assay ( Figure S5c, Supplementary Materials, p < 0.09). These results indicate that the prediction performance of the NR agonists and antagonists in the Tox21 10K library may be affected by the choice of the in vitro assay method. There are several conflicting reports regarding the in vitro receptor-mediated activity. Chemicals such as bisphenol A (BSA) and its halogenated analogs (tetrabromo-BSA and tetrachloro-BSA) show weak TR antagonist activity but have a potential agonist-like effect at lower concentrations [60,61]. Thus, competitive agonists and antagonists of the steroids have long been known [62][63][64]. Among them, ligands exhibiting agonist and antagonist activity, called selective steroid receptor modulators (SSRMs), are known to show specificity on tissue or cell type [62,[65][66][67][68][69]. In addition, a competitive antagonist, known as the passive antagonist, hinders the binding but induces the inactive state of NRs by modifying interaction with their corepressor and interfering with their nuclear translocation or DNA binding at saturated concentrations [62,70]. These reports suggest that the ligand of the steroid NRs can serve not only as competitive agonists and antagonists that affect binding to the NRs, but also as a unique allosteric modulator for subsequent molecular

Conclusions
In this study, we built prediction models of 35 NR agonists and antagonists using the DeepSnap-DL approach with information of the chemical structure and activity from the Tox21 10K library. Three prediction models outperformed the best performing models in the Tox21 Data Challenge 2014. These results suggest that the 3D chemical structure representation in the DeepSnap-DL approach may be useful for molecular image-based QSAR analysis, and the improvements to the DeepSnap-DL method may aid in achieving high-performing prediction models.

Conclusions
In this study, we built prediction models of 35 NR agonists and antagonists using the DeepSnap-DL approach with information of the chemical structure and activity from the Tox21 10K library. Three prediction models outperformed the best performing models in the Tox21 Data Challenge 2014. These results suggest that the 3D chemical structure representation in the DeepSnap-DL approach may be useful for molecular image-based QSAR analysis, and the improvements to the DeepSnap-DL method may aid in achieving high-performing prediction models.
The cells were dispensed at 1500 to 5000 cells/5 (for antagonist mode) or 6 (for agonist mode) microL/well in 1536-well black wall/clear bottom plates [72,73,75,[78][79][80]. After the cells were incubated at 37 • C for 5 to 6 h depending on the particular NR cell line to allow for cell attachment, 23 nL of the compounds at different concentrations were transferred to the assay plates. For the antagonist mode assay, the known agonist for each NR was added into the assay plates. For the agonist and antagonist mode assays, positive control compounds were dispensed into each other's wells on the plates (Table 1) [72,73,75,[78][79][80]. The plates were incubated for 16 to 18 h at 37 • C depending on the particular NR cell line. Then, a LiveBLAzer TM B/G FRET substrate (Invitrogen, Carlsbad, CA, USA) detection mix was added, and the plates were incubated at room temperature for 1.5 to 2 h. The fluorescence intensity (405 nm excitation, 460 and 530 nm emission) was measured using an Envision plate reader (PerkinElmer, Shelton, CT, USA). Data were expressed as the ratio of 460/530 nm emission values. To measure the luciferase reporter gene activity, 4 microL of ONE-Glo TM Luciferase Assay reagent (Promega, Madison, WI, USA) were added to each plate, and the luminescence intensity was quantified by a ViewLux plate reader (PerkinElmer) after 30 min of incubation at room temperature. Data were expressed as relative luminescence units.

qHTS Data Analysis
The Tox21 10k library can be grouped into clusters with similar activity that share similar annotated models of action according to PubChem activity scores. In the qHTS of the Tox21 program, to identify the chemical compounds in both potential agonist and antagonist modes, the PubChem activity scores were determined from 0% to 100% by normalizing each titration point relative to the positive control compound (agonist mode: 100%, antagonist mode: 0%) and DMSO-only wells (agonist mode: 0%, antagonist mode: -100%) according to the following equation: % Activity = [(Vcompound − Vdmso)/(Vpos − Vdmso)] × 100, where Vcompound, Vdmso, and Vpos denote the compound well values, the median values of the DMSO-only wells, and the median value of the positive control well, respectively.
The datasets were then corrected using compound-free control plates, i.e., DMSO-only plates, at the beginning and end of the compound plate measurement [72,73,75,[78][79][80]. The half maximum inhibition values (IC 50 ) and the maximum response values for each compound were calculated by fitting the concentration-response curves of each compound to a four-parameter Hill equation [84,85].
The PubCem activity scores of the agonists and antagonists were grouped into three classes, namely (1) 0, (2) 1-39, and (3) 40-100, which represent inactive, inconclusive, and active compounds, respectively. In this study, compounds with activity scores of 40-100 or 0-39 were defined as active or inactive, respectively. The dataset includes some similar chemical compounds, but with different activity scores for different ID (identification) numbers due to the presence of possible stereoisomers or salts. Therefore, chemical compounds with indefinite activity criteria, nonorganic compounds, and/or inaccurate SMILES were eliminated.

DeepSnap
We then applied a 3D conformational import from the SMILES format using molecular operating environment (MOE) 2018 software (MOLSIS Inc., Tokyo, Japan) to generate the chemical database. Here, the neutralization of the protonation state and the coordinating washed species were used by the external program, CORINA classic software ( Figure S1, Supplementary Materials) [86]. The resulting 3D structures were then saved in an SDF file format. Using the SDF files prepared by the MOE application, the 3D chemical structures were depicted as 3D ball-and-stick models with different colors corresponding to different atoms by Jmol, an open-source Java viewer software (version number, manufacturer, city, state abbreviation, country) for 3D molecular modeling of chemical structures [44][45][46][47]. These 3D chemical structures produce different images depending on the direction. The 3D chemical models were captured automatically as snapshots with user-defined angle increments with respect to the x-, y-, and z-axes. In this study, one angle increment was used, i.e., (176 • , 176 • , 176 • ). Other parameters for the DeepSnap depiction process were set based on previous studies as follows: image pixel size: 256 × 256; molecule number per SDF file to split into: 100; zoom factor (%): 100; atom size for van der Waals radius (%): 23; bond radius (mÅ): 14.5; minimum bond distance: 0.4; and bond tolerance: 0.8 [44][45][46][47]. The snapshots saved as 256 × 256 pixel resolution PNG files (RGB) were divided into three types of datasets: training (Tra), validation (Val), and test (Test) ( Figure S1, Figure 2).

Preparation of Dataset
Three groups of datasets were prepared by dividing the data into Tra, Val, and Test groups. The data were first split into 11 groups, and the two dataset groups (4:4:1_01 and 4:4:1_02) were then built in accordance with the ratio of Tra:Val:Test = 4:4:1. A prediction model was created using the Tra and Val datasets. Then, the prediction performance was evaluated using the Test dataset (4:4:1_01) ( Figure S2, Supplementary Materials). For a subsequent analysis, the remaining Test dataset was selected from the group not used in the first analysis. The model was then built, and its probability calculation was examined in the same manner (4:4:1_02). Finally, two tests were performed and the average was calculated ( Figure S2).

Deep Learning
All the two-dimensional (2D) PNG images produced by DeepSnap were resized by utilizing the NVIDIA DL GPU Training System (DIGITS) version 4.0.0 software (NVIDIA, Santa Clara, CA, USA), on four-GPU systems, Tesla-V100-PCIE (31.7 GB), with a resolution of 256 × 256 pixels as input data, as previously reported [44][45][46][47]. The prediction model was pre-trained as transfer learning [44][45][46][47] by the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 dataset [87], which includes 1000 classes, such as animal (40%), device (12%), container (9%), consumer goods (6%), and equipment (4%). The ILSVRC 2012 dataset was divided as 1.2 million Tra, 50,000 Val, and 1 million Test datasets extracted from ImageNet [88]. To rapidly train and fine-tune the highly accurate CNNs using the input Tra and Val datasets based on the image classification and building the pre-trained prediction model, we used a pre-trained open-source DL model, Caffe, and the open-source software on the CentOS Linux distribution 7.3.1611. In this study, the deep CNN architecture was GoogLeNet, which is a complex network inspired by LeNet and implemented with a novel module called "Inception", which facilitates batch normalization, image distortions, and RMSprop; concatenates different filter sizes and dimensions into a single new filter; and introduces sparsity and multiscale information in one block ( Figure S3, Supplementary Materials). The network is a 22-layer deep CNN, comprising two convolutional layers, two types of pooling layers (four max pools and one avg pool), and nine Inception modules, each module having six convolution layers and one pooling layer, with 4 million parameters ( Figure S3) [89][90][91].
In the DeepSnap-DL method, the prediction models were constructed by training datasets using 30 epochs with 1 snapshot interval in each epoch, 1 validation interval in each epoch, 1 random seed, a stochastic gradient descent-type solver, a learning rate of 0.006, and a batch size of 108 in DL. Among the epochs, the lowest Loss value in the Val dataset (Loss (Val)), which is the error rate between the results obtained from the validation data and the corresponding labeled dataset, was selected for subsequent examination of prediction using the Test dataset.

Evaluation of the Predictive Model
Through two tests conducted on the Test datasets for the experiments, with Tra:Val:Test = 4:4:1 in the DL prediction model, we analyzed the probability of the prediction results with the lowest minimum Loss (Val) value among 30 examined epochs. We calculated the probabilities for each image of one molecule captured at different angles with respect to the x-, y-, and z-axes using DeepSnap-DL. The medians of each of these predicted values were used as the representative values for target molecules as previously reported [44][45][46][47]. The performance of each model in predicting the NR agonists and antagonists was evaluated in terms of the following metrics: area under the curve of receiver operating characteristic curve (ROC_AUC); balanced accuracy (BAC); accuracy (Acc), which is the percentage of correct answers based on the results obtained from the validation dataset and the corresponding labeled dataset; F-measure; and Matthews correlation coefficient (MCC) calculated using JMP Pro 14, which is a statistical discovery software (SAS Institute Inc., Cary, NC, USA), as previously reported [44][45][46][47]. These performance metrics are defined as follows: where TP, FN, TN, and FP denote true positive, false negative, true negative, and false positive, respectively. To determine the optimal cutoff point for the definition of TP, FN, TN, and FP, the method of maximizing sensitivity (1-specificity), which is called the Youden index [92,93], was adopted using JMP Pro software. The index has a value ranging from 0 to 1, where 1 represents maximum effectiveness and 0 represents minimum effectiveness.

Statistical Analysis
Differences in prediction performance of in vitro assays in terms of loss (Val), Acc (Val), and Acc (Test), were analyzed by Tukey-Kramer's honestly significant difference test with JMP Pro 14 [94]. Results with p < 0.05 were considered statistically significant.