A Toxicity Prediction Tool for Potential Agonist/Antagonist Activities in Molecular Initiating Events Based on Chemical Structures

Because the health effects of many compounds are unknown, regulatory toxicology must often rely on the development of quantitative structure–activity relationship (QSAR) models to efficiently discover molecular initiating events (MIEs) in the adverse-outcome pathway (AOP) framework. However, the QSAR models used in numerous toxicity prediction studies are publicly unavailable, and thus, they are challenging to use in practical applications. Approaches that simultaneously identify the various toxic responses induced by a compound are also scarce. The present study develops Toxicity Predictor, a web application tool that comprehensively identifies potential MIEs. Using various chemicals in the Toxicology in the 21st Century (Tox21) 10K library, we identified potential endocrine-disrupting chemicals (EDCs) using a machine-learning approach. Based on the optimized three-dimensional (3D) molecular structures and XGBoost algorithm, we established molecular descriptors for QSAR models. Their predictive performances and applicability domain were evaluated and applied to Toxicity Predictor. The prediction performance of the constructed models matched that of the top model in the Tox21 Data Challenge 2014. These advanced prediction results for MIEs are freely available on the Internet.


Introduction
Quantitative structure-activity relationship (QSAR) analysis is a technique used to predict the physiological activity of low-molecular-weight compounds based on their molecular structure [1,2]. In the field of toxicology, QSAR methodology is used for quantitative structure-toxicity relationship (QSTR) modeling using complex toxicity and adverse effect onset mechanisms that are objective variables [3,4].
An in silico approach, such as QSTR, is time and cost-effective for the detection of the potential toxicity of compounds in the early phases of drug development and pharmacovigilance, satisfying global ethical requirements regarding the 3R rules [5][6][7]. QSTR has therefore been extensively applied to regulatory toxicology. Recently, the critical application issue of realizing the implementation of toxicity prediction models extensively and of putting them to practical use has emerged. However, currently, one missing but desirable functionality in the practical use of QSTR prediction is that resources, such as the toxicity prediction models, should be distributed as highly convenient public software. Therefore, these toxicity prediction models should be published so that users can access QSTR prediction models for various toxicity targets [8][9][10].
The Toxicology in the 21st Century (Tox21) program is a consortium constituted by the National Institute of Health, the US Environmental Protection Agency, the National Toxicology Program, the National Center for Advancing Translational Sciences, and the Food and Drug Administration [11]. This project develops and evaluates novel efficient methods for toxicity assessments and mechanistic insights in addition to reducing time, costs, and animal usage [11,12]. Furthermore, in the ToxCast and Tox21 programs, for potentially molecular initiating event (MIE) targets for adverse outcome pathways [13,14], the in vitro quantitative high-throughput screening (qHTS) of approximately 10,000 compounds was performed [15]. These targets include nuclear receptors (NRs) and stress response (SR) pathways. Endocrine-disrupting chemicals (EDCs) interfere with the endocrine system by interacting with NRs and SR pathways and engender myriad adverse developmental, reproductive, neurological, and immunological effects in both humans and wildlife [16,17]. Therefore, identifying potential EDCs is of specific interest for the Tox21 program and environmental chemical hazard screening in general.
However, the in vitro qHTS assay is insufficient to screen all classes of chemicals, such as those still in the molecular development and optimization phase and, thus, cannot provide an accurate evaluation of the potential toxicity of chemicals in humans and the environment [18]. Therefore, a growing interest exists in a comprehensive in silico approach to detect the potential toxicity of chemicals. The literature presents the results of successful examples of alternative in silico toxicity screening methods and their applications using the Tox21 10K library [19][20][21]. However, even though there are 59 types of well-confirmed assay results of agonist/antagonist activities for toxicity targets in the Tox21 10K library, several studies have built models for only a small number of toxicity targets. There is still no comprehensive approach. Furthermore, because these models had not been opened, other researchers could not access the available constructed models. Therefore, users have found it challenging to perform and reuse the prediction of MIEs.
In this study, we overcame this problem by extensively collecting and processing databases of 59 types of assay targets based on the Tox21 10K library and constructed in silico toxicity prediction models for each assay target using XGBoost [22], which is a gradient-boosting algorithm with multiple uses for toxicity predictions [23]. The predictive performance of all models was validated and published on the web application. Using the prediction models constructed in this study, the screening of the potential toxicity of chemicals to various toxicity targets is possible.

Distributions of Active and Inactive Compounds
The PubChem activity scores were normalized between 0 and 100 using the following equation: activity = ((V compound − V DMSO )/(V pos − V DMSO )) × 100, where V compound , V DMSO , and V pos denote the compound-well value, the median value of the DMSO-only wells, and the median value of the positive-control well, respectively [24]. The most active and inactive results have scores close to 100 and 0, respectively. In the PubChem documentation, all inactive compounds have a score of 0, active compounds have scores between 40 and 100, and inconclusive compounds have scores between 5 and 30. To implement the binary classification models, the binary teacher labels of active or inactive compounds were defined in two ways. In one definition, active compounds were scored 40 or higher; in another definition, active compounds were assigned scores of 1 or higher.
We converted PubChem activity scores to binary labels using the two definitions of a criterion of 40 and a criterion of 1 to implement binary classification models for 59 toxicity targets. Figure 1a shows the number of active and negative compounds based on the definition of a criterion of 40, and Figure 1b shows that of a criterion of 1. For all toxicity targets, when we converted PubChem activity scores to binary labels with the definition of a criterion of 40, the mean ratio of active compounds to all compounds was 4.7% ± 4.0% and that of a criterion of 1 was 18.1% ± 11.0%. Lowering the criteria from 40 to 1 increased the mean ratio of active compounds by approximately 13.4%. However, when annotated with the criteria of 40, the ratio of active compounds in VDR_ago (PubChem activity score ID (AID): 743241), NFkB_ago (AID: 1159518), and TGFb_ago (AID: 1347035) were lower than 1%.

Models and Performances
For the 59 individual targets, 10% of all compounds was assigned to the test set. The other 90% of the compounds was used for the optimization, training, and validation of models in the validator, as shown in Section 3.5. The predictive performances of the constructed models were evaluated based on the area under the curve (AUC) of the receiver operating characteristic (ROC) curve in the test set. Optimal thresholds to convert the prediction probability to a binary class output were calculated using the Youden index gained from the ROC curve in the test set. Using these thresholds, predictive performances in the test set were evaluated. Table 1 shows the results of the test set. Model performances in the test set were evaluated using the metrics of AUC, sensitivity (SE), specificity (SP), accuracy (ACC), balanced accuracy (BAC), and the Matthews correlation coefficient (MCC). Figures  2 and 3 show the ROC curves for all toxicity targets in the test set in the cases of criteria 40 and 1, respectively. In both cases in which the active labels were annotated with a criterion of 40 and 1, Table  2 summarizes the averages of predictive performances in the test set. Good predictive performances were observed for the models regardless of the criteria. However, for VDR_ago (AID: 743241), HIF1_ago (AID: 1224894), and Shh ago (AID: 1259390), which were annotated by a criterion of 40, the ratios of active compounds in the test set were 0%, 0.42%, and 0.62% and the AUC values were N.D., 0.556, and 0.571, respectively.
The classification performance of models tends to deteriorate because of class distribution imbalance [25]. A between-class imbalance degrades the prediction performance because of the bias in the prediction results toward the majority class, leading to more prediction errors in the minority class [26]. Figure 1 shows a sparser distribution of active compounds and an imbalance in the case of using a criterion of 40 compared to a criterion of 1. In this study, as shown in Figure 1 and Table 1, because of the between-class imbalance caused by the criteria of 40, constructing and evaluating some toxicity prediction models was impossible. We managed this problem by lowering the criteria from 40 to 1, and with this, we could evaluate the constructed models.
When using labels annotated by the criteria of 1, all compounds were treated as active, except those ensured to be inactive, which had a PubChem activity score of 0. Therefore, using the criterion of 1 in order to develop the models, we concluded that we developed criterion 1 models that accurately learned the inactive compounds compared with criterion 40 models. On the other hand, Judson et al. reported that a phenomenon called cytotoxicity-associated "burst" was observed for tests conducted on the Tox21 program [27]. Many chemicals show the activation of large numbers of assays over a narrow range of concentrations in which cell stress and cytotoxicity are also observed.

Models and Performances
For the 59 individual targets, 10% of all compounds was assigned to the test set. The other 90% of the compounds was used for the optimization, training, and validation of models in the validator, as shown in Section 3.5. The predictive performances of the constructed models were evaluated based on the area under the curve (AUC) of the receiver operating characteristic (ROC) curve in the test set. Optimal thresholds to convert the prediction probability to a binary class output were calculated using the Youden index gained from the ROC curve in the test set. Using these thresholds, predictive performances in the test set were evaluated. Table 1 shows the results of the test set. Model performances in the test set were evaluated using the metrics of AUC, sensitivity (SE), specificity (SP), accuracy (ACC), balanced accuracy (BAC), and the Matthews correlation coefficient (MCC). Figures 2 and 3 show the ROC curves for all toxicity targets in the test set in the cases of criteria 40 and 1, respectively. In both cases in which the active labels were annotated with a criterion of 40 and 1, Table 2 summarizes the averages of predictive performances in the test set. Good predictive performances were observed for the models regardless of the criteria. However, for VDR_ago (AID: 743241), HIF1_ago (AID: 1224894), and Shh ago (AID: 1259390), which were annotated by a criterion of 40, the ratios of active compounds in the test set were 0%, 0.42%, and 0.62% and the AUC values were N.D., 0.556, and 0.571, respectively.
The classification performance of models tends to deteriorate because of class distribution imbalance [25]. A between-class imbalance degrades the prediction performance because of the bias in the prediction results toward the majority class, leading to more prediction errors in the minority class [26]. Figure 1 shows a sparser distribution of active compounds and an imbalance in the case of using a criterion of 40 compared to a criterion of 1. In this study, as shown in Figure 1 and Table 1, because of the between-class imbalance caused by the criteria of 40, constructing and evaluating some toxicity prediction models was impossible. We managed this problem by lowering the criteria from 40 to 1, and with this, we could evaluate the constructed models.
When using labels annotated by the criteria of 1, all compounds were treated as active, except those ensured to be inactive, which had a PubChem activity score of 0. Therefore, using the criterion of 1 in order to develop the models, we concluded that we developed criterion 1 models that accurately learned the inactive compounds compared with criterion 40 models. On the other hand, Judson et al. reported that a phenomenon called cytotoxicity-associated "burst" was observed for tests conducted on the Tox21 program [27]. Many chemicals show the activation of large numbers of assays over a narrow range of concentrations in which cell stress and cytotoxicity are also observed. Therefore, some of the assay activity in this concentration range may represent nonintentional chemical effects, such as cytotoxicity. Judson et al. [27] showed that the Tox21 10K library contains false positive responses induced by the burst phenomenon.  The quality of a machine learning model depends on that of the experimental data being fed into it. Ideally, machine learning models should be provided with reliable data for both active and inactive compounds during training; however, the current concern is that this decreases the number of active compounds being trained and increases the between-class imbalance in the data set being fed into the model. Consequently, the identification of burst compounds in our models has not yet been examined. Therefore, our models are still limited in terms of their ability to successfully feed the training data; particularly, their ability to exactly identify real active compounds remains a challenge. Importantly, the active compounds identified using our predictive models may actually be inactive. However, our models have learned nontoxic compounds more exactly than other approaches, and the ability to identify real negative compounds could be promising. A toxicity prediction model in the field of drug discovery must determine nontoxic compounds as well as must be capable of accurately determining toxic compounds; thus, our tool could practically aid in toxicity assessment application.     Each value of performances evaluated by six metrics were shown as mean ± standard error. n = 58 (criteria 40), n = 59 (criteria 1).

Comparison with the Tox21 Data Challenge 2014
For further validation of the predictive performance of the models established in this study, we compared their performance with the predictive models built in the Tox21 Data Challenge. The Tox21 Data Challenge 2014 was designed to understand the interference of the chemical compounds derived from the Tox21 10K library in the biological pathway using a crowd-sourced data analysis conducted by independent researchers. This challenge used data generated from seven NR and five SR signaling pathway assays to construct prediction models for QSARs [28].
There For construction of a model for each of these toxicity targets, the compounds used in this work overlapped with those used in the Tox21 Data Challenge. Moreover, the active and inactive compounds used in this work were defined using the annotation method based on the criteria of 40 and showed a 98.7% ± 0.7% match with the active and inactive compounds used in the challenge and showed strong concordance overall.
The allocations of the test sets used in the Tox21 Data Challenge were different from those used in this study. Therefore, a simple comparison using the predictive performance of the models used in the Tox21 Data Challenge and that constructed in this study is impossible. However, in this study, we established predictive models for the 10 duplicate toxicity targets using the equivalent compounds and teacher labels to those of the challenge. Therefore, the results of this challenge could be a performance benchmark to discuss the predictive performance of models built for the same targets in this study.
The AUC has been adopted as a primary metric for ranking model performance in the Tox21 Data Challenge; therefore, the predictive models in the Tox21 Data Challenge have been ranked based on the AUC [29]. The AUCs in the test set validated in this study are shown in The predictive performances for these six targets were comparable to or better than those of the top models of the Tox21 Data Challenge. Therefore, the results indicate that several predictive models developed in this study were valid toxicity models for in silico screening with high accuracy.

Implementation of the Models in the Toxicity Predictor
All 118 (two criteria for each of the 59 toxicity targets) models were implemented as part of Toxicity Predictor, which is a web application for the prediction of drug-induced liver injury. The Toxicity Predictor web application was constructed by the Development of a Drug Discovery Informatics System project in the Japan Agency for Medical Research and Development (AMED) and is available at http://mmi-03.my-pharm.ac.jp/tox1/. This application uses an input file containing one or multiple QSAR-ready structures in simplified molecular-input line-entry system (SMILES) strings or SDF format. Furthermore, it can depict a structural formula drawn in the browser and can use it as an input. The molecular structure from the input file is converted to a three-dimensional (3D) structure by the three-dimensionalization algorithm used in this study ( Figure 5). Next, Toxicity Predictor calculated the necessary descriptors for the requested models using Mordred, an opensource software application used to calculate molecular descriptors. Finally, Toxicity Predictor predicted the chemical toxicity of 59 targets using the models constructed in this study. The prediction results of the input compound for the toxicity targets were converted to inactive or active, were returned, and can be viewed in a terminal browser (Figure 6b). Furthermore, the 3D structures and prediction results for all MIEs can be downloaded in SDF and CSV formats, respectively.
A model can be evaluated locally only within its applicability domain (AD), which is the chemical space of the training set [30,31]. Any extrapolation outside of that specific area of the structure space is most probably unreliable. Therefore, the system of the toxicity predictor incorporates domain evaluation to ensure the reliability of the QSTR inference. The AD of the evaluation compound is defined using the average of the logarithmic values of the Euclidean distance with the five nearest molecules in the descriptor space and is expressed numerically as reliability in Toxicity Predictor. Furthermore, the chemical structure is assessed to evaluate if it falls within the AD of the training set chemical space, and its position in the training set chemical space can be visualized and confirmed by principal component analysis (Figure 6a).

Implementation of the Models in the Toxicity Predictor
All 118 (two criteria for each of the 59 toxicity targets) models were implemented as part of Toxicity Predictor, which is a web application for the prediction of drug-induced liver injury. The Toxicity Predictor web application was constructed by the Development of a Drug Discovery Informatics System project in the Japan Agency for Medical Research and Development (AMED) and is available at http://mmi-03.my-pharm.ac.jp/tox1/. This application uses an input file containing one or multiple QSAR-ready structures in simplified molecular-input line-entry system (SMILES) strings or SDF format. Furthermore, it can depict a structural formula drawn in the browser and can use it as an input. The molecular structure from the input file is converted to a three-dimensional (3D) structure by the three-dimensionalization algorithm used in this study ( Figure 5). Next, Toxicity Predictor calculated the necessary descriptors for the requested models using Mordred, an open-source software application used to calculate molecular descriptors. Finally, Toxicity Predictor predicted the chemical toxicity of 59 targets using the models constructed in this study. The prediction results of the input compound for the toxicity targets were converted to inactive or active, were returned, and can be viewed in a terminal browser (Figure 6b). Furthermore, the 3D structures and prediction results for all MIEs can be downloaded in SDF and CSV formats, respectively.
A model can be evaluated locally only within its applicability domain (AD), which is the chemical space of the training set [30,31]. Any extrapolation outside of that specific area of the structure space is most probably unreliable. Therefore, the system of the toxicity predictor incorporates domain evaluation to ensure the reliability of the QSTR inference. The AD of the evaluation compound is defined using the average of the logarithmic values of the Euclidean distance with the five nearest molecules in the descriptor space and is expressed numerically as reliability in Toxicity Predictor. Furthermore, the chemical structure is assessed to evaluate if it falls within the AD of the training set chemical space, and its position in the training set chemical space can be visualized and confirmed by principal component analysis (Figure 6a). From the platform, entering the compounds for prediction and describing the chemical structure formula from an input format such as SMILES strings or SDF format is possible. The compound to be predicted is three-dimensionalized based on the algorithm in "Conformations and Descriptors", and descriptors are calculated. From the platform, entering the compounds for prediction and describing the chemical structure formula from an input format such as SMILES strings or SDF format is possible. The compound to be predicted is three-dimensionalized based on the algorithm in "Conformations and Descriptors", and descriptors are calculated. From the platform, entering the compounds for prediction and describing the chemical structure formula from an input format such as SMILES strings or SDF format is possible. The compound to be predicted is three-dimensionalized based on the algorithm in "Conformations and Descriptors", and descriptors are calculated.

Biological Overview of Modeled MIEs
We outline the toxicological meanings of the endpoints established in our model construction research. The following cellular targets and their interactions with agonists and antagonists can be potential MIEs associated with diverse toxicological adverse outcomes (Tables S1 and S2).
AhR. The aryl hydrocarbon receptor (AhR), a member of the family of basic helix-loop-helix transcription factors, is crucial for the adaptation of responses to environmental changes. AhR is a ligand-activated transcription factor that is known to mediate most of the toxic and carcinogenic effects of various environmental contaminants such as polyaromatic hydrocarbons and dioxin [32].
GR. The glucocorticoid receptor (GR) is a member of the nuclear receptor family of ligand-dependent transcription factors. GR plays a critical role in carbohydrate, protein, and lipid metabolism and programmed cell death [33].
AR. The androgen receptor (AR), a nuclear hormone receptor, is significant in AR-dependent prostate cancer and other androgen-related diseases. EDCs and their interactions with steroid hormone receptors, such as AR, may disrupt normal endocrine function and interfere with metabolic homeostasis, reproduction, and developmental and behavioral functions [34].
ER and ERRs. The estrogen receptor (ER), a nuclear hormone receptor, plays an important role in development, metabolic homeostasis, and reproduction. Two subtypes of ER, ER-α and ER-β, are composed of various functional domains and have several structural regions in common [35]. EDCs and their interactions with steroid hormone receptors, such as ER, disrupt normal endocrine function. However, estrogen-related receptors (ERRs), the orphan nuclear receptors, are crucial in cellular energy metabolism control. ERR-α is a member of the NR superfamily, and studies have linked it with various cancers. In endocrine-related cancers, such as breast cancer, ERR-α regulates numerous target genes that direct cell proliferation and growth independent of ER-α [36].
PR. The progesterone receptor (PR), a nuclear hormone receptor, influences development, metabolic homeostasis, and reproduction. EDCs tend to bind to PR and disrupt normal endocrine function [37].
Aromatase. Aromatase catalyzes the conversion of androgen to estrogen and is vital in maintaining the androgen and estrogen balance in many EDC-sensitive organs [38].
TRHR. Thyrotropin-releasing hormone (TRH) receptor (TRHR) is a G-protein-coupled receptor (GPCR) that binds the tripeptide thyrotropin-releasing hormone. TRHR is found in the brain and, when bound by TRH, acts to increase the intracellular inositol trisphosphate through phospholipase C. It plays a crucial role in the anterior pituitary as it controls the synthesis and secretion of thyroid-stimulating hormone and prolactin [39].
TSHR. TSHR is a GPCR for thyrotropin (thyroid-stimulating hormone or TSH), which is a member of the glycoprotein hormone family. TSH is released by the anterior pituitary gland and is the main regulator of thyroid gland growth and development [40].
TR. Thyroid receptor (TR), a nuclear hormone receptor, plays an important role in normal brain development, metabolism control, and many aspects of normal adult physiology. A large number of industrial chemicals reduce circulating levels of thyroid hormone [41,42].
PPARs. Peroxisome proliferator-activated receptors (PPARs) are lipid-activated transcription factors of the NR superfamily with three distinct subtypes, namely PPAR-α, PPAR-δ (also called PPAR-β), and PPAR-γ. All these subtypes heterodimerize with Retinoid X receptor (RXR), and these heterodimers regulate the transcription of various genes. PPAR-γ receptor is involved in the regulation of glucose and lipid metabolism. The function of PPAR-δ includes the regulation of cholesterol and lipid metabolism [43].
FXR. Farnesoid X receptor (FXR), a member of the NR superfamily, is identified as a receptor of bile acids. It is found in large amounts in the liver, intestine, kidney, and adrenal cortex. FXR binds to FXR-response elements of DNA as a monomer or heterodimer with a common partner for NRs, RXR, to regulate the expression of the diverse genes involved in the metabolism of bile acids, lipids, and carbohydrates. Numerous studies have reported that FXR agonist is favorable for liver regeneration and hepatocarcinogenesis [44,45].
CAR. The constitutive androstane receptor (CAR) is a nuclear receptor that regulates gene expression for multiple drug-metabolizing enzymes and transporters, which are important factors in the metabolism of drugs or xenobiotics. CAR activation leads to the upregulation of organic anion transporting polypeptide (OATP) transporters-that is, hepatic uptake transporters-together with the upregulation of cytochrome P450 (CYP) and UDP-glucuronosyltransferases (UGT) enzymes [46].
PXR. Pregnane X receptor (PXR) regulates the expression of several drug-metabolizing enzymes, such as CYP3A4. The induction of these proteins is a major mechanism for developing drug resistance in cancer [47].
RAR. Retinoic acid receptor (RAR) is a nuclear receptor that regulates the development of chordate animals, including the body axis, spinal cord, forelimbs, heart, eye, and reproductive tract. Retinoic acid (RA) is derived from retinol (vitamin A) as a metabolic product and functions as a ligand for nuclear RARs. These RARs bind target genes as heterodimer complexes with RXRs at a DNA sequence known as the RA response element. Interference with RA signaling can have potential adverse effects on embryonic development [48].
ROR-γ. Nuclear receptor retinoic acid receptor-related orphan receptor gamma (ROR-γ) is a key transcription factor for the pathogenesis of autoimmune diseases mediated by Th17 cells. Because of the essential role of ROR-γ in controlling the differentiation and functioning of Th17 cells, interference with ROR-γ signaling pathways may promote susceptibility to immunotoxicants and autoimmune diseases.
VDR. Vitamin D receptor (VDR), a member of the nuclear hormone receptor superfamily, plays a critical role in calcium homeostasis and bone metabolism [51].
ARE. The Nrf2-ARE pathway is an intrinsic mechanism of defense against oxidative stress. Nuclear factor E2-related factor 2 (Nrf2) is a transcription factor that induces the expression of target genes involved in the amelioration of oxidative stress by binding to the antioxidant response element (ARE) [52]. Oxidative stress can activate various transcription factors including NF-κB (nuclear factor-kappa B), AP-1 (activator protein-1), Nrf2, hypoxia-inducible factor-1 (HIF-1α), p53, and PPAR-γ. It can lead to chronic inflammation, mediating most chronic diseases, including cancer, diabetes, cardiovascular diseases, neurological diseases, and pulmonary diseases [53].
NF-κB and AP-1. The Nuclear factor-kappa B (NF-κB) transcription factor family and activator protein-1 (AP-1) transcription family are known as key regulators of inducible gene expression in the immune system [54].
HIF-1. Hypoxia-inducible factor-1 (HIF-1) is a major transcription factor that regulates the cellular response in low-oxygen conditions. HIF-1 comprises two subunits, hypoxia-responsive HIF-1-α and HIF-1-β, and is known as the aryl hydrocarbon receptor nuclear translocator. Under hypoxic conditions, HIF-1-α and HIF-1-β form a heterodimer. The HIF-1 complex translocates into the nucleus, binds to the hypoxia-responsive element (HRE), and activates the expression of target genes, such as vascular endothelial growth factor (VEGF). The HIF-1 pathway is essential for normal growth and development, and it is involved in the pathophysiology of cancer and inflammation [55].
p53. p53, a tumor suppressor protein, is activated following cellular insult, including DNA damage and other cellular stresses. The activation of p53 regulates cell fate by inducing DNA repair, cell cycle arrest, apoptosis, or cellular senescence. Therefore, the activation of p53 is a good indicator of DNA damage and other cellular stresses [56].
HDAC. Histone deacetylases (HDACs) are a group of epigenetic enzymes that regulate gene expression by histone deacetylation. Histone acetylation plays a major and fundamental role in chromatin structure/function regulating eukaryotic gene expression, and it facilitates gene transcription and expression by relaxing the chromatin structure. HDAC inhibitors activate antitumor pathways through multiple action mechanisms, such as the activation of the apoptotic pathway and cell cycle arrest [58].
H2AX. One of the earliest cellular responses to DNA double-strand breaks is the phosphorylation at Ser139 of the core histone protein H2AX. This phospho-Ser139 serves as a sensitive biomarker for detecting such breaks, localizing the site of DNA repair [59].
HSR. Heat shock response (HSR) is a transcriptional response to elevated temperature shock, regulated by heat shock transcription factors (HSFs). The function of HSF-1, a well-studied target gene in HSR, is the protection of cells against proteotoxicity associated with misfolding, aggregation, and proteome mismanagement. While the induction of the HSR is specific to elevated temperature stress, a closely related cell stress response with HSF-1 is also induced when cells are exposed to other forms of environmental stress, such as oxidants, heavy metals, and xenobiotics, that cause protein damage and misfolding [60].
Shh. The hedgehog (Hh) pathway is crucial in many vital cellular processes, such as cell proliferation and differentiation during embryonic development. Three Hh genes discovered in vertebrates are Sonic Hedgehog (Shh), Indian Hedgehog (Ihh), and Desert Hedgehog (Dhh). Sonic Hedgehog protein (Shh) is the most widely found in adult tissues and is the most potent target. Therefore, chemicals that interfere with the Shh pathway are potential developmental toxicants [61].
TGF-β. Transforming growth factor-β (TGF-β) is a cytokine involved in various biological activities, including the regulation of proliferation, differentiation, and function of numerous cell types and the effects on glucose metabolism and fibrosis, in addition to its immunomodulatory function [62].
MMP. Mitochondrial membrane potential (MMP), a parameter for mitochondrial function, is generated by the mitochondrial electron transport chain that creates an electrochemical gradient. This gradient drives the synthesis of ATP, a crucial molecule for various cellular processes. Measuring MMP in living cells is commonly performed to assess the effect of chemicals on mitochondrial function [63].
ERsr. The endoplasmic reticulum (ER) plays a major role in the synthesis, folding, and structural maturation of proteins in the cell. If cells encounter conditions during which the workload imposed on the ER protein-folding machinery exceeds its capability, ER stress (ERsr) can occur. Under ERsr, secretory proteins start to accumulate in improperly modified and unfolded forms within the organelle [64].
ATAD5. Enhanced Level of Genome Instability Gene 1 (ELG1; human ATAD5) protein levels increase in response to various types of DNA damage. Thus, quantifying this activity can be used to identify the compounds that cause genetic stress [65].

Data Source
For this modeling study, data collection and processing work were conducted on the constructed toxic database based on Tox21. First, all datasets (training and test sets) of chemicals were downloaded in the SMILES format from the PubChem database, derived from the Tox21 program. We used a keyword for the database search, namely "Tox21 summary", and selected bioassays of 59 toxicity targets, such as the NRs and SR pathways, to identify agonists/antagonists ( Table 3). The toxicity scores (PubChem activity scores) of each toxic target were tied to the PubChem Substance IDs (SIDs). Finally, 14,250 compounds were used, but compounds with no activity score were excluded.

qHTS Data Analysis
The Tox21 10k library can rank the results of qHTS and prioritize hits according to PubChem activity scores. PubChem activity scores are assigned normalized scores between 0 and 100 for each PubChem activity score ID (AID). The most active results have scores closer to 100, and inactive scores are closer to 0. According to PubChem documentation, all inactive compounds have a score of 0, active compounds have scores between 40 and 100, and inconclusive compounds have scores between 5 and 30. In this study, to implement binary classification models, the binary labels of active or inactive compounds were adopted following two definitions: (1) Under the definition of a criterion of 40, compounds with scores from 40 to 100 were defined as active and those activity scores from 0 to 39 were defined as inactive. (2) Under the definition of a criterion of 1, compounds with scores from 1 to 100 were defined as active and those with activity scores of 0 were defined as inactive. In definition (1), only the compounds concluded to be active based on the Pubchem criterion were defined as active compounds, and the other compounds were defined as inactive even if they were inconclusive compounds. On the other hand, in definition (2), only the compounds concluded to be inactive based on the Pubchem criterion were defined as inactive compounds and the other compounds were treated as active compounds even if they were inconclusive compounds. In Figure 7

qHTS Data Analysis
The Tox21 10k library can rank the results of qHTS and prioritize hits according to PubChem activity scores. PubChem activity scores are assigned normalized scores between 0 and 100 for each PubChem activity score ID (AID). The most active results have scores closer to 100, and inactive scores are closer to 0. According to PubChem documentation, all inactive compounds have a score of 0, active compounds have scores between 40 and 100, and inconclusive compounds have scores between 5 and 30. In this study, to implement binary classification models, the binary labels of active or inactive compounds were adopted following two definitions: (1) Under the definition of a criterion of 40, compounds with scores from 40 to 100 were defined as active and those activity scores from 0 to 39 were defined as inactive. (2) Under the definition of a criterion of 1, compounds with scores from 1 to 100 were defined as active and those with activity scores of 0 were defined as inactive. In definition (1), only the compounds concluded to be active based on the Pubchem criterion were defined as active compounds, and the other compounds were defined as inactive even if they were inconclusive compounds. On the other hand, in definition (2), only the compounds concluded to be inactive based on the Pubchem criterion were defined as inactive compounds and the other compounds were treated as active compounds even if they were inconclusive compounds. In Figure  7

Conformations and Descriptors
SMILES strings were cleaned and standardized (removing salts, counterions, and fragments and adjusting the protonation state (neutralize)) by RDkit, which is a Python library [66]. Optimal 3D structures were generated by following a calculation process to handle the calculation of excessive candidate compounds using an efficient and heuristic-though not strictly ideal-method. First, chemical structures were generated from the SMILES strings, and explicit hydrogen atoms were added to the chemical structures. Next, up to 200 types of 3D conformers were randomly generated. The energy minimization calculation was performed on them by the MMFF force field, and a conformer with minimal energy was adopted from 200 types of conformers. However, when this process lasted more than 60 s, instead of the above calculations, the conformer was generated using the ETKDG method [67] and the energy minimization calculation was performed on it by the MMFF force field [68]. Finally, the optimal conformer was converted into an SDF format.

Conformations and Descriptors
SMILES strings were cleaned and standardized (removing salts, counterions, and fragments and adjusting the protonation state (neutralize)) by RDkit, which is a Python library [66]. Optimal 3D structures were generated by following a calculation process to handle the calculation of excessive candidate compounds using an efficient and heuristic-though not strictly ideal-method. First, chemical structures were generated from the SMILES strings, and explicit hydrogen atoms were added to the chemical structures. Next, up to 200 types of 3D conformers were randomly generated. The energy minimization calculation was performed on them by the MMFF force field, and a conformer with minimal energy was adopted from 200 types of conformers. However, when this process lasted more than 60 s, instead of the above calculations, the conformer was generated using the ETKDG method [67] and the energy minimization calculation was performed on it by the MMFF force field [68]. Finally, the optimal conformer was converted into an SDF format. Molecular descriptors were calculated for each compound using Mordred [69,70], a Python library; 2D and 3D descriptors were obtained; and finally, 1824 descriptors were adopted for model construction.

ML Algorithm and Modeling Scheme
Classification models based on Tox21 were developed using XGBoost. This algorithm was designed to be highly scalable by adopting a sparsity-aware algorithm for sparse data and a weighted quantile sketch for approximate tree learning [22]. In this study, the modeling scheme was designed to integrate the validator, recorder, and filter to gain a single model satisfying high-predictive performance and robustness (Figure 8). Further, 10% of all compounds was assigned to the test set without the data being fed into this pipeline. The compounds fed into the pipeline included 90% of all compounds obtained by excluding the test set, and these were used for the optimization, training, and validation of the models.
Validator. In the validator, hyperparameter exploration using a grid search was performed. ML models were trained and validated according to the respective grid-generated parameter values. One-third of the data fed into this validator was assigned to the validation set as out-of-fold (OOF) and two-thirds to the training set, where the predictive performance was validated using the hold-out method. Here, when assigning validation and training sets, extreme unlike distributions between the validation and training sets could occur by chance. Therefore, three patterns of allocations of OOF were generated, ensuring that it represented 100% coverage of the input data set and without duplication. For all pairs of validation training set allocations, the models were constructed using each grid-generated hyperparameter. They evaluated the predictive performance in the validation sets according to the ROC-AUC. The hyperparameter governing the performance of the XGBoost was explored within the following predefined ranges: learning rate ("learning_rate": 100 types of values from 0.01 × 0 to 0.01 × 99).
Recorder. The recorder works as a record-keeper for the validator. The number of conditions to evaluate in the validator reached 300 patterns consisting of three OOFs and 100 hyperparameters. This recorder stored all prediction models constructed for the respective conditions, their modeling conditions, and the predictive performances in the OOFs.
Filter. The filter eliminates some overfitting cases while selects the models with the highest predictive performance from the information stored in the recorder. In the filter, based on 300 prediction performances stored in the recorder, a set of the highest predictive performing models and their modeling conditions was selected. Here, we imposed the following request to detect some overfitting cases. We excluded some hyperparameters used for model construction when the models with this hyperparameter had a high variability of the predictive performances between other OOFs in the 100% coverage validation. Therefore, even if the selected set of hyperparameters and allocation of OOFs resulted in high predictive performance, it was not adopted when the variability of performance with other OOFs at a coverage of 100% was high.
Molecular descriptors were calculated for each compound using Mordred [69,70], a Python library; 2D and 3D descriptors were obtained; and finally, 1824 descriptors were adopted for model construction.

ML Algorithm and Modeling Scheme
Classification models based on Tox21 were developed using XGBoost. This algorithm was designed to be highly scalable by adopting a sparsity-aware algorithm for sparse data and a weighted quantile sketch for approximate tree learning [22]. In this study, the modeling scheme was designed to integrate the validator, recorder, and filter to gain a single model satisfying high-predictive performance and robustness ( Figure 8). Further, 10% of all compounds was assigned to the test set without the data being fed into this pipeline. The compounds fed into the pipeline included 90% of all compounds obtained by excluding the test set, and these were used for the optimization, training, and validation of the models.
Validator. In the validator, hyperparameter exploration using a grid search was performed. ML models were trained and validated according to the respective grid-generated parameter values. One-third of the data fed into this validator was assigned to the validation set as out-of-fold (OOF) and two-thirds to the training set, where the predictive performance was validated using the holdout method. Here, when assigning validation and training sets, extreme unlike distributions between the validation and training sets could occur by chance. Therefore, three patterns of allocations of OOF were generated, ensuring that it represented 100% coverage of the input data set and without duplication. For all pairs of validation training set allocations, the models were constructed using each grid-generated hyperparameter. They evaluated the predictive performance in the validation sets according to the ROC-AUC. The hyperparameter governing the performance of the XGBoost was explored within the following predefined ranges: learning rate ("learning_rate": 100 types of values from 0.01 × 0 to 0.01 × 99).
Recorder. The recorder works as a record-keeper for the validator. The number of conditions to evaluate in the validator reached 300 patterns consisting of three OOFs and 100 hyperparameters. This recorder stored all prediction models constructed for the respective conditions, their modeling conditions, and the predictive performances in the OOFs.
Filter. The filter eliminates some overfitting cases while selects the models with the highest predictive performance from the information stored in the recorder. In the filter, based on 300 prediction performances stored in the recorder, a set of the highest predictive performing models and their modeling conditions was selected. Here, we imposed the following request to detect some overfitting cases. We excluded some hyperparameters used for model construction when the models with this hyperparameter had a high variability of the predictive performances between other OOFs in the 100% coverage validation. Therefore, even if the selected set of hyperparameters and allocation of OOFs resulted in high predictive performance, it was not adopted when the variability of performance with other OOFs at a coverage of 100% was high.  In the validator, using three types of unduplicated out-of-folds (OOFs) as the validation set, models were trained and validated with each hyperparameter. In the recorder, all prediction models, their modeling conditions, and predictive performances were stored. In the filter, high-variability cases were excluded according to 100% coverage validation, and the highest performing model was selected simultaneously.

Evaluation Metrics
The predictive performance of the classification models was evaluated based on information calculated from confusion matrices, including the number of true positives (TP; compounds correctly identified as positive), true negatives (TN; compounds correctly identified as negative), false negatives (FN; misclassified positive compounds), and false positives (FP; misclassified negative compounds). The following six evaluation indexes were used to evaluate the classification models.
(1) SE: accuracy of predicting "positive" (active) when the true outcome is positive.
(2) SP: accuracy of predicting "negative" (inactive) when the true outcome is negative.
(3) ACC: the number of correctly predicted samples divided by the total number of samples.
(4) BAC: average between SE and SP. BAC = 1 2 (SE + SP) (4) (5) MCC: used as a measure to assess the classification accuracy of the models for an unbalanced dataset [71]. To determine the optimal cutoff points in the definitions of TP, FN, TN, and FP, we maximized SE (1-SP) using the Youden index [73]. In the toxicity predictor, the cutoff value specific to each prediction model was standardized and displayed using the following formula so that the maximum, minimum, and average values were 1, 0, and 0.5, respectively.
The value x n is obtained by normalizing the directly predicted value x u using the equation. Here, c is the cutoff value of each prediction model.

Applicability Domain
The AD of the compound entered for the prediction was defined using the Euclidean distance to the five nearest molecules in the descriptor space of Tox21 compounds. The mean of the logarithmic Euclidean distances was normalized between 0 and 1 and expressed as reliability in the toxicity predictor.

Conclusions
In this study, we built prediction models of 59 MIE agonists and antagonists with information on the chemical structure and activity from the Tox21 10K library. We aimed to support regulatory toxicity decisions comprehensively and to enable users to reuse the QSTR predictions. Therefore, a web application integrating the three-dimensionalization algorithm, toxicity prediction models, and domain evaluation used in this study was developed to access to the assessment of activity against 59 MIEs. These models were valid toxicity models for alternative in silico screening and therefore could practically aid in achieving toxicity assessment.