Using Bioinformatic Approaches to Identify Pathways Targeted by Human Leukemogens

We have applied bioinformatic approaches to identify pathways common to chemical leukemogens and to determine whether leukemogens could be distinguished from non-leukemogenic carcinogens. From all known and probable carcinogens classified by IARC and NTP, we identified 35 carcinogens that were associated with leukemia risk in human studies and 16 non-leukemogenic carcinogens. Using data on gene/protein targets available in the Comparative Toxicogenomics Database (CTD) for 29 of the leukemogens and 11 of the non-leukemogenic carcinogens, we analyzed for enrichment of all 250 human biochemical pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. The top pathways targeted by the leukemogens included metabolism of xenobiotics by cytochrome P450, glutathione metabolism, neurotrophin signaling pathway, apoptosis, MAPK signaling, Toll-like receptor signaling and various cancer pathways. The 29 leukemogens formed 18 distinct clusters comprising 1 to 3 chemicals that did not correlate with known mechanism of action or with structural similarity as determined by 2D Tanimoto coefficients in the PubChem database. Unsupervised clustering and one-class support vector machines, based on the pathway data, were unable to distinguish the 29 leukemogens from 11 non-leukemogenic known and probable IARC carcinogens. However, using two-class random forests to estimate leukemogen and non-leukemogen patterns, we estimated a 76% chance of distinguishing a random leukemogen/non-leukemogen pair from each other.

biological agents, and lifestyle factors) that can increase the risk of human cancer. Interdisciplinary working groups of expert scientists review the published studies and evaluate the weight of the evidence that an agent can increase the risk of cancer. Since 1971, more than 900 agents have been evaluated, of which more than 400 have been identified as carcinogenic (Group 1), probably carcinogenic (Group 2a), or possibly carcinogenic (Group 2b) to humans [15]. The NTP prepares the Report on Carcinogens (RoC), a congressionally mandated, science-based, public health report that identifies agents, substances, mixtures, or exposures (collectively called "substances") in the environment that may increase the risk for cancer. The most recent, the 12th RoC, was released in 2011 and includes 240 listings [16]. Substances are listed in the report as either known or reasonably anticipated to be human carcinogens (equivalent to IARC Group 1 or 2).

Biological Pathways Involved in Leukemia
Many leukemia subtypes are characterized by recurrent structural and numerical chromosomal abnormalities. For example, t-AML following alkylating agent therapy exhibits abnormalities of chromosomes 5 and/or 7 and a complex karyotype [17,18] while t-AML following treatment with topoisomerase II inhibitors is characterized by balanced chromosomal translocations. Cooperation between mutations that activate signaling pathway genes (Class I mutations) and lead to increased cell proliferation, and mutations that inactivate hematopoietic transcription factors (Class II mutations) and interfere with hematopoietic differentiation, is thought to drive leukemogenesis [19,20]. The occurrence of at least eight different genetic pathways to therapy-related myelodysplastic syndrome (t-MDS) and t-AML, defined by the combinations of specific abnormalities present in each, were proposed [21,22]. Identical abnormalities are seen in t-AML and de novo AML, albeit at different frequencies.

Biological Pathways Targeted by Leukemogens
Limited evidence regarding the mechanisms of action of known leukemogens suggests that they target common biological pathways related to leukemogenesis. Benzene, an established human leukemogen, induces many of the specific abnormalities associated with the genetic pathways proposed for t-AML and de novo AML [21,22,43]. Both benzene and formaldehyde cause leukemia-specific chromosomal changes in the peripheral blood hematopoietic progenitors of otherwise healthy exposed workers [44,45]. Benzene is thought to target critical genes and pathways in hematopoietic stem cells (HSC) and bone marrow stromal cells, through the induction of genetic, chromosomal or epigenetic abnormalities, and genomic instability [46]. Pathways and biological processes such as apoptosis, proliferation, differentiation, oxidative stress, AhR dysregulation and reduced immunosurveillance, are thought to be involved in benzene-induced leukemogenesis. We recently reported altered expression of genes in immune response, inflammatory response, oxidative phosphorylation, and the AML pathway in the peripheral blood of workers occupationally exposed to a range of benzene levels [47]. Altered expression of genes related to mitochondria, oxidative phosphorylation, oxidative stress response, ribosomes, and DNA repair, was observed several months to years before development of clinically overt disease in patients who developed t-MDS/AML following chemotherapeutic regimens for lymphoma [48].

Study Aim
We hypothesized that common biological pathways involved in hematopoiesis and leukemogenesis would be enriched in toxicogenomic data from people exposed to leukemogens, and that distinct pathways would be enriched in those exposed to subtypes of leukemogens, such as alkylating agents. Analysis of altered pathways in human toxicogenomic data has been proposed as a basis to classify carcinogens [49] and pathway analysis of such data from the CTD [50] has been used to identify chemical-disease relationships [51]. Around 250 annotated human biochemical pathways are curated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [52][53][54]. The goals of the current study were as follows: (1) to identify common KEGG pathways targeted by human leukemogens identified from IARC Monographs and NTP's 12th RoC, through pathway analysis of genes and proteins reported in CTD; (2) to investigate whether different subtypes of leukemogens (based on their known mechanisms of action) would target distinct pathways; and, (3) to determine whether human leukemogens could be distinguished from carcinogens that were unknown to be associated with human leukemia (non-leukemogenic carcinogens) using their identified targeted pathways.

Identification of Leukemogens and Non-Leukemogenic Carcinogens
From chemicals classified as carcinogens by IARC (Group 1 and 2a) and NTP (Known and reasonably Anticipated), we identified known and probable carcinogens that were associated with leukemia in humans (n = 35), and a group of carcinogens that currently were not known to be associated with leukemia (n = 16), based on the conclusions of each agency. For the purposes of this study, we considered the first group as "leukemogens" and the second group as"non-leukemogenic carcinogens". The carcinogen classification status of each chemical according to both IARC and NTP are listed in Figure 1 and detailed in supplementary material, Table S1 (leukemogens) and Table S2 (non-leukemogenic carcinogens). Definitions of the disease abbreviations used in Tables S1  and S2 are provided in supplementary material, Table S3. All carcinogens identified (both leukemogens and non-leukemogenic carcinogens), regardless of CTD data availability, are listed in the boxes in Figure 1, with the box colors coordinated with the Venn diagram segments. The chemicals for which CTD data were available, are listed in gray, italicized text, within each box. Of the 35 human leukemogens identified, CTD data were available for 29 leukemogens (10 from both NTP and IARC, 15 from NTP only, 4 from IARC only). Of the 16 non-leukemogenic carcinogens, CTD data were available for 11; these chemcials were selected as non-leukemogenic based on data from both IARC and NTP. Figure 1. Human leukemogens and non-leukemogenic carcinogens identified from NTP and IARC reports. The Venn diagram shows the numbers of leukemogens (n = 29) and non-leukemogenic carcinogens (n = 11) identified from IARC and NTP, for which CTD data were available. The boxes detail the IARC and NTP carcinogen classifications of all 35 leukemogens and all 16 non-leukemogenic carcinogens, sorted by agency from which they were selected. Within the boxes, the chemicals are organized by reported disease associations, then by IARC group number, then alphabetically by name. "-" group or class indicates no report available. The 11 chemical names for which no CTD data were available are shown in gray-italics.
The characteristics of each chemical, such as CAS RN, environmental/therapeutic status, associated cancers, and purported mechanism of action, are listed in Table S1(a-c) (leukemogens) and Table S2(a,b) (non-leukemogenic carcinogens). Among the 29 leukemogens for which CTD data were available, 13 are environmental or industrial chemicals and 16 are therapeutic agents. Regarding mechanism of action, eight are alkylating agents (BCNU, busulfan, CCNU, chlorambucil, cyclophosphamide, melphalan, methyl-CCNU, and thiotepa); three are adduct-forming chemicals (1,3butadiene, cisplatin, vinyl chloride, and ethylene oxide); one induces chromosome abberations (benzene); 1 has alkylating and DNA adduct-forming properties (ethylene oxide); one has DNA adduct-forming and chromosome abberation inducing properties (formaldehyde); two are topoisomerase II inhibitors (etoposide, teniposide); three have miscellaneous effects (propylthiouracil, antithyroid agent; adriamycin, antimitotic effects; oxymetholone, synthetic androgen); and 10 have unknown mechanism of action. Though several of the leukemogens were predominantly associated with leukemia, many were also associated with other cancers, as listed in Table S1(a-c).
Among the 11 non-leukemogenic carcinogens for which CTD data are available, eight are environmental or industrial chemicals and three are therapeutic agents. Two of the chemicals are adduct forming (2-naphthylamine and benzidine); TCDD acts through the AhR to modify gene expression; azathiprine induces DNA damage by 6-thioguanine accumulation in DNA; and the remaining chemicals have unknown mechanisms of action. These 11 non-leukemogenic carcinogens are predominantly associated with either lymphoma; lung and respiratory cancers; bladder, liver, kidney, and gastrointestinal cancers; or reproductive organ cancers, as shown in Figure 1 and detailed in supplementary Table S2(a,b).
One limitation of the current study is that we selected the leukemogens and non-leukemogenic carcinogens from data reported by two carcinogen classification agencies, the IARC and the NTP, which have different goals and procedures. In addition, the strength of the association of each chemical with leukemia varies in the studies supporting IARC and NTP classifications; thus the possibility of misclassification exists. Further, we identified leukemogens based on evidence from human data only, which may be limited for some chemicals, such as those that have been banned. We focused on human data because differences in the metabolism of, and response to xenobiotic exposures, between animals and human have been reported. For example, the proportions of benzene metabolites produced differ among mice, rats and humans [55] and the types of leukemia induced differs in rodents and humans [16,56,57]. Finally, one of the major rat strains used in xenobiotic animal studies, F344 rat, has a high background rate of developing a specific type of leukemia [58].

Enrichment of KEGG Pathways in Genes and Proteins Associated with Leukemogens and Non-Leukemogenic Carcinogens
We hypothesized that known leukemogens would target common pathways. To test this, we analyzed for enrichment of all 250 KEGG pathways [52][53][54] using the SEPEA algorithm [59] in human toxicogenomic data on the 29 leukemogens and 11 non-leukemogenic carcinogens, extracted from the CTD. Unsupervised clustering of these results produced two clusters ( Figure 2). The probabilities of each of the 250 pathways belonging to either one of these two clusters are listed in supplementary  Table S4. Cluster 0 includes 115 pathways, targeted by an average of 3 (2) out of 29 (11) leukemogens (non-leukemogenic carcinogens), while Cluster 1 includes 135 pathways, targeted by an average of 11 (5) out of 29 (11) leukemogens (non-leukemogenic carcinogens). This suggests firstly that the pathways in Cluster 1 are apparently the main targets of both the leukemogens and the non-leukemogenic carcinogens. This is contrary to the hypothesis that the pathways targeted by the leukemogens and the non-leukemogenic carcinogens would separate out. In spite of this, the average percentage of non-leukemogens targeting the pathways in Cluster 0, ~20%, is marginally higher than that for the leukemogens, ~10%. Thus, a systematic learning approach aimed at accurately distinguishing the leukemogens from the non-leukemogenic carcinogens could gain (as will be seen later) information from the pathways in Cluster 0.  [60]. Two clusters, labeled 0 and 1, were identified.
Pathway analysis of data from human studies involving exposure to leukemogens would be expected to reveal changes in pathways that would create a permissive environment for the development of leukemia, such as apoptosis, oxidative stress, immune response, and inflammation, rather than the pathways targeted by specific mutations that occur in rare hematopoietic stem or progenitor cells. The top 10 KEGG biochemical non-disease and disease pathways affecting the largest number (and percentage) of leukemogens affected, with a family-wise error rate (FWER) cut off 0.01, are listed in Table 1. The probabilities of membership in either cluster are also listed. These pathways lie in Cluster 1 with relatively high cluster probabilities. Many of the leukemogen-associated biochemical pathways in Table 1 have been previously implicated in leukemogen exposure and/or leukemia. The targeting of the Metabolism of xenobiotics by cytochrome P450 pathway, by 20 of the 29 leukemogens, is not surprising since many chemicals (e.g., benzene etc.) are metabolized into more toxic forms by these enzymes. Involvement of the Glutathione metabolism pathway by 18 of the 29 chemicals suggests that oxidative stress, shown to be involved in AML and MDS [61], may be a common mechanism of leukemogens. Apoptosis, MAPK signaling, Toll-like receptor signaling and B and T cell receptor signaling were all identified as pathways targeted by benzene in our recent toxicogenomic study of gene expression in 125 occupationally exposed workers [47]. TP53 is the mostly commonly mutated gene in many cancers and the P53 tumor suppressor protein is involved in multiple cellular processes, including transcription, DNA repair, genomic stability, senescence, cell cycle control and apoptosis. In a previous analysis of pathways underlying disease, the p53 pathway along with ErbB and cell cycle, characterized the cancer cluster [51]. p53 mutations and alterations have been implicated in AML [22,62]. A number of disease-related (mainly cancers) pathways were also targeted by ~60% of the leukemogens in the present study ( Table 1), suggesting that common mechanisms may underlie the development of cancer and leukemia. Infectious diseases such as toxoplasmosis, HTLV_1 infection, tuberculosis, measles, etc., were also targeted, probably through modulation of immune response and myelotoxicity (supplementary material, Table S4).
While many of the pathways make sense in the context of the current understanding of leukemia development, our findings have identified additional pathways of potential interest with less well-known associations with leukemia. Neurotrophins (NTs) and their receptors play a key role in neurogenesis and survival. Thus, a link between the neurotrophin signaling pathway and leukemia is at first surprising. However, a 2009 study of cell-surface expression in leukemic blasts of 94 acute leukemia patients identified a role for NT receptors in leukemogenesis [63]. Retinol metabolism was among the top 10 pathways associated with the leukemogens. Retinol metabolism was previously found to be associated with hormonally regulated cancers in an analysis of disease pathways [51]. Retinol (vitamin A) and its biologically active metabolites are essential signaling molecules that control various developmental pathways and influence the proliferation and differentiation of a variety of cell types. The retinoid signaling pathway is often compromised in carcinomas and various tumors. Disruption of the physiological actions of retinoids through mutations in RARalpha, one of the retinoic acid receptors, via the PML-RARalpha fusion proteins, result in acute promyelocytic leukemia (APL) [64]. Interestingly, all-trans retinoic acid (ATRA) combined with anthracycline-based chemotherapy, is the current standard treatment for APL and has increased the prognosis for this disease [65]. ATRA specifically targets the PML-RARα transcripts characteristic of the majority of APL patients, releases the dominant transcription repressor, and induces specific differentiation of promyelocytes.

Unsupervised Clustering of Leukemogens
We hypothesized that subtypes of leukemogens would target distinct pathways. Unsupervised clustering of the 29 leukemogens by their associated pathways produced 18 clusters, comprising 1 to 3 chemicals, as shown in Figure 3. The medoid leukemogen of each cluster, the leukemogen that best represents the pathway enrichment pattern of all others leukemogens assigned to that cluster, is also shown in Figure 3, as well as cluster membership probabilities for all 29 leukemogens. The large number of clusters and small number of leukemogens per cluster suggests a diversity of the mechanisms of action among the leukemogens. Interestingly, the three drugs used for cancer therapy-adriamycin, cisplatin and etoposide cluster together. Lindane, α-hexachlorocyclohexane, β-hexachlorocyclohexane and δ-hexachlorocyclohexane often occur as a mixture and hence the NTP or IARC reports on the carcinogenicity of the mixture. The gene interactions from CTD were, however, obtained for individual components of this mixture. Lindane and β-hexachlorocyclohexane cluster separately from α-hexachlorocyclohexane and δ-hexachlorocyclohexane. Alkylating agents or topo II inhibitors did not cluster together, as expected.
We examined whether structural similarity among chemicals explained the clustering pattern. The order of chemicals as suggested by the unsupervised clustering was not significant for structural similarity (p value = 0.35). Further, a heat map of the reordered distance matrix of the Tanimoto coefficients also suggests that these leukemogens are structurally diverse (supplementary material, Figure S1). The mean 2D Tanimoto coefficient between a pair of leukemogens is 0.2 and it is less than 0.4 for 90% of all pairs of leukemogens (supplementary material, Figure S2). To the best of our knowledge, structural similarity among leukemogens has not been reported. A recent study showed that in vitro myelotoxicity of chemical compounds could be predicted from molecular structure using in silico computational modeling [66]. Since the chemical leukemogens in our study cluster independently of structure and major mechanism of action (alkylating agent, topoisomerase II inhibitor, etc.), it is possible that other characteristics such as less well-known mechanisms of action, underlie the cluster patterns. While one of the best known and most well studied leukemogens, benzene, has multiple potential mechanisms of action [46], many of the leukemogens included in our study have a paucity of mechanistic data and unknown mechanisms of action.  [60]. Eighteen clusters, labeled from 0 to 17, were identified. Listed on the right are the chemical names, the medoid chemicals (or chemicals with pathway response pattern most similar to other chemicals in the cluster) for each cluster identified in bold case, and the cluster membership probabilities of each of the leukemogens.

Distinguishing Leukemogens and Non-Leukemogenic Carcinogens
We sought to determine whether pathway analysis could distinguish the leukemogens from the non-leukemogenic carcinogens. Such an approach could be applied to screen chemicals for leukemogenic potential or to predict individuals at increased risk of developing leukemia regardless of exposure status (e.g., exposome study [67]). We applied three bioinformatic methods to determine whether leukemogens could be distinguished from non-leukemogenic carcinogens: unsupervised clustering; one class support vector machines (SVM); and, two-class random forests. 2.4.1. Unsupervised Clustering. Unsupervised clustering analysis of the 29 leukemogens and 11 non-leukemogenic carcinogens revealed seven clusters, with one cluster having a majority (30) of the leukemogens and non-leukemogenic carcinogens (Figure 4). The 11 non-leukemogenic carcinogens were not distinct and only five of them (benzidine, cyclosporine, tcdd, phenacetin and 4-aminobiphenyl) separated out. These clusters along with the chemical-specific cluster probabilities (supplementary material, Table S5) indicate that it is impossible (better than say using an equal probability head-tail coin toss) to separate the set of leukemogens from the set of non-leukemogenic carcinogens in an unsupervised manner. Even though the identified clusters of leukemogens in Figures 2 and 3 are different; their relative order remains more or less the same. Interestingly, trichloroethylene (TCE, an IARC Group 2a carcinogen) and tetrachloroethylene (Perc, a non-leukemogenic carcinogen), to which co-exposure frequently occurs, were subclustered together. Epidemiological studies have shown TCE exposure to be associated with kidney and liver cancers as well as NHL, with weak evidence for leukemia [68,69]. In subjects exposed to Perc in drinking water, an elevated relative risk of leukemia was observed among ever-exposed subjects that increased further among subjects whose exposure level was over the 90th percentile [70] but the plausibility of this finding has been questioned [71]. It is possible that the lack of association with human leukemia is due to the limited number of exposure case studies for Perc alone. Carcinogenicity studies of Perc in several rat strains showed moderate, not clearly dose-related, increases in mononuclear cell leukemia (MCL), only in F344 rats [72]. However, latency was not decreased by exposure and the incidence in the treated groups was within the overall control range. Because the F344 rat strain is highly predisposed to developing MCL, the results were not considered predictive for human cancer risk [72]. In a series of studies conducted by the National Toxicology Program (NTP), Perc was one of five chemicals for which leukemia (used collectively to summarize multiple neoplasms including MNCL), was the only neoplastic change in both male and female rats [58].

One-Class Support Vector Machines
The second attempt to distinguish the leukemogens from the non-leukemogenic carcinogens used a one-class support vector machines (SVM) approach to learn the pathway enrichment pattern of the leukemogen class of chemicals. The probability that a given chemical is identified as a leukemogen is estimated via a bootstrapping procedure involving five-fold cross validation, i.e., the pathway enrichment pattern of 80% of the leukemogens is used to predict that of the remaining 20%. The fraction of leukemogens identified correctly and the fraction of non-leukemogens identified incorrectly across all the 1000 bootstrap steps are both around 50%. This again suggests that one-class SVM is no better than using a coin toss for our purpose.

Two-Class Random Forests
Our third approach to classification of leukemogens and non-leukemogenic carcinogens involved the use of random forests [73]. This analysis differs from the previous two methods in that the pathway enrichment patterns for both the leukemogen and the non-leukemogen class are learned. One class SVM involved learning only the leukemogen class patterns while the clustering process did not involve any learning. In the two-class random forest approach, the 95% confidence interval of the area-under-the-curve (AUC) of the true positive rate (fraction of correctly identified leukemogens) versus the false positive rate (fraction of incorrectly identified non-leukemogens) was 0.76 ± 0.07. This implies that given a random leukemogen and non-leukemogen pair, the random forest based classifier has a 76% chance of correctly distinguishing one from the other. The probability that a given chemical is identified as a leukemogen, at a false positive rate of around 50% (the same as that reported for results of one-class SVM), is estimated using information across the 1,000 bootstrap steps. These probabilities are to be interpreted in the context of the pathway enrichments of the chosen leukemogens and non-leukemogenic chemicals. Thus, the false positives characterized by relatively high probability values among the non-leukemogenic chemicals (tcdd, phenacetin, tamoxifen, diethylstilbestrol, tetrachloroethylene and azathioprine) means that their pathway enrichment patterns are more similar to that of a majority of leukemogens. This could either reflect the inadequacy of using pathways as features to distinguish between the two classes or that some of these identified false positives may actually cause leukemia. Similarly, the false negatives characterized by relatively low probability values for the leukemogens (cyclophosphamide, etoposide, bcnu, benzene and trichloroethylene) may represent atypical leukemogens.
The top KEGG biochemical pathways driving the two-class classification, based on the largest mean decreases in gini indices, are given in Table 2. The larger this importance score of a pathway is, the better is its ability to separate the class of leukemogens from the class of non-leukemogenic carcinogens. The number (and percentage) of leukemogens and non-leukemogenic carcinogens affected (FWER cutoff of 0.01), are provided, as well as the probabilities that each of these pathways belong to one of the two clusters of pathways identified in the supplementary material, Table S4.
Compared with the pathways identified in Table 1 (leukemogens only), the pathways in Table 2 (both leukemia-positive and -negative carcinogens) in general have a relatively larger probability of being in Cluster 0 and affect a larger fraction of the non-leukemogens than the leukemogens. This suggests the differentiation of the leukemogens from the non-leukemogenic carcinogens is driven by pathways impacted by the non-leukemogenic carcinogens. Caffeine metabolism (mean decrease in gini index = 0.36) was the top pathway supporting the distinction between leukemogens and non-leukemogenic carcinogens, being targeted by 73% of the non-leukemogens compared with only 10% of the leukemogens. Possible inverse associations between caffeine intake and breast, liver, and colon cancer, as well as cancer of the ovary have been reported [74]. Opposing effects of caffeine and or coffee on ovarian cancer risk in postmenopausal (inverse association) [75] and premenopausal (positive association) [75,76] women, have been reported, suggesting that caffeine may be protective in a low-hormone environment. Two SNPs in the caffeine metabolizing enzyme, CYP19, (one positively and one inversely) were associated with ovarian cancer risk [77]. A common A to C polymorphism at position −163 in the CYP1A2 gene, that results in the slower metabolism of caffeine [78,79], was shown to be protective against the risk of postmenopausal breast cancer [80]. Cigarette smoking accelerates caffeine metabolism, which is mediated primarily via CYP1A2 [81]. CYP1A2 activity was also shown to be increased with increased broccoli intake and exercise [82]. A role for caffeine metabolism in hormonally regulated cancers may be what drives the distinction between leukemogens and non-leukemogenic carcinogens, but this requires further investigation.
Arachidonic acid metabolism was the second pathway supporting the distinction between leukemogens and non-leukemogenic carcinogens ( Table 2). The first two pathways of arachidonic acid metabolism are controlled by the enzyme families cyclooxygenase (COX) and lipoxygenase (LOX). These pathways produce prostaglandins and leukotrienes, respectively, potent mediators of inflammation [83], and both pathways have been implicated in cancer [84]. Eicosanoids may represent a missing link between inflammation and cancer [85]. In our study of human occupational benzene exposure, prostaglandin-endoperoxide synthase 2 (PTGS2 or COX2) was one of the most significant genes to be upregulated across all four doses relative to unexposed controls [47]. PTGS2 was central to a network of inflammatory response genes impacted by benzene. The distinct roles of inflammation and the arachidonic acid metabolism pathway, as well as the ribosome, retinol metabolism, and metabolism of xenobiotics by cytochrome P450 pathways, in response to leukemogens and in leukemia and other cancers, need to be further investigated.

Challenges in Discriminating Leukemogens and Non-Leukemogenic Carcinogens
The analyses reported in Gohlke et al. [51] demonstrated that it is possible to predict chemical associations with different diseases using the pathway enrichment patterns. They also showed that diseases belonging to different classes (cancer, immune, metabolic, neuropsychiatric) can be clustered separately in an unsupervised manner. Here, we took this approach one step further by asking whether the leukemia-positive chemicals can be separated from the other known carcinogens. While two-class random forests appeared to be able to distinguish leukemia-positive and -negative carcinogens, there are some caveats to these classification approaches generally. The overlap among cancer and leukemogen pathways makes the identification of common and distinct pathways among the 250 known KEGG pathways challenging. As detailed in Table S1(a-c), many of the leukemogens are associated with one or more cancers as well as leukemia. This limits the power of the discrimination analysis making it difficult to differentiate the carcinogenic and leukemogenic effects of the leukemogens. Heterogeneity in cancer types associated with the non-leukemogenic carcinogens, in leukemia subtypes, and in the mechanisms of action of leukemogens, and associated pathways, adds an additional layer of complexity. One caveat of the two-class approach is that it assumes that the non-leukemogenic carcinogens form a class. However, the group of 11 chemicals selected in the current study is heterogeneous with respect to associated cancer types and it is unclear how well the data from the 11 non-leukemogenic carcinogens analyzed in our study could be extrapolated to other sets of non-leukemogenic carcinogens. It is also unclear how well the 29 leukemia-positive carcinogens represent the full spectrum of potential leukemia pathways.
If our methodology were to be used for the purposes of risk assessment, the results suggest a hierarchical approach for the identification of a particular carcinogenicity hazard with the identification of leukemogens done after the chemicals were screened for other cancer types. Our study examined leukemogen pathways compared with those of non-leukemogenic carcinogens; it would be of interest to compare pathways induced by leukemogens and non-cancer disease-causing chemicals. In a study examining pathways associated with various diseases, cytochrome P450 metabolism, retinol metabolism, Jak-stat signaling, Toll-like receptor signaling, and adipocytokine signaling were identified as 5 critical pathways potentially important to disease progression from both a genetic and environmental standpoint [51]. In particular, cytochrome P450 metabolism was associated with cancers, cardiovascular disease and immune-related disorders while retinol metabolism was associated with hormonally regulated cancers.

Comparison of Pathway Enrichment in CTD and in Data from a Single, Well-Designed, Toxico-Genomic Study
The CTD [86] is based on the curation of chemical-gene/protein interactions reported in the literature. Some chemicals and some genes are better studied than others. Thus, there is likely to be an inherent bias in the data used for the chemical-wise pathway enrichments, which cannot be overcome by the analyses used in the current study. In addition, even though we only analyzed human CTD data, these data were generated from various types of human cells (peripheral blood of exposed subjects, various primary and cancer cell lines), under in vivo or in vitro conditions, across different exposure durations and across different doses of the chemical. In general the conclusions are based on different significance thresholds and further conclusions from studies aimed at understanding the role of a given gene in response to a given chemical are given the same weight as those aimed at understanding the responses of a larger set of genes. Further, employment of different microarray platforms or other methodologies to measure target genes/proteins could also influence experimental results. Given these variables, we felt it was important to assess how correlated the pathway analyses based on CTD data and on data from a well-designed human toxicogenomic study, were for a given chemical. Recently, we generated transcriptomic data from the peripheral blood mononuclear cells of 125 workers exposed to a range of benzene levels in an occupational setting in which we found ~3,000 differentially expressed genes [47]. We conducted pathway enrichment analyses using statistics on whether a gene was differentially expressed in at least one of the four considered dose ranges. We compared these results to those obtained using benzene-associated gene interactions from CTD. Spearman correlation between the significance of individual pathway enrichments obtained using either data set was moderate (0.45) but significant (p value < 0.05). The scatter plot of the ranks of the pathways based on their enrichment p-values is shown in supplementary material, Figure S3. Our findings suggest that despite the limitations of CTD data, pathway analysis of CTD data is an informative approach.

Identification of Human Leukemogens and Non-Leukemogenic Carcinogens
From chemicals classified as carcinogens by IARC (Group 1 and 2a) and NTP (Known and Reasonably Anticipated), we identified known and probable carcinogens that were associated with leukemia in humans (n = 35), and a group of carcinogens that currently were not known to be associated with leukemia (n = 16), based on the conclusions of each agency. Leukemia-positive and non-leukemogenic carcinogens were identified from all 100 volumes of IARC's Monograph series on the evaluation of carcinogenic risks to humans, titled A review of human carcinogens [15] and from NTP's 12th Report on Carcinogens (RoC) [16]. Leukemia-positive compounds were selected in three steps: (1) IARC (Group 1 and Group 2a) and NTP ("known" or "reasonably anticipated") carcinogens; (2) single chemical carcinogens with unique Chemical Abstracts Service Registry Numbers (CAS RN); and, (3) carcinogens that are associated with leukemia risk in humans. The specific type of exposure for each leukemia-positive carcinogen was determined (e.g., environmental or therapeutic). Statistical significance (p value) consistently supporting the association of each chemical with cancer from multiple exposure case-studies were provided in the IARC Monographs and chemicals with of p < 0.05 were included in the final list (statistical data not shown). As the NTP reports do not provide statistical significance to describe the strength of association between chemicals and cancer, chemicals that were concluded by NTP to be associated with human leukemia were included without supporting statistical data. Non-leukemogen carcinogens were identified that were currently not known to be associated with leukemia, based on the conclusions of IARC and NTP.

Analysis of Enrichment of KEGG Biochemical Pathways in CTD Data
The CTD [50] contains curated chemical-gene interactions. Human CTD data were available for 29 of the leukemogens (n = 35) and 11 of the non-leukemogenic carcinogens (n = 16) selected above. These data were retrieved on 10 March 2012, and were used in a pathway enrichment method called SEPEA [51,59]. All (250) human biochemical pathways in the KEGG pathway database [52][53][54] as accessed on 23 February 2012, were analyzed. SEPEA differs from other pathway enrichment methods in that it takes into account the network structure of the various pathways in the analyses-pathways where perturbed genes (as a result of treatment) are relatively close to each other in a graph/network sense are assigned more significance. The significance of a given pathway being enriched with the targets of a given chemical is reported as a p value. The number of chemicals affecting a given pathway is reported using a 0.01 FWER (corresponding to a 0.01/40 threshold for significance).

Structural Similarity between Chemicals
Structural similarity between each pair of leukemogens was determined using the 2D tanimoto coefficients (that represent the ratio of the number of shared two dimensional structural features between the pair of leukemogens to the total number of structural features defined for this pair) obtained from the PubChem database [87] on 25 April 2012. The Tanimoto Coefficient is widely applied to rank structural similarity and regarded to convey less molecular-size bias than other methods [88,89]. The CAS RN number for each of the chemicals was used for the queries. Let , , … , denote a particular ordered sequence of chemicals, O. Let , denote the 2D tanimoto coefficient between chemical and chemical . The statistic used to identify the significance of the given order of chemicals is defined as: , The significance of the particular order, O, is then estimated using a permutation test. 1,000 random permutations of the given chemicals are used to compute 1,000 values using Equation (1).
The p value is then estimated by, where 1 , 0 . HOPACH [60,90] using the R [91] package hopach [92] was used to identify the clusters. Cosine angle was used as the distance metric. Median split silhouette was the criteria used to define the clusters and the choice of medoids (representative chemicals or pathways) for each of the clusters. The reordered distance matrix (between the chemicals or the pathways) with the identified clusters was plotted as an image using the dplot function. The probability that a given chemical or pathway belongs to a particular cluster (as identified by the medoid chemical or pathway) was estimated via by a bootstrapping procedure as encoded in the boothopach function.

One-Class Classification of Chemicals
One class support vector machines, svm [93] were used to identify the pathway enrichment pattern defining the leukemogen class of chemicals from the data in the : matrix. The svm function in the e1071 package [94] in R [91] was used. The radial kernel was used with the ν parameter (that is a proxy for the error rate in the training samples) set to 0.05. The 29 leukemogens were arbitrarily divided into five groups-the first group being the first six chemicals in supplementary material, Table S6, the second being the next six and so on. These groups were used in a five-fold cross-validation procedure. The pathway enrichment pattern was learnt using data from the chemicals in every four of the five groups. The svm.predict function was then used to predict whether the chemicals in the remaining fifth group and the 11 identified non-leukemogens are leukemogens (1) or not (0). The sampling distribution of these 0 or 1 predictions was estimated by a bootstrapping procedure similar to the one used to estimate the chemical-specific cluster probabilities in the previous section. During the ith of 1,000 bootstrap steps, pathways were sampled with replacement from the set of all pathways. These pathways were used to define the : data matrix based on the corresponding columns of : matrix.
: was used to estimate the ith bootstrap one-class svm predictions of all the chemicals.

Two-Class Classification of Chemicals
Random forests [73] were used to identify the pathway enrichment pattern that separates the chemicals in the leukemogen class from those in the non-leukemogenic chemical class from the data in the : matrix. The CV.SuperLearner function coded in the SuperLearner package [95] uses three-fold cross-validation (the percentage of leukemogens and non-leukemogenic chemicals in each fold was chosen to be more or less equal) to estimate the leukemogen class predictions for all 40 chemicals. The CV.SuperLearner function uses the randomForest package [96]. The sampling distribution of these predictions was estimated using the 1,000 random bootstrapped : matrices, generated as described in the previous section. The importance of each of the pathways in reducing the error of differentiating the leukemogens from the non-leukemogens was obtained from the randomForest function (using the entire data, : ) as the corresponding mean decrease in Gini index. For each of the predictions based on the 1,000 bootstrap steps, the area-under-the-curve (AUC) of the True-Positive-Rate (fraction of leukemogens correctly identified) versus the False-Positive-Rate (fraction of non-leukemogens incorrectly identified) curve was estimated using the ROCR package [97].

Conclusions
We have identified common pathways targeted by single chemical human leukemogens as well as pathways that could distinguish leukemogens from non-leukemogenic carcinogens. The pathways had sufficient information to enable a reasonable separation of the leukemogens from the non-leukemogenic chemicals using a two-class classification method. As the CTD becomes populated with additional toxicogenomic datasets, our current bioinformatic approach will become more informative and discriminating, with potential applicability to the next generation of risk assessment of exposure to toxic chemicals.