Assessment of Drugs Toxicity and Associated Biomarker Genes Using Hierarchical Clustering

Background and objectives: Assessment of drugs toxicity and associated biomarker genes is one of the most important tasks in the pre-clinical phase of drug development pipeline as well as in toxicogenomic studies. There are few statistical methods for the assessment of doses of drugs (DDs) toxicity and their associated biomarker genes. However, these methods consume more time for computation of the model parameters using the EM (expectation-maximization) based iterative approaches. To overcome this problem, in this paper, an attempt is made to propose an alternative approach based on hierarchical clustering (HC) for the same purpose. Methods and materials: There are several types of HC approaches whose performance depends on different similarity/distance measures. Therefore, we explored suitable combinations of distance measures and HC methods based on Japanese Toxicogenomics Project (TGP) datasets for better clustering/co-clustering between DDs and genes as well as to detect toxic DDs and their associated biomarker genes. Results: We observed that Word’s HC method with each of Euclidean, Manhattan, and Minkowski distance measures produces better clustering/co-clustering results. For an example, in the case of the glutathione metabolism pathway (GMP) dataset LOC100359539/Rrm2, Gpx6, RGD1562107, Gstm4, Gstm3, G6pd, Gsta5, Gclc, Mgst2, Gsr, Gpx2, Gclm, Gstp1, LOC100912604/Srm, Gstm4, Odc1, Gsr, Gss are the biomarker genes and Acetaminophen_Middle, Acetaminophen_High, Methapyrilene_High, Nitrofurazone_High, Nitrofurazone_Middle, Isoniazid_Middle, Isoniazid_High are their regulatory (associated) DDs explored by our proposed co-clustering algorithm based on the distance and HC method combination Euclidean: Word. Similarly, for the peroxisome proliferator-activated receptor signaling pathway (PPAR-SP) dataset Cpt1a, Cyp8b1, Cyp4a3, Ehhadh, Plin5, Plin2, Fabp3, Me1, Fabp5, LOC100910385, Cpt2, Acaa1a, Cyp4a1, LOC100365047, Cpt1a, LOC100365047, Angptl4, Aqp7, Cpt1c, Cpt1b, Me1 are the biomarker genes and Aspirin_Low, Aspirin_Middle, Aspirin_High, Benzbromarone_Middle, Benzbromarone_High, Clofibrate_Middle, Clofibrate_High, WY14643_Low, WY14643_High, WY14643_Middle, Gemfibrozil_Middle, Gemfibrozil_High are their regulatory DDs. Conclusions: Overall, the methods proposed in this article, co-cluster the genes and DDs as well as detect biomarker genes and their regulatory DDs simultaneously consuming less time compared to other mentioned methods. The results produced by the proposed methods have been validated by the available literature and functional annotation.


Introduction
Assessment of groups of similar toxic doses of drugs (DDs) and their regulatory biomarker genes is the most important objective of toxicity investigation in the pre-clinical phase of drug development process as well as in toxicogenomic studies. Biomarker genes are a set of genes that are differentially expressed in the treatment group of animal compared to the control group. This set of genes is also efficient to differentiate the toxic DDs from the non-toxic DDs. Biomarker genes and their regulatory DDs can be assessed by toxicogenomic study which emerges from toxicology. Toxicology is a field of science which studies the adverse effects of chemicals and environmental exposures in living organisms [1]. The prime objective of this study is the empirical and contextual characterization of adverse effects of chemicals/drugs from tissue, the cell, and the intracellular molecular systems of organisms. Presently, the rapid accumulation of omics (genomics, transcriptomics, proteomics, metabolomics) data, development of sophisticated statistical tools and gene and protein annotation techniques have capitalized the application of gene expression analysis to understand the toxicity mechanism of drugs or chemical compounds and environmental stressors on biological systems. The development of these technologies leads to the development of the new field "toxicogenomics" from toxicology targeting to study the response of the whole genome to DDs or environmental stressors [2][3][4][5][6][7]. The adverse effects of the toxicants in an organism cause pathological changes in certain organs which can be detected by changes in the expression of genes, protein synthesis, and metabolism. Among these, the gene expression or abundant of mRNA is the most sensitive measure of these changes. Thus, toxicogenomics, which enables us to comprehensively analyze gene expression changes caused by an external stimulus in a specific organ, is considered to be one of the most powerful strategies [8,9]. But the toxicogenomic experiment produces a gigantic size of gene expression data. Analysis of this gigantic size of data is very complex and sometimes produces non-robust results for knowledge discovery about biomarker genes and toxicants. Therefore, pathway or molecular network-based gene expression data analysis increases the predictive power and produces more stable biomarkers [10][11][12][13][14].
On the other hand, toxicogenomic data analysis as well as knowledge discovery about the biomarkers and the toxicity of the DDs and environmental stressors often becomes tardy due to the following reasons. (1) Improper selection of statistical/computational tools. (2) Traditional ways of interpretation on the results of computational tools which do not cover the objectives of the study. For example, t-test and Mann-Whitney U test [12,15], and ANOVA [16,17] have been used to detect toxicogenomic biomarker genes. However, none of these methods can assess the similar toxic DDs and their associated biomarker genes which is one of the important objectives of toxicity investigation of drug candidates in the pre-clinical phase of the drug development pipeline. The limitation of the above mention methods can be overcome by using hidden variable models [14,18,19]. The hidden variable models are capable to detect toxic DDs and their regulatory biomarker genes by co-clustering DDs and genes. Nevertheless, since hidden variable models are EM (expectation-maximization) [20] based iterative method, these methods require comparatively more time to compute the model parameters. Therefore, to overcome this problem, in this paper, we propose an alternative algorithm based on hierarchical clustering (HC) for co-clustering DDs and genes as well as to discover toxic DDs and their associated biomarker genes. The term cluster analysis refers to the process of assigning data to different groups (clusters) according to their similarity. This approach provides an intuitive method for interpreting complex data such as microarray, transcriptomic, and epigenomic data. There are several types of HC (ward, single, complete, average, mcquitty, median, centroid) approaches whose performance depends on different similarity/distance (euclidean, maximum, manhattan, canberra, minkowski) measures. Every combination of distance and HC methods do not perform equally in grouping objects for all types of datasets. Even the performance of some of these combinations is very poor in some specific fields of study. In the literature, any suitable combination of distance and HC method is not suggested yet for clustering/co-clustering of toxicogenomic data.Hence, in this paper, we explore suitable combinations of distance measures and HC methods based on known Japanese Toxicogenomics Project (TGP) datasets for better clustering/co-clustering between DDs and genes as well as to detect toxic DDs and their associated biomarker genes.

Data Processing
To investigate toxicity of drugs, mRNA abundance in the liver of Rattus Norvegicus is measured administering multiple dose levels and time points. A well-designed experiment set to measure gene expression is measured from the treatment group samples where the treatments are the underlying conditions (DDs with time combinations). There are also control samples concurrently to the treatment group samples. The fold change gene expression (FCGE) y pqrt for the p th (p = 1, 2, · · · P) drug, q th (q = 1, 2, 3) dose level, t th (t = 1, 2, · · · , T) time point, and r th (r = 1, 2, 3) animal sample can be computed from the gene expression of the treatment and control group of samples using the equation: For single time point this equation can be written as In the Equation (1) x pqtr is the expression of a gene under the treatment group of animal and x pqtr is the expression of that gene under the control group of animal when the expression is measured at multiple points of time. Similarly, in Equation (2) x pqr and x pqr are the expression of a gene for the treatment and control group of animal, respectively when expression is measured at single time point. The average FCGE value over the animal samples of a gene are Y pqt. and Y pq. respectively for multiple and single time point. From these average FCGE values the effect of DDs over the genes can be measured. The values will be positive for upregulated genes and negative for downregulated genes. The datasets of the average FCGE value are the input of our analysis.

Hierarchical Clustering (HC) Algorithms and Distance Measures
The clustering task is solved by the application of various methods depending on the data. Each of these approaches will have peculiarities and the determination of what is the correct or what determines accurate clustering is not easily defined. Hierarchical clustering can proceed using various linkage/clustering and distance methods. The distance method determines how the distance between two observations is calculated. The linkage/clustering method is used when deciding the distance for observations that have already been merged together. Commonly used distance methods are shown in Table 1. In the analysis of biological data, the most commonly used clustering methods are of two types: Hierarchical and non-hierarchical (also known as partitioning). The hierarchical clustering approach builds clusters by repeatedly joining and merging the objects separated by the shortest distance. Following merging of the closest two points the distance matrix is updated and the process repeated until all objects are joined. In this article we have considered five distance methods (euclidean, maximum, manhattan, canberra, minkowski) and seven HC clustering methods (single, complete, average, ward, mcquitty, median, and centroid). We compare all the combinations of distance and HC methods for selecting more suitable combinations for clustering genes or DDs of toxicogenomic data. The description of these HC algorithms is as follows: Table 1. Important distance measures used in hierarchical clustering.

Distance Measure Mathematical Form
Euclidean The single linkage HC algorithm clusters objects (genes or doses of chemical compounds) of toxicogenomic data based on the distance or similarity between two pairs of genes/DDs. At the starting, the smallest distance D = d G i ,G i will be found and merge the corresponding genes and form a cluster (G i G i ). In the next step, the distance between the clusters (G i G i ) and G i are computed by to form the cluster (G i , G i G i ). This process continues until all genes merge into a single cluster.

Complete Linkage
In the complete linkage HC algorithm two objects form a cluster together, when their distance is the largest. The general agglomerative algorithm starts finding the minimum entry D = d G i ,G i and merges corresponding genes, such as G i and G i , to get cluster (G i , G i ). In the next step clusters (G i G i ) and G i will be merged into a cluster (G i , G i G i ) based on their maximum distance which is computed as This process continues until all genes merge into a single cluster.

Average Linkage
Average linkage treats the distance between two clusters as the average between all pairs of items where one member of a pair belongs to each cluster. We begin searching the distance matrix D = d G i ,G i to find the nearest genes, for example, G i and G i objects are merged to get the cluster (G i G i ). In the subsequent step, the distance between (G i G i ) and cluster G i is obtained by where d ii is the distance between gene i in cluster (G i G i ) and gene i in cluster G i and N G i G i and N G i are the number of genes in clusters (G i G i ) and G i respectively.

Centroid
The centroid method involves finding out the mean vector for each of the clusters and talking distance between two centroids. Initially, each of the genes is a cluster then distance between two clusters G i and G i is calculated as:

Median
The median HC method seeks the median of each of the clusters and measures the distance between two median points. The distance between the median of two clusters G i and G i is 2.2.6. Ward's Algorithm Ward's HC algorithm clusters objects based on minimizing 'loss of information' from joining two groups. This algorithm used error sum of squares (ESS) to measure the loss of information. Firstly, for a given cluster r, let ESS r be the sum of squared deviations of every item in the cluster from the cluster mean (centroid). If there are r clusters, define ESS as ESS = ESS 1 + ESS 2 + · · · + ESS r . At each step in the analysis, the union of every possible pair of clusters is considered, and the two clusters whose combination results in the smallest increase in ESS (minimum loss of information) are joined. Initially, each cluster consists of a single item, and, if there are N items, ESS r = 0, r = 1, 2, · · · N, so ESS = 0.

Distance Measures for HC
Most of the distance measure quantifies the distance or dissimilarity among m-dimensional objects or items of a dataset. For example, for a n × m gene-DDs toxicogenomic data matrix consisting of G = (G 1 , G 2 , . . . , G n ) genes and C = (C 1 , C 2 , . . . , C m ) DDs. We consider the (i, j) th input in the data matrix as F G i , C j for convenient using. This input actually represents average FCGE value Y pq. or Y pqt. for single or multiple time points. The following are important distance measure used in HC.

Selection of the Suitable Combination of Distance and HC Method
The hierarchical clustering methods group/cluster objects are based on distance matrix which is obtained from the original data matrix. There are also different methods to obtain distance matrix. We investigated the suitability of the combination of distance and HC methods for clustering genes or DDs using DDs clustering error rate (ER) based on known pathway based real datasets. The ER measures the percentage of miss-clustered DDs according to the known DDs which is calculated as: The HC algorithm in combination with distance method which produces the least clustering ER is the more suitable combination of clustering and distance methods for grouping genes or DDs.

Co-Clustering between Genes and DDs and Detection of Toxic DDs and Associated Biomarker Genes Using HC
In the toxicity study, the subsets of DDs regulate the expression profile of the subsets of genes. Accordingly, the genes in a biological pathway perform specific functions and the toxic DDs alter the expression pattern of a subset of biomarker genes in that pathway [19,21]. These biomarker genes and the toxic DDs can be explored from the biomarker co-clusters. For this purpose, more suitable distance and HC methods that produce less ER are used to cluster genes and DDs of toxicogenomic data. Our proposed algorithm follows the following steps to make co-clusters between genes and DDs.
Step 1: Fix the number of clusters in the genes as well as in DDs observing the dendrogram produced by HC according to the researchers' interest.
Step 2: Take absolute of the FCGE values within intersection areas for all pairs of genes and DDs clusters to give them equal weight in average calculations. Since the FCGE value for upregulated and downregulated genes consists of positive and negative expression values, respectively.
Step 3: Compute the average of the absolute FCGE value for intersection areas of all pairs of genes and DDs clusters.
Step 4: Rank the average FCGE values (computed in step 3) and the respective genes and DDs clusters simultaneously.
Step 5: Assign cluster numbers for genes and DDs newly, based on the ranked average FCGE values which we get from step 4. For example, the gene and DD cluster intersection which produces the largest average FCGE value; we assign both of these gene and DD clusters as cluster 1.
Simultaneously, the genes and DDs in cluster 1 together with form co-cluster 1. Similarly, we assign both of the gene and DD cluster as cluster 2 which produces the second largest average FCGE value and they form co-cluster 2 accordingly.
According to the characteristics of toxicogenomic data, a cluster of DDs can form co-cluster with single or more than one cluster of genes, when a DDs cluster might upregulate a set of genes and simultaneously downregulate another set of genes. Researchers consider a gene as differentially expressed or biomarker if its FCGE value is greater than 1.5. In that case, the expression intensity of that gene in the treatment group of samples is almost 3 times larger comparing to its expression in the control group of samples. But when the expression of a gene in the treatment group is 2 times larger than its expression in the control group, the FCGE value of that gene is 1. Therefore, we termed the co-clusters, which average FCGE value greater than one as biomarker co-clusters, and the genes and DDs in these co-clusters as biomarker genes and their regulatory DDs.

Real TGP Datasets to Investigate Clustering Performance
The Japanese Toxicogenomics Project (TGP) [22] collected gene expression data setting out a well-planned experimental condition. There were mainly two types of experiments, one is an in vivo experiment another is an in vitro experiment. The experimental condition pattern of the in vivo experiment was the combination of four time points (3 h, 6 h, 9 h, 24 h) and three dose levels (low, middle, high) and two organs (liver and kidney) of each of the drugs. These treatment conditions were applied on the Rattus norvegicus for collecting gene expression data from the target organ. There was also the control animal concurrently for each of the treatment group of animal in the experiment. The FCGE data can be computed from the gene expression data of the treatment group and control group samples produced by this experiment using the Equations (1) and (2). Toxygates a user-friendly interactive data analysis platform as well as database [15] where the FCGE data of the TGP experiment is available. The drugs' toxicity effects are more clearly visible at 24 h time point compared to the 3 h, 6 h, and 9 h time points [15]. That is why in this paper, we have considered pathway level FCGE data from Rattus Norvegicus, in vivo, liver, and single and multiple dose experiments at the 24 h time point. We have downloaded the glutathione metabolism pathway (GMP) and peroxisome proliferator-activated receptor signaling pathway (PPAR-SP) datasets for some selected drugs along with their dose levels whose toxicity mechanism are known [15,23] from Toxygates (http://toxygates.nibiohn.go.jp/toxygates/#columns). Additionally, to investigate the performance of the selected distance and HC methods for clustering toxicogenomic data, datasets for the mentioned pathways for multiple time points and dose levels are also analyzed in this article.

Selection of Suitable Combination of Distance and HC Methods
As mentioned earlier in the toxicogenomic data, the subsets of DDs regulates the expression patterns of the respective subsets of genes. Therefore, clustering/co-clustering of genes and DDs is an important issue in toxicogenomic studies. HC is a popular and widely used clustering algorithm that uses various distance measures and clustering methods for clustering genes or DDs of toxicogenomic data. However, none of the researchers suggested yet any suitable combination of distance and HC clustering methods for toxicogenomic data. Therefore, to do this we have used two known datasets GMP and PPAR-SP at the 24 h time point [14,15,23] because toxic effects of DDs are more clearly visible at this time point compared to the 3 h, 6 h, or 9 h time points [15]. In the GMP dataset, acetaminophen, methapyrilene, and nitrofurazone are considered as glutathione depleting and erythromycin, hexachlorobenzene, isoniazid, gentamicin, glibenclamide, penicillamine, and perhexilline are considered as non-glutathione depleting drugs [15]. In the PPAR-SP dataset WY-14643, clofibrate, gemfibrozil, benzbromarone, and aspirin are considered as PPARs regulated gene influencing drugs [23] and cisplatin, diltiazem, methapyrilene, phenobarbital, and triazolam are randomly selected drugs. The detail description of the datasets is given in Section 2.5. For comparing the 35 combinations of distance and HC clustering methods, we calculate the ER for both of the datasets in two ways. In the first way, we consider the glutathione depleting and PPARs-regulatory gene influencing drugs in one cluster and the others in another cluster for the respective datasets.
In that case, the FCGE value is merged into a single value averaging over the dose levels (low, middle, and high). In the second way, we consider high and middle doses of glutathione depleting drugs and PPARs-regulated gene influencing drugs in one cluster and other DDs in another cluster for GMP and PPAR-SP datasets, respectively. Therefore, each of the datasets is split into two datasets. For these datasets, the ER is displayed against the 35 combinations of distance and HC clustering methods in Table 2. From this table it is observed that the distance and HC method combinations euclidean: ward, manhattan: ward, and minkowski: ward produce smaller and stable ER in all datasets. Therefore, we suggest these combinations of distance and HC methods for clustering DDs or genes of toxicogenomic data.

Detection of Biomarker Genes and Their Regulatory DDs from the Co-Clusters
The important objective of the toxicogenomics studies is to explore subset of DDs which have the similar mechanism of action over a subset of genes. This can be done by applying our proposed algorithm described in Section 2.4 on the results obtained from the suitable combination of distance and HC methods. It is observed from the results of the previous Section 3.1, the more suitable combinations of distance and HC methods are euclidean: ward, manhattan: ward, and minkowski: ward. As an example, in this article we show the analysis of GMP and PPAR-SP datasets at 24 h as well as multiple (3 h, 6 h, 9 h, and 24 h) time points using the combinations of HC (ward) and distance (Euclidean) methods.
The dendrogram of DDs and genes based on the distance (Euclidean) and HC (ward) methods for GMP and PPAR-SP datasets at 24 h as well as multiple time points are depicted in the figures Figure 1 and Figure S1 (Supplementary File), respectively. The ranked clusters/co-clusters (according to average FCGE value within the co-clusters) for the GMP and PPAR-SP datasets are given in Tables 3 and 4, respectively. In these tables, the genes and DDs cluster numbers within the parenthesis represent the newly assigned cluster numbers based on the proposed co-clustering algorithm described in Section 2.4. For example, in the first row of Table 3 the original HC produced cluster number for both of the gene and DDs is 3. Since, the intersection mean of these genes and DDs cluster is the largest than other genes and DDs cluster intersection mean, we assign the both of the gene and DDs cluster as 1. Figure 2 represents the image of the co-clusters in which genes and DDs are arranged according to the ranked average FCGE values within the co-clusters (Tables 3 and 4). The biomarker co-clusters along with the proposed method assigned cluster number having the largest average FCGE values (consisting of biomarker genes and their regulatory DDs) are given in Tables 5 and 6 for GMP and PPAR-SP datasets, respectively. The results generated by the proposed methods for GMP and PPAR-SP datasets are validated by the literature [14,15,23] and functional annotation by the DAVID database [24]. The results of the functional annotation for biomarker genes are given in Tables 7-10. The detail results of genes and DDs clustering results are given in Supplementary file (Tables S1-S4). Table 3. Doses of drug and gene co-clustering mean (ranked) of the glutathione metabolism pathway datasets for the combination (Euclidean: ward) of distance and hierarchical clustering methods.    Gene-Cluster-2(3): Compound-Cluster-2(1) 0.2569643 Gene-Cluster-2(3): Compound-Cluster-1 (2) 0.1757952

Discussion
The important objectives of the toxicity investigation in the pre-clinical phase of the drug development process as well as in toxicogenomic studies are the subsets of DDs which have the similar mechanism of action over the respective subsets of genes and to assess toxic DDs and their regulatory toxicogenomic biomarker genes. With a view to satisfy these objectives, different authors have incorporated a number of statistical tools in their works. For example, t-test and Mann-Whitney U test [12,15], and ANOVA [16,17] were used for the exploration of biomarker genes. Nonetheless, these methods cannot satisfy the mentioned objectives. Although, there are few statistical methods [14,18,19] for the assessment of doses of drugs (DDs) toxicity and their associated biomarker genes, these methods consume more time for computation of the model parameters using the EM (expectation-maximization) [20] based iterative approaches. To overcome this problem, in this paper, we have proposed an alternative approach based on hierarchical clustering (HC) for the same purpose. However, there are several types of HC approaches whose performance depend on different similarity/distance measures. Therefore, we explored suitable combinations of distance measures and HC methods based on Japanese Toxicogenomics Project (TGP) datasets for better clustering/co-clustering between DDs and genes as well as to detect toxic DDs and their associated biomarker genes.

Conclusions
Overall, the study has shown that the proposed methods have significant advantage over the existing biomarker gene detection as well as co-clustering methods due to the following reasons.

•
Detect the biomarker genes and the regulatory (associated) DDs simultaneously.

•
The method safe time, since it requires less time for preparing results compared to the other EM based iterative co-clustering methods.

•
The results produced by the method conform to the literature and database results.