Clustering Gene Expressions Using the Table Invitation Prior

A prior for Bayesian nonparametric clustering called the Table Invitation Prior (TIP) is used to cluster gene expression data. TIP uses information concerning the pairwise distances between subjects (e.g., gene expression samples) and automatically estimates the number of clusters. TIP’s hyperparameters are estimated using a univariate multiple change point detection algorithm with respect to the subject distances, and thus TIP does not require an analyst’s intervention for estimating hyperparameters. A Gibbs sampling algorithm is provided, and TIP is used in conjunction with a Normal-Inverse-Wishart likelihood to cluster 801 gene expression samples, each of which belongs to one of five different types of cancer.


Introduction
The goal of clustering is to provide informative groupings (i.e., clusters) of similar objects. The objects are referred to in this paper as "subjects", and an example of an individual subject is a single gene expression from one person. The term "subjects" will be used throughout this paper in order to avoid confusion associated with the term "samples" in a statistical context. Note that, in practice, an individual subject may correspond to an individual vector, matrix, or higher-order tensor. In this work, vectors are considered for the sake of simplicity. In contrast, a "subject index" refers to an identifier for a subject. For example, in vector-variate data, a subject index i ∈ {1, 2, . . . , n} refers to the ith row in the provided dataset and n is the total number of subjects. The notation x i is used to refer to the subject (object) itself (i.e., a vector, matrix, tensor, etc.). The goal of Bayesian clustering is to produce a set of cluster assignments for each subject while also calculating the probability that two subjects are clustered together given the observed data and prior assumptions. Mathematically, this is represented as P(c | X) ∝ P(X | c)P(c) (1) where X refers to the data, c is a vector of n cluster assignments (e.g., the cluster assignments for each of the n gene expression samples), and P(c | X) represents the posterior probability of a cluster configuration, P(X | c) is the likelihood, and P(c) is the cluster prior which is the focus of this paper. A well-known challenge in clustering is to estimate the unknown number of clusters K * [1,2]. Some clustering methods, such as MCLUST, involve an analyst fitting several models with varying degrees of complexity and then choosing the desired model based on a chosen clustering metric [3,4]. A distinct, though similar approach, uses the gap statistic [2] in conjunction with another clustering algorithm (e.g., hierarchical clustering) to estimate the number of clusters.
Bayesian nonparametric models refer to a flexible class of prior distributions that may be used in a variety of settings including clustering. In the context of clustering, the use of such priors means that an analyst does not need to specify an estimate for the number of clusters since the number of clusters is modeled as a random variable. A variety of methods have been proposed to accomplish this task, and the relevant methods are reviewed in the sections that follow. Two methods include the Ewens-Pitman Attraction (EPA) prior [5] and a well-known special case of EPA called the Chinese Restaurant Process (CRP) [6,7].
The CRP, a variant of the Dirichlet process, is a well-known prior used in Bayesian clustering. One drawback of CRP is that it does not utilize information pertaining to the similarity between subjects (e.g., gene expression samples). A natural extension of the CRP is one that includes the aforementioned similarity information, and one such extension is the EPA prior. Although EPA utilizes similarity information, the primary drawback of EPA is that it relies on the choice of a hyperparameter (this is α in Section 1.2 below). Consequently, an analyst using EPA must either choose a fixed value for the EPA hyperparameter α or rely on an approximate posterior distribution for α [5,8]. In the context of Bayesian clustering, a number of samples are taken from a posterior distribution which can be time consuming, so manually tuning a hyperparameter is not desirable.
The focus of this paper is the Table Invitation Prior (TIP) which is an attempt to maintain the advantage of Bayesian clustering (i.e., the analyst does not have to specify the number of clusters) while removing the need for an analyst to carefully tune a hyperparameter. Although the approximate posterior distribution for α used in EPA removes the need of the analyst to tune the hyperparameter, empirical results show that TIP gives superior results and is less susceptible to splitting clusters as compared to EPA. Bayesian clustering methods often rely on the use of similarity functions to capture the relationships between subjects (e.g., gene expression samples), and thus a brief review of pairwise similarity functions is provided.

Pairwise Similarity Functions
Some Bayesian clustering priors use the similarity between subjects in order to obtain clusters that contain subjects that are similar to each other [5,9]. Let the similarity between two subjects with indices i and j be given by λ(i, j) for i = 1, 2, . . . , n and j = 1, 2, . . . , n where n is the number of subjects. The similarity function λ may take a variety of forms, and in this paper the similarity function used is the exponential decay function [5,9]: where τ > 0 is a hyperparameter and d ij is the distance between the ith and jth subjects. Following the approach taken in [10], the hyperparameter τ is set to the following: whered is the median of the pairwise distances of the strictly upper triangular portion of the distance matrix:d The choice of the median is heuristic, but there is a justification. Equation (3) implies that Consequently, similarity values corresponding to subject pairs whose distances are significantly larger than the overall median distance go to zero whereas subject pairs that are very close to each other have a similarity value that is closer to 1. Subject pairs whose distance from each other is close to the overall median distance have a similarity value that is between 0 and 1.

Ewens-Pitman Attraction Prior
The EPA distribution uses the pairwise similarity between subjects and a sequential sampling scheme to induce a partition of n subjects [5,11]. Let σ = {σ 1 , σ 2 , . . . , σ n } be a random permutation of the subject indices {1, 2, . . . , n}. Then the conditional probability of a subject with index i joining cluster k is given by the following: where α > 0 is a hyperparameter that controls the extent to which a new cluster is created, q i−1 is the number of clusters that are assigned among the first i − 1 subjects, δ ∈ [0, 1) is a "discount" hyperparameter, λ is a similarity function, and c(σ 1 , σ 2 , . . . , σ i−1 ) are the part assignments for the first i − 1 permuted subjects σ 1 , . . . , σ i−1 . The discount parameter is specific to EPA and its purpose is to incorporate information about the number of clusters in a previous iteration when computing the probability of a new cluster in the current iteration.
The value for α may be treated as a constant or it can be sampled from a distribution as described in West [8]. Specifically, West's approximate posterior distribution for α, given the number of clusters n k , is: where Γ denotes the gamma distribution, γ is Euler's constant, and the prior parameters are a and b. In this work, a = b = 1 so that the prior for α has exponential distribution with a scale parameter of 1.

Chinese Restaurant Process
The CRP is a special case of EPA that occurs when the discount parameter δ = 0 and λ(i, j) is constant for all subject indices i and j [5][6][7]. The conditional probability of a subject x i joining cluster k is given by the following: The CRP is obtained by taking the product of (7) over all possible partitions.

Table Invitation Prior
In this section, the Table Invitation Prior (TIP) is presented in the context of a Gibbs sampler in iteration t = 1, 2, . . . , T. An analogy is now provided to illustrate the prior's mechanics. Suppose that n subjects x 1 , x 2 , . . . , x n (i.e., vectors, matrices, tensors, etc.) are sitting in a restaurant with k = 1, 2, . . . , K (t) tables (clusters). A randomly selected subject with index r is chosen and then then τ r subjects that are most similar to the subject with index r are "invited" to a new table (cluster) K (t) + 1 (in this paper, all then τ r subjects accept the invitation with probability 1). The posterior probability of every subject belonging to a table (cluster) is computed for tables (clusters) 1, 2, . . . , K (t) , K (t) + 1 and, using the probabilities, the subjects are randomly assigned to a table (i.e., sample the posterior cluster assignment for every subject). The variable t is incremented by 1 and the this process continues; here t is the number of times the above process has occurred so far.
A more formal description of the Table Invitation Prior now follows. For the iteration t in a Gibbs sampler, let the random variable r be a randomly selected subject index (e.g., a randomly selected index corresponding to an individual gene expression) from a discrete uniform distribution so that r ∈ {1, 2, . . . , n}. Suppose a random subject x r is selected (i.e., x r can be a vector, matrix, higher-order tensor, etc.). The set of similarity values between subject x r , itself, and the other n − 1 subjects is where λ(r, i) is the similarity between the rth subject and the ith subject. Let the jth ordered similarity value in the set Λ r be Λ r (j) for j = 1, 2, . . . , n and let The set of indices of the n τ r subjects that are most similar to subject x r is given by where n τ r ∈ {1, 2, . . . , n − 1} is a hyperparameter. The estimation of hyperparameter n τ r proceeds in the following manner. First, recall that r is a randomly selected subject index so that r ∈ {1, 2, . . . , n}. The pairwise distances with respect to subject r are extracted and the distance from subject r to itself is removed: The distances are then sorted in ascending order: Next, a univariate multiple change point detection algorithm is applied to the sorted distances. The change point detection algorithm used in this paper is binary segmentation from the changepoint library in R [12]. The binary segmentation function in R takes a hyperparameter, denoted by Q, that is the maximum number of changepoints (13). In this paper, the value is set to n 2 + 1 since the changepoint library will throw an error if Q > n 2 + 1; also this allows the change point method to have the maximum amount of flexibility in detecting changes in the subject distances. Let the set of change points be given byτ then the number of subjects that are similar to the subject with index r is taken to follow a Poisson distribution:n τ r ∼ Poi(τ r,1 ). Consequently, the setŜ r is given by: Let the vector containing the cluster assignments be denoted by c t so that the ith element contains the cluster assignment for the ith subject and where K (t) is the total number of clusters after posterior sampling in iteration t. The Table  Invitation Prior is based on selecting a random subject index r and forming a new cluster withn τ r subjects that are most similar to subject x r . That is, the new cluster is formed using the subjects whose indices are in the setŜ r . Consider a modified cluster vectorc t then the Table Invitation Prior (TIP) is given by

TIP Gibbs Sampler
An implementation of a Gibbs sampler with a Table Invitation Prior for clustering corresponds to the following steps, and it is summarized in Algorithm 1. Initially, all subjects are sitting at K (0) tables (clusters). For Gibbs sampling iteration t = 1, a random subject with index r is chosen and then τ r most similar subjects with indices r (n−1) , r (n−2) , . . . , r (n−n τr +1) are assigned to table K (0) + 1 with subject x r . The conditional probabilities for all subjects x 1 , x 2 , . . . , x n (i.e. vectors, matrices, tensors, etc.) are computed using Equation (16) for all clusters k = 1, 2, . . . , K (0) + 1; if desired, a likelihood value may be computed for each table (cluster). Next, the posterior probability is computed and the subject's posterior cluster assignment is sampled (i.e., sampled from the set {1, 2, . . . , K (0) , K (0) + 1}). This gives a partition with K (1) clusters (tables). In the second Gibbs sampling iteration t = 2, a random subject with index r ∼ U {0, n} is chosen and then τ r most similar subjects with indices r (n−1) , r (n−2) , . . . , r (n−n τr +1) are assigned to table K (1) + 1 with subject x r . The conditional probabilities for all subjects x 1 , x 2 , . . . , x n are computed using Equation (16) for all clusters k = 1, 2, . . . , K (1) + 1; again, a likelihood value may be computed for each table (cluster), and each subject's posterior cluster assignment is sampled (i.e., sampled from the set {1, 2, . . . , K (1) + 1}). This process continues for t ∈ {3, . . . , T}. Table Invitation Prior Clustering 1 Inputs: n subjects X = {x i } n i=1 , number of Gibbs sampling iterations T, initial cluster assignments c 0 , similarity function λ with hyperparameters Θ, and a distance matrix D. 2 Output: posterior cluster assignments c 1 , c 2 , . . . , c T . 3 Sort the ith row of the distance matrix to obtain d * i . 4 Computen τ i for each subject x i for i ∈ {1, 2, . . . , n} by applying a univariate multiple change point detection algorithm to the set of sorted distances d * i . 5 Compute the similarity setsŜ i using Equation (14) for i ∈ {1, 2, . . . , n}. 6 for t in 1:T do 7 Draw a random subject index r ∼ U {0, n}. 8 Using the setŜ r , compute the modified cluster vectorc t via (15) given c t−1 . 9 parallel for i in 1:n do 10 for k in 1:(K (t) + 1) do 11 Compute the log-prior log(P(c i = k | X)) via (16). 12 Compute the log-likelihood (if desired; i.e., Normal-Inverse-Wishart etc.).
14 Convert the log-posterior to a probability. 15 Sample the posterior cluster assignment c i,t .

Posterior Cluster Assignments
The TIP Gibbs sampler produces a set of posterior cluster vectors c 1 , c 2 , . . . , c T . However, it is necessary to produce a single clustering from this set of posterior cluster assignments, and this section describes the methodology used in this paper to accomplish this task (other methods may be used for this task). Each posterior cluster vector is transformed into an n × n posterior proximity matrix B (t) : where c t,i and c t,j are the posterior cluster assignments for subject x i and x j after Gibbs sampling iteration t ∈ {1, 2, . . . , T}. The posterior similarity matrix is given bȳ A vector of posterior cluster assignments is computed using the Posterior Expected Adjusted Rand (PEAR) index, and technical details as well as an application of PEAR to gene expression data may be found in Fritsch and Ickstadt [13]. Let ρ denote the PEAR index function. Using the posterior similarity matrixB, the cluster vector that maximizes the PEAR index is taken to be the posterior cluster assignment vector: The computation of the PEAR index is accomplished using the mcclust package in R [13,14].

Likelihood Function
The Table Invitation Prior may be used for clustering vectors, matrices, and higherorder tensors, assuming that a suitable distance metric is available. One component of a Gibbs sampler utilizing TIP that may change depending on the dataset is the likelihood function. In this paper vector-variate data are considered, and the conjugate Normal-Inverse-Wishart (NIW) prior for the mean and covariance is utilized. Let x i ∈ R p be a p × 1 vector and represent the ith subject for i = 1, 2, . . . , n. Let c i be the cluster assignment for subject i. Assume that then the joint posterior for (µ k , Σ k |x i ) is given by where NIW denotes the Normal-Inverse-Wishart distribution. The posterior arguments are given by There are four hyperparameters and the following values are used:

Visualizing The Posterior Similarity Matrix
The n × n symmetric matrixB can be viewed an undirected weighted graph with n vertices where the edge between the ith and jth subjects (vertices) represents the posterior probability that subject x i and subject x j are in the same cluster. A network plot may be used to viewB directly, but the number of edges corresponding to small posterior probabilities may unnecessarily complicate the plot. Consequently, we show the plot of the graphB 1 which is the result of removing the maximum number of edges in the graphB such that the number of components in the graph is 1. That is, the graphB 1 has the minimum entropy among all subgraphs with one component and we call its corresponding network plot the "one-cluster plot". The idea is to remove as many connections as possible while still maintaining one component so that the clusters' relationships with each other are revealed. The network plots are used in Section 5 to visualize the cluster results.

Simulation Data
In this section, a clustering simulation is presented to compare TIP with various clustering algorithms including EPA, MCLUST, and linkage-based methods. For EPA, δ = 0 and α follows West's posterior given by Equation (6).

Simulation Description
The simulation is given by the following. A dataset X with n subjects x 1 , x 2 , . . . , x n is generated where x n ∈ R p . Each subject x i for i = 1, 2, . . . , n is generated according to its true cluster assignment k so that Here N p denotes the p-variate multivariate normal distribution and W −1 p denotes the inverse Wishart distribution. The number of burn-in iterations for both TIP and EPA is set to 1000, and the number of sampling iterations is set to 1000.

Simulation Results: Normal-Inverse-Wishart Likelihood Function
In this section, simulation results for TIP and EPA in conjunction with a Normal-Inverse-Wishart likelihood function are presented. TIP and EPA are compared with the MCLUST algorithm, k-means clustering, and hierarchical clustering [4,15]. For k-means and hierarhcial clustering, the number of clusters is estimated using the gap statistic [2] In this simulation there are K * = 4 well separated clusters. Each cluster is composed of n 1 = 20, n 2 = 25, n 3 = 30, and n 4 = 35 vectors in p = 2 dimensional space. The results are shown in Figure 1. Both TIP and MCLUST cluster the datasets perfectly while EPA with West's posterior is too aggressive and results in 12 clusters. Hierarchical clustering (complete linkage) is utilized in conjunction with the gap statistic. The optimal number of clusters given by the gap statistic is 4, and hierarchical clustering using complete linkage with exactly 4 clusters perfectly separates the data. The gap statistic is also used with k-means and the optimal number of clusters given by the gap statistic is 4, and k-means perfectly separates the data.

4.2.2.
Overlapped Clusters: p = 2, K * = 4 and n = 120 In this simulation there are K * = 4 clusters, but two of the clusters are overlapped. In this case, the cluster sizes are given by n 1 = 20, n 2 = 25, n 3 = 30, and n 4 = 45 vectors in p = 2 dimensional space. The results are shown in Figure 2. EPA gives 15 clusters, TIP gives 11 clusters, and MCLUST gives 5 clusters. MCLUST divides the two overlapped clusters into 3 clusters whereas TIP is too aggressive and divides the overlapped clusters into 8 clusters. Similarly, EPA divides the overlapped clusters into 8 clusters, but, unlike TIP, EPA also divides Cluster 1 into 2 clusters. Hierarchical clustering (complete linkage) is used in conjunction with the gap statistic and gives 4 clusters, though the resulting cluster assignments are not necessarily accurate since part of Cluster 3 is clustered with part of Cluster 4. K-means is also applied to the dataset in conjunction with the gap statistic; the optimal number of clusters according to the gap statistic is 3 which fuses two of the true clusters (i.e., Cluster 3 and Cluster 4) together.

Application: Clustering Gene Expression Data
In this section TIP is applied to a dataset pertaining to RNA-Seq gene expression levels as measured by an Illumina HiSeq platform [16]. The data were accessed from the UCI Machine Learning Repository [17] and were collected as a part of the Cancer Genome Atlas Pan-Cancer analysis project [18]. There are n = 801 gene expression samples (i.e., n = 801 subjects) and p = 20, 531 gene expression levels. The 801 gene expression samples can be classified into one of 5 classes, and each class corresponds to a different type of cancer: BRCA, COAD, KIRC, LUAD, and PRAD.
Principal components analysis was applied to the data, and 7 principal components were used so that p = 7. A plot showing the cumulative variance explained by a given number of principal components is shown in Figure 3. The reason that 7 principal components were used is that it takes a relatively large number of dimensions to explain percentages of the variance greater than 80%. The first 7 principal components explain about 80% of the variance, but it takes 22 principal components to explain 85% of the variance, 82 principal components to explain 90% of the variance, 269 principal components to explain 95% of the variance, and 603 principal components to explain 99% of the variance. The clustering methods are applied to the principal components so that p = 7 and n = 801. The TIP posterior cluster assignments are shown in Table 1. There is a small overlap of classes in cluster 3 where there are 270 BRCA gene expression samples and 1 LUAD gene expression sample. Also, 30 BRCA gene expression samples form a distinct cluster (see cluster 5). The one-cluster plot is shown in Figure 4 and shows a small amount of overlap between LUAD and BRCA which is consistent with the posterior cluster assignments in Table 1. Figure 4. A one-cluster plot with respect to TIP where each graph vertex (i.e., a colored dot) corresponds to a subject (e.g., an individual gene expression sample) and the edge weights (i.e., the lines) correspond to the elements in the matrixB 1 . Specifically, the edge between subject i and subject j is the posterior probability that subject i and subject j are in the same cluster. Shorter lines correspond to larger posterior probabilities, so pairs of graph vertices that are closer to each other in the plot are more likely to be assigned to the same cluster. The plot shows a minor overlap between BRCA (red diamond) and LUAD (green circle). The posterior cluster assignments for EPA are shown in Table 2. EPA is able to separate the classes quite well, but there is one cluster where there is substantial overlap between classes. Cluster 10 is comprised of samples from BRCA, COAD, KIRC, LUAD, and PRAD whereas this does not occur for TIP and MCLUST. Furthermore, Cluster 6 contains samples from both BRCA and LUAD; this is true for TIP and EPA. The one-cluster plot for EPA is shown in Figure 5, and it shows that there is overlap between LUAD and BRCA as well as BRCA and COAD. Figure 5. A one-cluster plot with respect to EPA where each graph vertex (i.e., a colored dot) corresponds to a subject (e.g., an individual gene expression sample) and the edge weights (i.e., the lines) correspond to the elements in the matrixB 1 . Specifically, the edge between subject i and subject j is the posterior probability that subject i and subject j are in the same cluster. Shorter lines correspond to larger posterior probabilities, so pairs of graph vertices that are closer to each other in the plot are more likely to be assigned to the same cluster. There is an overlap between LUAD (green circle) and BRCA (red diamond) as well as BRCA (red diamond) and COAD (orange square).
The cluster assignments for MCLUST are shown in Table 3. MCLUST, like TIP, performs well. MCLUST produces one cluster with a minor amount of overlap: cluster 1 features 57 BRCA samples and 2 LUAD samples. Furthermore, BRCA is split between two clusters: one with 57 BRCA samples and another with 243 BRCA samples. This is similar to the TIP results.
Hierarchical clustering is applied in conjunction with the gap statistic to choose the number of clusters, and the R package cluster is used to compute the gap statistic [19]. The settings used for hierarchical clustering and k-means are the default settings in the stats library in R [20]. The results for hierarchical clustering using complete linkage are shown in Table 4. The optimal number of clusters estimated via the gap statistic is 7, but complete linkage clustering is unable to separate the classes. The results for hierarchical clustering using single linkage are shown in Table 5. The optimal number of clusters estimated by the gap statistic is 1, and thus single linkage clustering is unable to separate the classes. The results for hierarchical clustering using median linkage are shown in Table 6. The optimal number of clusters given by the gap statistic is 1, and thus median linkage clustering is unable to separate the classes. K-means clustering is also used in conjunction with the gap statistic. The optimal number of clusters according to the gap statistic is 5, but the resulting clusters, which are provided in Table 7, do not separate the data well.       Table 7. The distribution of the hierarchical cluster assignments using k-means (default settings in R are used) and the gap statistic to select the number of clusters. The gap statistic suggests 5 clusters.

Conclusions and Discussion
In this work, a Bayesian nonparametric clustering prior called the Table Invitation Prior (TIP) was introduced. TIP does not require the analyst to specify the number of clusters, and its hyperparameters are automatically estimated via univariate multiple change point detection. EPA is a prior on partitions and is used for Bayesian clustering. Unlike TIP, the probability of a new cluster in EPA depends on preset hyperparameters (i.e., δ and α > −δ), which is not data-driven, and it may lead to a bias of the number of clusters due to improper hyperparameter values. The main difference between TIP and MCLUST is that TIP is a Bayesian cluster prior which can be incorporated with various types of likelihoods and priors for the parameters in the likelihood. For example, TIP can work with a conjugate using the Normal-Inverse-Wishart prior of for unknown mean and covariance matrix. MCLUST is based on a mixture model of finite Gaussian likelihoods and uses an expectation-maximization (EM) algorithm [21] for the Gaussian mixture parameter estimation with a preset covariance structure. TIP was used in conjunction with a Normal-Inverse-Wishart conjugate prior to cluster gene expression data, and it was compared with a variety of other clustering methodologies, including another Bayesian nonparametric clustering method called EPA, MCLUST, hierarchical clustering in conjunction with the gap statistic, and k-means clustering in conjunction with the gap statistic.  Institutional Review Board Statement: Ethical review and approval are not required for this study due to using publicly available data.

Informed Consent Statement: Not applicable.
Data Availability Statement: The data are publicly available from the UCI data repository mentioned in the manuscript.