1. Introduction
Change-point models deal with the analysis of an ordered sequence of random quantities. Examples of such sequences include daily average temperatures over time and sequencing data in genomics. One important component of a change-point model is change-point detection, which involves inferring the positions where an aspect of the data sequence changes, such as the location or distribution. These change points and their corresponding locations are of great practical interest. One of the first applications of these models dates back to the 1950s, when refs. [
1,
2] introduced a now well-known sequential method called cumulative sum (CUSUM) to detect changes in the mean of a quality control process. Since then, change-point detection has been actively addressed in various application settings, such as financial analysis [
3] and biostatistics [
4,
5,
6]. Change-point detection is also widely studied in time series analysis [
7,
8,
9,
10,
11]; however, in what follows, we focus on non-time series techniques.
Change-point models are generally divided into two main groups: online methods, which perform sequential detection with new data continually arriving and are commonly used in anomaly detection, and offline methods, in which retrospective analysis is performed in the entire observed sequence [
12]. In this article, we focus on the latter. Additionally, change-point models may be either parametric or nonparametric. Parametric models assume that the underlying distributions belong to some known family. In contrast, nonparametric approaches heavily rely on the estimation of density functions, but may be employed in a broader range of applications [
13,
14,
15,
16].
The literature on change-point models is vast, and several methods to perform change-point detection have been proposed in the past few decades. Therefore, we discuss here some approaches proposed for single and multiple change-point problems. For example, refs. [
3,
4] proposed circular binary segmentation and wild binary segmentation, respectively, both based on the binary segmentation algorithm proposed by [
17]. These methods perform change-point tests sequentially to locate change points in the data sequence. Other methods, which are mainly used for multiple change-point problems, treat change-point detection as a model-selection problem and estimate change points via minimizing a criterion. These methods often require dynamic programming such as the pruned exactly linear time (PELT) algorithm [
18] and the functional pruned (FP) algorithm [
19]. Some well-known approaches for multiple change-point detection include the simultaneous multiscale change-point estimator (SMUCE) [
20] and the heterogeneous simultaneous multiscale change-point estimator (H-SMUCE) [
21], both of which are based on a multiscale hypothesis testing, where the optimization process relies on the penalization of a test statistic. Additional approaches for change-point problems were described by [
22,
23].
The previously described approaches have been proposed to detect change points from a single data sequence. However, there is an interest in identifying common patterns across multiple sequences, which allows grouping sequences that originate from the same distribution. To the best of our knowledge, techniques involving change-point estimation and model-based clustering have only been studied by [
24,
25,
26]. Therefore, there needs to be more research on clustering change-point data from multiple sequences, especially considering model-based techniques. Ref. [
25] proposed a finite Gaussian mixture model for clustering observations with a single change point, whereas ref. [
26] proposed a finite negative binomial mixture model for clustering multiple change-point data. Both approaches consider the expectation–maximization (EM) algorithm for estimating the cluster assignments and the model parameters. The single change-point detection in [
25] was performed using exhaustive searches for changes in the mean or variance, where competing models were compared based on the Bayesian information criterion (BIC). The multiple change-point approach detects changes in the mean of a count process by employing a combination of segmentation and an exhaustive search approach. Similar to the single change-point approach, the best model is selected based on the BIC. Focusing on the analysis of the mortality rate over time for 49 states in the United States, ref. [
24] took a Bayesian approach for clustering multiple change-point data by assuming a functional Dirichlet process on the linear piece-wise structure of their data to cluster states based on the change-point locations and slope magnitudes. Although these papers showed promising results in dealing with the problem of clustering change-point data while simultaneously performing change-point detection, none made their algorithm’s implementation available. Moreover, there is currently no available software in
R that simultaneously performs clustering and multiple change-point detection. Existing packages like ecp [
27] and bcp [
28] can detect multiple change points within a single sequence of observations but do not perform clustering over multiple sequences.
In terms of application, an important motivation arises in clustering single-cell copy number data, where commonly used approaches estimate copy number profiles and cluster cells sequentially. Typically, hidden Markov models (HMMs) are used to infer copy number states for each cell, followed by clustering as a separate step, often relying on distance-based approaches [
29,
30]. Although ref. [
31] proposed a method that simultaneously performs copy number profiling and clustering, it is also based on an HMM framework. Therefore, there remains a need for alternative approaches that jointly perform copy number profiling and clustering, particularly methods based on change-point models, which naturally represent structural changes in copy number profiles.
In this paper, we propose and implement as an
R package a nonparametric Bayesian model for clustering multiple constant-wise change-point data via a Gibbs sampler. Similar to the approach of [
24], our model incorporates a functional Dirichlet process on the constant-wise change-point structures that automatically controls the number of clusters in the model as opposed to other clustering techniques [
32,
33]. To the best of our knowledge, this is the first work to provide an implementation for the problem of clustering multiple change-point data while simultaneously performing change-point detection. We apply our proposed approach to cluster abnormal (tumor) single-cell genomic data based on their copy number profiles, which resemble constant-wise structures. In addition, we evaluate the performance of our method under various simulated scenarios. Our proposed method is implemented as the
R package BayesCPclust and is available from the Comprehensive
R Archive Network (
https://CRAN.R-project.org/package=BayesCPclust, accessed on 10 February 2025).
The rest of this paper is organized as follows.
Section 2 introduces our proposed methodology and provides the updating steps for the Gibbs sampler.
Section 3 presents the performance results for our proposed method under various simulated scenarios. In
Section 4, we show the application results of our method in a single-cell copy number dataset. Finally,
Section 5 and
Section 6 present potential avenues for future work and a discussion about the implications of our work, respectively.
2. Methods
Let
be a data sequence ordered based on some covariate such as time or position along a chromosome. For example, in the copy number dataset analyzed in
Section 4,
represents the
ratio GC-corrected copy number aligned to a genomic bin
m and cell
n, where
, and
.
If we assume that there are change points in , then that means that can be partitioned into distinct segments, , with change point positions such that and . Also, we assume that the change points are ordered; that is, if and only if .
In our approach, we assume a constant-wise structure for
defined by the model
where
for
and
.
The model in Equation (
1) assumes that the mean in each interval between change points is constant, defined by an intercept
,
. Furthermore, this model allows the variability around the mean to differ depending on the observation by specifying a variance component
for each
n.
Clustering change-point data via a functional Dirichlet process is formulated by assuming that the constant-wise structures for the observations are independent draws from some distribution
G, which in turn follows a Dirichlet process prior. We define the constant-wise function as follows:
where
is the intercept in the segment
for each observation
n. This constant-wise function
contains all information about the number of change points, their locations, and the intercepts for the corresponding segment. Furthermore, a Dirichlet process on
leads to the hierarchical model
where
is the baseline distribution such that
and
is the precision parameter that determines how distant the distribution
is from
.
Integration over
G allows the predictive distribution of
to be written as shown in [
34]:
where
is a point mass distribution at
and
represents all the observations, except for
n. Note that under the first term in Equation (
2), there is a positive probability that draws from
G will take on the same value. This implies that for a long enough sequence of draws from
G, the value of any draw will be repeated by another draw, indicating that
G is a discrete distribution. Therefore, a Dirichlet process on the change-point structures allows the proposed approach to control the number of clusters in the model while not requiring pre-specification. More details about the Dirichlet process can be found in the works of [
35,
36].
We define the distribution in the following hierarchical form to cluster observations according to their constant-wise change-point profiles:
- (i)
Distribution of the number of change points (
K): We assume that each segment between change points has at least
points to ensure a non-zero length. Let
be the interval length of the
lth segment after subtracting
w. As a result,
, where
to ensure that
.
Therefore,
K follows a truncated Poisson distribution given by
- (ii)
Distribution of the interval lengths between change points: Given
, the distribution of the interval lengths is defined by
The change points’ positions are obtained recursively by assuming that and for .
- (iii)
Distribution of the constant level : Given , is generated from the probability density function on independently, where for and .
- (iv)
Finally, the constant-wise structure is then defined based on the random quantities generated accordingly to their distribution defined in (i–iii).
- (v)
The baseline distribution
is defined based on the distributions given in (i–iii):
where
represents an infinitesimal change in
x. Therefore,
corresponds to the probability of observing the infinitesimal interval in the neighborhood of
.
Because
for
, then
Note that, as mentioned, the distribution on the constant-wise structures is discrete. Therefore, observations in cluster r for are assumed to share the same constant-wise function . Parameter estimation for the model is achieved in a Bayesian framework via a Gibbs sampler.
2.1. Bayesian Inference
The vector with the observed data is denoted by , where for all observations , while is the set of all constant-wise functions across all N observations. Let be the vector with the number of change points for each data sequence. We define the set of all change points’ positions as , with , and as the set of all intercept parameters with . Let be the design matrix for .
The Dirichlet process hyperparameters
and
are given gamma priors:
The prior distribution for the intercepts
,
is improper to provide analytical simplifications in the calculations for their posterior conditional distributions. The variance components,
, are given independent inverse gamma priors such that
Gibbs Sampler
In this section, we present the updating steps for estimation of the parameters
, and
for
and
, where
r denotes an individual cluster and
d denotes the total number of clusters. Each step involves calculating the full conditional distributions (see
Appendix A for derivation details).
The following expression demonstrates the clustering capability of the Dirichlet process prior on the constant-wise structures
. The current value of
can be selected to be one of the existing
with a positive probability
. In cases in which observation
n does not belong to any existing clusters, a new
is generated from the posterior distribution
as shown in Equation (
3).
The posterior of
, conditional on
, is given by
where
and
define the mixing weights when observation
n forms a new cluster and when observation
n belongs to an existing cluster, respectively. Additionally,
is the posterior of
, given that a new cluster is formed by observation
n. Since
, we have that
represents the normal likelihood function corresponding to the observation
after integrating out the variance component
. Also,
corresponds to the precision hyperparameter for the Dirichlet process, and
denotes the number of observations in cluster
r. The full expressions for
and
are given in detail in
Appendix A, Equations (
A4) and (
A5).
Regardless of whether
is a new value or an existing
(Step 1), the variance component for observation
n is updated using the full conditional of
given the other parameters:
Note that
uniquely determines the collection of parameters
, and as mentioned, it contains several identical elements. Therefore,
also contains identical elements. In this step, we provide the updating procedures for the
d distinct components of
, defined by
, for
, where
d is the number of clusters at the current update of the Gibbs sampler. Considering the hierarchical structure for the distributions of
and
, which both depend on the value of
, we first update
from the posterior marginal probability function as follows:
where
The full expression of
is given in detail in
Appendix A, Equations (
A6) and (
A7).
Then, we update
given
using the probabilities
, where
This is carried out by exhaustively listing all combinations and numerically computing the corresponding probabilities.
Finally,
given
and
are updated based on the full conditional distribution:
where
represents the observations in cluster
r.
The update of
is given by the following full conditional distribution and is carried out by the Metropolis–Hastings algorithm; that is, we generate proposals from a gamma distribution and accept them with some probability based on an acceptance ratio:
where
a and
b are the prior hyperparameters previously defined as
and
, respectively.
The update of
is carried out using the procedure described in [
37]:
Sample ;
Draw from the mixture .
Here,
a and
b are the prior hyperparameters previously described as
and
, respectively, and
d is the number of clusters at the current update of the Gibbs sampler, while the probability membership is
3. Simulations
We evaluated the performance of our method through three simulated scenarios. We varied one of the parameters for each simulated scenario while fixing the others, as shown in
Table 1. We applied our method to 96 randomly generated datasets based on the model in Equation (
1), considering the initialization for the Gibbs sampler as described in
Section 3.1. Then, considering the evaluation metrics described in
Section 3.2, we assessed our method’s performance, and the results are presented in
Section 3.3,
Section 3.4 and
Section 3.5.
3.1. Gibbs Sampler Initialization and Implementation
This section describes the initialization of the Gibbs sampler and some details about our algorithm’s implementation. For the simulation scenarios and real data analysis, the hyperparameters for the prior distribution of the variance components were specified as and . For the prior distributions of and , the hyperparameters were and . The minimum number of locations in each segment between change points w was set to 10 in Scenario 1 and 10, 20, and 50 for Scenario 2 when considering , respectively.
To enable convergence diagnosis for the Gibbs sampler, we employed two chains with different initial values for each simulated scenario. The first chain started from the true settings; that is, we considered the parameter values used to generate the datasets as initial values for our algorithm, whereas for the second chain, we initialized the Gibbs sampler from the true parameter values plus a small perturbation. For instance, the initial values for the intercepts of each cluster were initialized from the true setting plus 1.5. The position of the change points for each cluster started from two points above the ground truth, and the variance components were initialized using generated values from an inverse gamma distribution with twice the average used to generate the true variance components.
The simulations and computations for the Gibbs sampler algorithm were performed using Sharcnet’s Graham cluster, with a single node consisting of two Intel E5-2683 v4 “Broadwell” with a 2.1 GHz processor base frequency for an overall of 32 computing cores. The number of simulated datasets,
, was chosen as a multiple of the number of cores. The computations were performed on CentOS 7, with
R version 4.2.1–“Funny-Looking Kid” [
38], using the parallel package version 4.4.2 [
38] to simulate and compute the Gibbs sampler for independent datasets simultaneously, the extraDistr package version 1.10.0 [
39] to generate samples from inverse gamma distributions, the RcppAlgos package version 2.9.3 [
40] to generate all possible partitions for the number of points in each segment between two change points, the MASS package version 7.3-61 [
41] to generate samples from multivariate normal distributions, and the FDRSeg package version 1.0-3 [
42] to calculate the V measure. It is worth mentioning that our algorithm was implemented as the
R package BayesCPclust version 0.1.0 [
43].
3.2. Performance Metrics
For each chain, simulated setting, and randomly generated datasetze of five to each data sequence, using the, we employed our method with 5000 iterations to estimate change points and perform clustering. We considered a burn-in of of the size of the chains, and we thinned our remaining samples by keeping only every 25th iteration. This procedure ensured that our samples were not highly correlated. For the 200 remaining samples, we calculated the posterior mean for each parameter, except for the discrete variables, such as cluster assignments, number of clusters, number of change points, and their locations, where we chose the most frequent value: the posterior mode.
To evaluate our method’s performance concerning intercept estimation, we computed the posterior mean for each intercept and simulated dataset, which corresponded to the optimal estimator under the squared error loss . Then, we calculated the average of these posterior means for each intercept and compared its value to the true settings considered when generating the datasets. Furthermore, we assessed uncertainty in the estimation of the intercepts by computing the average of the posterior variances across the simulated datasets, which represent the posterior expected risk under the squared loss function, and the average interval length of equal-tailed credible intervals taken over the 96 datasets. Additionally, we report the mean absolute deviation (MAD) for the variance components’ estimates.
For the discrete variables, we report the proportion of datasets in which we correctly estimated the parameters. To evaluate the clustering performance of our proposed approach, we considered the V measure [
42], which assesses observation-to-cluster assignments and measures the homogeneity and completeness of a clustering result. Homogeneity measures whether each cluster contains only observations from a single true class, while completeness evaluates whether all observations from the same class are assigned to the same cluster. The V measure ranges from zero to one, where results closer to one are considered adequate.
3.3. Scenario 1: Varying the Number of Data Sequences with
Figure 1 shows the data structure of four data sequences from 1 of the 96 randomly generated synthetic datasets for Scenario 1 when
. In this scenario, we varied the number of data sequences, considering
while keeping the other parameters fixed as described in
Table 1. Each panel represents one observation colored by their cluster assignment. Both clusters had two change points. The change points’ locations for Cluster 1 were 19 and 34, and for the second cluster, they were 15 and 32. Each segment between change points was defined by a constant level (5, 20, 10) for the first cluster and (17, 10, 2) for the second cluster.
Based on the methodology of [
44], the convergence of the chains for all parameters after the burn-in period and thinning procedure was confirmed.
Table 2 presents the results for the posterior estimates for the intercepts of each cluster when the number of data sequences was
10, 25, and 50, and the variance components were generated around 0.05. In this setting, our estimates were close to the true parameter values, showing that our proposed method retrieved the correct intercepts for each cluster. As the number of data sequences increased, the average of the posterior variances for the intercepts of each cluster and the average length of the
credible intervals decreased, indicating that the uncertainty about the intercepts decreased as the number of data sequences increased. Overall, these results demonstrate that our method accurately recovered the true parameter values across all cases. Considering that each data sequence had its own variance component and
M was fixed, increasing the number of data sequences did not considerably improve the estimation of the variance components as shown in
Table 3 by the mean absolute deviation.
The change points’s locations associated with the two clusters were correctly estimated for all 96 datasets. The number of clusters and the cluster assignment for each data sequence were correctly estimated for all 96 datasets, resulting in all V measures being equal to one. Due to these findings, we decided to not include the tables with the results for the change-point detection and the figures with the values for the V measure, which were all one for all the simulated datasets.
3.4. Scenario 2: Varying the Number of Data Sequences with
This section evaluates our method’s performance with a higher data dispersion than in the previous section. We generated 96 datasets as in the last experiment for each possible value of
N; however, for this scenario, we sampled the variance components from an inverse gamma distribution with an average 10 times higher than in the simulation Scenario 1, as shown in
Figure 2.
It is worth mentioning that the convergence of the chains for all parameters in Scenario 2 was also confirmed using the methodology of [
44].
Table 4 shows the results for the posterior estimates for the intercepts of each segment between change points for the two clusters when the number of data sequences was
, and the variance components were generated around 0.5. We observed that for every considered number of data sequences, our approach correctly estimated the intercepts. Although the average posterior variances and the average credible interval sizes of the intercepts for each cluster were noticeably higher than in the previous scenario, reflecting greater uncertainty due to the increase in data dispersion, they decreased as the number of data sequences increased. Nonetheless, our method showed satisfactory performance not only in estimating the intercepts for each cluster but also in correctly estimating the number of change points and their corresponding locations. Additionally, our method always recovered the true clustering configuration in our data, with all V measures being equal to one. Due to these findings, we decided to not include the tables with the results for the change-point detection and the figures with the values for the V measure, which were all one for all the simulated datasets.
Furthermore, the mean absolute deviation for the variance components estimates was small and remained stable, as in the previous scenario, suggesting that increasing the number of data sequences did not noticeably improve the precision of the variance component estimates, as reported in
Table 5.
3.5. Scenario 3: Varying the Number of Locations
In this section, we present the performance results of our method when the number of locations was
. In this scenario, both clusters had two change points. The intercept values between change points were fixed at (2, 15, 17) for the first cluster and
for the second cluster across all cases. The change points’ locations varied with the number of locations
M as follows. For
, they were
in the first cluster and
in the second one; for
, they were
and
; and for
, they were
and
.
Table 6 presents the posterior estimates for the intercepts for each case in Scenario 3. As in the previous scenarios, convergence of the chains for all parameters was confirmed. Based on the results, our approach correctly estimated the intercepts for each cluster and showed that as the number of locations increased, the uncertainty in the estimation of the intercepts for each cluster decreased, as reflected by the decreasing average posterior variances. Once again, our method correctly estimated the number of change points and change-point positions for all generated datasets. In addition, all the values for the V measure were equal to one, showing that our model recovered the true clustering configuration in our data.
Furthermore, in this scenario, we observed an increase in the precision of our estimates for the variance components, as shown in
Table 7, as
M increased. As discussed in the previous scenarios, the number of data sequences minimally affected the precision of our variance estimates, since each data sequence had its variance component. However, by increasing the number of locations, we noted a decrease in the mean absolute deviation for our estimates, suggesting that the number of locations considerably affected the estimation of the variance components.
4. Real Data Analysis
We further assessed the performance of our method in a real dataset. We applied our approach to a subset of the copy number genomic data analyzed by [
29], focusing on patient CRC2. The dataset consists of copy number information for 45 cells (data sequences) from frozen primary tumor and liver metastases of colorectal cancer. Each data point in the dataset corresponds to the
ratio of reads aligned per 200-kb genomic bin per cell after GC correction. The
ratios provide an indication of the number of copies in each genomic bin. A
ratio greater than one means an amplification in the corresponding region. Genomic copy number alterations are common in many diseases including cancer, where deletions or amplifications of DNA segments can contribute to alterations in the expression of tumor-suppressor genes [
45,
46]. Identifying the number and locations of these alterations is essential for understanding cancer progression. As tumors evolve, differences in genomic profiles, including the copy number, are expected between primary tumor and metastatic tumors [
47,
48,
49,
50,
51].
In this work, for computational feasibility purposes, we focused our analysis on chromosomes 19, 20, and 21, corresponding to 583 genomic bins (locations), since it is a region with visible change points, as observed by [
29]. The raw data (FASTQ files) are available publicly at the NCBI Sequence Read Archive (SRA) under accession number SRP074289. The processed
ratios were kindly provided by [
29] upon our request.
Figure 3 displays the copy number data for six cells in our dataset: three from the primary tumor location and three from a liver metastasis location. Our main interest lies in clustering all 45 cells based on their copy number variations, evaluating whether they formed groups according to their tissue of origin and uncovering any novel patterns, if present.
Due to computational cost, we fixed the maximum number of change points (
) to two, and we applied a median moving window with a size of five to each data sequence, using the
R package zoo [
52] to reduce the number of bins in the data and handle possible outliers. Considering the transformed data with 290 locations, we applied our algorithm using two chains with a size of 10,000. One was initialized using the clustering result from the K-means method when we set the number of clusters to be two. The other chain was initialized using random cluster assignments; that is, each cell was randomly assigned to one of two clusters. The number of change points for each cluster was set to zero at the beginning of the chains. Additionally, the initial values for the intercepts were selected as the average
ratio copy number information taken over the cells in each initial cluster, and the sample variances were set as initial values for the variance components. The minimum number of locations in each segment between change points
w was set to 50. Furthermore, convergence was confirmed using the methodology of [
44] for each chain after the burn-in of half the size of the chains and thinning the remaining samples by selecting every 50th one.
Our approach identified three clusters; Cluster 1 was composed of 18 primary tumor cells with clear change points at bin locations 100 and 226, Cluster 2 was composed of four primary tumor cells and five metastatic tumor cells, with
ratio reads around one for all bins, and Cluster 3 was composed of 18 metastatic tumor cells with two change points at bin locations 165 and 215, as shown in
Figure 4,
Figure 5, and
Figure 6, respectively.
Table 8 reports the posterior estimates for the intercepts of each segment between change points for each cluster, where we note that the intercepts for Cluster 2 were not significant since the credible intervals overlapped, suggesting, as is shown in
Figure 5, the absence of change points, since the
ratio reads were steady around one for all locations. In addition, the posterior variances of the intercepts for Clusters 1 and 3 were noticeably smaller than those for Cluster 2, which suggest a higher uncertainty in the estimation of the copy number information for Cluster 2, which may be due to the fact that Cluster 2 was composed of only 9 cells in comparison with the 18 cells assigned to each of the other clusters. Interestingly, the cells belonging to Cluster 2 were not considered in the hierarchical clustering analysis performed by [
29]. In addition, metastatic and primary tumor single cells were mainly clustered separately, as observed by [
29]. However, ref. [
29] considered all chromosomes when clustering cells and found two clusters for the metastatic tumor cells. Furthermore, they noted that amplifications of chromosomes 3 and 8 distinguished the subpopulations for the metastatic tumor cells. This data was also analyzed by [
31], where the authors developed a Markov chain-based method for clustering copy number data. Analyzing the same dataset, ref. [
31] considered the copy number data for chromosomes 18–21 from patient CRC2 to cluster tumor single cells according to their copy number profiles. As a result, ref. [
31] identified two clusters of tumor single cells, separating primary from metastatic single cells.
6. Conclusions
The results from the simulation scenarios show that our approach can recover the true classification of each data sequence. Furthermore, it was precise in identifying the change points when we varied the number of data sequences and the number of locations. Importantly, the degree of dispersion in the data did not affect our method’s performance; we observed satisfactory results in scenarios where the variance components were sampled from inverse gamma distributions with both small and large averages. Additionally, our method effectively recovered the true underlying data structure in the presence of outliers, demonstrating its robustness. This robustness was evaluated by introducing an outlier in the change point location for a subset of data sequences from Cluster 1. Using a dataset from Scenario 2 with
data sequences, we reduced the value of the 19th observation by 10 units in 9% of the sequences from Cluster 1, causing the first change point for these sequences to shift by one position from its true position. Despite this modification, our method successfully recovered both the true change-point profiles and the correct cluster assignments (see results in
Appendix C). Finally, by applying our method to a single-cell copy number dataset, our approach showed results consistent with [
29], where we obtained similar clusters for tumor single cells based on their change point structures, in which we observed that some cells were clustered according to their tissue of origin. However, the application to the dataset also revealed a novel cluster composed of cells from both primary and metastatic tissue origins, providing new insights into the dataset.
To facilitate the implementation of our method, we developed the R package BayesCPclust available from the Comprehensive R Archive Network, which to our knowledge is the first package that addresses the problem of clustering multiple change-point data while simultaneously performing change-point detection.
A limitation of our approach lies in the computational cost, since it requires the calculation of a probability for each possible combination of interval lengths between change points, which can be computationally expensive when the number of locations increases. To remedy this, in the real data analysis, we calculated the probabilities for a sample of all possible combinations of interval lengths, reducing (though not sufficiently) the computational cost. In general, as the number of data sequences or locations increased, the average processing time to infer change points and perform clustering analysis for the simulation scenarios also increased, with an average duration between 20 and 30 h for the scenarios with the highest number of locations (see
Table A3 in
Appendix B). Furthermore, we observed similar processing times for the first two scenarios (see
Table A1 and
Table A2 in
Appendix B), suggesting that the data dispersion had a minimal effect on the computational cost of our algorithm.
A common issue in Bayesian mixture modeling is that the labels of the clusters can be permuted multiple times over iterations of a Markov chain Monte Carlo (MCMC) method, such as the Gibbs sampler. This issue, known as label switching, happens because the data likelihood is invariant under the permutation of the labels of the clusters. Solutions for undoing label switching are necessary to perform cluster-specific inference. Thus, various approaches have been proposed to solve this issue [
56,
57,
58]. In this work, we assigned the most frequent set of labels to the sequences of cluster assignments, leading to the same clustering. Then, after this correction for label switching, we obtained all the corresponding parameter posterior estimates for each cluster.