This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

In this paper we describe an algorithm for clustering multivariate time series with variables taking both categorical and continuous values. Time series of this type are frequent in health care, where they represent the health trajectories of individuals. The problem is challenging because categorical variables make it difficult to define a meaningful distance between trajectories. We propose an approach based on Hidden Markov Models (HMMs), where we first map each trajectory into an HMM, then define a suitable distance between HMMs and finally proceed to cluster the HMMs with a method based on a distance matrix. We test our approach on a simulated, but realistic, data set of 1,255 trajectories of individuals of age 45 and over, on a synthetic validation set with known clustering structure, and on a smaller set of 268 trajectories extracted from the longitudinal Health and Retirement Survey. The proposed method can be implemented quite simply using standard packages in R and Matlab and may be a good candidate for solving the difficult problem of clustering multivariate time series with categorical variables using tools that do not require advanced statistic knowledge, and therefore are accessible to a wide range of researchers.

The interaction of a patient with the health care system takes place at different points in space and time. This implies that in many cases the natural unit of observation for health care research is the entire trajectory of a patient. As linked data and personal health records become more easily available we expect that both researchers and health care stakeholders will have an increasing need for tools that can be used to analyze health trajectories, where by health trajectory we mean a collection of time series describing some aspect of the individual's health or health care utilization.

A very common type of analysis one may need to perform on health trajectories is clustering. For example an insurer may wish to cluster claims trajectories in order to better account for risk categories, a health regulator may need to cluster administrative data trajectories for the purpose of defining appropriate groups for activity based funding, and clinicians may want to group patients with the same condition according to different courses of the disease.

While the analysis of health trajectories is our motivation and our experimental results concern health trajectories, the methods described in this paper do not depend on the fact that the variables we study are related to health, and therefore we will often refer to health trajectories as “multivariate trajectories” or “multivariate time series”.

Clustering multivariate trajectories is a very difficult task, because the notion of cluster is intrinsically linked to the notion of distance, and there is no obvious or standard way of defining a distance between arbitrary multivariate time series. When the time series only contain continuous variables then some well-defined distances can be defined [

Unfortunately, the time series we expect to find in health care related research are likely to contain a mix of categorical and continuous variables. Categorical variables may be used to denote a health condition (such as breast cancer, or diabetes), or a risk factor (such as smoking), the use of medication and the administration of a procedure or of a laboratory test. Continuous variables may arise in conjunctions with claims, costs and results of laboratory tests (such as glucose or cholesterol levels).

As a consequence, in order to make our research relevant for health care applications, in this paper we focus on the issue of clustering trajectories with a method that can handle both continuous and categorical variables at the same time. In addition, we require that the method can be explained in simple terms and implemented without the need of advanced programming or statistical skills, so that it is accessible to a large number of researchers.

Our approach is conceptually simple: since time series of continuous and categorical variables are difficult objects to deal with, we replace each time series with a probabilistic model that is likely to generate it, and then cluster the models, rather than the trajectories, since this task will turn out to be much better defined.

As probabilistic models we use Hidden Markov Models (HMMs) [

In the terminology of the survey articles [

For completeness we note that the term “model based clustering” also refers to a different strand of clustering literature, where one assumes that the data were generated probabilistically by a mixture of distributions, each defining a cluster (see [

Our data will be a set of _{i}_{i}

Standard clustering methods cannot be applied directly to this type of data because the trajectories _{i}_{i}_{j}_{i}_{j}_{ij}_{i}_{j}

In the particular case in which the trajectories are continuous valued some natural definitions of distances are available (for example the Frechet distance [

Therefore we take a different approach: we associate to each trajectory _{i}

This “embedding” approach [

The strategy to address the problem of clustering trajectories is therefore as follows:

We map each trajectory _{i}_{i}_{λ}_{i} = _{i}_{i}_{i}

We define a distance _{λ}_{i}; _{λ}_{j}) between probability densities _{λ}_{i} and _{λ}_{j} and define the distance _{i}_{j}_{i}_{j}_{i}_{j}_{λ}_{i}; _{λ}_{j}).

After having computed the distance matrix _{ij}_{i}_{j}

The main premise of this paper is that while the distance between two trajectories is ill-defined, the distance between two probabilistic models that are likely to generate them is well-defined. The class of probabilistic models we consider in this paper is the Hidden Markov Models (HMMs).

Hidden Markov Models (HMMs) are probabilistic models that were introduced in the late 60s [

HMMs can be thought of as a class of probability densities

The simplest example of an HMM is one in which an individual at any point in time is in one of two hidden health states (“sick” and “healthy”) and transitions from one state to another with certain probabilities. The state is not observed, but we observe two variables: body temperature and white blood cell count. The probability distributions of these variables (the

The introduction of the hidden states allows HMMs to generate much more complex dynamic patterns than traditional Markov models, while remaining computationally tractable. HMMs are particularly appealing in the health setting because they capture the notion that the health state of an individual is not a well defined quantity, and that the observations available on an individual only capture certain dimensions of health and do not necessarily get to underlying construct of health state. Therefore it seems reasonable to assume that a person's health state remains unobservable, and the only quantities that we can observe are certain “manifestations” of the hidden state.

The notion of hidden state is very much in line with the approach taken, usually in a static setting, by Latent Class Analysis (LCA). Based on the success of LCA in a wide range of applications [

In addition, recent advances in HMM theory [

Since HMMs are so well-documented in the literature we do not review them in details here. All that matters for the purpose of this paper is the following:

to each multivariate time series _{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}

a set of covariates is simply another multivariate time series _{i}_{i}

the parameters of the HMM λ_{i}

Once the parameters of a model λ_{i}_{i}_{i}

It is important to notice that current estimation algorithms for HMMs do not estimate the number of hidden states, which must be specified by the user and chosen using some heuristics or additional criteria. We postpone the description of our strategy for estimating the number of hidden states to

We have used the R package depmixS4 to perform the estimation of the HMMs. Since the problem is unconstrained, parameters are estimated in depmixS4 using the Expectation Maximization (EM) algorithm. We took advantage of the option to introduce covariates in the transition and emission probabilities so that we did not have to make stationarity assumptions. The package allows to fit multivariate time series with both continuous or categorical variables. In our case all variables were categorical and were modeled according to multinomial distributions. Following standard practice we have assumed contemporaneous conditional independence of the multiple time series forming our data ([

The need to compare different HMMs through an appropriate distance measure is not new, and has arisen in a variety of contexts such as speech recognition [

In order to define a meaningful distance between two HMMs researchers have used the fact that it is easy to compute, by means of the forward-backward algorithm [

The key observation is that the likelihood _{i}_{j}_{i}_{j}

One estimation strategy is a Monte Carlo approach, where one randomly draws a large number _{α}_{i}

This approach suffers from the problem that the number of trajectories one has to drawn in order to achieve a good approximation could be prohibitively large, since the approximation error is of order

Alternative methods consist of analytical approaches, such as recursive calculations [

On the opposite side of the Monte Carlo approach one could make a strong assumption and estimate the integral in _{i}_{i}_{i}

Since the KL distance is not symmetric one could then “symmetrize” it by writing

Expressions similar to the one of

In order to strike a compromise between the simplicity of _{1},…,λ_{N}_{i}_{λ} of _{i}^{−1}_{i}_{λ} represents sufficiently well the probability density _{i}_{j}_{λ}_{i} and _{λ}_{j}:
_{KL}_{i}_{j}_{KL}_{j}_{i}

As described in the previous section we define the distance _{ij}_{i}_{j}_{i}_{j}_{i}_{j}_{ij}

For our experiments we have elected to use the Partitioning Around Medoids (PAM) algorithm, a method originally developed by Kaufman and Rousseeuw [

PAM, like other clustering methods, does not determine automatically the optimal number of clusters, which must be chosen by the user. A common approach consists of running a clustering algorithm for different numbers of clusters and computing a “validity index” that assesses the quality of the results for each number of clusters[

This procedure is particularly useful in cases where there is no prior knowledge of the nature of the clusters [

In our experience the Dunn index is the one that has always given the most interpretable and stable results, and therefore this is the one we used in this paper. However, we will also report the Silhouette and DB index, since it turned out that, for our data sets, they actually provide the same answer as the Dunn index and therefore add to the credibility of the results.

Experiments were conducted using a large, synthetic and complex longitudinal data set, described in

One of the data sets used in our study comes from a simulation based on the 45 and Up Study data [

A subset of 60,000 participants was interviewed two to three years after baseline, as part of the Social, Economic and Environmental Factors (SEEF) study. The longitudinal structure of this data set allowed us to estimate transition probabilities between health states over a two years interval.

The health state of an individual was defined by a vector of binary and categorical variables associated to the presence of each of the following chronic conditions and risk factors: heart disease, diabetes, stroke, hypertension, cancer, obesity status and smoking status.

All the transition probabilities of the model were estimated using probit regressions that included as covariates the health state at the previous time, age, gender, income, education, and insurance status. The errors of the terms of the probit regressions are correlated, so that the time series corresponding to different health conditions are correlated and reflect the observed patterns of co-morbidity.

Applying repeatedly the transition probabilities to the original 45 and Up data we obtained a Markov model that we used to generate over 260,000 health trajectories with a time step of two years. The trajectory of an individual stops when the individual dies, where the probability of dying was estimated using data from the NSW Registry of Births, Marriages and Deaths linked to the 45 and Up data.

Out of the 260,000 trajectories, that were originally produced to forecast the health of the NSW population over the next few decades, we extracted a “challenging” subset, which is complex and that exhibits a high degree of variability. We wanted to avoid using trivial trajectories, in which an individual never develops any disease or only develops a condition prior to death, since those are easy to differentiate from the others. Therefore we picked a set of 1,255 trajectories associated with the individuals who developed _{it}_{it}

The average age of the cohort is approximately 60, and the length of trajectories is about 18 on average, although it does vary between 4 and 20 time steps, where a time step corresponds to a period of two years of life. The trajectory for individual _{it}_{it}_{it}_{it}_{it}

We underscore that the probabilistic model underlying these data is not an HMM, it is much more complex than the HMMs used in our experiments and it includes a realistic amount of noise. We resorted to its use only because access to longitudinal health data is limited by ethics and privacy concerns. Despite the fact that this is a simulated data set there is no guarantee a priori that it contains clusters of trajectories, since we have not artificially introduced any. Therefore any clustering structure that we find is a genuine feature of the underlying data.

The University of Michigan Health and Retirement Study (HRS) is a nationally representative longitudinal study that has surveyed more than 27,000 elderly and near-elderly Americans since its inception in 1992 [

A subset of HRS data with same the characteristics of the synthetic data described in

For each of the 1,255 trajectories in the 45 and Up synthetic data set we estimate a corresponding HMM using BMI and smoking behavior as time varying covariates. The estimation is performed in R using the package depmixS4 [

Once we have estimated the HMMs we used the standard forward-backward algorithm [_{i}_{i}

Clustering is then performed using the PAM algorithm. In order to determine the best number of clusters we used the cluster validity indexes discussed in

The DB, Silhouette and Dunn index for the 45 and Up data, in the case of 3 hidden states. The reason for choosing 3 hidden states is found in Subsection 4.1

2-Cluster | 1.6739 | 0.2074 | 0.3679 |

3-Cluster | 3.0126 | 0.1458 | 0.1815 |

5-Cluster | 2.0816 | 0.1964 | 0.2125 |

6-Cluster | 2.9047 | 0.1291 | 0.1690 |

7-Cluster | 2.1018 | 0.1724 | 0.2176 |

8-Cluster | 2.0894 | 0.1895 | 0.1929 |

9-Cluster | 1.9710 | 0.1651 | 0.2210 |

10-Cluster | 1.6086 | 0.2216 | 0.2601 |

In order to convince ourselves that four is a reasonable number of clusters for these data we use Multidimensional Scaling (MDS) to visualize the cluster structure. MDS works by finding a set of points in an

In our case we choose

While

MDS 45 and up.

An obvious feature that may explain the composition of a cluster is the time _{c}_{c}_{H}_{D}_{S}_{H}_{D}_{S}

Since we are including risk factors in our analysis it also seems important to study how they vary across clusters. The difficulty lies in the fact that both BMI and smoking behavior have significant dynamics that correlate with the onset of disease (for example, some people may lose weight or stop smoking after developing a disease). This implies that reporting the percentage of the

For example, we constructed a feature called “Normal BMI before first disease” that measures the percentage of the portion of trajectory before the first disease that is spent with a normal BMI. If in one cluster this feature is 10% this means that people in that cluster spends only 10% of their time before developing the first disease with normal BMI, and therefore we could label these people as mostly overweight or obese.

Since for each risk factors the corresponding categories are mutually exclusive we report all of them except one. So for smoking we only report the time spent in the “Not-smoking” and “Quit smoking” state, while for BMI we only report the time spent in the “Normal” and “Overweight” state (the “Underweight” category has too few records and is not worth reporting).

The full set of features we have constructed for each cluster is shown in

The profile of the four clusters in the feature space for the 45 and Up data.

In order to make it easier to interpret the composition of the clusters based on their feature profile we describe some clusters in details. Looking at the first three features of cluster 4 we see that people in this cluster spend about 60% of their trajectories with diabetes, and 40%–45% of their trajectories with heart disease and stroke. Therefore these people develop diabetes relatively early, and after a period of time develop both heart disease and stroke. This is quite different from what we observe, for example, in cluster 1, where people spend about 60% of their trajectories with all three conditions.

Still in cluster 4 we notice, by looking at the feature labeled “Normal BMI before first disease”, that these people, before developing their first condition (which is diabetes), spend only 20% of their trajectories in the state of “Normal BMI”, and therefore are mostly overweight and obese. However, looking at the feature labeled “Normal BMI after third disease” we notice that after developing the third disease these people spend at least 50% of their trajectories with normal BMI, and therefore a significant proportion experiences weight loss after the heart and stroke events. This is the opposite of what we observe for cluster 3, whose constituents are of relatively normal BMI before the first disease (diabetes) but experience weight gain after developing heart disease.

We have summarized the features of the different clusters in

Interpretation of the four clusters for the 45 and Up synthetic data. Note that the expressions such as “mostly”, “significant” or “some” do not refer to the size of the effect on an individuals, but rather to the size of the population that experiences the effect. Therefore “Some weight gain” means that some of the the people in the cluster experiences weight gain. Interpretation for not-smoking behavior omitted because of lack of change.

Heart disease, stroke and diabetes almost simultaneously | Mostly overweight/obese before 1st disease | Some weight loss after 3rd disease | No change in smoking behaviour after 3rd disease | |

Diabetes, heart disease and then stroke | Significantly overweight/obese before 1st disease | Some weight gain after 3rd disease | Mild increase in quitting smoking after 3rd disease | |

Diabetes, stroke and then heart disease | Half time normal BMI before 1st disease | Significant weight gain after 3rd disease | Mild increase in quitting after 3rd disease | |

Diabetes and then heart disease and stroke | Mostly overweight/obese before 1st disease | Significant weight loss after 3rd disease | Mild increase in quitting after 3rd disease |

The algorithms that estimate the parameters of the HMMs do not estimate the optimal number of hidden states. This parameter needs to be chosen by the users according to some criterion or some additional prior knowledge. We use the fact that we can represent the clusters as “feature profiles”, as shown in

The optimal number of hidden states for the 45 and Up synthetic data is 3, since it corresponds to the lowest average correlation across the feature profiles.

2 Hidden states | 0.74 | |

4 Hidden states | 0.47 |

The reason for which we do not consider more than 4 hidden states is that for a larger number of states the number of parameters of the HMMs would be too high, relative to the length of the trajectories, and over fitting would most certainly occur. The general manifestation of over-fitting would be instability [

Cross-validation involves using only a portion of a trajectory to estimate the parameters of the HMM, using the remaining portion to check whether the HMM predicts the trajectory correctly. This is easy to implement, although it is probably not meaningful for short time series and not practical for large data sets. Regularization involves setting restrictions on the set of parameters that need to be estimated. The R package depmixS4 allows to estimate the HMMs under inequality constraints, and therefore this option may be attractive for a reasonably wide range of users. The Bayesian integration method is much more involved than the other two, and it would be probably appealing to a restricted set of users.

A standard way of testing whether a novel clustering method performs as expected is to test it on a validation data set, for which one already knows the structure of the data, so that an appropriate comparison can be made. In our case this means to apply the algorithm to a set of trajectories that have been generated by HMMs with a known number of hidden states and that are expected to belong to a given number of clusters. The validation consists in checking that we recover the correct number of hidden states and cluster structure.

This type of experiments is particularly informative when there is an agreed notion of how difficult the validation task is, and often there are benchmark data sets that can be used to this end. In our case no such data set exists, and there is no obvious notion of “difficulty” of a data set. Therefore any test we may run is more of a sanity check than a validation test, in the sense that if the algorithm failed we would have to investigate why, but if it worked successfully we could not draw any conclusions since it is possible that the problem is simply not challenging enough.

Nevertheless it seems worth to run such type of test, especially on a data set that does not appear trivial. Therefore, rather than constructing an arbitrary, new synthetic data set we took advantage of the results of the previous section to generate a non trivial data set. In the previous section we showed that our method found 4 clusters in a set of 1,255 trajectories, and estimated that the optimal number of hidden states that describe the data is 3. In order to produce our artificial data set we took the 4 trajectories corresponding to the centers of the 4 clusters. To each of them it corresponds an HMM with three hidden states, and therefore a probability density _{i}

The resulting set of 1,255 trajectories is all generated by HMMs with 3 hidden states, and if we applied our algorithm we should be able to recover the fact that there are 4 clusters in the data and that the optimal number of hidden states is 3. This is indeed the case, as shown in

The optimal number of hidden states for the validation data set is 3, since it corresponds to the lowest average correlation across the feature profiles. This is indeed the number of hidden states used to generate the data.

2 Hidden states | 0.12 | |

4 Hidden states | 0.17 |

In addition, trajectories sampled from the same HMM λ_{i}

Confusion matrix between true cluster and predicted cluster results.

| |||||
---|---|---|---|---|---|

199 | 49 | 1 | 25 | ||

5 | 441 | 2 | 0 | ||

0 | 0 | 221 | 0 | ||

0 | 0 | 0 | 312 |

The analysis for the HRS data follows exactly the same lines of the analysis of the 45 and Up data. The optimal number of hidden states turned out to be 3, as shown in

The optimal number of hidden states for the 45 and Up data is 3, since it corresponds to the lowest average correlation across the feature profiles.

2 Hidden states | 0.59 | |

4 Hidden states | 0.62 |

The DB, Silhouette and Dunn index for the HRS data, using 3 hidden states.

2-Cluster | 2.0529 | 0.4636 | 0.3893 |

4-Cluster | 2.1257 | 0.4412 | 0.2820 |

5-Cluster | 1.8561 | 0.3974 | 0.2614 |

6-Cluster | 2.6350 | 0.4321 | 0.1841 |

7-Cluster | 1.8339 | 0.4304 | 0.3497 |

8-Cluster | 2.6883 | 0.4506 | 0.2170 |

9-Cluster | 2.5222 | 0.4536 | 0.1544 |

10-Cluster | 2.2139 | 0.4226 | 0.2243 |

MDS HRS data.

In order to interpret the clusters we use the same methodology we used for the 45 and Up data, and in

The profile of the three clusters in the feature space for the HRS data.

Interpretation of the three clusters in the HRS data. Note that the expressions such as “large” or “mostly” do not refer to the size of the effect on an individuals, but rather to the size of the population that experiences the effect. Therefore “Large weight loss” means that a large portion of the people in the cluster experiences weight loss. Interpretation of not-smoking omitted because of lack of change.

Heart disease, then stroke and then diabetes | Mostly overweight or obese before 1st disease | Weight gain after 3rd disease | Large increase in smoke quitting after 3rd disease | |

Diabetes and then heart disease and stroke almost at the same time | Mostly overweight or obese before 1st disease | Some weight loss after 3rd disease | Increa1se in smoke quitting after 3rd disease | |

Heart disease, then diabetes and much later stroke | Mostly overweight or obese before 1st disease | Large weight loss after 3rd disease | Increase in smoke quitting after 3rd disease |

People in cluster 3 spends 60% and 50% of their trajectories with heart disease and diabetes respectively, so that they develop first heart disease and diabetes soon after. Unlike people in other clusters they develop stroke much later, and spend only 30% of their trajectory with that condition. A sizable proportion of people in this cluster experiences weight loss and reverts to normal BMI after developing stroke, unlike people in cluster 1 who may actually gain weight. Compared to people in cluster 1 they are less likely to be non-smokers before the first disease, and a moderate proportion of them quits smoking after developing stroke.

We are aware that these interpretations are not complete and many other factors (starting from age) need to be looked at, but this is not in the scope of the paper. The purpose of these interpretations is simply to show that these clusters are well separated and that correspond to some meaningful groups of individuals.

While there is significant literature on the problem of clustering time series of continuous variables [

Our goal was to devise a method that is sound, easy to explain and that can be implemented easily using statistical software such as R or Matlab, taking advantage of already existing packages. While it is true that research in health care is becoming more and more interdisciplinary it seems that a method for clustering health trajectories that requires users to write their own Monte Carlo Markov Chains or dynamic programming routines would be of little use to the community of practitioners.

We believe the methodology is sound because it is based on the simple idea of representing a complex, unstructured object (a multivariate time series) with a well defined dynamic model: embedding data in generative models is a common theme in the machine learning literature, and it is a well-tested strategy [

Our approach to compute distances between HMMs is based on the notion of KL distance, and it is standard in the literature. The innovation we bring over methods such as those described in [

The advantage of our procedure over more sophisticated methods, such as those based on recursive calculations [

We have performed three tests on our method. For the first test we used a synthetic data set based on the 45 and Up and SEEF surveys, which consisted of 1,255 trajectories of people over age 45, with an average trajectory length equal to 18. The simulation that generated the data was complex and it was not built to contain any clusters. We applied our method to a non-trivial set of trajectories, where all subjects develop three chronic conditions and where the transition probabilities across hidden states depend on two time varying covariates (obesity and smoking). The method proposed allowed to identify 4 clear clusters of records that differed in the dynamics of the chronic disease as well of the risk factors.

For the second test we used the results of the first test to built a simulated set of four clusters of trajectories generated by HMMs with 3 hidden states. We tested our method by checking that we recovered the correct clustering structure and number of hidden states.

For the third test we used a subset of the Health and Retirement Survey, a longitudinal study of people over age 50. This set consisted of 268 trajectories of length 10, with the same variables as the simulated data. Our method was able to identify 3 clear clusters that could be easily interpreted.

The main point of this paper was not to solve a specific problem, but to show a well-grounded method for clustering health trajectories that can be implemented quite simply with statistical software such as R or Matlab. All that is required to implement this method is a package that can estimate an HMM (possibly with covariates) given a multivariate continuous or categorical time series and that can compute the likelihood of a trajectory given an HMM. Once these two components are in place then clustering can be performed with any method that takes as input a distance matrix. We found PAM [

The main limitation of this method is that HMMs are not particularly meaningful for very short trajectories, since even for a small number of hidden states one could easily end up having to estimate more parameters than the number of observations. Therefore we do not expect it to work well if one only has few time periods. In addition, while the method worked well with few thousand observations it would run into difficulties with a very large number of observations, say one million, since it would require to estimate one million HMMs and then run PAM on a distance matrix with one million rows. Clearly for large data sets some partitioning or a hierarchical approach, such as the one proposed in [

Shima Ghassempour is responsible for the development and implementation of the algorithm, designed the experiments and wrote the first draft of the paper. Federico Girosi created the synthetic data and interpreted the results. Anthony Maeder gave conceptual advice and contributed to the analysis of the results. All authors commented on the manuscript at all stages and gave final approval of the article.

The authors declare no conflict of interest.