1. Introduction
Markovian models such as homogeneous Markov chains, Mixture Transition Distribution (MTD) models and Hidden Markov Models (HMMs) are widely used for the analysis of time series. They are, for instance, a standard tool in speech recognition and DNA analysis, and they are becoming more and more popular in psychology, finance, and social sciences. However, some aspects of these models have not been fully studied yet, including the construction of confidence intervals (CIs). At least two reasons can explain this: (i) Markov chains were often used in a theoretical framework which does not require real data; (ii) even if models such as HMMs are of standard use today, they are often applied in fields where the number of available data is very large, limiting the need to worry about sample size and confidence intervals. When confidence intervals are nevertheless required, the most common solution is to rely on resampling methods such as bootstrap. This approach can be used with all Markovian models, but on the other hand, it can be very computationally intensive [
1,
2]. Therefore, the aim of this article is to provide an easy-to-calculate method for the evaluation of confidence intervals in the case of the MTD model, and then to extend the same approach to other Markovian models.
The approach described in this paper can be viewed as a development of a method already applied to homogeneous Markov chains: Since the transition matrix of a Markov chain can be seen as a set of multinomial distributions, it is possible to rely on methods developed for computing simultaneous confidence intervals on multinomial distributions. For instance, the R package markovchain [
3] includes an implementation of this principle based on the work of Sison and Glaz [
4]. This approach works well, because in empirical situations, the number of data points used to estimate each element of the transition matrix is generally known. However, this is not the case for the MTD and the HMM models. In the case of HMMs, there exists a theoretical solution using the Hessian matrix, but it cannot be applied on large samples [
5] or when some of the parameters are on the boundary of the parameter space [
2], and as far as we know there is still no published method for the construction of analytical confidence intervals in the case of the MTD model.
The purpose of this paper is to show that existing methods can also be applied to the MTD model and other advanced Markovian models, the only requirement being a way for correctly estimating the number of observations entering in the computation of each transition probability. The paper is organized as follows: In
Section 2 we recall some principles used to build confidence intervals in the case of multinomial distributions. In
Section 3, we show how to apply these principles in the case of the MTD model, and we extend the same approach to hidden models in
Section 4. Numerical simulations are provided in
Section 5, and a Discussion ends the paper.
2. Simultaneous Confidence Intervals for Multinomial Distributions
Let
be the theoretical probability distribution of a categorical random variable
Y taking values in
. Let
be the observed frequency distribution computed from a sample of size
, and let
be the corresponding empirical probability distribution. The objective is to provide simultaneous confidence intervals for
. One approach is to consider the related problem of the minimal sample size required to achieve a given precision of the estimation. Mathematically, we have to find the minimal
n such that
where
is the type I error, and where
d is the precision of the estimation, that is the maximal difference allowed between any element of the theoretical probability distribution and its empirical estimation, or equivalently the half-length confidence interval around the empirical estimations.
This problem has been adressed several times. An easy to used, but imprecise solution, is to provide intervals of the form
, and to simply compute
d as
, what is similar to the confidence interval computed for a single binomial proportion. Following Adcock [
6],
d can also be obtain from the following equation:
where
is the threshold of a chi-square distribution with one degree of freedom and probability equal to
. This method is very easy to use, but it suffers some shortcomings, among them the fact that the resulting confidence interval is a function of
c, the number of categories. Thompson [
7] proposed to consider instead
Even if this rule is not perfect, its degree of conservatism being unknown, it presents the advantage to provide a precision d independent from the number of categories.
Thompson provided the quantity for different values of , for instance = 1.00635 for , 1.27359 for , and 1.96986 for . Dividing this value by the desired gives the required sample size, whatever the number of categories of the distribution. For instance, and lead to , so a sample of at least 128 data points should be used.
Now, considering the reverse problem, it is possible to compute the actual half-length confidence interval d given the number of data used for the estimation. If we have for instance data points and we use , the precision is then equal to , what means that all probabilities of the multinomial distribution have been estimated with a precision of .
Confidence intervals derived from the above formulas have in common the fact that
d is equal for each probability of the distribution. This is a limitation, because we can expect the number of data available for estimating each probability to influence the width of the confidence interval. Different solutions for the computation of non-equal simultaneous confidence intervals have been proposed by, e.g., Sison and Glaz [
4], Bailey [
8], Fitzpatrick and Scott [
9], Goodman [
10], Hou, Chiang and Tai [
11], Quesenberry and Hurst [
12]. To our best knowledge, no published results do systematically compare from a theoretical point of view the performance of all of these approaches. Nevertheless, the simulation studies of Cherry [
13] suggest that the method proposed by Bailey performs generally well at a very light computational cost. The lower bound of the confidence interval for the
i-th element of a multinomial distribution, including continuity correction, is written
and the upper bound is given by
where
,
, and
C is the upper (
)100th percentile of the one degree of freedom chi-square distribution divided by
.
Without entering into the discussion about the comparison of the different approaches cited above, what is not the purpose of this paper, we note that all these methods require the knowledge not only of the empirical probability distribution , but also of the corresponding frequency distribution . When these quantities are generally known in the case of a single multinomial distribution or of an homogeneous Markov chain of any order, this is not the case for more complex models such as the MTD model. This is the reason why confidence intervals for these models are usually obtained only by the mean of bootstrap, what can be very time consuming. In contrast, the method developed hereafter does not require repetitive computations on many different samples, so the computation cost is reduced.
3. Mixture Transition Distribution Models
In many cases, the use of high-order Markov chains is not possible, because the number of parameters increases exponentially with the dependence order. Different modelings have been proposed to reduce this problem. Among them, the Mixture Transition Distribution (MTD) model introduced by Raftery [
14] proved to be very interesting, combining parsimony with quality of modeling [
15]. Let
be a categorical random variable taking values in
. The principle of the MTD model is to consider independently the influence of each lag upon the present. The basic equation is
where
f is the order of the Markov chain,
,
, is a vector of lag weights, and where
is a transition matrix of the same size as the transition matrix associated with the first-order homogeneous Markov chain. This model is very parsimonious since each new lag adds only one parameter to the model.
The parameters of the MTD model can be decomposed into
different distributions: The
c rows of the transition matrix
Q and the vector of weights
. When the elements of
are constrained to be non-negative, the usual case, the
sets of parameters are all probability distributions. The requirement for applying the methods of
Section 2 is then to be able to estimate the numbers of data used to compute each of these
distributions. That can be done using a principle similar to the E-step of an EM algorithm [
16]. Suppose that we have
successive observations of a random variable numbered from
to
. Suppose that at each time
t, the probability distribution was generated by one of the
f lags and let
be a length
f indicator vector defined as:
The lag really used to explain
being unobserved, we say that the model includes a hidden component. Following Böhning [
17], the expectation of
is computed as
The weight coefficient for the
g-th lag is then estimated as
and the vector
gives an estimation of the number of data points used to compute coefficients
. Confidence intervals are then obtained by using one of the methods of
Section 2.
The computation of confidence intervals for the parameters of the transition matrix
Q is very similar: In a MTD model of order
f, the transition matrix
Q is used to represent the relationship between each of the
f lags of the model and the present situation. Therefore, each transition probability of the form
represents the
f following situations where the lag takes value
k and the current observation is state
: (
and
), (
and
),
…, (
and
). Let
be a
matrix whose element
is defined as
This matrix contains estimates of the number of data used to calculate each element of
Q. From there, the use of the methods described in
Section 2 becomes possible.
After estimation, the MTD model can be used to construct an approximation of the high order transition matrix of the Markov chain. The number of data used to compute the probability corresponding to
is estimated by
and the total number of data used for the row defined by
is
These quantities are then used to compute confidence intervals for the probabilities in the order f transition matrix.
The basic MTD model allows many extensions, among them the MTDg model in which a different matrix is used to represent the transition process between each lag and the present. There are then
f transition matrices
and
f corresponding matrices
, the element
of
being defined as
The number of data used to compute the probability defined by
in the order
f transition matrix is then estimated by
4. Extension to Hidden Markov Models
The principles developed in the context of the MTD model can also be applied to hidden models. In a Hidden Markov Model (HMM), the state taken by the Markov chain at time
t is unknown. We observe instead a second random variable whose distribution depends on the state taken by the hidden chain. Let
be a categorical random variable taking values in
and governed by a hidden Markov chain of order
ℓ. Let
be a categorical random variable taking values in
. To each of the
k states of
correspond a different probability distribution for
, the distribution actually used at time
t being unknown. See, e.g., Zucchini and MacDonald [
2] for a good introduction to HMMs.
We consider a first-order hidden transition matrix and n successive observations of numbered from to . Three different sets of probabilities are used to fully identify the model: The unconditional distribution of the first hidden state, denoted by , the transition matrix between hidden states, A, and the distribution of the random variable given each of the k hidden states, .
Theoretically, assuming the asymptotic normality of the parameter maximum likelihood estimators around their true value, it is possible to compute an exact confidence interval for each parameter of a HMM [
18]. In practice, however, it is not possible to use this method when the series of interest is more than about
n = 100 data points long [
5]. At least three alternative methods have been introduced: Finite-differences approximation [
19], bootstrap [
20], and likelihood profiles [
21]. These methods present the advantage of being usable on long time series. On the other hand, as noted by Visser, Raijmakers and Molenaar [
5], results can be very sensible to truncations and rounding errors. Moreover, the required computations are very intensive. Finite-differences approximation of the Hessian matrix is obtained by considering the second-order partial derivative of the log-likelihood. This approximation is necessary, because the true Hessian matrix cannot be computed on large samples. Bootstrap consists in generating a large number of new samples by sampling with replaceemnt from the original data. Then, the model of interest is estimated on each sample, which provides an approximation of the distribution of each parameter. Likelihood profiling consists in comparing the likelihood of the optimal model with the likelihood of a model in which the value of one of the parameters was slightly increased by an amount
a. Different trials are performed in order to find the maximal value
a for which the likelihood of the original and modified models are still statistically identical. The resulting
a value is then considered as the upper bound of the confidence interval for the parameter. Repeating the same procedure with a negative value of
a provides the lower bound of the interval. The same method is successively applied to each parameter of the model.
We show now that the approach proposed for MTD models can also be applied to HMMs at a very low cost. The percentage of the time the model is using each of the
k possible distributions for the observed random variable
is unknown. However, this percentage can be estimated as follows: HMM parameters are estimated using the Baum-Welch algorithm [
22], a customized version of the Expectation-Maximization (EM) algorithm. Similarly to what we did for the MTD model, we suppose that at each time
t, one of the
k possible distributions was used. Let
be a size
k vector indicating which distribution was used at time
t:
During the E-step of the estimation algorithm, the expectation of
is computed as follows: First, we define two auxiliary quantities:
Using
and
,
, to initialize the computation, we obtain iteratively
The expectation of
is then
where
is the likelihood of the observed data. The total number of data points used to compute
is estimated as
and this quantity is used to compute confidence intervals for the probabilities in
. Notice that multiplying the above quantity by the
r-th element of
provides an estimation of the number of data used to compute this
r-th element.
By definition of the model, each observation of is directly linked to a hidden state, so the length of the unobserved sequence of hidden states is n. The probability distribution of the first hidden state is given by , and the sum of this vector, one, is the number of data used to compute . Obviously, one data is not enough to obtain a reliable estimation, but in practice, when only one time-series is used to identify the model, the precise knowledge of the active state at the beginning of the series is generally useless. On the other hand, when a HMM is estimated from several independent samples, the total number of data used to compute each vector , and in particular , is equal to the number of independent samples, what can lead to a reliable estimation of .
The number of data entering in the computation of the
g-th row of the hidden transition matrix
A is estimated as
and this quantity is used to obtain confidence intervals for the elements of this
g-th row. The summation is taken up to
rather than
n, because the transition matrix expresses the relation between two successive hidden states, and a sample of size
n contains only
transitions.
We turn now to the case of a high-order HMM, that is a HMM with a transition matrix
A of order
. In addition to the computation of the first hidden state distribution,
, the complete identification of the model requires the computation of the distribution of the second hidden state given the first one,
, the distribution of the third hidden state given the first two,
, and so on until
. In the Baum-Welch algorithm, the vectors
are replaced by the computation of matrices giving the expectation of observing
f successive hidden states and we define
For
, we define matrices giving the expectation of observing
t successive hidden states:
The number of data used to compute a probability distribution in the model is then estimated by summing up the corresponding elements in and in .
Finally, it is also possible to use the same principle to compute confidence intervals in the case of the Double Chain Markov Model (DCMM). This model, first introduced by Paliwal [
23] in the context of speech recognition and then developed and generalized by Berchtold [
24,
25], is an extension of the basic HMM in which the hypothesis of conditional independence between successive observations of the random variable
is removed, the relation being instead modeled by the mean of a Markov chain. The probability distributions
are then replaced by
k transition matrices,
, possibly of high-order.
6. Discussion
The computation of confidence intervals for the parameters of Markovian models is made difficult because of the possible large number of parameters involved, and because parts of these models are sometimes not directly observed (hidden models). However, these issues are tractable using appropriate principles and algorithms. First of all, the parameters of Markovian models can be decomposed into a set of probability distributions. Secondly, when parameters of a model are hidden, it is possible to estimate the number of data points entering in their computation though the use of one iteration of the E-step of an EM algorithm. Putting these two elements together leads to the possibility of using established formulas to estimate confidence intervals.
Other methods have been proposed for the construction of confidence intervals in Markovian models, but as mentioned previously, exact computation is only feasible on very short data sequences. Alternative methods such as likelihood profiling and bootstrap answer to this issue, but they are very computationally intensive, as shown through our simulations. For instance, bootstrap implies to optimize repetitively the same model on thousands of different samples, and likelihood profiling must be performed separately for finding the lower and upper bounds of each confidence interval of a model. Even if the bootstrap can be considered as the current gold standard for calculating CIs in Markovian models, its use is made difficult when the available processor time is limited, or when the model of interest is difficult to optimize. On the other hand, the approach developed in this article is much more effective, because, whatever the number of parameters of the model, the calculations required are mainly based on a single execution of the optimization algorithm. It could therefore prove to be very interesting for the development of real time or near real time applications.
Simulation results indicate that confidence intervals obtained from our approach are consistent with those obtained from previous methods. Moreover, our approach presents the advantage of unifying the computation of confidence intervals for homogeneous Markov chains, MTD models, and hidden models in a same theoretical framework. However, it must be recalled that the optimization of some models can prove very difficult, and that a given model is not always able to correctly represent the data. In such situations, the computation of confidence intervals may be misleading, giving a false idea of accuracy. Therefore, it is always necessary to question the merits of using one model rather than another, and to select the correct model for the function of the data to be analyzed. As demonstrated by the first simulation study, the standard MTD model, for example, is not suitable for all types of autocorrelation, and in some cases it does not necessarily represent the best choice.