Consistent Estimation of Partition Markov Models

: The Partition Markov Model characterizes the process by a partition L of the state space, where the elements in each part of L share the same transition probability to an arbitrary element in the alphabet. This model aims to answer the following questions: what is the minimal number of parameters needed to specify a Markov chain and how to estimate these parameters. In order to answer these questions, we build a consistent strategy for model selection which consist of: giving a size n realization of the process, ﬁnding a model within the Partition Markov class, with a minimal number of parts to represent the process law. From the strategy, we derive a measure that establishes a metric in the state space. In addition, we show that if the law of the process is Markovian, then, eventually, when n goes to inﬁnity, L will be retrieved. We show an application to model internet navigation patterns.


Introduction
The Markov models have received enormous visibility for being powerful tools [1][2][3].In recent years, the theoretical advances have allowed for its users to identify the most suitable methods for estimating them.For instance, [4] shows that the Bayesian Information Criterion (BIC)- [5]-can be used to consistently choose a Variable Length Markov Chain model in an efficient way using the Context Tree Maximization (CTM) algorithm.See also [6,7].In this paper, we show that the criterion BIC is also consistent for estimating a more general Markovian family (the Partition Markov Models), which includes the Variable Length Markov Chain models and the complete Markov chains.We consider a discrete stationary process with finite alphabet A of size |A|.Markov chains of finite order are widely used to model stationary processes with finite memory.In databases with Markovian structure, it is frequently observed a pronounced degree of redundancy, which means that different sequences of symbols have the same effect over the law of the process.For example, in datasets coming from the linguistic field, we can observe words which are synonyms.In some cases, exchanging a word for a synonym, does not change the meaning of sentences.In a more general context, there are also sequences of several words, which are equivalent in that sense.See, for instance, [8,9].For this kind of data, a model should retrieve and use the redundancy to improve the quality of the estimate.The Partition Markov Model represents the redundancy through a partition of the state space (see [10]).See also [11] and contemporary literature of [10].Under the assumption of this family, we address the problem of model selection, showing that the model can be selected consistently using the BIC.We show that, in order to apply the BIC criterion, it is not necessary to find a global maximum inside the set of partitions, which will be impossible even for a moderate size of the state space.Instead, it is possible to start the searching process by an initial partition; for instance, the state space itself, and then coarsen the partition, step by step.This process is associated with a metric that governs the state space.
The Partition Markov Models are being used and explored intensively: for instance, [12] combines two statistical concepts-Copulas and Partition Markov Models-with the purpose of defining a natural correction for the estimator of the transition probabilities of a multivariate Markov process.Moreover, Reference [13] presents a simulation study that identifies when this correction succeeds well.A second strategy to deal with this issue, is shown and applied to real data in [14].The idea is to combine through a copula the partitions coming from the marginal processes and the partition coming from the multivariate process.This strategy shows excellent theoretical properties that will be essential to increasing the predictive ability of the estimation.In [14], the strategy was applied to multivariate Brazilian financial data in order to show how this new estimator allows for considering a longer past (order of the process), in the estimation of the transition probabilities, in comparison with the allowed past in the Partition Markov Models, when the data size is not large enough to ensure reliable results.The application of Partition Markov Models has also been useful to reveal important facts in other areas, as shown in [15], where these models were applied to written texts of European Portuguese, in order to identify change points from the period 16th century-19th century.
Here is a description of the issues addressed in this article, section by section.In Section 2, we introduce the concept of Markov chain with partition L, which is a partition of the state space defined through a stochastic equivalence between the strings of the state space (see [10]).In Section 3, we describe the model selection procedure for choosing the optimal partition, which is based on the BIC criterion and on the concept of good partitions of the state space, which were also introduced in this section.We introduce a distance between the parts of a partition, and this concept defines a metric on the state space and also allows it to build efficient algorithms for estimating the optimal partition (see [10]).In Section 3, we show that the optimal partition can be obtained through the BIC criterion, eventually almost surely, when the sample size tends to infinity.Section 4 shows the application to model navigation patterns on a website.We conclude this paper with a discussion in Section 5.The proof of the results introduced in this paper are included in Appendixes A and B.

Preliminaries
Let (X t ) be a discrete time order M Markov chain on a finite alphabet A, with M < ∞.Let us call S = A M the state space.Denote the string a m a m+1 . . .a n by a n m , where a i ∈ A, m ≤ i ≤ n.For each a ∈ A and s ∈ S, P(a|s) = Prob(X t = a|X t−1 t−M = s).Let L = {L 1 , L 2 , . . ., L |L| } be a partition of S, ∀a ∈ A and L ∈ L, define P(L, a) = ∑ s∈L Prob(X t−1 t−M = s, X t = a), P(L) = ∑ s∈L Prob(X t−1 t−M = s).If P(L) > 0, we define P(a|L) = P(L,a) P(L) .With the purpose of formulating the model, we introduce the following equivalence relation.Definition 1.Let (X t ) be a discrete time order M Markov chain on a finite alphabet A, with state space S = A M : (i) s, r ∈ S are equivalent (denoted by s ∼ p r) if P(a|s) = P(a|r) ∀a ∈ A. (ii) (X t ) is a Markov chain with partition L = {L 1 , L 2 , . . ., L |L| } if this partition is the one defined by the equivalence relationship ∼ p introduced by item (i).
The equivalence relationship defines a partition on S. The parts of this partition are subsets of S with the same transition probabilities, i.e., s, r ∈ S are in different parts if, and only if, they have different transition probabilities.To understand more deeply the motivation of a model like this, we note that the full Markov chains show a restriction, in terms of point estimation, which is, for an order M model, that the number of parameters given by |A| M (|A| − 1) grows exponentially with the order M. Another limitation is that the class of full Markov chains is not very rich, since, due to the fixed the alphabet A, there is just one model for each order M and in practical situations a more flexible structure could be necessary.For an extensive discussion of these two restrictions, see [1].A well-known and richer class of finite order Markov models, introduced by [1,2], is composed of the Variable Length Markov Chains (VLMC).In the VLMC class, each model is identified by a prefix tree T called Context Tree.For a given model with a Context Tree T , the total number of parameters is |T |(|A| − 1).We will see later that the Definition 1(ii) supports both: complete Markov chains and VLMC, becoming a natural extension of the two possibilities.
Remark 1.Given a Markov chain over the alphabet A = {a 1 , a 2 , ..., a |A| } with partition L = {L 1 , L 2 , . . ., L |L| } (Definition 1(ii)); in order to specify the process, it is necessary to estimate (|A| − 1) transition probabilities for each part in L. Thus, the set of parameters to estimate is {P(a i |L j ) : 1 ≤ i < |A|, 1 ≤ j ≤ |L|} and the total number of parameters for the model is |L|(|A| − 1).If the estimation of the transition probabilities is performed under any other conception-for instance, considering a complete Markov chain or a VLMC-the number of parameters to estimate will be higher than |L|(|A| − 1), since they do not consider that there are strings that share transition probabilities.
The structure of a VLMC can be expressed by a partition in the sense described before.Each model in the family of VLMC models is identified by its Context Tree, and we will use this structure to establish the relation between VLMC and Partition Markov Models (see Example 1).
Example 1.Let (X t ) be a finite order Markov chain taking values on A = {0, 1} and T a set of sequences of symbols from A such that no string in T is a suffix of another string in T , d(T ) = max l(s), s ∈ T , where l(s) is the length of the string s ∈ T .Consider d(T ) = 3 and T = {{0}, {01}, {011}, {111}}.Define a partition of A 3 as being , while L does not check that definition.
In the next example, we can see a situation in which can be observed the economy in the number of parameters, achieved by a Partition Markov Model following Definition 1(ii) (see also Remark 1).
Example 2. Let (X t ) be a finite order Markov chain taking values on A = {0, 1} with state space A 3 .Suppose that this chain follows the transition probabilities given by the Table 1.Considering the process as a full chain, we have eight parameters.Then, if we look more closely, a Context Tree is enough to describe the process, with just four parameters, because T = {{0}, {01}, {011}, {111}}.Moreover, if we analyze this situation from the perspective of Definition 1(ii), we note that only two parameters are needed to describe the source, since L = {{000, 001, 010, 100, 101, 110, 111}, {011}}, because just the string 011 has different transition probability to 0.

Let x n
1 be a sample of the process X t , s ∈ S, a ∈ A and n > M. We denote by N n (s, a) the number of occurrences of the string s followed by a in the sample x n 1 , which is N n (s, a) = {t : M < t ≤ n, x t−1 t−M = s, x t = a} .In addition, the number of occurrences of s in the sample x n 1 is denoted by N n (s) and N n (s) = {t : M < t ≤ n, x t−1 t−M = s} .The number of occurrences of elements into L followed by a and the total number of strings in L are given by In order to simplify the notation, we use the same notation N n with different arguments, a string s or a part L. In addition, we note that N n (L) is a function of the partition L. As a consequence, if we write P(x n 1 ) = Prob(X n 1 = x n 1 ), we obtain under the assumption of a hypothetical partition L of S : The Bayesian Information Criterion (BIC) is defined through a modified maximum likelihood (see [4]).We will call maximum likelihood the maximization of the second term in the Equation ( 2) for a given observation x n 1 .We denote that term as ML(L, x n 1 ), and the BIC is given by the next definition.
Definition 2. Given a sample x n 1 of the process (X t ), a discrete time order M Markov chain on a finite alphabet A with state space S = A M and L a partition of S, the BIC of the model given by Definition 1(ii), and according to the modified likelihood, Equation ( 3 Remark 2. The results of this paper will remain valid if we replace, in Definition 2, the constant for some arbitrary constant v, positive and finite. Below, we define some concepts that help, in practice, to limit the search for an ideal partition to a subset of possible partitions, with natural characteristics.Definition 3. Let (X t ) be a discrete time order M Markov chain on a finite alphabet A and S = A M the state space.Let L = {L 1 , L 2 , . . ., L |L| } be a partition of S : (ii) L is a good partition of S if for each i ∈ {1, . . ., |L|}, L i verifies item (i).
Example 3. We will consider two situations: (ii) Consider the Example 1(ii), and the partition L is a good partition of S = {0, 1} 3 .
If L is a good partition of S, we define for each part L ∈ L where s ∈ L.
We introduce a notation that will be used in the next results.

Notation 1.
(a) Let L ij denote the partition (b) For a ∈ A, we write P(L ij , a) = P(L i , a) + P(L j , a) and P(L ij ) = P(L i ) + P(L j ).In addition, Note that, if L is a good partition and P(•|L i ) = P(•|L j ), then L ij is a good partition.We show a way to build partitions, from good partitions, which are candidates more suitable for checking the Definition 1(ii).This way of building partitions seeks to reduce the size of the partition, step by step.

A Metric on the State Space
The next result allows formulating the main findings of this section.Nonetheless, this result could be applied to partitions with at least two good parts, i.e., it is not necessary to have good partitions.In this section, we also define a measure to quantify the distance between the parts of a partition.This distance is based on the practical use of the next theorem and allows building an efficient algorithm for estimating the partition given by Definition 1(ii).For complementing, see [16].
Theorem 1.Let (X t ) be a Markov chain of order M over a finite alphabet A, S = A M the state space and x n 1 a sample of the Markov process.Let L = {L 1 , L 2 , . . ., L |L| } be a partition of S and suppose that i and jexist, and i = j such that L i and L j verified the Definition 3(i) (are good parts).Then, P(a|L i ) = P(a|L j ) ∀a ∈ A if, and only if, eventually almost surely as n → ∞, where L ij is defined under L by Notation 1(a).

Proof. See Appendix A.1.
It is also possible to decide simultaneously if more than two good parts should be put together, as shown in the next corollary.Corollary 1.Let (X t ) be a Markov chain of order M over a finite alphabet A, S = A M the state space and x n 1 a sample of the Markov process.If L = {L 1 , L 2 , . . ., L |L| } is a partition of S with K 1 good parts, denoted by {L i k } K 1 k=1 , and T is an index set, T ⊆ {1, . . ., K 1 }, then, P(a|L i k ) = P(a|L i l ) ∀a ∈ A, ∀k, l ∈ T if, and only if, eventually almost surely as n → ∞, BIC(L, x n 1 ) < BIC(L T , x n 1 ), where L T denotes the partition which join the |T| good parts in ∪ k∈T L i k generalizing the Notation 1(a).

Proof. Replace Equation (A3) in the proof of Theorem 1 by
Applying log-sum inequality, the result follows.
Remark 3. Let (X t ) be a Markov chain of order M over a finite alphabet A, S = A M the state space and x n 1 a sample of the Markov process.Let L = {L 1 , L 2 , . . ., L |L| } be a partition of S, given i, j ∈ {1, 2, . . ., |L|}, i = j, such that L i and L j verified the Definition 3(i)(are good parts).If P(a|L i ) = P(a|L j ) for some a ∈ A, then, eventually almost surely as n → ∞, BIC(L, x n 1 ) > BIC(L ij , x n 1 ), where L ij verified the Notation 1(a).Now, we can introduce a distance in L. This distance allows to establish a metric in the state space S. Definition 4. Let (X t ) be a Markov chain of order M, with finite alphabet A and state space S = A M , x n 1 a sample of the process and let L = {L 1 , L 2 , . . ., L |L| } be a good partition of S The next theorem shows that d L is a distance in L.
Theorem 2. Let (X t ) be a Markov chain of order M over a finite alphabet A, and S = A M the state space and x n 1 a sample of the Markov process.If L = {L 1 , L 2 , . . ., L |L| } is a good partition of S, for each n, and for any i, j, k ∈ {1, 2, ..., |L|} : Some observations of practice order are appropriate at this time.Suppose the good partition of Theorem 2 is the space S itself, so each part of the partition is given by each string of S. Thus, the distance (Definition 4) defines the following relation of equivalence between strings of S, for each value n : The next result formalizes how the distance is related to the BIC criterion.
Corollary 2. Let (X t ) be a Markov chain of order M over a finite alphabet A, with S = A M the state space and x n 1 a sample of the Markov process.Let L = {L 1 , L 2 , . . ., L |L| } be a partition of S, given i, j ∈ {1, 2, . . ., |L|}, i = j, such that L i and L j verified the Definition 3(i).(are good parts): Proof.From Equation (A2) in the proof of Theorem 1.
The previous corollary provides the statistical interpretation of the distance.

Consistent Estimation of the Process's Partition
In this section, we prove that the partition following Definition 1(ii), referred to herein as minimal good partition, can be obtained by maximizing the equation introduced in Definition 2, in the space of all possible partitions of the state space.For instance, the smaller good partition in the universe of all possible good partitions of S is the partition defined by the equivalence relationship in Definition 1.Note that, for a discrete time order M Markov chain on a finite alphabet A, with S = A M the state space, there exists one and only one minimal good partition of S. The next theorem shows that for large enough n, we obtain, through the BIC, the minimal good partition.Theorem 3. Let (X t ) be a Markov chain of order M over a finite alphabet A, with S = A M the state space and x n 1 a sample of the Markov process.Let P be the set of all the partitions of S. Define L * n = argmax L∈P {BIC(L, x n 1 ).} Then, eventually, almost surely as n → ∞, L * = L * n , where L * is the minimal good partition of S, following Definition 1(ii).
From Corollary 2, algorithms can be formulated to obtain L * .See, for instance, Algorithm 3.1 in [10].For large enough n, the algorithm returns the minimal good partition as shown by the next result.Corollary 3. Let (X t ) be a Markov chain of order M over a finite alphabet A, S = A M the state space and x n 1 a sample of the Markov process.Ln , given by the Algorithm 3.1, [10] converges almost surely eventually to L * , where L * is the minimal good partition of S. Remark 4. Algorithm 3.1 [10] requires as initial input a good partition.In the case in which there is not previous information about a good partition or about the length of the memory, the initial good partition can be chosen as the set of sequences, satisfying the suffix property and appearing in the sample at least B times, where B is a positive integer, which corresponds to the first part of the Context Algorithm ( [2,3]).
We note that Corollary 3 also applies to other clustering algorithms based on distances, such as single-linkage clustering.However, exploring this aspect is beyond the scope of this paper.

Navigation Patterns on a Web Site (MSNBC.com)
The MSNBC.com anonymous web data set consists of one million user sessions recorded in 24 h on the web site.The dataset can be retrieved from [17] The web pages on the site are divided into 17 categories: frontpage, news, tech, local, opinion, on-air, misc, weather, msn-news, health, living, business, msn-sports, sports, summary, bbs and travel.
Each category will be a letter in the alphabet A, with total size equal to 17.Each user session corresponds to a sequence of symbols from the alphabet starting with the category during which the session is initiated on the MSNBC site.The sequence of categories which the user visits defines the string that finishes when the user leaves the MSNBC site.In Table 2, we show an illustrative sample of 12 user sessions.The issue that the model can clarify is to identify the strings that can be considered equivalent, in terms of the next step of the internet surfers.This information could be extremely important in determining the profile of users, in relation to the preferences of states in A. The idealization that supports this application is that there are different sequences in the state space that share the same transition probability to the next symbol in the alphabet of the process.Sequences with such properties form a part of the minimal good partition, which completely describes the process.Furthermore, from this perspective, these sequences are equivalent to deciding the next symbol for the process.These sequences in our application are the paths used by internet surfers.
We use the distance d L to find the minimal partition.It is known that there are several strategies based on a measure which allows us to reach this purpose, and one of them is the algorithm introduced in [10].We used three strategies, with input given by the set of strings (S): (i) Algorithm 3.1 introduced in [10]; (ii) Algorithm 3.1 [10] modified; and (iii) an agglomerative strategy.Option (ii) is composed of two stages.First, we join in the same part all the strings r and s of S, which show that d L (r, s) < , and this process generates an initial partition that is used as input of Algorithm 3.1 [10], which is the second stage.The agglomerative strategy of (iii) explores the ability of d L to build distances between strings of S and between groups of strings of S. Thus, in this strategy, all of the distances are computed joining groups of strings L i and L j if d L (i, j) < (|A|−1)

2
. We show in Table 3 that the agglomerative algorithm produces the best BIC value.We expose two cases with order M = 3 and M = 2, where the first is the order chosen as usual, 3 = log |A| (1.0 × 10 6 ) − 1.Let us look at some specific situations.For example, suppose that our interest is to investigate the parts that lead to the local state with probability greater than 0.6.There are seven different parts (using M = 3 and method (iii)) that fulfill this condition, and Table 4 shows the composition of each part and its probability of being local the next state.
For instance, once the model was selected, in order to predict the next place that the user will visit, we first check during which of the 269 parts his/her path fall and then the corresponding probabilities are used.For example, if the user's path is weather.weather.misc,then the probability for the user to visit local is 0.7822 and this probability is shared by all three of the strings of L 5 .The full partition obtained by the algorithm and the set of transition probabilities associated to each part can be obtained from [18].We can draw some observations.For example, the transition probabilities of each part P(a|L i ) have been computed, using, in general, several strings, representing a natural improvement in the calculation of the transition probabilities.The reader can find many larger parts in [18], making this observation more incisive.On the other hand, the strings listed as members of the same part (see, for instance, several situations in Table 4), must be considered stochastically equivalent.

Conclusions
The development of the partition concept in Markov processes allows for proving that, for a stationary, finite memory process and a sample large enough, it is theoretically possible to consistently find a minimal partition to represent the process and this can be accomplished in practice.In this paper, we show (in Theorem 3) that the Bayesian Information Criterion can be used to obtain a consistent estimation of the partition of a Markov process.We show that the use of this criterion is also convenient for producing a consistent strategy that allows for deciding if a candidate partition is preferable to another candidate partition (in Theorem 1).We also define a metric on the state space, allowing for the introduction of a definition of a distance between the parts of a partition (Theorem 2).The distance allows the construction of consistent estimation algorithms to identify the partition.Research in progress suggests that this measure can be harnessed for the development and implementation of robust estimation techniques, given that there are records (see [19]) of the need of these techniques for Markov processes.In summary, in this paper, in addition to responding positively to the question of whether the Bayesian Information Criterion is capable of allowing a consistent estimation of the partition of the Markov process, we also obtain that, in terms of the model selection procedure, the Bayesian Information Criterion corresponds to a distance in the state space of the Markov process.

Table 2 .
Sample of 12 sessions.Each line represents the path followed by an user.

Table 3 .
Number of parts or cardinal of L and BIC value of the model (Definition 2), for memories 2 and 3, respectively.In (ii), =(|A|−1)