Next Article in Journal
P-Adic Analog of Navier–Stokes Equations: Dynamics of Fluid’s Flow in Percolation Networks (from Discrete Dynamics with Hierarchic Interactions to Continuous Universal Scaling Model)
Next Article in Special Issue
Discovery of Kolmogorov Scaling in the Natural Language
Previous Article in Journal
An Approach to the Evaluation of the Quality of Accounting Information Based on Relative Entropy in Fuzzy Linguistic Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Consistent Estimation of Partition Markov Models

Department of Statistics, University of Campinas, Rua Sérgio Buarque de Holanda, 651, Campinas, São Paulo 13083-859, Brazil
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(4), 160; https://doi.org/10.3390/e19040160
Submission received: 1 March 2017 / Revised: 31 March 2017 / Accepted: 4 April 2017 / Published: 6 April 2017
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

Abstract

:
The Partition Markov Model characterizes the process by a partition L of the state space, where the elements in each part of L share the same transition probability to an arbitrary element in the alphabet. This model aims to answer the following questions: what is the minimal number of parameters needed to specify a Markov chain and how to estimate these parameters. In order to answer these questions, we build a consistent strategy for model selection which consist of: giving a size n realization of the process, finding a model within the Partition Markov class, with a minimal number of parts to represent the process law. From the strategy, we derive a measure that establishes a metric in the state space. In addition, we show that if the law of the process is Markovian, then, eventually, when n goes to infinity, L will be retrieved. We show an application to model internet navigation patterns.

1. Introduction

The Markov models have received enormous visibility for being powerful tools [1,2,3]. In recent years, the theoretical advances have allowed for its users to identify the most suitable methods for estimating them. For instance, [4] shows that the Bayesian Information Criterion (BIC)—[5]—can be used to consistently choose a Variable Length Markov Chain model in an efficient way using the Context Tree Maximization (CTM) algorithm. See also [6,7]. In this paper, we show that the criterion BIC is also consistent for estimating a more general Markovian family (the Partition Markov Models), which includes the Variable Length Markov Chain models and the complete Markov chains. We consider a discrete stationary process with finite alphabet A of size | A | . Markov chains of finite order are widely used to model stationary processes with finite memory. In databases with Markovian structure, it is frequently observed a pronounced degree of redundancy, which means that different sequences of symbols have the same effect over the law of the process. For example, in datasets coming from the linguistic field, we can observe words which are synonyms. In some cases, exchanging a word for a synonym, does not change the meaning of sentences. In a more general context, there are also sequences of several words, which are equivalent in that sense. See, for instance, [8,9]. For this kind of data, a model should retrieve and use the redundancy to improve the quality of the estimate. The Partition Markov Model represents the redundancy through a partition of the state space (see [10]). See also [11] and contemporary literature of [10]. Under the assumption of this family, we address the problem of model selection, showing that the model can be selected consistently using the BIC. We show that, in order to apply the BIC criterion, it is not necessary to find a global maximum inside the set of partitions, which will be impossible even for a moderate size of the state space. Instead, it is possible to start the searching process by an initial partition; for instance, the state space itself, and then coarsen the partition, step by step. This process is associated with a metric that governs the state space. The Partition Markov Models are being used and explored intensively: for instance, [12] combines two statistical concepts—Copulas and Partition Markov Models—with the purpose of defining a natural correction for the estimator of the transition probabilities of a multivariate Markov process. Moreover, Reference [13] presents a simulation study that identifies when this correction succeeds well. A second strategy to deal with this issue, is shown and applied to real data in [14]. The idea is to combine through a copula the partitions coming from the marginal processes and the partition coming from the multivariate process. This strategy shows excellent theoretical properties that will be essential to increasing the predictive ability of the estimation. In [14], the strategy was applied to multivariate Brazilian financial data in order to show how this new estimator allows for considering a longer past (order of the process), in the estimation of the transition probabilities, in comparison with the allowed past in the Partition Markov Models, when the data size is not large enough to ensure reliable results. The application of Partition Markov Models has also been useful to reveal important facts in other areas, as shown in [15], where these models were applied to written texts of European Portuguese, in order to identify change points from the period 16th century–19th century.
Here is a description of the issues addressed in this article, section by section. In Section 2, we introduce the concept of Markov chain with partition L , which is a partition of the state space defined through a stochastic equivalence between the strings of the state space (see [10]). In Section 3, we describe the model selection procedure for choosing the optimal partition, which is based on the BIC criterion and on the concept of good partitions of the state space, which were also introduced in this section. We introduce a distance between the parts of a partition, and this concept defines a metric on the state space and also allows it to build efficient algorithms for estimating the optimal partition (see [10]). In Section 3, we show that the optimal partition can be obtained through the BIC criterion, eventually almost surely, when the sample size tends to infinity. Section 4 shows the application to model navigation patterns on a website. We conclude this paper with a discussion in Section 5. The proof of the results introduced in this paper are included in Appendix A and Appendix B.

2. Preliminaries

Let ( X t ) be a discrete time order M Markov chain on a finite alphabet A , with M < . Let us call S = A M the state space. Denote the string a m a m + 1 a n by a m n , where a i A , m i n . For each a A and s S , P ( a | s ) = Prob ( X t = a | X t M t 1 = s ) . Let L = { L 1 , L 2 , , L | L | } be a partition of S , a A and L L , define P ( L , a ) = s L Prob ( X t M t 1 = s , X t = a ) , P ( L ) = s L Prob ( X t M t 1 = s ) . If P ( L ) > 0 , we define P ( a | L ) = P ( L , a ) P ( L ) . With the purpose of formulating the model, we introduce the following equivalence relation.
Definition 1.
Let ( X t ) be a discrete time order M Markov chain on a finite alphabet A , with state space S = A M :
(i) 
s , r S are equivalent (denoted by s p r ) if P ( a | s ) = P ( a | r ) a A .
(ii) 
( X t ) is a Markov chain with partition L = { L 1 , L 2 , , L | L | } if this partition is the one defined by the equivalence relationship p introduced by item (i).
The equivalence relationship defines a partition on S . The parts of this partition are subsets of S with the same transition probabilities, i.e., s , r S are in different parts if, and only if, they have different transition probabilities. To understand more deeply the motivation of a model like this, we note that the full Markov chains show a restriction, in terms of point estimation, which is, for an order M model, that the number of parameters given by | A | M ( | A | 1 ) grows exponentially with the order M . Another limitation is that the class of full Markov chains is not very rich, since, due to the fixed the alphabet A, there is just one model for each order M and in practical situations a more flexible structure could be necessary. For an extensive discussion of these two restrictions, see [1]. A well-known and richer class of finite order Markov models, introduced by [1,2], is composed of the Variable Length Markov Chains (VLMC). In the VLMC class, each model is identified by a prefix tree T called Context Tree. For a given model with a Context Tree T , the total number of parameters is | T | ( | A | 1 ) . We will see later that the Definition 1(ii) supports both: complete Markov chains and VLMC, becoming a natural extension of the two possibilities.
Remark 1.
Given a Markov chain over the alphabet A = { a 1 , a 2 , , a | A | } with partition L = { L 1 , L 2 , , L | L | } (Definition 1(ii)); in order to specify the process, it is necessary to estimate ( | A | 1 ) transition probabilities for each part in L . Thus, the set of parameters to estimate is { P ( a i | L j ) : 1 i < | A | , 1 j | L | } and the total number of parameters for the model is | L | ( | A | 1 ) . If the estimation of the transition probabilities is performed under any other conception—for instance, considering a complete Markov chain or a VLMC—the number of parameters to estimate will be higher than | L | ( | A | 1 ) , since they do not consider that there are strings that share transition probabilities.
The structure of a VLMC can be expressed by a partition in the sense described before. Each model in the family of VLMC models is identified by its Context Tree, and we will use this structure to establish the relation between VLMC and Partition Markov Models (see Example 1).
Example 1.
Let ( X t ) be a finite order Markov chain taking values on A = { 0 , 1 } and T a set of sequences of symbols from A such that no string in T is a suffix of another string in T , d ( T ) = max l ( s ) , s T , where l ( s ) is the length of the string s T . Consider d ( T ) = 3 and T = { { 0 } , { 01 } , { 011 } , { 111 } } . Define a partition of A 3 as being L = { L 1 , L 2 , L 3 , L 4 } , where L 1 = { { 000 } , { 100 } , { 010 } , { 110 } } , L 2 = { { 001 } , { 101 } } , L 3 = { 011 } and L 4 = { 111 } :
(i) 
Suppose P ( · | s ) P ( · | s ) , s , s T . Then, L verifies Definition 1(ii);
(ii) 
Suppose P ( · | s ) P ( · | s ) , s , s T { 0 } and P ( · | { 0 } ) = P ( · | { 01 } ) . Define L 1 = L 1 L 2 , and then L = { L 1 , L 3 , L 4 } verifies Definition 1(ii), while L does not check that definition.
In the next example, we can see a situation in which can be observed the economy in the number of parameters, achieved by a Partition Markov Model following Definition 1(ii) (see also Remark 1).
Example 2.
Let ( X t ) be a finite order Markov chain taking values on A = { 0 , 1 } with state space A 3 . Suppose that this chain follows the transition probabilities given by the Table 1.
Considering the process as a full chain, we have eight parameters. Then, if we look more closely, a Context Tree is enough to describe the process, with just four parameters, because T = { { 0 } , { 01 } , { 011 } , { 111 } } . Moreover, if we analyze this situation from the perspective of Definition 1(ii), we note that only two parameters are needed to describe the source, since L = { { 000 , 001 , 010 , 100 , 101 , 110 , 111 } , { 011 } } , because just the string 011 has different transition probability to 0.

3. Consistent Estimation through the Bayesian Information Criterion

Let x 1 n be a sample of the process X t , s S , a A and n > M . We denote by N n ( s , a ) the number of occurrences of the string s followed by a in the sample x 1 n , which is N n ( s , a ) = | { t : M < t n , x t M t 1 = s , x t = a } | . In addition, the number of occurrences of s in the sample x 1 n is denoted by N n ( s ) and N n ( s ) = | { t : M < t n , x t M t 1 = s } | . The number of occurrences of elements into L followed by a and the total number of strings in L are given by
N n ( L , a ) = s L N n ( s , a ) , N n ( L ) = s L N n ( s ) , L L .
In order to simplify the notation, we use the same notation N n with different arguments, a string s or a part L . In addition, we note that N n ( L ) is a function of the partition L . As a consequence, if we write P ( x 1 n ) = Prob ( X 1 n = x 1 n ) , we obtain under the assumption of a hypothetical partition L of S :
P ( x 1 n ) = P ( x 1 M ) L L , a A P ( a | L ) N n ( L , a ) .
The Bayesian Information Criterion (BIC) is defined through a modified maximum likelihood (see [4]). We will call maximum likelihood the maximization of the second term in the Equation (2) for a given observation x 1 n . We denote that term as ML ( L , x 1 n ) ,
ML ( L , x 1 n ) = L L , a A N n ( L , a ) N n ( L ) N n ( L , a ) , with N n ( L ) 0 , L L ,
and the BIC is given by the next definition.
Definition 2.
Given a sample x 1 n of the process ( X t ) , a discrete time order M Markov chain on a finite alphabet A with state space S = A M and L a partition of S , the BIC of the model given by Definition 1(ii), and according to the modified likelihood, Equation (3), is B I C ( L , x 1 n ) = ln M L ( L , x 1 n ) ( | A | 1 ) | L | 2 ln ( n ) .
Remark 2.
The results of this paper will remain valid if we replace, in Definition 2, the constant ( | A | 1 ) 2 for some arbitrary constant v , positive and finite.
Below, we define some concepts that help, in practice, to limit the search for an ideal partition to a subset of possible partitions, with natural characteristics.
Definition 3.
Let ( X t ) be a discrete time order M Markov chain on a finite alphabet A and S = A M the state space. Let L = { L 1 , L 2 , , L | L | } be a partition of S :
(i) 
L L is a good part of L if s , s L , P r o b ( X t = . | X t M t 1 = s ) = P r o b ( X t = . | X t M t 1 = s ) , for values of t : t > M ;
(ii) 
L is a good partition of S if for each i { 1 , , | L | } , L i verifies item (i).
Example 3.
We will consider two situations:
(i) 
L = S is a good partition of S .
(ii) 
Consider the Example 1(ii), and the partition L is a good partition of S = { 0 , 1 } 3 .
If L is a good partition of S , we define for each part L L
P ( a | L ) = Prob ( X t = a | X t M t 1 = s ) a A ,
where s L .
We introduce a notation that will be used in the next results.
Notation 1.
(a) 
Let L i j denote the partition
L i j = { L 1 , , L i 1 , L i j , L i + 1 , , L j 1 , L j + 1 , , L | L | , } where
L = { L 1 , , L | L | } is a partition of S , and for 1 i < j | L | with L i j = L i L j .
(b) 
For a A , we write P ( L i j , a ) = P ( L i , a ) + P ( L j , a ) and P ( L i j ) = P ( L i ) + P ( L j ) . In addition,
N n ( L i j , a ) = N n ( L i , a ) + N n ( L j , a ) ; N n ( L i j ) = N n ( L i ) + N n ( L j ) .
Note that, if L is a good partition and P ( · | L i ) = P ( · | L j ) , then L i j is a good partition. We show a way to build partitions, from good partitions, which are candidates more suitable for checking the Definition 1(ii). This way of building partitions seeks to reduce the size of the partition, step by step.

3.1. A Metric on the State Space

The next result allows formulating the main findings of this section. Nonetheless, this result could be applied to partitions with at least two good parts, i.e., it is not necessary to have good partitions. In this section, we also define a measure to quantify the distance between the parts of a partition. This distance is based on the practical use of the next theorem and allows building an efficient algorithm for estimating the partition given by Definition 1(ii). For complementing, see [16].
Theorem 1.
Let ( X t ) be a Markov chain of order M over a finite alphabet A , S = A M the state space and x 1 n a sample of the Markov process. Let L = { L 1 , L 2 , , L | L | } be a partition of S and suppose that i and j e x i s t , a n d i j such that L i and L j verified the Definition 3(i) (are good parts). Then, P ( a | L i ) = P ( a | L j ) a A if, and only if, eventually almost surely as n ,
B I C ( L i j , x 1 n ) > B I C ( L , x 1 n ) ,
where L i j is defined under L by Notation 1(a).
Proof. 
See Appendix A.1. ☐
It is also possible to decide simultaneously if more than two good parts should be put together, as shown in the next corollary.
Corollary 1.
Let ( X t ) be a Markov chain of order M over a finite alphabet A , S = A M the state space and x 1 n a sample of the Markov process. If L = { L 1 , L 2 , , L | L | } is a partition of S with K 1 good parts, denoted by { L i k } k = 1 K 1 , and T is an index set, T { 1 , , K 1 } , then, P ( a | L i k ) = P ( a | L i l ) a A , k , l T if, and only if, eventually almost surely as n , B I C ( L , x 1 n ) < B I C ( L T , x 1 n ) , where L T denotes the partition which join the | T | good parts in k T L i k generalizing the Notation 1(a).
Proof. 
Replace Equation (A3) in the proof of Theorem 1 by
a A k T N n ( L i k , a ) n ln N n ( L i k , a ) N n ( L i k ) N n ( k T L i k , a ) n ln N n ( k T L i k , a ) N n ( k T L i k ) < ( | A | 1 ) ( | T | 1 ) ln ( n ) 2 n .
Applying log-sum inequality, the result follows. ☐
Remark 3.
Let ( X t ) be a Markov chain of order M over a finite alphabet A , S = A M the state space and x 1 n a sample of the Markov process. Let L = { L 1 , L 2 , , L | L | } be a partition of S , given i , j { 1 , 2 , , | L | } , i j , such that L i and L j verified the Definition 3(i)(are good parts). If P ( a | L i ) P ( a | L j ) for some a A , then, eventually almost surely as n , B I C ( L , x 1 n ) > B I C ( L i j , x 1 n ) , where L i j verified the Notation 1(a).
Now, we can introduce a distance in L . This distance allows to establish a metric in the state space S .
Definition 4.
Let ( X t ) be a Markov chain of order M , with finite alphabet A and state space S = A M , x 1 n a sample of the process and let L = { L 1 , L 2 , , L | L | } be a good partition of S
d L ( i , j ) = 1 ln ( n ) a A N n ( L i , a ) ln N n ( L i , a ) N n ( L i ) + N n ( L j , a ) ln N n ( L j , a ) N n ( L j ) N n ( L i j , a ) ln N n ( L i j , a ) N n ( L i j ) } .
The next theorem shows that d L is a distance in L .
Theorem 2.
Let ( X t ) be a Markov chain of order M over a finite alphabet A , a n d S = A M the state space and x 1 n a sample of the Markov process. If L = { L 1 , L 2 , , L | L | } is a good partition of S , for each n , and for any i , j , k { 1 , 2 , , | L | } :
(i) 
d L ( i , j ) 0 with equality if and only if N n ( L i , a ) N n ( L i ) = N n ( L j , a ) N n ( L j ) a A ;
(ii) 
d L ( i , j ) = d L ( j , i ) ;
(iii) 
d L ( i , k ) d L ( i , j ) + d L ( j , k ) .
Proof. 
See Appendix A.2. ☐
Some observations of practice order are appropriate at this time. Suppose the good partition of Theorem 2 is the space S itself, so each part of the partition is given by each string of S . Thus, the distance (Definition 4) defines the following relation of equivalence between strings of S , for each value n :
s n r N n ( s , a ) N n ( s ) = N n ( r , a ) N n ( r ) a A , s , r S .
The next result formalizes how the distance is related to the BIC criterion.
Corollary 2.
Let ( X t ) be a Markov chain of order M over a finite alphabet A , w i t h S = A M the state space and x 1 n a sample of the Markov process. Let L = { L 1 , L 2 , , L | L | } be a partition of S , given i , j { 1 , 2 , , | L | } , i j , such that L i and L j verified the Definition 3(i). (are good parts):
B I C ( L , x 1 n ) B I C ( L i j , x 1 n ) < 0 d L ( i , j ) < ( | A | 1 ) 2 .
Proof. 
From Equation (A2) in the proof of Theorem 1. ☐
The previous corollary provides the statistical interpretation of the distance.

3.2. Consistent Estimation of the Process’s Partition

In this section, we prove that the partition following Definition 1(ii), referred to herein as minimal good partition, can be obtained by maximizing the equation introduced in Definition 2, in the space of all possible partitions of the state space. For instance, the smaller good partition in the universe of all possible good partitions of S is the partition defined by the equivalence relationship in Definition 1. Note that, for a discrete time order M Markov chain on a finite alphabet A, with S = A M the state space, there exists one and only one minimal good partition of S . The next theorem shows that for large enough n, we obtain, through the BIC, the minimal good partition.
Theorem 3.
Let ( X t ) be a Markov chain of order M over a finite alphabet A , w i t h S = A M the state space and x 1 n a sample of the Markov process. Let P be the set of all the partitions of S . Define
L n * = a r g m a x L P { B I C ( L , x 1 n ) . }
Then, eventually, almost surely as n , L * = L n * , where L * is the minimal good partition of S , following Definition 1(ii).
Proof. 
See Appendix A.3. ☐
From Corollary 2, algorithms can be formulated to obtain L * . See, for instance, Algorithm 3.1 in [10]. For large enough n, the algorithm returns the minimal good partition as shown by the next result.
Corollary 3.
Let ( X t ) be a Markov chain of order M over a finite alphabet A , S = A M the state space and x 1 n a sample of the Markov process. L ^ n , given by the Algorithm 3.1, [10] converges almost surely eventually to L * , where L * is the minimal good partition of S .
Remark 4.
Algorithm 3.1 [10] requires as initial input a good partition. In the case in which there is not previous information about a good partition or about the length of the memory, the initial good partition can be chosen as the set of sequences, satisfying the suffix property and appearing in the sample at least B times, where B is a positive integer, which corresponds to the first part of the Context Algorithm ([2,3]).
We note that Corollary 3 also applies to other clustering algorithms based on distances, such as single-linkage clustering. However, exploring this aspect is beyond the scope of this paper.

4. Navigation Patterns on a Web Site (MSNBC.com)

The MSNBC.com anonymous web data set consists of one million user sessions recorded in 24 h on the web site. The dataset can be retrieved from [17] The web pages on the site are divided into 17 categories: frontpage, news, tech, local, opinion, on-air, misc, weather, msn-news, health, living, business, msn-sports, sports, summary, bbs and travel.
Each category will be a letter in the alphabet A , with total size equal to 17. Each user session corresponds to a sequence of symbols from the alphabet starting with the category during which the session is initiated on the MSNBC site. The sequence of categories which the user visits defines the string that finishes when the user leaves the MSNBC site. In Table 2, we show an illustrative sample of 12 user sessions.
The issue that the model can clarify is to identify the strings that can be considered equivalent, in terms of the next step of the internet surfers. This information could be extremely important in determining the profile of users, in relation to the preferences of states in A . The idealization that supports this application is that there are different sequences in the state space that share the same transition probability to the next symbol in the alphabet of the process. Sequences with such properties form a part of the minimal good partition, which completely describes the process. Furthermore, from this perspective, these sequences are equivalent to deciding the next symbol for the process. These sequences in our application are the paths used by internet surfers.
We use the distance d L to find the minimal partition. It is known that there are several strategies based on a measure which allows us to reach this purpose, and one of them is the algorithm introduced in [10]. We used three strategies, with input given by the set of strings ( S ): (i) Algorithm 3.1 introduced in [10]; (ii) Algorithm 3.1 [10] modified; and (iii) an agglomerative strategy. Option (ii) is composed of two stages. First, we join in the same part all the strings r and s of S , which show that d L ( r , s ) < ϵ , a n d this process generates an initial partition that is used as input of Algorithm 3.1 [10], which is the second stage. The agglomerative strategy of (iii) explores the ability of d L to build distances between strings of S and between groups of strings of S . Thus, in this strategy, all of the distances are computed joining groups of strings L i and L j if d L ( i , j ) < ( | A | 1 ) 2 . We show in Table 3 that the agglomerative algorithm produces the best BIC value. We expose two cases with order M = 3 and M = 2 , where the first is the order chosen as usual, 3 = log | A | ( 1.0 × 10 6 ) 1 .
Let us look at some specific situations. For example, suppose that our interest is to investigate the parts that lead to the local state with probability greater than 0.6. There are seven different parts (using M = 3 and method (iii)) that fulfill this condition, and Table 4 shows the composition of each part and its probability of being local the next state.
For instance, once the model was selected, in order to predict the next place that the user will visit, we first check during which of the 269 parts his/her path fall and then the corresponding probabilities are used. For example, if the user’s path is weather.weather.misc, then the probability for the user to visit local is 0.7822 and this probability is shared by all three of the strings of L 5 . The full partition obtained by the algorithm and the set of transition probabilities associated to each part can be obtained from [18]. We can draw some observations. For example, the transition probabilities of each part P ( a | L i ) have been computed, using, in general, several strings, representing a natural improvement in the calculation of the transition probabilities. The reader can find many larger parts in [18], making this observation more incisive. On the other hand, the strings listed as members of the same part (see, for instance, several situations in Table 4), must be considered stochastically equivalent.

5. Conclusions

The development of the partition concept in Markov processes allows for proving that, for a stationary, finite memory process and a sample large enough, it is theoretically possible to consistently find a minimal partition to represent the process and this can be accomplished in practice. In this paper, we show (in Theorem 3) that the Bayesian Information Criterion can be used to obtain a consistent estimation of the partition of a Markov process. We show that the use of this criterion is also convenient for producing a consistent strategy that allows for deciding if a candidate partition is preferable to another candidate partition (in Theorem 1). We also define a metric on the state space, allowing for the introduction of a definition of a distance between the parts of a partition (Theorem 2). The distance allows the construction of consistent estimation algorithms to identify the partition. Research in progress suggests that this measure can be harnessed for the development and implementation of robust estimation techniques, given that there are records (see [19]) of the need of these techniques for Markov processes. In summary, in this paper, in addition to responding positively to the question of whether the Bayesian Information Criterion is capable of allowing a consistent estimation of the partition of the Markov process, we also obtain that, in terms of the model selection procedure, the Bayesian Information Criterion corresponds to a distance in the state space of the Markov process.

Acknowledgments

The authors wish to express their gratitude to three referees for their helpful comments on an earlier draft of this paper.

Author Contributions

The authors of this paper jointly conceived the idea for this paper, discussed the agenda for the research, performed the theoretical and numerical calculations, and prepared each draft of the paper. The authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs

Previously, we defined some concepts and established some useful notation that will be used in this section.
Definition A1.
Let P and Q be probability distributions on A . The relative entropy between P and Q is given by D ( P ( · ) | | Q ( · ) ) = a A P ( a ) ln P ( a ) Q ( a ) , with Q ( a ) 0 , a A .
From Equation (1), define
r n L , a = N n ( L , a ) n a n d r n L = N n ( L ) n , a A , L L .
Because N n ( L ) 0 L L , r n ( L ) 0 , L L .

Appendix A.1. Proof of Theorem 1

B I C ( L , x 1 n ) = a A ln L L r n ( L , a ) r n ( L ) N n ( L , a ) ( | A | 1 ) | L | 2 ln ( n ) .
Then,
B I C ( L , x 1 n ) B I C ( L i j , x 1 n ) = a A N n ( L i , a ) ln r n ( L i , a ) r n ( L i ) + N n ( L j , a ) ln r n ( L j , a ) r n ( L j ) N n ( L i j , a ) ln r n ( L i j , a ) r n ( L i j ) ( | A | 1 ) 2 ln ( n ) .
We note that B I C ( L i j , x 1 n ) > B I C ( L , x 1 n ) if, and only if,
a A r n ( L i , a ) ln r n ( L i , a ) r n ( L i ) + r n ( L j , a ) ln r n ( L j , a ) r n ( L j ) r n ( L i j , a ) ln r n ( L i j , a ) r n ( L i j ) < ( | A | 1 ) ln ( n ) 2 n .
Because r n ( L , a ) and r n ( L ) are non-negative and applying the log-sum inequality, we obtain
r n ( L i , a ) ln r n ( L i , a ) r n ( L i ) + r n ( L j , a ) ln r n ( L j , a ) r n ( L j ) r n ( L i , a ) + r n ( L j , a ) ln r n ( L i , a ) + r n ( L j , a ) r n ( L i ) + r n ( L j ) ,
or, equivalently,
r n ( L i , a ) ln r n ( L i , a ) r n ( L i ) + r n ( L j , a ) ln r n ( L j , a ) r n ( L j ) r n ( L i j , a ) ln r n ( L i j , a ) r n ( L i j ) ,
with equality if, and only if, r n ( L i , a ) r n ( L i ) = r n ( L j , a ) r n ( L j ) , a A .
As a consequence, inequality Equation (A3) ⇒
a A r n ( L i , a ) ln r n ( L i , a ) r n ( L i ) + r n ( L j , a ) ln r n ( L j , a ) r n ( L j ) r n ( L i j , a ) ln r n ( L i j , a ) r n ( L i j ) 0 ,
with equality if, and only if, r n ( L i , a ) r n ( L i ) = r n ( L j , a ) r n ( L j ) a A .
Considering that ( | A | 1 ) ln ( n ) 2 n 0 , as n and from the Equation (A3), we have that if lim n I { B I C ( L i j , x 1 n ) > B I C ( L , x 1 n ) } = 1 , where I W is the indicator function of the set W , then
lim n a A r n ( L i , a ) ln r n ( L i , a ) r n ( L i ) + r n ( L j , a ) ln r n ( L j , a ) r n ( L j ) r n ( L i j , a ) ln r n ( L i j , a ) r n ( L i j ) 0 ,
from Equation (A4), and, taking the limit inside the sum, we obtain
a A P ( L i , a ) ln P ( L i , a ) P ( L i ) + P ( L j , a ) ln P ( L j , a ) P ( L j ) P ( L i j , a ) ln P ( L i j , a ) P ( L i j ) = 0 ,
applying the log-sum inequality. This means that P ( L i , a ) P ( L i ) = P ( L j , a ) P ( L j ) a A , or, equivalently, P ( a | L i ) = P ( a | L j ) a A .
For the other half of the proof, suppose that P ( a | L i ) = P ( a | L j ) a A . As a consequence, we have that
P ( a | L i j ) = P ( a | L i ) a A ,
B I C ( L , x 1 n ) B I C ( L i j , x 1 n ) = ln a A N n ( L i , a ) N n ( L i ) N n ( L i , a ) + ln a A N n ( L j , a ) N n ( L j ) N n ( L j , a ) ln a A N n ( L i j , a ) N n ( L i j ) N n ( L i j , a ) ( | A | 1 ) 2 ln ( n ) .
Now, considering that N n ( L i j , a ) N n ( L i j ) is the maximum likelihood estimator of P ( a | L i j ) ,
a A N n ( L i j , a ) N n ( L i j ) N n ( L i j , a ) a A P ( a | L i j ) N n ( L i j , a ) .
B I C ( L , x 1 n ) B I C ( L i j , x 1 n ) is bounded above by
ln a A N n ( L i , a ) N n ( L i ) N n ( L i , a ) + ln a A N n ( L j , a ) N n ( L j ) N n ( L j , a ) ln a A P ( a | L i j ) N n ( L i j , a ) ( | A | 1 ) 2 ln ( n ) = N n ( L i ) D N n ( L i , . ) N n ( L i ) | | P ( . | L i ) + N n ( L j ) D N n ( L j , . ) N n ( L j ) | | P ( . | L j ) ( | A | 1 ) 2 ln ( n ) .
By using Assumption (A5), where D ( P ( · ) | | Q ( · ) ) is the relative entropy, given by Definition A1, and applying the Lemma 6.3 from [4] and the proposition (A1), for any δ > 0 and large enough n,
D N n ( L , . ) N n ( L ) | | P ( . | L ) a A N n ( L , a ) N n ( L ) P ( a | L ) 2 P ( a | L ) a A δ   ln ( n ) N n ( L ) P ( a | L ) .
Then, for any δ > 0 and large enough n,
B I C ( L , x 1 n ) B I C ( L i j , x 1 n ) 2 δ | A | p ln ( n ) ( | A | 1 ) 2 ln ( n ) = ln ( n ) 2 δ | A | p ( | A | 1 ) 2 ,
where p = min { P ( a | L ) : a A , L { L i , L j } } .
In particular, taking δ < p ( | A | 1 ) 4 | A | , for n large enough,
B I C ( L , x 1 n ) B I C ( L i j , x 1 n ) < 0 .

Appendix A.2. Proof of Theorem 2

Proof. 
( i ) Fix a value a A , set two parts L i , L j L , define a 1 = N n ( L i , a ) , b 1 = N n ( L i ) and a 2 = N n ( L j , a ) , b 2 = N n ( L j ) .
s = 1 , 2 a s = N n ( L i j , a ) and s = 1 , 2 b s = N n ( L i j ) . Thus, a 1 ln ( a 1 b 1 ) + a 2 ln ( a 2 b 2 ) s = 1 , 2 a s ln s = 1 , 2 a s s = 1 , 2 b s , with equality a 1 b 1 = a 2 b 2 . This means that
a A N n ( L i , a ) ln N n ( L i , a ) N n ( L i ) + N n ( L j , a ) ln N n ( L j , a ) N n ( L j ) a A N n ( L i j , a ) ln N n ( L i j , a ) N n ( L i j )
with equality N n ( L i , a ) N n ( L i ) = N n ( L j , a ) N n ( L j ) . Thereby, ( i ) is proved.
( i i i ) d L ( i , k ) d L ( i , j ) + d L ( j , k ) if, and only if,
0 s = i , k a A N n ( L j , a ) ln N n ( L j , a ) N n ( L j ) ln N n ( L s j , a ) N n ( L s j ) + a A N n ( L i , a ) ln N n ( L i k , a ) N n ( L i k ) ln N n ( L i j , a ) N n ( L i j ) + a A N n ( L k , a ) ln N n ( L i k , a ) N n ( L i k ) ln N n ( L k j , a ) N n ( L k j ) ,
and the right side is equivalent to
s = i , k N n ( L j ) a A N n ( L j , a ) N n ( L j ) ln N n ( L j , a ) N n ( L j ) / N n ( L s j , a ) N n ( L s j ) + s = i , k a A N n ( L s , a ) N n ( L i k ) N n ( L i k , a ) N n ( L i k , a ) N n ( L i k ) ln N n ( L i k , a ) N n ( L i k ) / N n ( L s j , a ) N n ( L s j ) ,
which is greater than
N n ( L j ) s = i , k D N n ( L j , . ) N n ( L j ) | | N n ( L s j , . ) N n ( L s j ) + 1 n s = i , k D N n ( L i k , . ) N n ( L i k ) | | N n ( L s j , . ) N n ( L s j ) 0 .
Thus, ( i i i ) is proved. ☐

Appendix A.3. Proof of Theorem 3

Proof. 
The first part of the proof will be devoted to show that the maximum BIC just can be attained on a good partition.
Consider L b = { L 1 b , L 2 b , , L | L b | b } a partition of S , which is not a good partition. This means that at least some part does not verify the Definition 3(i). Suppose, without loss of generality, that L b has just one part that is not good, L 1 b . Suppose also that B I C ( L b , x 1 n ) > B I C ( L , x 1 n ) , L P .
Let L b * = { S 11 , S 12 , , S 1 k 1 , L 2 b , , L | L b | b } , where i = 1 k 1 S 1 i = L 1 b and each S 1 i verifies the Definition 3(i). In addition, impose i , j { 1 , , k 1 } , i j , P ( . | S 1 i ) P ( . | S 1 j ) . By Corollary 1 and Remark 3, we obtain B I C ( L b , x 1 n ) < B I C ( L b * , x 1 n ) , eventually almost surely as n . This is absurd coming from the supposition of the existence of L b such that the BIC attains the maximum on P .
The second part of the proof is dedicated to identifying the minimal good partition in the space of good partitions. Define the set of good partitions, P = { L : L is   a   good   partition   of S } .
Let an arbitrary good partition L = { L 1 , L 2 , , L | L | } P , by Corollary 1, K 1 = | L | and taking an appropriate T { 1 , , | L | } , B I C ( L , x 1 n ) B I C ( L T , x 1 n ) , eventually almost surely as n .
Note that | P | < because | S | < , and define L * = argmax L P B I C ( L T , x 1 n ) when n . By construction, L * P . If L * = { L 1 * , , L | L * | * } is not minimal, then at least there are i and j such that B I C ( L * , x 1 n ) < B I C ( L * i j , x 1 n ) and it is impossible because L * was defined as being the maximum argument of the BIC criterion in the class P , and by construction L * i j P (see Notation 1). As a consequence, L * is the minimal good partition. ☐

Appendix B. Auxiliary Results

Proposition A1.
Allow the process P ( · | L ) on A , where L L and L is a good partition of S . For arbitrary δ > 0 , there exists α > 0 (depending on P), such that, eventually almost surely as n ,
| N n ( L , a ) N n ( L ) P ( a | L ) | δ ln ( n ) N n ( L ) ,
with M < α ln ( n ) .
Proof. 
From Corollary 2 [6], we have that, for any ϵ > 0 , there exists α > 0 (depending on P), such that eventually almost surely as n ,
| N n ( s , a ) N n ( s ) P ( a | s ) | ϵ ln ( N n ( s ) ) N n ( s ) ,
with M < α ln ( n ) . Letting δ > 0 and ϵ = δ | A | 2 M in Equation (A7), we obtain
N n ( s , a ) N n ( s ) P ( a | s ) δ ln ( N n ( s ) ) | A | 2 M N n ( s ) , N n ( s , a ) N n ( s ) P ( a | s ) δ ln ( N n ( s ) ) | A | 2 M N n ( s ) .
Because L is a good partition of S , s L and L L , we obtain
s L N n ( s , a ) P ( a | L ) s L N n ( s ) s L δ ln ( N n ( s ) ) | A | 2 M N n ( s ) .
Following the equations in (1), we have
N n ( L , a ) P ( a | L ) N n ( L ) δ ln ( n ) | A | M s L N n ( s ) .
Then,
N n ( L , a ) N n ( L ) P ( a | L ) δ ln ( n ) | A | M N n ( L ) | L | max s L ( N n ( s ) ) δ ln ( n ) | A | M N n ( L ) | A | M s L N n ( s ) = δ ln ( n ) N n ( L ) N n ( L ) = δ ln ( n ) N n ( L ) .

References

  1. Buhlmann, P.; Wyner, A. Variable length Markov chains. Ann. Stat. 1999, 27, 480–513. [Google Scholar]
  2. Rissanen, J. A universal data compression system. IEEE Trans. Inf. Theory 1983, 29, 656–664. [Google Scholar] [CrossRef]
  3. Weinberger, M.; Rissanen, J.; Feder, M. A universal finite memory source. IEEE Trans. Inf. Theory 1995, 41, 643–652. [Google Scholar] [CrossRef]
  4. Csiszár, I.; Talata, Z. Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inf. Theory 2006, 52, 1007–1016. [Google Scholar] [CrossRef]
  5. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  6. Csiszár, I. Large-scale typicality of Markov sample paths and consistency of MDL order estimators. IEEE Trans. Inf. Theory 2002, 48, 1616–1628. [Google Scholar] [CrossRef]
  7. Csiszár, I.; Shields, P.C. The consistency of the BIC Markov order estimator. Ann. Stat. 2000, 28, 1601–1619. [Google Scholar]
  8. Jääskinen, V.; Xiong, J.; Corander, J.; Koski, T. Sparse markov chains for sequence data. Scand. J. Stat. 2014, 41, 639–655. [Google Scholar] [CrossRef]
  9. Manning, C.D.; Schütze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999; Volume 999. [Google Scholar]
  10. Garca, J.E.; González-López, V.A. Minimal Markov Models. In Proceedings of the Fourth Workshop on Information Theoretic Methods in Science and Engineering, Helsinki, Finland, 7–10 August 2011; Volume 1, pp. 25–28. [Google Scholar]
  11. Farcomeni, A. Hidden Markov Partition Models. Stat. Probab. Lett. 2011, 81, 1766–1770. [Google Scholar] [CrossRef]
  12. García, J.E.; Fernández, M. Copula based model correction for bivariate Bernoulli financial series. In Proceedings of the 11th International Conference of Numerical Analysis and Applied Mathematics (ICNAAM 2013), Rhodes, Greece, 21–27 September 2013; AIP Publishing: Melville, NY, USA, 2013; Volume 1558, pp. 1487–1490. [Google Scholar]
  13. Fernández, M.; García Jesús, E.; González-López, V.A. Multivariate Markov chain predictions adjusted with copula models. In New Trends in Stochastic Modeling and Data Analysis; ISAST: Athens, Greece, 2015. [Google Scholar]
  14. García, J.E.; González-López, V.A.; Hirsh, I.D. Copula-Based Prediction of Economic Movements. In Proceedings of the 13th International Conference of Numerical Analysis and Applied Mathematics (ICNAAM 2015), Rhodes, Greece, 23–29 September 2015; AIP Publishing: Melville, NY, USA, 2015; Volume 1738, p. 140005. [Google Scholar]
  15. García, J.E.; González-López, V.A. Detecting regime changes in Markov models. Proceedings of The Sixth Workshop on Information Theoretic Methods in Science and Engineering, Tokyo, Japan, 26–29 August 2013. [Google Scholar]
  16. Gusfield, D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
  17. MSNBC.com Anonymous Web Data Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/MSNBC.com+Anonymous+Web+Data (accessed on 5 April 2017).
  18. Index of /~jg/MSNBC. Available online: http://www.ime.unicamp.br/~jg/MSNBC/ (accessed on 5 April 2017).
  19. Galves, A.; Galves, C.; García, J.; Garcia, N.L.; Leonardi, F. Context tree selection and linguistic rhythm retrieval from written texts. Ann. Appl. Stat. 2012, 6, 186–209. [Google Scholar] [CrossRef]
Table 1. Transition probabilities.
Table 1. Transition probabilities.
s000001010011100101110111
P ( 0 | s ) 0.10.10.10.20.10.10.10.1
Table 2. Sample of 12 sessions. Each line represents the path followed by an user.
Table 2. Sample of 12 sessions. Each line represents the path followed by an user.
User 1frontpage, tech, tech, frontpage.
User 2weather, weather, weather, misc, local, weather, weather, weather.
User 3on-air, msn-news, msn-news, msn-news, msn-news, misc, msn-news.
User 4news.
User 5msn-sports, sports, msn-sports.
User 6frontpage, frontpage, frontpage.
User 7news, business, tech, local, business, business.
User 8frontpage.
User 9local.
User 10frontpage, tech, tech.
User 11frontpage, frontpage, business, frontpage.
User 12sports, sports, sports, sports, sports, sports.
Table 3. Number of parts or cardinal of L and BIC value of the model (Definition 2), for memories 2 and 3, respectively. In (ii), ϵ = ( | A | 1 ) 2 1 10 was used. In bold we mark the highest BIC values, which indicate the best method.
Table 3. Number of parts or cardinal of L and BIC value of the model (Definition 2), for memories 2 and 3, respectively. In (ii), ϵ = ( | A | 1 ) 2 1 10 was used. In bold we mark the highest BIC values, which indicate the best method.
Order 3
MethodNumber of Parts ( | L | )BIC Value
(i)196−2957442
(ii)210−2895322
(iii)269−2865622
Order 2
MethodNumber of Parts ( | L | )BIC Value
(i)177−3614825
(ii)177−3613655
(iii)181−3611092
Table 4. Parts of L such that P ( local | L ) > 0.6 .
Table 4. Parts of L such that P ( local | L ) > 0.6 .
PartStrings P ( local | L i )
L 1 msn-news.news.local, msn-news.business.local, on-air.tech.local0.7257
tech.local.local, msn-news.tech.local, business.local.local
on-air.local.local, msn-news.local.local
L 2 health.news.local, health.local.local, news.local.local0.6096
L 3 local.local.local0.8874
L 4 misc.local.local, tech.weather.local, frontpage.opinion.misc0.6355
local.news.misc, local.misc.local, misc.misc.local
L 5 weather.local.local, local.weather.misc, weather.weather.misc0.7822
L 6 local.local.misc, on-air.weather.misc, msn-news.weather.misc, local.misc.misc0.7373
L 7 misc.local.misc0.8563

Share and Cite

MDPI and ACS Style

García, J.E.; González-López, V.A. Consistent Estimation of Partition Markov Models. Entropy 2017, 19, 160. https://doi.org/10.3390/e19040160

AMA Style

García JE, González-López VA. Consistent Estimation of Partition Markov Models. Entropy. 2017; 19(4):160. https://doi.org/10.3390/e19040160

Chicago/Turabian Style

García, Jesús E., and Verónica A. González-López. 2017. "Consistent Estimation of Partition Markov Models" Entropy 19, no. 4: 160. https://doi.org/10.3390/e19040160

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop