Permutation Complexity and Coupling Measures in Hidden Markov Models

In [Haruna, T. and Nakajima, K., 2011. Physica D 240, 1370-1377], the authors introduced the duality between values (words) and orderings (permutations) as a basis to discuss the relationship between information theoretic measures for finite-alphabet stationary stochastic processes and their permutation analogues. It has been used to give a simple proof of the equality between the entropy rate and the permutation entropy rate for any finite-alphabet stationary stochastic process and show some results on the excess entropy and the transfer entropy for finite-alphabet stationary ergodic Markov processes. In this paper, we extend our previous results to hidden Markov models and show the equalities between various information theoretic complexity and coupling measures and their permutation analogues. In particular, we show the following two results within the realm of hidden Markov models with ergodic internal processes: the two permutation analogues of the transfer entropy, the symbolic transfer entropy and the transfer entropy on rank vectors, are both equivalent to the transfer entropy if they are considered as the rates, and the directed information theory can be captured by the permutation entropy approach.


Introduction
Recently, the permutation-information theoretic approach to time series analysis proposed by Bandt and Pompe [1] has become popular in various fields [2]. It has been proven that the method of permutation is easy to implement relative to the other traditional methods and is robust under the existence of noise [3][4][5][6][7]. However, if we turn our eyes to its theoretical side, few results are known for the permutation analogues of information theoretic measures, except the entropy rate.
There are two approaches to introduce permutation into dynamical systems theory [8]. The first approach is introduced by Bandt et al. [9]. Given a one-dimensional interval map, they considered permutations induced by iterations of the map. Each point in the interval is classified into one of n! permutations according to the permutation defined by n − 1 times iterations of the map starting from the point. Then, the Shannon entropy of this partition (called the standard partition) of the interval is taken and normalized by n. The quantity obtained in the limit n → ∞ is called permutation entropy if it exists. It was proven that the permutation entropy is equal to the Kolmogorov-Sinai entropy for any piecewise monotone interval map [9]. This approach based on the standard partitions was extended by Keller et al. [10][11][12][13][14].
The second approach is taken by Amigó et al. [2,15,16]. In this approach, given a measure-preserving map on a probability space, first, an arbitrary finite partition of the space is taken. This gives rise to a finite-alphabet stationary stochastic process. An arbitrary ordering is introduced on the alphabet, and the permutations of the words of finite lengths can be naturally defined (see Section 2 below). It is proven that the Shannon entropy of the occurrence of the permutations of a fixed length normalized by the length converges in the limit of the large length of the permutations. The quantity obtained is called the permutation entropy rate (also called metric permutation entropy) and is shown to be equal to the entropy rate of the process. By taking the limit of finer partitions of the measurable space, the permutation entropy rate of the measure-preserving map is defined if the limit exists. Amigó [16] proved that it exists and is equal to the Kolmogorov-Sinai entropy.
In this paper, we restrict our attention to finite-alphabet stationary stochastic processes. Thus, we follow the second approach, namely, ordering on the alphabet is introduced arbitrarily. For quantities other than the entropy rate, three results for finite-alphabet stationary ergodic Markov processes have been shown by our previous work: the equality between the excess entropy and the permutation excess entropy [17], the equality between the mutual information expression of the excess entropy and its permutation analogue [18] and the equality between the transfer entropy rate and the symbolic transfer entropy rate [19]. Whether these equalities for the permutation entropies can be extended to general finite-alphabet stationary ergodic stochastic processes is still unknown. However, for the modified permutation entropies defined by the partition of the set of words based on permutations and equalities between occurrences of symbols, which is finer than the partition obtained by permutations only, we have the corresponding equalities for general finite-alphabet stationary ergodic stochastic processes [20].
The purpose of this paper is to generalize our previous results on the permutation entropies for finite-alphabet stationary ergodic Markov processes to output processes of finite-state finite-alphabet hidden Markov models with ergodic internal processes. Upon this generalization, somewhat ad hoc proofs in our previous work for multivariate stationary ergodic Markov processes become straightforward. The key property of hidden Markov models (HMMs), which we will use repeatedly, is the following: a marginal process of the output process of a hidden Markov model with an ergodic internal process is, again, the output process of a hidden Markov model with an ergodic internal process obtained from the original hidden Markov model. In general, this property does not hold for multivariate stationary ergodic Markov processes. The generalization also makes us easily access quantities that have not been considered theoretically in the permutation approach. In this paper, we shall treat the following quantities: excess entropy [21], transfer entropy [22,23], momentary information transfer [24] and directed information [25,26]. As far as the authors are aware, the equality between the momentary information transfer and its permutation analogue and that for directed information have not been discussed anywhere. The equalities could be directly proven with some extra discussion to that in [17] for finite-alphabet multivariate stationary ergodic Markov processes. However, the equalities can be proven straightforwardly, as in [17], within the realm of HMMs with ergodic internal processes, once we show Lemma 3 below.
This paper is organized as follows: In Section 2, we briefly review our previous result on the duality between words and permutations to make this paper as self-contained as possible. In Section 3, we prove a lemma about finite-state finite-alphabet hidden Markov models.
In Section 4, we show equalities between various information theoretic complexity and coupling measures and their permutation analogues that hold for output processes of finite-state finite-alphabet hidden Markov models with ergodic internal processes. In Section 5, we discuss how our results are related to the previous work in the literature.

The Duality between Words and Permutations
In this section, we summarize the results from our previous work [17] that will be used in this paper. Let A n be a finite set consisting of natural numbers from one to n, called an alphabet. In this paper, A n is considered as a totally ordered set ordered by the usual "less-than-or-equal-to" relationship. When we emphasize the total order, we call A n an ordered alphabet.
Note that the results in this paper hold for every total order on A n . This is because the probability of the occurrence of the permutations in a given stationary stochastic process over A n with an arbitrary total order is just a re-indexing of that with the "less-than-or-equal-to" total order.
Let A L n = A n × · · · × A n L be the L-fold product of A n . A word of length L ≥ 1 is an element of A L n . It is denoted by x 1:L := x 1 · · · x L := (x 1 , · · · , x L ) ∈ A L n . We say that the permutation type of a word x 1:L is π ∈ S L if we have x π(i) ≤ x π(i+1) and π(i) < π(i + 1) when x π(i) = x π(i+1) for i = 1, 2, · · · , L−1. Namely, the permutation type of x 1:L is the permutation of indices defined by re-ordering symbols x 1 , · · · , x L in the increasing order. For example, the permutation type of x 1:5 = 31, 212 ∈ A 5 This example illustrates the following two properties of the map, φ n,L : first, φ n,L (A L n ) can be a proper subset of S L . As one can see from Theorem 1 below, φ n,L (A L n ) is a proper subset of S L , if and only if L > n. Second, two different words can have the same permutation type.
We define µ n,L (π) = x 1:L . Note that Desc(π) ≤ n − 1, because π is the permutation type of some word, y 1:L ∈ A L n . Thus, we have k = Desc(π) + 1 ≤ n. Hence, µ n,L is well-defined as a map from φ n,L A L n to A L n .
By construction, we have φ n,L • µ n,L (π) = π for all π ∈ φ n,L A L n . To illustrate the construction of µ n,L , let us consider a word, y 1:5 = 21, 123 ∈ A 5 3 . The permutation type of y 1:5 is π(1)π(2)π(3)π(4)π(5) = 23, 145. The decomposition of 23, 145 into maximal ascending subsequences is 23, 145. We obtain µ n, In particular, φ −1 n,L (π) = ∅, if and only if Desc(π) ≥ n; (ii) Let us put: Then, φ n,L restricted on B n,L is a map into C n,L and µ n,L restricted on C n,L is a map into B n,L . They form a pair of mutually inverse maps. Furthermore, we have: Proof. The theorem is a recasting of statements in Lemma 5 and Theorem 9 in [17].
Let X = {X 1 , X 2 , · · · } be a finite-alphabet stationary stochastic process, where each stochastic variable, X i , takes its value in A n . By the assumed stationarity, the probability of the occurrence of any word x 1:L ∈ A L n is time-shift invariant: for all k, L ≥ 1. Hence, it makes sense to define it without referring to the time to start. We denote the probability of the occurrence of a word, x 1:L ∈ A L n by p(x 1:L ) = p(x 1 · · · x L ). The probability of the occurrence of a permutation π ∈ S L is given by p(π) = x 1:L ∈φ −1 n,L (π) p(x 1:L ). For a finite-alphabet stationary stochastic process, X, over the alphabet, A n , we define: and: where L ≥ 1, x ∈ A n , N = L/2 , and a is the largest integer not greater than a.
Lemma 2 Let X be a finite-alphabet stationary stochastic process and a positive real number. If β x,X,L < for all x ∈ A n , then we have α X,L < 2n .
Proof. The claim follows from Theorem 1 (ii). See Lemma 12 in [17] for the complete proof.

A Result on Finite-State Finite-Alphabet Hidden Markov Models
In this paper, we use the parametric description of hidden Markov models as given in [27]. A finite-state finite-alphabet hidden Markov model (in short, HMM) [27] is a quadruple, (Σ, A, {T (a) } a∈A , µ), where Σ and A are finite sets, called state set and alphabet, respectively, {T (a) } a∈A is a family of |Σ| × |Σ| matrices indexed by elements of A, where |Σ| is the size of state set Σ, and µ is a probability distribution on the set Σ. The following conditions must be satisfied: (iii) and µ(s ) = s,a µ(s)T (a) ss for any s ∈ Σ.
Any probability distribution satisfying condition (iii) is called a stationary distribution. The |Σ| × |Σ| matrix T := a∈A T (a) is called a state transition matrix. The ternary, (Σ, T, µ), defines the underlying Markov chain. Note that condition (iii) is equivalent to condition (iii') µ(s ) = s µ(s)T ss . Two finite-alphabet stationary processes are induced by a HMM (Σ, A, {T (a) } a∈A , µ). One is solely determined by the underlying Markov chain. It is called an internal process and is denoted by S = {S 1 , S 2 , · · · }. The alphabet for S is Σ. The joint probability distributions that characterize S is given by: for any s 1 , · · · , s L ∈ Σ and L ≥ 1. The other process X = {X 1 , X 2 , · · · } with the alphabet, A, is defined by the following joint probability distributions: for any x 1 , · · · , x L ∈ A and L ≥ 1 and is called an output process. The stationarity of the probability distribution µ ensures that of both the internal and output processes.
Symbols a ∈ A, such that T (a) = O occurs in the output process with a probability of zero. Hence, we obtain the equivalent output process, even if we remove these symbols. Thus, we can assume that T (a) = O for any a ∈ A without loss of generality.
The internal process, S, of an HMM (Σ, A, {T (a) } a∈A , µ) is called ergodic if the state transition matrix, T , is irreducible [28]: for any s, s ∈ Σ, there exists k > 0, such that (T k ) ss > 0. If the internal process, S, is ergodic, then the stationary distribution µ is uniquely determined by the state transition matrix T via condition (iii). Every finite-alphabet finite-order multivariate stationary ergodic Markov process can be described as an HMM with an ergodic internal process.

Permutation Complexity and Coupling Measures
In this section, we discuss the equalities between complexity and coupling measures and their permutation analogues for the output processes of HMMs whose internal processes are ergodic.

Fundamental Lemma
Let (X 1 , · · · , X m ) be a multivariate finite-alphabet stationary stochastic process, where each univariate process, X k = {X k 1 , X k 2 , · · · }, k = 1, 2, · · · , m, is defined over an ordered alphabet, A n k . Note that the notation for stochastic variables is different from that in [17]. Here, X k t is the stochastic variable for the k-th component of the multivariate process at time step t.
On the other hand, we have: This completes the proof of the inequality.

Excess Entropy
Let X be a finite-alphabet stationary stochastic process. Its excess entropy is defined by [21]: if the limit on the right-hand side exists, where h(X) = lim L→∞ H(X 1:L )/L is the entropy rate of X, which exists for any finite-alphabet stationary stochastic process [31].
The excess entropy has been used as a measure of complexity [32][33][34][35][36][37]. Actually, it quantifies global correlations present in a given stationary process in the following sense. If E(X) exists, then it can be written as the mutual information between the past and future: It is known that if X is the output process of an HMM, then E(X) exists [38].
When the alphabet of X is an ordered alphabet A n , we define the permutation excess entropy of X by [17]: if the limit on the right-hand side exists, where h * (X) = lim L→∞ H * (X 1:L )/L is the permutation entropy rate of X, which exists for any finite-alphabet stationary stochastic process and is equal to the entropy rate h(X) [2,15,16], H * (X L |X 1:L−1 ) := H * (X 1:L ) − H * (X 1:L−1 ), and H * (X 1:L ) and H * (X 1:L−1 ) are as defined in the statement of Lemma 4.
The following proposition is a generalization of our previous results in [17,18]. However, this can be shown similarly with the above discussion by applying Lemma 4 to the bivariate process (X 1 , X 2 ) := (X, X) and, then, using Lemma 2 and Lemma 3.

Transfer Entropy and Momentary Information Transfer
In this subsection, we consider two information rates that are measures of coupling direction and strength between two jointly distributed processes and discuss the equalities between them and their permutation analogues. One is the rate of the transfer entropy [22], and the other is the rate of the momentary information transfer [24]. Both are particular instances of the conditional mutual information [39].
Let (X, Y) be a bivariate finite-alphabet stationary stochastic process. We assume that the alphabets of X and Y are ordered alphabets, A n and A m , respectively. For τ = 1, 2, · · · , we define the τ -step transfer entropy rate from Y to X by: When τ = 1, t 1 (Y → X) is called just the transfer entropy rate [40] from Y to X and simply denoted by t(Y → X).
If we introduce the τ -step entropy rate of X by: then we can write: because both h τ (X) and h τ (X|Y) exist. We call h 1 (X|Y) the conditional entropy rate and denote it by h(X|Y). Note that the conditional entropy rate here is slightly different from that found in the literature. For example, in [41], the conditional entropy rate (called the conditional uncertainty) is defined by lim L→∞ H(X L+1 |X 1:L , Y 1:L+1 ). The difference from the conditional entropy rate defined here is in whether the conditioning on Y L+1 is involved or not. h τ (X) is additive, namely, we always have: However, for the τ -step conditional entropy rate, the additivity cannot hold in general. It is at most super-additive: we only have the inequality: in general. Indeed, we have: This leads to the sub-additivity of the τ -step transfer entropy rate: An example with the strict inequality can be easily given. Let Y be an independent and identically distributed (i.i.d.) process and X defined by There are two permutation analogues of the transfer entropy. One is called the symbolic transfer entropy (STE) [42], and the other is called the transfer entropy on rank vector (TERV) [43]. Here, we introduce their rates as follows: the rate of STE from Y to X is defined by: if the limit on the right-hand side exists. The rate of TERV from Y to X is defined by: if the limit on the right-hand side exists. If E * (X) exists, then, by the definition of the permutation excess entropy, we have: In this case, t * 1 (Y → X) coincides with a quantity called the symbolic transfer entropy rate, introduced in [19].
Proposition 6 Let (X, Y) be the output process of an HMM (Σ, A n × A m , {T (a,b) } (a,b)∈An×Am , µ) with an ergodic internal process. Then, we have: Proof. Since both X and Y are the output processes of appropriate HMMs with ergodic internal processes, the equalities follow from the similar discussion with that in the proof of Proposition 5. Indeed, for example, X is the output process of the HMM (Σ, A n , {T (a) } a∈An , µ), where T (a) := b∈Am T (a,b) . A different instance of conditional mutual information, called momentary information transfer, is considered in [24]. It was proposed to improve the ability to detect coupling delays, which is lacking in the transfer entropy. Here, we consider its rate: the momentary information transfer rate is defined by: Its permutation analogue, called the momentary sorting information transfer rate, is defined by: By a similar discussion with that in the proof of Proposition 6, we obtain the following equality: with an ergodic internal process. Then, we have:

Directed Information
Directed information is a measure of coupling direction and strength based on the idea of causal conditioning [26,44]. Since it is not a particular instance of conditional mutual information, here, we treat it separately. In the following presentation, we make use of terminologies from [40,45].
Let (X, Y) be a bivariate finite-alphabet stationary stochastic process. The alphabets of X and Y are ordered alphabets, A n and A m , respectively. The directed information rate from Y to X is defined by: where: Note that if Y 1:i in the above expression on the right-hand side is replaced by Y 1:L , then we obtain the mutual information between X 1:L and Y 1:L : Entropy 2013, 15

3922
Thus, conditioning on Y 1:i for i = 1, · · · , L, not on Y 1:L , distinguishes the directed information from the mutual information. Following [44], we write: and call the quantity causal conditional entropy. By using this notation, we have: The permutation analogue of the directed information rate, which we call the symbolic directed information rate is defined by: if the limit on the right-hand side exists, where: If we write: Proof. We have: Entropy 2013, 15

3923
We know that the first term on the right-hand side in the above inequality goes to zero as L → ∞. Let us evaluate the second sum. By Lemma 4, it holds that: By Lemma 2 and Lemma 3, we have: where C := max x∈An {C x } and γ := max x∈An {γ x } < 1. It is elementary to show that lim L→∞ L i=1 γ i log(i + n) is finite. The limits of the other terms are also shown to be finite similarly. Thus, we can conclude that the limit of the second sum is bounded. Similarly, the limit of the third sum is also bounded. The equality in the claim follows immediately.
For output processes of HMMs with ergodic internal processes, properties on the directed information rate can be transferred to those on the symbolic directed information rate. Since proofs of them can be given by the same manner as those of the above propositions, here, we list some of them without proofs. For the proofs of the properties on the directed information rate, we refer to [44,45].
Let (X, Y) be the output process of an HMM (Σ, A n × A m , {T (a,b) } (a,b)∈An×Am , µ) with an ergodic internal process. Then, we have: Here: The symbol, D, denotes the one-step delay. I * ∞ (DY → X) is the corresponding permutation analogue. The second equality is the permutation analogue of the equality I ∞ (DY → X) = lim L→∞ I(X L ; Y 1:L−1 |X 1:L−1 ). Since I ∞ (DY → X) coincides with the transfer entropy rate, the first equality is just the equality between the transfer entropy rate and the symbolic transfer entropy rate (or the rate of one-step TERV) proven in Proposition 6, given the second equality; (iii) where I ∞ (Y → X||DY) is called the instantaneous information exchange rate and is defined by: From the last expression of I(Y 1:L → X 1:L ||DY 1:L ), we can obtain: is the corresponding permutation analogue and called the symbolic instantaneous information exchange rate; (iv) Namely, the symbolic directed information rate decomposes into the sum of the symbolic transfer entropy rate and the symbolic instantaneous information exchange rate. This follows immediately from (ii), (iii) and the equality saying that the directed information rate decomposes into the sum of the transfer entropy rate and the instantaneous information exchange rate: (v) This is the permutation analogue of the equality saying that the mutual information rate between X and Y is the sum of the directed information rate from Y to X and the transfer entropy rate from X to Y: where: is the mutual information rate and I * ∞ (X; Y) is its permutation analogue, called the symbolic mutual information rate. It is known that they are equal for any bivariate finite-alphabet stationary stochastic process [19]. Thus, the symbolic mutual information rate between X and Y is the sum of the symbolic directed information rate from Y to X and the symbolic transfer entropy rate from X to Y.
The second equality is the permutation analogue of the equality: I ∞ (DY → X||Z 1 , · · · , Z k ) = lim L→∞ I(X L ; Y 1:L−1 |X 1:L−1 , Z 1 1:L , · · · , Z k 1:L ); The quantities, I ∞ (DY → X||Z 1 , · · · , Z k ) and I * ∞ (DY → X||Z 1 , · · · , Z k ), are called the causal conditional transfer entropy rate and the symbolic causal conditional transfer entropy rate, respectively. Real-world time series data are often multivariate. However, it seems that univariate analysis is still main stream in the field of ordinal pattern analysis (see, for example, the papers in [47]). We hope that this work stimulates multivariate analysis of real-world time series data.