Permutation Complexity and Coupling Measures in Hidden Markov Models

Taichi Haruna; Kohei Nakajima

doi:10.3390/e15093910

and

¹

Department of Earth and Planetary Sciences, Graduate School of Science, Kobe University,Rokkodaicho, Nada, Kobe 657-8501, Japan

²

Department of Mechanical and Process Engineering, ETH Zurich, Leonhardstrasse 27, Zurich 8092,Switzerland

^*

Author to whom correspondence should be addressed.

Entropy2013, 15(9), 3910-3930;https://doi.org/10.3390/e15093910

Version Notes

Order Reprints

Abstract

Recently, the duality between values (words) and orderings (permutations) has been proposed by the authors as a basis to discuss the relationship between information theoretic measures for finite-alphabet stationary stochastic processes and their permutation analogues. It has been used to give a simple proof of the equality between the entropy rate and the permutation entropy rate for any finite-alphabet stationary stochastic process and to show some results on the excess entropy and the transfer entropy for finite-alphabet stationary ergodic Markov processes. In this paper, we extend our previous results to hidden Markov models and show the equalities between various information theoretic complexity and coupling measures and their permutation analogues. In particular, we show the following two results within the realm of hidden Markov models with ergodic internal processes: the two permutation analogues of the transfer entropy, the symbolic transfer entropy and the transfer entropy on rank vectors, are both equivalent to the transfer entropy if they are considered as the rates, and the directed information theory can be captured by the permutation entropy approach.

Keywords:

duality; permutation entropy; excess entropy; transfer entropy; directed information

Classification: PACS:

02.50.Ey; 02.50.Ga

Classification: MSC:

94A17; 60G10; 60J10

1. Introduction

Recently, the permutation-information theoretic approach to time series analysis proposed by Bandt and Pompe [1] has become popular in various fields [2]. It has been proven that the method of permutation is easy to implement relative to the other traditional methods and is robust under the existence of noise [3,4,5,6,7]. However, if we turn our eyes to its theoretical side, few results are known for the permutation analogues of information theoretic measures, except the entropy rate.

There are two approaches to introduce permutation into dynamical systems theory [8]. The first approach is introduced by Bandt et al. [9]. Given a one-dimensional interval map, they considered permutations induced by iterations of the map. Each point in the interval is classified into one of n! permutations according to the permutation defined by n − 1 times iterations of the map starting from the point. Then, the Shannon entropy of this partition (called the standard partition) of the interval is taken and normalized by n. The quantity obtained in the limit n → ∞ is called permutation entropy if it exists. It was proven that the permutation entropy is equal to the Kolmogorov-Sinai entropy for any piecewise monotone interval map [9]. This approach based on the standard partitions was extended by Keller et al. [10,11,12,13,14].

The second approach is taken by Amigó et al. [2,15,16]. In this approach, given a measure-preserving map on a probability space, first, an arbitrary finite partition of the space is taken. This gives rise to a finite-alphabet stationary stochastic process. An arbitrary ordering is introduced on the alphabet, and the permutations of the words of finite lengths can be naturally defined (see Section 2 below). It is proven that the Shannon entropy of the occurrence of the permutations of a fixed length normalized by the length converges in the limit of the large length of the permutations. The quantity obtained is called the permutation entropy rate (also called metric permutation entropy) and is shown to be equal to the entropy rate of the process. By taking the limit of finer partitions of the measurable space, the permutation entropy rate of the measure-preserving map is defined if the limit exists. Amigó [16] proved that it exists and is equal to the Kolmogorov-Sinai entropy.

In this paper, we restrict our attention to finite-alphabet stationary stochastic processes. Thus, we follow the second approach, namely, ordering on the alphabet is introduced arbitrarily. For quantities other than the entropy rate, three results for finite-alphabet stationary ergodic Markov processes have been shown by our previous work: the equality between the excess entropy and the permutation excess entropy [17], the equality between the mutual information expression of the excess entropy and its permutation analogue [18] and the equality between the transfer entropy rate and the symbolic transfer entropy rate [19]. Whether these equalities for the permutation entropies can be extended to general finite-alphabet stationary ergodic stochastic processes is still unknown. However, for the modified permutation entropies defined by the partition of the set of words based on permutations and equalities between occurrences of symbols, which is finer than the partition obtained by permutations only, we have the corresponding equalities for general finite-alphabet stationary ergodic stochastic processes [20].

The purpose of this paper is to generalize our previous results on the permutation entropies for finite-alphabet stationary ergodic Markov processes to output processes of finite-state finite-alphabet hidden Markov models with ergodic internal processes. Upon this generalization, somewhat ad hoc proofs in our previous work for multivariate stationary ergodic Markov processes become straightforward. The key property of hidden Markov models (HMMs), which we will use repeatedly, is the following: a marginal process of the output process of a hidden Markov model with an ergodic internal process is, again, the output process of a hidden Markov model with an ergodic internal process obtained from the original hidden Markov model. In general, this property does not hold for multivariate stationary ergodic Markov processes. The generalization also makes us easily access quantities that have not been considered theoretically in the permutation approach. In this paper, we shall treat the following quantities: excess entropy [21], transfer entropy [22,23], momentary information transfer [24] and directed information [25,26]. As far as the authors are aware, the equality between the momentary information transfer and its permutation analogue and that for directed information have not been discussed anywhere. The equalities could be directly proven with some extra discussion to that in [17] for finite-alphabet multivariate stationary ergodic Markov processes. However, the equalities can be proven straightforwardly, as in [17], within the realm of HMMs with ergodic internal processes, once we show Lemma 3 below.

This paper is organized as follows: In Section 2, we briefly review our previous result on the duality between words and permutations to make this paper as self-contained as possible. In Section 3, we prove a lemma about finite-state finite-alphabet hidden Markov models. In Section 4, we show equalities between various information theoretic complexity and coupling measures and their permutation analogues that hold for output processes of finite-state finite-alphabet hidden Markov models with ergodic internal processes. In Section 5, we discuss how our results are related to the previous work in the literature.

2. The Duality between Words and Permutations

In this section, we summarize the results from our previous work [17] that will be used in this paper.

Let

A_{n}

be a finite set consisting of natural numbers from one to n, called an alphabet. In this paper,

A_{n}

is considered as a totally ordered set ordered by the usual “less-than-or-equal-to” relationship. When we emphasize the total order, we call

A_{n}

an ordered alphabet.

Note that the results in this paper hold for every total order on

A_{n}

. This is because the probability of the occurrence of the permutations in a given stationary stochastic process over

A_{n}

with an arbitrary total order is just a re-indexing of that with the “less-than-or-equal-to” total order.

The set of all permutations of length

L \geq 1

is denoted by

S_{L}

. Namely,

S_{L}

is the set of all bijections π on the set

{1, 2, \dots, L}

. For convenience, we sometimes denote a permutation π of length L by a string

π (1) π (2) \dots π (L)

. The number of descents, places with

π (i) > π (i + 1)

, of

π \in S_{L}

is denoted by

Desc (π)

. For example, if

π \in S_{5}

is given by

π (1) π (2) π (3) π (4) π (5) = 35, 142

, then

Desc (π) = 2

.

Let

A_{n}^{L} = \underset{L}{\underset{︸}{A_{n} \times \dots \times A_{n}}}

be the L-fold product of

A_{n}

. A word of length

L \geq 1

is an element of

A_{n}^{L}

. It is denoted by

x_{1 : L} : = x_{1} \dots x_{L} : = (x_{1}, \dots, x_{L}) \in A_{n}^{L}

. We say that the permutation type of a word

x_{1 : L}

is

π \in S_{L}

if we have

x_{π (i)} \leq x_{π (i + 1)}

and

π (i) < π (i + 1)

when

x_{π (i)} = x_{π (i + 1)}

for

i = 1, 2, \dots, L - 1

. Namely, the permutation type of

x_{1 : L}

is the permutation of indices defined by re-ordering symbols

x_{1}, \dots, x_{L}

in the increasing order. For example, the permutation type of

x_{1 : 5} = 31, 212 \in A_{3}^{5}

is

π (1) π (2) π (3) π (4) π (5) = 24, 351

, because

x_{2} x_{4} x_{3} x_{5} x_{1} = 11, 223

.

Let

ϕ_{n, L} : A_{n}^{L} \to S_{L}

be a map sending each word,

x_{1 : L}

, to its permutation type,

π = ϕ_{n, L} (x_{1 : L})

. For example, the map,

ϕ_{2, 3} : A_{2}^{3} \to S_{3}

, is given by: Entropy 15 03910 i001

This example illustrates the following two properties of the map,

ϕ_{n, L}

: first,

ϕ_{n, L} (A_{n}^{L})

can be a proper subset of

S_{L}

. As one can see from Theorem 1 below,

ϕ_{n, L} (A_{n}^{L})

is a proper subset of

S_{L}

, if and only if

L > n

. Second, two different words can have the same permutation type.

We define another map,

μ_{n, L} : ϕ_{n, L} (A_{n}^{L}) \subseteq S_{L} \to A_{n}^{L}

, by the following procedure:

(i): Given a permutation, $π \in ϕ_{n, L} (A_{n}^{L}) \subseteq S_{L}$ , we decompose the sequence, $π (1) \dots π (L)$ , of length L into maximal ascending subsequences. A subsequence, $i_{j} \dots i_{j + k}$ , of a sequence, $i_{1} \dots i_{L}$ , of length L is called a maximal ascending subsequence if it is ascending, namely, $i_{j} \leq i_{j + 1} \leq \dots \leq i_{j + k}$ , and neither $i_{j - 1} i_{j} \dots i_{j + k}$ nor $i_{j} i_{j + 1} \dots i_{j + k + 1}$ is ascending;
(ii): If $π (1) \dots π (i_{1}), π (i_{1} + 1) \dots π (i_{2}), \dots, π (i_{k - 1} + 1) \dots π (L)$ is a decomposition of $π (1) \dots π (L)$ into maximal ascending subsequences, then a word, $x_{1 : L} \in A_{n}^{L}$ , is defined by:

$\begin{matrix} x_{π (1)} = \dots = x_{π (i_{1})} = 1, x_{π (i_{1} + 1)} = \dots = x_{π (i_{2})} = 2, \dots, x_{π (i_{k - 1}) + 1} = \dots = x_{π (L)} = k . \end{matrix}$

We define $μ_{n, L} (π) = x_{1 : L}$ . Note that $Desc (π) \leq n - 1$ , because π is the permutation type of some word, $y_{1 : L} \in A_{n}^{L}$ . Thus, we have $k = Desc (π) + 1 \leq n$ . Hence, $μ_{n, L}$ is well-defined as a map from $ϕ_{n, L} (A_{n}^{L})$ to $A_{n}^{L}$ .

By construction, we have

ϕ_{n, L} \circ μ_{n, L} (π) = π

for all

π \in ϕ_{n, L} (A_{n}^{L})

. To illustrate the construction of

μ_{n, L}

, let us consider a word,

y_{1 : 5} = 21, 123 \in A_{3}^{5}

. The permutation type of

y_{1 : 5}

is

π (1) π (2) π (3) π (4) π (5) = 23, 145

. The decomposition of

23, 145

into maximal ascending subsequences is 23, 145. We obtain

μ_{n, L} (π) = x_{1} x_{2} x_{3} x_{4} x_{5} = 21122

by putting

x_{2} x_{3} x_{1} x_{4} x_{5} = 11, 222

.

Theorem 1

(i): For every $π \in S_{L}$ ,

$| ϕ_{n, L}^{- 1} (π) | = (\binom{L + n - Desc (π) - 1}{L}),$

where $(\binom{a}{b}) = 0$ if $a < b$ . In particular, $ϕ_{n, L}^{- 1} (π) = \emptyset$ , if and only if $Desc (π) \geq n$ ;
(ii): Let us put:

$\begin{matrix} B_{n, L} & : = & {x_{1 : L} \in A_{n}^{L} | ϕ_{n, L}^{- 1} (π) = {x_{1 : L}} for some π \in S_{L}}, \\ C_{n, L} & : = & {π \in S_{L} | | ϕ_{n, L}^{- 1} (π) | = 1} . \end{matrix}$

Then, $ϕ_{n, L}$ restricted on $B_{n, L}$ is a map into $C_{n, L}$ and $μ_{n, L}$ restricted on $C_{n, L}$ is a map into $B_{n, L}$ . They form a pair of mutually inverse maps. Furthermore, we have:

$\begin{matrix} B_{n, L} & = & {x_{1 : L} \in A_{n}^{L} | 1 \leq \forall i \leq n - 1, 1 \leq \exists j < k \leq L s. t. x_{j} = i + 1, x_{k} = i}, \\ C_{n, L} & = & {π \in S_{L} | Desc (π) = n - 1} . \end{matrix}$

Proof.

The theorem is a recasting of statements in Lemma 5 and Theorem 9 in [17].

Let

X = {X_{1}, X_{2}, \dots}

be a finite-alphabet stationary stochastic process, where each stochastic variable,

X_{i}

, takes its value in

A_{n}

. By the assumed stationarity, the probability of the occurrence of any word

x_{1 : L} \in A_{n}^{L}

is time-shift invariant:

\begin{matrix} Pr {X_{1} = x_{1}, \dots, X_{L} = x_{L}} = Pr {X_{k + 1} = x_{1}, \dots, X_{k + L} = x_{L}} \end{matrix}

for all

k, L \geq 1

. Hence, it makes sense to define it without referring to the time to start. We denote the probability of the occurrence of a word,

x_{1 : L} \in A_{n}^{L}

by

p (x_{1 : L}) = p (x_{1} \dots x_{L})

. The probability of the occurrence of a permutation

π \in S_{L}

is given by

p (π) = \sum_{x_{1 : L} \in ϕ_{n, L}^{- 1} (π)} p (x_{1 : L})

.

For a finite-alphabet stationary stochastic process,

X

, over the alphabet,

A_{n}

, we define:

\begin{matrix} α_{X, L} : = \sum_{\begin{matrix} π \in S_{L}, \\ | ϕ_{n, L}^{- 1} (π) | > 1 \end{matrix}} p (π) = \sum_{π \notin C_{n, L}} p (π) \end{matrix}

and:

\begin{matrix} β_{x, X, L} & = & Pr {x_{1 : N} \in A_{n}^{N} | x_{j} \neq x for all 1 \leq j \leq N} \\ = & \sum_{\begin{matrix} x_{j} \neq x, \\ 1 \leq j \leq N \end{matrix}} p (x_{1} \dots x_{N}), \end{matrix}

where

L \geq 1

,

x \in A_{n}

,

N = ⌊ L / 2 ⌋

, and

⌊ a ⌋

is the largest integer not greater than a.

Lemma 2

Let

X

be a finite-alphabet stationary stochastic process and ϵ a positive real number. If

β_{x, X, L} < ϵ

for all

x \in A_{n}

, then we have

α_{X, L} < 2 n ϵ

.

Proof.

The claim follows from Theorem 1 (ii). See Lemma 12 in [17] for the complete proof.

3. A Result on Finite-State Finite-Alphabet Hidden Markov Models

In this paper, we use the parametric description of hidden Markov models as given in [27].

A finite-state finite-alphabet hidden Markov model (in short, HMM) [27] is a quadruple,

(Σ, A, {T^{(a)}}_{a \in A}, μ)

, where Σ and A are finite sets, called state set and alphabet, respectively,

{T^{(a)}}_{a \in A}

is a family of

| Σ | \times | Σ |

matrices indexed by elements of A, where

| Σ |

is the size of state set Σ, and μ is a probability distribution on the set Σ. The following conditions must be satisfied:

(i): $T_{s s^{'}}^{(a)} \geq 0$ for any $s, s^{'} \in Σ$ and $a \in A$ ;
(ii): $\sum_{s^{'}, a} T_{s s^{'}}^{(a)} = 1$ for any $s \in Σ$ ;
(iii): and $μ (s^{'}) = \sum_{s, a} μ (s) T_{s s^{'}}^{(a)}$ for any $s^{'} \in Σ$ .

Any probability distribution satisfying condition (iii) is called a stationary distribution. The

| Σ | \times | Σ |

matrix

T : = \sum_{a \in A} T^{(a)}

is called a state transition matrix. The ternary,

(Σ, T, μ)

, defines the underlying Markov chain. Note that condition (iii) is equivalent to condition (iii’)

μ (s^{'}) = \sum_{s} μ (s) T_{s s^{'}}

.

Two finite-alphabet stationary processes are induced by a HMM

(Σ, A, {T^{(a)}}_{a \in A}, μ)

. One is solely determined by the underlying Markov chain. It is called an internal process and is denoted by

S = {S_{1}, S_{2}, \dots}

. The alphabet for

S

is Σ. The joint probability distributions that characterize

S

is given by:

Pr {S_{1} = s_{1}, S_{2} = s_{2}, \dots, S_{L} = s_{L}} : = μ (s_{1}) T_{s_{1} s_{2}} \dots T_{s_{L - 1} s_{L}}

for any

s_{1}, \dots, s_{L} \in Σ

and

L \geq 1

. The other process

X = {X_{1}, X_{2}, \dots}

with the alphabet, A, is defined by the following joint probability distributions:

Pr {X_{1} = x_{1}, X_{2} = x_{2}, \dots, X_{L} = x_{L}} : = \sum_{s, s^{'}} μ (s) {(T^{(x_{1})} \dots T^{(x_{L})})}_{s s^{'}}

for any

x_{1}, \dots, x_{L} \in A

and

L \geq 1

and is called an output process. The stationarity of the probability distribution μ ensures that of both the internal and output processes.

Symbols

a \in A

, such that

T^{(a)} = O

occurs in the output process with a probability of zero. Hence, we obtain the equivalent output process, even if we remove these symbols. Thus, we can assume that

T^{(a)} \neq O

for any

a \in A

without loss of generality.

The internal process,

S

, of an HMM

(Σ, A, {T^{(a)}}_{a \in A}, μ)

is called ergodic if the state transition matrix, T, is irreducible [28]: for any

s, s^{'} \in Σ

, there exists

k > 0

, such that

{(T^{k})}_{s s^{'}} > 0

. If the internal process,

S

, is ergodic, then the stationary distribution μ is uniquely determined by the state transition matrix T via condition (iii). Every finite-alphabet finite-order multivariate stationary ergodic Markov process can be described as an HMM with an ergodic internal process.

Lemma 3

Let

X

be the output process of an HMM

(Σ, A_{n}, {T^{(a)}}_{a \in A_{n}}, μ)

, where

A_{n} = {1, 2, \dots, n}

is an ordered alphabet. If the internal process,

S

, of the HMM is ergodic, then for every

x \in A_{n}

, there exists

0 < γ_{x} < 1

and

C_{x} > 0

, such that

β_{x, X, L} < C_{x} γ_{x}^{L}

for all

L \geq 1

.

Proof.

Given

L \geq 1

, let us put

N : = ⌊ L / 2 ⌋

. Fix an arbitrary

x \in A_{n}

. We have:

\begin{matrix} β_{x, X, L} & = & \sum_{\begin{matrix} x_{j} \neq x, \\ 1 \leq j \leq N \end{matrix}} p (x_{1} \dots x_{N}) \\ = & \sum_{\begin{matrix} x_{j} \neq x, \\ 1 \leq j \leq N \end{matrix}} \sum_{s, s^{'}} μ (s) {(T^{(x_{1})} \dots T^{(x_{N})})}_{s s^{'}} \\ = & ⟨ μ {(T - T^{(x)})}^{N}, 1 ⟩, \end{matrix}

where

1 = (1, 1, \dots, 1)

and

⟨ \dots, \dots ⟩

is the usual inner product of the

| Σ |

-dimensional Euclidean space,

R^{| Σ |}

. The spectral radius

ρ (T_{x})

of the matrix

T_{(x)} : = T - T^{(x)}

is less than one. Indeed, this follows immediately from the Perron-Frobenius theorem for non-negative irreducible matrices: T is a non-negative irreducible matrix with

ρ (T) = 1

by the assumption. Since

T \geq T_{(x)} \geq O

and

T \neq T_{(x)}

, applying Theorem 1.5 (e) in [29] implies that

ρ (T_{(x)}) < ρ (T) = 1

. By Lemma 5.6.10 in [30], for any

ϵ > 0

, there exists a matrix norm,

{∥ \dots ∥}^{'}

, such that

ρ (T_{(x)}) \leq {∥ T_{(x)} ∥}^{'} < ρ (T_{(x)}) + ϵ

. It follows that for any

ϵ > 0

, there exists

C_{ϵ} > 0

, such that for all

k \geq 1

:

∥ μ T_{(x)}^{k} ∥ \leq C_{ϵ} {(ρ (T_{(x)}) + ϵ)}^{k} ∥ μ ∥,

where

∥ \dots ∥

is the Euclidean norm. Since we have

ρ (T_{(x)}) < 1

, we can choose

ϵ > 0

, so that

ρ (T_{(x)}) + ϵ < 1

. If we put

γ_{x} : = {(ρ (T_{(x)}) + ϵ)}^{\frac{1}{2}}

and

C_{x} : = C_{ϵ} {(ρ (T_{(x)}) + ϵ)}^{- 1} ∥ μ ∥ ∥ 1 ∥

, then we obtain

β_{x, X, L} < C_{x} γ_{x}^{L}

by the Cauchy-Schwartz inequality as desired.

4. Permutation Complexity and Coupling Measures

In this section, we discuss the equalities between complexity and coupling measures and their permutation analogues for the output processes of HMMs whose internal processes are ergodic.

4.1. Fundamental Lemma

Let

(X^{1}, \dots, X^{m})

be a multivariate finite-alphabet stationary stochastic process, where each univariate process,

X^{k} = {X_{1}^{k}, X_{2}^{k}, \dots}, k = 1, 2, \dots, m

, is defined over an ordered alphabet,

A_{n_{k}}

. Note that the notation for stochastic variables is different from that in [17]. Here,

X_{t}^{k}

is the stochastic variable for the k-th component of the multivariate process at time step t.

We use the notations:

\begin{matrix} p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m}) & : = & Pr {X_{t_{1} : t_{1} + L_{1} - 1}^{1} = x_{1 : L_{1}}^{1}, \dots, X_{t_{m} : t_{m} + L_{m} - 1}^{m} = x_{1 : L_{m}}^{m}}, \\ p (π_{1}, \dots, π_{m}) & : = & Pr {ϕ_{n_{k}, L_{k}} \circ X_{t_{k} : t_{k} + L_{k} - 1}^{k} = π_{k}, k = 1, \dots, m} \end{matrix}

and:

\begin{matrix} p (π_{k}) : = Pr {ϕ_{n_{k}, L_{k}} \circ X_{t_{k} : t_{k} + L_{k} - 1}^{k} = π_{k}}, \end{matrix}

where

1 \leq t_{k}, L_{k}

,

x_{1 : L_{k}}^{k} \in A_{n_{k}}^{L_{k}}

and

π_{k} \in S_{L_{k}}

for

k = 1, \dots, m

. In general, if

m \geq 2

, then

p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m})

and

p (π_{1}, \dots, π_{m})

depend on

(t_{1}, \dots, t_{m})

and are invariant only by the simultaneous time shift

(t_{1}, \dots, t_{m}) \mapsto (t_{1} + τ, \dots, t_{m} + τ)

. However, here, we make the dependence on

(t_{1}, \dots, t_{m})

implicit for notational simplicity.

Lemma 4

Let:

\begin{matrix} Δ H : = H (X_{t_{1} : t_{1} + L_{1} - 1}^{1}, \dots, X_{t_{m} : t_{m} + L_{m} - 1}^{m}) - H^{*} (X_{t_{1} : t_{1} + L_{1} - 1}^{1}, \dots, X_{t_{m} : t_{m} + L_{m} - 1}^{m}), \end{matrix}

where:

\begin{matrix} H (X_{t_{1} : t_{1} + L_{1} - 1}^{1}, \dots, X_{t_{m} : t_{m} + L_{m} - 1}^{m}) = - \sum_{x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m}} p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m}) log p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m}) \end{matrix}

and:

\begin{matrix} H^{*} (X_{t_{1} : t_{1} + L_{1} - 1}^{1}, \dots, X_{t_{m} : t_{m} + L_{m} - 1}^{m}) = - \sum_{π_{1}, \dots, π_{m}} p (π_{1}, \dots, π_{m}) log p (π_{1}, \dots, π_{m}) \end{matrix}

are the Shannon entropies of the joint occurrence of words

x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m}

and permutations

π_{1}, \dots, π_{m}

, respectively, and the base of the logarithm is taken as two. Then, we have:

\begin{matrix} 0 \leq Δ H \leq (\sum_{k = 1}^{m} α_{X^{k}, L_{k}}) (\sum_{k = 1}^{m} n_{k} log (L_{k} + n_{k})) . \end{matrix}

Proof.

We have:

\begin{matrix} Δ H & = & H (X_{t_{1} : t_{1} + L_{1} - 1}^{1}, \dots, X_{t_{m} : t_{m} + L_{m} - 1}^{m}) - H^{*} (X_{t_{1} : t_{1} + L_{1} - 1}^{1}, \dots, X_{t_{m} : t_{m} + L_{m} - 1}^{m}) \\ = & \sum_{\begin{matrix} π_{1}, \dots, π_{m}, \\ p (π_{1}, \dots, π_{m}) > 0 \end{matrix}} p (π_{1}, \dots, π_{m}) \\ \times (- \sum_{\begin{matrix} x_{1 : L_{k}}^{k} \in ϕ_{n_{k}, L_{k}}^{- 1} (π_{k}), \\ 1 \leq k \leq m \end{matrix}} \frac{p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m})}{p (π_{1}, \dots, π_{m})} log \frac{p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m})}{p (π_{1}, \dots, π_{m})}) . \end{matrix}

By Theorem 1 (i), it holds that:

\begin{matrix} 0 & \leq & - \sum_{\begin{matrix} x_{1 : L_{k}}^{k} \in ϕ_{n_{k}, L_{k}}^{- 1} (π_{k}), \\ 1 \leq k \leq m \end{matrix}} \frac{p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m})}{p (π_{1}, \dots, π_{m})} log \frac{p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m})}{p (π_{1}, \dots, π_{m})} \\ \leq & log (\prod_{k = 1}^{m} (\binom{L_{k} + n_{k} - Desc (π_{k}) - 1}{L_{k}})) \\ \leq & log (\prod_{k = 1}^{m} {(L_{k} + n_{k})}^{n_{k}}) \\ = & \sum_{k = 1}^{m} n_{k} log (L_{k} + n_{k}) \end{matrix}

for

(π_{1}, \dots, π_{m}) \in S_{L_{1}} \times \dots \times S_{L_{k}}

such that

p (π_{1}, \dots, π_{m}) > 0

.

If

| {(ϕ_{n_{1}, L_{1}} \times \dots \times ϕ_{n_{m}, L_{m}})}^{- 1} (π_{1}, \dots, π_{m}) | = 1

then:

\begin{matrix} - \sum_{\begin{matrix} x_{1 : L_{k}}^{k} \in ϕ_{n_{k}, L_{k}}^{- 1} (π_{k}), \\ 1 \leq k \leq m \end{matrix}} \frac{p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m})}{p (π_{1}, \dots, π_{m})} log \frac{p (x_{1 : L_{1}}^{1}, \dots, x_{1 : L_{m}}^{m})}{p (π_{1}, \dots, π_{m})} = 0 . \end{matrix}

On the other hand, we have:

\begin{matrix} \sum_{\begin{matrix} π_{1}, \dots, π_{m}, \\ \exists k s.t. | ϕ_{n_{k}, L_{k}}^{- 1} (π_{k}) | > 1 \end{matrix}} p (π_{1}, \dots, π_{m}) & \leq & \sum_{k = 1}^{m} \sum_{\begin{matrix} π_{k}, \\ | ϕ_{n_{k}, L_{k}}^{- 1} (π_{k}) | > 1 \end{matrix}} p (π_{k}) \\ = & \sum_{k = 1}^{m} α_{X^{k}, L_{k}} . \end{matrix}

This completes the proof of the inequality.

4.2. Excess Entropy

Let

X

be a finite-alphabet stationary stochastic process. Its excess entropy is defined by [21]:

\begin{matrix} E (X) & = & lim_{L \to \infty} (H (X_{1 : L}) - h (X) L) \\ = & \sum_{L = 1}^{\infty} (H (X_{L} | X_{1 : L - 1}) - h (X)), \end{matrix}

if the limit on the right-hand side exists, where

h (X) = {lim}_{L \to \infty} H (X_{1 : L}) / L

is the entropy rate of

X

, which exists for any finite-alphabet stationary stochastic process [31].

The excess entropy has been used as a measure of complexity [32,33,34,35,36,37]. Actually, it quantifies global correlations present in a given stationary process in the following sense. If

E (X)

exists, then it can be written as the mutual information between the past and future:

\begin{matrix} E (X) = lim_{L \to \infty} I (X_{1 : L}; X_{L + 1 : 2 L}) . \end{matrix}

It is known that if

X

is the output process of an HMM, then

E (X)

exists [38].

When the alphabet of

X

is an ordered alphabet

A_{n}

, we define the permutation excess entropy of

X

by [17]:

\begin{matrix} E^{*} (X) & = & lim_{L \to \infty} (H^{*} (X_{1 : L}) - h^{*} (X) L) \\ = & \sum_{L = 1}^{\infty} (H^{*} (X_{L} | X_{1 : L - 1}) - h^{*} (X)), \end{matrix}

if the limit on the right-hand side exists, where

h^{*} (X) = {lim}_{L \to \infty} H^{*} (X_{1 : L}) / L

is the permutation entropy rate of

X

, which exists for any finite-alphabet stationary stochastic process and is equal to the entropy rate

h (X)

[2,15,16,

H^{*} (X_{L} | X_{1 : L - 1}) : = H^{*} (X_{1 : L}) - H^{*} (X_{1 : L - 1})

, and

H^{*} (X_{1 : L})

and

H^{*} (X_{1 : L - 1})

are as defined in the statement of Lemma 4.

The following proposition is a generalization of our previous results in [17,18].

Proposition 5

Let

X

be the output process of an HMM

(Σ, A_{n}, {T^{(a)}}_{a \in A_{n}}, μ)

with an ergodic internal process. Then, we have:

E (X) = E^{*} (X) = lim_{L \to \infty} I^{*} (X_{1 : L}; X_{L + 1 : 2 L}),

where

I^{*} (X_{1 : L}; X_{L + 1 : 2 L}) : = H^{*} (X_{1 : L}) + H^{*} (X_{L + 1 : 2 L}) - H^{*} (X_{1 : L}, X_{L + 1 : 2 L}) = 2 H^{*} (X_{1 : L}) - H^{*} (X_{1 : L}, X_{L + 1 : 2 L})

.

Proof.

Let

L \geq 1

. We have:

\begin{matrix} | (H (X_{1 : L}) - h (X) L) - (H^{*} (X_{1 : L}) - h^{*} (X) L) | & = & | H (X_{1 : L}) - H^{*} (X_{1 : L}) | \\ \leq & α_{X, L} n log (L + n) \\ \leq & 2 C n^{2} log (L + n) γ^{L}, \end{matrix}

where

C : = {max}_{x \in A_{n}} {C_{x}}

,

γ : = {max}_{x \in A_{n}} {γ_{x}} < 1

, and we have used

h (X) = h^{*} (X)

for the first equality, Lemma 4 for the second inequality and Lemma 2 and Lemma 3 for the last inequality. By taking the limit,

L \to \infty

, we obtain

E (X) = E^{*} (X)

.

To prove

{lim}_{L \to \infty} I (X_{1 : L}; X_{L + 1 : 2 L}) = {lim}_{L \to \infty} I^{*} (X_{1 : L}; X_{L + 1 : 2 L})

, it is sufficient to show that

| H (X_{1 : L}, X_{L + 1 : 2 L}) - H^{*} (X_{1 : L}, X_{L + 1 : 2 L}) | \to 0

as

L \to \infty

. This is because we have:

\begin{matrix} | I (X_{1 : L}; X_{L + 1 : 2 L}) - I^{*} (X_{1 : L}; X_{L + 1 : 2 L}) | \\ \leq & 2 | H (X_{1 : L}) - H (X_{1 : L}) | + | H (X_{1 : L}, X_{L + 1 : 2 L}) - H^{*} (X_{1 : L}, X_{L + 1 : 2 L}) | . \end{matrix}

However, this can be shown similarly with the above discussion by applying Lemma 4 to the bivariate process

(X^{1}, X^{2}) : = (X, X)

and, then, using Lemma 2 and Lemma 3.

4.3. Transfer Entropy and Momentary Information Transfer

In this subsection, we consider two information rates that are measures of coupling direction and strength between two jointly distributed processes and discuss the equalities between them and their permutation analogues. One is the rate of the transfer entropy [22], and the other is the rate of the momentary information transfer [24]. Both are particular instances of the conditional mutual information [39].

Let

(X, Y)

be a bivariate finite-alphabet stationary stochastic process. We assume that the alphabets of

X

and

Y

are ordered alphabets,

A_{n}

and

A_{m}

, respectively. For

τ = 1, 2, \dots

, we define the τ-step transfer entropy rate from

Y

to

X

by:

\begin{matrix} t_{τ} (Y \to X) & = & lim_{L \to \infty} [H (X_{L + 1 : L + τ} | X_{1 : L}) - H (X_{L + 1 : L + τ} | X_{1 : L}, Y_{1 : L})] \\ = & lim_{L \to \infty} [H (X_{1 : L + τ}) - H (X_{1 : L}) - H (X_{1 : L + τ}, Y_{1 : L}) + H (X_{1 : L}, Y_{1 : L})] . \end{matrix}

When

τ = 1

,

t_{1} (Y \to X)

is called just the transfer entropy rate [40] from

Y

to

X

and simply denoted by

t (Y \to X)

.

If we introduce the τ-step entropy rate of

X

by:

h_{τ} (X) = lim_{L \to \infty} H (X_{L + 1 : L + τ} | X_{1 : L})

and the τ-step conditional entropy rate of

X

given

Y

by:

h_{τ} (X | Y) = lim_{L \to \infty} H (X_{L + 1 : L + τ} | X_{1 : L}, Y_{1 : L})

then we can write:

t_{τ} (Y \to X) = h_{τ} (X) - h_{τ} (X | Y)

because both

h_{τ} (X)

and

h_{τ} (X | Y)

exist. We call

h_{1} (X | Y)

the conditional entropy rate and denote it by

h (X | Y)

. Note that the conditional entropy rate here is slightly different from that found in the literature. For example, in [41], the conditional entropy rate (called the conditional uncertainty) is defined by

{lim}_{L \to \infty} H (X_{L + 1} | X_{1 : L}, Y_{1 : L + 1})

. The difference from the conditional entropy rate defined here is in whether the conditioning on

Y_{L + 1}

is involved or not.

h_{τ} (X)

is additive, namely, we always have:

h_{τ} (X) = τ h_{1} (X) = τ h (X) .

However, for the τ-step conditional entropy rate, the additivity cannot hold in general. It is at most super-additive: we only have the inequality:

h_{τ} (X | Y) \geq τ h (X | Y)

in general. Indeed, we have:

\begin{matrix} h_{τ} (X | Y) & = & lim_{L \to \infty} H (X_{L + 1 : L + τ} | X_{1 : L}, Y_{1 : L}) \\ = & lim_{L \to \infty} \sum_{τ^{'} = 1}^{τ} H (X_{L + τ^{'}} | X_{1 : L + τ^{'} - 1}, Y_{1 : L}) \\ \geq & lim_{L \to \infty} \sum_{τ^{'} = 1}^{τ} H (X_{L + τ^{'}} | X_{1 : L + τ^{'} - 1}, Y_{1 : L + τ^{'} - 1}) \\ = & τ h (X | Y) . \end{matrix}

This leads to the sub-additivity of the τ-step transfer entropy rate:

t_{τ} (Y \to X) \leq τ t (Y \to X) .

An example with the strict inequality can be easily given. Let

Y

be an independent and identically distributed (i.i.d.) process and

X

defined by

X_{1} = Y_{1}

and

X_{i + 1} = Y_{i}

. We have

h (X) = h (Y) = H (Y_{1})

and

h_{τ} (X | Y) = (τ - 1) H (Y_{1})

. Hence,

t_{τ} (Y \to X) = H (Y_{1})

for all

τ = 1, 2, \dots

.

There are two permutation analogues of the transfer entropy. One is called the symbolic transfer entropy (STE) [42], and the other is called the transfer entropy on rank vector (TERV) [43]. Here, we introduce their rates as follows: the rate of STE from

Y

to

X

is defined by:

\begin{matrix} t_{τ}^{* *} (Y \to X) = lim_{L \to \infty} [H^{*} (X_{1 : L}, X_{1 + τ : L + τ}) - H^{*} (X_{1 : L}) - H^{*} (X_{1 : L}, X_{1 + τ : L + τ}, Y_{1 : L}) + H^{*} (X_{1 : L}, Y_{1 : L})] \end{matrix}

if the limit on the right-hand side exists. The rate of TERV from

Y

to

X

is defined by:

\begin{matrix} t_{τ}^{*} (Y \to X) = lim_{L \to \infty} [H^{*} (X_{1 : L + τ}) - H^{*} (X_{1 : L}) - H^{*} (X_{1 : L + τ}, Y_{1 : L}) + H^{*} (X_{1 : L}, Y_{1 : L})] . \end{matrix}

if the limit on the right-hand side exists. If

E^{*} (X)

exists, then, by the definition of the permutation excess entropy, we have:

\begin{matrix} h^{*} (X) = lim_{L \to \infty} (H^{*} (X_{1 : L + 1}) - H^{*} (X_{1 : L})) . \end{matrix}

In this case,

t_{1}^{*} (Y \to X)

coincides with a quantity called the symbolic transfer entropy rate, introduced in [19].

Proposition 6

Let

(X, Y)

be the output process of an HMM

(Σ, A_{n} \times A_{m}, {T^{(a, b)}}_{(a, b) \in A_{n} \times A_{m}}, μ)

with an ergodic internal process. Then, we have:

t_{τ} (Y \to X) = t_{τ}^{*} (Y \to X) = t_{τ}^{* *} (Y \to X) .

Proof.

Since both

X

and

Y

are the output processes of appropriate HMMs with ergodic internal processes, the equalities follow from the similar discussion with that in the proof of Proposition 5. Indeed, for example,

X

is the output process of the HMM

(Σ, A_{n}, {T^{(a)}}_{a \in A_{n}}, μ)

, where

T^{(a)} : = \sum_{b \in A_{m}} T^{(a, b)}

.

A different instance of conditional mutual information, called momentary information transfer, is considered in [24]. It was proposed to improve the ability to detect coupling delays, which is lacking in the transfer entropy. Here, we consider its rate: the momentary information transfer rate is defined by:

\begin{matrix} m_{τ} (Y \to X) & = & lim_{L \to \infty} [H (X_{L + τ} | X_{1 : L + τ - 1}, Y_{1 : L - 1}) - H (X_{L + τ} | X_{1 : L + τ - 1}, Y_{1 : L})] \\ = & lim_{L \to \infty} [H (X_{1 : L + τ}, Y_{1 : L - 1}) - H (X_{1 : L + τ - 1}, Y_{1 : L - 1}) \\ - H (X_{1 : L + τ}, Y_{1 : L}) + H (X_{1 : L + τ - 1}, Y_{1 : L})] . \end{matrix}

Its permutation analogue, called the momentary sorting information transfer rate, is defined by:

\begin{matrix} m_{τ}^{*} (Y \to X) & = & lim_{L \to \infty} [H^{*} (X_{1 : L + τ}, Y_{1 : L - 1}) - H^{*} (X_{1 : L + τ - 1}, Y_{1 : L - 1}) \\ - H^{*} (X_{1 : L + τ}, Y_{1 : L}) + H^{*} (X_{1 : L + τ - 1}, Y_{1 : L})] . \end{matrix}

By a similar discussion with that in the proof of Proposition 6, we obtain the following equality:

Proposition 7

Let

(X, Y)

be the output process of an HMM

(Σ, A_{n} \times A_{m}, {T^{(a, b)}}_{(a, b) \in A_{n} \times A_{m}}, μ)

with an ergodic internal process. Then, we have:

m_{τ} (Y \to X) = m_{τ}^{*} (Y \to X) .

4.4. Directed Information

Directed information is a measure of coupling direction and strength based on the idea of causal conditioning [26,44]. Since it is not a particular instance of conditional mutual information, here, we treat it separately. In the following presentation, we make use of terminologies from [40,45].

Let

(X, Y)

be a bivariate finite-alphabet stationary stochastic process. The alphabets of

X

and

Y

are ordered alphabets,

A_{n}

and

A_{m}

, respectively. The directed information rate from

Y

to

X

is defined by:

\begin{matrix} I_{\infty} (Y \to X) = lim_{L \to \infty} \frac{1}{L} I (Y_{1 : L} \to X_{1 : L}) \end{matrix}

where:

\begin{matrix} I (Y_{1 : L} \to X_{1 : L}) & = & \sum_{i = 1}^{L} I (X_{i}, Y_{1 : i} | X_{1 : i - 1}) \\ = & H (X_{1 : L}) - \sum_{i = 1}^{L} H (X_{i} | X_{1 : i - 1}, Y_{1 : i}) . \end{matrix}

Note that if

Y_{1 : i}

in the above expression on the right-hand side is replaced by

Y_{1 : L}

, then we obtain the mutual information between

X_{1 : L}

and

Y_{1 : L}

:

\begin{matrix} I (X_{1 : L}; Y_{1 : L}) = H (X_{1 : L}) - \sum_{i = 1}^{L} H (X_{i} | X_{1 : i - 1}, Y_{1 : L}) . \end{matrix}

Thus, conditioning on

Y_{1 : i}

for

i = 1, \dots, L

, not on

Y_{1 : L}

, distinguishes the directed information from the mutual information. Following [44], we write:

\begin{matrix} H (X_{1 : L} | | Y_{1 : L}) : = \sum_{i = 1}^{L} H (X_{i} | X_{1 : i - 1}, Y_{1 : i}) \end{matrix}

and call the quantity causal conditional entropy. By using this notation, we have:

I (Y_{1 : L} \to X_{1 : L}) = H (X_{1 : L}) - H (X_{1 : L} | | Y_{1 : L}) .

The permutation analogue of the directed information rate, which we call the symbolic directed information rate is defined by:

\begin{matrix} I_{\infty}^{*} (Y \to X) = lim_{L \to \infty} \frac{1}{L} I^{*} (Y_{1 : L} \to X_{1 : L}) \end{matrix}

if the limit on the right-hand side exists, where:

\begin{matrix} I^{*} (Y_{1 : L} \to X_{1 : L}) : = H^{*} (X_{1 : L}) - \sum_{i = 1}^{L} (H^{*} (X_{1 : i}, Y_{1 : i}) - H^{*} (X_{1 : i - 1}, Y_{1 : i})) . \end{matrix}

If we write:

I^{*} (X_{i}; Y_{1 : i} | X_{1 : i - 1}) : = H^{*} (X_{1 : i}) - H^{*} (X_{1 : i - 1}) - H^{*} (X_{1 : i}, Y_{1 : i}) + H^{*} (X_{1 : i - 1}, Y_{1 : i})

and:

\begin{matrix} H^{*} (X_{1 : L} | | Y_{1 : L}) : = \sum_{i = 1}^{L} (H^{*} (X_{1 : i}, Y_{1 : i}) - H^{*} (X_{1 : i - 1}, Y_{1 : i})) \end{matrix}

then we have the expressions:

I^{*} (Y_{1 : L} \to X_{1 : L}) = \sum_{i = 1}^{L} I^{*} (X_{i}; Y_{1 : i} | X_{1 : i - 1}) = H^{*} (X_{1 : L}) - H^{*} (X_{1 : L} | | Y_{1 : L}) .

Proposition 8

Let

(X, Y)

be the output process of an HMM

(Σ, A_{n} \times A_{m}, {T^{(a, b)}}_{(a, b) \in A_{n} \times A_{m}}, μ)

with an ergodic internal process. Then, we have:

I_{\infty} (Y \to X) = I_{\infty}^{*} (Y \to X) .

Proof.

We have:

\begin{matrix} | I (Y_{1 : L} \to X_{1 : L}) - I^{*} (Y_{1 : L} \to X_{1 : L}) | \\ \leq & | H (X_{1 : L}) - H^{*} (X_{1 : L}) | + \sum_{i = 1}^{L} | H (X_{1 : i}, Y_{1 : i}) - H^{*} (X_{1 : i}, Y_{1 : i}) | \\ + \sum_{i = 1}^{L} | H (X_{1 : i - 1}, Y_{1 : i}) - H^{*} (X_{1 : i - 1}, Y_{1 : i}) | . \end{matrix}

We know that the first term on the right-hand side in the above inequality goes to zero as

L \to \infty

. Let us evaluate the second sum. By Lemma 4, it holds that:

\begin{matrix} \sum_{i = 1}^{L} | H (X_{1 : i}, Y_{1 : i}) - H^{*} (X_{1 : i}, Y_{1 : i}) | \leq \sum_{i = 1}^{L} (α_{X, i} + α_{Y, i}) (n log (i + n) + m log (i + m)) \end{matrix}

By Lemma 2 and Lemma 3, we have:

\begin{matrix} \sum_{i = 1}^{L} α_{X, i} n log (i + n) \leq 2 C n^{2} \sum_{i = 1}^{L} γ^{i} log (i + n), \end{matrix}

where

C : = {max}_{x \in A_{n}} {C_{x}}

and

γ : = {max}_{x \in A_{n}} {γ_{x}} < 1

. It is elementary to show that

{lim}_{L \to \infty} \sum_{i = 1}^{L} γ^{i} log (i + n)

is finite. The limits of the other terms are also shown to be finite similarly. Thus, we can conclude that the limit of the second sum is bounded. Similarly, the limit of the third sum is also bounded. The equality in the claim follows immediately.

For output processes of HMMs with ergodic internal processes, properties on the directed information rate can be transferred to those on the symbolic directed information rate. Since proofs of them can be given by the same manner as those of the above propositions, here, we list some of them without proofs. For the proofs of the properties on the directed information rate, we refer to [44,45].

Let

(X, Y)

be the output process of an HMM

(Σ, A_{n} \times A_{m}, {T^{(a, b)}}_{(a, b) \in A_{n} \times A_{m}}, μ)

with an ergodic internal process. Then, we have:

(i): $I_{\infty}^{*} (Y \to X) = lim_{L \to \infty} I^{*} (X_{L}; Y_{1 : L} | X_{1 : L - 1}) .$

This is the permutation analogue of the equality:

$I_{\infty} (Y \to X) = lim_{L \to \infty} I (X_{L}; Y_{1 : L} | X_{1 : L - 1});$
(ii): $\begin{matrix} I_{\infty} (D Y \to X) = I_{\infty}^{*} (D Y \to X) = lim_{L \to \infty} I^{*} (X_{L}; Y_{1 : L - 1} | X_{1 : L - 1}) . \end{matrix}$

Here:

$\begin{matrix} I_{\infty} (D Y \to X) : = lim_{L \to \infty} \frac{1}{L} I (D Y_{1 : L} \to X_{1 : L}) \end{matrix}$

and:

$\begin{matrix} I (D Y_{1 : L} \to X_{1 : L}) : = \sum_{i = 1}^{L} I (X_{i}; Y_{1 : i - 1} | X_{1 : i - 1}) . \end{matrix}$

The symbol, D, denotes the one-step delay. $I_{\infty}^{*} (D Y \to X)$ is the corresponding permutation analogue. The second equality is the permutation analogue of the equality $I_{\infty} (D Y \to X) = {lim}_{L \to \infty} I (X_{L}; Y_{1 : L - 1} | X_{1 : L - 1})$ . Since $I_{\infty} (D Y \to X)$ coincides with the transfer entropy rate, the first equality is just the equality between the transfer entropy rate and the symbolic transfer entropy rate (or the rate of one-step TERV) proven in Proposition 6, given the second equality;
(iii): $\begin{matrix} I_{\infty} (Y \to X | | D Y) = I_{\infty}^{*} (Y \to X | | D Y) = lim_{L \to \infty} I^{*} (X_{L}; Y_{L} | X_{1 : L - 1}, Y_{1 : L - 1}), \end{matrix}$

where $I_{\infty} (Y \to X | | D Y)$ is called the instantaneous information exchange rate and is defined by:

$\begin{matrix} I_{\infty} (Y \to X | | D Y) : = lim_{L \to \infty} \frac{1}{L} I (Y_{1 : L} \to X_{1 : L} | | D Y_{1 : L}) \end{matrix}$

and:

$\begin{matrix} I (Y_{1 : L} \to X_{1 : L} | | D Y_{1 : L}) & = & H (X_{1 : L} | | D Y_{1 : L}) - H (X_{1 : L} | | Y_{1 : L}, D Y_{1 : L}) \\ = & \sum_{i = 1}^{L} I (X_{i}; Y_{1 : i} | X_{1 : i - 1}, Y_{1 : i - 1}) \\ = & \sum_{i = 1}^{L} I (X_{i}; Y_{i} | X_{1 : i - 1}, Y_{1 : i - 1}) . \end{matrix}$

From the last expression of $I (Y_{1 : L} \to X_{1 : L} | | D Y_{1 : L})$ , we can obtain:

$\begin{matrix} I_{\infty} (Y \to X | | D Y) = lim_{L \to \infty} I (X_{L}; Y_{L} | X_{1 : L - 1}, Y_{1 : L - 1}) . \end{matrix}$

$I_{\infty}^{*} (Y \to X | | D Y)$ is the corresponding permutation analogue and called the symbolic instantaneous information exchange rate;
(iv): $\begin{matrix} I_{\infty}^{*} (Y \to X) = I_{\infty}^{*} (D Y \to X) + I_{\infty}^{*} (Y \to X | | D Y) . \end{matrix}$

Namely, the symbolic directed information rate decomposes into the sum of the symbolic transfer entropy rate and the symbolic instantaneous information exchange rate. This follows immediately from (ii), (iii) and the equality saying that the directed information rate decomposes into the sum of the transfer entropy rate and the instantaneous information exchange rate:

$\begin{matrix} I_{\infty} (Y \to X) = I_{\infty} (D Y \to X) + I_{\infty} (Y \to X | | D Y); \end{matrix}$
(v): $\begin{matrix} I_{\infty}^{*} (Y \to X) + I_{\infty}^{*} (D X \to Y) = I_{\infty}^{*} (X; Y) . \end{matrix}$

This is the permutation analogue of the equality saying that the mutual information rate between $X$ and $Y$ is the sum of the directed information rate from $Y$ to $X$ and the transfer entropy rate from $X$ to $Y$ :

$\begin{matrix} I_{\infty} (Y \to X) + I_{\infty} (D X \to Y) = I_{\infty} (X; Y), \end{matrix}$

where:

$\begin{matrix} I_{\infty} (X; Y) : = lim_{L \to \infty} \frac{1}{L} I (X_{1 : L}; Y_{1 : L}) \end{matrix}$

is the mutual information rate and $I_{\infty}^{*} (X; Y)$ is its permutation analogue, called the symbolic mutual information rate. It is known that they are equal for any bivariate finite-alphabet stationary stochastic process [19]. Thus, the symbolic mutual information rate between $X$ and $Y$ is the sum of the symbolic directed information rate from $Y$ to $X$ and the symbolic transfer entropy rate from $X$ to $Y$ .

We can also introduce the permutation analogue of the causal conditional directed information rate and prove the corresponding properties. To be precise, let us consider a multivariate finite-alphabet stationary stochastic process

(X, Y, Z^{1}, \dots, Z^{k})

with the alphabet,

A_{n} \times A_{m} \times A_{l_{1}} \times \dots \times A_{l_{k}}

. The causal conditional directed information rate from

Y

to

X

given

(Z^{1}, \dots, Z^{k})

is defined by:

\begin{matrix} I_{\infty} (Y \to X | | Z^{1}, \dots, Z^{k}) : = lim_{L \to \infty} \frac{1}{L} I (Y_{1 : L} \to X_{1 : L} | | Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) \end{matrix}

where:

\begin{matrix} I (Y_{1 : L} \to X_{1 : L} | | Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) & = & H (X_{1 : L} | | Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) - H (X_{1 : L} | | Y_{1 : L}, Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) \\ = & \sum_{i = 1}^{L} I (X_{i}; Y_{1 : i} | X_{1 : i - 1}, Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) . \end{matrix}

Corresponding to Proposition 8, we have the following equality if

(X, Y, Z^{1}, \dots, Z^{k})

is the output process of an HMM with an ergodic internal process:

\begin{matrix} I_{\infty} (Y \to X | | Z^{1}, \dots, Z^{k}) = I_{\infty}^{*} (Y \to X | | Z^{1}, \dots, Z^{k}), \end{matrix}

where

I_{\infty}^{*} (Y \to X | | Z^{1}, \dots, Z^{k})

is the symbolic causal conditional directed information rate, which is defined in the same manner as the symbolic directed information rate. The following properties also hold: assume that

(X, Y, Z^{1}, \dots, Z^{k})

is the output process of an HMM with an ergodic internal process. Then, we have:

(i’): $I_{\infty}^{*} (Y \to X | | Z^{1}, \dots, Z^{k}) = lim_{L \to \infty} I^{*} (X_{L}; Y_{1 : L} | X_{1 : L - 1}, Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) .$

This is the permutation analogue of the equality:

$I_{\infty} (Y \to X | | Z^{1}, \dots, Z^{k}) = lim_{L \to \infty} I (X_{L}; Y_{1 : L} | X_{1 : L - 1}, Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k});$
(ii’): $\begin{matrix} I_{\infty} (D Y \to X | | Z^{1}, \dots, Z^{k}) & = & I_{\infty}^{*} (D Y \to X | | Z^{1}, \dots, Z^{k}) \\ = & lim_{L \to \infty} I^{*} (X_{L}; Y_{1 : L - 1} | X_{1 : L - 1}, Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) . \end{matrix}$

The second equality is the permutation analogue of the equality:

$\begin{matrix} I_{\infty} (D Y \to X | | Z^{1}, \dots, Z^{k}) = lim_{L \to \infty} I (X_{L}; Y_{1 : L - 1} | X_{1 : L - 1}, Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}); \end{matrix}$

The quantities, $I_{\infty} (D Y \to X | | Z^{1}, \dots, Z^{k})$ and $I_{\infty}^{*} (D Y \to X | | Z^{1}, \dots, Z^{k})$ , are called the causal conditional transfer entropy rate and the symbolic causal conditional transfer entropyrate, respectively.
(iii’): $\begin{matrix} I_{\infty} (Y \to X | | D Y, Z^{1}, \dots, Z^{k}) & = & I_{\infty}^{*} (Y \to X | | D Y, Z^{1}, \dots, Z^{k}) \\ = & lim_{L \to \infty} I^{*} (X_{L}; Y_{L} | X_{1 : L - 1}, Y_{1 : L - 1}, Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}), \end{matrix}$

where $I_{\infty} (Y \to X | | D Y, Z^{1}, \dots, Z^{k})$ is called the causal conditional instantaneous information exchange rate. The second equality is the permutation analogue of the equality:

$\begin{matrix} I_{\infty} (Y \to X | | D Y, Z^{1}, \dots, Z^{k}) = lim_{L \to \infty} I (X_{L}; Y_{L} | X_{1 : L - 1}, Y_{1 : L - 1}, Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) . \end{matrix}$

$I_{\infty}^{*} (Y \to X | | D Y, Z^{1}, \dots, Z^{k})$ is the permutation analogue and is called the symbolic causal conditional instantaneous information exchange rate;
(iv’): $\begin{matrix} I_{\infty}^{*} (Y \to X | | Z^{1}, \dots, Z^{k}) = I_{\infty}^{*} (D Y \to X | | Z^{1}, \dots, Z^{k}) + I_{\infty}^{*} (Y \to X | | D Y, Z^{1}, \dots, Z^{k}) . \end{matrix}$

This is the permutation analogue of the following equality:

$\begin{matrix} I_{\infty} (Y \to X | | Z^{1}, \dots, Z^{k}) = I_{\infty} (D Y \to X | | Z^{1}, \dots, Z^{k}) + I_{\infty} (Y \to X | | D Y, Z^{1}, \dots, Z^{k}); \end{matrix}$
(v’): $\begin{matrix} I_{\infty}^{*} (Y \to X | | Z^{1}, \dots, Z^{k}) + I_{\infty}^{*} (D X \to Y | | Z^{1}, \dots, Z^{k}) = I_{\infty}^{*} (X; Y | | Z^{1}, \dots, Z^{k}) . \end{matrix}$

This is the permutation analogue of the equality:

$\begin{matrix} I_{\infty} (Y \to X | | Z^{1}, \dots, Z^{k}) + I_{\infty} (D X \to Y | | Z^{1}, \dots, Z^{k}) = I_{\infty} (X; Y | | Z^{1}, \dots, Z^{k}), \end{matrix}$

where:

$\begin{matrix} I_{\infty} (X; Y | | Z^{1}, \dots, Z^{k}) & : = & lim_{L \to \infty} \frac{1}{L} (H (X_{1 : L} | | Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) \\ + H (Y_{1 : L} | | Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k}) - H (X_{1 : L}, Y_{1 : L} | | Z_{1 : L}^{1}, \dots, Z_{1 : L}^{k})) \end{matrix}$

is the causal conditional mutual information rate and $I_{\infty}^{*} (X; Y | | Z^{1}, \dots, Z^{k})$ is its permutation analogue, called the symbolic causal conditional mutual information rate. It can be shown that:

$\begin{matrix} I_{\infty} (X; Y | | Z^{1}, \dots, Z^{k}) = I_{\infty}^{*} (X; Y | | Z^{1}, \dots, Z^{k}) \end{matrix}$

if $(X, Y, Z^{1}, \dots, Z^{k})$ is the output process of an HMM with an ergodic internal process.

5. Discussion

In this section, we discuss how our theoretical results in this paper are related to the previous work in the literature.

Being confronted with real-world time series data, we cannot take the limit of a large length of words. Hence, we have to estimate information rates with a finite length of words. In such a situation, one permutation method could have some advantages over the other permutation methods. As a matter of fact, TERV was originally proposed as an improved analogue of STE [43]. However, it has been unclear whether they coincide in the limit of a large length of permutations. In this paper, we provide a partial answer to this question: the two permutation analogues of the transfer entropy rate, the rate of STE and the rate of TERV, are equivalent to the transfer entropy rate for bivariate processes generated by HMMs with ergodic internal processes.

The Granger causality graph [46] is a model of causal dependence structure in multivariate stationary stochastic processes. Given a multivariate stationary stochastic process, nodes in a Granger causality graph are components of the process. There are two types of edges: one is directed, and the other is undirected. The absence of a directed edge from one node to another node indicates the lack of the Granger cause from the former to the latter relative to the other remaining processes. Similarly, the absence of an undirected edge between two nodes indicates the lack of the instantaneous cause between them relative to the other remaining processes. Amblard and Michel [40,45] proposed that the Granger causality graph can be constructed based on directed information theory: let

X = (X^{1}, X^{2}, \dots, X^{m})

be a multivariate finite-alphabet stationary stochastic process with the alphabet,

A_{n_{1}} \times A_{n_{2}} \times \dots \times A_{n_{m}}

, and

(V, E_{d}, E_{u})

, the Granger causality graph of the process,

X

, where

V = {1, 2, \dots, m}

is the set of nodes,

E_{d}

is the set of directed edges and

E_{u}

is the set of undirected edges. Their proposal is that:

(i): for any $i, j \in V$ , $(i, j) \notin E_{d}$ , if and only if $I_{\infty} (D X^{i} \to X^{j} | | X ∖ {X^{i}, X^{j}}) = 0$ ;
(ii): for any $i, j \in V$ , $(i, j) \notin E_{u}$ , if and only if $I_{\infty} (X^{i} \to X^{j} | | D X^{i}, X ∖ {X^{i}, X^{j}}) = 0$ .

Thus, in the Granger causality graph construction proposed in [40], the causal conditional transfer entropy rate captures the Granger cause from one process to another process relative to the other remaining processes. On the other hand, the causal conditional instantaneous information exchange rate captures the instantaneous cause between two processes relative to the other remaining processes.

Now, let us consider the case when

X

is an output process of an HMM with an ergodic internal process. Then, from the results of Section 4.4, we have:

(i’): for any $i, j \in V$ , $(i, j) \notin E_{d}$ , if and only if $I_{\infty}^{*} (D X^{i} \to X^{j} | | X ∖ {X^{i}, X^{j}}) = 0$ ;
(ii’): for any $i, j \in V$ , $(i, j) \notin E_{u}$ , if and only if $I_{\infty}^{*} (X^{i} \to X^{j} | | D X^{i}, X ∖ {X^{i}, X^{j}}) = 0$ .

Thus, the Granger causality graphs in the sense of [40,45] for multivariate processes generated by HMMs with ergodic internal processes can be captured by the language of the permutation entropy: the symbolic causal conditional transfer entropy rate and the symbolic instantaneous information exchange rate. This statement opens up a possibility of the permutation approach to the problem of assessing the causal dependence structure of multivariate stationary stochastic processes. However, of course, the details of the practical implementation should be an issue of further study.

Real-world time series data are often multivariate. However, it seems that univariate analysis is still main stream in the field of ordinal pattern analysis (see, for example, the papers in [47]). We hope that this work stimulates multivariate analysis of real-world time series data.

Acknowledgments

The authors would like to thank D. Kugiumtzis for his useful comments and discussion on the relationship between STE and TERV. The authors also appreciate the anonymous referees for their comments, which significantly improved the manuscript. T.H. was supported by the Precursory Research for Embryonic Science and Technology (PRESTO) program of Japan Science and Technology Agency (JST).

Conflicts of Interest

The authors declare no conflict of interest.

References

Bandt, C.; Pompe, B. Permutation entropy: A natural complexity measure for time series. Phys. Rev. Lett. 2002, 88, e174102. [Google Scholar] [CrossRef]
Amigó, J.M. Permutation Complexity in Dynamical Systems; Springer-Verlag: Berlin/ Heidelberg, Germany, 2010. [Google Scholar]
Bahraminasab, A.; Ghasemi, F.; Stefanovska, A.; McClintock, P.V.E.; Kantz, H. Direction of coupling from phases of interacting oscillators: A permutation information approach. Phys. Rev. Lett. 2008, 100, e084101. [Google Scholar] [CrossRef]
Cao, Y.H.; Tung, W.W.; Gao, J.B.; Protopopescu, V.A.; Hively, L.M. Detecting dynamical changes in time series using the permutation entropy. Phys. Rev. E 2004, 70, e046217. [Google Scholar] [CrossRef]
Kugiumtzis, D. Partial transfer entropy on rank vectors. Eur. Phys. J. Special Topics 2013, 222, 401–420. [Google Scholar] [CrossRef]
Nakajima, K.; Haruna, T. Symbolic local information transfer. Eur. Phys. J. Special Topics 2013, 222, 421–439. [Google Scholar] [CrossRef]
Rosso, O.A.; Larrondo, H.A.; Martin, M.T.; Plastino, A.; Fuentes, M.A. Distinguishing noise from chaos. Phys. Rev. Lett. 2007, 99, e154102. [Google Scholar] [CrossRef]
Amigó, J.M.; Keller, K. Permutation entropy: One concept, two approaches. Eur. Phys. J. Special Topics 2013, 222, 263–273. [Google Scholar] [CrossRef]
Bandt, C.; Keller, G.; Pompe, B. Entropy of interval maps via permutations. Nonlinearity 2002, 15, 1595–1602. [Google Scholar] [CrossRef]
Keller, K.; Sinn, M. A standardized approach to the Kolmogorov-Sinai entropy. Nonlinearity 2009, 22, 2417–2422. [Google Scholar] [CrossRef]
Keller, K.; Sinn, M. Kolmogorov-Sinai entropy from the ordinal viewpoint. Phys. D 2010, 239, 997–1000. [Google Scholar] [CrossRef]
Keller, K. Permutations and the Kolmogorov-Sinai entropy. Discr. Cont. Dyn. Syst. 2012, 32, 891–900. [Google Scholar] [CrossRef]
Keller, K.; Unakafov, A.M.; Unakafova, V.A. On the relation of KS entropy and permutation entropy. Phys. D 2012, 241, 1477–1481. [Google Scholar] [CrossRef]
Unakafova, V.A.; Unakafov, A.M.; Keller, K. An approach to comparing Kolmogorov-Sinai and permutation entropy. Eur. Phys. J. Special Topics 2013, 222, 353–361. [Google Scholar] [CrossRef]
Amigó, J.M.; Kennel, M.B.; Kocarev, L. The permutation entropy rate equals the metric entropy rate for ergodic information sources and ergodic dynamical systems. Phys. D 2005, 210, 77–95. [Google Scholar] [CrossRef]
Amigó, J.M. The equality of Kolmogorov-Sinai entropy and metric permutation entropy generalized. Phys. D 2012, 241, 789–793. [Google Scholar] [CrossRef]
Haruna, T.; Nakajima, K. Permutation complexity via duality between values and orderings. Phys. D 2011, 240, 1370–1377. [Google Scholar] [CrossRef]
Haruna, T.; Nakajima, K. Permutation excess entropy and mutual information between the past and future. Int. J. Comput. Ant. Sys. 2012, in press. [Google Scholar]
Haruna, T.; Nakajima, K. Symbolic transfer entropy rate is equal to transfer entropy rate for bivariate finite-alphabet stationary ergodic Markov processes. Eur. Phys. J. B 2013, 86, e230. [Google Scholar] [CrossRef]
Haruna, T.; Nakajima, K. Permutation approach to finite-alphabet stationary stochastic processes based on the duality between values and orderings. Eur. Phys. J. Special Topics 2013, 222, 383–399. [Google Scholar] [CrossRef]
Crutchfield, J.P.; Feldman, D.P. Regularities unseen, randomness observed: Levels of entropy convergence. Chaos 2003, 15, 25–54. [Google Scholar] [CrossRef]
Schreiber, T. Measuring information transfer. Phys. Rev. Lett. 2000, 85, 461–464. [Google Scholar] [CrossRef] [PubMed]
Kaiser, A.; Schreiber, T. Information transfer in continuous processes. Phys. D 2002, 166, 43–62. [Google Scholar] [CrossRef]
Pompe, B.; Runge, J. Momentary information transfer as a coupling measure of time series. Phys. Rev. E 2011, 83, e051122. [Google Scholar] [CrossRef]
Marko, H. The bidirectional communication theory—A generalization of information theory. IEEE Trans. Commun. 1973, 21, 1345–1351. [Google Scholar] [CrossRef]
Massey, J.L. Causality, Feedback and Directed Information. In Proceedings of International Symposium on Information Theory and Its Applications, Waikiki, HI, USA, 27–30 November 1990.
Anderson, B.D.O. The realization problem for hidden Markov models. Math. Control Signals Syst. 1999, 12, 80–120. [Google Scholar] [CrossRef]
Walters, P. An Introduction to Ergodic Theory; Springer-Verlag: New York, NY, USA, 1982. [Google Scholar]
Seneta, E. Non-Negative Matrices and Markov Chains; Springer: New York, NY, USA, 1981. [Google Scholar]
Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
Arnold, D.V. Information-theoretic analysis of phase transitions. Complex Syst. 1996, 10, 143–155. [Google Scholar]
Bialek, W.; Nemenman, I.; Tishby, N. Predictability, complexity, and learning. Neural Comput. 2001, 13, 2409–2463. [Google Scholar] [CrossRef] [PubMed]
Feldman, D.P.; McTague, C.S.; Crutchfield, J.P. The organization of intrinsic computation: Complexity-entropy diagrams and the diversity of natural information processing. Chaos 2008, 18, e043106. [Google Scholar] [CrossRef] [PubMed]
Grassberger, P. Toward a quantitative theory of self-generated complexity. Int. J. Theor. Phys. 1986, 25, 907–938. [Google Scholar] [CrossRef]
Li, W. On the relationship between complexity and entropy for Markov chains and regular languages. Complex Syst. 1991, 5, 381–399. [Google Scholar]
Shaw, R. The Dripping Faucet as a Model Chaotic System; Aerial Press: Santa Cruz, CA, USA, 1984. [Google Scholar]
Löhr, W. Models of Discrete Time Stochastic Processes and Associated Complexity Measures. Ph.D. Thesis, Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany, 2010. [Google Scholar]
Frenzel, S.; Pompe, B. Partial mutual information for coupling analysis of multivariate time series. Phys. Rev. Lett. 2007, 99, e204101. [Google Scholar] [CrossRef]
Amblard, P.O.; Michel, O.J.J. On directed information theory and Granger causality graphs. J. Comput. Neurosci. 2011, 30, 7–16. [Google Scholar] [CrossRef] [PubMed]
Ash, R. Information Theory; Wiley Interscience: New York, NY, USA, 1965. [Google Scholar]
Staniek, M.; Lehnertz, K. Symbolic transfer entropy. Phys. Rev. Lett. 2008, 100, e158101. [Google Scholar] [CrossRef]
Kugiumtzis, D. Transfer entropy on rank vectors. J. Nonlin. Sys. Appl. 2012, 3, 73–81. [Google Scholar]
Kramer, G. Directed Information for Channels with Feedback. Ph.D. Thesis, Swiss Federal Institute of Technology, Zurich, Switzerland, 1998. [Google Scholar]
Amblard, P.O.; Michel, O.J.J. Relating Granger causality to directed information theory for networks of stochastic processes. 2011; arXiv:0911.2873v4. [Google Scholar]
Dahlaus, R.; Eichler, M. Causality and graphical models in time series analysis. In Highly Structured Stochastic Systems; Green, P., Hjort, N., Richardson, S., Eds.; Oxford University Press: New York, NY, USA, 2003; pp. 115–137. [Google Scholar]
European Physical Journal Special Topics on Recent Progress in Symbolic Dynamics and Permutation Complexity. Eur. Phys. J. 2013, 222, 241–598.

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Permutation Complexity and Coupling Measures in Hidden Markov Models

Abstract

1. Introduction

2. The Duality between Words and Permutations

3. A Result on Finite-State Finite-Alphabet Hidden Markov Models

4. Permutation Complexity and Coupling Measures

4.1. Fundamental Lemma

4.2. Excess Entropy

4.3. Transfer Entropy and Momentary Information Transfer

4.4. Directed Information

5. Discussion

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics