Theoretical Aspects on Measures of Directed Information with Simulations

Gkelsinis, Thomas; Karagrigoriou, Alex

doi:10.3390/math8040587

Open AccessArticle

Theoretical Aspects on Measures of Directed Information with Simulations

by

Thomas Gkelsinis

^†

and

Alex Karagrigoriou

^*,†

Department of Statistics and Actuarial-Financial Mathematics, Lab of Statistics and Data Analysis, University of the Aegean, 83200 Karlovasi, Greece

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2020, 8(4), 587; https://doi.org/10.3390/math8040587

Submission received: 21 March 2020 / Revised: 8 April 2020 / Accepted: 10 April 2020 / Published: 15 April 2020

(This article belongs to the Special Issue Probability, Statistics and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

Measures of directed information are obtained through classical measures of information by taking into account specific qualitative characteristics of each event. These measures are classified into two main categories, the entropic and the divergence measures. Many times in statistics we wish to emphasize not only on the quantitative characteristics but also on the qualitative ones. For example, in financial risk analysis it is common to take under consideration the existence of fat tails in the distribution of returns of an asset (especially the left tail) and in biostatistics to use robust statistical methods to trim extreme values. Motivated by these needs in this work we present, study and provide simulations for measures of directed information. These measures quantify the information with emphasis on specific parts (or events) of their probability distribution, without losing the whole information of the less significant parts and at the same time by concentrating on the information of the parts we care about the most.

Keywords:

information; entropy; divergence; directed information; weighted entropy; weighted divergence

1. Introduction

In statistics and other fields one of the most challenging aspects is to investigate the probabilistic behaviour of a random process with respect to specific events. From finance [1] to signal processing [2] researchers try to distinguish random processes from each other and study their behaviour. A very important tool, in the ‘quiver’ of the researcher, for that purpose is the concept of information measures. These measures are divided in two main categories, the entropic and the divergence ones. The entropic measures quantify the diversity (or the information) within a population and divergence measures quantify the dissimilarities between two different populations. These types of measures are based only on the probabilistic behaviour of each population. We often state that divergences measure the discrepancy between two probability distributions or the information needed in order to distinguish one from the other. There is plethora of estimators and hypothesis tests associated with such measures for several cases [3,4], many tests of fit based on measures of divergence and take into account dissimilarities between the distributions involved [5] or based on maximum entropy principle [6]. Also, model selection criteria [7,8,9] are based on such type of information measures.

The challenge here is to construct such measures which will take into account not only the probabilistic aspects of random processes but also the qualitative characteristics of them. These characteristics sometimes are subjective and someone would say that they are related to the significance, the relevance or the utility of the information contained which is related to a specific goal [10]. There are such measures in the literature [11] but they do not take care of some important issues that we collect and present here. Also, Barbu et al. [12] provide the weighted form of the generalizations of Alpha divergence measures and Beta divergence measures for Markov chains. We call this type of measure Directed Information Measures. These do not assume that all possible states of a random process have the same significance to a goal (like the classical ones assume), so they apply specific weights in different states or parts of these processes. By applying these directed measures we can distinguish small dissimilarities in the probabilistic behaviour of two random processes which in other cases would be difficult to notice.

A method close to this one is the concept of local divergence measures [13], which studies the dissimilarities between two random processes in a specific subset of their support. The main difference between this method and the method we study is that we do not lose the whole information of the less significant parts and at the same time we focus on the more significant parts.

The present article is structured as follows. Section 2 is devoted to the notion of directed entropy and its properties as proposed by Guiaşu [10]. We also present some theoretical problems of the classical Shannon entropy to the continuous case. In Section 3 we present the Corrected Weighted Kullback–Leibler (CWKL) divergence which is a measure of directed divergence both in the discrete and the continuous case. We also present the asymptotic distribution of the estimated CWKL divergence for tests of fit. In Section 4 we provide some simulations based on discrete and continuous distributions of measures of directed entropy and CWKL divergence. The aim of Section 4 is to scrutinize the behaviour of the above directed measures. Finally, Section 5 consists of a short conclusion of the above.

2. Measures of Directed Entropy

How will we measure the information or the uncertainty of an experiment with respect to certain characteristics of its events? The foundations of the answer lie on the work of Belis and Guiaşu [14] while the answer itself was given by Guiaşu [10] who proposed the weighted entropy. He explicitly defined the axioms, the properties and the maximum value of weighted entropic formula. After this pioneer work Guiaşu [15] used the weighted entropy to group data with respect to the importance of specific regions of the domain. Later, Narowcki and Harding [1] proposed the use of weighted entropy as a measure of investment risk, Di Crescenzo and Longobardi [16] propose the weighted residual and past entropies and Suhov and Zohren [17] proposed the quantum version of weighted entropy and its properties in quantum statistical mechanics.

2.1. Discrete Case

As we mentioned earlier in the introduction of this section, the first who proposed the weighted form of Shannon entropy and its properties was Guiaşu [10]. He proposed this form of entropy only in the discrete case because the continuous analogue as proposed by Shannon [18] raises some concerns to be discussed later. The relevant definition is presented below.

Definition 1.

Let a stochastic source described by a discrete random variable X of n possible states, with distribution

P_{X}

, probability mass function

\underset{\tilde{}}{p} = {(p_{1}, \dots, p_{n})}^{T}

, and

\underset{\tilde{}}{w} = {(w_{1}, \dots, w_{n})}^{T}

be a vector of weights associated with these states, where

w_{i} \geq 0, i = 1, \dots, n

. The weighted Shannon entropy measure is defined by:

H^{w} (X) = H^{w} (\underset{\tilde{}}{w}, \underset{\tilde{}}{p}) = \sum_{i = 1}^{n} w_{i} p_{i} log \frac{1}{p_{i}} .

(1)

We proceed below with the properties of the above weighted entropy as proposed by Guiaşu [10].

Proposition 1.

1.: $H^{w} (X) \geq 0$ .
2.: If $w_{1} = w_{2} = \dots = w_{n} = w$ , then $H^{w} (X) = w H (X)$ where $H (X)$ is the Shannon entropy.
3.: If $p_{i} = 1$ for some $i = 1, \dots, n$ then $H^{w} (X) = 0$ irrespectively of the values of the weights $\underset{\tilde{}}{w}$ .
4.: If $p_{i} = 0, w_{i} \neq 0 \forall i \in I$ and $p_{j} \neq 0, w_{j} = 0 \forall j \in J$ where $I \cup J = {1, 2, \dots, n}$ , $I \cap J = \emptyset$ , then $H^{w} (X) = 0$ .
5.: $H^{w} (w_{1}, \dots, w_{n + 1}; p_{1}, \dots, p_{n}, 0) = H^{w} (w_{1}, \dots, w_{n}; p_{1}, \dots, p_{n}) = H^{w} (X)$ , for any $w_{n + 1}$ .
6.: For every non-negative, real number λ we have $H^{w} (λ \underset{\tilde{}}{w}; \underset{\tilde{}}{p}) = λ H^{w} (\underset{\tilde{}}{w}, \underset{\tilde{}}{p}) = λ H^{w} (X)$ . The weight of the union of two incompatible events E and F is given by:

$w (E \cup F) = \frac{p (E) w (E) + p (F) w (F)}{p (E) + p (F)}$

(2)

If the events are complementary the $(2)$ reduces to $w (E \cup F) = p (E) w (E) + (1 - p (E)) w (F)$ .
7.: If the equation $(2)$ holds then for $w_{n} = \frac{p^{'} w^{'} + p^{″} w^{″}}{p^{'} + p}$ and $p_{n} = p^{'} + p^{″}$ , then:

$H^{w} (w_{1}, \dots, w_{n}, w^{'}, w^{″}; p_{1}, \dots, p_{n - 1}, p^{'}, p^{″}) = H^{w} (w_{1}, \dots, w_{n}; p_{1}, \dots, p_{n}) + p_{n} H^{w} (w^{'}, w^{″}; \frac{p^{'}}{p_{n}}, \frac{p^{″}}{p_{n}}) .$

(3)

Also, Guiaşu [10] proposed the axioms of weighted entropy and provided the conditions that maximized it. We give a short presentation of them and we assume that the following axioms hold from now on.

Axiom 1.

H^{w} (w_{1}, w_{2}; p, 1 - p)

is a continuous function of p on the interval

[0, 1]

.

Axiom 2.

H^{w} (\underset{\tilde{}}{w}, \underset{\tilde{}}{p})

is a symmetric function with respect to all pairs of variables

(w_{i}, p_{i}), i = 1, \dots, n

. This means that it is invariable in any permutation that keeps the pairs

(w_{i}, p_{i}), i = 1, \dots, n

unchanged.

Axiom 3.

If all probabilities are equal

(p_{i} = \frac{1}{n}, i = 1, \dots, n)

, then:

H^{w} (w_{1}, \dots, w_{n}; \frac{1}{n}, \dots, \frac{1}{n}) = log n \frac{(w_{1} + \dots + w_{n})}{n},

(4)

where

log n > 0, \forall n > 1

.

Remark 1.

For the justification of Axioms 1–3 please note that small changes in the probability of an event must affect analogously the information measure (Axiom 1), the information of an experiment must depend solely on the probabilistic behaviour of the associated events (Axiom 2), while for equal event probabilities the weighted entropy should be equal to the average of all weights (Axiom 3).

The following theorem states the condition between p and w that maximizes the weighted entropy.

Theorem 1.

(Guiaşu [10]) Consider the random variable X associated with the discrete probability distribution

p_{i} \geq 0,

i = 1, \dots, n

where

\sum_{i = 1}^{n} p_{i} = 1

and the weights

w_{i} \geq 0,

i = 1, \dots, n

associated with the significance of event i. The weighted entropy:

H^{w} (X) = H^{w} (\underset{\tilde{}}{w}, \underset{\tilde{}}{p}) = \sum_{i = 1}^{n} w_{i} p_{i} log \frac{1}{p_{i}}

(5)

is maximum if and only if:

p_{i} = e^{- (\frac{α}{w_{i}}) - 1}, (i = 1, \dots, n)

(6)

where α is the solution of the equation:

\sum_{i = 1}^{n} e^{- (\frac{α}{w_{i}}) - 1} = 1

(7)

and the maximum value of

H^{w} (X)

is:

α + \sum_{i = 1}^{n} w_{i} e^{- (\frac{α}{w_{i}}) - 1} .

(8)

2.2. Continuous Case

Before we propose the continuous version of the weighted entropy, we have to present the continuous version of entropy. Shannon proposed the entropy as a function which measures the uncertainty or the information of a random source. Although, his entropic formula for a discrete random source was provided to describe the uncertainty (or the information) a discrete signal contains. The nature of a signal could be not necessarily discrete but also continuous [19]. So, there is a need to introduce a formula which will measure the uncertainty in the continuous case. In the literature the continuous entropy is described by the following definition [20].

Definition 2.

Let a stochastic source X which is described by the continuous probability distribution P with support

S_{X}

,

μ_{p}

be an absolutely continuous probability measure with respect to Lebesgue measure λ and p be the induced density (Radon–Nikodym derivative). Then, the continuous version of Shannon entropy measure is defined by:

h (X) = - \int_{S_{X}} p (x) {log}_{c} (p (x)) d λ (x) .

(9)

This function, also called differential entropy, satisfies some of the properties of a suitable measure of uncertainty but fails to fulfil two of them, the positivity and the invariance under the change of variables. The following examples will make the problem of positivity clear.

Example 1.

Suppose the stochastic sources

X \sim U (a, b)

and

Y \sim E x p (κ)

, then the entropies of X and Y are given respectively by:

h (X) = {log}_{c} (b - a) and h (Y) = 1 - {log}_{c} (κ) .

(10)

It is easy to see that the entropies

h (X)

and

h (Y)

which for appropriate values of the corresponding parameters will be negative.

As a result the construction of the continuous analogue in the same thinking as Guiaşu will be meaningless.

3. Measures of Directed Divergence

In contrast with weighted entropy, the weighted form of divergence measures has not extensively studied by researchers. In the previous section we stress that the weighted entropy measure is a concept to measure the probabilistic behaviour of a statistical population with greater significance in some events. Correspondingly, the weighted form of divergences measure the probabilistic dissimilarities between two statistical populations while taking into account the qualitative characteristics of each region of the support.

This section will be divided in two subsections in which we will present and study the problems arising from the concept of measures of directed divergence both in discrete and continuous case.

3.1. Discrete Case

One of the most popular and extensively used divergence measures in the literature is the Kullback–Leibler divergence [21]. From the statistical point of view, as a divergence, it quantifies the dissimilarities between two distributions with the same support. On the other hand, in information theory Kullback–Leibler divergence

D_{K L} (P, Q)

quantifies the amount of information gained by learning that a variable previously thought to be distributed as P is actually distributed as Q.

But, as in the case of Shannon entropy, the Kullback–Leibler divergence does not take into account the qualitative characteristics of random events. The idea here is to determine the dissimilarities between two distributions with specific weight on each part of the support. In the discrete case this is resolved by putting weights on each event of the sample space. The concept of weighting is equivalent to the concept of weighted Shannon entropy. The weights are related to the significance of each event with respect to a specific goal. If one would have thought that the weighted Kullback–Leibler divergence, in discrete case, is analogous to weighted entropy then he/she would have concluded the following.

Consider two probability mass functions

\underset{\tilde{}}{p} = {(p_{1}, \dots, p_{n})}^{T}

,

\underset{\tilde{}}{q} = {(q_{1}, \dots, q_{n})}^{T}

and let

\underset{\tilde{}}{w} = {(w_{1}, \dots, w_{n})}^{T}

be a vector of weights (here T denotes transposition). Then the discrete version of weighted Kullback–Leibler divergence would be the following:

D_{K L}^{w} (\underset{\tilde{}}{p}, \underset{\tilde{}}{q}) = \sum_{i = 1}^{n} w_{i} p_{i} log (\frac{p_{i}}{q_{i}}) .

(11)

The above expression is not proper due to the fact that the Kullback–Leibler divergence is not positive everywhere in the support but only on average. The following theorem confirms the average non-negativity of the Kullback–Leibler divergence measure.

Theorem 2.

The Kullback–Leibler divergence

D_{K L} (P, Q)

between two distributions

P, Q

is not negative on average.

Proof.

Firstly, we have that

f (x) = log (x)

is a concave function for

x = \frac{q (x)}{p (x)} \in R_{+}

where

q (x)

and

p (x)

are the Radon–Nikodym derivatives of the induced probability measures

P, Q

and some base measure

μ

(commonly the Lebesgue measure for continuous random variables). Then:

- D_{K L} (P, Q) = - \int p (x) log \frac{p (x)}{q (x)} d μ = \int p (x) log \frac{q (x)}{p (x)} d μ = E_{p} [log \frac{q (X)}{p (X)}] .

From Jensen inequality for a concave function we have that:

\begin{matrix} - D_{K L} (P, Q) & = E_{p} [log \frac{q (X)}{p (X)}] \\ \leq log [E_{p} (\frac{q (X)}{p (X)})] \\ = log [\int p (x) \frac{q (x)}{p (x)} d μ] = log [\int q (x) d μ] = log 1 = 0 . \end{matrix}

Thus, we prove that

D_{K L} (p, q) = E_{p} [log \frac{p (X)}{q (X)}] \geq 0

with equality if and only if

p (x) = q (x)

,

\forall x \in S_{X}

. □

Thus even though it is positive on average, i.e., for the ‘finite case of equal weights’, it is not positive in general. The following example will make clear that the weighted Kullback–Leibler divergence could be negative.

Example 2.

Let P be a binomial distribution with

n = 2, p = 0.4

and

P (X = 0) = 0.36, P (X = 1) = 0.48

,

P (X = 2) = 0.16

and Q a discrete uniform distribution with the three possible outcomes

X = 0, 1, 2

each with probability

p = 1 / 3

. The Kullback–Leibler divergence between P and Q is

D_{K L} (P, Q) = 0.08529

. Now, if we give the event

X = 2

an enormously greater significance than the others, for example we put the following weights

\underset{\tilde{}}{w} = {(1, 1, 4)}^{T}

, then the weighted Kullback–Leibler divergence, according to the previous definition, will be

D_{K L}^{w} (P, Q) = - 0.11342

. This is due to the fact that the logarithm in the interval

(0, 1)

is negative.

Due to this fact, Kapur [22] stressed that a weighted divergence measure will be an appropriate measure of directed divergence if the following conditions are satisfied:

It is a continuous function of $\underset{\tilde{}}{p}$ , $\underset{\tilde{}}{q}$ and $\underset{\tilde{}}{w}$ .
It is permutationally symmetric function of $\underset{\tilde{}}{p}$ , $\underset{\tilde{}}{q}$ and $\underset{\tilde{}}{w}$ , i.e., it does not change when the triplets ( $p_{1}$ , $q_{1}$ , $w_{1}$ ), ( $p_{2}$ , $q_{2}$ , $w_{2}$ ), ⋯, ( $p_{n}$ , $q_{n}$ , $w_{n}$ ) are permuted among themselves.
It is always greater than or equal to zero for all possible choices of weights $\underset{\tilde{}}{w}$ and vanishes when $p_{i} = q_{i}$ for each $i = 1, \dots, n$ .
It is a convex function of $p_{1}, p_{2}, \dots, p_{n}$ which has its minimum value zero when $p_{i} = q_{i}$ for each $i = 1, \dots, n$ .
It reduces to an ordinary measure of directed divergence upon ignoring weights $(w_{i} = c, c > 0,$ $\forall i = 1, \dots, n)$ .

The most important of these limitations is Condition 3, which is violated for most of the usual measures. The solution to this problem is quite simple. We just need to transform the usual measure to its positive equivalent. For this purpose Kapur [22] introduced and proved the above for the following function:

ϕ (x) = p (x) log (\frac{p (x)}{q (x)}) - p (x) + q (x) .

(12)

This function is not negative (

ϕ (x) \geq 0, \forall x \in S_{X}

) everywhere in the support. So the divergence which is based on this

ϕ

-function is also not negative for every subset of the support. This divergence belongs to

ϕ

-divergence family [23] so we can use all the theoretical results of this family, such as the asymptotic distribution of

ϕ

-divergence estimator [24] or the least

ϕ

-divergence estimator [25].

As a result we have the following definition for the discrete case of Corrected Weighted Kullback–Leibler (CWKL) divergence.

Definition 3.

Consider two probability mass functions

\underset{\tilde{}}{p} = {(p_{1}, \dots, p_{n})}^{T}

,

\underset{\tilde{}}{q} = {(q_{1}, \dots, q_{n})}^{T}

and let

\underset{\tilde{}}{w} = {(w_{1}, \dots, w_{n})}^{T}

be a vector of weights. Then the discrete CWKL divergence is the following:

D_{C K L}^{w} (\underset{\tilde{}}{p}, \underset{\tilde{}}{q}) = \frac{\sum_{i = 1}^{n} w_{i} (p_{i} log (\frac{p_{i}}{q_{i}}) - p_{i} + q_{i})}{\sum_{j} w_{j} p_{j}} .

(13)

Where

D_{C K L}^{w} (\underset{\tilde{}}{p}, \underset{\tilde{}}{q}) \geq 0, \forall x \in S_{X}

.

3.2. Continuous Case

We proceed now to the continuous form of weighted divergence measures which has a slightly different thinking of construction. In the discrete case we multiply each data point of the support with the desirable weight. In the continuous case we have to take into account that the support is infinite, so we have to partition it and apply the appropriate weight on each interval. More generally, we can restrict the appropriate measure (usually the Lebesgue measure) on each subset of the support we want to emphasize.

Definition 4.

Consider two absolutely continuous probability measures

μ_{f}

and

μ_{g}

with respect to the Lebesgue measure λ. Also, the Radon–Nikodym derivatives

f, g

of

μ_{f}

and

μ_{g}

with respect to measure λ and

A_{i} \in A

is a partition of support

S_{X}

, i.e.,

⋃_{i = 1}^{n} A_{i} = S_{X}, A_{i} \cap A_{j} = \emptyset \forall i \neq j .

Then, if

w_{i}, i = 1, \dots n

are the weights the continuous CWKL divergence measure would be the following:

D_{C K L}^{w} (f, g) = \frac{(\sum_{i = 1}^{n} w_{i} (\int_{S_{X}} (f (x) log (\frac{f (x)}{g (x)}) - f (x) + g (x)) d λ_{|_{A_{i}}} (x)))}{\sum_{j} w_{j} \int_{S_{X}} f (x) d λ_{| A_{j}} (x)}

(14)

where

λ_{|_{A_{i}}}

is the restricted Lebesgue measure on the subset

A_{i}

and

D_{C K L}^{w} (f, g) \geq 0, \forall x \in S_{X}

.

3.3. Asymptotic Distribution of CWKL Divergence Estimator

In this subsection we give the asymptotic distribution of the CWKL divergence estimator for goodness of fit testing between multinomial distributions. For this purpose we rely on the theoretical results of Frank et al. [11], who provides the asymptotic distribution of the estimator of the Weighted Kullback–Leibler divergence for this type of test.

Consider a random variable X described by a probability distribution

F_{X} (x)

. If we wish to test the hypothesis

H_{0} : F (x) = F^{0} (x)

, where

F^{0} (x)

is a hypothesized distribution, then we have to partition the range of distribution in K classes, say

C_{1}, \dots, C_{K}

. The probability of falling into the class i is

P (X \in C_{i}) = p_{i}

where

i = 1, \dots, K

and

w_{i}

,

i = 1, \dots, K

is the weight or the importance of each class. Now, suppose that we have a random sample

X_{1}, \dots, X_{n}

from the distribution

F_{X} (x)

and

N = (N_{1}, \dots, N_{K})

is the observed number of values falling on each class

C_{1}, \dots, C_{K}

. It is straight forward that the vector N has a multinomial distribution with parameters

(n, p_{1}, \dots, p_{K}), n = \sum_{i} N_{i}

. Also, the estimator of

P = (p_{1}, \dots, p_{K})

is

\hat{P} = (\hat{p_{1}}, \dots, \hat{p_{K}})

, where

\hat{p_{i}} = \frac{N_{i}}{n}

,

i = 1, \dots, K

. Since the null hypothesis is equivalent to

H_{0} : P = P^{0}

, where

P^{0}

is the hypothesized distribution, the

D_{C K L}^{w} (\hat{P}, P^{0})

has to be small. Otherwise, if

D_{C K L}^{w} (\hat{P}, P^{0})

is sufficiently large the null hypothesis should be rejected. The theorem below provides the asymptotic distribution of

D_{C K L}^{w} (\hat{P}, P^{0})

which is a natural extension of the result of Frank et al., [11].

Theorem 3.

Assume the weighted directed divergence

D_{C K L}^{w} (P, P^{0})

and its estimator

D_{C K L}^{w} (\hat{P}, P^{0})

. Under the null hypothesis

H_{0} : P = P^{0}

we have:

2 n D_{C K L}^{w} (\hat{P}, P^{0}) \to_{n \to \infty}^{L} \sum_{i = 1}^{r} β_{i} Z_{i}^{2},

(15)

where

Z_{i}

,

i = 1, \dots, r

are iid Standard Normal variables,

β_{i}

,

i = 1, \dots, r

are the eigenvalues of the matrix

C Σ_{P^{0}}

where

C = {(c_{i j})}_{i, j = 1, \dots, K} = \{\begin{matrix} 0, & if i \neq j \\ \frac{w_{i}}{p_{i 0} \sum_{l} w_{l} p_{l 0}}, & if i = j \end{matrix}

,

Σ_{P^{0}} =

diag

(P^{0}) - P^{0} {(P^{0})}^{T}

and

r =

rank

(Σ_{P^{0}} C Σ_{P^{0}})

.

4. Simulations

In this Section we implement all weighted cases from the previous sections. In Section 4.1, we present Bernoulli simulations of discrete cases for the weighted Shannon entropy while in Section 4.2, we present simulations of continuous and discrete cases for the CWKL divergence. The aim of this Section is to study the behaviour of these Directed Information Measures, something that is missing in the literature, and especially to identify if the CWKL divergence can recognize small dissimilarities between distributions which the classical measures can not.

4.1. Weighted Shannon Entropy

Example 3.

Assume a coin toss, the random variable X which enumerates the probability of heads is described by a Bernoulli distribution with probability p. In Table 1 we present the weighted entropy of this variable for various p and

\underset{\tilde{}}{w}

.

The above results show that under the equiprobable condition (equal weights) the Shannon Entropy is symmetric around its maximum value but in all other cases with unequal weights (associated with unequal significance) the Weighted Shannon Entropy ends to be symmetric. All these are depicted in Figure 1.

4.2. CWKL Divergence

In this subsection we present examples of the CWKL divergence and consider cases based on several distributions with various weights and our aim is to examine whether small dissimilarities between the two distributions are easily distinguished by the directed measures.

Example 4.

Consider the assets, A, B, C, D, E and F. The returns associated with five states and the probability of each to occur are given in Table 2:

In Table 3 we presented the Corrected Weighted Kullback Leibler (CWKL) divergence between the assets A,B,C,D,E and F for various vectors of weights.

As we can easily see the probability distributions of the assets like A and B are quite similar in contrast with C and D which defers a lot in the ‘extreme’ states (1 and 5) or E and F which have greater dissimilarity in the state 3. From Table 3 we can identify that if we focus on the states with greater dissimilarities then the CWKL divergence increase. All the above are visualized in Figure 2.

Example 5.

Let P be a Standard Normal distribution and Q a Student’s t-distribution with one degree of freedom (df = 1). The CWKL divergence between P and Q for various weights

\underset{\tilde{}}{w}

applied in different parts of the support is given in Table 4:

Now, let P be a Standard Normal distribution and Q a Student’s t-distribution with thirty degrees of freedom (df = 30). The CWKL divergence between P and Q for various weights

\underset{\tilde{}}{w}

applied in different parts of the support is given in Table 5:

In Figure 3 we visualize the variations of the CWKL between a Standard Normal distribution and a Student’s t-distribution with various degrees of freedom (df). A plethora of weights are applied on a fixed partition of the support

S_{1} = (- \infty, - 3), S_{2} = (- 3, 3), S_{3} = (3, \infty)

.

In Example 6 we revisit a real life case proposed by Johnson and Wichern [26] and used by Avlogiaris et al., [13] to study the performance of Local Divergence Measures. The example deals with the grade point average (GPA) scores of 85 students who apply to a business school. Students are categorized in three groups\populations according to their GPA performance as

π_{1}

: admit,

π_{2}

: do not admit and

π_{3}

: borderline. Sample means and variances of each group together with the group sample sizes are given in Table 6. The aim of Example 6 is to investigate the dissimilarities between the distributions which describe each population and to compare the results with those provided by Avlogiaris et al., [13].

Example 6.

A proper distribution to describe each population, according to Avlogiaris et al., [13] is the Normal distribution with the estimated parameters of each population given in Table 6. So, we have:

π_{1} \sim N (3.4, 0.04), π_{2} \sim N (2.48, 0.03)

and

π_{3} \sim N (2.99, 0.03)

. In Table 7, Table 8 and Table 9 we present the CWKL divergence between each pair of populations for appropriate parts of the support.

As we can observe, when we focus on the parts where one of the two distributions is heavier then the CWKL divergence increases. This means that all distributions are far from each other, a result that has also been verified by Avlogiaris et al. [13]. Nevertheless, in our case since, as opposed to the Local Divergence Measures (Avlogiaris et al., [13]), we do not discard the information of the less significant parts, we can easily identify not only whether the distributions differ but also in which parts they differ the most. For that purpose as we give attention to the more ‘extreme’ parts of the distributions (for

π_{1}

and

π_{2}

the

(- \infty, 2)

, for

π_{1}

and

π_{3}

the

(- \infty, 2.7)

and for

π_{2}

and

π_{3}

the

(3.2, \infty))

the divergence increases, which means that they differ considerably as we ‘move’ to the tail of each.

5. Conclusions

The present study succeeds in revealing that the results based on the classical and the weighted entropies and divergences are entirely different and directly related to the specific parts of the distributions we focus on.

The above results show the behaviour of the Weighted Shannon Entropy and clearly show the effect of weights as they applied to specific parts of the distribution. Such results can be used as important probabilistic tools for descriptive as well as discriminatory purposes.

Also, as a conclusion from the above simulations the CWKL divergence is larger (smaller) than the Corrected Kullback–Leibler divergence if we apply bigger weights on the parts with greater (less) dissimilarities. At the same time Corrected Kullback–Leibler divergence is a special case of CWKL divergence when equal weights are applied.

As it is clear from all the above, the appropriate choice of weights will result in better discrimination between similar distributions. In addition, they provide us with a framework for concentrating on the ‘important’ parts of the distribution. Here, we have to mention that the choice of the weights is absolutely subjective and it is related with the belief of the researcher about the significance of each part of the domain. Nevertheless, if someone wishes to identify the parts of the domain of the distributions with the highest dissimilarities an iterative approach could be adopted. Indeed, firstly, the regions of the domain we want to study must be clearly established, then the sum of the weights must be fixed and with multiple iterations (with simultaneous permutations of the weights over the regions) we maximize the CWKL divergence.

These weighted measures could be a useful tool for the construction of ‘directed’ statistical tests on the parts of the distribution we wish to emphasize. Such tests could include goodness of fit tests or model selection criteria which will concentrate on specific parts of the distribution by assigning appropriate weights. Also, the use of weighted divergences could turn out to be useful in financial time series analysis. It is not unusual for a single stochastic model like the Barndorff–Nielsen and Shephard (BN-S) model ([27,28,29]) to fail to describe adequately derivative or commodity market dynamics. For various financial time series data, jumps that often play an important role are typically captured by a Lévy process. The weighted divergences could be incorporated into the analysis of Lévy processes for capturing (identifying) the fluctuations in the jump term of such processes. This way, the jump term could be replaced or modified accordingly, resulting in a more effective and efficient modelling approach. Thus, there is much room for improvement and research on this promising concept.

Author Contributions

Conceptualization, T.G. and A.K.; Formal analysis, T.G.; Methodology, A.K.; Project administration, A.K.; Software, T.G.; Validation, A.K.; Visualization, T.G.; Writing—original draft, T.G. All authors have read and agreed to the published version of the manuscript.

Funding

Research received no external fund.

Acknowledgments

The authors wish to express their appreciation to the Editor, the Special Editor and 3 anonymous referees for their valuable comments and recommendations. This work was completed as part of the activities of the Laboratory of Statistics and Data Analysis of the University of the Aegean.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nawrocki, D.N.; Harding, W.H. State-value weighted entropy as a measure of investment risk. Appl. Econ. 1986, 18, 411–419. [Google Scholar] [CrossRef]
Basseville, M. Distance measures for signal processing and pattern recognition. Signal Process. 1989, 18, 349–369. [Google Scholar] [CrossRef]
Jimenez-Gamero, M.D.; Batsidis, A. Minimum distance estimators for count data based on the probability generating function with applications. Metrika 2017, 80, 503–545. [Google Scholar] [CrossRef]
Pardo, L. Statistical Inference Based on Divergence Measures; CRC Press: Boca Raton, FL, USA, 2005. [Google Scholar]
Cressie, N.; Read, T.R. Multinomial goodness-of-fit tests. J. R. Stat. Soc. Ser. B Methodol. 1984, 46, 440–464. [Google Scholar] [CrossRef]
Lee, S.; Vonta, I.; Karagrigoriou, A. A maximum entropy type test of fit. Comput. Stat. Data Anal. 2011, 55, 2635–2643. [Google Scholar] [CrossRef]
Cavanaugh, J.E. Criteria for linear model selection based on Kullback’s symmetric divergence. Aust. N. Z. J. Stat. 2004, 46, 257–274. [Google Scholar] [CrossRef]
Shang, J.; Cavanaugh, J.E. Bootstrap variants of the Akaike information criterion for mixed model selection. Comput. Stat. Data Anal. 2008, 52, 2004–2021. [Google Scholar] [CrossRef]
Toma, A. Model selection criteria using divergences. Entropy 2014, 16, 2686. [Google Scholar] [CrossRef]
Guiaşu, S. Weighted Entropy. Rep. Math. Phys. 1971, 2, 165–179. [Google Scholar] [CrossRef]
Frank, O.; Men é ndez, M.L.; Pardo, L. Asymptotic distributions of weighted divergence between discrete distributions. Commun. Stat. Theory Methods 1998, 27, 867–885. [Google Scholar] [CrossRef]
Barbu, V.S.; Karagrigoriou, A.; Preda, V. Entropy and divergence rates for markov chains: II. The weighted case. Proc. Rom. Acad. Ser. A 2018, 1, 3–10. [Google Scholar]
Avlogiaris, G.; Micheas, A.; Zografos, K. On local divergences between two probability measures. Metrika 2016, 79, 303–333. [Google Scholar] [CrossRef]
Belis, M.; Guiaşu, S. A quantitative-qualitative measure of information in cubernetic systems. IEEE Trans. Inf. Theory 1968, 14, 593–594. [Google Scholar] [CrossRef]
Guiaşu, S. Grouping data by using the weighted entropy. J. Stat. Plan. Inference 1990, 15, 63–69. [Google Scholar] [CrossRef]
Di Crescenzo, A.; Longobardi, M. On weighted residual and past entropies. arXiv 2007, arXiv:math/0703489. [Google Scholar]
Suhov, Y.; Zohren, S. Quantum weighted entropy and its properties. arXiv 2014, arXiv:11411.0892. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Moddemeijer, R. On estimation of entropy and mutual information of continuous distributions. Signal Process. 1989, 16, 233–248. [Google Scholar] [CrossRef]
Marsh, C. Introduction to Continuous Entropy; Department of Computer Science, Princeton University: Princeton, NJ, USA, 2013. [Google Scholar]
Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 571–575. [Google Scholar] [CrossRef]
Kapur, J.N. Measures of Information and Their Applications; Publishing House: New York, NY, USA, 1994. [Google Scholar]
Csiszar, I. On infinite products of random elements and infinite convolutions of probability distributions on locally compact groups. Z. Wahrscheinlichkeitstheorie verw Gebiete 1966, 5, 279–295. [Google Scholar] [CrossRef]
Zografos, K.; Ferentinos, K.; Papaioannou, T. Divergence statistics: Sampling properties and multinomial goodness of fit and divergence tests. Commun. Stat. Theory Methods 1990, 19, 1785–1802. [Google Scholar] [CrossRef]
Morales, D.; Pardo, L.; Vajda, I. Asymptotic divergence of estimates of discrete distributions. J. Stat. Plan. Inference 1995, 48, 347–369. [Google Scholar] [CrossRef]
Johnson, R.A.; Wichern, D.W. Applied Multivariate Statistical Analysis, 3rd ed.; Publishing House, Prentice Hall International Editions: Englewood Cliffs, NJ, USA, 1992. [Google Scholar]
Barndorff-Nielsen, O.E.; Shephard, N. Non-Gaussian Ornstein-Uhlenbeck-based models and some of their uses in financial economics. J. R. Stat. Soc. Ser. B 2001, 63, 167–241. [Google Scholar] [CrossRef]
Barndorff-Nielsen, O.E.; Shephard, N. Modelling by Lévy processes for financial econometrics. In Lévy Processes; Birkhäuser: Boston, MA, USA, 2001; pp. 283–318. [Google Scholar]
Barndorff-Nielsen, O.E. Superposition of Ornstein-Uhlenbeck type processes. Theory Probab. Appl. 2001, 45, 175–194. [Google Scholar] [CrossRef]

Figure 1. Weighted Shannon Entropy of Bernoulli trials.

Figure 2. CWKL divergence between assets.

Figure 3. CWKL divergence between Standard Normal and various Student’s t-distributions.

Table 1. The Weighted Shannon Entropy for

X \sim B (1, p)

.

Table 1. The Weighted Shannon Entropy for

X \sim B (1, p)

.

	$p = 0.1$	$p = 0.5$	$p = 0.9$
$\underset{\tilde{}}{w} = {(1, 1)}^{T}$	0.325	0.693	0.325
$\underset{\tilde{}}{w} = {(1.5, 0.5)}^{T}$	0.392	0.693	0.257
$\underset{\tilde{}}{w} = {(0.5, 1.5)}^{T}$	0.257	0.693	0.392
$\underset{\tilde{}}{w} = {(1.8, 0.2)}^{T}$	0.433	0.693	0.216
$\underset{\tilde{}}{w} = {(0.2, 1.8)}^{T}$	0.216	0.693	0.433

Table 2. The returns of assets A-F with 5 states and their associated probabilities.

State i	Return	A( $p_{i}$ )	B( $q_{i}$ )	C( $p_{i}$ )	D( $q_{i}$ )	E( $p_{i}$ )	F( $p_{i}$ )
1	$1 %$	$0.1$	$0.14$	$0.05$	$0.3$	$0.1$	$0.05$
2	$3 %$	$0.21$	$0.22$	$0.25$	$0.1$	$0.25$	$0.15$
3	$5 %$	$0.38$	$0.35$	$0.4$	$0.2$	$0.3$	$0.6$
4	$8 %$	$0.21$	$0.22$	$0.25$	$0.1$	$0.25$	$0.15$
5	$10 %$	$0.1$	$0.07$	$0.05$	$0.3$	$0.1$	$0.05$

Table 3. CWKL divergence between assets A-F for various weights.

	$D_{CKL}^{w} (P_{A}, Q_{B})$	$D_{CKL}^{w} (P_{C}, Q_{D})$	$D_{CKL}^{w} (P_{E}, Q_{F})$
$\underset{\tilde{}}{w} = {(1, 1, 1, 1, 1)}^{T}$	$0.013$	$0.556$	$0.186$
$\underset{\tilde{}}{w} = {(2, 0.333, 0.333, 0.333, 2)}^{T}$	$0.036$	$1.442$	$0.189$
$\underset{\tilde{}}{w} = {(0.75, 1, 1.5, 1, 0.75)}^{T}$	$0.009$	$0.438$	$0.202$

Table 4. CWKL divergence between P and Q in two different partitions of the support.

	$S_{1} = (- \infty, - 3), S_{2} = (- 3, 3), S_{3} = (3, \infty)$	$S_{1} = (- \infty, - 1.85), S_{2} = (- 1.85, 1.85), S_{3} = (1.85, \infty)$
$\underset{\tilde{}}{w} = {(1, 1, 1)}^{T}$	$0.259$	$0.259$
$\underset{\tilde{}}{w} = {(1.3, 0.4, 1.3)}^{T}$	$0.693$	$0.738$
$\underset{\tilde{}}{w} = {(0.85, 1.3, 0.85)}^{T}$	$0.192$	$0.185$
$\underset{\tilde{}}{w} = {(2, 0.5, 0.5)}^{T}$	$0.549$	$0.579$
$\underset{\tilde{}}{w} = {(0.5, 2, 0.5)}^{T}$	$0.119$	$0.098$

Table 5. CWKL divergence between P and Q in two different partitions of the support.

	$S_{1} = (- \infty, - 3), S_{2} = (- 3, 3), S_{3} = (3, \infty)$	$S_{1} = (- \infty, - 1.85), S_{2} = (- 1.85, 1.85), S_{3} = (1.85, \infty)$
$\underset{\tilde{}}{w} = {(1, 1, 1)}^{T}$	$0.0016$	$0.0016$
$\underset{\tilde{}}{w} = {(1.3, 0.4, 1.3)}^{T}$	$0.0037$	$0.0051$
$\underset{\tilde{}}{w} = {(0.85, 1.3, 0.85)}^{T}$	$0.0013$	$0.0011$
$\underset{\tilde{}}{w} = {(2, 0.5, 0.5)}^{T}$	$0.0030$	$0.0039$
$\underset{\tilde{}}{w} = {(0.5, 2, 0.5)}^{T}$	$0.0009$	$0.0004$

Table 6. Sample mean and variance for the three groups of grade point average (GPA) scores.

Population	Sample Size	Estimates
$π_{1}$	31	${\bar{x}}_{1} = 3.4, s_{1}^{2} = 0.04$
$π_{2}$	28	${\bar{x}}_{2} = 2.48, s_{2}^{2} = 0.03$
$π_{3}$	26	${\bar{x}}_{3} = 2.99, s_{3}^{2} = 0.03$

Table 7. CWKL divergence between

π_{1}

and

π_{2}

.

Table 7. CWKL divergence between

π_{1}

and

π_{2}

.

	$S_{1} = (- \infty, 2), S_{2} = (2, 3), S_{3} = (3, 4), S_{4} = (4, \infty)$
$\underset{\tilde{}}{w} = {(1, 1, 1, 1)}^{T}$	$14.129$
$\underset{\tilde{}}{w} = {(0, 0, 1, 0)}^{T}$	$13.421$
$\underset{\tilde{}}{w} = {(0, 0, 2, 1)}^{T}$	$13.431$
$\underset{\tilde{}}{w} = {(0, 1, 0, 0)}^{T}$	$43.139$
$\underset{\tilde{}}{w} = {(1, 2, 0, 0)}^{T}$	$43.261$

Table 8. CWKL divergence between

π_{1}

and

π_{3}

.

Table 8. CWKL divergence between

π_{1}

and

π_{3}

.

	$S_{1} = (- \infty, 2.7), S_{2} = (2.7, 3.2), S_{3} = (3.2, 3.7), S_{4} = (3.7, \infty)$
$\underset{\tilde{}}{w} = {(1, 1, 1, 1)}^{T}$	$2.821$
$\underset{\tilde{}}{w} = {(0, 0, 1, 0)}^{T}$	$2.277$
$\underset{\tilde{}}{w} = {(0, 0, 2, 1)}^{T}$	$2.492$
$\underset{\tilde{}}{w} = {(0, 1, 0, 0)}^{T}$	$3.221$
$\underset{\tilde{}}{w} = {(1, 2, 0, 0)}^{T}$	$3.365$

Table 9. CWKL divergence between

π_{2}

and

π_{3}

.

Table 9. CWKL divergence between

π_{2}

and

π_{3}

.

	$S_{1} = (- \infty, 2.3), S_{2} = (2.3, 2.8), S_{3} = (2.8, 3.2), S_{4} = (3.2, \infty)$
$\underset{\tilde{}}{w} = {(1, 1, 1, 1)}^{T}$	$4.331$
$\underset{\tilde{}}{w} = {(0, 1, 0, 0)}^{T}$	$2.925$
$\underset{\tilde{}}{w} = {(1, 2, 0, 0)}^{T}$	$3.341$
$\underset{\tilde{}}{w} = {(0, 0, 1, 0)}^{T}$	$19.981$
$\underset{\tilde{}}{w} = {(0, 0, 2, 1)}^{T}$	$21.726$

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gkelsinis, T.; Karagrigoriou, A. Theoretical Aspects on Measures of Directed Information with Simulations. Mathematics 2020, 8, 587. https://doi.org/10.3390/math8040587

AMA Style

Gkelsinis T, Karagrigoriou A. Theoretical Aspects on Measures of Directed Information with Simulations. Mathematics. 2020; 8(4):587. https://doi.org/10.3390/math8040587

Chicago/Turabian Style

Gkelsinis, Thomas, and Alex Karagrigoriou. 2020. "Theoretical Aspects on Measures of Directed Information with Simulations" Mathematics 8, no. 4: 587. https://doi.org/10.3390/math8040587

APA Style

Gkelsinis, T., & Karagrigoriou, A. (2020). Theoretical Aspects on Measures of Directed Information with Simulations. Mathematics, 8(4), 587. https://doi.org/10.3390/math8040587

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Theoretical Aspects on Measures of Directed Information with Simulations

Abstract

1. Introduction

2. Measures of Directed Entropy

2.1. Discrete Case

2.2. Continuous Case

3. Measures of Directed Divergence

3.1. Discrete Case

3.2. Continuous Case

3.3. Asymptotic Distribution of CWKL Divergence Estimator

4. Simulations

4.1. Weighted Shannon Entropy

4.2. CWKL Divergence

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI