Theoretical Aspects on Measures of Directed Information with Simulations

: Measures of directed information are obtained through classical measures of information by taking into account speciﬁc qualitative characteristics of each event. These measures are classiﬁed into two main categories, the entropic and the divergence measures. Many times in statistics we wish to emphasize not only on the quantitative characteristics but also on the qualitative ones. For example, in ﬁnancial risk analysis it is common to take under consideration the existence of fat tails in the distribution of returns of an asset (especially the left tail) and in biostatistics to use robust statistical methods to trim extreme values. Motivated by these needs in this work we present, study and provide simulations for measures of directed information. These measures quantify the information with emphasis on speciﬁc parts (or events) of their probability distribution, without losing the whole information of the less signiﬁcant parts and at the same time by concentrating on the information of the parts we care about the most.


Introduction
In statistics and other fields one of the most challenging aspects is to investigate the probabilistic behaviour of a random process with respect to specific events. From finance [1] to signal processing [2] researchers try to distinguish random processes from each other and study their behaviour. A very important tool, in the 'quiver' of the researcher, for that purpose is the concept of information measures. These measures are divided in two main categories, the entropic and the divergence ones. The entropic measures quantify the diversity (or the information) within a population and divergence measures quantify the dissimilarities between two different populations. These types of measures are based only on the probabilistic behaviour of each population. We often state that divergences measure the discrepancy between two probability distributions or the information needed in order to distinguish one from the other. There is plethora of estimators and hypothesis tests associated with such measures for several cases [3,4], many tests of fit based on measures of divergence and take into account dissimilarities between the distributions involved [5] or based on maximum entropy principle [6]. Also, model selection criteria [7][8][9] are based on such type of information measures.
The challenge here is to construct such measures which will take into account not only the probabilistic aspects of random processes but also the qualitative characteristics of them. These characteristics sometimes are subjective and someone would say that they are related to the significance, the relevance or the utility of the information contained which is related to a specific goal [10]. There are such measures in the literature [11] but they do not take care of some important issues that we collect and present here. Also, Barbu et al. [12] provide the weighted form of the generalizations of Alpha divergence measures and Beta divergence measures for Markov chains. We call this type of measure Directed Information Measures. These do not assume that all possible states of a random process have the same significance to a goal (like the classical ones assume), so they apply specific weights in different states or parts of these processes. By applying these directed measures we can distinguish small dissimilarities in the probabilistic behaviour of two random processes which in other cases would be difficult to notice.
A method close to this one is the concept of local divergence measures [13], which studies the dissimilarities between two random processes in a specific subset of their support. The main difference between this method and the method we study is that we do not lose the whole information of the less significant parts and at the same time we focus on the more significant parts.
The present article is structured as follows. Section 2 is devoted to the notion of directed entropy and its properties as proposed by Guiaşu [10]. We also present some theoretical problems of the classical Shannon entropy to the continuous case. In Section 3 we present the Corrected Weighted Kullback-Leibler (CWKL) divergence which is a measure of directed divergence both in the discrete and the continuous case. We also present the asymptotic distribution of the estimated CWKL divergence for tests of fit. In Section 4 we provide some simulations based on discrete and continuous distributions of measures of directed entropy and CWKL divergence. The aim of Section 4 is to scrutinize the behaviour of the above directed measures. Finally, Section 5 consists of a short conclusion of the above.

Measures of Directed Entropy
How will we measure the information or the uncertainty of an experiment with respect to certain characteristics of its events? The foundations of the answer lie on the work of Belis and Guiaşu [14] while the answer itself was given by Guiaşu [10] who proposed the weighted entropy. He explicitly defined the axioms, the properties and the maximum value of weighted entropic formula. After this pioneer work Guiaşu [15] used the weighted entropy to group data with respect to the importance of specific regions of the domain. Later, Narowcki and Harding [1] proposed the use of weighted entropy as a measure of investment risk, Di Crescenzo and Longobardi [16] propose the weighted residual and past entropies and Suhov and Zohren [17] proposed the quantum version of weighted entropy and its properties in quantum statistical mechanics.

Discrete Case
As we mentioned earlier in the introduction of this section, the first who proposed the weighted form of Shannon entropy and its properties was Guiaşu [10]. He proposed this form of entropy only in the discrete case because the continuous analogue as proposed by Shannon [18] raises some concerns to be discussed later. The relevant definition is presented below. Definition 1. Let a stochastic source described by a discrete random variable X of n possible states, with distribution P X , probability mass functionp = (p 1 , ..., p n ) T , andw = (w 1 , ..., w n ) T be a vector of weights associated with these states, where w i ≥ 0, i = 1, ..., n. The weighted Shannon entropy measure is defined by: We proceed below with the properties of the above weighted entropy as proposed by Guiaşu [10].
3. If p i = 1 for some i = 1, ..., n then H w (X) = 0 irrespectively of the values of the weightsw.

Remark 1.
For the justification of Axioms 1-3 please note that small changes in the probability of an event must affect analogously the information measure (Axiom 1), the information of an experiment must depend solely on the probabilistic behaviour of the associated events (Axiom 2), while for equal event probabilities the weighted entropy should be equal to the average of all weights (Axiom 3).
The following theorem states the condition between p and w that maximizes the weighted entropy.
Theorem 1. (Guiaşu [10]) Consider the random variable X associated with the discrete probability distribution p i ≥ 0, i = 1, ..., n where ∑ n i=1 p i = 1 and the weights w i ≥ 0, i = 1, ..., n associated with the significance of event i. The weighted entropy: is maximum if and only if: where α is the solution of the equation: and the maximum value of H w (X) is:

Continuous Case
Before we propose the continuous version of the weighted entropy, we have to present the continuous version of entropy. Shannon proposed the entropy as a function which measures the uncertainty or the information of a random source. Although, his entropic formula for a discrete random source was provided to describe the uncertainty (or the information) a discrete signal contains. The nature of a signal could be not necessarily discrete but also continuous [19]. So, there is a need to introduce a formula which will measure the uncertainty in the continuous case. In the literature the continuous entropy is described by the following definition [20].

Definition 2.
Let a stochastic source X which is described by the continuous probability distribution P with support S X , µ p be an absolutely continuous probability measure with respect to Lebesgue measure λ and p be the induced density (Radon-Nikodym derivative). Then, the continuous version of Shannon entropy measure is defined by: This function, also called differential entropy, satisfies some of the properties of a suitable measure of uncertainty but fails to fulfil two of them, the positivity and the invariance under the change of variables.
The following examples will make the problem of positivity clear.

Example 1.
Suppose the stochastic sources X ∼ U(a, b) and Y ∼ Exp(κ), then the entropies of X and Y are given respectively by: It is easy to see that the entropies h(X) and h(Y) which for appropriate values of the corresponding parameters will be negative.
As a result the construction of the continuous analogue in the same thinking as Guiaşu will be meaningless.

Measures of Directed Divergence
In contrast with weighted entropy, the weighted form of divergence measures has not extensively studied by researchers. In the previous section we stress that the weighted entropy measure is a concept to measure the probabilistic behaviour of a statistical population with greater significance in some events. Correspondingly, the weighted form of divergences measure the probabilistic dissimilarities between two statistical populations while taking into account the qualitative characteristics of each region of the support.
This section will be divided in two subsections in which we will present and study the problems arising from the concept of measures of directed divergence both in discrete and continuous case.

Discrete Case
One of the most popular and extensively used divergence measures in the literature is the Kullback-Leibler divergence [21]. From the statistical point of view, as a divergence, it quantifies the dissimilarities between two distributions with the same support. On the other hand, in information theory Kullback-Leibler divergence D KL (P, Q) quantifies the amount of information gained by learning that a variable previously thought to be distributed as P is actually distributed as Q.
But, as in the case of Shannon entropy, the Kullback-Leibler divergence does not take into account the qualitative characteristics of random events. The idea here is to determine the dissimilarities between two distributions with specific weight on each part of the support. In the discrete case this is resolved by putting weights on each event of the sample space. The concept of weighting is equivalent to the concept of weighted Shannon entropy. The weights are related to the significance of each event with respect to a specific goal. If one would have thought that the weighted Kullback-Leibler divergence, in discrete case, is analogous to weighted entropy then he/she would have concluded the following.
Consider two probability mass functionsp = (p 1 , ..., p n ) T ,q = (q 1 , ..., q n ) T and letw = (w 1 , ..., w n ) T be a vector of weights (here T denotes transposition). Then the discrete version of weighted Kullback-Leibler divergence would be the following: The above expression is not proper due to the fact that the Kullback-Leibler divergence is not positive everywhere in the support but only on average. The following theorem confirms the average non-negativity of the Kullback-Leibler divergence measure.
Theorem 2. The Kullback-Leibler divergence D KL (P, Q) between two distributions P, Q is not negative on average.
and p(x) are the Radon-Nikodym derivatives of the induced probability measures P, Q and some base measure µ (commonly the Lebesgue measure for continuous random variables). Then: From Jensen inequality for a concave function we have that: Thus, we prove that D KL (p, q) = E p log p(X) q(X) ≥ 0 with equality if and only if p(x) = q(x), ∀x ∈ S X .
Thus even though it is positive on average, i.e., for the 'finite case of equal weights', it is not positive in general. The following example will make clear that the weighted Kullback-Leibler divergence could be negative.

Example 2.
Let P be a binomial distribution with n = 2, p = 0.4 and P(X = 0) = 0.36, P(X = 1) = 0.48, P(X = 2) = 0.16 and Q a discrete uniform distribution with the three possible outcomes X = 0, 1, 2 each with probability p = 1/3. The Kullback-Leibler divergence between P and Q is D KL (P, Q) = 0.08529. Now, if we give the event X = 2 an enormously greater significance than the others, for example we put the following weightsw = (1, 1, 4) T , then the weighted Kullback-Leibler divergence, according to the previous definition, will be D w KL (P, Q) = −0.11342. This is due to the fact that the logarithm in the interval (0, 1) is negative.
Due to this fact, Kapur [22] stressed that a weighted divergence measure will be an appropriate measure of directed divergence if the following conditions are satisfied: 1. It is a continuous function ofp,q andw. 2. It is permutationally symmetric function ofp,q andw, i.e., it does not change when the triplets (p 1 , q 1 ,w 1 ), (p 2 , q 2 ,w 2 ) , . . . , (p n , q n , w n ) are permuted among themselves. 3. It is always greater than or equal to zero for all possible choices of weightsw and vanishes when p i = q i for each i = 1, ..., n. 4. It is a convex function of p 1 , p 2 , . . . , p n which has its minimum value zero when p i = q i for each i = 1, ..., n. 5. It reduces to an ordinary measure of directed divergence upon ignoring weights (w i = c, c > 0, ∀ i = 1, ..., n).
The most important of these limitations is Condition 3, which is violated for most of the usual measures. The solution to this problem is quite simple. We just need to transform the usual measure to its positive equivalent. For this purpose Kapur [22] introduced and proved the above for the following function: This function is not negative (φ(x) ≥ 0, ∀ x ∈ S X ) everywhere in the support. So the divergence which is based on this φ-function is also not negative for every subset of the support. This divergence belongs to φ-divergence family [23] so we can use all the theoretical results of this family, such as the asymptotic distribution of φ-divergence estimator [24] or the least φ-divergence estimator [25].
As a result we have the following definition for the discrete case of Corrected Weighted Kullback-Leibler (CWKL) divergence.

Continuous Case
We proceed now to the continuous form of weighted divergence measures which has a slightly different thinking of construction. In the discrete case we multiply each data point of the support with the desirable weight. In the continuous case we have to take into account that the support is infinite, so we have to partition it and apply the appropriate weight on each interval. More generally, we can restrict the appropriate measure (usually the Lebesgue measure) on each subset of the support we want to emphasize.

Definition 4.
Consider two absolutely continuous probability measures µ f and µ g with respect to the Lebesgue measure λ. Also, the Radon-Nikodym derivatives f , g of µ f and µ g with respect to measure λ and A i ∈ A is a partition of support S X , i.e., n i=1 A i = S X , A i ∩ A j = ∅ ∀ i = j. Then, if w i , i = 1, . . . n are the weights the continuous CWKL divergence measure would be the following: where λ | A i is the restricted Lebesgue measure on the subset A i and D w CKL ( f , g) ≥ 0, ∀x ∈ S X .

Asymptotic Distribution of CWKL Divergence Estimator
In this subsection we give the asymptotic distribution of the CWKL divergence estimator for goodness of fit testing between multinomial distributions. For this purpose we rely on the theoretical results of Frank et al. [11], who provides the asymptotic distribution of the estimator of the Weighted Kullback-Leibler divergence for this type of test.
Consider a random variable X described by a probability distribution F X (x). If we wish to test the hypothesis H 0 : F(x) = F 0 (x), where F 0 (x) is a hypothesized distribution, then we have to partition the range of distribution in K classes, say C 1 , . . . , C K . The probability of falling into the class i is P(X ∈ C i ) = p i where i = 1, . . . , K and w i , i = 1, . . . , K is the weight or the importance of each class. Now, suppose that we have a random sample X 1 , . . . , X n from the distribution F X (x) and N = (N 1 , . . . , N K ) is the observed number of values falling on each class C 1 , . . . , C K . It is straight forward that the vector N has a multinomial distribution with parameters (n, p 1 , . . . , p K ), n = ∑ i N i . Also, the estimator of P = (p 1 , . . . , p K ) isP = (p 1 , . . . ,p K ) , wherep i = N i n , i = 1, . . . , K. Since the null hypothesis is equivalent to H 0 : P = P 0 , where P 0 is the hypothesized distribution, the D w CKL (P, P 0 ) has to be small. Otherwise, if D w CKL (P, P 0 ) is sufficiently large the null hypothesis should be rejected. The theorem below provides the asymptotic distribution of D w CKL (P, P 0 ) which is a natural extension of the result of Frank et al., [11]. Theorem 3. Assume the weighted directed divergence D w CKL (P, P 0 ) and its estimator D w CKL (P, P 0 ). Under the null hypothesis H 0 : P = P 0 we have: where Z i , i = 1, . . . , r are iid Standard Normal variables, β i , i = 1, . . . , r are the eigenvalues of the matrix CΣ P 0 , if i = j , Σ P 0 =diag(P 0 ) − P 0 (P 0 ) T and r =rank(Σ P 0 CΣ P 0 ).

Simulations
In this Section we implement all weighted cases from the previous sections. In Section 4.1, we present Bernoulli simulations of discrete cases for the weighted Shannon entropy while in Section 4.2, we present simulations of continuous and discrete cases for the CWKL divergence. The aim of this Section is to study the behaviour of these Directed Information Measures, something that is missing in the literature, and especially to identify if the CWKL divergence can recognize small dissimilarities between distributions which the classical measures can not.

Weighted Shannon Entropy
Example 3. Assume a coin toss, the random variable X which enumerates the probability of heads is described by a Bernoulli distribution with probability p. In Table 1 we present the weighted entropy of this variable for various p andw.  The above results show that under the equiprobable condition (equal weights) the Shannon Entropy is symmetric around its maximum value but in all other cases with unequal weights (associated with unequal significance) the Weighted Shannon Entropy ends to be symmetric. All these are depicted in Figure 1.

CWKL Divergence
In this subsection we present examples of the CWKL divergence and consider cases based on several distributions with various weights and our aim is to examine whether small dissimilarities between the two distributions are easily distinguished by the directed measures. Example 4. Consider the assets, A, B, C, D, E and F. The returns associated with five states and the probability of each to occur are given in Table 2: Table 2. The returns of assets A-F with 5 states and their associated probabilities. In Table 3 we presented the Corrected Weighted Kullback Leibler (CWKL) divergence between the assets A,B,C,D,E and F for various vectors of weights. Table 3. CWKL divergence between assets A-F for various weights. (1, 1, 1, 1, 1 As we can easily see the probability distributions of the assets like A and B are quite similar in contrast with C and D which defers a lot in the 'extreme' states (1 and 5) or E and F which have greater dissimilarity in the state 3. From Table 3 we can identify that if we focus on the states with greater dissimilarities then the CWKL divergence increase. All the above are visualized in Figure 2. Example 5. Let P be a Standard Normal distribution and Q a Student's t-distribution with one degree of freedom (df = 1). The CWKL divergence between P and Q for various weightsw applied in different parts of the support is given in Table 4: Table 4. CWKL divergence between P and Q in two different partitions of the support. Now, let P be a Standard Normal distribution and Q a Student's t-distribution with thirty degrees of freedom (df = 30). The CWKL divergence between P and Q for various weightsw applied in different parts of the support is given in Table 5: Table 5. CWKL divergence between P and Q in two different partitions of the support.  CWKL divergence between N(0,1) and t(30) CWKL divergence between N(0,1) and t (15) CWKL divergence between N(0,1) and t (5) CWKL divergence between N(0,1) and t(2) CWKL divergence between N(0,1) and t(1) Figure 3. CWKL divergence between Standard Normal and various Student's t-distributions.

State i Return A(p i ) B(q i ) C(p i ) D(q i ) E(p i ) F(p i )
In Example 6 we revisit a real life case proposed by Johnson and Wichern [26] and used by Avlogiaris et al., [13] to study the performance of Local Divergence Measures. The example deals with the grade point average (GPA) scores of 85 students who apply to a business school. Students are categorized in three groups\populations according to their GPA performance as π 1 : admit, π 2 : do not admit and π 3 : borderline. Sample means and variances of each group together with the group sample sizes are given in Table 6. The aim of Example 6 is to investigate the dissimilarities between the distributions which describe each population and to compare the results with those provided by Avlogiaris et al., [13].  [13] is the Normal distribution with the estimated parameters of each population given in Table 6. So, we have: π 1 ∼ N(3.4, 0.04), π 2 ∼ N(2.48, 0.03) and π 3 ∼ N(2.99, 0.03). In Tables 7-9 we present the CWKL divergence between each pair of populations for appropriate parts of the support. Table 7. CWKL divergence between π 1 and π 2 .  Table 8. CWKL divergence between π 1 and π 3 .  Table 9. CWKL divergence between π 2 and π 3 . As we can observe, when we focus on the parts where one of the two distributions is heavier then the CWKL divergence increases. This means that all distributions are far from each other, a result that has also been verified by Avlogiaris et al. [13]. Nevertheless, in our case since, as opposed to the Local Divergence Measures (Avlogiaris et al., [13]), we do not discard the information of the less significant parts, we can easily identify not only whether the distributions differ but also in which parts they differ the most. For that purpose as we give attention to the more 'extreme' parts of the distributions (for π 1 and π 2 the (−∞, 2), for π 1 and π 3 the (−∞, 2.7) and for π 2 and π 3 the (3.2, ∞)) the divergence increases, which means that they differ considerably as we 'move' to the tail of each.

Conclusions
The present study succeeds in revealing that the results based on the classical and the weighted entropies and divergences are entirely different and directly related to the specific parts of the distributions we focus on.
The above results show the behaviour of the Weighted Shannon Entropy and clearly show the effect of weights as they applied to specific parts of the distribution. Such results can be used as important probabilistic tools for descriptive as well as discriminatory purposes.
Also, as a conclusion from the above simulations the CWKL divergence is larger (smaller) than the Corrected Kullback-Leibler divergence if we apply bigger weights on the parts with greater (less) dissimilarities. At the same time Corrected Kullback-Leibler divergence is a special case of CWKL divergence when equal weights are applied.
As it is clear from all the above, the appropriate choice of weights will result in better discrimination between similar distributions. In addition, they provide us with a framework for concentrating on the 'important' parts of the distribution. Here, we have to mention that the choice of the weights is absolutely subjective and it is related with the belief of the researcher about the significance of each part of the domain. Nevertheless, if someone wishes to identify the parts of the domain of the distributions with the highest dissimilarities an iterative approach could be adopted. Indeed, firstly, the regions of the domain we want to study must be clearly established, then the sum of the weights must be fixed and with multiple iterations (with simultaneous permutations of the weights over the regions) we maximize the CWKL divergence.
These weighted measures could be a useful tool for the construction of 'directed' statistical tests on the parts of the distribution we wish to emphasize. Such tests could include goodness of fit tests or model selection criteria which will concentrate on specific parts of the distribution by assigning appropriate weights. Also, the use of weighted divergences could turn out to be useful in financial time series analysis. It is not unusual for a single stochastic model like the Barndorff-Nielsen and Shephard (BN-S) model ( [27][28][29]) to fail to describe adequately derivative or commodity market dynamics. For various financial time series data, jumps that often play an important role are typically captured by a Lévy process. The weighted divergences could be incorporated into the analysis of Lévy processes for capturing (identifying) the fluctuations in the jump term of such processes. This way, the jump term could be replaced or modified accordingly, resulting in a more effective and efficient modelling approach. Thus, there is much room for improvement and research on this promising concept.