On Conditional Tsallis Entropy

There is no generally accepted definition for conditional Tsallis entropy. The standard definition of (unconditional) Tsallis entropy depends on a parameter α that converges to the Shannon entropy as α approaches 1. In this paper, we describe three proposed definitions of conditional Tsallis entropy suggested in the literature—their properties are studied and their values, as a function of α, are compared. We also consider another natural proposal for conditional Tsallis entropy and compare it with the existing ones. Lastly, we present an online tool to compute the four conditional Tsallis entropies, given the probability distributions and the value of the parameter α.


Introduction
Tsallis entropy [1], (The name Tsallis entropy used in this paper, to identify the quantity presented in Equation (3), is not consensual in the community, given that before Tsallis presented it in 1988, and as he himself acknowledges, other authors had already introduced it [2][3][4].) a generalization of Shannon entropy [5,6], was extensively studied by Constantino Tsallis in 1988, and provides an alternative way of dealing with several characteristics of nonextensive physical systems, given that the information about the intrinsic fluctuations in the physical system can be characterized by the nonextensivity parameter α. It can be applied to many scientific fields, such as physics [7], economics [8], computer science [9,10], and biology [11]. We refer the reader to Reference [12] for a more extensive bibliography on applications of Tsallis entropy. Furthermore, we refer the reader to Reference [13] for a survey on the most significant areas of application of the most usual entropy measures, including Shannon [6], Rényi [14] , and Tsallis entropies [1][2][3][4].
It is known that, as the parameter α approaches 1, the Tsallis entropy corresponds to the Shannon entropy. Unlike for Shannon entropy, but similar to Rényi entropy (yet another generalization of Shannon entropy developed by Alfréd Rényi in 1961 [14], which also depends on a parameter α and converges to Shannon entropy when α approaches 1), there is no commonly accepted definition for the conditional Tsallis entropy: several versions have been proposed and used in the literature [15,16]. In this work, we revisit the notion of conditional Tsallis entropy by studying some natural and desirable properties in the existing proposals (see for instance References [15,16]): when α → 1, the usual conditional Shannon entropy should be recovered, the conditional Tsallis entropy should not exceed the unconditional Tsallis entropy, and the conditional Tsallis entropy should have values between 0 and the maximum value of the unconditional version.
The use of entropies in different fields, especially in the field of information theory and its connection to communication, allowed the development of several useful information measures, such as mutual information, symmetry of information, and information distances. See, for example, References [17][18][19] for some recent work related to the aforementioned information measures.
Depending on the entropy measure used, all of these have been applied in many different areas of knowledge, such as physics [20], information theory [21,22], complexity theory [23][24][25], security [26][27][28], biology [29][30][31][32], finances [33], and medicine [34][35][36], among others. The conditional Tsallis entropy, as suggested in Reference [37], can be directly applied to information theory, especially coding theory. Furthermore, since Tsallis entropy can be applied in many areas (see, for example, Reference [12]), the study of conditional Tsallis entropies is quite promising. This paper analyzes several definitions of conditional Tsallis entropy, with the intent of providing the reader with a description of the properties that each approach satisfies.
Continuing from previous works [37,38], we introduce a new natural definition for conditional Tsallis entropy as a possible alternative to the existing ones. Our new proposal does not intend to be the ultimate version of conditional Tsallis entropy, but an alternative to the existing ones, with its own properties that, in settings, such as biomedical applications, might be useful for defining information distances or other significant measurements. None of the known definitions contain all of the desired properties for a conditional version. In particular, the one presented here (as it takes the maximum over the marginal distributions) does not converge to the Shannon entropy when α → 1-it behaves similar to a parameterized entropy, and is akin to the one proposed in Reference [38] as an alternative to Rényi's conditional entropy, another generalization of Shannon entropy.
The paper is organized as follows. In the next section, we present the definitions necessary for the rest of the paper, namely Shannon entropy and Tsallis entropy. In Section 3, we provide several definitions for the conditional Tsallis entropy in both existing literature and our proposal. In Section 4, we establish several results, comparing the definitions presented previously. In Section 5, we explore some features of each variant for the conditional Tsallis entropy. Finally, in Section 6, we present the conclusions and future work.

Preliminaries
In the remainder of the paper, we use the standard notation for entropies and for probability distributions according to Reference [5]. For the sake of simplicity of notation, we use the notation log for the logarithm in base 2. We call the reader's attention to the fact that, whenever we say that one entropy converges to another, it is always up to logarithmic factor that depends only on the choice of cardinality of the alphabet.
The Shannon entropy of X is the expectation of the surprise of an occurrence, The conditional Shannon entropy, H(Y|X), is the expectation over x of the entropy of the distribution P(Y|X = x), It is easy to derive the chain rule H(X, Y) = H(X) + H(Y|X): to get the average information contained in (X, Y), we may first get the average information contained in X, and add to it the average information of Y, given X.
The Tsallis entropy [1] was firstly introduced in [2,3] and is defined for a random variable X by: It is straightforward to show that, when the parameter α converges to 1, the value of the entropy converges to the Shannon entropy.

Conditional Tsallis Entropy: Four Definitions
We consider three definitions for conditional Tsallis entropy that already exist in the literature and introduce a new proposal. All definitions consider a positive parameter α. Definition 1. Let Z = (X, Y) be a random vector. One can define the following variants of conditional Tsallis entropy: One can easily verify that T α (X, Y) = T α (Y|X) + T α (X) and, therefore, it satisfies the chain rule.

3.
Definition of S α (Y|X) from [16] (Definition 2.10) The first definition presented proposes that the conditional Tsallis entropy should be weighed by the probability of sampling X = x with parameter α, while the second one proposes that one uniformly weighs only the probability of sampling X = x. Therefore, notice that for the first definition presented, the value of α largely affects the value of the conditional Tsallis entropy. The idea for the third proposal is to distribute evenly the influence of the parameter α by the entire joint distribution. Next, we present another possible definition of the conditional Tsallis entropy. This definition is based on Definition III.6 of [38] and captures the intuitive notion of defining the conditional entropy, by taking the maximum over all possible marginal distributions. Note that this definition is analogous to an existing one for the Rényi entropy; however, as we will show later, this proposal does not satisfy some of the expected basic properties.
We opted to use different notations for the variants of the conditional Tsallis entropy in the last definition, to better distinguish them in the rest of the paper. In particular, we follow the same approach as in Reference [38].
The following expressions will be useful later.
Theorem 1. Let Z = (X, Y) be a random vector. The following identities are true:

Comparison of the Definitions
We now compare the above four definitions of the conditional Tsallis entropy by comparing whether or not the definition satisfies some common properties of an entropy measure. In the next theorem, we report two simple facts with straightforward proofs. We leave the details for the interested reader to check.

Theorem 2.
For any fixed joint probability distribution P(X, Y), (i) T α (Y|X), S α (Y|X) and S α (Y|X), as functions of α, are continuous and differentiable; (ii) T α (Y|X), as a function of α, is continuous for all α = 1.
The following results provide the possible comparisons (in terms of values) between the proposed definitions. For the sake of organization, we split the comparison by types of entropy.
First we compare T α (Y|X) with S α (Y|X).
Theorem 3. For all joint probability distributions P(X, Y) and for every α > 0, Proof. Consider first the case α < 1. In this case, we have that P(X = x) α ≥ P(X = x). Thus, For the case α = 1, see the proof of Theorem 8. The case α > 1 is similar to the previous one, but this time, the conclusion follows, since for In the next theorem we provide the comparison between T α (Y|X) and S α (Y|X).
The proof of the case α > 1 is similar to the previous one but this time, the conclusion follows from the fact that, for α > 1, T α (Y|X) = max x T α (Y|X = x).
As a consequence of the two previous results and the definitions, we can derive the relation between T α and T α .

Corollary 1.
For all joint probability distributions P(X, Y) and for every α > 0, The proof follows directly from Theorems 3 and 4. Now, we derive the relation between S α (Y|X) and T α (Y|X).
Theorem 5. For all joint probability distributions P(X, Y) and for every α > 0, Proof. Consider first the case α < 1. Proving that S α (Y|X) ≥ T α (Y|X), by definition, is the same to prove: As α < 1, we have that 1 α − 1 < 0. Thus, proving Equation (27) is the same, proves that: Now, the result follows by observing that the last inequality is true, since, for α < 1 and for every x, we have that The case α > 1 is proved in a similar manner. Now, we derive the relation between T α (Y|X) and S α (Y|X).

Theorem 6.
For all joint probability distributions P(X, Y) and for every α > 0, Proof. Consider first the case α < 1. Thus, The result follows by observing that the last inequality is true, since for α < 1, we have that: and consequently, The proof of the case α > 1 is similar to the previous one.
Finally, we show that the values of S α and S α are incomparable in the sense that there are probability distributions for which S α is greater than S α and there are probability distributions for which S α is greater than S α . Theorem 7. The values of S α (Y|X) and of S α (Y|X) are incomparable, i.e., for each n ≥ 2 and α = 1 Proof. For Statement (32) and α < 1, consider the following joint probability distribution:

Properties of the Conditional Tsallis Entropies
In this section, we investigate some properties of the proposals considered. In particular, we show that there are probability distributions and α = 1 for which the conditional Tsallis entropies are bigger than the unconditional Tsallis entropy. Theorem 8. For any fixed joint probability distribution P(X, Y), where H(Y|X) is the conditional Shannon entropy. In general, it is not true that lim Proof. The second equation is easy to derive directly from the definition of conditional probability and from Equation (2). Furthermore, using Equation (6) we can also easily obtain (using the previous derivation) that Equation (42) is also true. The third equation was proven in Reference [16]. Now, it is only left to prove the last statement of the theorem, i.e., in general From Equations (6) and (11) it is easy to check that The function T α (Y|x) depends on the conditional probabilities P(Y = y|X = x). Therefore, there are joint probability distributions P(X = x, Y = y), such that: Contrary to the Shannon entropy, the value of any conditional Tsallis entropy may exceed the corresponding unconditional Tsallis entropy for all proposals. Theorem 9. There are probability distributions P(X, Y) and values of α, such that: Proof. Consider the following joint probability distribution: For this distribution we have:

Bounds on Conditional Tsallis Entropy
As mentioned in the Introduction, one of the properties of the (conditional) Shannon entropy for discrete variables is to be bounded by the number of elements of the support of the distribution. Furthermore, it is well known that the unconditional Tsallis entropy is always between 0 and m 1−α 1 − α , where m is the number of elements in the support of the distribution. In this subsection, we derive bounds for the conditional Tsallis entropies based on the number of elements in the support of each distribution.
Theorem 10. Let Z = (X, Y) be any joint random vector defined over sets of size m each. Then, Moreover all of these lower and upper bounds may be reached by suitable probability distributions P(X, Y).
Proof. The Inequalities (57) follow from the fact that S α (Y|X) is the expectation of the unconditional Tsallis entropy. For Inequalities (58), recall that Equation (10) can be written, for α < 1, as Equation (12). Note that, for all x, the values T α (Y|X = x) are the (unconditional) Tsallis entropies of the marginal distribution, and are all defined in a set of cardinality m.
So, by definition of T α , for some particular x, we have T α (Y|X) = T α (Y|X = x). The case α > 1 is similar. So, independently of α, for every probability distributions P(X) and P(Y) defined over set with m elements, we have 0 ≤ T α (Y|X) ≤ m 1−α 1 − α , since the same bound applies for the unconditional version or any its marginal distributions.
Theorem 11. Let Z = (X, Y) be any joint random vector defined over sets of size m each. Then, For α < 1, in general, the inequality does not hold.
Theorem 12. Let Z = (X, Y) be any joint random vector defined over sets of size m each. Then, Proof. The result follows directly from the Inequalities (26) and (58).
We conjecture that the above theorem also holds for α < 1. For example, the inequality is true for all uniform probability distribution over n variables.
We now show that, for any fixed joint probability distribution P(X, Y), three of the forms of conditional Tsallis entropy studied in this paper are non-increasing functions of α. First, we state a simple theorem. T α (Y|X) is a non-increasing function of α.

2.
S α (Y|X) is a non-increasing function of α.

1.
First consider the case α > 1, and consider the function dT α (Y|X), the derivative of the function T α (Y|X) in order to α: It is easy to see that, since α > 1, dT α (X) dα < 0. Therefore, the function T α (Y|X) is a non-increasing function of α.
Consider now the case α < 1 and assume that α, α are such that α < α < 1. In order to prove that T α (Y|X) is non-increasing we have to show that T α (X) ≥ T α (X), i.e.,: So, the last inequality is true.

2.
This part of the result follows from the fact that S α (Y|X) is the expectation of unconditional Tsallis entropies; see Equation (6).

3.
Suppose that α > 1. The proof is a direct consequence of Equation (11) and Lemma 1. The case α < 1 can be proven in a similar way.
It is easy to show that S does not fulfill the property of the last theorem.
Proof. Consider the following joint probability distribution: We developed a small application that, given two probability distributions, computes the values of all conditional Tsallis entropies considered in the paper. The application is self-contained and its use is extremely simple. There are two use case examples that the reader can use in order to try the calculator. The interested reader can find it in the following link: http://gloss.di.fc.ul.pt/tryit/Tsallis (accessed on 28 October 2021).

Conclusions
In this paper, we studied the definitions for the conditional Tsallis entropy existing in the literature. We also considered a possible alternative definition for it. This new proposal is a natural approach to consider as a possible definition. It defines the conditional value as the maximum value of all marginal distributions. Due to this fact, and similar to what happens with the Rényi entropy, this definition was also analyzed, although it was never considered in the literature before. The relationships between the four definitions, described in this work, are summarized in Figure 1. As we understand, it would be expectable that a proposal for conditional Tsallis entropy would satisfy the following properties:
Convergence to Shannon entropy as the parameter α tended to 1; 3.
Its value would be between 0 and the upper bound of the unconditional version.
In Table 1, we summarize the properties that the four proposals have (we also added the property of being a non-increasing function with α). To conclude, we can say that none of the proposals fulfill all of the properties. The definition T α (Y|X) is the candidate that fulfills more properties. For future work, since all definitions focus on possible different aspects of the entropy, it would be important to consider a deeper study in this area and its possible applications, aiming to develop a theory that would emphasize the best proposal for each area, or eventually present an ultimate version for the conditional Tsallis entropy that would satisfy all of the desirable properties.