Information Distances versus Entropy Metric

: Information distance has become an important tool in a wide variety of applications. Various types of information distance have been made over the years. These information distance measures are different from entropy metric, as the former is based on Kolmogorov complexity and the latter on Shannon entropy. However, for any computable probability distributions, up to a constant, the expected value of Kolmogorov complexity equals the Shannon entropy. We study the similar relationship between entropy and information distance. We also study the relationship between entropy and the normalized versions of information distances.


Introduction
Information distance [1] is a universal distance measure between individual objects based on Kolmogorov complexity. It is the length of a shortest program that transforms one object into the other object. Since the theory of information distance was proposed, various information distance measures have been known. The normalized versions of information distances [2] have been introduced for measuring similarity between sequences. The min distance and its normalized version, which do not satisfy the triangle inequality, have been presented in [3,4]. The time-bounded version of information distance [5] has been used for studying the computability properties of the normalized information distances. A safe approximability of the normalized information distance have been discussed in [6]. Since the normalized information distance is uncomputable, two practical distance measures, the normalized compression distance and the Google similarity distance, have been presented [7][8][9][10][11]. These distance measures have been successfully applied to bioinformatics [10], music clustering [7][8][9], linguistics [2,12], plagiarism detection [13], question answering [3,4,14] and many more.
As mentioned in [1], information distance should be contrasted with the entropy metric. The former is based on Kolmogorov complexity and the latter on Shannon entropy. Various relations between Shannon entropy and Kolmogorov complexity are known [15][16][17]. It is well known that for any computable probability distributions the expected value of Kolmogorov complexity equals the Shannon entropy [18,19]. Linear inequalities are valid for Shannon entropy are also valid for Kolmogorov complexity, and vice verse [20]. We also know that various notions are both based on Shannon entropy and Kolmogorov complexity. Hence, many similar relationships between entropy based notions and Kolmogorov complexity based notions have been proposed. Relations between time-bounded entropy measures and time-bounded Kolmogorov complexity have been proposed in [21]. Relations between Shannon mutual information and algorithmic (Kolmogorov) mutual information have been proposed in [18]. Then, relations between entropy based cryptographic security and Kolmogorov complexity based cryptographic security have been studied [22][23][24][25]. One-way functions have been studied on both time-bounded entropy and time-bounded Kolmogorov complexity [26]. However, the relationship between information distance and entropy has not been studied. In this paper, we study the similar relationship between information distance and the entropy metric. We also analyze the validity of the relationship between normalized information distance and the entropy metric.
The rest of this paper is organized as follows: In Section 2, some basic notions are reviewed. In Section 3, we study the relationship between information distance and the entropy metric. In Section 4, we study the relationship between normalized information distance and the entropy metric. Finally, conclusions are stated in Section 5.

Preliminaries
In this paper, let |x| be the length of the string x and log(·) be the function log 2 (·).

Kolmogorov Complexity
Kolmogorov complexity was introduced independently by Solomonoff [27] and Kolmogorov [28] and later by Chaitin [29]. Some basic notions of Kolmogorov complexity are given below. For more details, see [16,17]. We use the prefix-free definition of Kolmogorov complexity. A string x is a proper prefix of a string y if we have y = xz for z = ε, where ε is the empty string. A set of strings A is prefix-free if there are not two strings x and y in A such that x is a proper prefix of y. For convenience, we use the prefix-free Turing machine, i.e., Turing machines with a prefix-free domain.
Let F be a fixed prefix-free optimal universal Turing machine. The conditional Kolmogorov complexity K(y|x) of y given x is defined by where F(p, x) is the output of the program p with auxiliary input x when it is run in the machine F.
The (unconditional) Kolmogorov complexity K(y) of y is defined as K(y|ε).

Shannon Entropy
Shannon entropy [30] is a measure of the average uncertainty in a random variable. Some basic notions of entropy are given here. For more details, see [16,18]. For simplicity, all random variables mentioned in the paper are outcomes in the sets of finite strings.
Let X, Y be two random variables with a computable joint probability distribution f (x, y), the marginal distributions of X and Y are defined by f 1 (x) = ∑ y f (x, y) and f 2 (x) = ∑ x f (x, y), respectively.
The joint Shannon entropy of X and Y is defined as The Shannon entropy of X is defined as The conditional Shannon entropy with respect to Y given X is defined as The mutual information between variables X and Y is defined as Kolmogorov complexity and Shannon entropy are fundamentally different measures. However, for any computable probability distributions, up to K( f ) + O(1), the Shannon entropy equals the expected value of the Kolmogorov complexity [17][18][19]. Conditional Kolmogorov complexity and conditional Shannon entropy are also related.
The following two Lemmas are Theorem 8.1.1 from [17] and Theorem 5 from [22], respectively. Lemma 1. Let X be a random variable over X . For any computable probability distribution f (x) over X , Lemma 2. Let X, Y be two random variables over X , Y, respectively. For any computable probability distribution f (x, y) over X × Y, The following two Lemmas will be used in the next section.

Lemma 3.
There are four positive integer a, b, c, d such that

Information Distance Versus Entropy
A metric on a set X is a function d : X × X → R + having the following properties: for every x, y, z ∈ X Here, entropy metric means the metric on the set of all random variables over a set. d(X, Y) = H(X|Y) + H(Y|X) is a metric [16]. It is easy to know that d(X, Y) = max{H(X|Y), H(Y|X)} is also a metric.
Information distance E max (x, y) [1], the length of a shortest program computing y from x and vice versa, is defined as E max (x, y) = min{|p| : F(p, x) = y, F(p, y) = x}.
In [1], up to an additive logarithmic term, the equality, E max (x, y) = max(K(x|y), K(y|x)), holds. So E max (x, y) is called the max distance between x and y.
We show the following relationship between max distance and the entropy metric.
Moreover, from Lemma 2, we get

Remark 1.
From the above Theorem, the inequality ∑ x,y f (x, y)E max (x, y) ≥ max(H(X|Y), H(Y|X)) holds. Unfortunately, the inequality ∑ x,y f (x, y)E max (x, y) ≤ max(H(X|Y), H(Y|X)) does not hold. For instance, let the joint probability distribution f (x, y) of X and Y be f (x 1 , y 1 ) = f (x 2 , y 2 ) = 0.5, and let a = K(x 1 |y 1 ), b = K(y 1 |x 1 ), c = K(x 2 |y 2 ) and d = K(y 2 |x 2 ) such that a = b and d = c. Assume, without loss of generality, that a > b and d > c, then from Lemma 3, we have ∑ x,y f (x, y)E max (x, y) > ∑ x,y f (x, y)K(x|y) + ∑ x,y f (x, y)K(y|x).
This means we will get the inequality ∑ x,y f (x, y)E max (x, y) > ∑ x,y f (x, y)K(x|y) + ∑ x,y f (x, y)K(y|x) ≥ max(H(X|Y), H(Y|X)) for some cases.
From above results, we know that the relationship ∑ x,y f (x, y)E max (x, y) ≈ max(H(X|Y), H(Y|X)) does not hold.
Because the mutual information between X and Y is defined as I(X; Y) = H(X) − H(X|Y), we have the following result. Corollary 1. Let X, Y be two random variables with a computable joint probability distribution f (x, y), then Min distance E min (x, y) [3,4] is defined as E min (x, y) = min{|p| : F(p, x, z) = y, F(p, y, r) = x, |p| + |z| + |r| ≤ E max (x, y)}.
In [3,4], the equality, E min (x, y) = min(K(x|y), K(y|x)), holds, when a term O(log |x| + |y|) is omitted. Then we have the following relationship between min distance and the entropy metric. Theorem 2. Let X, Y be two random variables with a computable joint probability distribution f (x, y), then Thus ∑ x,y Remark 2. From the above Theorem, the inequality ∑ x,y f (x, y)E min (x, y) ≤ min(H(X|Y), H(Y|X)) holds. Unfortunately, from Lemma 4, we know that the inequality ∑ x,y f (x, y)E min (x, y) ≥ min(H(X|Y), H(Y|X)) does not hold.

Corollary 2.
Let X, Y be two random variables with a computable joint probability distribution f (x, y), then Sum distance E sum (x, y) [1] is defined as E sum (x, y) = K(x|y) + K(y|x).
Then we have the following relationship between sum distance and entropy measure.
Theorem 3. Let X, Y be two random variables with a computable joint probability distribution f (x, y), then = ∑ x,y f (x, y)K(x|y) + ∑ x,y f (x, y)K(x|y) Then, from Lemma 1, we can get Corollary 3. Let X, Y be two random variables with a computable joint probability distribution f (x, y), then From above results we know that, when f is given, up to a additive constant,

Normalized Information Distance Versus Entropy
In this section, we establish relationships between entropy and the normalized versions of information distances.
Theorem 4. Let X, Y be two random variables with a computable joint probability distribution f (x, y), then ∑ x,y Then K(x)e max (x, y) ≤ K(x|y) + K(y|x).

From Lemmas 1 and 2, we have
Thus ∑ x,y Corollary 4. Let X, Y be two random variables with a computable joint probability distribution f (x, y), then The normalized version e min (x, y) [3,4] of E min (x, y) is defined as e min (x, y) = min(K(x|y), K(y|x)) min(K(x), K(y)) .
Because e min (x, y) ≤ e max (x, y), for all x, y [3,4], the following Corollary is straightforward with the above Theorem.

Conclusions
As we know, the Shannon entropy of a distribution is approximately equal to the expected Kolmogorov complexity, up to a constant term that only depends on the distribution [17]. We studied whether a similar relationship holds for information distance. Theorem 5 gave the analogous result for sum distance. We also gave some bounds for the expected value of other (normalized) information distances.