Algorithmic Relative Complexity

Information content and compression are tightly related concepts that can be addressed through both classical and algorithmic information theories, on the basis of Shannon entropy and Kolmogorov complexity, respectively. The definition of several entities in Kolmogorov's framework relies upon ideas from classical information theory, and these two approaches share many common traits. In this work, we expand the relations between these two frameworks by introducing algorithmic cross-complexity and relative complexity, counterparts of the cross-entropy and relative entropy (or Kullback-Leibler divergence) found in Shannon's framework. We define the cross-complexity of an object x with respect to another object y as the amount of computational resources needed to specify x in terms of y, and the complexity of x related to y as the compression power which is lost when adopting such a description for x, compared to the shortest representation of x. Properties of analogous quantities in classical information theory hold for these new concepts. As these notions are incomputable, a suitable approximation based upon data compression is derived to enable the application to real data, yielding a divergence measure applicable to any pair of strings. Example applications are outlined, involving authorship attribution and satellite image classification, as well as a comparison to similar established techniques.


Introduction
Both classical and algorithmic information theory aim at quantifying the information contained within an object.Classical Shannon's information theory [1] has a probabilistic approach.As it is based on the uncertainty of the outcomes of random variables, it cannot describe the information content of an isolated object, if no a priori knowledge is available.The primary concept of algorithmic information theory is instead the information content of an individual object, which is a measure of how difficult it is to specify how to construct or calculate that object.This notion is also known as Kolmogorov complexity [2].This area of study allowed formal definitions of concepts which were previously vague, such as randomness, Occam's razor, simplicity and complexity.The theoretical frameworks of classical and algorithmic information theory are similar, and many concepts exist in both, sharing various properties (a detailed overview is to be found in [2]).
In this paper we introduce the concepts of cross-complexity and relative complexity, the algorithmic versions of cross-entropy and relative entropy (also known as Kullback-Leibler divergence).These are defined between any two strings x and y respectively as the computational resources needed to specify x only in terms of y, and the compression power which is lost when using such a representation for x instead of its most compact one, which has length equal to its Kolmogorov complexity.
As the introduced concepts are incomputable, we rely on previous work which approximates the complexity of an object with the size of its compressed version, so as to quantify the shared information between two objects [3].We derive a similarity measure from the concept of relative complexity that can be applied between any two strings.
Previously, a correspondence between relative entropy and compression-based similarity measures was considered in [4] for static encoders, directly related to the probability distributions of random variables.Additionally, methods to compute the relative entropy between any two strings have been proposed by Ziv and Merham [5] and Benedetto et al. [6].The concept of relative complexity introduced in this work may be regarded as an expansion of [4], and experiments on authorship attribution contained in [6] are repeated in this paper using the proposed distance with better results.This paper is organized as follows.We recall basic concepts of Shannon entropy and Kolmogorov complexity in Section 2, focusing on their shared properties and their relation with data compression.Section 3 introduces the algorithmic cross-complexity and relative complexity, while in Section 4 we define their computable approximations using compression-based techniques.Practical applications and comparisons with similar methods are reported in Section 5. We conclude in Section 6.

Shannon Entropy and Kolmogorov Complexity
The Shannon entropy in classical information theory [1] is an ensemble concept; it is a measure of the degree of ignorance about the outcomes of a random variable X with a given a priori probability distribution p(x) = P(X = x): This definition can be interpreted as the average length in bits needed to encode the outcomes of X, which can be obtained, for example, through the Shannon-Fano code, to achieve compression.An approach of this nature, with probabilistic assumptions, does not provide the informational content of individual objects and their possible regularity.Instead, the Kolmogorov complexity K(x), or algorithmic complexity, evaluates an intrinsic complexity for any isolated string x, independently of any description formalism.In this work we consider the "prefix" algorithmic complexity of a binary string x, which is the size in bits (binary digits) of the shortest self-delimiting program q used as input by a universal Turing machine to compute x and halt: with Qx being the set of instantaneous codes that generate x.One interpretation of K(x) is the quantity of information needed to recover x from scratch.The original formulation of this concept is independently due to Solomonoff [7], Kolmogorov [8], and Chaitin [9].Strings exhibiting recurring patterns have low complexity, whereas the complexity of random strings is high and almost equals their own length.It is important to remark that K(x) is not a computable function of x.A formal link between entropy and algorithmic complexity has been established in the following theorem [2].

Theorem 1:
The sum of the expected Kolmogorov complexities of all the code words x which are output of a random source X, weighted by their probabilities p(x), equals the statistical Shannon entropy H(X) of X, up to an additive constant.The following holds, if the set of outcomes of X is finite and each probability p(x) is computable: where K(p) represents the complexity of the probability function itself.So for simple distributions the expected complexity approaches the entropy.

Mutual Information and Other Correspondences
The conditional complexity K(x|y) of x related to y quantifies the information needed to recover x if y is given as an auxiliary input to the computation.Note that if y carries information which is shared with x, K(x|y) will be considerably smaller than K(x).In the other case, if y gives no information at all about x, then K(x|y) = K(x) + O(1), and K(x,y) = K(x) + K(y), with the joint complexity K(x,y) being defined as the length of the shortest program which outputs x followed by y.For all these definitions, the desirable properties of analogous quantities in classical information theory related to random variables, i.e., the conditional entropy H(X|Y) of X given Y and the joint entropy of X and Y H(X,Y), hold [2].
An important issue of the information content analysis is the estimation of the amount of information shared by two objects.From Shannon's probabilistic point of view, it occurs via the mutual information I(X,Y) between two random variables X and Y, defined in terms of entropy as: It is possible to obtain a similar estimation of shared information in the Kolmogorov complexity framework by defining the algorithmic mutual information between two strings x and y as: .This definition resembles (4) both in properties and nomenclature [2]: one important shared property is that if 0 and x and y are, by definition, algorithmically independent.What probably is the greatest success of these concepts is enabling the ultimate estimation of shared information between two objects: the Normalized Information Distance (NID) [3].The NID is a similarity metric minimizing any admissible metric, proportional to the length of the shortest program that computes x given y, as well as computing y given x.The distance computed on the basis of these considerations is, after normalization: where, in the right term of the equation, the relation between conditional and joint complexities is used to substitute the terms in the dividend.The NID is a metric, so its result is a positive quantity r in the domain 0 ≤ r ≤ 1, with r = 0 iff the objects are identical and r = 1 representing maximum distance between them.
The value of this similarity measure between two strings x and y is directly related to the algorithmic mutual information, with . 1 )} ( ), ( max{ This can be shown, assuming > being symmetric and up to an additive constant O(1), as follows: .
Another Shannon-Kolmogorov correspondence is the one between rate-distortion theory [1] and Kolmogorov structure functions [10], which aim at separating the meaningful (structural) information contained in an object from its random part (its randomness deficiency), characterized by less meaningful details and noise.

Compression-Based Approximations
As the complexity K(x) is not a computable function of x, a suitable approximation is defined by Li and Vitányi by considering it as the size of the ultimate compressed version of x, and a lower bound for what a real compressor can achieve.This allows approximating K(x) with C(x) = K(x) + k, i.e., the length of the compressed version of x obtained with any off-the-shelf lossless compressor C, plus an unknown constant k: the presence of k is required by the fact that it is not possible to estimate how close to the lower bound represented by K(x) this approximation is.The conditional complexity K(x|y) can be also estimated through compression [11] while the joint complexity K(x,y) is approximated by compressing the concatenation of x and y.Equation ( 7) can then be estimated through the Normalized Compression Distance (NCD) as follows: where C(x,y) represents the size of the file obtained by compressing the concatenation of x and y .The NCD can be explicitly computed between any two strings or files x and y and it represents how different they are.The conditions for NCD to be a metric hold under certain assumptions [12]: in practice the NCD is a non-negative number 0 NCD 1 + e, with the e in the upper bound due to imperfections in the compression algorithms, usually assuming a value below 0.1 for most standard compressors [4].The NCD has a characteristic data-driven, parameter-free approach that allows performing clustering, classification and anomaly detection on diverse data types [12,13].

Cross-Entropy and Cross-Complexity
Let us start by recalling the definition of cross-entropy in Shannon's framework: . The cross-entropy represents the expected number of bits needed to encode the outcomes of a variable X as if they were outcomes of another variable Y. Therefore, the set of outcomes of X is a subset of the outcomes of Y.This notion can be brought in the algorithmic framework to determine how to measure the computational resources needed to specify an object x in terms of another one y.We introduce the cross-complexity .We use an oracle to determine which elements of S are self-delimiting programs which halt when fed to a reference universal prefix Turing Machine U [14], so that U halts with such a segment of * y as input.Let the set of these halting programs be Y, and let the set of outputs of Y be Z, with } : u and 2 u give as output the same element of Z, i.e., . Finally, determine an integer n and the way to divide x in

=
, so that the sum . This way we can write x as a binary string preceded by a self-delimiting program of constant length c telling U how to interpret the next commands, followed by 1 ) ( is lower bounded by the plain Kolmogorov complexity K(x) of x (10), and reaches its , as in the above definition, is equal to 0 (11).For the case x=y, implies reusing the shortest code x* of length K(x) which outputs x, hence (12).The cross-complexity : in the former x is expressed in terms of a description tailored for y, whereas in the latter the object y is an auxiliary input that is given "for free" and does not count in the estimation of the computational resources needed to specify x.Key cross-entropy's properties hold for this definition of algorithmic cross-complexity: as the cross-complexity in (11).

The identity
), ( ) ( also holds up to an additive term (12).Note that the strongest , does not hold in the algorithmic framework.Consider the case of x being a substring of y, with y* containing the shortest code x* to output x, then ) ( ⊕ of X given Y and the entropy H(X) of X share the same upper bound log(N), where N is the number of possible outcomes of X, as algorithmic complexity and algorithmic cross-complexity.This property follows from the definition of algorithmic complexity and (10).

Relative Entropy and Relative Complexity
The definition of algorithmic relative complexity derives from the idea of relative entropy (or Kullback-Leibler divergence) related to two probability distributions X and Y.This divergence represents the expected difference in the number of bits required to code an outcome i of X when using an encoding based on Y, instead of X [15]: is not a metric, as it is not symmetric and the triangle inequality does not hold [16].What is more meaningful for our purposes is the definition of relative entropy expressed in terms of difference between cross-entropy and entropy: We define the relative complexity in terms of cross-entropy according to (14), replacing entropies by complexities.For two finite binary strings x and y the algorithmic relative complexity ) || ( y x K of x towards y is equal to the difference between the cross-complexity ) ( y x K ⊕ and the Kolmogorov complexity K(x): The relative complexity between x and y represents the compression power lost when compressing x by describing it only in terms of y, instead of using its most compact representation.We may also regard ) || ( y x K , as for its counterpart in Shannon framework, as a quantification of the distance between x and y.It is desirable that the key properties of ( 13) hold also for (15).As in ( 13), the algorithmic relative complexity , as a consequence of ( 10) and ( 12).

Computable Algorithmic Cross-Complexity
The incomputability of algorithmic cross-complexity and relative complexity is a direct consequence of the incomputability of their Kolmogorov complexity components.We once again rely on data compression to approximate the relative complexity the ideas contained in [12].To encode a string x according to the description of another string y, one could first compress y and then use the patterns found in y to compress x.But with such an approach, it would be difficult to compare the cross-compression of x given y to the compression factor of x obtained through a standard compression algorithm.Consider compressors of the Lempel-Ziv family, which use dynamic dictionaries built on the fly as a string x is analyzed: it would not be fair to compare the compression of x achieved through such a dictionary to the cross-compression obtained by compressing x with the full, static dictionary extracted by y.To reach our goal we want instead to simulate the behaviour of a real compressor which processes in parallel x and y, exploiting on the fly the information and redundancies contained in y to compress x.Such cross-compressor would keep relative entropy's key idea of encoding the outcomes of a variable X using a code which is optimal for another random variable Y, and is implemented as follows.
Consider represents then the size of x compressed by the dictionary generated from y, if a parallel processing of x and y is simulated.It is possible to create a unique dictionary for a string y as a hash table containing couples (key, value), where key is the position of y in which the pattern occurs the first time, and value contains the full pattern.Then * ) ( y x C ⊕ can be computed by matching the patterns in x with the portions of the dictionary of y with key < p, where p is the actual position in x.So, for two strings x and y with | | | | y x < , only the first |x| elements of y will be considered.We report in Tables 1 and 2  Table 1.An example of cross-compression.Extracted dictionaries and compressed versions of A and B, plus cross-compressions between A and B, computed with the algorithm reported in Figure 1. by not including the items from the dictionary in the representation of x rather than by including them, and by considering only subsets of the dictionary extracted from y, gradually expanding into the full dictionary of y.

Computable Algorithmic Relative Complexity
The computable relative complexity of a string x towards another string y is the length of x represented through the dictionary of a set of substrings of y, minus the length of the compressed version of x.So it is the excess in length of representing x using y over just representing x: with ) ( y x C ⊕ computed as described in the previous section and C(x) representing the length of x after being compressed by the LZW algorithm [17].Finally, we introduce the approximated normalized relative complexity ) || ( y x C : The distance (14) ranges from 0 to e + 1 , representing respectively maximum and minimum similarity between x and y.The term of e is due to (10), as ) ( y x C ⊕ can be greater than |x|, and it is of the order of O(log|x|).

Symmetric Relative Complexity
Kullback and Leibler themselves define their distance in a symmetric way: [15].We define a symmetric version of (17) as: In our normalized equation we divide both terms by 2 to keep the values between 0 and 1.For the strings A and B considered in Tables 1 and 2, we obtain the following estimations: . So B can be better expressed in terms of A than vice versa, and overall the strings are similar.

Applications
Even if our main concern is not the performance of the introduced distance measures, we outline practical application examples in order to show the consistence of the introduced divergence, and compare it to similar existing methods.In the following experiments a preliminary step of dividing the strings into a set of words has been performed [18], on the basis of which the dictionary extractions have been more easily carried out.

Application to Authorship Attribution
The problem of automatically recognizing the author of a text is given.In the following experiment the same procedure used to test the relative entropy in [6], and a dataset as close as possible, have been adopted: the collection comprises 90 texts of 11 known Italian authors spanning the XIII-XXth centuries [19].Each text i T was used as an unknown text against the rest of the database, and assigned to the author of its closest object k T , for which ) || ( was minimal.Overall accuracy was then computed as the percentage of texts assigned to their correct authors.The results, reported in Table 3, show that the correct author ) ( i T A for each i T has been found in 97.8%, of the cases, with Table 4 reporting a comparison with other compression-based methods.The relative complexity yields better results than the relative entropy by Benedetto et al.,as (15) does not have any limitation on the size of the strings to be analyzed, and takes into account the full information content of the objects.The NCD tested with three different compressors gave slightly inferior results, along with the Ziv-Merhav method to estimate the relative entropy between two strings [20].Only two texts by Antonio Fogazzaro are incorrectly assigned to Grazia Deledda.These errors may be anyhow justified, as Deledda's strongest influences are Fogazzaro and Giovanni Verga [21].According to the classification results, Fogazzaro seems to have had a stronger influence on Deledda than Verga.

Satellite Images Classification
In a second experiment we classified a labelled satellite images dataset, containing 600 optical image subsets of 64 × 64 size acquired by the SPOT 5 satellite.The dataset has been divided into six classes (clouds, sea, desert, city, forest and fields) and split into 200 training images and 400 test images.As a first step, the images have been encoded into strings, as in [18], by traversing them in raster order; then all distances between training and test images have been computed by applying (17); finally, each subset was assigned to the class from which the average distance was minimal.Results reported in Table 5 show an overall satisfactory performance, achieved considering only the horizontal information within the image subsets.The use of NCD with an image compressor (JPEG2000), and to a minor degree with linear compression (zlib), yields superior results anyway [22].Table 5. Accuracy for satellite images classification (%) using the relative complexity as distance measure, and comparison to NCD using both a general and a specialized compressor.A good performance is reached for all classes except for the class fields, confused with city and desert.

Conclusions
Two new concepts in algorithmic information theory have been introduced: cross-complexity and relative complexity, both defined for any two arbitrary finite strings.Key properties of the classical information theory concepts of cross-entropy and relative entropy hold for these definitions.Due to their incomputability, suitable approximations through data compression have been derived, enabling tests on real data, performed and against similar existing methods.The computable relative complexity can be considered as an expansion of the relation illustrated in [4] between relative entropy and static encoders, extended to dynamic encoding for the general case of two isolated objects.
The introduced approximation, in its actual implementation, exhibits some drawbacks.Firstly, it requires greater computational resources and cannot be computed by simply compressing a file, as with the NCD.Secondly, it needs a preceding first step of encoding the data into strings, whereas distance measures as the NCD may be applied directly using any compressor.This work does not aim then at outperforming existing methods in the field, but at expanding the relations between classical and algorithmic information theory.

Figure 1 .
Figure 1.Pseudo-code to generate an approximation ) ( y x C ⊕ of the cross-complexity ) ( y x K ⊕ between two strings x and y. a practical example.Consider two ASCII-coded strings A = {abcabcabcabc} and B = {abababababab}.By applying the LZW algorithm, we extract and use two dictionaries Dict(A) and Dict(B) to compress A and B into two strings * A and * B of length C(A) and C(B), respectively.By applying the pseudo-code in Figure 1 we compute * The above sum is prefixed by a c-bit program (self-delimiting), which depends on the Turing Machine adopted, that tells U how to compute x from the following commands.The total forms the code This way, x is coded into some concatenation of subsegments of * y expanding into segments of x, prefixed with 1, and the remaining segments of x in self-delimiting form prefixed with 0.

Table 2 .
Estimated complexities and cross-complexities for the sample strings A and B of Table1.As A and B share common patterns, compression is achieved, and it is more effective when B is expressed in terms of A due to the fact that A contains all the relevant patterns within B.

Table 4 .
Authorship attribution.Comparison with other compression-based methods.