Point Information Gain and Multidimensional Data Analysis

We generalize the Point information gain (PIG) and derived quantities, i.e. Point information entropy (PIE) and Point information entropy density (PIED), for the case of R\'enyi entropy and simulate the behavior of PIG for typical distributions. We also use these methods for the analysis of multidimensional datasets. We demonstrate the main properties of PIE/PIED spectra for the real data on the example of several images, and discuss possible further utilization in other fields of data processing.

approach based on a simple concept of entropy difference. By generalization of both concepts from Shannon's approach to Rényi's approach we obtain the whole class of information measures that enables to aim to different parts of probability distributions and interpret it as investigation of different parts of multifractal systems.
Despite the mathematical precision of the concept of Shannon/Rényi divergence, we use the latter concept, i.e., (Rényi) entropy difference, for introduction of a measure which locally determines an information contribution of a given element in a discrete set. In spite of no substantial restriction on the use of a standard divergence for calculation of the information difference upon elimination of one element from the set, for practical reasons, we used a simple concept of entropy difference between sets with and without a given element. The resulting value has been called the point information gain Γ α,i [1], [2]. The goal of this article is to examine and demonstrate some properties of this variable and derived quantities, namely a point information gain entropy H α and point information gain entropy density Ξ α . We also introduce the relation of all these variables to semantic and syntactic information in multidimensional data analysis.

A. Point information gain and its relation to other information entropies
The important problem in the information theory is to estimate the amount of information gained/lost by refining/approximating the probability distribution P by distribution Q. The most popular measure used in the theory is Kullback-Leibler (KL) divergence, defined as where S P (Q) is so-called cross-entropy [3] and S(P ) is the entropy of distribution P . In case, when P is similar to Q, this measure can be approximated by entropy difference ∆S(P, Q) = S(Q) − S(P ).
Indeed, this measure does not obey as many theoretic-measure axioms as KL-divergence. For instance, for P = Q we can still obtain |∆S(P, Q)| = 0. Nevertheless, if P ≈ Q and P = Q, this measure can be still a suitable information measure. The situation, when the distributions are approximative histograms of some underlying distributions P for n and n + 1 entries, respectively, is particularly interesting. In this case, the entropy difference can be interpreted as an information gained by the (n + 1)-th point.
When dealing with complex real systems, it is sometimes advantageous to introduce new information measures and entropies that capture the complexity of the system better -e.g., Hellinger distance, Jeffrey's distance or Jdivergence. There are also some specific information measures that have special interpretations and are widely used in various applications. The two most important measures are the Tsallis-Havrda-Charvát (THC) entropy [4], which is the entropy of non-extensive systems, and the Rényi entropy, the entropy of multifractal systems [5], [6]. The latter one is tightly connected to theory of multifractal systems and generalized dimensions [7], the scaling exponent of the Rényi entropy is equal to the generalized dimension D α = lim l→0 Hα(P (l)) ln l . Rényi entropy indicates the average information cost, when the cost of information is an exponential function of its length [8]. Thus, changing the parameter α changes the cost of the information and therefore accentuates some parts of the probability distributions while suppressing the other. The limit lim α→1 H α = H 1 is equal to the Shannon entropy, thus, by taking into account the whole class of Rényi entropies, we get a new class of information measures.
The point information gain (Γ α , i) was developed as a practical tool for assessment of information contribution of an element to a given discrete distribution [9]. Similarly to Shannon's entropy difference, it is defined as a difference of two Rényi entropies -with and without the examined element of a discrete phenomenon. We consider a discrete distribution of k phenomena, which occur exclusively. The H α = H α (P ) is the Rényi entropy of the full distribution and H α,i = H α (P i ) is the Rényi entropy of the distribution P i , in which one point in the occurrence of the examined i-th phenomenon was omitted. Hence, we may write the point information gain Γ α,i as where α is the Rényi coefficient, k is the number of elements in the discrete distribution, p j = n j /n and p j,i = n j,i /(n−1) is the probability of occurrence of the j-th phenomenon in the original distribution and in the distribution without one element of the i-th phenomenon, respectively 1 .
In contrast to the commonly used Rényi divergence [10]- [16], we use Γ α,i for its relative simplicity 2 and practical interpretation.
After the substitution for probabilities, one gets that where C α (n) = 1 1−α ln n α (n−1) α is depending only on the number of events n. For n → ∞, Γ α,i → 0, and the whole entropy remains finite 3 . Therefore, we examine only the second term. When the argument of the logarithm is close to 1, e.g., which leads to the condition that ni−1 ni α ≈ 1, one can then approximate the logarithm by the Taylor expansion of the first order. By denoting the second term of Γ i can be approximated as Let us note that the last term of Eq. (6) is nothing else than the THC entropy formula [4], [17]. In other words, at the condition of the utilization of the THC entropy instead of the Rényi entropy, it almost corresponds to the point information gain derived from Eq. (5). This is due to the fact that, for large n, the omission of the point have no large impact on the whole distribution. This refers to the fact that, for our purposes, the degree of the α-parameter, which causes rescaling of probabilities to p α i , is more important than the particular form of the information measure. We shall continue utilizing Rényi entropy due to its correspondence to a generalized dimension of multifractal systems [18], [19]. Let us concentrate again to the term D α,i . It can be rewritten as Specifically, provided α = 2, we obtain which explains why the dependency n i on Γ 2,i is approximately linear (Fig. 2d). In general, if α = 1, the point information gain is a monotonous function of n i , respectively p i , for all possible discrete distributions. Thus, it may be used as a measure of information gain between two discrete distributions, which in the occurrence of one particular feature differ.
In general, Γ α,i < 0 correspond to tail parts of the distribution, which points to rare events, while Γ α,i > 0 correspond to frequent events. Thus, in addition to definition of the measure of the contribution of each event to the examined distribution, we also obtain the discrimination between points which contribute to information about the given distribution under the given statistical assumption represented by the particular α-value. This opens the question on existence of the so-called "optimal" distribution for the given α. There naturally arise two possible measures of such optimality. The first one would be defined as a distribution for which exactly a half of the n i values produces Γ α,i > 0 and the other half yields Γ α,i < 0. The second one requires that Γ α,i -values are equally spaced. Existence of such an "ideal" distribution would be another generalization of the concept of the entropy power similar to that reported recently [20], [21]. We intend to address this question in our future research.
With respect to the previous discussion and practical utilization of this notion, we emphasize that for real systems with large n, Γ α,i -values are rather small numbers. Their further computer averaging and numerical representation lead to significant errors (e.g., Fig. 1c). At lower α-values, the Γ α,i -values are broadly separated for rare points, while at higher α-values the resolution is higher for more frequent data points. Therefore, Γ α,i (α)-spectrum is more advisable to compute, rather than a single Γ α,i -value at a chosen α.

C. Point information gain entropy and point information gain entropy density
During the previous sections, we showed that Γ α,i is different for any n i and the dependency of these two variables is a monotonously increasing function for all α > 0. Further, if α > 0, the term k j=1 n α j is different for any pair of dissimilar distribution classes. Here, we propose new variables -a point information gain entropy (H α ) and point information gain entropy density (Ξ α ) defined by formulae and They can be understood as a multiple of the average point information gain and -under linear averaging -an average gain of the phenomenon j, respectively. The information content is generally measured by the entropy. The famous Shannon source coding theorem [22] refers to a specific process of transmission of a discretized signal and introduction of the noise. The Rényi entropy is one of the class of one-parametric entropies and offers numerous additional features over the Shannon entropy [5], [10], [23], such as the determination of a generalized dimension of a strange attractor [18], [19]. The universality of the generalized dimension for characterization of any distribution, whose regularity may be only coincidental, is still under dispute. However, the H α -and Ξ α -values characterize unambiguously a given distribution for any α.
Differences between distributions are expressed in counts along the Γ α -axes. Thus, independently of the mechanism of its generation, the H α /Ξ α -value of the given distribution, also a non-parametric one, may be always compared.
The next question is whether Ξ α measures the information in a commonly understood way. In this aspect, we mention the fact observed upon examination of Eq. (9), i.e., that the point information gain Γ α,i has the properties of an information measure. We may rewrite it where the product in the argument of the logarithm in the second term is a product of functions upper limited by 1 and thus again a function upper limited by 1. From the previous analysis done of D α,i , we may conclude that the point information gain entropy density (Ξ α ) has properties of an information measure. Similarly, the point information gain entropy (H α ) may be rewritten to Again, the argument of the logarithm in the second term is upper limited by 1. H α has also properties of an information measure, although, in this case, its relation to the original Rényi entropy is more complicated.

A. Point information gain
Point information gain Γ α,i introduced in Eq. (5) was originally applied to an image enhancement [1], [2]. A typical digital image is a matrix of x × y × n values, where x and y are dimensions of the image and n corresponds to the number of color channels (e.g., n is 1 and 3 for a monochrome and RGB image, respectively). In most cases, the intensity values are in the range from 0 to 255 (an 8-bit image) or from 0 to 4095 (a 12-bit image) for each color channel.
Nevertheless, independently of the size and bit depth of an image, we may also examine the context of each pixel in the image via Γ α,i -calculation. In other words, apart from the semantic information (Algorithm 1), the   Γ α,i -calculation also allows us to analyze syntactic information around each pixel (Algorithm 2). The semantic information is evaluated as a change of probability intensity histogram after removing a point at each occupied intensity level. The choice of the syntactic surroundings around pixels is specific for each image. According to our knowledge, the appropriate surroundings chosen on the basis of the origin of the image generation was, for the particular case of cellular automata, studied only in Refs. [26]- [28]. Thus, we do not have any systematic method for comparison of suitability of different definitions of surroundings of the pixels and it obviously depends on the process by which the observed pattern or other distribution was generated. This makes the study of the syntactic information very interesting, because it outlines the method of further discrimination of the processes of selforganization/pattern formation [29]. In this article, we confine ourselves on the usage of the syntactic information shadows around a group of the jelly beans (Fig. 5b). In conformity with the statement in the next-to-last paragraph of Sect. II-A, this principle also enables to highlight rare points in much more intensity richer images, mainly at low α-values. The calculations using higher α-values merge resulted Γ α,i -values and draw areas.
The syntactic information emphasizes differences based on a local concept. The cross from the intensity values, whose shanks meet in the examined point of the original image [1], was chosen as the first syntactic surroundings.
In contrast to the global (semantic) recalculation, such a transformation of the texmos2.s512 image produces a substantially much intensity richer Γ α,i -image. One can see that a relatively simple semantic information consists of a more complex syntactic information (Fig. 4a,c,e).
However, the cross-syntactic type of the image transformation is the least suitable approach for the analysis of the photograph of the jelly beans (Fig. 5c). In this case, a circle syntactic element is rather recommended to use.
As seen in Figs. 5d-f, increase of the circle diameter up to the size of the jelly beans reduces the background gradually. The next increase enables to group the jelly beans into higher-order assemblies. A similar grouping is observable for the smallest squares in the transformed texmos2.s512 using the 29-px square surroundings (Fig. 3f).
In contrast, lower values of square surroundings (Fig. 3d) highlight only the intensity borders.

B. Point information gain entropy and point information gain entropy density
From the point of view of thermodynamics, H α and Ξ α can be considered as additive, homological, state variables and their knowledge can be also helpful in analysis of multidimensional (image) data [30]. Plotting H α and Ξ α vs. α in Fig. 6 is not random. Similarly as mentioned for Γ α,i -calculations (Sect. II-A), multidimensional discrete (image) data is suitable to characterize not only by one discrete value, either H α or Ξ α , at a particular α, but also by their α-dependent spectra. The reason is not only to avoid digital rounding, but, as written in Sect. III-A, there is also a possibility to characterize the type and origin of geometrical structures in the image. Another application is in the statistical evaluation of the time-lapse multidimensional datasets [30]. This calculation method was originally developed for study of multifractal self-organizing biological images [31], [32], however, it enables to describe any types of images. Since parts of an image are forms of complex structures, the best way how to interpret the image is to use a combination of its semantic and syntactic kinds of information. We demonstrate that in Fig. 6, which contains an example of a unifractal (almost non-fractal) Euclidian image and a computer-generated multifractal image. Whereas the Euclidian image gives monotonous H α /Ξ α (α)-spectra (for the semantic and cross-syntactic kinds of information, even almost linear dependences at the particular discrete interval of α-values), the recalculation of the multifractal image shows extremes at values of α close to 1. Analogous dependences have been also plotted for the image sets of the course of the self-organizing Belousov-Zhabotinsky reaction [30].

IV. CONCLUSIONS
In this article, we propose novel information measures -a point information gain (Γ α,i ), a point information gain entropy (H α ), and a point information gain entropy density (Ξ α ). Variables H α and Ξ α may be used as measures in multidimensional datasets for the definition of the information context. This option may be practically utilized for acquisition of differently resolved information measures in a dataset. In other words, it enables to avoid cases, where the number of occurrences of a certain event is the same, but in distribution in time, space or along any other measure differ. Examination of syntactic information distribution shows a potential for in-depth insight into formation of observed structures and patterns. Further, we found a monotonous dependency of the number of the elements of a given property in the set on Γ α,i . In principle, variables H α and Ξ α are unique for each distribution but suffer from problems with digital precision of the calculation. Therefore, we proposed the Γ α,i -spectrum as a proper characteristics of any discrete distribution.
b) Cauchy distribution: c) Gauss distribution: Multidimensional image analysis based on calculation of Γ α,i , H α , and Ξ α was tested on 5 standard images (Table I)  px, respectively) syntactic information was set as special cases of a Cross, Rectangle, and Ellipse calculation at the rotation angle P hi of 0 • . Into the Image Info Extractor Professional software, a side of the square and radius of the circle surroundings is inputed as width/2 and height/2 of 2, 5, and 14 px and a and b of 2, 5, and 8 px, respectively.

B. Calculation algorithms
The algorithms implemented into the Image Info Extractor Professional are described in Algorithms 1-2. In case of RGB images, the algorithms were applied to each color channel. The