Entropy Approximation in Lossy Source Coding Problem

In this paper, we investigate a lossy source coding problem, where an upper limit on the permitted distortion is defined for every dataset element. It can be seen as an alternative approach to rate distortion theory where a bound on the allowed average error is specified. In order to find the entropy, which gives a statistical length of source code compatible with a fixed distortion bound, a corresponding optimization problem has to be solved. First, we show how to simplify this general optimization by reducing the number of coding partitions, which are irrelevant for the entropy calculation. In our main result, we present a fast and feasible for implementation greedy algorithm, which allows one to approximate the entropy within an additive error term of log2 e. The proof is based on the minimum entropy set cover problem, for which a similar bound was obtained.


Introduction
Lossy source coding transforms possibly continuously-distributed information into a finite number of codewords [1,2].Although this allows one to encode data efficiently, such an operation is irreversible, and once modified, information cannot be restored accurately.One of the fundamental questions in lossy coding is the following: What is the lowest achievable statistical code length given a maximal coding error?To answer this question, the precise formulation of the coding error and related definition of the entropy need to be given.In this paper, we present how to approximate the value of the entropy in the case when every entry element has a fixed upper limit on the permitted error.

Motivation
In order to explain our results, we give a more precise problem formulation.Suppose that a random source represented by a probability measure µ produces the information from space X.We fix a partition P of X and encode an arbitrary element x ∈ X by a unique P ∈ P, such that x ∈ P .The statistical code length is described by the Shannon entropy of µ with respect to P [3]: h (µ; P) = P ∈P µ(P ) log 2 µ(P ).
Example 1. Suppose that we want to encode numbers picked randomly from [0, n).One can use a coding partition with equally-sized sets, e.g., P δ = {[k, k + δ) : k = 0, δ, . . ., n − δ}.When the source elements are encoded by the centers of these sets, then the (de)coding error does not exceed δ/2.Clearly, one may construct partitions that contain different types of sets.Roughly speaking, highly probable elements should be coded with high accuracy (smaller sets), while the rare numbers can be coded with low precision (larger sets).DjVuis an example of a file format that uses different precision for various image elements: it compresses the text layer and the background separately [4].The proposed approach allows one to define a maximal coding error for every dataset element separately.
To control the maximal coding error in the above formulation, we propose to use an additional family Q of subsets of X, which we call an error-control family.The error-control family is a kind of a fidelity criterion [5].We accept only these coding partitions P where every element is a subset of some element of Q (we say that P is Q-acceptable).The optimal Q-acceptable coding partition is the one with the minimal entropy.Thus, we define the entropy of an error-control family Q by [6][7][8]: The above problem differs from the rate distortion theory [9,10], which is the most common approach to lossy source coding.Instead of specifying an upper limit on the allowed average distortion rate, an upper limit on the permitted distortion for any symbol is considered.The above formulation can be seen as a kind of vector quantization [11,12] and was partially motivated by the notion of epsilon-entropy proposed by Posner [13,14].The entropy of specific error-control families in the case of metric spaces appears also in the definition of Rényi entropy dimension [15].It is worth mentioning that our idea is partially connected with perceptual source coding considered by Jayant et al. [16].
Answering the question raised at the beginning of the paper concerning the lowest code length given a maximal coding error is equivalent to the calculation of the entropy of an error-control family (1).In this paper, we focus on methods that allow us to approximate this quantity.

Main Results
In our main result, Theorem 4, we present a method that allows us to approximate the entropy of an error-control family within an additive term.More precisely, we propose a fast and easy to implement algorithm, the greedy entropy algorithm, which for a given finite error-control family Q produces a Q-acceptable partition P satisfying: The obtained bound is sharp and cannot be improved.Moreover, it is independent of a coding problem instance characterized by a probability measure µ and an error-control family Q.
Our method is reminiscent of the procedure used to approximate the solution of the minimum entropy set cover problem (MESC) [17,18], where a similar bound was derived.Roughly speaking, MESC focuses on an optimization problem, where one seeks for partition compatible with a fixed cover of a finite dataset X with minimal entropy.In fact, it looks for an assignment f : X → Q, which can be seen as a special case of a partition.A similar greedy algorithm was used for producing a partition with the entropy approximating the minimal entropy value.To be able to apply the results obtained for MESC, the precise relationships between these two problems were established.In particular, in Theorem 3, we show that the entropy of an error-control family equals the minimal entropy for a set cover.Let us observe that our main minimization problem (1) is very complex, since for most examples of error-control families, there exists an uncountable number of acceptable partitions (see Section 2.2).As a consequence, an exhaustive search through all acceptable partitions cannot be done in practice.The second key part for establishing the inequality (2) was to show that the number of partitions relevant for the entropy calculation may be drastically reduced.In Theorem 1, we derive that finding the entropy of an error-control family, it is sufficient to consider only partitions constructed from the elements of σ-algebra generated by an error-control family.

Discussion
Before presenting the details of the proposed algorithm, let us first demonstrate its sample effects.The reader interested in its full description may skip this part during the first reading.
In order to show the capabilities of the greedy entropy algorithm, we apply this procedure for image segmentation.For simplicity, we assume that every pixel is represented by a three-dimensional feature vector, i.e., the intensity of each color coordinate ranges between zero and 255.
Let the error-control family Q 1 δ consist of all cubes with a side length δ, i.e., Q 1 δ = {[k, k + δ) 3 : k = 0, 1, . . ., 255 − k}.The greedy entropy algorithm selects sequentially the most probable cubes from the image histogram.The final partition P 1 δ , for two color components-green and blue-is shown in Figure 1b.Table 1a presents the comparison between the entropies calculated for partition P 1 δ and another Q 1 δ -acceptable partition P δ , including all pairwise disjoint cubes with the side length δ > 0. Surprisingly, both partitions gave similar entropies.This might be explained by the fact that P δ contains only the elements of the error-control family Q 1 δ .Table 1.Entropies calculated for partitions P 1 δ (a), P 2 δ (b) returned by the greedy entropy algorithm for two cases of error control families.The first one consists of cubes with a given side length, while the second contains balls with a given radius.In each case, the results are compared with the entropy of the acceptable partition consisting of maximally-sized pairwise disjoint cubes P δ , P 2 √   Clearly, this is not always the case.To observe this, let the error-control family Q 2 δ consist of all balls with radius δ. Figure 1c shows the partition P 2 δ returned by the greedy entropy algorithm for two color components: green and blue.As in the previous case, we consider also a Q 2 δ -acceptable partition of maximally-sized pairwise disjoint cubes with a fixed side length.We can see from the results placed in Table 1b that a greedy selection provided significantly lower entropy values.The convergence behavior of the partitions produced by the greedy entropy algorithm was shown in Figure 2. One can observe that the entropy, as well as the cardinality of resultant partitions decrease in a hyperbolic way when the diameters of sets included in the error-control family increase.) of the partitions produced by the greedy entropy algorithm for two cases of error control families: the first one consists of cubes with a given side length, while the second contains balls with a given radius.In every case, the diameters of elements included in the error-control family were increased, which caused the decrease of the entropy and the cardinality of resultant partitions.

Paper Organization
The paper is organized as follows.In Section 2, we formulate our lossy source coding problem and show that the entropy optimization problem can be simplified by reducing the number of partitions, which are irrelevant for entropy calculation.Section 3 contains our main result.First, the minimum entropy set cover problem is recalled, and its relationship with our notion of entropy is established.Next, we define the greedy entropy algorithm and derive that it constructs a partition with the entropy close to the optimal one.

Entropy Calculation
We start this section with establishing basic notations and definitions.Then, we present the main problem of this paper concerning the entropy calculation and show how to reduce its complexity by eliminating irrelevant coding partitions.

Lossy Source Coding and Error-Control Families
Let us assume that (X, Σ, µ) is a probability space.In our formulation of lossy source coding, we are interested in encoding elements of X produced by a probability measure µ by a countable number of symbols.The source code is determined by a partition of X, which is a countable family of measurable, pairwise disjoint subsets of X, such that: More precisely, every element x ∈ X is transformed into a code related to a unique P ∈ P, such that x ∈ P .The statistical code length of an arbitrary element of X in the optimal coding scheme can be calculated by the entropy of P [3]: Definition 1.The entropy of a partition P is defined by: h (µ; P) := P ∈P sh(µ(P )), where by −0 • log 2 (0), we understand zero.
Although the entropy and the partition depend strictly on the probability measure, we consequently omit the symbol µ in their definitions to simplify notations.
The use of a partition causes a coding error (distortion).To be able to control the maximal coding error (the upper limit of permitted distortion), an additional family Q of subsets of X is introduced, which we call an error-control family [7].The error-control family restricts the number of permissible partitions that can be used for encoding.More precisely, a partition P is said to be Q-acceptable iff for every P ∈ P, there exists Q ∈ Q, such that P ⊂ Q (which we write P ≺ Q).
The partitions that are allowed to be used for encoding are limited by a fixed error-control family.The optimal lossy coding scheme (determined by a partition) is the one that minimizes the entropy and does not violate the upper limit of permitted distortion, i.e., P ≺ Q.This leads to the following definition of the entropy of an error-control family [7].
Definition 2. Let Q ⊂ Σ be an error-control family.The entropy of Q is defined by: H (µ; Q) := inf{ h (µ; P) ∈ [0, ∞] : P is a partition and P ≺ Q}. ( In this paper, we focus on the computational methods for finding the value of H (µ; Q) and possibly a partition P ≺ Q, which satisfies h (µ; P) ≈ H (µ; Q).One should look for a partition P satisfying h (µ; P) = H (µ; Q).However, Example II.1 in [7] shows that the value of entropy does not have to be attained on any partition in general.

Partition Reduction
In order to find the entropy of an error-control family, a minimization problem (3) has to be solved.Let us first observe that for very simple error-control families, the number of acceptable partitions can be uncountable.Given a family Q = {(−∞, 1], [0, +∞)}, any partition of the form P = {(−∞, a], (a, ∞)}, where a ∈ (0, 1), is Q-acceptable.Clearly, some of them do not lead to the optimal solution.As a consequence, it is extremely important to eliminate partitions that are irrelevant for entropy calculation (3).
The main result of this section shows that to find the entropy, it is sufficient to consider only partitions constructed from the sets of Σ Q : the σ-algebra generated by the error-control family Q.
In the aforementioned example, there are only three such partitions: P 1 = {(−∞, 0], (0, +∞)}, P 2 = {(−∞, 1], (1, +∞)} and P 3 = {(−∞, 0], (0, 1], (1, +∞)}.Theorem 1.Let (X, Σ, µ) be a probability space, and let Q be an error-control family.Then, we have: To derive this fact, for any Q-acceptable partition P, we will construct a partition R ⊂ Σ Q with the entropy not greater than h (µ; P).To describe the process of construction of such a partition, let us first establish the notation: for a given partition (or, more generally, family of sets) P of X and a set A ⊂ X, we denote: This notation will be used through this section.
Then, for a Q-acceptable partition P, a family R is built by the following Algorithm 1.
Let P i max ∈ P Xi be such that µ(P i max ) = max{µ(P ) : P ∈ P Xi } Let R i ∈ Q Xi be an arbitrary set which satisfies Our goal is to show that the partition reduction algorithm produces a partition R, such that: We need to observe that the following property of the Shannon function holds.Proposition 1.Given numbers p ≥ q ≥ 0 and r > 0, such that p, q, p + r, q − r ∈ [0, 1], we have: sh(p) + sh(q) ≥ sh(p + r) + sh(q − r).
Let us focus on a single iteration of the partition reduction algorithm.Lemma 1.Let (X, Σ, µ) be a subprobability space, i.e., (X, Σ) is measurable space, and µ is a non-negative measure on (X, Σ), such that µ(X) ≤ 1.We consider an error-control family Q and a Q-acceptable partition P of X.Let P max ∈ P be such that: Proof.Clearly, if h (µ; P) = ∞, then the inequality (5) holds trivially.Thus, we assume that h (µ; P) < ∞.
Let us observe that it is enough to consider only elements of P with non-zero measures: the number of such sets that can be at most countable.Thus, let us assume that P = {P i } ∞ i=1 (the case when P is finite can be treated in a similar manner).
For simplicity, we put P 1 := P max .For every k ∈ N, we consider the sequence of sets, defined by: Clearly, for k ∈ N, we have: To complete the proof, it is sufficient to derive that for every k ∈ N, we have: and: Let k ∈ N be arbitrary.Then, from ( 6) and ( 7), we get: Making use of Observation 1, we obtain: which proves (9).
As the sequence of functions {sh(µ(P \ Q n ))} n∈N satisfies the assumptions of Lebesgue's dominated convergence theorem [19], then we get: Consequently, we have: which completes the proof.
Making use of the above lemma, we summarize the analysis of the partition reduction algorithm in the following theorem.
Theorem 2. We assume that (X, Σ, µ) is a subprobability space.Let Q be an error-control family on X, and let P be a Q-acceptable partition of X.A family R constructed by the partition reduction algorithm is a partition of X and satisfies: Proof.Directly from the partition reduction algorithm, we get that R is a countable family of pairwise disjoint sets.The fact that: follows from Lebesgue's dominated convergence theorem [19] applied to a sequence of functions f n : P → R defined by: We prove the inequality (11).If h (µ; P) = ∞, then the inequality ( 11) is straightforward.Thus, let us discuss the case when h (µ; P) < ∞.
We denote P = {P i } ∞ i=1 , since at most, a countable number of elements of the partition can have a positive measure (the case when P is finite follows similarly).We will use the notation introduced in the partition reduction algorithm.
Directly from Lemma 1, we obtain: Consequently, for every k ∈ N, we get: Our goal is to show that: for every k ∈ N. Making use of ( 12), we have: for every k ∈ N.
We will calculate lim n→∞ ∞ i=1 sh(µ(P i \ n j=1 R j )) using Lebesgue's dominated convergence theorem [19] for a sequence of functions {f n } ∞ n=1 , defined by: Similar reasoning was used in the proof of Lemma 1.
Similarly to the proof of Lemma 1, we may assume that there exists m ∈ N, such that: and: for every n ∈ N.Moreover, for every P ∈ P, since R is a partition of X.
Making use of Lebesgue's dominated convergence theorem [19], we get: Consequently, for every k ∈ N, we have: which completes the proof.
As a consequence of Theorem 2, we directly get that Theorem 1 holds.In the case of finite error-control families, we get that there exists an acceptable partition with minimal entropy.Corollary 1.Let (X, Σ, µ) be a probability space, and let Q be a finite error-control family.Then, there exists a Q-acceptable partition P ⊂ Σ Q , such that:

Entropy Approximation
In the previous section, we simplified the problem of entropy calculation by reducing the number of partitions that are necessary to consider to find the entropy of an error-control family.Since the number of acceptable partitions grows exponentially with the cardinality of the error-control family, it might be impossible to test all of them for entropy calculation.In this section, we show an algorithm that allows us to approximate the entropy within an additive term.
The presented formulation of lossy source coding is closely related to the minimum entropy set cover problem (MESC) [17,18], where one focuses on a similar optimization problem.There exists an algorithm for approximation of the solution of MESC within an additive term of log 2 e.First, we present a description of MESC and its relationship with the introduced coding problem.Next, we use these facts to apply a similar technique for approximating the entropy of an error-control family.

Relationship with the Minimum Entropy Set Cover Problem
In order to define MESC, let X = {x 1 , . . ., x n } be a finite dataset, where every observation x ∈ X appears with a probability p x .A random source produces a signal from a probability distribution {p x 1 , . . ., p xn } and passes it through the noisy channel.Each observation has a type, but due to the noise, we only know that it is one of a given set of types defined by a finite cover Q = {Q 1 , . . ., Q k } of data space X.We map an observation to a type by defining an assignment Let us denote by q i the probability that the random point is assigned to Q i : The goal is to find such an assignment that minimizes the entropy of the distribution of the types, i.e., Such an optimal assignment is denoted by f Q opt .MESC is an NP-hard problem (Theorem 1 in [18] ).To find an assignment that efficiently approximates the minimal entropy value, a simple greedy algorithm can be used (which we call the greedy MESC algorithm).It relies on the iterative execution of the following steps: • remove from X (and from all Q ∈ Q) the elements of Q i max , until there exists Q ∈ Q with a positive probability [18].Cardinal et al. proved a sharp bound for the entropy of assignment constructed with the use of the greedy MESC algorithm: Greedy MESC approximation (see [17] Theorem 1 ): If f Q g is an assignment produced by the greedy algorithm, then: To be able to obtain a similar approximation of the entropy of an error-control family, the relationship between MESC and our formulation of entropy has to be established.For this purpose, our lossy source coding problem will be considered on a discrete probability space (X, Σ, µ), where X is a finite set, Σ is a σ-algebra generated by all singletons of X and µ := x∈X p x δ x is an atomic measure on (X, Σ).A cover Q = {Q 1 , . . ., Q k } plays the role of an error-control family.
We start our analysis with showing that given an assignment f Q compatible with Q, one can construct a Q-acceptable partition with equal entropy: Lemma 2. Let f Q be an assignment compatible with Q.Then, the family P = {P i } k i=1 , where , is a Q-acceptable partition, and: Proof.Directly from the definition of compatible assignment, we get that P is a Q-acceptable partition of X.
The following example illustrates that the natural inverse construction is not possible, i.e., for some Q-acceptable partitions, there does not exist any compatible assignment with identical entropy.
Example 2. Let (X, Σ, µ) be a probability space with an error-control family Q, where: Then: is a Q-acceptable partition with the entropy equal to one.However, there is no compatible assignments with the entropy of one: the only assignment f Q : X → Q compatible with Q is defined by: which entropy equal to zero.
The following result demonstrates that given a partition, one can find an assignment without greater entropy: Lemma 3. Let P ≺ Q be a partition.Then, there exists an assignment f Q compatible with Q, such that: Proof.Since P is a partition, the function: g : X x → P x ∈ P, for x ∈ P x , is well defined.Moreover, as P is Q-acceptable, we find a mapping h : P → Q, such that: Finally, we put an assignment f Q : X → Q, by: Clearly, f Q is an assignment compatible with Q.Let us calculate the entropy of f Q .We have: Making use of the subadditivity of the Shannon function, we get: Q∈Q sh( As a consequence, we get that the entropy of the optimal assignment equals the entropy of an error-control family.
Theorem 3. We have: Proof.If f Q opt is an optimal assignment, then making use of Lemma 2, we construct a Q-acceptable partition P 1 that satisfies: ).On the other hand, since Q is a finite error-control family, then from Theorem 1, we get: for a specific Q-acceptable partition P 2 .Using Lemma 3, we find an assignment f Q compatible with Q, such that: which completes the proof.

Greedy Approximation
In this section, we show that the analogue of the greedy MESC algorithm can be applied for the case of our formulation of lossy source coding.Furthermore, similar bounds can be established.
Let us start with an extended version of the approximation algorithm, which we call the greedy entropy algorithm (Algorithm 2).Contrary to the greedy MESC algorithm, our procedure works directly with partitions; hence, it is more general.We assume that Q is a finite error-control family.
To see that the greedy entropy algorithm is not well defined for infinite error-control families, let us consider the example: Example 3. Let us consider an open segment (0, 1) with σ-algebra generated by all Borel subsets of (0, 1), Lebesgue measure λ and an error control family, defined by: There does not exist a set of maximal measure from family Q; hence, the greedy entropy algorithm cannot be applied directly in such a case.
Let us observe that both greedy algorithms create partitions with the same entropies.For this purpose, we denote by Greedy f Q a set of all assignments produced by the greedy MESC algorithm, while by Greedy Q , we denote a set of all partitions returned by the greedy entropy algorithm: Proposition 2. We have: • For every f Q g ∈ Greedy f Q , there exists P g ∈ Greedy Q , such that: • For every P g ∈ Greedy Q , there exists f Q g ∈ Greedy f Q , such that: The main result of this section shows that the greedy entropy algorithm produces a partition with the entropy not greater that the entropy of an error-control family.Theorem 4. Let (X, Σ, µ) be a probability space, and let Q be a finite error-control family.Then: h (µ; P) ≤ H (µ; Q) + log 2 e, for P ∈ Greedy Q .

The proof of Theorem 4 involves two facts:
• To calculate the entropy, it is sufficient to consider only partitions constructed from the elements of σ-algebra generated by Q (see Corollary 1).
• The calculation of the entropy of an error-control family is closely related to MESC optimization problem (see Theorem 3 and Proposition 2).
To be able to apply these facts, we need an additional lemma: Lemma 4. Let (X, Σ, µ) be an arbitrary (not necessarily discrete) probability space, and let Q be a finite error-control family.Then, there exists a probability space ( X, Σ, μ) with an error-control family Q, such that X, Σ and Q are finite, μ is an atomic measure, and for every P g ∈ Greedy Q , there exists Pg ∈ Greedy Q satisfying: Proof.We restrict our consideration to partitions P ⊂ Σ Q , since, by Corollary 1, the entropy H (µ; Q) is attained on some partition generated by Σ Q .Let us denote the set of generators of Σ Q : Next, for every set G ∈ Gen(Σ Q ), we fix exactly one point x G ∈ G.
Then, we obtain a probability space ( X, Σ, μ) and an error-control family Q by: It is easy to see that every Q-acceptable μ-partition P ⊂ Σ Q corresponds naturally to a specific Q-acceptable partition P ⊂ Σ Q and conversely.The measures of corresponding sets are equal. Thus: Moreover, for every P g ∈ Greedy Q , there exists Pg ∈ Greedy Q, which satisfies: h (µ; P g ) = h (μ; Pg ).
Finally, the proof of our main result is as follows: Proof.(of Theorem 4) Making use of Lemma 4, we find a probability space ( X, Σ, μ) with the error-control family Q, such that X, Σ, Q are finite, μ is an atomic measure and: By Theorem 3, we get that if f Q opt is an optimal assignment compatible with Q, then: Moreover, making use of Proposition 2, for every Pg ∈ Greedy Q, we find f Q g ∈ Greedy f Q, such that: Thus, by the greedy MESC approximation, we have: Consequently, using Lemma 4, we get: h (µ; P g ) ≤ H (µ; Q) + log 2 e.
The above approximation cannot be improved.To see this, we use the example from Section 2 of [17] adopted to our situation.Moreover, it is also NP-hard to approximate the entropy within an additive term lower than log 2 e (see Theorem 2 of [17]).

Conclusion and Future Work
The paper focused on a non-standard type of lossy source coding.In contrast to rate distortion theory, a cover of the source alphabet, which defines a maximal distortion permitted on every element, was introduced.The calculation of the entropy in such a formulation of lossy coding is equivalent to solving the minimum entropy optimization problem, where one would like to find a coding partition compatible with a fixed distortion with minimal entropy.Our results show how to simplify this optimization problem and to find the approximated entropy value.The proposed algorithm is fast, feasible for implementation and produces a partition that has a proven upper bound on accuracy, i.e., the entropy of the returned partition is not higher than log 2 ethe true entropy value.
In the future, we plan to consider a more general family of entropy functions, including Rényi and Tsallis entropies, which are of great importance in the theory of coding and related problems [6,20,21].Moreover, there also arises a natural question concerning the compression of n-tuple random variables.More precisely, it is worth investigating how the coding efficiency increases when the larger blocks of source elements are compressed jointly.

Figure 1 .
Figure 1.Input image for compression (a) and partitions produced by the greedy entropy algorithm for two cases of error-control families: the first one (b) consists of cubes with a given side length, while the second (c) contains balls with a given radius.For the visualization, only two color components were used: green and blue in (b) and (c).

Figure 2 .
Figure 2. Convergence behavior (entropy (a) and cardinality (b)) of the partitions produced by the greedy entropy algorithm for two cases of error control families: the first one consists of cubes with a given side length, while the second contains balls with a given radius.In every case, the diameters of elements included in the error-control family were increased, which caused the decrease of the entropy and the cardinality of resultant partitions.