Next Article in Journal
Minimum Error Entropy Algorithms with Sparsity Penalty Constraints
Previous Article in Journal
Non-Abelian Topological Approach to Non-Locality of a Hypergraph State

2015, 17(5), 3400-3418; https://doi.org/10.3390/e17053400

Article
Entropy Approximation in Lossy Source Coding Problem
by Marek Śmieja * and Jacek Tabor
Department of Mathematics and Computer Science, Jagiellonian University, Lojasiewicza 6, 30-348 Kraków, Poland
*
Author to whom correspondence should be addressed.
Received: 26 March 2015 / Accepted: 12 May 2015 / Published: 18 May 2015

## Abstract

:
In this paper, we investigate a lossy source coding problem, where an upper limit on the permitted distortion is defined for every dataset element. It can be seen as an alternative approach to rate distortion theory where a bound on the allowed average error is specified. In order to find the entropy, which gives a statistical length of source code compatible with a fixed distortion bound, a corresponding optimization problem has to be solved. First, we show how to simplify this general optimization by reducing the number of coding partitions, which are irrelevant for the entropy calculation. In our main result, we present a fast and feasible for implementation greedy algorithm, which allows one to approximate the entropy within an additive error term of log2 e. The proof is based on the minimum entropy set cover problem, for which a similar bound was obtained.
Keywords:
Shannon entropy; entropy approximation; minimum entropy set cover; lossy compression; source coding

## 1. Introduction

Lossy source coding transforms possibly continuously-distributed information into a finite number of codewords [1,2]. Although this allows one to encode data efficiently, such an operation is irreversible, and once modified, information cannot be restored accurately. One of the fundamental questions in lossy coding is the following: What is the lowest achievable statistical code length given a maximal coding error? To answer this question, the precise formulation of the coding error and related definition of the entropy need to be given. In this paper, we present how to approximate the value of the entropy in the case when every entry element has a fixed upper limit on the permitted error.

#### 1.1. Motivation

In order to explain our results, we give a more precise problem formulation. Suppose that a random source represented by a probability measure µ produces the information from space X. We fix a partition $P$ of X and encode an arbitrary element xX by a unique $P ∈ P$, such that xP. The statistical code length is described by the Shannon entropy of µ with respect to $P$ :
$h ( μ ; P ) = ∑ P ∈ P μ ( P ) log 2 μ ( P ) .$
Example 1. Suppose that we want to encode numbers picked randomly from [0,n). One can use a coding partition with equally-sized sets, e.g., $P δ = { [ k , k + δ ) : k = 0 , δ , … , n − δ }$. When the source elements are encoded by the centers of these sets, then the (de)coding error does not exceed δ/2. Clearly, one may construct partitions that contain different types of sets. Roughly speaking, highly probable elements should be coded with high accuracy (smaller sets), while the rare numbers can be coded with low precision (larger sets). DjVuis an example of a file format that uses different precision for various image elements: it compresses the text layer and the background separately . The proposed approach allows one to define a maximal coding error for every dataset element separately.
To control the maximal coding error in the above formulation, we propose to use an additional family $Q$ of subsets of X, which we call an error-control family. The error-control family is a kind of a fidelity criterion . We accept only these coding partitions $P$ where every element is a subset of some element of $Q$ (we say that $P$ is $Q$-acceptable). The optimal $Q$-acceptable coding partition is the one with the minimal entropy. Thus, we define the entropy of an error-control family $Q$ by [6-8]:
$H ( μ ; Q ) : = inf { h ( μ ; P ) ∈ [ 0 , ∞ ] : P is Q - acceptable partition } .$
The above problem differs from the rate distortion theory [9,10], which is the most common approach to lossy source coding. Instead of specifying an upper limit on the allowed average distortion rate, an upper limit on the permitted distortion for any symbol is considered. The above formulation can be seen as a kind of vector quantization [11,12] and was partially motivated by the notion of epsilon-entropy proposed by Posner [13,14]. The entropy of specific error-control families in the case of metric spaces appears also in the definition of Rényi entropy dimension . It is worth mentioning that our idea is partially connected with perceptual source coding considered by Jayant et al. .
Answering the question raised at the beginning of the paper concerning the lowest code length given a maximal coding error is equivalent to the calculation of the entropy of an error-control family (1). In this paper, we focus on methods that allow us to approximate this quantity.

#### 1.2. Main Results

In our main result, Theorem 4, we present a method that allows us to approximate the entropy of an error-control family within an additive term. More precisely, we propose a fast and easy to implement algorithm, the greedy entropy algorithm, which for a given finite error-control family $Q$ produces a $Q$-acceptable partition $P$ satisfying:
$h ( μ ; P ) ≤ H ( μ ; Q ) + log 2 e .$
The obtained bound is sharp and cannot be improved. Moreover, it is independent of a coding problem instance characterized by a probability measure µ and an error-control family $Q$.
Our method is reminiscent of the procedure used to approximate the solution of the minimum entropy set cover problem (MESC) [17,18], where a similar bound was derived. Roughly speaking, MESC focuses on an optimization problem, where one seeks for partition compatible with a fixed cover of a finite dataset X with minimal entropy. In fact, it looks for an assignment $f : X → Q$, which can be seen as a special case of a partition. A similar greedy algorithm was used for producing a partition with the entropy approximating the minimal entropy value. To be able to apply the results obtained for MESC, the precise relationships between these two problems were established. In particular, in Theorem 3, we show that the entropy of an error-control family equals the minimal entropy for a set cover. Let us observe that our main minimization problem (1) is very complex, since for most examples of error-control families, there exists an uncountable number of acceptable partitions (see Section 2.2). As a consequence, an exhaustive search through all acceptable partitions cannot be done in practice. The second key part for establishing the inequality (2) was to show that the number of partitions relevant for the entropy calculation may be drastically reduced. In Theorem 1, we derive that finding the entropy of an error-control family, it is sufficient to consider only partitions constructed from the elements of σ-algebra generated by an error-control family.

#### 1.3. Discussion

Before presenting the details of the proposed algorithm, let us first demonstrate its sample effects. The reader interested in its full description may skip this part during the first reading.
In order to show the capabilities of the greedy entropy algorithm, we apply this procedure for image segmentation. For simplicity, we assume that every pixel is represented by a three-dimensional feature vector, i.e., the intensity of each color coordinate ranges between zero and 255.
Let the error-control family $Q δ 2$ consist of all cubes with a side length δ, i.e., $Q δ 1 = { [ k , k + δ ) 3 : k = 0 , 1 , … , 255 − k }$. The greedy entropy algorithm selects sequentially the most probable cubes from the image histogram. The final partition $P δ 1$, for two color components—green and blue—is shown in Figure 1b. Table 1a presents the comparison between the entropies calculated for partition $P δ 1$ and another $Q δ 1$-acceptable partition $P δ$, including all pairwise disjoint cubes with the side length δ > 0. Surprisingly, both partitions gave similar entropies. This might be explained by the fact that $P δ$ contains only the elements of the error-control family $Q δ 1$.
Clearly, this is not always the case. To observe this, let the error-control family $Q δ 2$ consist of all balls with radius δ. Figure 1c shows the partition $P δ 2$ returned by the greedy entropy algorithm for two color components: green and blue. As in the previous case, we consider also a $Q δ 2$-acceptable partition $P ⌊ 2 3 3 δ ⌋$ of maximally-sized pairwise disjoint cubes with a fixed side length. We can see from the results placed in Table 1b that a greedy selection provided significantly lower entropy values.
The convergence behavior of the partitions produced by the greedy entropy algorithm was shown in Figure 2. One can observe that the entropy, as well as the cardinality of resultant partitions decrease in a hyperbolic way when the diameters of sets included in the error-control family increase.

#### 1.4. Paper Organization

The paper is organized as follows. In Section 2, we formulate our lossy source coding problem and show that the entropy optimization problem can be simplified by reducing the number of partitions, which are irrelevant for entropy calculation. Section 3 contains our main result. First, the minimum entropy set cover problem is recalled, and its relationship with our notion of entropy is established. Next, we define the greedy entropy algorithm and derive that it constructs a partition with the entropy close to the optimal one.

## 2. Entropy Calculation

We start this section with establishing basic notations and definitions. Then, we present the main problem of this paper concerning the entropy calculation and show how to reduce its complexity by eliminating irrelevant coding partitions.

#### 2.1. Lossy Source Coding and Error-Control Families

Let us assume that (X, Σ, µ) is a probability space. In our formulation of lossy source coding, we are interested in encoding elements of X produced by a probability measure µ by a countable number of symbols. The source code is determined by a partition of X, which is a countable family of measurable, pairwise disjoint subsets of X, such that:
$μ ( X \ ∪ P ∈ P P ) = 0 .$
More precisely, every element xX is transformed into a code related to a unique $P ∈ P$, such that xP. The statistical code length of an arbitrary element of X in the optimal coding scheme can be calculated by the entropy of $P$:
Definition 1. The entropy of a partition$P$ is defined by:
$h ( μ ; P ) : = ∑ P ∈ P sh ( μ ( P ) ) ,$
where sh : [0,1] [0, ∞) is the Shannon function, i.e.,
$sh ( x ) : = − x ⋅ log 2 ( x ) , for x ∈ [ 0 , 1 ] ,$
where by −0 • log2(0), we understand zero.
Although the entropy and the partition depend strictly on the probability measure, we consequently omit the symbol µ in their definitions to simplify notations.
The use of a partition causes a coding error (distortion). To be able to control the maximal coding error (the upper limit of permitted distortion), an additional family $Q$ of subsets of X is introduced, which we call an error-control family . The error-control family restricts the number of permissible partitions that can be used for encoding. More precisely, a partition $P$ is said to be $Q$-acceptable iff for every $P ∈ P$, there exists $Q ∈ Q$, such that PQ (which we write $P ≺ Q$).
The partitions that are allowed to be used for encoding are limited by a fixed error-control family. The optimal lossy coding scheme (determined by a partition) is the one that minimizes the entropy and does not violate the upper limit of permitted distortion, i.e., $P ≺ Q$. This leads to the following definition of the entropy of an error-control family .
Definition 2. Let$Q ⊂ ∑$ be an error-control family. The entropy of$Q$ is defined by:
$H ( μ ; Q ) : = inf { h ( μ ; P ) ∈ [ 0 , ∞ ] : P is a partiion and P ≺ Q } .$
In this paper, we focus on the computational methods for finding the value of $H ( μ ; Q )$ and possibly a partition $P ≺ Q$, which satisfies $h ( μ ; P ) ≈ H ( μ ; Q )$. One should look for a partition $P$ satisfying $h ( μ ; P ) = H ( μ ; Q )$. However, Example II.1 in  shows that the value of entropy does not have to be attained on any partition in general.

#### 2.2. Partition Reduction

In order to find the entropy of an error-control family, a minimization problem (3) has to be solved. Let us first observe that for very simple error-control families, the number of acceptable partitions can be uncountable. Given a family $Q = { ( − ∞ , 1 ] [ 0 , + ∞ ) }$, any partition of the form $P = { ( − ∞ , a ] , ( a , ∞ ) }$, where a ∈ (0,1), is $Q$-acceptable. Clearly, some of them do not lead to the optimal solution. As a consequence, it is extremely important to eliminate partitions that are irrelevant for entropy calculation (3).
The main result of this section shows that to find the entropy, it is sufficient to consider only partitions constructed from the sets of $Σ Q$: the σ-algebra generated by the error-control family $Q$. In the aforementioned example, there are only three such partitions: $P 1 = { ( − ∞ , 0 ] , ( 0 , + ∞ ) }$, $P 2 = { ( − ∞ , 1 ] , ( 1 , + ∞ ) }$ and $P 3 = { ( − ∞ , 0 ] , ( 0 , 1 ] , ( 1 , + ∞ ) }$.
Theorem 1. Let (X, Σ, µ) be a probability space, and let$Q$ be an error-control family. Then, we have:
$H ( μ ; Q ) : = inf { h ( μ ; P ) ∈ [ 0 , ∞ ] : P is a partition , P ≺ Q and P ⊂ Σ Q } .$
To derive this fact, for any $Q$-acceptable partition $P$, we will construct a partition $R ⊂ ∑ Q$ with the entropy not greater than $h ( μ ; P )$. To describe the process of construction of such a partition, let us first establish the notation: for a given partition (or, more generally, family of sets) $P$ of X and a set AX, we denote:
$P A = { P ∩ A : P ∈ P } .$
This notation will be used through this section.
Then, for a $Q$-acceptable partition $P$, a family $R$ is built by the following Algorithm 1.
Algorithm 1:. Partition reduction algorithm. Our goal is to show that the partition reduction algorithm produces a partition $R$, such that:
$h ( μ ; R ) ≤ h ( μ ; P ) .$
We need to observe that the following property of the Shannon function holds.
Proposition 1. Given numbers p ≥ q ≥ 0 and r > 0, such that p, q,p + r,qr ∈ [0,1], we have:
$sh ( p ) + sh ( q ) ≥ sh ( p + r ) + sh ( q − r ) .$
Let us focus on a single iteration of the partition reduction algorithm.
Lemma 1. Let (X,Σ,µ) be a subprobability space, i.e., (X, Σ) is measurable space, and µ is a non-negative measure on (X, Σ), such that µ(X) 1. We consider an error-control family$Q$ and a$Q$-acceptable partition$P$ of X. Let$P max ∈ P$ be such that:
$μ ( P max ) = max { μ ( P ) : P ∈ P } .$
If$Q ∈ Q$ is chosen so that PmaxQ, then:
$h ( μ ; { Q } ∪ P X \ Q ) ≤ h ( μ ; P ) .$
Proof. Clearly, if $h ( μ ; P ) = ∞$, then the inequality (5) holds trivially Thus, we assume that $h ( μ ; P ) < ∞$.
Let us observe that it is enough to consider only elements of $P$ with non-zero measures: the number of such sets that can be at most countable. Thus, let us assume that $P = { P i } i = 1 ∞$ (the case when $P$ is finite can be treated in a similar manner).
For simplicity, we put P1 := Pmax. For every $k ∈ ℕ$, we consider the sequence of sets, defined by:
$Q k : = ∪ i = 1 k ( P i ∩ Q ) .$
Clearly, for $k ∈ ℕ$, we have:
$Q 1 = P 1 , Q k ⊂ Q k + 1 ,$
$P i ∩ Q k = P i ∩ Q , for i ≤ k ,$
$P i ∩ Q k = ∅ , for i > k ,$
$lim n → ∞ μ ( Q n ) = μ ( Q ) .$
To complete the proof, it is sufficient to derive that for every $k ∈ ℕ$, we have:
$h ( μ ; { Q k } ∪ P X \ Q k ) ≥ h ( μ ; { Q k + 1 } ∪ P X \ Q k + 1 )$
and:
$h ( μ ; { Q k } ∪ P X \ Q k ) ≥ h ( μ ; { Q } ∪ P X \ Q ) .$
Let $K ∈ ℕ$ be arbitrary. Then, from (6) and (7), we get:
$h ( μ ; { Q k } ∪ P X \ Q k ) = sh ( μ ( Q k ) ) + ∑ i = 2 ∞ sh ( μ ( P i \ Q k ) ) = sh ( μ ( Q k ) ) + ∑ i = 2 k sh ( μ ( P i \ Q ) ) + ∑ i = k + 1 ∞ sh ( μ ( P i ) ) = h ( μ ; { Q k + 1 } ∪ P X \ Q k + 1 ) + sh ( μ ( Q k ) ) − sh ( μ ( Q k + 1 ) ) + sh ( μ ( P k + 1 ) ) − sh ( μ ( P k + 1 \ Q ) ) .$
Making use of Observation 1, we obtain:
$sh ( μ ( Q k ) ) + sh ( μ ( P k + 1 ) ) ≥ sh ( μ ( Q k + 1 ) ) + sh ( μ ( P k + 1 \ Q ) ) ,$
which proves (9).
To derive (10), we first use Inequality (9). Then:
$h ( μ ; { Q k } ∪ P X \ Q k ) = sh ( μ ( Q k ) ) + ∑ i = 2 ∞ sh ( μ ( P i \ Q k ) ) ≥ lim n → ∞ [ sh ( μ ( Q n ) ) + ∑ i = 1 ∞ sh ( μ ( P i \ Q n ) ) ] .$
By (8),
$lim n → ∞ sh ( μ ( Q n ) ) = sh ( μ ( Q ) ) < ∞ .$
To calculate $lim n → ∞ ∑ i = 1 ∞ sh ( μ ( P i \ Q n ) )$, we will use Lebesgue’s dominated convergence theorem . We consider a sequence of functions:
$f n : P ∍ P → sh ( μ ( P \ Q n ) ) ∈ ℝ , for n ∈ ℕ .$
Let us observe that the Shannon function sh is increasing on $[ 0 , 2 − 1 ln 2 ]$ and decreasing on $( 2 − 1 ln 2 , 1 ]$ Thus, for a certain $m ∈ ℕ$,
$sh ( μ ( P i \ Q n ) ) ≤ 1 , for i ≤ m$
and:
$sh ( μ ( P i \ Q n ) ) ≤ sh ( μ ( P i ) ) , for i ≤ m ,$
for every $n ∈ ℕ$. Since $h ( μ ; P ) < ∞$, then:
$∑ i = 1 ∞ sh ( μ ( P i \ Q n ) ) ≤ m + ∑ i = m + 1 ∞ sh ( μ ( P i ) ) < ∞ .$
Moreover,
$lim n → ∞ sh ( μ ( P \ Q n ) ) = sh ( μ ( P \ Q ) ) .$
for every $P ∈ P$.
As the sequence of functions ${ sh ( μ ( P \ Q n ) ) } n ∈ ℕ$ satisfies the assumptions of Lebesgue’s dominated convergence theorem , then we get:
$lim n → ∞ ∑ i = 1 ∞ sh ( μ ( P i \ Q n ) ) = ∑ i = 1 ∞ lim n → ∞ sh ( μ ( P i \ Q n ) ) = ∑ i = 1 ∞ sh ( μ ( P i \ Q ) ) < ∞ .$
Consequently, we have:
$h ( μ ; { Q k } ∪ P X \ Q k ) ≥ lim n → ∞ [ sh ( μ ( Q n ) ) + ∑ i = 1 ∞ sh ( μ ( P i \ Q n ) ) ] = sh ( μ ( Q ) ) + ∑ i = 1 ∞ sh ( μ ( P i \ Q ) ) = h ( μ ; { Q } ∪ P X \ Q ) ,$
which completes the proof. □
Making use of the above lemma, we summarize the analysis of the partition reduction algorithm in the following theorem.
Theorem 2. We assume that (X, Σ, µ) is a subprobability space. Let$Q$ be an error-control family on X, and let$P$ be a$Q$-acceptable partition of X. A family$R$ constructed by the partition reduction algorithm is a partition of X and satisfies:
$h ( μ ; R ) ≤ h ( μ ; P ) .$
Proof. Directly from the partition reduction algorithm, we get that $R$ is a countable family of pairwise disjoint sets. The fact that:
$μ ( X \ ∪ R ∈ R R ) = 0 ,$
follows from Lebesgue’s dominated convergence theorem  applied to a sequence of functions $f n : P → ℝ$ defined by:
$f n ( P ) : = μ ( P \ ∪ i = 1 n R i ) , for P ∈ P .$
We prove the inequality (11). If $h ( μ ; P ) = ∞$, then the inequality (11) is straightforward. Thus, let us discuss the case when $h ( μ ; P ) < ∞$
We denote $P = { P i } i = 1 ∞$, since at most, a countable number of elements of the partition can have a positive measure (the case when
Directly from Lemma 1, we obtain:
$h ( μ ; P X k ) ≥ h ( μ ; P X k + 1 ∪ { R k } ) , for k ∈ ℕ .$
Consequently, for every $k ∈ ℕ$, we get:
$h ( μ ; ∪ i = 1 k { R i } ∪ P X k ) ≥ h ( μ ; ∪ i = 1 k + 1 { R i } ∪ P X k + 1 ) .$
Our goal is to show that:
$h ( μ ; ∪ i = 1 k { R i } ∪ P X k ) ≥ h ( μ ; R ) ,$
for every $k ∈ ℕ$.
Making use of (12), we have:
$h ( μ ; ∪ i = 1 k { R i } ∪ P X k ) = ∑ i = 1 k sh ( μ ( R i ) ) + ∑ i = 1 ∞ sh ( μ ( P i \ ∪ j = 1 k R j ) ) ≥ lim n → ∞ [ ∑ i = 1 n sh ( μ ( R i ) ) + ∑ i = 1 ∞ sh ( μ ( P i \ ∪ j = 1 n R j ) ) ] , ,$
for every $k ∈ ℕ$.
We will calculate $lim n → ∞ ∑ i = 1 ∞ sh ( μ ( P i \ ∪ j = 1 n R j ) )$ using Lebesgue’s dominated convergence theorem  for a sequence of functions ${ f n } n = 1 ∞$, defined by:
$f n : P ∍ P → sh ( μ ( P \ ∪ j = 1 n R j ) ) ∈ ℝ , for n ∈ ℕ .$
Similar reasoning was used in the proof of Lemma 1.
Similarly to the proof of Lemma 1, we may assume that there exists $m ∈ ℕ$, such that:
$sh ( μ ( P i \ ∪ j = 1 n R j ) ) < 1 , for i ≤ m$
and:
$sh ( μ ( P i \ ∪ j = 1 n R j ) ) < sh ( μ ( P i ) ) , for i > m ,$
for every $n ∈ ℕ$. Moreover,
$lim n → ∞ sh ( μ ( P \ ∪ j = 1 n R j ) ) = sh ( μ ( P \ ∪ j = 1 ∞ R j ) ) = 0 ,$
for every $P ∈ P$, since $R$ is a partition of X.
Making use of Lebesgue’s dominated convergence theorem , we get:
$lim n → ∞ ∑ i = 1 ∞ sh ( μ ( P i \ ∪ j = 1 n R j ) ) = ∑ i = 1 ∞ sh ( μ ( P i \ ∪ j = 1 ∞ R j ) ) = 0 .$
Consequently, for every $k ∈ ℕ$, we have:
$h ( μ ; ∪ i = 1 k { R i } ∪ P X k ) ≥ lim n → ∞ [ ∑ i = 1 n sh ( μ ( R i ) ) + ∑ i = 1 ∞ sh ( μ ( P i \ ∪ j = 1 n R j ) ) ] = ∑ i = 1 ∞ sh ( μ ( R i ) ) = h ( μ ; R ) ,$
which completes the proof. □
As a consequence of Theorem 2, we directly get that Theorem 1 holds. In the case of finite error-control families, we get that there exists an acceptable partition with minimal entropy.
Corollary 1. Let (X, Σ, µ) be a probability space, and let$Q$ be a finite error-control family. Then, there exists a$Q$-acceptable partition$P ⊂ Σ Q$, such that:
$H ( μ ; Q ) = h ( μ ; P ) .$

## 3. Entropy Approximation

In the previous section, we simplified the problem of entropy calculation by reducing the number of partitions that are necessary to consider to find the entropy of an error-control family Since the number of acceptable partitions grows exponentially with the cardinality of the error-control family, it might be impossible to test all of them for entropy calculation. In this section, we show an algorithm that allows us to approximate the entropy within an additive term.
The presented formulation of lossy source coding is closely related to the minimum entropy set cover problem (MESC) [17,18], where one focuses on a similar optimization problem. There exists an algorithm for approximation of the solution of MESC within an additive term of log2 e. First, we present a description of MESC and its relationship with the introduced coding problem. Next, we use these facts to apply a similar technique for approximating the entropy of an error-control family.

#### 3.1. Relationship with the Minimum Entropy Set Cover Problem

In order to define MESC, let X = {x1,, xn} be a finite dataset, where every observation xX appears with a probability px. A random source produces a signal from a probability distribution ${ p x 1 , … , p x n }$ and passes it through the noisy channel. Each observation has a type, but due to the noise, we only know that it is one of a given set of types defined by a finite cover $Q = { Q 1 , … , Q k }$ of data space X. We map an observation to a type by defining an assignment $f Q : X → Q$, which is compatible with $Q$,i.e., $x ∈ f Q ( x )$ for all xX.
Let us denote by qi the probability that the random point is assigned to Qi:
$q i : = ∑ x ∈ f − 1 ( Q i ) p x , for i = 1 , … , k .$
The goal is to find such an assignment that minimizes the entropy of the distribution of the types, i.e.,
$h ( f Q ) : = ∑ i = 1 k sh ( q i ) .$
Such an optimal assignment is denoted by $f opt Q$.
MESC is an NP-hard problem (Theorem 1 in ). To find an assignment that efficiently approximates the minimal entropy value, a simple greedy algorithm can be used (which we call the greedy MESC algorithm). It relies on the iterative execution of the following steps:
• choose the most probable type $Q max i ∈ Q ,$
• if $x ∈ Q max i$, then assign x to $Q max i$, i.e., put $f Q ( x ) : = Q max i$,
• remove from X (and from all $Q ∈ Q$) the elements of $Q max i$,
until there exists $Q ∈ Q$ with a positive probability .
Cardinal et al. proved a sharp bound for the entropy of assignment constructed with the use of the greedy MESC algorithm:
Greedy MESC approximation (see  Theorem 1): If $f g Q$ is an assignment produced by the greedy algorithm, then:
$h ( f g Q ) ≤ h ( f opt Q ) + log 2 e .$
To be able to obtain a similar approximation of the entropy of an error-control family, the relationship between MESC and our formulation of entropy has to be established. For this purpose, our lossy source coding problem will be considered on a discrete probability space (X, Σ,µ), where X is a finite set, Σ is a σ-algebra generated by all singletons of X and $μ : = ∑ x ∈ X p x δ x$ is an atomic measure on (X, Σ). A cover $Q = { Q 1 , … , Q k }$ plays the role of an error-control family
We start our analysis with showing that given an assignment $f Q$ compatible with $Q$, one can construct a $Q$-acceptable partition with equal entropy:
Lemma 2. Let$f Q$ be an assignment compatible with$Q$. Then, the family$P = { P i } i = 1 k$, where$P i : = ( f Q ) − 1 ( Q i ) ,$, is a$Q$-acceptable partition, and:
$h ( f Q ) = h ( μ ; P ) .$
Proof. Directly from the definition of compatible assignment, we get that $P$ is a $Q$-acceptable partition of X.
Moreover, let us observe that:
$h ( f Q ) = ∑ i = 1 k sh ( q i ) = ∑ i = 1 k sh ( ∑ x ∈ f − 1 ( Q i ) p x ) = ∑ i = 1 k sh ( ∑ x ∈ P i p x ) = ∑ i = 1 k sh ( μ ( P i ) ) = h ( μ ; P ) .$
The following example illustrates that the natural inverse construction is not possible, i.e., for some $Q$-acceptable partitions, there does not exist any compatible assignment with identical entropy
Example 2. Let (X, Σ, µ) be a probability space with an error-control family$Q$, where:
$X = { 0 , 1 } , ∑ = { ∅ , { 0 } , { 1 } , { 0 , 1 } } , μ ( { 0 } ) = μ ( { 1 } ) = 1 2 , Q = { { 0 , 1 } } .$
Then:
$P = { { 0 } , { 1 } }$
is a$Q$-acceptable partition with the entropy equal to one. However, there is no compatible assignments with the entropy of one: the only assignment$f Q : X → Q$ compatible with$Q$ is defined by:
$f Q ( 0 ) = f Q ( 1 ) = { 0 , 1 } ,$
which entropy equal to zero.
The following result demonstrates that given a partition, one can find an assignment without greater entropy:
Lemma 3. Let$P ≺ Q$ be a partition. Then, there exists an assignment$f Q$ compatible with$Q$, such that:
$h ( f Q ) ≤ h ( μ ; P ) .$
Proof. Since $P$ is a partition, the function:
$g : X ∍ x → P x ∈ P , for x ∈ P x ,$
is well defined. Moreover, as $P$ is $Q$-acceptable, we find a mapping $P → Q$, such that:
$h ( P ) = Q , if P ⊂ Q .$
Finally, we put an assignment $f Q : X → Q$, by:
$f Q = h ∘ g .$
Clearly, $f Q$ is an assignment compatible with $Q$. Let us calculate the entropy of $f Q$. We have:
$h ( f Q ) = ∑ Q ∈ Q sh ( ∑ x ∈ ( f Q ) − 1 ( Q ) p x ) = ∑ Q ∈ Q sh ( ∑ x ∈ g − 1 ( h − 1 ( Q ) ) p x ) = ∑ Q ∈ Q sh ( ∑ P ∈ h − 1 ( Q ) ∑ x ∈ g − 1 ( P ) p x ) .$
Making use of the subadditivity of the Shannon function, we get:
$∑ Q ∈ Q sh ( ∑ P ∈ h − 1 ( Q ) ∑ x ∈ g − 1 ( P ) p x ) ≤ ∑ Q ∈ Q ∑ P ∈ h − 1 ( Q ) sh ( ∑ x ∈ g − 1 ( P ) p x ) = ∑ P ∈ P sh ( ∑ x ∈ g − 1 ( P ) p x ) = ∑ P ∈ P sh ( ∑ x ∈ P p x ) = h ( μ ; P ) .$
As a consequence, we get that the entropy of the optimal assignment equals the entropy of an error-control family.
Theorem 3. We have:
$h ( f opt Q ) = H ( μ ; Q ) .$
Proof. If $f opt Q$ is an optimal assignment, then making use of Lemma 2, we construct a $Q$-acceptable partition $P 1$ that satisfies:
$h ( f opt Q ) = h ( μ ; P 1 ) .$
On the other hand, since $Q$ is a finite error-control family, then from Theorem 1, we get:
$H ( μ ; Q ) = h ( μ ; P 2 ) ,$
for a specific $Q$-acceptable partition $P 2$. Using Lemma 3, we find an assignment $f Q$ compatible with $Q$, such that:
$h ( μ ; P 2 ) ≥ h ( f Q ) ,$
which completes the proof.

#### 3.2. Greedy Approximation

In this section, we show that the analogue of the greedy MESC algorithm can be applied for the case of our formulation of lossy source coding. Furthermore, similar bounds can be established.
Let us start with an extended version of the approximation algorithm, which we call the greedy entropy algorithm (Algorithm 2). Contrary to the greedy MESC algorithm, our procedure works directly with partitions; hence, it is more general. We assume that $Q$ is a finite error-control family.
Algorithm 2:. Greedy entropy algorithm. To see that the greedy entropy algorithm is not well defined for infinite error-control families, let us consider the example:
Example 3. Let us consider an open segment (0,1) with σ-algebra generated by all Borel subsets of (0,1), Lebesgue measure λ and an error control family, defined by:
$Q = { [ a , b ] : 0 < a < b < 1 } .$
There does not exist a set of maximal measure from family$Q$; hence, the greedy entropy algorithm cannot be applied directly in such a case.
Let us observe that both greedy algorithms create partitions with the same entropies. For this purpose, we denote by $Greedy Q f$ a set of all assignments produced by the greedy MESC algorithm, while by $Greedy Q$, we denote a set of all partitions returned by the greedy entropy algorithm:
Proposition 2. We have:
• For every$f g Q ∈ Greedy Q f$, there exists$P g ∈ Greedy Q$, such that:
$h ( f g Q ) = h ( μ ; P g ) .$
• For every$P g ∈ Greedy Q$, there exists$f g Q ∈ Greedy Q f$, such that:
$h ( μ ; P g ) = h ( f g Q ) .$
The main result of this section shows that the greedy entropy algorithm produces a partition with the entropy not greater that the entropy of an error-control family
Theorem 4. Let (X, Σ, µ) be a probability space, and let$Q$ be a finite error-control family. Then:
$h ( μ ; P ) ≤ H ( μ ; Q ) + log 2 e , f o r P ∈ Greedy Q .$
The proof of Theorem 4 involves two facts:
• To calculate the entropy, it is sufficient to consider only partitions constructed from the elements of σ-algebra generated by $Q$ (see Corollary 1).
• The calculation of the entropy of an error-control family is closely related to MESC optimization problem (see Theorem 3 and Proposition 2).
To be able to apply these facts, we need an additional lemma:
Lemma 4. Let (X, Σ,µ) be an arbitrary (not necessarily discrete) probability space, and let$Q$ be a finite error-control family. Then, there exists a probability space$( X ˜ , Σ ˜ , μ ˜ )$ with an error-control family$Q ˜$, such that$X ˜$, $∑ ˜$ and$Q ˜$ are finite, $μ ˜$ is an atomic measure,
$H ( μ ; Q ) = H ( μ ˜ ; Q ˜ )$
and for every$P g ∈ Greedy Q$, there exists$P ˜ g ∈ Greedy Q ˜$ satisfying:
$h ( μ ; P g ) = h ( μ ˜ ; P ˜ g ) .$
Proof. We restrict our consideration to partitions $P ⊂ Σ Q$, since, by Corollary 1, the entropy $H ( μ ; Q )$ is attained on some partition generated by $Σ Q$. Let us denote the set of generators of $Σ Q$:
$Gen ( Σ Q ) : = { G ∈ Σ Q : G ≠ ∅ , and for all Q ′ ∈ Q : G ⊂ Q ′ or G ∩ Q ′ = ∅ } .$
Next, for every set $G ∈ Gen ( Σ Q )$, we fix exactly one point xGG.
Then, we obtain a probability space $( X ˜ , Σ ˜ , μ ˜ )$ and an error-control family $Q ˜$ by:
$X ˜ = { x G } G ∈ Gen ( ∑ Q ) Q ˜ : = { ∪ G ∈ Gen ( ∑ Q ) , G ⊂ Q { x G } } Q ∈ Q ∑ ˜ = ∑ Q ˜ μ ˜ = ∑ G ∈ Gen ( ∑ Q ) μ ( G ) δ x G .$
It is easy to see that every $Q ˜$-acceptable $μ ˜$-partition $P ˜ ⊂ ∑ Q ˜$ corresponds naturally to a specific $Q$-acceptable partition $P ⊂ ∑ Q$ and conversely The measures of corresponding sets are equal.
Thus:
$H ( μ ; Q ) = H ( μ ˜ ; Q ˜ ) .$
Moreover, for every $P g ∈ Greedy Q$, there exists $P ˜ g ∈ Greedy Q ˜$, which satisfies:
$h ( μ ; P g ) = h ( μ ˜ ; P ˜ g ) .$
Finally, the proof of our main result is as follows:
Proof. (of Theorem 4) Making use of Lemma 4, we find a probability space $( X ˜ , Σ ˜ , μ ˜ )$ with the error-control family $Q ˜$, such that $X ˜$, $∑ ˜$, $Q ˜$ are finite, $μ ˜$ is an atomic measure and:
$H ( μ ; Q ) = H ( μ ˜ ; Q ˜ ) .$
By Theorem 3, we get that if $f opt Q ˜$ is an optimal assignment compatible with $Q ˜$, then:
$h ( f opt Q ˜ ) = H ( μ ˜ , Q ˜ ) .$
Moreover, making use of Proposition 2, for every $P ˜ g ∈ Greedy Q ˜$, we find $f g Q ˜ ∈ Greedy Q ˜ f$, such that:
$h ( f g Q ˜ ) = h ( μ ˜ , P ˜ g ) .$
Thus, by the greedy MESC approximation, we have:
$h ( μ ˜ , P ˜ g ) ≤ H ( μ ˜ ; Q ˜ ) + log 2 e .$
Consequently, using Lemma 4, we get:
$h ( μ , P g ) ≤ H ( μ ; Q ) + log 2 e$
The above approximation cannot be improved. To see this, we use the example from Section 2 of  adopted to our situation. Moreover, it is also NP-hard to approximate the entropy within an additive term lower than log2 e (see Theorem 2 of ).

## 4. Conclusion and Future Work

The paper focused on a non-standard type of lossy source coding. In contrast to rate distortion theory, a cover of the source alphabet, which defines a maximal distortion permitted on every element, was introduced. The calculation of the entropy in such a formulation of lossy coding is equivalent to solving the minimum entropy optimization problem, where one would like to find a coding partition compatible with a fixed distortion with minimal entropy. Our results show how to simplify this optimization problem and to find the approximated entropy value. The proposed algorithm is fast, feasible for implementation and produces a partition that has a proven upper bound on accuracy, i.e., the entropy of the returned partition is not higher than log2 e the true entropy value.
In the future, we plan to consider a more general family of entropy functions, including Rényi and Tsallis entropies, which are of great importance in the theory of coding and related problems [6,20,21]. Moreover, there also arises a natural question concerning the compression of n-tuple random variables. More precisely, it is worth investigating how the coding efficiency increases when the larger blocks of source elements are compressed jointly.

## Acknowledgments

This research was partially funded by the National Centre of Science (Poland) Grant Nos. 2014/13/N/ST6/01832 and 2014/13/B/ST6/01792. The authors are very grateful to the reviewers for many useful remarks and corrections, as well as for inspiring suggestions concerning further work.

## Author Contributions

Marek Śmieja established the connections with Minimum Entropy Set Cover Problem, proved most of the theorems, performed experiments and wrote most of the manuscript. Jacek Tabor proposed the research problem, designed Partition Reduction Algorithm, corrected proofs and the manuscript. Both authors have read and approved the final manuscript.

## Conflicts of Interest

The authors declare no conflict of interest.

## References

1. Berger, T. Lossy source coding. IEEE Trans. Inf. Theory 1998, 44, 2693–2723. [Google Scholar]
2. Gray, R.M.; Neuhoff, D.L. Quantization. IEEE Trans. Inf. Theory 1998, 44, 2325–2383. [Google Scholar]
3. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar]
4. Bottou, L.; Haffner, P.; Howard, P.G.; Simard, P.; Bengio, Y.; LeCun, Y. High quality document image compression with “DjVu”. J. Electron. Imaging 1998, 7, 410–425. [Google Scholar]
5. Kieffer, J.C. A survey of the theory of source coding with a fidelity criterion. IEEE Trans. Inf. Theory 1993, 39, 1473–1490. [Google Scholar]
6. Śmieja, M. Weighted approach to general entropy function. IMA J. Math. Control Inf. 2014, 32. [Google Scholar] [CrossRef]
7. Śmieja, M.; Tabor, J. Entropy of the mixture of sources and entropy dimension. IEEE Trans. Inf. Theory 2012, 58, 2719–2728. [Google Scholar]
8. Śmieja, M.; Tabor, J. Rényi entropy dimension of the mixture of measures. Proceedings of 2014 Science and Information Conference, London, UK, 27–29 August 2014; pp. 685–689.
9. Berger, T. Rate-Distortion Theory; Wiley: Hoboken, NJ, USA, 1971. [Google Scholar]
10. Ortega, A.; Ramchandran, K. Rate-distortion methods for image and video compression. IEEE Signal Process. Mag. 1998, 15, 23–50. [Google Scholar]
11. Gray, R.M. Vector quantization. IEEE ASSP Mag. 1984, 1, 4–29. [Google Scholar]
12. Nasrabadi, N.M.; King, R.A. Image coding using vector quantization: A review. IEEE Trans. Commun. 1998, 36, 957–971. [Google Scholar]
13. Posner, E.C.; Rodemich, E.R. Epsilon entropy and data compression. Ann. Math. Stat. 1971, 42, 2079–2125. [Google Scholar]
14. Posner, E.C.; Rodemich, E.R.; Rumsey, J.H. Epsilon entropy of stochastic processes. Ann. Math. Stat. 1967, 38, 1000–1020. [Google Scholar]
15. Rényi, A. On the dimension and entropy of probability distributions. Acta Math. Hungar. 1959, 10, 193–215. [Google Scholar]
16. Jayant, N.; Johnston, J.; Safranek, R. Signal compression based on models of human perception. Proc. IEEE 1993, 81, 1385–1422. [Google Scholar]
17. Cardinal, J.; Fiorini, S.; Joret, G. Tight results on minimum entropy set cover. Algorithmica 2008, 51, 49–60. [Google Scholar]
18. Halperin, E.; Karp, R.M. The minimum-entropy set cover problem. Theor. Comput. Sci. 2005, 348, 240–250. [Google Scholar]
19. Kingman, J.F.C.; Taylor, S.J. Introduction to measures and probability; Cambridge University Press: Cambridge, UK, 1966. [Google Scholar]
20. Bercher, J.F. Source coding with escort distributions and Rényi entropy bounds. Phys. Lett. A 2009, 373, 3235–3238. [Google Scholar][Green Version]
21. Czarnecki, W.M.; Tabor, J. Multithreshold entropy linear classifier: Theory and applications. Expert Syst. Appl. 2015, 42, 5591–5606. [Google Scholar]
Figure 1. Input image for compression (a) and partitions produced by the greedy entropy algorithm for two cases of error-control families: the first one (b) consists of cubes with a given side length, while the second (c) contains balls with a given radius. For the visualization, only two color components were used: green and blue in (b) and (c).
Figure 1. Input image for compression (a) and partitions produced by the greedy entropy algorithm for two cases of error-control families: the first one (b) consists of cubes with a given side length, while the second (c) contains balls with a given radius. For the visualization, only two color components were used: green and blue in (b) and (c).
Figure 2. Convergence behavior (entropy (a) and cardinality (b)) of the partitions produced by the greedy entropy algorithm for two cases of error control families: the first one consists of cubes with a given side length, while the second contains balls with a given radius. In every case, the diameters of elements included in the error-control family were increased, which caused the decrease of the entropy and the cardinality of resultant partitions.
Figure 2. Convergence behavior (entropy (a) and cardinality (b)) of the partitions produced by the greedy entropy algorithm for two cases of error control families: the first one consists of cubes with a given side length, while the second contains balls with a given radius. In every case, the diameters of elements included in the error-control family were increased, which caused the decrease of the entropy and the cardinality of resultant partitions.
Table 1. Entropies calculated for partitions $P δ 1$ (a), $P δ 2$ (b) returned by the greedy entropy algorithm for two cases of error control families. The first one consists of cubes with a given side length, while the second contains balls with a given radius. In each case, the results are compared with the entropy of the acceptable partition consisting of maximally-sized pairwise disjoint cubes $P δ$, $P ⌊ 2 3 3 δ ⌋$, respectively.
Table 1. Entropies calculated for partitions $P δ 1$ (a), $P δ 2$ (b) returned by the greedy entropy algorithm for two cases of error control families. The first one consists of cubes with a given side length, while the second contains balls with a given radius. In each case, the results are compared with the entropy of the acceptable partition consisting of maximally-sized pairwise disjoint cubes $P δ$, $P ⌊ 2 3 3 δ ⌋$, respectively.
(a)
δ$h ( μ ; P δ )$$h ( μ ; P δ 1 )$

312.7312.59
510.8110.83
98.628.62
156.956.72
Table 1. Entropies calculated for partitions $P δ 1$ (a), $P δ 2$ (b) returned by the greedy entropy algorithm for two cases of error control families. The first one consists of cubes with a given side length, while the second contains balls with a given radius. In each case, the results are compared with the entropy of the acceptable partition consisting of maximally-sized pairwise disjoint cubes $P δ$, $P ⌊ 2 3 3 δ ⌋$, respectively.
(b)
δ$h ( μ ; P ⌊ 2 3 3 δ ⌋ )$$h ( μ ; P δ 2 )$

514.1112.25
910.819.87
178.627.23
257.155.82