Universality of Logarithmic Loss in Fixed-Length Lossy Compression

We established a universality of logarithmic loss over a finite alphabet as a distortion criterion in fixed-length lossy compression. For any fixed-length lossy-compression problem under an arbitrary distortion criterion, we show that there is an equivalent lossy-compression problem under logarithmic loss. The equivalence is in the strong sense that we show that finding good schemes in corresponding lossy compression under logarithmic loss is essentially equivalent to finding good schemes in the original problem. This equivalence relation also provides an algebraic structure in the reconstruction alphabet, which allows us to use known techniques in the clustering literature. Furthermore, our result naturally suggests a new clustering algorithm in the categorical data-clustering problem.


Introduction
Logarithmic loss is a unique distortion measure in the sense that it allows a "soft" estimation (or reconstruction) of the source. Although logarithmic loss plays a crucial role in learning theory, not much work has been published regarding lossy compression until recently. A few exceptions are a line of work on multiterminal source coding [1][2][3], the single-shot approach to lossy source coding under logarithmic loss [4], and several universal properties of logarithmic loss in information theory [5][6][7]. In [4], Shkel and Verdú focused on the lossy-compression problem when the distortion measure is given by logarithmic loss. On the other hand, Jiao et al. justified logarithmic loss by showing it is the only loss function that satisfies a natural data-processing requirement [5]. Painsky and Wornell provided a universal property of logarithmic loss in the context of classification. In [7], No focused on the universal property of logarithmic loss in the successive refinement problem. We would also like to point out that the information bottleneck method [8][9][10][11] is related to lossy compression under logarithmic loss. Indeed, it is equivalent to the noisy lossy-compression problem under logarithmic loss [12].
In this paper, we present a new universal property of logarithmic loss in fixed-length lossy-compression problems. Consider an arbitrary fixed-length lossy-compression problem, where source and reconstruction alphabets X andX are discrete. Suppose arbitrary distortion measure d : X ×X is given. Then, we show that there exists a corresponding fixed-length lossy-compression problem where the source alphabet remains the same, but the reconstruction alphabet is a set of distributions on X , and the distortion measure is logarithmic loss. This implies that there is a correspondence between any fixed-length lossy-compression problem under an arbitrary distortion measure and that under logarithmic loss. The correspondence is in the following strong sense: • optimal schemes for the two problems are the same; and • a good scheme for one problem is also a good scheme for the other.
We are more precise about the "optimal" and "goodness" of the scheme in later sections. This finding essentially implies that it is enough to consider the lossy-compression problem under logarithmic loss.
The above correspondence provides new insights into the fixed-length lossy-compression problem. In general, the reconstruction alphabet in the lossy-compression problem does not have any well-defined operations. However, in the corresponding lossy compression under logarithmic loss, reconstruction symbols are probability distributions that have their own algebraic structure. Thus, under the corresponding setting, we can apply various techniques, such as the information geometric approach, clustering with Bregman divergence, and relaxation of the optimization problem. Furthermore, the equivalence relation suggests a new algorithm in the categorical data-clustering problem, where data are not in the continuous space.
The remainder of the paper is organized as follows. In Section 2, we revisit some of the known results of logarithmic loss and fixed-length lossy compression. Section 3 is dedicated to the equivalence between lossy compression under arbitrary distortion measures and that under logarithmic loss. In Section 4, we present the geometric interpretation of our result. We provide the log-convex relaxation of lossy compression and connection to the clustering problems in Section 5. Finally, we conclude in Section 6.
Notation: Uppercase X denotes a random variable, where X denotes a set of alphabet. On the other hand, lowercase x denotes a specific possible realization of random variable X, i.e., x ∈ X . Similarly, X n denotes an n-dimensional random vector (X 1 , X 2 , . . . , X n ) while lowercase x n denotes a realization of X n . The absolute value of function | f | denotes a size of image of function f : X → Y, i.e., |{ f (x) : x ∈ X }|. If it was clear from the context, we used ∑ x instead of ∑ x∈X . We used a natural logarithm and nats instead of bits.

Logarithmic Loss
Suppose X is a finite set of discrete symbols, and M(X ) is the set of probability measures on X . For x ∈ X and q ∈ M(X ), the definition of logarithmic loss : X × M(X )→[0, ∞] is given by (x, q) = log 1 q(x) .

Fixed-Length Lossy Compression
In this section, we briefly introduce the basic settings of the fixed-length lossy-compression problem [13]. In a fixed-length lossy-compression setting, we have a source X with finite alphabet X = {1, . . . , r} and source distribution p X . An encoder f : X → {1, . . . , M} maps the source symbol to one of M messages. On the other side, a decoder g : {1, . . . , M} →X maps the message to actual reconstructionX, where the reconstruction alphabet is also finiteX = {1, . . . , s}. Let d : X ×X → [0, ∞) be a distortion measure between source and reconstruction.
First, we can define the code that the expected distortion is lower than a given distortion level. One may consider a stronger criterion that restricts the probability of exceeding a given distortion level.
Definition 2 (Excess distortion criterion). An (M, D, ) code is a pair of an encoder f with | f | ≤ M and a decoder g such that The minimum number of codewords required to achieve excess distortion probability , and distortion D is Given target distortion D and p X , the information rate-distortion function is defined by We make the following benign assumptions: • There exists a unique rate-distortion function achieving conditional distribution p X |X .
• We assume that p X (x) > 0 for allx ∈X since we can always discard the reconstruction symbol with zero probability.
for all x, then, there is no difference betweenx 1 andx 2 in terms of loss. Thus, we can always discardx 2 without loss of generality.)

D-Tilted Information
Define the information density of joint distribution p X,X by Then, we are ready to define D-tilted information that plays a key role in fixed-length lossy compression.

Definition 3 ([13] (Definition 6)
). The D-tilted information in x ∈ X is defined as where the expectation is with respect to the marginal distribution ofX and λ = −R (D).
Note thatX is a random variable that has a marginal distribution of p X × p X |X , and R (D) is the first derivative of rate-distortion function R(D).
(2) therefore, we have Let p X|X be the induced conditional probability from p X |X . Then, (2) can equivalently be expressed as The following lemma shows that p X|X (·|x) are all distinct.

Main Results
Consider fixed-length lossy compression under arbitrary distortion d(·, ·), as described in Section 2.2. We have a source X with finite alphabet X = {1, . . . , r}, source distribution p X , and finite reconstruction alphabetX = {1, . . . , s}. For a fixed number of messages M, let f and g be the encoder and decoder that achieve optimal average distortion D (M), i.e., Let p X |X denote the rate-distortion function achieving conditional distribution at distortion Note that R(D (M)) may be strictly smaller than log M in general since R(·) is an information rate-distortion function that does not characterize the best achievable performance for the "one-shot" setting in which D (M) is defined. Now, we define the corresponding fixed-length lossy-compression problem under logarithmic loss. In the corresponding problem, source alphabet X = {1, . . . , r}, source distribution p X , and number of messages M remain the same. However, we have a different reconstruction alphabet Y = {p X|X (·|x) :x ∈X } ⊂ M(X ) where p pertains to the achiever of the infimum in Equation (4) associated with the original loss function. Recall that M(X ) is the set of all probability measures on X . Let the distortion of the corresponding problem be the logarithmic loss.
We now further connect the encoding and decoding schemes between the two problems. Suppose f : X → {1, . . . , M} and g : {1, . . . , M} →X are an encoder and decoder pair in the original problem. When f and g are given in the original problem, we define the corresponding encoder and decoder in the corresponding problem as follows. We let the encoder be the same f = f , and define the decoder Then, f and g are a valid encoder and decoder pair for the corresponding fixed-length lossy-compression problem under logarithmic loss. Conversely, given f and g , we can find corresponding f and g because Lemma 1 guarantees that p X|X (·|x) are distinct.
The following result shows the relation between the corresponding schemes.

Theorem 2.
For any encoder-decoder pair ( f , g ) for the corresponding fixed-length lossy-compression problem under logarithmic loss, we have is the corresponding encoder-decoder pair for the original lossy-compression problem. Note that H(X|X ) and the expectations are with respect to distribution p X × p X |X . Moreover, equality holds if and only if f = f and g (m) = p X|X (·|g (m)).

Proof.
We have Then, Equation (3) implies that where Equation (5) is codewords. Equality holds if and only if E [d(X, g( f (X)))] = D (M), which can be achieved by the optimal scheme for the original lossy-compression problem. In other words, the equality holds if In the above theorem, distortion D (M) plays a critical role, which is the minimal achievable distortion in the one-shot setting. We also used p X|X in the corresponding problem, which is the rate-distortion-achieving conditional distribution. This might be confusing since the rate-distortion function provides the optimal rate in the asymptotic setting. However, recall that the minimal mutual information between X andX in Equation (1) is the "information" rate-distortion function. The "information" rate-distortion function is equal to the optimum rate in the asymptotic case if the source is independent and identically distributed.
On the other hand, we viewed the "information" rate-distortion function differently. We considered the one-shot setting where source X and reconstructionX are single variables. Given number of messages M, the minimal achievable distortion is given by D (M). Under this setting, we focused on minimal mutual information between X andX when the distortion between X andX is restricted by D (M). Our theorem implies that minimal achieving distribution p X|X provides the corresponding one-shot lossy-compression problem under logarithmic loss.

Remark 1.
In the corresponding fixed-length lossy-compression problem under logarithmic loss, the minimal achievable average distortion given number of codewords M is where the conditional entropy is with respect to distribution p X × p X |X .

Remark 2.
From now on, we denote the original lossy-compression problem under given distortion measure d(·, ·) with reconstruction alphabetX by "original problem". On the other hand, we denote the corresponding lossy-compression problem under logarithmic loss with reconstruction alphabet Y by "corresponding problem".

Example: Memoryless Bernoulli Source with Hamming Distortion Measure
In this section, we consider the memoryless Bernoulli source under Hamming distortion measure as an example of the above equivalence. Let X = U n be a memoryless Bernoulli source with probability α, where X = U n = {0, 1} n , and reconstructionX = V n is also an n-dimensional binary vector wherê X = V n = {0, 1} n . Note that block length n is fixed, so the problem is in the one-shot setting. Distortion measure d is separable Hamming distortion, i.e., Let M be the number of messages. Then, we are interested in optimal encoding and decoding schemes that achieve distortion D = D (M).
In this scenario, the information rate-distortion function is not hard to compute [15]: where h 2 (·) is the binary entropy function. Let p U|V be the distribution that achieves the infimum in Equation (7). We have an analytic formula for rate-distortion-achieving distribution p X|X . For x = u n andx = v n , we have . Then, the corresponding problem is the rate-distortion problem under logarithmic loss where the set of reconstruction symbols is Y ={p X|X (·|x) :x ∈ V n }.

Remark 3. We can rewrite Equation
The above equation explicitly shows the correspondence between logarithmic loss and the original distortion measure.

One-to-One Correspondence
Theorem 2 implies that, for any fixed-length lossy-compression problem, we can find an equivalent problem under logarithmic loss where optimal encoding schemes are the same. Thus, without loss of generality, we can restrict our attention to the problem under logarithmic loss with reconstruction alphabet Y = {q (1) , . . . , q (s) } for some q (1) , . . . , q (s) ∈ M(X ).

Scheme Suboptimality
Suppose f and g are a suboptimal encoder and decoder for the original fixed-length lossy-compression problem. Then, the theorem implies The left-hand side of Equation (9) is the cost of suboptimality for the corresponding lossy-compression problem. On the other hand, the right-hand side is proportional to the cost of suboptimality for the original problem. In Section 3.3.1, we discussed that the optimal schemes of the two problems coincide. Equation (9) shows stronger equivalence in which costs of suboptimalities are linearly related. This implies that a good code for one problem is also good for the other.

Operations on the Reconstruction Alphabet
In general, reconstruction alphabetX does not have an algebraic structure. However, in the corresponding rate-distortion problem, the reconstruction alphabet is the set of probability measures where we have natural operations such as convex combinations of elements or projection to a convex hull. We discuss such operations closer in Section 5.

Exact Performance of Optimal Scheme
In the previous section, we showed that there is a corresponding lossy-compression problem under logarithmic loss that shares the same optimal coding scheme. In this section, we investigate the exact performance of the optimal scheme for the fixed-length lossy-compression problem under logarithmic loss, when the reconstruction alphabet is the set of all probability measures on X , i.e., M(X ). (Recently, Shkel and Verdu [4] independently proposed similar results. The result was also presented in our conference version of the paper [16].) We also characterize minimal average distortion D (M) when we have a fixed number of messages M. Note that this is a single-letter version of ( [2], [Lemma 1]). Although the optimal scheme associated with M(X ) may differ from the optimal scheme with restricted reconstruction alphabets Y, it provides an insight, as we show in Section 4. In this section, we restrict our attention to deterministic schemes. However, it is not hard to show that the same result holds even if we allow a stochastic encoder and decoder.
Let an encoder and a decoder be f : X → {1, . . . , M} and g : {1, . . . , This implies that the optimal encoder is function f that maximizes H( f (X)), and the optimal decoder is given by g(m) = p X| f (X) (·|m). The above result provides a trivial lower bound: The optimal scheme under an excess distortion criterion is given in Appendix A.

Geometrical Interpretation
In this section, we present another geometrical interpretation of the decoder in lossy-compression problems. Consider the original lossy-compression problem with discrete reconstruction alphabetX and distortion measure d(·, ·). Suppose encoding function f is given that may or may not be optimal, which is the set of source symbols that are mapped to message m. Then, optimal reconstruction g(m) is given by Now, consider the corresponding lossy-compression problem under logarithmic loss. Recall that the set of reconstruction alphabets is given by where Y ⊂ M(X ). As we have seen in Section 3.4, the optimal reconstruction is g E (m) = p X| f (X) (·|m) if we have extended set of reconstruction alphabet M(X ). Thus, it is natural to find the probability distribution in Y, which is the nearest distribution from g E (m). We propose Kullback-Leibler divergence to measure the distance between probability distributions. In other words, we want to findg (m) ∈ Y, such thatg This can be viewed as projecting the optimal solution from extended set M(X ) to original feasible set Y. Since q ∈ Y, there existsx ∈X , such that q(·) = p X|X (·|x). Then, the above Kullback-Leibler divergence is given by where the last equality is from Equation (2). Note that d(x,x) is the only term that is a function ofx, and λ is positive. Thus, if q(·) = p X|X (·|x) achieves the minimum in Equation (11), thenx minimizes the following: Since Equation (12) coincides with Equation (10), we havẽ g (m) = p X|X (·|g(m)).
Remark 4. In Section 3, we directly defined g (m) = p X|X (·|g(m)). However, we obtainedg (m) via the following two-step procedure: • extend the reconstruction set from Y to M(X ), then characterize optimal decoding functions g E (m) ∈ M(X ); and • find the measureg (m) ∈ Y that is closest to g E (m).

Log-Convex Relaxation
In the previous section, we obtained the optimal reconstruction symbol from the extended reconstruction alphabet, and projected it to the feasible set. In this section, instead of direct projection to Y, we propose another slight extension of Y, namely, log-convex hull. As we show in the following sections, the log-convex hull has interesting properties.

rI-Projection
Before defining the log-convex hull, we need to define the log-convex combination of probability distributions. Let p and q be probability distributions in M(X ). For 0 < t < 1, the log-convex combination of p and q is given by It is clear to see that log p t q 1−t is a convex combination of log p(x) and log q(x) with a normalizing constant. We can now define log-convex hull logconv(Y ) that is a set of log-convex combination of probability measures in set Y. More precisely, where r is a weight vector (i.e., r ∈ M(X )), and c(r) is a normalizing constant. By definition, logconv(Y ) is log-convex since it contains all log-convex combinations of probability distributions in Y.
Instead of having projection of p X| f (X) (·|m) to Y, we consider the projection to logconv(Y ). Since logconv(Y ) is log-convex, ( [17], [Theorem 1]) implies that there exists unique probability distribution q m ∈ logconv(Y ) that achieves the following minimum.
Projection q m is called an rI-projection of p X| f (X) (·|m) to logconv(Y ). Let r m be the corresponding weights, i.e., q m = q (r m ) .
On the other hand, the log-convex combination of probability measures q (r) is called the geometric mean of probability measures [18]. The author also provided geometric compensation identity, which is given by The above result holds for any r ∈ M(X ); therefore, Equation (16) also holds when q (r) = q * m . Together with Inequality (15), we get the following result. For allx ∈X , if r m (x) = 0, then D(p X| f (X) (·|m) p X|X (·|x)) =D(p X| f (X) (·|m) q m ) + D(q m p X|X (·|x)). Then, projection h is on the plane generated by {v i } i∈E . Thus, two vectors w − h and h − v i are orthogonal for all i ∈ E. Then, Pythagorean theorem implies that, for all i, we have either a i = 0 or

Optimization
As we saw in the previous section, we want to find q ∈ logconv(Y ) that minimizes D p X| f (X) (·|m) q . Note that Since the first term is not a function of q (r) , it is enough to consider the second term. By the definition of q (r) , we have Thus, minimizing D(p X| f (X) (·|m) q) is equivalent to solving the following optimization problem.
Since the objective function is a convex function of r(x), the above problem is a convex optimization problem that can be efficiently solved.

Relaxation in Clustering
In the corresponding lossy-compression problem under logarithmic loss, reconstruction symbols are probability measures that have a natural algebraic structure, as we discussed in Section 3.3.3. In this section, we present the benefits of such a property when we apply some known techniques from the clustering literature.
Lossy compression is closely related to the clustering problem [19][20][21]. Many works focused on the application of k-means clustering to a lossy-compression problem [22][23][24], which is an extension of the Lloyd max algorithm [25,26]. However, k-means clustering is only available when there exists a well-defined operation inX (e.g.,X = R n ). This is because k-means clustering requires computing the mean of data points, which is the center of each cluster. In general lossy-compression problems, reconstruction alphabetX may not have such an operation. In such cases, we may have to apply k-medoidlike clustering [27], where the center of each cluster has to be a data point. The k-medoidlike algorithm in the context of lossy compression is shown in Algorithm 1. D(p X| f (X) (·|m) q (r) ) end for until converge The main idea of the above algorithm is that log-convex combination q m behaves like center of cluster A m . In the clustering literature, there are many known variations of k-means clustering [28,29]. The above result shows that we can borrow those techniques and apply them to the lossy-compression problem even without any algebraic structures on the reconstruction alphabet.
However, at most, e D of the q (m) (x) can be larger than e −D where x is the largest integer that is smaller than or equal to x. Thus, we can at mos cover t M · e D of the source symbols with M codewords. Suppose p X (1) ≥ p X (2) ≥ · · · ≥ p X (r), then the optimal scheme is where q (m) = g(m) and x are the smallest integer that is larger than or equal to x. The idea is that each reconstruction symbol q (m) covers e D number of source symbols by assigning probability mass 1/ e D to each of them. The above optimal scheme satisfies where F X (·) is the cumulative distribution function of X. This implies that the minimal error probability is On the other hand, if we fix target error probability , the minimal number of codewords is where F −1 X (y) = argmin 1≤x≤r {x : F X (x) ≥ y}. Note that if we allow variable length coding without a prefix condition, the optimal coding scheme is similar to optimal nonasymptotic lossless coding introduced in [35].