Abstract
We established a universality of logarithmic loss over a finite alphabet as a distortion criterion in fixed-length lossy compression. For any fixed-length lossy-compression problem under an arbitrary distortion criterion, we show that there is an equivalent lossy-compression problem under logarithmic loss. The equivalence is in the strong sense that we show that finding good schemes in corresponding lossy compression under logarithmic loss is essentially equivalent to finding good schemes in the original problem. This equivalence relation also provides an algebraic structure in the reconstruction alphabet, which allows us to use known techniques in the clustering literature. Furthermore, our result naturally suggests a new clustering algorithm in the categorical data-clustering problem.
1. Introduction
Logarithmic loss is a unique distortion measure in the sense that it allows a “soft” estimation (or reconstruction) of the source. Although logarithmic loss plays a crucial role in learning theory, not much work has been published regarding lossy compression until recently. A few exceptions are a line of work on multiterminal source coding [1,2,3], the single-shot approach to lossy source coding under logarithmic loss [4], and several universal properties of logarithmic loss in information theory [5,6,7]. In [4], Shkel and Verdú focused on the lossy-compression problem when the distortion measure is given by logarithmic loss. On the other hand, Jiao et al. justified logarithmic loss by showing it is the only loss function that satisfies a natural data-processing requirement [5]. Painsky and Wornell provided a universal property of logarithmic loss in the context of classification. In [7], No focused on the universal property of logarithmic loss in the successive refinement problem. We would also like to point out that the information bottleneck method [8,9,10,11] is related to lossy compression under logarithmic loss. Indeed, it is equivalent to the noisy lossy-compression problem under logarithmic loss [12].
In this paper, we present a new universal property of logarithmic loss in fixed-length lossy-compression problems. Consider an arbitrary fixed-length lossy-compression problem, where source and reconstruction alphabets and are discrete. Suppose arbitrary distortion measure is given. Then, we show that there exists a corresponding fixed-length lossy-compression problem where the source alphabet remains the same, but the reconstruction alphabet is a set of distributions on , and the distortion measure is logarithmic loss. This implies that there is a correspondence between any fixed-length lossy-compression problem under an arbitrary distortion measure and that under logarithmic loss. The correspondence is in the following strong sense:
- optimal schemes for the two problems are the same; and
- a good scheme for one problem is also a good scheme for the other.
We are more precise about the “optimal” and “goodness” of the scheme in later sections. This finding essentially implies that it is enough to consider the lossy-compression problem under logarithmic loss.
The above correspondence provides new insights into the fixed-length lossy-compression problem. In general, the reconstruction alphabet in the lossy-compression problem does not have any well-defined operations. However, in the corresponding lossy compression under logarithmic loss, reconstruction symbols are probability distributions that have their own algebraic structure. Thus, under the corresponding setting, we can apply various techniques, such as the information geometric approach, clustering with Bregman divergence, and relaxation of the optimization problem. Furthermore, the equivalence relation suggests a new algorithm in the categorical data-clustering problem, where data are not in the continuous space.
The remainder of the paper is organized as follows. In Section 2, we revisit some of the known results of logarithmic loss and fixed-length lossy compression. Section 3 is dedicated to the equivalence between lossy compression under arbitrary distortion measures and that under logarithmic loss. In Section 4, we present the geometric interpretation of our result. We provide the log-convex relaxation of lossy compression and connection to the clustering problems in Section 5. Finally, we conclude in Section 6.
Notation: Uppercase X denotes a random variable, where denotes a set of alphabet. On the other hand, lowercase x denotes a specific possible realization of random variable X, i.e., . Similarly, denotes an n-dimensional random vector while lowercase denotes a realization of . The absolute value of function denotes a size of image of function , i.e., . If it was clear from the context, we used instead of . We used a natural logarithm and nats instead of bits.
2. Preliminaries
2.1. Logarithmic Loss
Suppose is a finite set of discrete symbols, and is the set of probability measures on . For and , the definition of logarithmic loss is given by
2.2. Fixed-Length Lossy Compression
In this section, we briefly introduce the basic settings of the fixed-length lossy-compression problem [13]. In a fixed-length lossy-compression setting, we have a source X with finite alphabet and source distribution . An encoder maps the source symbol to one of M messages. On the other side, a decoder maps the message to actual reconstruction , where the reconstruction alphabet is also finite . Let be a distortion measure between source and reconstruction.
First, we can define the code that the expected distortion is lower than a given distortion level.
Definition 1
(Average distortion criterion). An code is a pair of an encoder f with and a decoder g, such that
The minimum number of codewords required to achieve average distortion not exceeding D is defined by
Similarly, we can define the minimum achievable average distortion given number of codewords M.
One may consider a stronger criterion that restricts the probability of exceeding a given distortion level.
Definition 2
(Excess distortion criterion). An code is a pair of an encoder f with and a decoder g such that
The minimum number of codewords required to achieve excess distortion probability ϵ, and distortion D is defined by
Similarly, we can define the minimum achievable excess distortion probability given target distortion D and number of codewords M.
Given target distortion D and , the information rate-distortion function is defined by
We make the following benign assumptions:
- There exists a unique rate-distortion function achieving conditional distribution .
- We assume that for all ∈ since we can always discard the reconstruction symbol with zero probability.
- If for all , then . (If for all x, then, there is no difference between and in terms of loss. Thus, we can always discard without loss of generality.)
2.3. D-Tilted Information
Define the information density of joint distribution by
Then, we are ready to define D-tilted information that plays a key role in fixed-length lossy compression.
Definition 3
([13] (Definition 6)). The D-tilted information in is defined as
where the expectation is with respect to the marginal distribution of and .
Note that is a random variable that has a marginal distribution of , and is the first derivative of rate-distortion function .
Theorem 1
([14] (Lemma 1.4)). For all ∈ ,
therefore, we have
Let be the induced conditional probability from . Then, (2) can equivalently be expressed as
The following lemma shows that are all distinct.
Lemma 1
([7] (Lemma 2)). For all , there exists such that .
3. One-to-One Correspondence Between General Distortion and Logarithmic Loss
3.1. Main Results
Consider fixed-length lossy compression under arbitrary distortion , as described in Section 2.2. We have a source X with finite alphabet , source distribution , and finite reconstruction alphabet . For a fixed number of messages M, let and be the encoder and decoder that achieve optimal average distortion , i.e.,
Let denote the rate-distortion function achieving conditional distribution at distortion . In other words, achieves the infimum in
Note that may be strictly smaller than in general since is an information rate-distortion function that does not characterize the best achievable performance for the “one-shot” setting in which is defined.
Now, we define the corresponding fixed-length lossy-compression problem under logarithmic loss. In the corresponding problem, source alphabet , source distribution , and number of messages M remain the same. However, we have a different reconstruction alphabet where pertains to the achiever of the infimum in Equation (4) associated with the original loss function. Recall that is the set of all probability measures on . Let the distortion of the corresponding problem be the logarithmic loss.
We now further connect the encoding and decoding schemes between the two problems. Suppose and are an encoder and decoder pair in the original problem. When f and g are given in the original problem, we define the corresponding encoder and decoder in the corresponding problem as follows. We let the encoder be the same , and define the decoder by
Then, and are a valid encoder and decoder pair for the corresponding fixed-length lossy-compression problem under logarithmic loss. Conversely, given and , we can find corresponding f and g because Lemma 1 guarantees that are distinct.
The following result shows the relation between the corresponding schemes.
Theorem 2.
For any encoder–decoder pair for the corresponding fixed-length lossy-compression problem under logarithmic loss, we have
where is the corresponding encoder–decoder pair for the original lossy-compression problem. Note that and the expectations are with respect to distribution . Moreover, equality holds if and only if and .
Proof.
We have
Then, Equation (3) implies that
where Equation (5) is because with respect to distribution . Inequality (6) is because is the minimum achievable average distortion with M codewords. Equality holds if and only if , which can be achieved by the optimal scheme for the original lossy-compression problem. In other words, the equality holds if
□
In the above theorem, distortion plays a critical role, which is the minimal achievable distortion in the one-shot setting. We also used in the corresponding problem, which is the rate-distortion-achieving conditional distribution. This might be confusing since the rate-distortion function provides the optimal rate in the asymptotic setting. However, recall that the minimal mutual information between X and in Equation (1) is the “information” rate-distortion function. The “information” rate-distortion function is equal to the optimum rate in the asymptotic case if the source is independent and identically distributed.
On the other hand, we viewed the “information” rate-distortion function differently. We considered the one-shot setting where source X and reconstruction are single variables. Given number of messages M, the minimal achievable distortion is given by . Under this setting, we focused on minimal mutual information between X and when the distortion between X and is restricted by . Our theorem implies that minimal achieving distribution provides the corresponding one-shot lossy-compression problem under logarithmic loss.
Remark 1.
In the corresponding fixed-length lossy-compression problem under logarithmic loss, the minimal achievable average distortion given number of codewords M is
where the conditional entropy is with respect to distribution .
Remark 2.
From now on, we denote the original lossy-compression problem under given distortion measure with reconstruction alphabet by “original problem”. On the other hand, we denote the corresponding lossy-compression problem under logarithmic loss with reconstruction alphabet by “corresponding problem”.
3.2. Example: Memoryless Bernoulli Source with Hamming Distortion Measure
In this section, we consider the memoryless Bernoulli source under Hamming distortion measure as an example of the above equivalence. Let be a memoryless Bernoulli source with probability , where , and reconstruction is also an n-dimensional binary vector where . Note that block length n is fixed, so the problem is in the one-shot setting. Distortion measure d is separable Hamming distortion, i.e.,
where if and if . Let M be the number of messages. Then, we are interested in optimal encoding and decoding schemes that achieve distortion .
In this scenario, the information rate-distortion function is not hard to compute [15]:
where is the binary entropy function. Let be the distribution that achieves the infimum in Equation (7). We have an analytic formula for rate-distortion-achieving distribution . For and , we have
Then, the corresponding problem is the rate-distortion problem under logarithmic loss where the set of reconstruction symbols is
Remark 3.
We can rewrite Equation (3) in this case.
The above equation explicitly shows the correspondence between logarithmic loss and the original distortion measure.
3.3. Discussion
3.3.1. One-to-One Correspondence
Theorem 2 implies that, for any fixed-length lossy-compression problem, we can find an equivalent problem under logarithmic loss where optimal encoding schemes are the same. Thus, without loss of generality, we can restrict our attention to the problem under logarithmic loss with reconstruction alphabet for some .
3.3.2. Scheme Suboptimality
Suppose f and g are a suboptimal encoder and decoder for the original fixed-length lossy-compression problem. Then, the theorem implies
The left-hand side of Equation (9) is the cost of suboptimality for the corresponding lossy-compression problem. On the other hand, the right-hand side is proportional to the cost of suboptimality for the original problem. In Section 3.3.1, we discussed that the optimal schemes of the two problems coincide. Equation (9) shows stronger equivalence in which costs of suboptimalities are linearly related. This implies that a good code for one problem is also good for the other.
3.3.3. Operations on the Reconstruction Alphabet
In general, reconstruction alphabet does not have an algebraic structure. However, in the corresponding rate-distortion problem, the reconstruction alphabet is the set of probability measures where we have natural operations such as convex combinations of elements or projection to a convex hull. We discuss such operations closer in Section 5.
3.4. Exact Performance of Optimal Scheme
In the previous section, we showed that there is a corresponding lossy-compression problem under logarithmic loss that shares the same optimal coding scheme. In this section, we investigate the exact performance of the optimal scheme for the fixed-length lossy-compression problem under logarithmic loss, when the reconstruction alphabet is the set of all probability measures on , i.e., . (Recently, Shkel and Verdu [4] independently proposed similar results. The result was also presented in our conference version of the paper [16].) We also characterize minimal average distortion when we have a fixed number of messages M. Note that this is a single-letter version of ([2], [Lemma 1]). Although the optimal scheme associated with may differ from the optimal scheme with restricted reconstruction alphabets , it provides an insight, as we show in Section 4. In this section, we restrict our attention to deterministic schemes. However, it is not hard to show that the same result holds even if we allow a stochastic encoder and decoder.
Let an encoder and a decoder be and where . Then, we have
where and . Since for all , we have
Equality can be achieved by choosing , which can be done no matter what f is. Thus, we have
This implies that the optimal encoder is function f that maximizes , and the optimal decoder is given by . The above result provides a trivial lower bound:
The optimal scheme under an excess distortion criterion is given in Appendix A.
4. Geometrical Interpretation
In this section, we present another geometrical interpretation of the decoder in lossy-compression problems. Consider the original lossy-compression problem with discrete reconstruction alphabet and distortion measure . Suppose encoding function f is given that may or may not be optimal, where . Let , which is the set of source symbols that are mapped to message m. Then, optimal reconstruction is given by
Now, consider the corresponding lossy-compression problem under logarithmic loss. Recall that the set of reconstruction alphabets is given by
where . As we have seen in Section 3.4, the optimal reconstruction is if we have extended set of reconstruction alphabet . Thus, it is natural to find the probability distribution in , which is the nearest distribution from . We propose Kullback–Leibler divergence to measure the distance between probability distributions. In other words, we want to find , such that
This can be viewed as projecting the optimal solution from extended set to original feasible set . Since , there exists ∈ , such that . Then, the above Kullback–Leibler divergence is given by
where the last equality is from Equation (2). Note that is the only term that is a function of , and is positive. Thus, if achieves the minimum in Equation (11), then minimizes the following:
Since Equation (12) coincides with Equation (10), we have
Remark 4.
In Section 3, we directly defined . However, we obtained via the following two-step procedure:
- extend the reconstruction set from to , then characterize optimal decoding functions ; and
- find the measure that is closest to .
The above result (13) implies that .
5. Log-Convex Relaxation
In the previous section, we obtained the optimal reconstruction symbol from the extended reconstruction alphabet, and projected it to the feasible set. In this section, instead of direct projection to , we propose another slight extension of , namely, log-convex hull. As we show in the following sections, the log-convex hull has interesting properties.
5.1. rI-Projection
Before defining the log-convex hull, we need to define the log-convex combination of probability distributions. Let p and q be probability distributions in . For , the log-convex combination of p and q is given by
It is clear to see that is a convex combination of and with a normalizing constant. We can now define log-convex hull logconv that is a set of log-convex combination of probability measures in set . More precisely,
where r is a weight vector (i.e., ), and is a normalizing constant. By definition, logconv is log-convex since it contains all log-convex combinations of probability distributions in .
Instead of having projection of to , we consider the projection to logconv. Since logconv is log-convex, ([17], [Theorem 1]) implies that there exists unique probability distribution that achieves the following minimum.
Projection is called an rI-projection of to logconv. Let be the corresponding weights, i.e.,
Csiszár and Matúš ([17], [Theorem 1]) showed that the rI-projection satisfies the following inequality for all ∈ .
On the other hand, the log-convex combination of probability measures is called the geometric mean of probability measures [18]. The author also provided geometric compensation identity, which is given by
The above result holds for any ; therefore, Equation (16) also holds when . Together with Inequality (15), we get the following result. For all ∈ , if , then
Remark 5.
The above result is similar to the projection to polytope in Euclidean space. Suppose vectors form a polytope, and consider the projection from a vector w to the polytope. Let h be a projection. Then, h is a convex combination of ’s. Thus, there exist coefficients , such that
where , and for all i. Let be the set of indices of nonzero coefficients. Then, projection h is on the plane generated by . Thus, two vectors and are orthogonal for all . Then, Pythagorean theorem implies that, for all i, we have either or
5.2. Optimization
As we saw in the previous section, we want to find that minimizes . Note that
Since the first term is not a function of , it is enough to consider the second term. By the definition of , we have
Thus, minimizing is equivalent to solving the following optimization problem.
Since the objective function is a convex function of , the above problem is a convex optimization problem that can be efficiently solved.
5.3. Relaxation in Clustering
In the corresponding lossy-compression problem under logarithmic loss, reconstruction symbols are probability measures that have a natural algebraic structure, as we discussed in Section 3.3.3. In this section, we present the benefits of such a property when we apply some known techniques from the clustering literature.
Lossy compression is closely related to the clustering problem [19,20,21]. Many works focused on the application of k-means clustering to a lossy-compression problem [22,23,24], which is an extension of the Lloyd max algorithm [25,26]. However, k-means clustering is only available when there exists a well-defined operation in (e.g., ). This is because k-means clustering requires computing the mean of data points, which is the center of each cluster. In general lossy-compression problems, reconstruction alphabet may not have such an operation. In such cases, we may have to apply k-medoidlike clustering [27], where the center of each cluster has to be a data point. The k-medoidlike algorithm in the context of lossy compression is shown in Algorithm 1.
| Algorithm 1k-medoidlike clustering in lossy compression. |
| Randomly initialize repeat Set for all . for do where end for for to M do end for until converge |
On the other hand, in the corresponding problem, the reconstruction alphabet is the set of probability distributions where operations such as log-convex combinations are well-defined. This allows us to propose a k-meanslike clustering algorithm, as shown in Algorithm 2.
| Algorithm 2k-meanslike clustering in lossy compression. |
| Randomly initialize repeat Set for all for do where Set where end for for to M do end for until converge |
The main idea of the above algorithm is that log-convex combination behaves like center of cluster . In the clustering literature, there are many known variations of k-means clustering [28,29]. The above result shows that we can borrow those techniques and apply them to the lossy-compression problem even without any algebraic structures on the reconstruction alphabet.
5.4. Application to General Clustering Problems
The idea of the previous section can be applied to an actual clustering problem. We mainly focus on clustering categorical data where data points are not in continuous space [30,31,32,33,34]. Since operations such as mean are not well-defined in this case, it is hard to apply known data-clustering algorithms in continuous space. The key idea is that the equivalence relation with logarithmic loss allows the algebraic structure on any set. More precisely, we can transform any clustering problem to the clustering problem in continuous space and apply known techniques such as variations of k-means.
A more rigorous definition of the problem is given below. Assume that we have a finite set of data points , and each data point has its weight . We normalize the weights so that , and the weights may or may not be uniform. The distance between two points are given by measure . Suppose we want to partition the data points into M clusters.
If we let = , then the clustering problem turns out to be a lossy-compression problem under distortion measure , where the number of messages is M. Let be the optimal achievable distortion, and be the distribution that achieves rate-distortion function as defined in Equation (4). Then, we can find the corresponding lossy-compression problem under logarithmic loss. Finally, we can apply clustering algorithms in continuous space such as k-means to the corresponding problem. For example, Algorithm 2 can be applied to the corresponding problem.
Remark 6.
Note that it is hard to have an exact analytic formula for or . However, as we mentioned in Section 3.3.2, we do not have to find an optimal scheme under the exact problem formulation. If we can provide a good scheme of the corresponding problem with , that should be a good enough scheme in the original problem.
6. Conclusions
To conclude our discussion, we summarize our main contributions. We showed that for any fixed-length lossy-compression problem under an arbitrary distortion measure, there exists a corresponding lossy-compression problem under logarithmic loss where optimal schemes coincide. We also proved that a good scheme for one lossy-compression problem is also good for another problem. This equivalence provides an algebraic structure on any reconstruction alphabet that allows using various optimization techniques in lossy-compression problems, such as log-convex relaxation. Furthermore, our results naturally suggest a k-meanslike clustering algorithm in categorical data-clustering problems.
Funding
This work was supported by the National Research Foundation of Korea, funded by the Korean Government (MSIT) under Grant NRF-2017R1C1B5018298.
Conflicts of Interest
The author declares no conflict of interest.
Appendix A. Optimal Scheme Under Excess Distortion Criterion
In this section, we characterize minimum number of codewords that can achieve distortion D and excess distortion probability . Let an encoder and a decoder be and where . Since is equivalent to , we hav
However, at most, of the can be larger than where is the largest integer that is smaller than or equal to x. Thus, we can at mos cover t of the source symbols with M codewords. Suppose , then the optimal scheme is
where and are the smallest integer that is larger than or equal to x. The idea is that each reconstruction symbol covers number of source symbols by assigning probability mass to each of them.
The above optimal scheme satisfies
where is the cumulative distribution function of X. This implies that the minimal error probability is
On the other hand, if we fix target error probability , the minimal number of codewords is
where . Note that if we allow variable length coding without a prefix condition, the optimal coding scheme is similar to optimal nonasymptotic lossless coding introduced in [35].
References
- Courtade, T.A.; Wesel, R.D. Multiterminal source coding with an entropy-based distortion measure. Proc. IEEE Int. Symp. Inf. Theory. IEEE 2011, 2011, 2040–2044. [Google Scholar]
- Courtade, T.; Weissman, T. Multiterminal Source Coding Under Logarithmic Loss. IEEE Trans. Inf. Theory 2014, 60, 740–761. [Google Scholar] [CrossRef]
- Ugur, Y.; Aguerri, I.E.; Zaidi, A. Vector Gaussian CEO problem under logarithmic loss. In Proceedings of the 2018 IEEE Information Theory Workshop, Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar]
- Shkel, Y.Y.; Verdú, S. A single-shot approach to lossy source coding under logarithmic loss. IEEE Trans. Inf. Theory 2018, 64, 129–147. [Google Scholar] [CrossRef]
- Jiao, J.; Courtade, T.A.; Venkat, K.; Weissman, T. Justification of logarithmic loss via the benefit of side information. IEEE Trans. Inf. Theory 2015, 61, 5357–5365. [Google Scholar] [CrossRef]
- Painsky, A.; Wornell, G.W. Bregman divergence bounds and the universality of the logarithmic loss. arXiv 2018, arXiv:1810.07014. [Google Scholar]
- No, A. Universality of Logarithmic Loss in Successive Refinement. Entropy 2019, 21, 158. [Google Scholar] [CrossRef]
- Tishby, N.; Pereira, F.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
- Harremoës, P.; Tishby, N. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings of the 2007 IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 566–570. [Google Scholar]
- Gilad-Bachrach, R.; Navot, A.; Tishby, N. An information theoretic tradeoff between complexity and accuracy. In Learning Theory and Kernel Machines; Springer: Berlin/Heidelberg, Germany, 2003; pp. 595–609. [Google Scholar]
- Aguerri, I.E.; Zaidi, A. Distributed Information Bottleneck Method for Discrete and Gaussian Sources. In Proceedings of the International Zurich Seminar on Information and Communication, Zurich, Switzerland, 21–23 February 2018. [Google Scholar]
- Kostina, V.; Verdú, S. Nonasymptotic noisy lossy source coding. IEEE Trans. Inf. Theory 2016, 62, 6111–6123. [Google Scholar] [CrossRef]
- Kostina, V.; Verdú, S. Fixed-length lossy compression in the finite blocklength regime. IEEE Trans. Inf. Theory 2012, 58, 3309–3338. [Google Scholar] [CrossRef]
- Csiszár, I. On an extremum problem of information theory. Studia Scientiarum Mathematicarum Hungarica 1974, 9, 57–71. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- No, A.; Weissman, T. Universality of logarithmic loss in lossy compression. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hongkong, China, 14–19 June 2015; pp. 2166–2170. [Google Scholar]
- Csiszár, I.; Matus, F. Information projections revisited. IEEE Trans. Inf. Theory 2003, 49, 1474–1490. [Google Scholar] [CrossRef]
- No, A. Information Geometric Approach on Most Informative Boolean Function Conjecture. Entropy 2018, 20, 688. [Google Scholar] [CrossRef]
- Chaffee, D.L. Applications of Rate Distortion Theory to the Bandwidth Compression of Speech Signals. Ph.D. Thesis, University of California, Los Angeles, CA, USA, 1975. [Google Scholar]
- Chen, D. On two or more dimensional optimum quantizers. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Hartford, CT, USA, 9–11 May 1977; Volume 2, pp. 640–643. [Google Scholar]
- Gray, R.; Buzo, A.; Matsuyoma, Y.; Gray, A., Jr.; Markel, J. Source coding and speech compression. In International Telemetering Conference Proceedings; International Foundation for Telemetering: San Diego, CA, USA, 1978; Volume 14. [Google Scholar]
- Linde, Y.; Buzo, A.; Gray, R. An algorithm for vector quantizer design. IEEE Trans. Commun. 1980, 28, 84–95. [Google Scholar] [CrossRef]
- Gray, R.M.; Neuhoff, D.L. Quantization. IEEE Trans. Inf. Theory 1998, 44, 2325–2383. [Google Scholar] [CrossRef]
- Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- Max, J. Quantizing for minimum distortion. IRE Trans. Inf. Theory 1960, 6, 7–12. [Google Scholar] [CrossRef]
- Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 344. [Google Scholar]
- Phillips, S.J. Acceleration of k-means and related clustering algorithms. In Proceedings of the Workshop on Algorithm Engineering and Experimentation, San Francisco, CA, USA, 4–5 January 2002; pp. 166–177. [Google Scholar]
- Pelleg, D.; Moore, A.W. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; Volume 1, pp. 727–734. [Google Scholar]
- Watve, A.; Pramanik, S.; Jung, S.; Jo, B.; Kumar, S.; Sural, S. Clustering Non-Ordered Discrete Data. J. Inf. Sci. Eng. 2014, 30, 1–23. [Google Scholar]
- Bai, L.; Liang, J.; Dang, C.; Cao, F. A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit. 2011, 44, 2843–2861. [Google Scholar] [CrossRef]
- Ng, R.T.; Han, J. CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 2002, 14, 1003–1016. [Google Scholar] [CrossRef]
- Ganti, V.; Gehrke, J.; Ramakrishnan, R. CACTUS—clustering categorical data using summaries. In Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999; Volume 99, pp. 73–83. [Google Scholar]
- Kumar, S.; Sural, S.; Watve, A.; Pramanik, S. CNODE: clustering of set-valued non-ordered discrete data. Int. J. Data Min. Model. Manag. 2009, 1, 310–334. [Google Scholar] [CrossRef]
- Kontoyiannis, I.; Verdu, S. Optimal Lossless Data Compression: Non-Asymptotics and Asymptotics. IEEE Trans. Inf. Theory 2014, 60, 777–795. [Google Scholar] [CrossRef]
© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).