Soft Quantization Using Entropic Regularization

The quantization problem aims to find the best possible approximation of probability measures on Rd using finite and discrete measures. The Wasserstein distance is a typical choice to measure the quality of the approximation. This contribution investigates the properties and robustness of the entropy-regularized quantization problem, which relaxes the standard quantization problem. The proposed approximation technique naturally adopts the softmin function, which is well known for its robustness from both theoretical and practicability standpoints. Moreover, we use the entropy-regularized Wasserstein distance to evaluate the quality of the soft quantization problem’s approximation, and we implement a stochastic gradient approach to achieve the optimal solutions. The control parameter in our proposed method allows for the adjustment of the optimization problem’s difficulty level, providing significant advantages when dealing with exceptionally challenging problems of interest. As well, this contribution empirically illustrates the performance of the method in various expositions.


Introduction
Over the past few decades, extensive research has been conducted on optimal quantization techniques in order to tackle numerical problems that are related to various fields such as data science, applied disciplines, and economic models.These problems are typically centered around uncertainties or probabilities which demand robust and efficient solutions (cf.Graf and Mauldin [7], Luschgy and Pagès [11], El Nmeir et al. [5]).In general, these problems are difficult to handle, as the random components in the problem allow uncountable many outcomes.As a consequence to address this difficulty, the probability measures are replaced by simpler or finite measures, which facilitates numerical computations.However, the probability measures should be 'close', so that the result of the computations with approximate (discrete) measures will resemble the original problem.In a nutshell, the goal is to find the best approximation of a diffuse measure using a discrete measure, and it is called optimal quantization problem.For a comprehensive discussion of the optimal quantization problem from a mathematical standpoint, we refer to Graf and Luschgy [6].
On the other hand, entropy is an inevitable concept to deal with uncertainties and probabilities.In mathematics, entropy is often used as a measure of information and uncertainty.It provides a quantitative measure of the randomness or disorder in a system or a random variable.Its applications span across information theory, statistical analysis, probability theory, and the study of complex dynamical systems (cf.Breuer and Csiszár [2,3], Pichler and Schlotter [16]).
In order to assess the closeness of the probability measures, distances are often considered, and one of the notable instances is the Wasserstein distance.Ostensibly, the Wasserstein distance measures the minimum, average amount of transporting cost required to transfer one probability distribution into another.Unlike other formulations of distances and/ or divergence, which simply compares the probabilities of the distribution functions (e.g., the total variation distance and the Kullback-Leibler divergence), the Wasserstein distance incorporates the support of the underlying distributions.This increases the understanding of the relationships between different probability measures in a geometrically trustworthy manner.
In our research work, we focus on entropy adjusted quantization methods.More precisely, we consider an entropy regularized version of the Wasserstein problem to quantify the quality of the approximation, and we adapt the stochastic gradient approach to obtain the optimal quantizers.
Some key features of our methodology include the following: (i) This regularization approach stabilizes and simplifies the standard quantization problem by introducing penalty terms or constraints that discourage overly complex or overfit models, promoting better generalizations and robustness in the solutions.
(ii) The influence of entropy is controlled using a parameter , which also facilitates us to reach the genuine optimal quantizers.
(iii) Generally, parameter tuning comes with certain limitations.However, our method builds upon the framework of the well-established softmin function, which allows us to exercise parameter control without encountering any restrictions.
(iv) For larger regularization parameter , the optimal measure accumulates all its mass at the center of the measure.
Related works and contributions.As mentioned above, optimal quantization is a well-researched topic in the field of information theory and signal processing.There are several methods that have been developed for optimal quantization problem.Here are some remarkable methods of optimal quantization: -Lloyd-Max Algorithm: the Lloyd-Max algorithm, also known as the Lloyd's algorithm or the -means algorithm, is a popular iterative algorithm for computing optimal vector quantizers.It iteratively adjusts the centroids of the quantization levels to minimize the quantization error (cf.Scheunders [20]).
-Tree-Structured Vector Quantization (TSVQ): TSVQ is a hierarchical quantization method that uses a tree structure to partition the input space into regions.It recursively applies vector quantization at each level of the tree until the desired number of quantization levels is achieved (cf.Wei and Levoy [22]).
-Expectation-maximization (EM) algorithm: the EM algorithm is a general-purpose optimization algorithm that can be used for optimal quantization.It is an iterative algorithm that estimates the parameters of a statistical model to maximize the likelihood of the observed data (cf.Heskes [8]).
-Stochastic Optimization Methods: Stochastic optimization methods, such as simulated annealing, genetic algorithms, and particle swarm optimization, can be used to find optimal quantization strategies by exploring the search space and iteratively improving the quantization performance (cf.Pagès et al. [14]).
-Greedy vector quantization (GVQ): the greedy algorithm tries to solve this problem iteratively, by adding one code word at every step until the desired number of code words is reached, and each time selecting the code word that minimizes the error.GVQ is known to provide suboptimal quantization compared to other non-greedy methods like the Lloyd-Max and Linde-Buzo-Gray algorithms.However, it has been shown to perform well when the data has a strong correlation structure.Notably, it utilizes Wasserstein distance to measure the error of approximation (cf.Luschgy and Pagès [11]).
These methods provide efficient and practical solutions for finding optimal quantization schemes with different trade-offs between complexity and performance.The choice of method depends on the problem of interset and the requirements of the application.However, most of these methods depend on strict constraints which makes the solutions overly complex or overfit models.Our method mitigates this issue by promoting better generalizations and robustness in the solutions.
In the optimal transport community, entropy regularized version of optimal transport problem (also known as entropy regularized Wasserstein problem) is initial proposed by Cuturi [4].This entropy version of Wasserstein problem promotes fast computations using Sinkhorn's algorithm.As an avenue for constructive research, this study has presented a multitude of results aimed at gaining a comprehensive understanding of the subtleties involved in enhancing the computational performance of entropy optimal transport (cf.Ramdas et al. [18], Neumayer and Steidl [13], [1], Lakshmanan et al. [10]).These findings serve as a valuable foundation for further exploration in the field of optimal transport, providing insights into both the intricacies of the topic and potential avenues for improvement.
In contrast, we present a new, innovative approach that concentrates on the optimal quantization problem based on entropy, and its robust properties, which is a distinct contribution from standard entropy regularized optimal transport problems.
One of the principal consequences of our research substantiates the behavior of convergence of quantizers at the center of the measure.The relationship between the center of measure and entropy regularized quantization problem has not been exposed yet.The following plain solution is obtained by intensifying the entropy term in the regularization of the quantization problem.
Theorem 1.1.There exist a real valued  0 > 0 such that the best approximation of entropy regularized optimal quantization problem is given by the Dirac-measure  =   for every  >  0 , where  is the center of the measure  with respect to the distance .
The enthralling interpretation of our master problem facilitates us to understand the transition from a complex hard optimization solution to the simple solution in Theorem 1.1.Moreover, along with the theoretical discussion, we provide an algorithm and numerical exemplification, which empirically demonstrate the robustness of the method.The forthcoming sections elucidate the robustness and asymptotic properties of the methods in detail.
Outline of the paper.Section 2 establishes the essential notations, definitions, and properties.Moreover, we comprehensively expound upon the significance of the smooth minimum, a pivotal component in our research.In Section 3, we introduce the entropy-regularized optimal quantization problem and delve into its inherent properties.Section 4 presents the discussion of soft tessellation, optimal weights and theoretically properties of parameter tuning.Furthermore, we systematically illustrate the computational process along with a pseudo algorithm.Section 5 provides numerical examples, and empirically substantiates the theoretical proofs.Finally, Section 6 summarize our study.

Preliminaries
In what follows, (X, ) is a Polish space.The -algebra generated by the Borel sets induced by the distance  is F , the set of all probability measures on X is (X).

Distances and divergences of measures
The standard quantization problem employs the Wasserstein distance to measure the quality of the approximation, which was initially studied by Monge and Kantorovich (cf.Monge [12], Kantorovich [9]).One of the remarkable properties of this distance is, it metrizes the weak* topology of measures.Definition 2.1 (Wasserstein distance).Let  and P be probability measures on (X, ).The Wasserstein distance of order  ≥ 1 of  and P ∈ (X) is , where the infimum is among all measures  ∈ (X 2 ) with marginals  and P, that is for all sets  and  ∈ F .The measures on X are called the marginal measures of the bivariate measure .
We may refer to the excellent monographs [21,17] for a comprehensive discussion of the Wasserstein distance.
Remark 2.2 (Flexibility).In the subsequent discussion, our problem of interset is to approximate the measure , which is a continuous, a discrete or mixed measure on X = R  .The measure P is used to approximate the measure , which is a discrete measure.The definition of the Wasserstein distance flexibly comprises all the cases, namely continuous, semi-discrete, and discrete measures.
In contrast to the standard methodology, we investigate the quantization problem by utilizing an entropy version of the Wasserstein distance.The standard Wasserstein problem is regularized by adding the Kullback-Leibler divergence, which is also known as relative entropy.Definition 2.3 (Kullback-Leibler divergence).Let  and  ∈ (X) be probability measures.Denote by  ∈  1 () the Radon-Nikodým derivative, d =  d, if  is absolutely continuous with respect to  ( ≪ ).The Kullback-Leibler divergence is where E  (E  , resp.) is the expectation with respect to the measure  (, resp.).By Gibb's inequality, the Kullback-Leibler divergence satisfies  (∥) ≥ 0 (non-negativity).However,  is not a distance metric, as it does not satisfy the symmetry, and the triangle inequality properties.
We would like to emphasize the following distinctness to the Wasserstein distance (cf.Remark 2.2): for the Kullback-Leibler divergence to be finite ( (∥) < ∞), we necessarily have supp  ⊂ supp , where the support of the measure is (cf.Rüschendorf [19]) If  is a continuous measure on X = R  , then so is .If  is a finite measure, then the support points of  contain the support points of .

The smooth minimum
In what follows we present the smooth minimum in its general form, which includes discrete and continuous measures.Numerical computations in the following section rely on results on its discrete version.Therefore, we also address the special properties of its discrete version in detail.
Definition 2.4 (Smooth minimum).Let  > 0 and  be a random variable.The smooth minimum, or smooth minimum with respect to P, is provided that the expectation (integral) of  − / is finite, and min P;  ( ) −∞, if it is not finite.For  = 0, we set min P; =0 ( ) ess inf  . ( For a -algebra G ⊂ F and  > 0 measurable with respect to G, the conditional smooth minimum is The following lemma relates the smooth minimum with the essential infimum (cf.(2.4)), that is, colloquially, the 'minimum' of a random variable.As well, the result justifies the term smooth minimum.Proof.The inequality (2.5) follows from Jensen's inequality, applied to the convex function  ↦ → exp(−/).
Next, the first inequality in the second display (2.6) follows from ess inf  ≤  and the fact that all operations in (2.3) are monotonic.Finally, let  > ess inf  .By Markov's inequality, we have which is a variant of Chernoff's bound.From inequality (2.7), it follows that When  > 0 and  → 0, we have that min P; ( ) ≤ , where  is an arbitrary number with  > ess inf  .This completes proof.□ Remark 2.6 (Nesting property).The main properties of the smooth minimum include translation equivariance, min P; ( + ) = min P; ( ) + ,  ∈ R, and positive homogeneity, As a consequence of the tower property of the expectation, we have the nesting property min P; min P; ( | G) = min P; ( ), provided that G is a sub--algebra of F .

Softmin function
The smooth minimum is related to the softmin function via its derivatives.In what follows, we express variants of its derivatives, which are involved later.
Definition 2.7 (Softmin function).For  > 0 and a random variable  with finite smooth minimum, the softmin function is the random variable where the latter equality is obvious with the definition of the smooth minimum in (2.3).The function   ( ) is also called the Gibbs density.

The derivative with respect to the probability measure
The definition of the smooth minimum in (2.3) does not require the measure P to be a probability measure.Based on   log( +  • ℎ) = ℎ  (at  = 0) for the natural logarithm, the directional derivative of the smooth minimum in direction of the measure  is 1 Note, that −   is (up to the constant −) a Radon-Nikodým density in (2.9).The Gibbs density   ( ) thus is proportional to the directional derivative of the smooth minimum with respect to the underlying measure P.

The derivative with respect to the random variable
In what follows we shall need the derivative of the smooth minimum with respect to its argument as well.With a similar reasoning as above, this is accomplished by 1 which involves the softmin function   (•) as well.

Regularized quantization
This section introduces the entropy regularized optimal quantization problem along with its properties, and recalls the standard optimal quantization problem first.The standard quantization measures the quality of the approximation by the Wasserstein distance and considers the problem (cf.Graf and Luschgy [6]) where is the set of measures on X supported by not more than  ( ∈ N) points.Soft quantization (or quantization, regularized with Kullback-Leibler divergence), instead of (3.1) involves the regularized Wasserstein distance.The soft quantization problem is regularized with the Kullback-Leibler divergence, it is inf where  > 0 and E    = ∬ X 2  (, ξ)  (d, d ξ).The optimal measure P ∈   (X) solving (3.3) depends on the regularization parameter .
In the following discussion, we initially investigate the regularized approximation, which also demonstrates existence of the optimal approximation.

Approximation with inflexible marginal measures
The following proposition addresses the optimal approximation problem, regularized with Kullback-Leibler divergence and fixed marginals.To this end, dissect the infimum in the soft quantization problem (3.3) where the marginals  and P are fixed in the inner infimum.
The following Proposition 3.1 addresses this problem with fixed bivariate distribution, which is the inner infimum in (3.4).Then, Proposition 3.6 reveals that the optimal marginals coincide in this case.Proposition 3.1.Let  be a probability measure and  > 0. The inner optimization problem in (3.4) relative to the fixed bivariate distribution  × P is given by the explicit formula which is the formula without regularization (i.e.,  = 0, cf.Pflug and Pichler [15]).Note that the preceding display explicitly involves the support supp P, while (3.5)only involves the expectation (via the smooth minimum) with respect to the measure P.
Proof of Proposition 3.1.It follows from the definition of the Kullback-Leibler divergence in (2.2) that it is enough to consider measures , which are absolutely continuous with respect to the product measure,  ≪  × P; otherwise, the objective is not finite.Hence, there is a Radon-Nikodým density Z such that, with Fubini's theorem, For the marginal constraint (  × X) = ( ) to be satisfied (cf.(2.1)), we have that for every measurable set .It follows that ∫ X Z (, ) P(d) = 1 (d) almost everywhere.
We conclude that every density of the form satisfies the constraint (2.1), irrespective of  and conversely, every  -via Z in (3.8) -defines a bivariate measure  satisfying the constraints (2.1).We set Φ(, ) log  (, ) (with the convention that log 0 = −∞ and exp(−∞) = 0, resp.) and consider With that, the divergence is For the other term in the objective (3.3), we have (, )  P(d)(d).
Combining the last expressions obtained, the objective in (3.5) is For  fixed ( is simply suppressed in the following two displays to abbreviate the notation), consider the function The directional derivative in direction ℎ of this function is Finally, notice that the variable  (, ) =  Φ(  , ) is completely arbitrary for the problem (3.5) involving the Wasserstein distance and the Kullback-Leibler divergence.As outlined above, for every measure  with finite divergence  (∥ × P), there is a density  as considered above.With that, the assertion of Proposition 3.1 follows.□ Remark 3.4.The preceding proposition considers probability measures  with marginal  1 = .Its first marginal distribution is (trivially) absolutely continuous with respect to ,  1 ≪ , as  1 = .
The second marginal  2 , however, is not specified.But for  to be feasible in (3.5), its Kullback-Leibler divergence with respect to  × P is finite.There is hence a (non-negative) Radon-Nikodým density  so that It follows from Fubini's theorem that where  () ∫ X  (, ) (d).The second marginal thus is absolutely continuous with respect to P,  2 ≪ P. Proposition 3.1 characterizes the objective of the quantization problem.Its proof, implicitly, reveals the marginal of the best approximation as well.The following lemma spells out the density of the marginal of the optimal measure with respect to P explicitly.Proof.Recall from the proof or Proposition 3.1 the density Z (, ξ) =  − (  , ξ )  / E P  − (  , ξ )  / of the optimal measure  relative to  × P. From that we derive that is the density with respect to P, that is d 2 =  d P (i.e.,  2 (d ξ) =  ( ξ) P(d ξ)).□

Approximation with flexible marginal measure
The following proposition reveals that the best approximation of a bivariate measure in terms of a product of independent measures is given by the product of its marginals.With that it follows that the objectives in (3.4) which is the assertion.In case the measures are not absolutely continuous, the assertion (3.12) is trivial.□ Suppose now that  is a solution of the master problem (3.5) with some P. It follows from the preceding proposition that the objective (3.5) improves when replacing the initial P by the marginal of the optimal solution, P =  2 .

The relation of soft quantization and entropy
The soft quantization problem (3.5) involves the Kullback-Leibler divergence and not the entropy.The major advantage of the formulation presented above is that it works for discrete, continuous or mixed measures, while entropy usually needs to be defined separately for discrete and continuous measures.

Soft tessellation
The quantization problem (3.4) consists in finding a good (in the best case the optimal) approximation of a general probability measure  on X by a simple, and discrete measure P =  =1 p     .The problem thus consists in finding good weights p1 , . . ., p , as well as good locations  1 , . . .,   .Quantization employs the Wasserstein distance to measure the quality of the approximation; soft quantization involves the regularized Wasserstein distance, instead (as in (3.5) where the measures on X supported by not more than  points are (cf.(3.2)) We separate the problem of finding the best weights and locations.The following Section 4.1 addresses the problem of finding the optimal weights p, the subsequent Section 4.2 then the problem of finding the optimal locations  1 , . . .,   .As well, we shall elaborate the numerical advantages of soft quantization below.

Optimal weights
Proposition 3.1 above is formulated for general probability measures  and P. The desired measure in quantization is a simple and discrete measure.To this end recall that measures, which are feasible for (3.5), have marginals  2 with  2 ≪ P by Remark 3.4.It follows that the support of the marginal is smaller than the support of P, that is supp  2 ⊂ supp P.
To unfold the result of Proposition 3.1 for discrete measures we recall the smooth minimum and the softmin function for the discrete (empirical or uniform) measure P =  =1 p     .For this measure, the smooth minimum (2.3) explicitly is For  = 1 and uniform weights p1 = . . .= p = 1  , this quantity is occasionally referred to as LogSumExp.The softmin function (or Gibbs density (2.8)) is It follows from Lemma 3.5 that the best approximating measure is  =  =1   p     , where the vector  of optimal weights, relative to P, is given explicitly by which involves computing expectations.

Soft tessellation
For  = 0, the softmin function That is, the mapping  ↦ → p  •   (. . . )  can serve for classification, i.e., tessellation: the point  is associated to   , if   (. . . )  ≠ 0 and the corresponding region is known as Voronoi diagram.

Optimal locations
As a result of Proposition 3.1, the objective in (3.6) is an expectation.To identify the optimal support points  1 , . . .,   , it is central to minimize min This is a stochastic, non-linear and non-convex optimization problem.
where '•' denotes the Hadamard (element-wise) product and p,  (, )  −1 are the vectors with entries p  ,  (,   )  −1 ,  = 1, . . ., .Algorithm 1 is a stochastic gradient algorithm to minimize (4.1), which collects the elements of the optimal weights and the optimal locations given in the preceding and this section.
Example 4.1.To provide an example for the gradient of the distance function in (4.3) ((4.4), resp.), the derivative of the weighted norm

Quantization with large regularization parameters
The entropy in (3.14) is minimal for the Dirac measure  =   (where  is any point in X): in this case,  (  ) = 1 • log 1 = 0, while  ( P) > 0 for any other measure.For larger values of , the objective in (3.16) -and thus the objective of the master problem (3.2) -supposedly will give preference to measure with fewer points.This is indeed the case, as Theorem 1.1 (above) states.We give its proof below, after formally defining the center of the measure.
Definition 4.2 (Center of the measure).Let  be a probability measure on X and  be a distance on X.The point  ∈ X is a center of the measure  with respect to the distance , if provided that E  ( 0 , )  < ∞ for some (and thus any)  0 ∈ X and  ≥ 1.
In what follows, we demonstrate that the regularized quantization problem (3.16) links the optimal quantization problem and the center of the measure.
This matrix Σ is positive definite (as  =1   = 1) and 0 ≤ Σ ≤ 1 in Loewner order (indeed, Σ is the covariance matrix of the multinomial distribution).It follows that the first term in (4.5) is O (1), while the second is O 1  , so that (4.5) is positive definite for  sufficiently small.That is, the extremal point   =  is a minimum for all .In particular, there exists  0 > 0 such that (4.5) is positive definite for every  >  0 and hence the result.□

Numerical illustration
This section presents numerical findings for the approaches and methods discussed earlier.The Julia implementations for these methods are available online.1 In the following experiments, we approximate the measure  by a finite discrete measure P using the stochastic gradient algorithm, Algorithm 1.

One dimension
First, we perform the analysis in one dimension.In this experiment, our problem of interest is to find entropy regularized optimal quantizers for  ∼ N (0, 1) and  ∼ Exp (1) (the normal and the exponential distribution with standard parameters).To enhance the peculiarity, we consider only  = 8 quantizers.
Figure 1 illustrates the results of soft quantization of standard normal distribution and exponential distribution.It is apparent that when  is increased beyond a certain threshold (cf.Theorem 1.1), the quantizers converge towards the center of the measure (i.e., the mean), while for smaller values of , the quantizers are able to identify the actual optimal locations with greater accuracy.Furthermore, we want to emphasize that our proposed method is capable of identifying mean location regardless of the shape of the distribution, which this experiment empirically substantiates.In order to increase the understanding of dissemination of weights (probabilities) and their respective positions, the following examination involves the calculation of the cumulative distribution function.Additionally, we consider  ∼ Γ(2, 2) (Gamma distribution) as a problem interest, a notably distinct scenario in terms of shape compared to the measures previously examined.
Figure 2 provides results.It is evident that as  increases, the number of quantizers  decreases.When  reaches a specific threshold, such as with  = 20 in our case, all quantizers converge towards the center of the measures, represented by the mean (i.e., 4).

Two dimensions
Next, we demonstrate the behavior of entropy regularized optimal quantization for a range of  in two dimensions.In the following experiment, we consider  ∼  (0, 1) × (0, 1) (uniform distribution on the square)  Once again, we are considering uniform distribution as a problem of interest in the subsequent experiment, this time employing  = 16 quantizers for enhanced comprehension.Figure 4 encapsulates the essence of the experiment, offering an extensive visual representation.In contrast to the previous experiment, we observe that for regularization values of  = 0.037 and  = 0.1, they assemble at the nearest strong points (in terms of high probability) rather than converging toward the center of the measure (see Subplots 4b and 4c).Subsequently, for larger , they move from these strong points toward the center, where they make a diagonal alignment before collision (see Subplot 4d).More concisely, when  = 0, we achieve the genuine quantization solution (see Subplot 4a).As  increases, quantizers with lower probabilities converge towards those with nearest higher probabilities.Subsequently, all quantizers converge towards the center of the measure, represented by the mean of respective measure.
Thus far, we have conducted two-dimensional experiments, employing various quantizers ( = 4 and  = 16) with the uniform distribution.Now, we will delve into the complexity of a multivariate normal distribution, aiming to enhance comprehension.More precisely, our problem of interest is to find soft quantization for  ∼ N (, Σ), where  = 0 0 , Σ = 3 1 1 3 .In this endeavor, we employ more quantizers, specifically  = 100.Figure 5 captures the core essence of the experiment, delivering a comprehensive and visually illustrative representation.From the experiment it becomes evident that, as  increases, the initial diagonal alignment precedes convergence toward the center of the measure.Additionally, we observe a noticeable shift of points with lower probabilities towards those with higher probabilities.Furthermore, this experiment highlights that the threshold of  for achieving convergence or diagonal alignment in the center of the measure is dependent on the number of quantizers employed.

Summary
The softmin function is frequently used in classification in a maximum likelihood framework.It holds that

Figure 1 :
Figure 1: Soft quantization of measures on R with varying regularization parameter  with 8 quantization points

Figure 2 :
Figure 2: Soft quantization of the Gamma distribution on R with varying regularization parameter ; the approximating measure simplifies with  increasing

Figure 3 :Figure 4 :
Figure 3: Two dimension-Soft quantization of uniform distribution on R 2 with varying regularization parameter  with 4 quantizers
.6)Further, the infimum in(3.5) is attained.Remark 3.2.The notation in (3.6) ((3.7) below, resp.) is chosen to reflect the explicit expression (3.5): while the soft minimum min P; is with respect to the measure P, which is associated with the variable ξ, the expectation E  is with respect to , its associated variable is  (that is, the variable  in (3.6) is associated with , the variable ξ with P).