Rate-Distortion Bounds for Kernel-Based Distortion Measures †

Kernel methods have been used for turning linear learning algorithms into nonlinear ones. These nonlinear algorithms measure distances between data points by the distance in the kernel-induced feature space. In lossy data compression, the optimal tradeoff between the number of quantized points and the incurred distortion is characterized by the rate-distortion function. However, the rate-distortion functions associated with distortion measures involving kernel feature mapping have yet to be analyzed. We consider two reconstruction schemes, reconstruction in input space and reconstruction in feature space, and provide bounds to the rate-distortion functions for these schemes. Comparison of the derived bounds to the quantizer performance obtained by the kernel K-means method suggests that the rate-distortion bounds for input space and feature space reconstructions are informative at low and high distortion levels, respectively.


Introduction
Kernel methods have been widely used for nonlinear learning problems combined with linear learning algorithms such as the support vector machine and the principal component analysis [1].By the so-called kernel trick, kernel-based methods can use linear learning methods in the kernel-induced feature space without explicitly computing the high-dimensional feature mapping.Kernel-based methods measure the dissimilarity between data points by the distance in the feature space, which, in input space, corresponds to a distance measure involving the feature mapping [2].If a kernel-based learning method is used as a lossy source coding scheme, its optimal rate-distortion tradeoff is indicated by the rate-distortion function associated with the distortion measure defined by the kernel feature map [3].Successful applications of kernel methods in learning problems and flexibility to create various distance measures suggest that kernel-based distortion measures can be suitable for certain lossy compression problems.However, the rate-distortion function of such a distortion measure has yet to be evaluated analytically.Although there are several kernel-based approaches to vector quantization [4,5], their rate-distortion tradeoffs are still unknown.
In this paper, we derive bounds for the rate-distortion functions for kernel-based distortion measures.We consider two schemes to reconstruct inputs in lossy coding methods.One is to obtain a reconstruction in the original input space.Since kernel methods usually yield results of learning by the linear combination of vectors in feature space, we need an additional step to obtain the reconstruction in input space, such as preimaging [6].The other is to consider the linear combination of feature vectors as the reconstruction and measure the distortion in the feature space directly.We formulate the two reconstruction schemes (Sections 3.1 and 3.2), and prove that the rate-distortion function of input space reconstruction provides an upper bound of that of feature space reconstruction (Section 3.3).We derive lower and upper bounds to the rate-distortion function of input space reconstruction, which are computable only by one-dimensional numerical integrations in the case of translation invariant and isotropic kernel functions (Sections 4.1 and 4.2).We also provide an upper bound to the rate-distortion function of feature space reconstruction for general positive definite kernel functions (Section 4.4).In the usual applications of kernel-based quantization algorithms, one fixes the rate by determining the number of quantized points, and minimizes the average distortion for training data.The distortion-rate function, which is the inverse function of the rate-distortion function, shows the minimum achievable expected distortion (or distortion for test data) at the fixed rate.The derived bounds approximately characterize such optimal tradeoffs between the rate and expected distortion.
Furthermore, we design a vector quantizer using the kernel K-means method and compare its performance with the derived rate-distortion bounds (Section 5).We also compute the preimages of the quantized points in feature space to investigate the performance of the quantizer in input space.It is suggested through the experiments using synthetic and image data that the rate-distortion bounds of reconstruction in input space are accurate at low distortion levels while the upper bound for reconstruction in feature space is informative at high distortion levels.

Rate-Distortion Function
Let X and Y be random variables of input and reconstruction taking values in X and Y, respectively.For the non-negative distortion measure between x and y, d(x, y), the rate-distortion function R(D) of the source X ∼ p(x) is defined by where I(q) = I(X; Y) is the mutual information and E denotes the expectation with respect to q(y|x)p(x).R(D) shows the minimum achievable rate R under the given distortion measure d [3,7].The distortion-rate function is the inverse function of the rate-distortion function and denoted by D(R).
If the conditional distributions q s (y|x) achieve the minimum of the following Lagrange functional parameterized by s ≥ 0, then, the rate-distortion function is parametrically given by R(D s ) = I(q s ), The parameter s corresponds to the (negated) slope of the tangent of R(D) at (D s , R(D s )) and hence is referred to as the slope parameter [3].Alternatively, if there exists a marginal reconstruction density q s (y) that minimizes the functional, E log e −sd(X,y) q(y)dy , then the optimal conditional reconstruction distributions are given by q s (y|x) = e −sd(x,y) q s (y) e −sd(x,y) q s (y)dy (see, for example, [3,8]).
From the properties of the rate-distortion function R(D), we know that R(D) > 0 for 0 < D < D max , where and R(D) = 0 for D ≥ D max [3] (p. 90).Hence, D max = lim R→0 D(R).

Kernel-Based Distortion Measures
In kernel-based learning methods, data points in input space X are mapped into some high-dimensional feature space H by a feature mapping φ.Then, the similarity between the two points x and y in X is measured by the inner product φ(x), φ(y) in H.
The inner product is directly evaluated by a nonlinear function in input space which is called the kernel function.Mercer's theorem ensures that there exists some φ such that Equation (4) holds if K is a positive definite kernel [1].This enables us to avoid explicitly computing the feature map φ in the potentially high-dimensional space H, which is called the kernel trick.A lot of learning methods that can be expressed by only the inner products between data points have been kernelized [1].We identify the feature space H with the reproducing kernel Hilbert space (RKHS) associated with the kernel function K by the canonical feature map, φ(x) = K(•, x) [9] (Lemma 4.19).We assume that the input space X is a subset of R m , and the kernel function K is continuous [9] (Lemma 4.29).We focus on the squared norm in feature space as the distortion measure, and consider two reconstruction schemes in the following respective subsections.

Reconstruction in Input Space
If we restrict ourselves to the reconstruction in input space, that is, the reconstruction y ∈ X ⊂ R m is computed for each input x ∈ X , the distortion measure is naturally defined by Note that the reconstruction φ(y) of φ(x) is restricted to the subset of the feature space, {φ(y); y ∈ X }.To obtain a reconstruction in input space, we need a technique such as preimaging [6].This is a difference distortion measure if and only if the kernel function is translation invariant, that is, K(x + a, y + a) = K(x, y) for any a ∈ X .In this case, the distortion measure is expressed as where ρ(z) = 2(C − K(z, 0)) and C = K(0, 0).The rate-distortion function (distortion-rate function, resp.) for this distortion measure is denoted by R inp (D) (D inp (R), resp.) and the maximum distortion D max in Equation ( 3) is denoted by D max,inp , that is, which is in the translation invariant case,

Reconstruction in Feature Space
Suppose we have a sample of length n in input space, S = {x 1 , ..., x n } so that {φ(x 1 ), ..., φ(x n )} spans a linear subspace in feature space.If we compute the reconstruction by the linear combination .., n, and consider it as the reconstruction in feature space, the distortion can be measured by where α = (α 1 , ..., and K = (K(x i , x j )) ij is the Gram matrix.Note that the reconstruction is identified with the coefficients α whose domain is not identical to the input space X .Although the distortion measure d fea depends on the sample S, we omit the dependence in the notation since we consider a fixed design of S for a sufficiently large n.The sample does not have to be distributed according to the source distribution, while it is required to overspread the support of the source.
The rate-distortion function (distortion-rate function, resp.) for this distortion measure is denoted by R fea (D) (D fea (R), resp.) and the maximum distortion D max in Equation ( 3) is given by which is derived from the direct minimization of the quadratic function of α, d fea (x, α)p(x)dx.

R inp (D) and R fea (D)
The following theorem claims that R inp (D) provides an upper bound of R fea (D) when n is sufficiently large.Theorem 1.If the input space X is bounded, and there exists a conditional density achieving the infimum in the definition of R inp (D), for any ε > 0, D ≥ ε, and sufficiently large n, the following inequality holds: The proof is given in Appendix A. This theorem shows that the feature space reconstruction gives better rates since a single feature vector φ(y) can be approximated by a linear combination ∑ n i=1 α i φ(x i ) when n is sufficiently large.

Rate-Distortion Bounds
Since the rate-distortion problem (Section 2) is rarely solved in a closed form [8], we derive bounds to R inp (D) and R fea (D).

Lower Bound to R inp (D)
Although the Shannon lower bound to R(D) is defined for difference distortion measures in general [3] (p. 92), it diverges to −∞ for the distortion measure in Equation ( 6) since e −sρ(z) dz diverges to ∞.Hence, we consider an improved lower bound, which was introduced by [3 where h denotes the differential entropy, and u is the step function.G B,D is the set of all probability densities g(•) for which g(x) = 0 for x > B and ρ(z)g(z)dz ≤ D/Q B .
In the case of the distortion measure in Equation ( 6), the maximum in Equation ( 10) is explicitly given by where C B,s = z ≤B e 2sK(z,0) dz for s related to D by ρ(z)g s (z)dz = D/Q B .Since its differential entropy is we arrive at the following theorem.
Theorem 2. The rate distortion function R inp (D) is parametrically lower-bounded as If we further assume that the kernel function is radial, that is, K(x, y) = K(x − y, 0) = k( x − y ) for some function k, the integrations above reduce to one-dimensional ones, where ) is the area of the m-dimensional unit sphere, and Γ is the gamma function.

Upper Bound to R inp (D)
If d inp in Equation ( 5) is a difference distortion measure, that is, K is translation invariant, by choosing q(y|x) = g s (y − x) for the density g s in Equation ( 12), the following upper bound is obtained, where h(g s ) is given by Equation ( 13) and (g s * p)(y) = g s (y − x)p(x)dx is the convolution between g s and p.This type of upper bound was used to prove the asymptotic tightness of the Shannon lower bound (as D → 0) for a class of general sources and distortion measures [3,[10][11][12].However, this upper bound requires the evaluation of the differential entropy of the convolution.
The following theorem is derived from the facts that the spherical Gaussian distribution maximizes the entropy under the constraint that E[ X 2 ] is no greater than a constant, and that and D s is given by Equation ( 17) (and Equation ( 15)).

Rate-Distortion Dimension
In this section, we evaluate the rate-distortion dimension [13] of the kernel-based distortion measure in Equation ( 5) to investigate its property.We focus on the radial kernel, K(x, y) = k( x − y ), also in this section, and assume that holds for some α > 0 and β > 0. For example, the Gaussian kernel, k(r) = exp −γr 2 (γ > 0), satisfies Equation ( 19) for α = 2 and β = γ.
To examine the limit D → 0 of R inp (D), we consider the asymptotic case of s → ∞.Since Thus, we have from Equations ( 14) and (17), for both the lower and upper bounds, and from Equation (13), Since d inp in Equation ( 5) is a norm squared for a valid RKHS kernel K, the rate-distortion dimension of the source distribution p is defined by [13], From Theorems 2 and 3 and Equation (20), we conclude the following.

Theorem 4.
If the source has a finite differential entropy, positive and finite v p defined in Equation ( 18), and a bounded support, that is, there exists a finite B > 0 such that Q B = 1 in Equation ( 11), and the radial kernel, K(x, y) = k( x − y ) satisfies Equation (19) for α > 0 and β > 0, then the rate-distortion dimension Equation (21 This theorem shows that the rate-distortion dimension is dependent only on the dimensionality of the input space and independent of the dimensionality of the feature space.In the case of the linear kernel, K(x, y) = x, y , with φ(x) = x, the distortion measure in Equation ( 5) reduces to the usual squared distortion measure, x − y 2 .It can be shown that under norm-based distortion measures including the squared distortion measure, the rate-distortion dimension of a source with an m-dimensional density is m [11,12].From the preceding theorem, this is also the case for a general radial kernel if the kernel function has the order α = 2 as the Gaussian kernel.Expression (22) of the rate-distortion dimension will be examined through a numerical experiment in Section 5.1.

Upper Bound to R fea (D)
We construct an upper bound to the rate-distortion function R fea (D).We choose the conditional distribution of the reconstruction by where and N(•; m, Σ) denotes the n-dimensional normal density with mean m and covariance matrix Σ.Here, we have introduced the regularization constant c ≥ 0 with the n × n identity matrix I.The conditional distribution in Equation ( 23) is implied by Equation ( 2) and the approximation q s (α) = N(α; 0, I/(2sc)).This reconstruction distribution yields the following upper bound: where which is independent of the input x, and If c = 0, D min is the mean of the variance of the prediction by the associated Gaussian process [14].Further upper-bounding the differential entropy h(M p ) by the Gaussian entropy, we have the following theorem.
Theorem 5.The rate distortion function R fea (D) is upper-bounded as where The proof is put in Appendix B. In the simplest case where φ(x) = x ∈ R 1 , n = 1, and the source is the Gaussian, p(x) = N(x; 0, σ 2 ), the upper bound in Equation ( 27) reduces to which is an asymptotically (as D → 0) tight upper bound of the well-known rate distortion function for the Gaussian source under the squared distortion measure, R(D) = 1 2 log σ 2 D [3,7].

Experimental Evaluation
We numerically evaluate the rate-distortion bounds obtained in the previous section.Designing a quantizer by the kernel K-means algorithm, we compare its performance with the bounds.
We focus on the case of the Gaussian kernel, with the kernel parameter γ > 0.

Synthetic Data
As a source, we first assumed the uniform distribution on the union of the two regions, We used the trapezoidal rule to compute the one-dimensional integrations in the lower bound R inp,L and the upper bound R inp,G .We generated i.i.d sample of the size n = 200 from the source to compute k(x) and K for R fea,G in Equation ( 27).Generating another 4000 data points, we approximated the required expectations.We optimized the regularization coefficient c to minimize the upper bound R fea,G for each D.
Using the same data set of the size 4000 as a training data set, we run the kernel K-means algorithm 10 times with random initializations to obtain the minimum distortion for each rate.Varying the number K of quantized points from 2 1 to 2 10 , for each K, we counted the effective number K eff of quantized points which have at least one assigned data point and computed the rate by log 2 K eff as the quantizer is first order, that is, the block length is one.The kernel parameter γ was chosen so that the clear separation of C 1 and C 2 is obtained when K = 2.
After the training, we computed the distortion and rate for the test data set, by assigning each of 20,000 test data generated from the same source to the nearest quantized points in the feature space.
For each quantized point, we obtained its preimage.That is, if the kth quantized point is expressed as ∑ n i=1 α ki φ(x i ), its preimage is We used the mean shift procedure for the maximization, although this procedure only guarantees the convergence to a local maximum [15,16].
The obtained bounds and the quantizer performances are displayed in Figure 1a,b and for m = 2 and m = 10, respectively, in the forms of distortion-rate functions.The values of D max in Equations ( 7) and ( 9) are also indicated in the figures.
In both dimensions, the upper bound D fea,G is smaller than D inp,G at low rates while the bound is above the quantizer performance.However, the value of D max,fea suggests that the bound is informative at low rates.As the rate becomes higher, the lower and upper bounds of the input space reconstruction, D L,inp and D G,inp , approach each other.In fact, they sandwich the quantizer performance tightly in the two-dimensional case, which suggests that the rate-distortion function for the feature space reconstruction, R fea (D) is close to the rate-distortion function of the input space reconstruction R inp (D) at high rates.
We see that the quantizer performances for d fea and those for d inp approach each other as the rate R grows.The upper bound D inp,G reasonably approximates the quantizer performance by the preimages, and it indicates that, in the two-dimensional case (Figure 1a), the results for R = 2 and 3 bits can be improved by at least about 1 bit.
At low distortion levels, each source output should be reconstructed within a small neighborhood in the feature space where we can find another point y in the input space whose feature map φ(y) is sufficiently close to the reconstruction.This suggests that the rate-distortion function of feature space reconstruction is well approximated by the rate-distortion function of input space reconstruction.In other words, combining multiple input points to make a reconstruction in feature space does not do any good for reducing distortion and only a single input point is enough when it is mapped into feature space.Hence, the rate-distortion bounds of input space reconstruction may be informative at low distortion levels.
In the 10-dimensional case (Figure 1b), the distortion in the test data set is close to D inp,G (R) or above it at high rates.This may be due to overfitting of the kernel K-means to the training data set of the size, 4000.That is, as the the rate grows, the distortion in the training data set decreases and the discrepancy between the distortions in the training and test sets increases.q q q q q q q q q q D max,fea R (bits) d fea : test q q q q q q q q q q D max,fea = 0.947 To examine the asymptotic behavior of R inp (D) discussed in Section 4.3, we computed R inp,L (D) and R inp,G (D) for small D, that is, for large s.As well as the Gaussian kernel Equation (29), which has α = 2 in Equation (19), we applied the Laplacian kernel, K(x, y) = e −γ x−y , which corresponds to α = 1.The kernel parameter of the Laplacian kernel was set to the square root of the value used in the Gaussian kernel.
The rate-distortion bounds, R inp,L (D) and R inp,G (D) divided by −(log D)/2 for small distortion levels are shown in Figure 2a,b and for m = 2 and m = 10, respectively.We can see that, in each case, the ratio tends to 2m/α, that is, the rate-distortion dimension evaluated in Equation ( 22) as D → 0. For the distortion levels smaller than those presented in Figure 2, the ratios start oscillating due to the errors of numerical integrations.

Image Data
We carried out a similar evaluation of the rate-distortion bounds and quantizer performances for a grayscale image data set extracted from the COIL20 data set [18].We used the first category from 20 categories of images, which consisted of 72 images of size 32 × 32.Dividing each 32 × 32 image into small patches of size 2 × 2 (m = 4), we obtained 256 data from each image, and 18,432 data in total.Removing duplicate data points, we finally obtained 13,368 data.We used first 2048 data as the training data and the remaining 11,320 data as the test data.The training data set was also used for approximating expectations of kernel functions required to compute R fea (D), and the first n = 256 data points were used as the sample data in the definition of d fea .We evaluated only the upper bounds, R fea,G and R inp,G , since the lower bound R inp,L requires estimating the source entropy from empirical data, which depends heavily on the estimation method, and hence is to be addressed more in detail.
Each dimension was normalized so that it has mean 0 and variance 1.Hence, v p in R inp,G was approximated by the empirical variance, 1.The boundary B in R inp,G was approximated by the maximum norm of the training data points.
The upper bounds and quantizer performances are presented in Figure 3.Although the upper bounds are loose and above the respective quantizer performances, the upper bound D inp,G (R) is roughly predictive of the quantizer performance in the input space, and so does min{D inp,G (R), D fea,G (R)} for the reconstruction in the feature space.q q q q q q q q q q D max,fea = 0.955 R (bits) D q q q q D inp,G (R) D fea,G (R)

Conclusions
In this paper, we have shown upper and lower bounds for the rate-distortion functions associated with kernel feature mapping.As suggested in Section 5, the upper bound for the reconstruction in feature space is informative at high distortion levels while the bounds for the reconstruction in input space are informative at low distortion levels.We have also evaluated the rate-distortion dimension of sources with bounded support under kernel-based distortion measures, which shows the asymptotic behavior of the rate-distortion function.Our future directions include deriving tighter bounds and exact evaluation of the rate-distortion function in some special cases.In particular, it is an important undertaking to derive a lower bound to the rate-distortion function of the reconstruction in feature space.

where C 1 and C 2
have equal volumes and C 1 ∪ C 2 has volume 1.This suggests that B = m(m+1/2) A(m) 1/m and Q B = 1 in Equation (10) and succeeding equations in Sections 4.1 and 4.2.
d inp : training d inp : test d fea : training

Figure 3 .
Figure 3. Upper bounds of the rate-distortion functions and quantizer performance for image data.