Open Access
This article is

- freely available
- re-usable

*Entropy*
**2017**,
*19*(7),
336;
doi:10.3390/e19070336

Article

Rate-Distortion Bounds for Kernel-Based Distortion Measures †

Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Hibarigaoka Tempaku-cho Toyohashi, Aichi 441-8580, Japan; Tel.: +81-532-44-6893

^{†}

This paper is an extended version of my papers published in the Eighth Workshop on Information Theoretic Methods in Science and Engineering, Copenhagen, Denmark, 24–26 June 2015 and the IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017.

Received: 9 May 2017 / Accepted: 2 July 2017 / Published: 5 July 2017

## Abstract

**:**

Kernel methods have been used for turning linear learning algorithms into nonlinear ones. These nonlinear algorithms measure distances between data points by the distance in the kernel-induced feature space. In lossy data compression, the optimal tradeoff between the number of quantized points and the incurred distortion is characterized by the rate-distortion function. However, the rate-distortion functions associated with distortion measures involving kernel feature mapping have yet to be analyzed. We consider two reconstruction schemes, reconstruction in input space and reconstruction in feature space, and provide bounds to the rate-distortion functions for these schemes. Comparison of the derived bounds to the quantizer performance obtained by the kernel $\mathtt{K}$-means method suggests that the rate-distortion bounds for input space and feature space reconstructions are informative at low and high distortion levels, respectively.

Keywords:

kernel methods; rate-distortion function; kernel K-means; preimaging## 1. Introduction

Kernel methods have been widely used for nonlinear learning problems combined with linear learning algorithms such as the support vector machine and the principal component analysis [1]. By the so-called kernel trick, kernel-based methods can use linear learning methods in the kernel-induced feature space without explicitly computing the high-dimensional feature mapping. Kernel-based methods measure the dissimilarity between data points by the distance in the feature space, which, in input space, corresponds to a distance measure involving the feature mapping [2]. If a kernel-based learning method is used as a lossy source coding scheme, its optimal rate-distortion tradeoff is indicated by the rate-distortion function associated with the distortion measure defined by the kernel feature map [3]. Successful applications of kernel methods in learning problems and flexibility to create various distance measures suggest that kernel-based distortion measures can be suitable for certain lossy compression problems. However, the rate-distortion function of such a distortion measure has yet to be evaluated analytically. Although there are several kernel-based approaches to vector quantization [4,5], their rate-distortion tradeoffs are still unknown.

In this paper, we derive bounds for the rate-distortion functions for kernel-based distortion measures. We consider two schemes to reconstruct inputs in lossy coding methods. One is to obtain a reconstruction in the original input space. Since kernel methods usually yield results of learning by the linear combination of vectors in feature space, we need an additional step to obtain the reconstruction in input space, such as preimaging [6]. The other is to consider the linear combination of feature vectors as the reconstruction and measure the distortion in the feature space directly. We formulate the two reconstruction schemes (Section 3.1 and Section 3.2), and prove that the rate-distortion function of input space reconstruction provides an upper bound of that of feature space reconstruction (Section 3.3). We derive lower and upper bounds to the rate-distortion function of input space reconstruction, which are computable only by $one$-dimensional numerical integrations in the case of translation invariant and isotropic kernel functions (Section 4.1 and Section 4.2). We also provide an upper bound to the rate-distortion function of feature space reconstruction for general positive definite kernel functions (Section 4.4). In the usual applications of kernel-based quantization algorithms, one fixes the rate by determining the number of quantized points, and minimizes the average distortion for training data. The distortion-rate function, which is the inverse function of the rate-distortion function, shows the minimum achievable expected distortion (or distortion for test data) at the fixed rate. The derived bounds approximately characterize such optimal tradeoffs between the rate and expected distortion.

Furthermore, we design a vector quantizer using the kernel $\mathtt{K}$-means method and compare its performance with the derived rate-distortion bounds (Section 5). We also compute the preimages of the quantized points in feature space to investigate the performance of the quantizer in input space. It is suggested through the experiments using synthetic and image data that the rate-distortion bounds of reconstruction in input space are accurate at low distortion levels while the upper bound for reconstruction in feature space is informative at high distortion levels.

## 2. Rate-Distortion Function

Let X and Y be random variables of input and reconstruction taking values in $\mathcal{X}$ and $\mathcal{Y}$, respectively. For the non-negative distortion measure between x and y, $d(x,y)$, the rate-distortion function $R\left(D\right)$ of the source $X\sim p\left(x\right)$ is defined by
where $I\left(q\right)=I(X;Y)$ is the mutual information and E denotes the expectation with respect to $q\left(y\right|x\left)p\right(x)$. $R\left(D\right)$ shows the minimum achievable rate R under the given distortion measure d [3,7]. The distortion-rate function is the inverse function of the rate-distortion function and denoted by $D\left(R\right)$.

$$R\left(D\right)=\underset{q\left(y\right|x):E[d(X,Y)]\le D}{inf}I\left(q\right),$$

If the conditional distributions ${q}_{s}\left(y\right|x)$ achieve the minimum of the following Lagrange functional parameterized by $s\ge 0$,
then, the rate-distortion function is parametrically given by

$$L\left(q\right)=I\left(q\right)+s\left(E\left[d\right(X,Y\left)\right]-D\right),$$

$$\begin{array}{ccc}\hfill R\left({D}_{s}\right)& =& I\left({q}_{s}\right),\hfill \\ \hfill {D}_{s}& =& \int {q}_{s}\left(y\right|x)p\left(x\right)d(x,y)dxdy.\hfill \end{array}$$

The parameter s corresponds to the (negated) slope of the tangent of $R\left(D\right)$ at $({D}_{s},R\left({D}_{s}\right))$ and hence is referred to as the slope parameter [3]. Alternatively, if there exists a marginal reconstruction density ${q}_{s}\left(y\right)$ that minimizes the functional,
then the optimal conditional reconstruction distributions are given by
(see, for example, [3,8]).

$$F\left(q\right)=-\frac{1}{s}E\left[log\int {e}^{-sd(X,y)}q\left(y\right)dy\right],$$

$${q}_{s}\left(y\right|x)=\frac{{e}^{-sd(x,y)}{q}_{s}\left(y\right)}{\int {e}^{-sd(x,y)}{q}_{s}\left(y\right)dy}$$

From the properties of the rate-distortion function $R\left(D\right)$, we know that $R\left(D\right)>0$ for $0<D<{D}_{max}$, where
and $R\left(D\right)=0$ for $D\ge {D}_{max}$ [3] (p. 90). Hence, ${D}_{max}={lim}_{R\to 0}D\left(R\right)$.

$${D}_{max}=\underset{y}{inf}\int p\left(x\right)d(x,y)dx,$$

## 3. Kernel-Based Distortion Measures

In kernel-based learning methods, data points in input space $\mathcal{X}$ are mapped into some high-dimensional feature space H by a feature mapping $\varphi $. Then, the similarity between the two points x and y in $\mathcal{X}$ is measured by the inner product $\u2329\varphi \left(x\right),\varphi \left(y\right)\u232a$ in H.

The inner product is directly evaluated by a nonlinear function in input space
which is called the kernel function. Mercer’s theorem ensures that there exists some $\varphi $ such that Equation (4) holds if K is a positive definite kernel [1]. This enables us to avoid explicitly computing the feature map $\varphi $ in the potentially high-dimensional space H, which is called the kernel trick. A lot of learning methods that can be expressed by only the inner products between data points have been kernelized [1].

$$K(x,y)=\u2329\varphi \left(x\right),\varphi \left(y\right)\u232a,$$

We identify the feature space H with the reproducing kernel Hilbert space (RKHS) associated with the kernel function K by the canonical feature map, $\varphi \left(x\right)=K(\xb7,x)$ [9] (Lemma 4.19). We assume that the input space $\mathcal{X}$ is a subset of ${\mathbb{R}}^{m}$, and the kernel function K is continuous [9] (Lemma 4.29). We focus on the squared norm in feature space as the distortion measure, and consider two reconstruction schemes in the following respective subsections.

#### 3.1. Reconstruction in Input Space

If we restrict ourselves to the reconstruction in input space, that is, the reconstruction $y\in \mathcal{X}\subset {\mathbb{R}}^{m}$ is computed for each input $x\in \mathcal{X}$, the distortion measure is naturally defined by

$$\begin{array}{ccc}\hfill {d}_{\mathrm{inp}}(x,y)& =& \left|\right|\varphi \left(x\right)-{\varphi \left(y\right)\left|\right|}^{2}\hfill \\ \hfill & =& K(x,x)+K(y,y)-2K(x,y).\hfill \end{array}$$

Note that the reconstruction $\varphi \left(y\right)$ of $\varphi \left(x\right)$ is restricted to the subset of the feature space, $\left\{\varphi \right(y);y\in \mathcal{X}\}$. To obtain a reconstruction in input space, we need a technique such as preimaging [6].

This is a difference distortion measure if and only if the kernel function is translation invariant, that is, $K(x+a,y+a)=K(x,y)$ for any $a\in \mathcal{X}$. In this case, the distortion measure is expressed as
where $\rho \left(z\right)=2(C-K(z,0\left)\right)$ and $C=K(0,0)$. The rate-distortion function (distortion-rate function, resp.) for this distortion measure is denoted by ${R}_{\mathrm{inp}}\left(D\right)$ (${D}_{\mathrm{inp}}\left(R\right)$, resp.) and the maximum distortion ${D}_{max}$ in Equation (3) is denoted by ${D}_{max,\mathrm{inp}}$, that is,
which is in the translation invariant case, ${D}_{max,\mathrm{inp}}=2\left(C-{sup}_{y}E\left[K(X,y)\right]\right)$.

$${d}_{\mathrm{inp}}(x,y)=\rho (x-y),$$

$${D}_{max,\mathrm{inp}}=E\left[K(X,X)\right]+\underset{y}{inf}\left\{K(y,y)-2E\left[K\right(X,y\left)\right]\right\},$$

#### 3.2. Reconstruction in Feature Space

Suppose we have a sample of length n in input space, $S=\{{x}_{1},\dots ,{x}_{n}\}$ so that $\{\varphi \left({x}_{1}\right),\dots ,\varphi \left({x}_{n}\right)\}$ spans a linear subspace in feature space. If we compute the reconstruction by the linear combination ${\sum}_{i=1}^{n}{\alpha}_{i}\varphi \left({x}_{i}\right)$ for ${\alpha}_{i}\in \mathbb{R},i=1,\dots ,n$, and consider it as the reconstruction in feature space, the distortion can be measured by
where $\mathit{\alpha}={({\alpha}_{1},\dots ,{\alpha}_{n})}^{T}\in {\mathbb{R}}^{n}$,
and $\mathit{K}={\left(K({x}_{i},{x}_{j})\right)}_{ij}$ is the Gram matrix. Note that the reconstruction is identified with the coefficients $\mathit{\alpha}$ whose domain is not identical to the input space $\mathcal{X}$. Although the distortion measure ${d}_{\mathrm{fea}}$ depends on the sample S, we omit the dependence in the notation since we consider a fixed design of S for a sufficiently large n. The sample does not have to be distributed according to the source distribution, while it is required to overspread the support of the source.

$$\begin{array}{ccc}\hfill {d}_{\mathrm{fea}}(x,\mathit{\alpha})& =& {d}_{\mathrm{fea}}^{\left[S\right]}(x,\mathit{\alpha})={\u2225\varphi \left(x\right)-\sum _{i=1}^{n}{\alpha}_{i}\varphi \left({x}_{i}\right)\u2225}^{2}\hfill \\ \hfill & =& K(x,x)-2{\mathit{\alpha}}^{T}\mathit{k}\left(x\right)+{\mathit{\alpha}}^{T}\mathit{K}\mathit{\alpha},\hfill \end{array}$$

$$\mathit{k}\left(x\right)={(K({x}_{1},x),\dots ,K({x}_{n},x))}^{T},$$

The rate-distortion function (distortion-rate function, resp.) for this distortion measure is denoted by ${R}_{\mathrm{fea}}\left(D\right)$ (${D}_{\mathrm{fea}}\left(R\right)$, resp.) and the maximum distortion ${D}_{max}$ in Equation (3) is given by
which is derived from the direct minimization of the quadratic function of $\mathit{\alpha}$, $\int {d}_{\mathrm{fea}}(x,\mathit{\alpha})p\left(x\right)dx$.

$${D}_{max,\mathrm{fea}}=E\left[K(X,X)\right]-E{\left[\mathit{k}\left(X\right)\right]}^{T}{\mathit{K}}^{-1}E\left[\mathit{k}\left(X\right)\right],$$

#### 3.3. ${R}_{\mathrm{inp}}\left(D\right)$ and ${R}_{\mathrm{fea}}\left(D\right)$

The following theorem claims that ${R}_{\mathrm{inp}}\left(D\right)$ provides an upper bound of ${R}_{\mathrm{fea}}\left(D\right)$ when n is sufficiently large.

**Theorem**

**1.**

If the input space $\mathcal{X}$ is bounded, and there exists a conditional density achieving the infimum in the definition of ${R}_{\mathrm{inp}}\left(D\right)$, for any $\epsilon >0$, $D\ge \epsilon $, and sufficiently large n, the following inequality holds:

$${R}_{\mathrm{fea}}(D+\epsilon )\le {R}_{\mathrm{inp}}\left(D\right).$$

The proof is given in Appendix A. This theorem shows that the feature space reconstruction gives better rates since a single feature vector $\varphi \left(y\right)$ can be approximated by a linear combination ${\sum}_{i=1}^{n}{\alpha}_{i}\varphi \left({x}_{i}\right)$ when n is sufficiently large.

## 4. Rate-Distortion Bounds

Since the rate-distortion problem (Section 2) is rarely solved in a closed form [8], we derive bounds to ${R}_{\mathrm{inp}}\left(D\right)$ and ${R}_{\mathrm{fea}}\left(D\right)$.

#### 4.1. Lower Bound to ${R}_{\mathrm{inp}}\left(D\right)$

Although the Shannon lower bound to $R\left(D\right)$ is defined for difference distortion measures in general [3] (p. 92), it diverges to $-\infty $ for the distortion measure in Equation (6) since $\int {e}^{-s\rho \left(z\right)}dz$ diverges to ∞. Hence, we consider an improved lower bound, which was introduced by [3] (p. 140). Let ${Q}_{B}$ be the probability that $\parallel X\parallel \le B$. Then, $R\left(D\right)$ is lower-bounded as
where h denotes the differential entropy,
and u is the step function. ${G}_{B,D}$ is the set of all probability densities $g(\xb7)$ for which $g\left(x\right)=0$ for $\parallel x\parallel >B$ and $\int \rho \left(z\right)g\left(z\right)dz\le D/{Q}_{B}$.

$$R\left(D\right)\ge {Q}_{B}\left\{h\left({p}_{B}\right)-\underset{g\in {G}_{B,D}}{max}h\left(g\right)\right\},$$

$${p}_{B}\left(x\right)=\frac{1}{{Q}_{B}}p\left(x\right)u(B-\parallel x\parallel ),$$

In the case of the distortion measure in Equation (6), the maximum in Equation (10) is explicitly given by
where ${C}_{B,s}={\int}_{\parallel z\parallel \le B}{e}^{2sK(z,0)}dz$ for s related to D by $\int \rho \left(z\right){g}_{s}\left(z\right)dz=D/{Q}_{B}$. Since its differential entropy is
we arrive at the following theorem.

$${g}_{s}\left(z\right)=\frac{1}{{C}_{B,s}}exp\left(2sK(z,0)\right)u(B-\parallel z\parallel ),$$

$$h\left({g}_{s}\right)=-s\frac{\partial log{C}_{B,s}}{\partial s}+log{C}_{B,s},$$

**Theorem**

**2.**

The rate distortion function ${R}_{\mathrm{inp}}\left(D\right)$ is parametrically lower-bounded as

$$\begin{array}{ccc}\hfill {R}_{\mathrm{inp}}\left({D}_{s}\right)\ge {R}_{\mathrm{inp},L}\left({D}_{s}\right)& =& {Q}_{B}\left\{h\left({p}_{B}\right)+s\frac{\partial log{C}_{B,s}}{\partial s}-log{C}_{B,s}\right\},\hfill \\ \hfill {D}_{s}& =& {Q}_{B}\left\{2C-\frac{\partial log{C}_{B,s}}{\partial s}\right\}.\hfill \end{array}$$

If we further assume that the kernel function is radial, that is, $K(x,y)=K(x-y,0)=k(\parallel x-y\parallel )$ for some function k, the integrations above reduce to $one$-dimensional ones,
and
where $A\left(m\right)=\frac{m{\sqrt{\pi}}^{m}}{\mathrm{\Gamma}(m/2+1)}$ is the area of the m-dimensional unit sphere, and $\mathrm{\Gamma}$ is the gamma function.

$${C}_{B,s}=A\left(m\right){\int}_{0}^{B}{r}^{m-1}{e}^{2sk\left(r\right)}dr,$$

$$\begin{array}{ccc}\hfill \frac{\partial log{C}_{B,s}}{\partial s}& =& 2{\int}_{\parallel z\parallel \le B}K(z,0){e}^{2sK(z,0)}dz\hfill \\ \hfill & =& 2A\left(m\right){\int}_{0}^{B}{r}^{m-1}k\left(r\right){e}^{2sk\left(r\right)}dr,\hfill \end{array}$$

#### 4.2. Upper Bound to ${R}_{\mathrm{inp}}\left(D\right)$

If ${d}_{\mathrm{inp}}$ in Equation (5) is a difference distortion measure, that is, K is translation invariant, by choosing $q\left(y\right|x)={g}_{s}(y-x)$ for the density ${g}_{s}$ in Equation (12), the following upper bound is obtained,
where $h\left({g}_{s}\right)$ is given by Equation (13) and $({g}_{s}\ast p)\left(y\right)=\int {g}_{s}(y-x)p\left(x\right)dx$ is the convolution between ${g}_{s}$ and p. This type of upper bound was used to prove the asymptotic tightness of the Shannon lower bound (as $D\to 0$) for a class of general sources and distortion measures [3,10,11,12]. However, this upper bound requires the evaluation of the differential entropy of the convolution.

$$\begin{array}{ccc}\hfill {R}_{\mathrm{inp}}\left({D}_{s}\right)\le {R}_{\mathrm{inp},U}\left({D}_{s}\right)& =& h({g}_{s}\ast p)-h\left({g}_{s}\right)\hfill \end{array}$$

$$\begin{array}{cccc}\hfill & \hfill {D}_{s}& =& 2C-\frac{\partial log{C}_{B,s}}{\partial s},\hfill \end{array}$$

The following theorem is derived from the facts that the spherical Gaussian distribution maximizes the entropy under the constraint that $E[\parallel X{\parallel}^{2}]$ is no greater than a constant, and that ${E[\parallel Y\parallel}^{2}{]=E[\parallel X\parallel}^{2}{]+E[\parallel Z\parallel}^{2}]$ holds for $Y=X+Z\sim {g}_{s}\ast p$.

**Theorem**

**3.**

If the kernel function is translation invariant and radial, $K(x,y)=k(\parallel x-y\parallel )$, then ${R}_{\mathrm{inp}}\left(D\right)$ is parametrically upper-bounded as
where
and ${D}_{s}$ is given by Equation (17) (and Equation (15)).

$${R}_{\mathrm{inp}}\left({D}_{s}\right)\le {R}_{\mathrm{inp},G}\left({D}_{s}\right)=\frac{m}{2}log\left(2\pi e({v}_{p}+{v}_{s})\right)-h\left({g}_{s}\right),$$

$$\begin{array}{ccc}\hfill {v}_{p}& =& \frac{1}{m}\int {\parallel x-\mu \parallel}^{2}p\left(x\right)dx,\hfill \\ \hfill \mu & =& \int xp\left(x\right)dx,\hfill \\ \hfill {v}_{s}& =& \frac{1}{m}\int {\parallel x\parallel}^{2}{g}_{s}\left(x\right)dx\hfill \\ \hfill & =& \frac{A\left(m\right)}{m{C}_{B,s}}{\int}_{0}^{B}{r}^{m+1}{e}^{2sk\left(r\right)}dr,\hfill \end{array}$$

#### 4.3. Rate-Distortion Dimension

In this section, we evaluate the rate-distortion dimension [13] of the kernel-based distortion measure in Equation (5) to investigate its property. We focus on the radial kernel, $K(x,y)=k(\parallel x-y\parallel )$, also in this section, and assume that
holds for some $\alpha >0$ and $\beta >0$. For example, the Gaussian kernel, $k\left(r\right)=exp\left(-\gamma {r}^{2}\right)(\gamma >0)$, satisfies Equation (19) for $\alpha =2$ and $\beta =\gamma $.

$$\underset{r\to 0}{lim}\frac{k\left(r\right)-k\left(0\right)}{{r}^{\alpha}}=-\beta $$

To examine the limit $D\to 0$ of ${R}_{\mathrm{inp}}\left(D\right)$, we consider the asymptotic case of $s\to \infty $. Since $k\left(r\right)=k\left(0\right)-\beta {r}^{\alpha}+o\left({r}^{\alpha}\right)$, it follows that
and
Thus, we have from Equations (14) and (17),
for both the lower and upper bounds, and from Equation (13),

$$\begin{array}{ccc}\hfill {C}_{B,s}& =& A\left(m\right){\int}_{0}^{B}{e}^{2sk\left(r\right){r}^{m-1}}dr\hfill \\ \hfill & =& A\left(m\right){e}^{2sk\left(0\right)}\frac{1}{\alpha}{\left(\frac{1}{s\beta}\right)}^{m/\alpha}\left\{\mathrm{\Gamma}\left(\frac{m}{\alpha}\right)+o\left(1\right)\right\},\hfill \end{array}$$

$${\int}_{0}^{B}2k\left(r\right){e}^{2sk\left(r\right){r}^{m-1}}dr=2k\left(0\right)\frac{{C}_{B,s}}{A\left(m\right)}-2{e}^{2sk\left(0\right)}\frac{1}{s\alpha}{\left(\frac{1}{s\beta}\right)}^{1+m/\alpha}\left\{\mathrm{\Gamma}\left(1+\frac{m}{\alpha}\right)+o\left(1\right)\right\},$$

$$\begin{array}{ccc}\hfill \frac{\partial log{C}_{B,s}}{\partial s}& =& \frac{{\int}_{0}^{B}2k\left(r\right){e}^{2sk\left(r\right){r}^{m-1}}dr}{{\int}_{0}^{B}{e}^{2sk\left(r\right){r}^{m-1}}dr}\hfill \\ \hfill & =& 2k\left(0\right)-\frac{m}{s\alpha \beta}+o\left(\frac{1}{s}\right).\hfill \end{array}$$

$$-log{D}_{s}=logs+O\left(1\right),$$

$$\begin{array}{ccc}\hfill h\left({g}_{s}\right)& =& -\frac{m}{\alpha}logs+O\left(1\right)\hfill \\ \hfill & =& \frac{m}{\alpha}log{D}_{s}+O\left(1\right).\hfill \end{array}$$

Since ${d}_{\mathrm{inp}}$ in Equation (5) is a norm squared for a valid RKHS kernel K, the rate-distortion dimension of the source distribution p is defined by [13],

$${\mathrm{dim}}_{R}\left(p\right)=\underset{D\to 0}{lim}\frac{{R}_{\mathrm{inp}}\left(D\right)}{-\frac{1}{2}logD}.$$

From Theorems 2 and 3 and Equation (20), we conclude the following.

**Theorem**

**4.**

If the source has a finite differential entropy, positive and finite ${v}_{p}$ defined in Equation (18), and a bounded support, that is, there exists a finite $B>0$ such that ${Q}_{B}=1$ in Equation (11), and the radial kernel, $K(x,y)=k(\parallel x-y\parallel )$ satisfies Equation (19) for $\alpha >0$ and $\beta >0$, then the rate-distortion dimension Equation (21) of ${R}_{\mathrm{inp}}\left(D\right)$ is given by

$${\mathrm{dim}}_{R}\left(p\right)=\frac{2m}{\alpha}.$$

This theorem shows that the rate-distortion dimension is dependent only on the dimensionality of the input space and independent of the dimensionality of the feature space. In the case of the linear kernel, $K(x,y)=\langle x,y\rangle $, with $\varphi \left(x\right)=x$, the distortion measure in Equation (5) reduces to the usual squared distortion measure, ${\parallel x-y\parallel}^{2}$. It can be shown that under norm-based distortion measures including the squared distortion measure, the rate-distortion dimension of a source with an m-dimensional density is m [11,12]. From the preceding theorem, this is also the case for a general radial kernel if the kernel function has the order $\alpha =2$ as the Gaussian kernel. Expression (22) of the rate-distortion dimension will be examined through a numerical experiment in Section 5.1.

#### 4.4. Upper Bound to ${R}_{\mathrm{fea}}\left(D\right)$

We construct an upper bound to the rate-distortion function ${R}_{\mathrm{fea}}\left(D\right)$. We choose the conditional distribution of the reconstruction by
where $\tilde{\mathit{K}}=\mathit{K}+c\mathit{I}$,
and $N(\xb7;\mathit{m},\Sigma )$ denotes the n-dimensional normal density with mean $\mathit{m}$ and covariance matrix $\Sigma $. Here, we have introduced the regularization constant $c\ge 0$ with the $n\times n$ identity matrix $\mathit{I}$. The conditional distribution in Equation (23) is implied by Equation (2) and the approximation ${q}_{s}\left(\mathit{\alpha}\right)=N(\mathit{\alpha};\mathbf{0},\mathit{I}/\left(2sc\right))$. This reconstruction distribution yields the following upper bound:
where ${M}_{p}\left(\mathit{\alpha}\right)=\int N(\mathit{\alpha};{\mathit{m}}_{K}\left(x\right),{\tilde{\mathit{K}}}^{-1}/2s)p\left(x\right)dx$,
which is independent of the input x, and
If $c=0$, ${D}_{min}$ is the mean of the variance of the prediction by the associated Gaussian process [14].

$$q\left(\mathit{\alpha}\right|x)=N(\mathit{\alpha};{\mathit{m}}_{K}\left(x\right),{\tilde{\mathit{K}}}^{-1}/2s),$$

$${\mathit{m}}_{K}\left(x\right)={\tilde{\mathit{K}}}^{-1}\mathit{k}\left(x\right),$$

$$\begin{array}{ccc}\hfill {R}_{\mathrm{fea}}\left({D}_{s}\right)\le {R}_{\mathrm{fea},U}\left({D}_{s}\right)& =& h\left({M}_{p}\right)-h\left(N(\mathit{\alpha};{\mathit{m}}_{K}\left(x\right),{\tilde{\mathit{K}}}^{-1}/2s)\right),\hfill \end{array}$$

$$\begin{array}{ccc}\hfill {D}_{s}& =& \frac{n-c\mathrm{tr}\left\{{\tilde{\mathit{K}}}^{-1}\right\}}{2s}+{D}_{min}\left(c\right),\hfill \end{array}$$

$$h\left(N(\mathit{\alpha};{\mathit{m}}_{K}\left(x\right),{\tilde{\mathit{K}}}^{-1}/2s)\right)=\frac{n}{2}log\left(\frac{\pi e}{s}{\left|\tilde{\mathit{K}}\right|}^{1/n}\right),$$

$${D}_{min}\left(c\right)=E\left[K(X,X)\right]-\mathrm{tr}\left\{{\tilde{\mathit{K}}}^{-1}E\left[\mathit{k}\left(X\right)\mathit{k}{\left(X\right)}^{T}\right]\right\}-c\mathrm{tr}\left\{{\tilde{\mathit{K}}}^{-1}E\left[\mathit{k}\left(X\right)\mathit{k}{\left(X\right)}^{T}\right]{\tilde{\mathit{K}}}^{-1}\right\}.$$

Further upper-bounding the differential entropy $h\left({M}_{p}\right)$ by the Gaussian entropy, we have the following theorem.

**Theorem**

**5.**

The rate distortion function ${R}_{\mathrm{fea}}\left(D\right)$ is upper-bounded as
where

$${R}_{\mathrm{fea}}\left(D\right)\le {R}_{\mathrm{fea},G}\left(D\right)=\frac{1}{2}log\left|\mathit{I}+\frac{n-c\mathrm{tr}\left\{{\tilde{\mathit{K}}}^{-1}\right\}}{D-{D}_{min}\left(c\right)}{\tilde{\mathit{K}}}^{-1}\mathit{C}\right|,$$

$$\mathit{C}=E\left[\mathit{k}\left(X\right)\mathit{k}{\left(X\right)}^{T}\right]-E\left[\mathit{k}\left(X\right)\right]E{\left[\mathit{k}\left(X\right)\right]}^{T}.$$

The proof is put in Appendix B. In the simplest case where $\varphi \left(x\right)=x\in {\mathbb{R}}^{1}$, $n=1$, and the source is the Gaussian, $p\left(x\right)=N(x;0,{\sigma}^{2})$, the upper bound in Equation (27) reduces to
which is an asymptotically (as $D\to 0$) tight upper bound of the well-known rate distortion function for the Gaussian source under the squared distortion measure, $R\left(D\right)=\frac{1}{2}log\left(\frac{{\sigma}^{2}}{D}\right)$ [3,7].

$${R}_{\mathrm{fea},G}\left(D\right)=\frac{1}{2}log\left(1+\frac{{\sigma}^{2}}{D}\right),$$

## 5. Experimental Evaluation

We numerically evaluate the rate-distortion bounds obtained in the previous section. Designing a quantizer by the kernel $\mathtt{K}$-means algorithm, we compare its performance with the bounds.

We focus on the case of the Gaussian kernel,
with the kernel parameter $\gamma >0$.

$$K(x,y)={e}^{{-\gamma \parallel x-y\parallel}^{2}}$$

#### 5.1. Synthetic Data

As a source, we first assumed the uniform distribution on the union of the two regions, ${C}_{1}=\{x\in {\mathbb{R}}^{m}{;A\left(m\right)\parallel x\parallel}^{m}\le m/2\}$ and ${C}_{2}=\{x\in {\mathbb{R}}^{m};{m}^{2}\le {A\left(m\right)\parallel x\parallel}^{m}\le m(m+1/2)\}$, where ${C}_{1}$ and ${C}_{2}$ have equal volumes and ${C}_{1}\cup {C}_{2}$ has volume 1. This suggests that $B={\left\{\frac{m(m+1/2)}{A\left(m\right)}\right\}}^{1/m}$ and ${Q}_{B}=1$ in Equation (10) and succeeding equations in Section 4.1 and Section 4.2.

We used the trapezoidal rule to compute the $one$-dimensional integrations in the lower bound ${R}_{\mathrm{inp},L}$ and the upper bound ${R}_{\mathrm{inp},G}$. We generated i.i.d sample of the size $n=200$ from the source to compute $\mathit{k}\left(x\right)$ and $\mathit{K}$ for ${R}_{\mathrm{fea},G}$ in Equation (27). Generating another 4000 data points, we approximated the required expectations. We optimized the regularization coefficient c to minimize the upper bound ${R}_{\mathrm{fea},G}$ for each D.

Using the same data set of the size 4000 as a training data set, we run the kernel $\mathtt{K}$-means algorithm 10 times with random initializations to obtain the minimum distortion for each rate. Varying the number $\mathtt{K}$ of quantized points from ${2}^{1}$ to ${2}^{10}$, for each $\mathtt{K}$, we counted the effective number ${\mathtt{K}}_{\mathtt{eff}}$ of quantized points which have at least one assigned data point and computed the rate by ${log}_{2}{\mathtt{K}}_{\mathtt{eff}}$ as the quantizer is first order, that is, the block length is one. The kernel parameter $\gamma $ was chosen so that the clear separation of ${C}_{1}$ and ${C}_{2}$ is obtained when $\mathtt{K}=2$.

After the training, we computed the distortion and rate for the test data set, by assigning each of 20,000 test data generated from the same source to the nearest quantized points in the feature space.

For each quantized point, we obtained its preimage. That is, if the kth quantized point is expressed as ${\sum}_{i=1}^{n}{\alpha}_{ki}\varphi \left({x}_{i}\right)$, its preimage is

$$\begin{array}{ccc}\hfill {y}_{k}& =& \underset{y}{\mathrm{argmin}}{\u2225\varphi \left(y\right)-\sum _{i=1}^{n}{\alpha}_{ki}\varphi \left({x}_{i}\right)\u2225}^{2}\hfill \\ \hfill & =& \underset{y}{\mathrm{argmax}}\sum _{i=1}^{n}{\alpha}_{ki}K(y,{x}_{i}).\hfill \end{array}$$

We used the mean shift procedure for the maximization, although this procedure only guarantees the convergence to a local maximum [15,16].

The obtained bounds and the quantizer performances are displayed in Figure 1a,b and for $m=2$ and $m=10$, respectively, in the forms of distortion-rate functions. The values of ${D}_{max}$ in Equations (7) and (9) are also indicated in the figures.

In both dimensions, the upper bound ${D}_{\mathrm{fea},G}$ is smaller than ${D}_{\mathrm{inp},G}$ at low rates while the bound is above the quantizer performance. However, the value of ${D}_{max,\mathrm{fea}}$ suggests that the bound is informative at low rates. As the rate becomes higher, the lower and upper bounds of the input space reconstruction, ${D}_{L,\mathrm{inp}}$ and ${D}_{G,\mathrm{inp}}$, approach each other. In fact, they sandwich the quantizer performance tightly in the $two$-dimensional case, which suggests that the rate-distortion function for the feature space reconstruction, ${R}_{\mathrm{fea}}\left(D\right)$ is close to the rate-distortion function of the input space reconstruction ${R}_{\mathrm{inp}}\left(D\right)$ at high rates.

We see that the quantizer performances for ${d}_{\mathrm{fea}}$ and those for ${d}_{\mathrm{inp}}$ approach each other as the rate R grows. The upper bound ${D}_{\mathrm{inp},G}$ reasonably approximates the quantizer performance by the preimages, and it indicates that, in the $two$-dimensional case (Figure 1a), the results for $R=2$ and 3 bits can be improved by at least about 1 bit.

At low distortion levels, each source output should be reconstructed within a small neighborhood in the feature space where we can find another point y in the input space whose feature map $\varphi \left(y\right)$ is sufficiently close to the reconstruction. This suggests that the rate-distortion function of feature space reconstruction is well approximated by the rate-distortion function of input space reconstruction. In other words, combining multiple input points to make a reconstruction in feature space does not do any good for reducing distortion and only a single input point is enough when it is mapped into feature space. Hence, the rate-distortion bounds of input space reconstruction may be informative at low distortion levels.

In the 10-dimensional case (Figure 1b), the distortion in the test data set is close to ${D}_{\mathrm{inp},G}\left(R\right)$ or above it at high rates. This may be due to overfitting of the kernel $\mathtt{K}$-means to the training data set of the size, 4000. That is, as the the rate grows, the distortion in the training data set decreases and the discrepancy between the distortions in the training and test sets increases.

To examine the asymptotic behavior of ${R}_{\mathrm{inp}}\left(D\right)$ discussed in Section 4.3, we computed ${R}_{\mathrm{inp},L}\left(D\right)$ and ${R}_{\mathrm{inp},G}\left(D\right)$ for small D, that is, for large s. As well as the Gaussian kernel Equation (29), which has $\alpha =2$ in Equation (19), we applied the Laplacian kernel,
which corresponds to $\alpha =1$. The kernel parameter of the Laplacian kernel was set to the square root of the value used in the Gaussian kernel.

$$K(x,y)={e}^{-\gamma \parallel x-y\parallel},$$

The rate-distortion bounds, ${R}_{\mathrm{inp},L}\left(D\right)$ and ${R}_{\mathrm{inp},G}\left(D\right)$ divided by $-(logD)/2$ for small distortion levels are shown in Figure 2a,b and for $m=2$ and $m=10$, respectively. We can see that, in each case, the ratio tends to $2m/\alpha $, that is, the rate-distortion dimension evaluated in Equation (22) as $D\to 0$. For the distortion levels smaller than those presented in Figure 2, the ratios start oscillating due to the errors of numerical integrations.

#### 5.2. Image Data

We carried out a similar evaluation of the rate-distortion bounds and quantizer performances for a grayscale image data set extracted from the COIL20 data set [18]. We used the first category from 20 categories of images, which consisted of 72 images of size $32\times 32$. Dividing each $32\times 32$ image into small patches of size $2\times 2$ ($m=4$), we obtained 256 data from each image, and 18,432 data in total. Removing duplicate data points, we finally obtained 13,368 data. We used first 2048 data as the training data and the remaining 11,320 data as the test data. The training data set was also used for approximating expectations of kernel functions required to compute ${R}_{\mathrm{fea}}\left(D\right)$, and the first $n=256$ data points were used as the sample data in the definition of ${d}_{\mathrm{fea}}$. We evaluated only the upper bounds, ${R}_{\mathrm{fea},G}$ and ${R}_{\mathrm{inp},G}$, since the lower bound ${R}_{\mathrm{inp},L}$ requires estimating the source entropy from empirical data, which depends heavily on the estimation method, and hence is to be addressed more in detail.

Each dimension was normalized so that it has mean 0 and variance 1. Hence, ${v}_{p}$ in ${R}_{\mathrm{inp},G}$ was approximated by the empirical variance, 1. The boundary B in ${R}_{\mathrm{inp},G}$ was approximated by the maximum norm of the training data points.

The upper bounds and quantizer performances are presented in Figure 3. Although the upper bounds are loose and above the respective quantizer performances, the upper bound ${D}_{\mathrm{inp},G}\left(R\right)$ is roughly predictive of the quantizer performance in the input space, and so does $min\{{D}_{\mathrm{inp},G}\left(R\right),{D}_{\mathrm{fea},G}\left(R\right)\}$ for the reconstruction in the feature space.

## 6. Conclusions

In this paper, we have shown upper and lower bounds for the rate-distortion functions associated with kernel feature mapping. As suggested in Section 5, the upper bound for the reconstruction in feature space is informative at high distortion levels while the bounds for the reconstruction in input space are informative at low distortion levels. We have also evaluated the rate-distortion dimension of sources with bounded support under kernel-based distortion measures, which shows the asymptotic behavior of the rate-distortion function. Our future directions include deriving tighter bounds and exact evaluation of the rate-distortion function in some special cases. In particular, it is an important undertaking to derive a lower bound to the rate-distortion function of the reconstruction in feature space.

## Acknowledgments

The author would like to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported in part by the Japan Society for the Promotion of Science (JSPS) grants 25120014, 15K16050, and 16H02825.

## Conflicts of Interest

The author declares no conflict of interest.

## Appendix A. Proof of Theorem 1

**Proof.**

Let ${q}^{\ast}\left(y\right|x)$ be the conditional density for $x\in \mathcal{X}$ that achieves the infimum of ${R}_{\mathrm{inp}}\left(D\right)$. Then, for $Y\sim \int {q}^{\ast}\left(y\right|x)p\left(x\right)dx$, it holds that $I(X;Y)={R}_{\mathrm{inp}}\left(D\right)$ and

$$E\left[{\parallel \varphi \left(X\right)-\varphi \left(Y\right)\parallel}^{2}\right]\le D.$$

Since the input space $\mathcal{X}$ is bounded and separable, and the kernel function K is continuous, for any $\epsilon >0$ and $y\in \mathcal{X}$, there exist coefficients $\left\{{\alpha}_{i}\left(y\right)\right\}$ such that
holds when n is sufficiently large.

$$\u2225\varphi \left(y\right)-\sum _{i=1}^{n}{\alpha}_{i}\left(y\right)\varphi \left({x}_{i}\right)\u2225\le \frac{\epsilon}{3\sqrt{D}}$$

Let $\mathit{\alpha}\left(y\right)={({\alpha}_{1}\left(y\right),\cdots ,{\alpha}_{n}\left(y\right))}^{T}$ and
where $\delta $ is Dirac’s delta function. Then, for $\mathit{A}\sim \int {q}^{\ast}\left(\mathit{\alpha}\right|x)p\left(x\right)dx$, it follows from the triangle inequality that
and hence
To obtain Inequality (A3), we used Equations (A1) and (A2), and Jensen’s inequality,

$${q}^{\ast}\left(\mathit{\alpha}\right|x)=\int \delta (\mathit{\alpha}-\mathit{\alpha}\left(y\right)){q}^{\ast}\left(y\right|x)dy,$$

$$\begin{array}{ccc}\hfill E\left[{d}_{\mathrm{fea}}(X,\mathit{A})\right]& =& E\left[{\u2225\varphi \left(X\right)-{\sum}_{i=1}^{n}{\alpha}_{i}\left(Y\right)\varphi \left({x}_{i}\right)\u2225}^{2}\right]\hfill \\ \hfill & \le & E\left[{\u2225\varphi \left(X\right)-\varphi \left(Y\right)\u2225}^{2}\right]+2E\left[\u2225\varphi \left(X\right)-\varphi \left(Y\right)\u2225\u2225\varphi \left(Y\right)-{\sum}_{i=1}^{n}{\alpha}_{i}\left(Y\right)\varphi \left({x}_{i}\right)\u2225\right]\hfill \\ & & +E\left[{\u2225\varphi \left(Y\right)-{\sum}_{i=1}^{n}{\alpha}_{i}\left(Y\right)\varphi \left({x}_{i}\right)\u2225}^{2}\right],\hfill \end{array}$$

$$\begin{array}{ccc}\hfill E\left[{d}_{\mathrm{fea}}(X,\mathit{A})\right]& \le & D+\frac{2\epsilon}{3}+\frac{{\epsilon}^{2}}{9D}\hfill \end{array}$$

$$\begin{array}{ccc}\hfill & \le & D+\epsilon .\hfill \end{array}$$

$$\begin{array}{ccc}\hfill E\left[\sqrt{{\u2225\varphi \left(X\right)-\varphi \left(Y\right)\u2225}^{2}}\right]& \le & \sqrt{E\left[{\u2225\varphi \left(X\right)-\varphi \left(Y\right)\u2225}^{2}\right]}\hfill \\ \hfill & \le & \sqrt{D}.\hfill \end{array}$$

## Appendix B. Proof of Theorem 5

**Proof.**

The mean and covariance matrix of the random vector $\mathit{A}\sim {M}_{p}\left(\mathit{\alpha}\right)$ are
where $\mathit{C}$ is defined by Equation (28).

$$\begin{array}{ccc}\hfill E\left[\mathit{A}\right]& =& {\tilde{\mathit{K}}}^{-1}\int \mathit{k}\left(x\right)p\left(x\right)dx\hfill \\ \hfill Cov\left[\mathit{A}\right]& =& E\left[\mathit{A}{\mathit{A}}^{T}\right]-E\left[\mathit{A}\right]E{\left[\mathit{A}\right]}^{T}\hfill \\ \hfill & =& \left\{\frac{1}{2s}\mathit{I}+{\tilde{\mathit{K}}}^{-1}\int \mathit{k}\left(x\right)\mathit{k}{\left(x\right)}^{T}p\left(x\right)dx\right\}{\tilde{\mathit{K}}}^{-1}-{\tilde{\mathit{K}}}^{-1}\int \mathit{k}\left(x\right)p\left(x\right)dx\int \mathit{k}{\left(x\right)}^{T}p\left(x\right)dx{\tilde{\mathit{K}}}^{-1}\hfill \\ \hfill & =& \left\{\frac{1}{2s}\mathit{I}+{\tilde{\mathit{K}}}^{-1}\mathit{C}\right\}{\tilde{\mathit{K}}}^{-1},\hfill \end{array}$$

Thus, the maximum entropy principle of the Gaussian distribution implies that the differential entropy $h\left({M}_{p}\right)$ is upper-bounded by
Combining this inequality with Equations (24) and (26), we have

$$h\left({M}_{p}\right)\le \frac{n}{2}log\left[\left(2\pi e\right){\left|\left\{\frac{1}{2s}\mathit{I}+{\tilde{\mathit{K}}}^{-1}\mathit{C}\right\}{\tilde{\mathit{K}}}^{-1}\right|}^{\frac{1}{n}}\right].$$

$${R}_{\mathrm{fea}}\left({D}_{s}\right)\le \frac{1}{2}log\left|\mathit{I}+2s{\tilde{\mathit{K}}}^{-1}\mathit{C}\right|.$$

## References

- Schölkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
- Aizerman, M.A.; Braverman, E.A.; Rozonoer, L. Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control
**1964**, 25, 821–837. [Google Scholar] - Berger, T. Rate Distortion Theory: A Mathematical Basis for Data Compression; Prentice-Hall: Englewood Cliffs, NJ, USA, 1971. [Google Scholar]
- Girolami, M. Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw.
**2002**, 13, 780–784. [Google Scholar] [CrossRef] [PubMed] - Filippone, M.; Camastra, F.; Masulli, F.; Rovetta, S. A survey of kernel and spectral methods for clustering. Pattern Recognit.
**2008**, 41, 176–190. [Google Scholar] [CrossRef] - Schölkopf, B.; Mika, S.; Burges, C.J.C.; Knirsch, P.; Müller, K.R.; Ratsch, G.; Smola, A.J. Input space versus feature space in kernel-based methods. IEEE Trans. Neural Netw.
**1999**, 10, 1000–1017. [Google Scholar] [CrossRef] [PubMed] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley Interscience: Hoboken, NJ, USA, 1991. [Google Scholar]
- Gray, R.M. Entropy and Information Theory, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
- Steinwart, I.; Christmann, A. Support Vector Machines; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Linkov, Y.N. Evaluation of ϵ-entropy of random variables for small ϵ. Probl. Inf. Transm.
**1965**, 1, 18–26. [Google Scholar] - Linder, T.; Zamir, R. On the asymptotic tightness of the Shannon lower bound. IEEE Trans. Inf. Theory
**1994**, 40, 2026–2031. [Google Scholar] [CrossRef] - Koch, T. The Shannon lower bound is asymptotically tight. IEEE Trans. Inf. Theory
**2016**, 62, 6155–6161. [Google Scholar] [CrossRef] - Kawabata, T.; Dembo, A. The rate-distortion dimension of sets and measures. IEEE Trans. Inf. Theory
**1994**, 40, 1564–1572. [Google Scholar] [CrossRef] - Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning); The MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
- Fukunaga, K.; Hostetler, L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory
**1975**, 21, 32–40. [Google Scholar] [CrossRef] - Comaniciu, D.; Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 603–619. [Google Scholar] [CrossRef] - Watanabe, K. Rate-distortion analysis for kernel-based distortion measures. In Proceedings of the Eighth Workshop on Information Theoretic Methods in Science and Engineering, Copenhagen, Denmark, 24–26 June 2015. [Google Scholar]
- Nene, S.A.; Nayar, S.K.; Murase, H. Columbia Object Image Library (COIL-20). Available online: http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php (accessed on 4 July 2017).

**Figure 2.**The ratios between the rate-distortion bounds and $-(logD)/2$ for (

**a**) $m=2$ and (

**b**) $m=10$. The bounds are for the Laplacian kernel ($\alpha =1$) and the Gaussian kernel ($\alpha =2$).

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).