Fixed-Rate Universal Lossy Source Coding and Model Identification: Connection with Zero-Rate Density Estimation and the Skeleton Estimator

Silva, Jorge F.; Derpich, Milan S.

doi:10.3390/e20090640

Open AccessArticle

Fixed-Rate Universal Lossy Source Coding and Model Identification: Connection with Zero-Rate Density Estimation and the Skeleton Estimator

by

Jorge F. Silva

^1,* and

Milan S. Derpich

²

¹

Information and Decision System Group, Department of Electrical Engineering, Universidad de Chile, Av. Tupper 2007, Santiago 7591538, Chile

²

Department of Electronic Engineering, Universidad Tecnica Federico Santa Maria, Valparaiso 2390123, Chile

^*

Author to whom correspondence should be addressed.

Entropy 2018, 20(9), 640; https://doi.org/10.3390/e20090640

Submission received: 26 June 2018 / Revised: 11 August 2018 / Accepted: 22 August 2018 / Published: 25 August 2018

(This article belongs to the Special Issue Rate-Distortion Theory and Information Theory)

Download

Browse Figure

Versions Notes

Abstract

:

This work demonstrates a formal connection between density estimation with a data-rate constraint and the joint objective of fixed-rate universal lossy source coding and model identification introduced by Raginsky in 2008 (IEEE TIT, 2008, 54, 3059–3077). Using an equivalent learning formulation, we derive a necessary and sufficient condition over the class of densities for the achievability of the joint objective. The learning framework used here is the skeleton estimator, a rate-constrained learning scheme that offers achievable results for the joint coding and modeling problem by optimally adapting its learning parameters to the specific conditions of the problem. The results obtained with the skeleton estimator significantly extend the context where universal lossy source coding and model identification can be achieved, allowing for applications that move from the known case of parametric collection of densities with some smoothness and learnability conditions to the rich family of non-parametric

L_{1}

-totally bounded densities. In addition, in the parametric case we are able to remove one of the assumptions that constrain the applicability of the original result obtaining similar performances in terms of the distortion redundancy and per-letter rate overhead.

Keywords:

fixed-rate lossy source coding; joint coding and modeling; universal source coding; learning with rate constraints; the skeleton estimator; L₁-totally bounded classes

1. Introduction

Universal source coding (USC) has a long history in information theory and statistics [1,2,3,4,5]. Davisson’s seminal work [4] formalized the variable-length lossless coding problem and introduced important information quantities for performance analysis [1,2]. In this lossless setting, it is well-understood that the Shannon entropy provides the minimum achievable rate (in bits per sample) [2] to code a stationary and memoryless source when the probability (model) of the source is available. When the probability of the source is not known but belongs to a family of distributions

F

(the so called universal source coding problem), the focus of the problem is to characterize the penalty (or redundancy in bits per sample) that an encoder and decoder pair will experience due to the lack of knowledge about the samples’ probability [1]. In the lossless case, a seminal result states that the least worst-case redundancy over

F

(or the minimax solution of the USC problem for

F

) is determined by the information radius of

F

[1].

Building on this connection between least worse-case redundancy and information radius of

F

, there are numerous important results developed for lossless USC [1,6,7,8,9]. In particular, it is known that the information radius grows sub-linearly (with the block-length) for the family of finite alphabet stationary and memoryless sources [1], which implies the existence of a universal source code that achieves Shannon entropy as the block length goes to a large value for every distribution in

F

. However universality is not possible for the family of alphabet stationary and memoryless sources because the information radius of this family is unbounded [3,5,7]. More recent results on lossless USC over countable infinite alphabets have looked at restricting the analysis to specific collections of distributions (with some tail bounded conditions) to achieve minimax universality [7,8,9] and also looked at weak variations of the lossless source coding setting [10,11,12].

In the fixed-rate lossy source coding problem, assuming first that the probability

μ

of a memoryless source is known, the performance limit of the coding problem is given by the Shannon distortion-rate function

D_{μ} (R)

[2,13]. Consequently, the universal lossy source coding problem reduces to compare the distortion of a coding scheme (satisfying a fixed-rate constraint) with the Shannon distortion-rate function assuming that the designer only knows that

μ \in F

. The literature on this problem is rich [3,5,14,15,16,17,18] with a first result dating back to Ziv [17] who showed the existence of weakly minimax fixed-rate universal lossy source code for the class of stationary sources under certain assumptions about the source, the alphabet, and the distortion measure. More refined results were presented in [5,16] one of which established necessary and sufficient conditions to achieve weakly minimax universality for the class of stationary and ergodic sources. To provide a more specific analysis of universal lossy source coding, Linder et al. [14] presented a lossy USC scheme with a distortion redundancy that goes to zero as

O (\sqrt{\frac{log log n}{log n}})

for the case of independent and identically distributed (i.i.d.) bound sources. Later Linder et al. [15] improved previous results showing a fixed-rate lossy construction with a distortion redundancy that vanishes as

O (n^{- 1} log n)

and

O (\sqrt{n^{- 1} log n})

with n for finite alphabet i.i.d. sources and bounded infinite alphabet i.i.d. sources, respectively. Similar convergence results were obtained using a nearest-neighbor vector quantization approach in [19].

It is also understood that universal variable length lossless-source coding is connected with the problem of distribution estimation [3,6,20] as there is a one-to-one correspondence between prefix-free codes and finite-entropy discrete distributions in the finite and countable alphabet case [1,2,21]. Building on this one-to-one correspondence in the lossless case, Györfi et al. ([3], Theorem 1) showed that the redundancy (in bits per sample) of a given code upper bounds the expected divergence between the true distribution of the source

μ

and the estimated distribution derived from the code. Therefore, the existence of a universal (lossless) source code for

F

implies the existence of a universal (distribution-free in

F

) estimator of the distribution in expected (direct) information divergence [22]. This means that achieving lossless USC not only provides a lossless representation of the data, but it offers a consistent (error-free) estimator of the distribution at the receiver.

The connection between coding and distribution estimation that is evident in the lossless case is not, however, present in the (fixed-rate) lossy source coding problem. As argued in [18], a fixed-rate lossy source code does not offer a direct map with a probability distribution (model) for the source. In light of this gap between lossy codes and distributions (models) and motivated by some problems in adaptive control, where it is relevant to both compress data in a lossy way and identify the distribution of the source at the receiver [18,23], Raginsky explored the joint objective of fixed-rate universal lossy source coding and model (i.e., distribution) identification in [18].

Inspired by Rissanen’s achievability construction in [6,20], Raginsky [18] proposed a new setting for the problem of fixed-rate universal lossy compression of continuous memoryless sources based on the idea of a two-stage joint coding and model or distribution identification framework. In this context, he proposed a two-stage scheme to consider two objectives: fixed-rate universal lossy source coding and source distribution (model) identification. The first objective of the scheme is to transmit the data (optimally) in the classical distortion-rate sense [24], while the second objective is to learn and transmit a description (quantized version) of the source distribution (model) [25,26]. Taking ideas from statistical learning, Raginsky proposed [18] splitting the data into training and testing samples. The training data is used in the first-stage of the encoding process to construct a quantized estimation of the source distribution and encode it (the first stage bits). Then in a second stage of the encoding process, the first-stage bits are used to pick a matched (with the estimated distribution) fixed-rate lossy source code to encode the test data (the second stage bits). In this joint coding and modeling setting, the existence of a zero-rate consistent estimator of the density (in expected total variation) is sufficient to show the existence of a weakly minimax universal fixed-rate source coding scheme [18] (Theorem 3.2), achieving the Shannon distortion-rate function [2,24,27,28], for any given rate. This result is obtained for a wide class of single-letter bounded distortion functions and for a family of source densities

F = \{μ_{θ} : θ \in Θ\}

indexed over a bounded finite dimensional space

Θ \subset \otimes_{i = 1}^{k} [- L, L] \subset R^{k}

(i.e., a parametric collection) with some needed smoothness and learnability conditions [18] (Theorem 3.2).

It is important to highlight that the joint coding and modeling achievability results in [18] did not degrade the performance of the source coding objective. In fact by restricting the analysis to the source coding objective alone, the joint coding and modeling framework in [18] showed the same state-of-the-art performance results as conventional two-stage universal source coding schemes (or universal vector quantizers) [14,15,19] in terms of distortion redundancy and per-letter rate overhead (

O (\sqrt{log (n) / n})

and

O (log (n) / n)

, respectively) as the block length n tends to a large number. Importantly, the first-stage bits of this joint coding and modeling scheme are used to achieve model identification at the receiver with arbitrary precision in total variation (with a rate of convergence of

O (\sqrt{log (n) / n})

as n goes to infinity), with no extra cost in bits per-letter compared with conventional fixed-rate lossy source coding methods.

Contributions of This Work

This work formally studies the interplay between density estimation under a data-rate constraint and the joint fixed-rate universal lossy source coding and modeling problem with training data or memory introduced in [18]. The first main result (Theorem 1) establishes a connection between zero-rate density-estimation and a universal joint coding and modeling scheme that achieves optimal lossy source coding (in a distortion-rate sense) and lossless model identification. This result is obtained for the general family of bounded single-letter distortions [13]. Remarkably, this connection implies that the construction of a joint coding and modeling scheme reduces to the construction of a zero-rate density estimator. From this result, the second main result (Theorem 2) stipulates a necessary and sufficient condition for the existence of a weakly minimax universal joint coding and modeling scheme. For the achievability part of this result, we used the skeleton estimator as our learning framework [29]. Using this learning framework we extend the parametric context explored in [18] to the rich non-parametric scenario of

L_{1}

-totally bounded densities [30].

Furthermore, revisiting the parametric case studied in [18], by using the skeleton estimator we are able to remove some of the assumptions that limit the applicability of the original result. We show that the skeleton estimator matches the best performance reported in [18] in terms of the distortion redundancy and (per-letter) rate overhead, in particular obtaining rates of convergence to zero of

O (\sqrt{log (n) / n})

and

O (log (n) / n)

, respectively, as the block-length tends to infinity. To obtain this, our result relaxes the finite Vapnik and Chervonenkis (VC) dimension assumption considered in [18]. On the other hand, when the finite VC dimension assumption is added in the analysis, the skeleton learning scheme offers a convergence rate of

O (1 / \sqrt{n})

for the distortion redundancy as the sample-length goes to infinity. Finally, the skeleton framework is implementable in the parametric case as its minimum-distance decision is carried out on a finite number of candidates and the oracle

ϵ

-skeleton (or the

ϵ

-covering in total variation of

F

) [30] (Chapter 7) can be replaced by a practical uniform covering of the compact index set

Θ \subset R^{k}

(Theorem 4). Finally, it is worth noting that a preliminary version of this work (in the context of density estimation under a data-rate constraint) was presented in [31].

The rest of the paper is organized as follows: Section 2 introduces the setting of the joint coding and modeling with training data. Section 3 elaborates the connections with zero-rate density estimation. Section 4 presents the main joint coding and modeling result (Theorem 2) and introduces the skeleton estimator. Finally, Section 5 revisits a special case where the distributions are indexed by finite dimensional bounded space (the parametric context). A summary of the results is presented in Section 6 and Section 7. Finally, the proofs are presented in Section 8.

2. Preliminaries

The fixed-rate coding and modeling problem introduced in [18] is presented in this section. This joint coding and modeling problem will be the main focus of this work. In addition, notations and definitions used in the rest of the paper will be presented.

2.1. Basic Definitions

Let

X \in B (R^{d})

be a separable and complete subset of

R^{d}

where

B (R^{d})

is the Borel sigma field. Let

P (X)

be the collection of probability measures on

(X, B (X))

, with

B (X)

denoting the Borel sigma field restricted to

X

, and let

AC (X) \subset P (X)

denote the set of probability measures absolutely continuous with respect to the Lebesgue measure

λ

[32]. For any

μ \in AC (X)

,

g_{μ} (x) = \frac{d μ}{d λ} (x)

denotes its probability density function. The total variational distance [30] of v and

μ

in

P (X)

is given by (to avoid any confusion, if S is a set then

|S|

denotes its cardinality).

V (μ, v) = sup_{A \in B (X)} |μ (A) - v (A)| .

(1)

For

μ

and v belonging to

AC (X)

, if we define the Scheffé set for the pair

(μ, v)

by

A_{μ, v} \equiv \{x \in X : g_{μ} (x) > g_{v} (x)\} \in B (X),

(2)

then

V (μ, v) = μ (A_{μ, v}) - v (A_{μ, v})

[30,33].

2.2. Fixed-Rate Universal Lossy Source Coding with Memory or Training Data

Let

\{X_{n} : n \geq 1\}

be an i.i.d. stochastic process (or stationary and memoryless source), where

X_{i}

takes values in

X \subset R^{d}

and has a distribution

μ

in

F = \{μ_{θ} : θ \in Θ\} \subset AC (X)

.

Θ

is in general an index set for

F

. The problem of lossy source-coding of a finite block of the process

X^{n} = (X_{1}, . . ., X_{n})

reduces to find a mapping (or code)

C^{n} (\cdot)

from

X^{n}

to

S_{n}

, where

S_{n}

is a finite set. Given a cardinality constraint on

S_{n}

, the design objetive is to make

C^{n} (X^{n})

as close as possible to

X^{n}

(in average) using for that a distortion function. The standard coding problem assumes the knowledge of

μ

for finding the optimal code (for any finite block n) [1,2,13], as well as for characterizing the fundamental performance limits of this task as n goes to infinity [2,24,28,34,35,36].

A more realistic scenario is the universal source coding (USC) problem [2], where the source distribution

μ \in F

is unknown and a coding scheme needs to be designed optimally for the family

F

. Here we focus on a specific learning variation of this task introduced by Raginsky in [18], where in addition to the data that needs to be compressed and recovered (with respect to a fidelity criterion), we have a finite number of i.i.d samples following the same distribution

μ

and that can be used to estimate

μ

in the encoding process (more details of this approach in Section 2.3). This additional data can be interpreted as memory, training data, or side information about

μ

available at the encoder because it is data that is not required to be compressed and recovered. The existence of this memory departs from the standard zero-memory setting considered in universal source coding [1]. However, this information can be seen as a realistic assumption in the context of a sequential block by block coding of an infinite sequence, where the data is partitioned into blocks of the same finite length and compressed sequentially block by block. Then in a given stage of this sequential process, the data from previous blocks are available at the encoder (lossless) for the process compressing the current block [18].

More specifically following the fixed-rate block coding and modeling setting introduced by Raginsky in [18], we consider an n-block coding scheme with finite memory m, where there is a distinction between the data

Z^{m} = (Z_{1}, \dots, Z_{m})

that is available (as side information) to estimate the source distribution (training data) and the data

X^{n}

that needs to be encoded and recovered (source or test data), under the important assumption that both data sets are i.i.d. samples of the same unknown probability

μ \in F

. A systematic exposition of this coding setting and its connection with the classical setting of zero-memory block coding is presented in [18] (Section II). Formally, let us define an

(m, n)

-block code by the pair

C^{m, n} \equiv (f : X^{m} \times X^{n} \to S_{n}, ϕ : S_{n} \to {\hat{X}}^{n}) .

(3)

Then given a set of training samples

z^{m} \in X^{m}

and a finite block of the source

x^{n} \in X^{n}

,

C^{m, n}

is the composition of: a encoding function

f (z^{m}, x^{n})

that maps

x^{n}

to an element in a finite set

S_{n}

conditioned on the training data (or memory) represented by

z^{m}

, and a decoding function

ϕ (\cdot)

that maps a symbol

s \in S_{n}

into the reproduction points

Γ_{C^{m, n}} \equiv \{ϕ (s) : s \in S_{n}\}

that we called the codebook of

C^{m, n}

. In this context,

\hat{X}

denotes the reproduction space. As a short-hand, we denote by

{\hat{x}}^{n} = C^{m, n} (x^{n}) = ϕ (f (z^{m}, x^{n}))

the reconstruction of

x^{n}

obtained by

C^{m, n}

and its memory

z^{m}

(for simplicity, the dependency of

{\hat{x}}^{n}

or

C^{m, n} (x^{n})

on the memory

z^{m}

will be implicit in the rest of the exposition.). The rate of

C^{m, n}

in bits-per-letter is given by

R (C^{m, n}) \equiv \frac{{log}_{2} |S_{n}|}{n}

. In general, it is not possible to recover

x^{n}

from

{\hat{x}}^{n}

given the cardinality constraint on

S_{n}

, and thus a single-letter distortion measure

ρ : X \times \hat{X} \to R^{+}

is used to quantify the n-block discrepancy by [24]

ρ^{n} (x^{n}, {\hat{x}}^{n}) \equiv \sum_{i = 1}^{n} ρ (x_{i}, {\hat{x}}_{i}) .

(4)

Finally considering

X^{n} \sim μ^{n}

and

Z^{m} \sim μ^{m}

, the average distortion per-letter of

C^{m, n}

given

Z^{m}

is

D_{μ} (C^{m, n} | Z^{m}) \equiv \frac{1}{n} E_{X^{n} \sim μ^{n}} (ρ^{n} (X^{n}, {\hat{X}}^{n})),

(5)

which is a function of

Z^{m}

and hence the average distortion per-letter of

C^{m, n}

is

D_{μ} (C^{m, n}) \equiv E_{Z^{m} \sim μ^{m}} (D_{μ} (C^{m, n} | Z^{m})) .

(6)

In universal source coding the performance of a code

D_{μ} (C^{m, n})

is evaluated over a collection of distributions

μ \in F

and is compared (point-wise) with the best code that can be obtained assuming that

μ

is known. For this analysis, we need the following definitions:

Definition 1

([18]). For a finite block length n and distribution

μ \in F

, the n-order operational distortion-rate function of μ at rate R is

D_{μ}^{n} (R) \equiv inf_{m \geq 0} inf_{\begin{matrix} C^{m, n} \\ w i t h R (C^{m, n}) \leq R \end{matrix}} D_{μ} (C^{m, n}) .

(7)

In this context, the operational distortion-rate function (DRF) [2,28] is given by

D_{μ} (R) \equiv lim_{n \to \infty} D_{μ}^{n} (R) = inf_{n \geq 1} D_{μ}^{n} (R) .

(8)

The celebrated Shannon lossy source-coding theorem [27] provides a single letter theoretical characterization for

D_{μ} (R)

in (8) (also known as the Shannon DRF). A nice exposition of this celebrated result can be found in [2,24,28].

It is worth noting that the operational distortion-rate function in (7) is equivalent to the classical zero-memory n-order operational distortion-rate function given by

{inf}_{C^{0, n}} \{D_{μ} (C^{0, n}) : such that R (C^{0, n}) \leq R\}

[18] (Lemma 2.1). Then, allowing a nonzero memory (side information at the encoder) does not help in the minimization of the distortion when

μ

is known.

For the rest of the exposition, we will concentrate on the simple case studied in [18] where

n = m

(i.e., the block-length is equal to the memory of the code). To be precise about the meaning of universality in this context, we resort to some standard definitions:

Definition 2

([16]). A coding scheme

\{C^{n, n} : n \geq 1\}

is weakly minimax universal for the class

F

at rate R, if

\forall μ \in F

lim_{n \to \infty} D_{μ} (C^{n, n}) = D_{μ} (R)

(9)

and

lim {sup}_{n \to \infty} R (C^{n, n}) = lim {sup}_{n \to \infty} \frac{{log}_{2} |S_{n}|}{n} \leq R

. Alternatively, the scheme is said to be strongly minimax universal for the class

F

at rate R if

lim_{n \to \infty} sup_{μ \in F} [D_{μ} (C^{n, n}) - D_{μ} (R)] = 0

(10)

and

lim {sup}_{n \to \infty} R (C^{n, n}) \leq R

.

Decomposing the distortion redundancy in two terms,

D_{μ} (C^{n, n}) - D_{μ} (R) = [D_{μ} (C^{n, n}) - D_{μ}^{n} (R)] + [D_{μ}^{n} (R) - D_{μ} (R)],

(11)

the first term

[D_{μ} (C^{n, n}) - D_{μ}^{n} (R)]

is the n-order distortion redundancy, which is the discrepancy that can be attributed exclusively to the goodness of the coding scheme. The second term in (11), i.e.,

[D_{μ}^{n} (R) - D_{μ} (R)]

, has to do with how fast

D_{μ}^{n} (R)

converges to the Shannon DRF as the block length tends to infinity (see further details in [14] (Section III) and references therein). From this observation, we introduce the following definition:

Definition 3.

A coding scheme

\{C^{n, n} : n \geq 1\}

is strongly finite-block universal for the class

F

at rate R if

lim_{n \to \infty} sup_{μ \in F} [D_{μ} (C^{n, n}) - D_{μ}^{n} (R)] = 0

(12)

and

lim {sup}_{n \to \infty} R (C^{n, n}) \leq R

.

Note that if

\{C^{n, n} : n \geq 1\}

is strongly minimax universal then it is strongly finite-block universal, but the converse result is not true in general. The missing condition to make these two criteria equivalent is the uniform convergence of

D_{μ}^{n} (R)

to

D_{μ} (R)

in the class

F

. More discussion about this point in Section 6.

2.3. Raginsky’s Two-Stage Joint Universal Coding and Modeling

Motivated by the work of Rissanen [6], Raginsky [18] proposed a two-stage block code with finite memory (training data), with the objective of doing both fixed-rate lossy source coding, and identification of the source distribution at the receiver. More precisely, given

Z^{n} \sim μ_{θ}^{n}

and

X^{n} \sim μ_{θ}^{n}

(the training and the source-data samples, respectively), an

(n, n)

-joint coding and modeling rule is given by

\begin{matrix} C^{n, n} \equiv & (f_{n} : X^{n} \to {\tilde{S}}_{n}, ϕ_{n} : {\tilde{S}}_{n} \to Θ, \\ \{f_{n, \tilde{s}} : X^{n} \to S_{n}, ϕ_{n, \tilde{s}} : S_{n} \to {\hat{X}}^{n}; \tilde{s} \in {\tilde{S}}_{n}\}), \end{matrix}

(13)

where

S_{n}

and

{\tilde{S}}_{n}

are finite-set functions of n.

C^{n, n}

processes

(Z^{n}, X^{n})

in two stages. In the first stage, the pair

(f_{n}, ϕ_{n})

in (13) uses

Z^{n}

to do density estimation and finite-rate encoding (quantization) by

f_{n} (Z^{n})

, and

ϕ_{n} (\cdot)

decodes an estimated density in

\{ϕ_{n} (s) : s \in {\tilde{S}}_{n}\} \subset Θ

. At the end, the first stage provides a quantized estimation of

μ_{θ} \in F

given by

\begin{matrix} {\hat{θ}}_{n} (Z^{n}) \equiv ϕ_{n} (f_{n} (Z^{n})) \in Θ . \end{matrix}

(14)

Using the index

\tilde{s} = f_{n} (Z^{n}) \in {\tilde{S}}_{n}

, the second stage of

C^{n, n}

, represented by

\{(f_{n, s}, ϕ_{n, s}); s \in {\tilde{S}}_{n}\}

in (13), encodes and decodes the source data

X^{n}

by

\begin{matrix} C^{n, n} (X^{n}) \equiv ϕ_{n, \tilde{s}} (f_{n, \tilde{s}} (X^{n})) . \end{matrix}

(15)

In summary, the outcome of the whole encoding process is the concatenation of the bits that represent

f_{n} (Z^{n})

(first-stage bits), and the bits that represent

f_{n, f_{n} (Z^{n})} (X^{n})

(second-stage bits). The decoding process, on the other hand, reads the first-stage bits to recover

{\hat{θ}}_{n} (Z^{n})

and then reads the second-stage bits to recover

C^{n, n} (X_{1}^{n})

. (see Figure 1 in which this process is illustrated). The rate (in bits per letter) of

C^{n, n}

is

R (C^{n, n}) = \frac{{log}_{2} |{\tilde{S}}_{n}|}{n} + \frac{{log}_{2} |S_{n}|}{n} .

(16)

Based on this two-stage scheme, we could simultaneously achieve source coding and density estimation (modeling) at the decoder. This new joint coding and modeling objective motivates the introduction of the following definition:

Definition 4.

A joint coding and modeling scheme

\{C^{n, n} : n \geq 1\}

in (13) is strongly minimax universal for a class of distribution

F = \{μ_{θ} : θ \in Θ\} \subset AC (X)

at the rate

R > 0

, if

${lim}_{n \to \infty} {sup}_{μ \in F} [D_{μ} (C^{n, n}) - D_{μ}^{n} (R)] = 0$ ,
${lim}_{n \to \infty} {sup}_{μ \in F} E_{Z^{n} \sim μ^{n}} (V (μ_{{\hat{θ}}_{n} (Z^{n})}, μ)) = 0$ , and
${lim sup}_{n \to \infty} R (C^{n, n}) \leq R$ .

Consequently, if

\{C^{n, n} : n \geq 1\}

is strongly minimax universal for

F

, it follows that as n tends to infinity, density estimation is achieved at the decoder (in expected total variations) and, from the source coding perspective,

\{C^{n, n} : n \geq 1\}

is strongly finite-block universal for

F

in the sense of Definition 3. For the rest of the paper, the strongly minimax universality of Definition 4 will be the main coding and modeling objective.

3. Connections with Zero-Rate Density Estimation

This section formalizes a connection between the objective of joint coding and modeling (declared in Definition 4) and a problem of zero-rate density estimation.

3.1. Density Estimation with a Rate Constraint

Let us first introduce the problem of rate constrained density estimation. Let

F = \{μ_{θ} : θ \in Θ\} \subset AC (X)

be an indexed collection of densities as introduced in Section 2.2.

Definition 5.

An

(n, 2^{n R})

learning rule of length n and rate R for

F

is a pair of functions

(f, ϕ)

, with

f : X^{n} \to S

and

ϕ : S \to Θ

, where S is a finite set and

\frac{1}{n} {log}_{2} |\{f (x^{n}) : x^{n} \in X^{n}\}| = \frac{1}{n} {log}_{2} |S| = R .

(17)

The composition of these two functions

π = ϕ \circ f : X^{n} \to Θ

defines the rate-constrained learning rule for

F

taking values in the codebook

\{ϕ (s) : s \in S\} \subset Θ

, where

R (π) = {log}_{2} (|S|) / n

denotes its description complexity in bits per training sample.

Definition 6.

The rate

R \geq 0

is achievable for

F

, if a learning scheme

Π = \{(f_{n}, ϕ_{n}) : n \geq 1\}

exists such that

lim_{n \to \infty} sup_{μ \in F} E_{Z^{n} \sim μ^{n}} (V (μ_{π_{n} (Z^{n})}, μ)) = 0 and lim sup_{n \to \infty} R (π_{n}) \leq R,

(18)

where

Z_{1}, Z_{2} \dots

in the left hand side (LHS) of (18) corresponds to i.i.d. realizations driven by

μ \in F

. In this case, we say that Π is an R-rate uniformly consistent scheme (or estimator) for the class

F

.

3.2. Main Results

Proposition 1.

If for a given

R > 0

,

\{C^{n, n} : n \geq 1\}

is strongly minimax universal for the class

F

at the rate R (Definition 4), then its induced finite-description learning scheme obtained from the first stage in (13), i.e.,

Π = \{(f_{n}, ϕ_{n}) : n \geq 1\}

, is a zero-rate uniformly consistent estimator for

F

(Definition 6).

The proof is presented in Section 8.1.

Interestingly, the existence of a zero-rate uniformly consistent scheme for

F

is also sufficient to achieve the joint coding and modeling objetive (Definition 4) if some mild conditions are adopted from the work in [18]. This is stated in the following result:

Theorem 1.

Let us assume that

(i): $ρ : X \times \hat{X} \to R^{+}$ can be expressed by $ρ (x, \hat{x}) = d {(x, \hat{x})}^{p}$ where $d (,)$ is a bounded metric in $X \cup \hat{X} \times X \cup \hat{X}$ with $p > 0$ and
(ii): for all $μ \in F$ , for all $n \geq 1$ , and for all $R > 0$ , there exists a $(0, n)$ -block code, say $C_{μ}^{* n}$ , that achieves the n-order operational DRF $D_{μ}^{n} (R)$ in (7).

Then the existence of a learning scheme

Π = \{(f_{n}, ϕ_{n}) : n \geq 1\}

that is zero-rate uniformly consistent for

F

implies that

\forall R > 0

there exists a joint coding and modeling scheme

\{C^{n, n} : n \geq 1\}

that is strongly minimax universal for

F

at rate R (Definition 4).

The proof is presented in Section 8.2.

Remark 1.

The construction proposed for

\{C^{n, n} : n \geq 1\}

at any rate

R > 0

(in Section 8.2) using the zero-rate density estimation scheme

Π = \{π_{n} = ϕ_{n} \circ f_{n} : n \geq 1\}

satisfies that:

\begin{matrix} sup_{μ \in F} [D_{μ} (C^{n, n}) - D_{μ}^{n} (R)] & \leq C \cdot sup_{μ \in F} E_{Z^{n} \sim μ^{n}} (V (μ_{π_{n} (Z^{n})}, μ)) and \end{matrix}

(19)

\begin{matrix} R (C^{n, n}) - R & \leq R (π_{n}), \end{matrix}

(20)

\forall n \geq 1

, where

C > 0

is a constant. It is worth noting that these two inequalities summarize the result in Theorem 1 and, importantly, these two bounds are independent of R.

Remark 2.

An important consequence of the bounds in (19) and (20) is the fact that constructing a learning scheme

Π = \{π_{n} : n \geq 1\}

with specific rates of convergence for

{sup}_{μ \in F} E (V (μ_{π_{n} (Z^{n})}, μ))

and

R (π_{n})

(as n goes to infinity) produces a joint coding and modeling scheme that achieves a uniform rate of convergence to zero (over

F

) of the overhead in distortion by (19) and a uniform rate of convergence to zero of the overhead in rate by (20). This observation will be used in all the achievable results presented in Section 4 and Section 5, where, consequently, the problem reduces to determine Π and expressions for

{sup}_{μ \in F} E (V (μ_{π_{n} (Z^{n})}, μ))

and

R (π_{n})

.

4. Joint Source Coding and Modeling Achievability Results

From the connection with zero-rate density estimation in Section 3, here we present a set of new results for the joint coding and modeling problem of Section 2.3. In these results, the general conditions (i) and (ii) stated in Theorem 1 are assumed.

4.1. Main Result: The Skeleton Density Estimator

Let us first introduce some notions from approximation theory [37].

Definition 7.

Let

F \subset AC (X)

be a class of densities. We say that

F

is

L_{1}

-totally bounded if for every

ϵ > 0

, there is a finite set of elements

\{μ_{i} : i = 1, \dots, N\}

in

F

such that,

\begin{matrix} F \subset ⋃_{i = 1}^{N} B_{ϵ}^{V} (μ_{i}), \end{matrix}

(21)

where

B_{ϵ}^{V} (μ) \equiv \{v \in AC (X) : V (μ, v) < ϵ\}

.

Definition 8.

For

F

L_{1}

-totally bounded, let

N_{ϵ}

denote the smallest positive integer that achieves the condition in (21).

N_{ϵ}

is called the ϵ-covering number of

F

and

K (ϵ) \equiv {log}_{2} (N_{ϵ})

is called the Kolmogorov’s ϵ-entropy of

F

[30].

Definition 9.

An ϵ-covering

G_{ϵ}

of

F

such that

|G_{ϵ}| = N_{ϵ}

is called an ϵ-skeleton of

F

[29].

Theorem 2.

There is a strongly minimax universal joint coding and modeling scheme for

F

at rate R for any rate

R > 0

if, and only if,

F

is

L_{1}

-totally bounded.

The proof is presented in Section 8.3.

The achievability part of the proof of Theorem 2 relies on the adoption of the skeleton estimator [29] (with its minimum distance learning principle in (42)), which is a zero-rate uniformly consistent density estimator for

F

(Definition 6). Furthermore, Theorem 2 can be complemented saying that the proposed construction

\{C^{n, n} : n \geq 1\}

derived from the skeleton estimator satisfies that (

P_{μ}

is a short-hand for the process distribution of

{(Z_{n})}_{n \geq 1}

characterized by

μ \in F

under the i.i.d. assumption.)

\begin{matrix} lim_{n \to \infty} D_{μ} (C^{n, n} | Z^{n}) = D_{μ} (R), P_{μ} - a l m o s t s u r e l y, \end{matrix}

(22)

\begin{matrix} lim_{n \to \infty} V (μ_{π_{n} (Z^{n})}, μ) = 0, P_{μ} - a l m o s t s u r e l y, \end{matrix}

(23)

\forall μ \in F

. The argument is presented in Appendix A.

4.2. Examples of $L_{1}$ -Totally Bounded Clases

Knowing specific expressions for

K (ϵ) = {log}_{2} N_{ϵ} < \infty

, the skeleton estimator can be optimized selecting its design parameter appropriately. In particular, the sequence

{(ϵ_{n})}_{n \geq 1}

(see details in Section 8.3) is selected as the solution of the optimal balance between estimation and approximation errors (see (45) in Section 8.3), which is given by

ϵ_{n}^{*} \equiv inf \{ϵ > 0 : log (2 N_{ϵ}^{2}) \leq \sqrt{n}\}

[30] (Chapter 7.2). The details of this analysis are presented in Section 8.3 and [30] (Chapter 7). By doing so, an optimized zero-rate skeleton scheme

Π = \{(f_{ϵ_{n}^{*}}, ϕ_{ϵ_{n}^{*}}), n \geq 1\}

, with concrete rate of convergence for

{sup}_{μ \in F} E_{Z^{n} \sim μ^{n}} (V (μ_{π_{ϵ_{n}^{*}} (Z^{n})}, μ))

and

R (π_{ϵ_{n}^{*}})

, can be obtained. From Remarks 1 and 2, these results imply specific performance results for the induced joint coding and modeling scheme. To illustrate, we present three interesting examples below.

4.2.1. Finite Mixture Classes

Let

F = \{μ_{θ} : θ \in Θ\}

with

Θ = \{θ \in {[0, 1]}^{d} : \sum_{k = 1}^{d} θ_{i} = 1\}

be the class of measures which are a convex combination of

\{μ_{1}, \dots, μ_{d}\} \subset A C (X)

, i.e.,

\forall θ \in Θ

,

\forall A \in B (X)

,

μ_{θ} (A) = \sum_{k = 1}^{d} θ_{k} \cdot μ_{k} (A)

.

F

is

L_{1}

-totally bounded with

K (ϵ)

being

O (d log (1 / ϵ))

[30] (Chapter 7.4). From (45) the optimal sequence

(ϵ_{n}^{*})

is

O (\sqrt{d / n})

[30], which implies the following finite-rate performance bound [30] (Chapter 7.4):

\begin{matrix} sup_{θ \in Θ} E \{V (μ_{π_{ϵ_{n}^{*}} (Z^{n})}, μ_{θ})\} \leq \sqrt{\frac{C d log n}{n}}, \end{matrix}

with C a universal non-negative constant. The rate in bits per-sample

R (π_{ϵ_{n}^{*}}) = K (ϵ_{n}^{*}) / n

is

O (log n / n)

.

4.2.2. Monotone Densities in ${[0, 1]}^{d}$

Let

F

be the collection of densities with support on

{[0, 1]}^{d}

, monotonically decreasing per coordinate and bounded by a constant

L > 0

. This class is known to be

L_{1}

-totally bounded, and furthermore

K (ϵ) \leq \frac{C L^{d}}{ϵ^{d}}

[30] (Lemma 7.1), with the constant C depending only on d. From (45),

(ϵ_{n}^{*})

being

O (L^{d / d + 2} / n^{1 / d + 2})

is optimal (please see details in [26,30]) with the following performance bound,

\begin{matrix} sup_{μ \in F} E \{V (μ_{π_{ϵ_{n}^{*}} (Z^{n})}, μ)\} \leq \frac{C L^{d / d + 2}}{n^{1 / d + 2}} . \end{matrix}

In this case, the rate in bits per sample

R (π_{ϵ_{n}^{*}}) = K (ϵ_{n}^{*}) / n

is

O (1 / n^{2 / d + 2})

.

4.2.3. r-Moment Smooth Class in $[0, 1]$

Let

F

be the class of densities defined on the bounded support

[0, 1]

, with r absolutely continuous derivatives (with r an integer greater than zero) and satisfying that:

\forall f \in F

\int_{[0, 1]} |f^{(r + 1)}| d x \leq C

for a constant

C > 0

. This class is

L_{1}

-totally bounded with

K (ϵ)

being

O (1 / ϵ^{r + 1})

[30] (Chapter 7.6). From (45), the optimal sequence

(ϵ_{n}^{*})

is

O (1 / n^{1 / 3 + r})

, where

{sup}_{μ \in F} E \{V (μ_{π_{ϵ_{n}^{*}} (Z^{n})}, μ)\}

is

O (1 / n^{1 / 3 + r})

and the rate in bits per sample

R (π_{ϵ_{n}^{*}}) = K (ϵ_{n}^{*}) / n

is

O (1 / n^{2 / 3 + r})

.

Notably, the last two examples are fully non-parametric, where

K (ϵ)

is a polynomial function of

1 / ϵ

. Richer non-parametric examples of

L_{1}

-totally bounded clases of densities, where

K (ϵ)

is even exponentially in

1 / ϵ

, are presented in [30] (Chapters 7.6 and 7.8) and its references.

4.3. Yatracos Classes with Finite VC Dimension

Looking at the distortion redundancy bound in (19), when

F

is totally bounded the fastest rate of convergence that could be achieved with the skeleton estimator proposed in Theorem 2 is

O (\sqrt{1 / n})

(see Section 8.3 and the estimation error bound in (45)). In this section, more specific density collections are studied to achieve this best rate

O (\sqrt{1 / n})

for density estimation and distortion redundancy from (19). We follow the path proposed by Yatracos in [38], who explored families of distributions with a finite Vapnik and Chervonenkis (VC) dimension the so-called VC classes [39,40]. Let us first introduce some definitions:

Definition 10

([38]). Let

F = \{μ_{θ} : θ \in Θ\} \subset AC (X)

be an indexed collection of densities. The Yatracos class for such a collection is given by

\begin{matrix} A_{Θ} = \{A_{θ, \bar{θ}} : θ, \bar{θ} \in Θ, θ \neq \bar{θ}\}, \end{matrix}

(24)

where

A_{θ, \bar{θ}} \equiv \{x \in X : g_{μ_{θ}} (x) > g_{μ_{\bar{θ}}} (x)\} \in B (X)

is the Scheffé set of

μ_{θ}

with respect to

μ_{\bar{θ}}

, as defined in (2).

Theorem 3.

Let us assume that

(i): $F$ is $L_{1}$ -totally bounded,
(ii): the Yatracos class $A_{Θ}$ has a finite VC dimension (Definition A1 in Appendix B), and
(iii): the Kolmogorov’s entropy of $F$ associated with the sequence $ϵ_{n} = 1 / \sqrt{n}$ grows strictly sub-linearly, i.e., ${log}_{2} (N_{1 / \sqrt{n}})$ is $o (n)$ ,

then there is a zero-rate density estimator scheme

Π = \{(f_{n}, ϕ_{n}) : n \geq 1\}

for

F

such that

\begin{matrix} sup_{μ \in F} E_{Z^{n} \sim μ^{n}} \{V (μ_{π_{n} (Z^{n})}, μ)\} is O (1 / \sqrt{n}), \end{matrix}

where

π_{n} (Z^{n}) = ϕ_{n} (f_{n} (Z^{n}))

is the skeleton estimator in (42) with

ϵ_{n} = 1 / \sqrt{n}

. Furthermore, Π is also a zero-rate strongly consistent density estimator where

\forall μ \in F

V (μ_{π_{n} (Z^{n})}, μ) is O (\sqrt{log n / n}), P_{μ} - a l m o s t s u r e l y .

The proof is presented in Section 8.4.

From Definition 7,

{log}_{2} (N_{ϵ})

is inversely proportional to

ϵ

. In fact, depending of how rich

F

is,

{log}_{2} (N_{ϵ})

can go from being

O (log 1 / ϵ)

, passing from being polynomial in

1 / ϵ

, to being

O (e^{1 / ϵ})

(see a number of examples in [30] (Chapter 7) and its references). Then the role of (iii) in the statement of Theorem 3 is to bound how fast

N_{ϵ}

should tend to infinity as

ϵ

goes to zero, to guarantee a zero-rate in the skeleton learning scheme. It is simple to show that

N_{ϵ}

being

O (e^{{(1 / ϵ)}^{q}})

with

q \in [0, 2)

is sufficient to achieve that

{log}_{2} (N_{1 / \sqrt{n}})

is

o (n)

. This is a condition satisfied by a rich collection of

L_{1}

-totally bounded classes in

AC (X)

. Concrete examples are presented in [30] (Chapter 7).

5. The Parametric Scenario

The results presented so far are of theoretical interest because they rely on the skeleton estimator that is constructed from the skeleton covering of

F

(see Definition 9), which is unknown in practice. Moving towards making the zero-rate skeleton learning scheme of practical interest, we revisit the important parametric scenario in which

Θ

, the index set of

F

, is a compact set contained in a finite-dimensional Euclidean space

R^{k}

. Interestingly, in this context we can consider a practical covering of

F

induced by the uniform partition of the parameter space

Θ

, as used in [18]. Unlike [18], where a minimum-distance estimate is first found and then quantized, here we first quantize the space

Θ

and then find the minimum-distance estimate among a finite collection of candidates (i.e., over a finte number of prototypes in

Θ

). Some assumptions will be needed.

Definition 11

([18]). Let

F = \{μ_{θ} : θ \in Θ\}

with

Θ \subset R^{k}

. Let

I_{F} : Θ \to F

be the index function of

F

that maps θ to

μ_{θ}

.

I_{F}

is said to be locally uniformly Lipschitz, if there exists

r > 0

and

m > 0

, such

\forall θ \in Θ

,

\forall ϕ \in B_{r} (θ)

,

V (μ_{θ}, μ_{ϕ}) \leq m ||θ - ϕ||,

(25)

where

B_{r} (θ) \subset Θ

denotes the ball of radius r (with respect to the Euclidean norm in

R^{k}

) centered at θ.

The following lemma shows that

F

is

L_{1}

-totally bounded under some parametric assumptions.

Lemma 1.

Let

F = \{μ_{θ} : θ \in Θ\} \subset P (X)

with

Θ \subset R^{k}

. If Θ is bounded (

\exists L > 0

such that

Θ \subset ⨂_{i = 1}^{k} [- L, L]

) and the mapping

I_{F} : Θ \to F

is locally uniformly Lipschitz (Definition 11), then

F

is

L_{1}

-totally bounded. Furthermore,

N_{ϵ}

is

O (1 / ϵ^{k})

for this family.

The proof is presented in Section 8.5.

It is important to note that the

ϵ

-covering of

F

used in the proof of Lemma 1 to derive an upper bound for

N_{ϵ}

is practical (see Appendix C). This offers the possibility of implementing a practical skeleton estimator, which is the focus of the following result.

The Practical Skeleton Estimator

Under the assumptions of Lemma 1, let

({\tilde{f}}_{n, ϵ}, {\tilde{ϕ}}_{n, ϵ})

denote the learning rule of length n associated with the minimum-distance principle in (42) with parameter

ϵ

(see details in Section 8.3), where instead of using the

ϵ

-skeleton

G_{ϵ}

of

F

(in Definition 9), the implementable (see Appendix C)

ϵ

-covering of

Θ

presented in the proof of Lemma 1 is used. This practical

ϵ

-covering is denoted by

{\tilde{G}}_{ϵ}

(by definition,

N_{ϵ} = |G_{ϵ}| \leq |{\tilde{G}}_{ϵ}| = {\tilde{N}}_{ϵ} \sim O (1 / ϵ^{k})

, this last part from Lemma 1.). With this, let

\tilde{Π} ({(ϵ_{n})}_{n \geq 1}) \equiv \{({\tilde{f}}_{n, ϵ_{n}}, {\tilde{ϕ}}_{n, ϵ_{n}}) : n \geq 1\}

denote our practical learning scheme indexed by the precision numbers

{(ϵ_{n})}_{n \geq 1} \in {(R^{+})}^{N}

. We are in a position to integrate Theorem 3 and Lemma 1 to state the following:

Theorem 4.

Under the assumptions of Lemma 1, the practical skeleton estimator

\tilde{Π} ({(ϵ_{n})}_{n \geq 1})

with

ϵ_{n}^{*} = 1 / \sqrt{n}

satisfies that

\begin{matrix} sup_{μ_{θ} \in F} E_{Z^{n} \sim μ^{n}} \{V (μ_{{\tilde{π}}_{n, ϵ_{n}^{*}} (Z^{n})}, μ_{θ})\} is O (\sqrt{log n / n}), and R ({\tilde{π}}_{n, ϵ_{n}^{*}}) is O (log n / n), \end{matrix}

(26)

where

{\tilde{π}}_{n, ϵ} (Z^{n}) \equiv {\tilde{ϕ}}_{n, ϵ} ({\tilde{f}}_{n, ϵ} (Z^{n}))

.

In addition, if the Yatracos collection

A_{Θ} = \{A_{θ, \bar{θ}} : θ, \bar{θ} \in Θ, θ \neq \bar{θ}\}

has a finite VC dimension equal to J, then

\begin{matrix} sup_{μ_{θ} \in F} E_{Z^{n} \sim μ^{n}} \{V (μ_{{\tilde{π}}_{n, ϵ_{n}^{*}} (Z^{n})}, μ_{θ})\} is O (1 / \sqrt{n}), and R ({\tilde{π}}_{n, ϵ_{n}^{*}}) is O (log n / n) . \end{matrix}

(27)

The proof is presented in Section 8.6.

When

X \subset R^{d}

, Raginsky [18] showed that the finite VC dimension assumption of Theorem 4 is satisfied by the class of mixture families presented in Section 4.2.1 and a rich collection of exponential families of the form

F = \{μ_{θ} : θ \in Θ\} \subset P (X)

with

\frac{d μ_{θ}}{d λ} (x) = f (x) \cdot e^{\sum_{i = 1}^{k} θ_{i} h_{i} (x) - g (θ)}, \forall x \in X

, where

f (x)

is a reference density,

\{h_{i} (\cdot) : i = 1, \dots, k\}

is a set of arbitrary real-valued functions,

g (θ)

is a normalization constant (

g (θ) = ln \int_{X} e^{\sum_{i = 1}^{k} θ_{i} h_{i} (x)} f (x) d x

see details in [18] (Section V)), and

Θ

is a compact subset of

R^{k}

(see details in [18] (Section V)).

6. Summary of the Results

We summarize the results of the proposed zero-rate density estimation approach adopted for the problem of joint fixed-rate lossy source coding and modeling of continuous memoryless sources.

Proposition 1 and Theorem 1 formalize the interplay between the two-stage joint fixed-rate coding and modeling objective and the problem of zero-rate uniformly consistent (in expected total variation) density estimation.
Theorem 2 establishes a necessary and sufficient condition on a family of densities for the existence of a strongly minimax joint coding and modeling scheme achieving both source coding and model identification objectives (Definition 4). The result is obtained for the rich non-parametric collection of $L_{1}$ -totally bounded densities.
For the modeling stage, we propose using the skeleton estimator, which first quantizes the data and then finds the minimum-distance decision on this finite set of density candidates (42). This is a practical solution in the sense that the inference (minimization) is carried out over a finite set.
By introducing combinatorial regularity conditions on the family of distributions $F = \{μ_{θ} : θ \in Θ\}$ , the skeleton scheme achieves $O (1 / \sqrt{n})$ rate of convergence in the n-order distortion redundancy, and the same rate in the expected total variational distance for the modeling part (Theorem 3).
Finally, for a relevant parametric setting, a practical skeleton-based joint coding and modeling scheme is proposed that achieves a rate of $O (1 / \sqrt{n})$ for the n-order distortion redundancy (Theorem 4). This rate is slightly better than the $O (\sqrt{log n / n})$ achieved in [18] under the same rate overhead of $O (log (n) / n)$ . Furthermore, Theorem 4 removes the finite-VC-dimension assumption over the Yatracos class $A_{Θ}$ considered in [18] (Theorem 3.2), while achieving the same performance rates in terms of n-order distortion redundancy $O (\sqrt{log n / n})$ , uniform expected risk to learn the density $O (\sqrt{log n / n})$ , and rate overhead $O (log n / n)$ .

Concerning the last parametric result, we note that the result in [18] can be improved by the adoption of Dudley’s entropy bound [41], which would yield the same asymptotic rate reported in this work for the n-order distortion redundancy.

A final remark is that under the bounded distortion metric assumption of Theorem 1 condition (i), Linder et al. [14] (Theorem 2) showed that

\forall θ \in Θ

, and for every

R > 0

such that

D_{μ_{θ}} (R) > 0

, there is a constant

K_{θ} (R) > 0

such that

\begin{matrix} D_{μ_{θ}}^{n} (R) - D_{μ_{θ}} (R) \leq (K_{θ} (R) + r_{n}) \sqrt{\frac{log n}{n}}, \end{matrix}

(28)

where

(r_{n})

is a sequence that converges to zero (

o (1)

) uniformly in

Θ

. This result offers a rate of convergence of the n-order operational distortion-rate function to the Shannon DRF as the block length tends to infinity. In view of (11), we can adopt this result in Theorems 3 and 4, to say that the average distortion of the respective joint coding and modeling schemes at rate R, i.e.,

D_{μ} (C^{n, n})

, convergences to the Shannon DRF

D_{μ} (R)

as

O (\sqrt{\frac{log n}{n}})

point-wise

\forall μ \in F

. Therefore in the process of comparing

D_{μ} (C^{n, n})

with the Shannon DR function, we lose the

O (\sqrt{1 / n})

rate of convergence.

7. Conclusions

This work revisits the problem of fixed-rate universal lossy source coding and model identification with training data proposed in [18] from a learning perspective. Remarkably, we found that the problem is equivalent to the problem of density estimation of the source distribution with some concrete but non-conventional operational data-rate constraints in bits per sample. This learning problem can be seen as the task of estimating and encoding the distribution of samples with a zero-rate in bits per sample, while achieving a consistent estimation in expected total variations of the distribution after the decoding process. From our perspective, the rate-constraint density estimation problem is interesting in itself and can have relevant applications in other contexts such as distributed learning scenarios and sensor network problems.

Importantly for the joint coding and modeling problem, the connection with density estimation provides a context for the use of the skeleton estimator proposed by Yatracos in [29]. We highlight two important implications from its use. First, we extend results about minimax universality from the parametric context explored in [30] to the rich non-parametric family of

L_{1}

-totally bounded densities [26,30]. This result significantly expands the contexts where the joint model and coding objective can be achieved. We illustrated this with some examples in Section 4.2 and many more can be found in the literature of density estimation [26,30].

Second, in the parametric case studied in [18], we were able to remove some of the assumptions and obtain not only the same performance result in terms of rate of convergence of the n-order distortion redundancy but also slightly better convergence results. Therefore, the Skeleton estimator, though essentially a non-parametric learning scheme, is shown to be instrumental in enriching the applicability of the joint coding and modeling framework.

8. Proofs of Results

8.1. Proposition 1

Proof.

The fact that

Π

is uniformly consistent for

F

is directly from Definition 4. On the other hand, the rate of

π_{n} = ϕ_{n} \circ f_{n}

is

R (π_{n}) = \frac{1}{n} {log}_{2} |{\tilde{S}}_{n}|

. From the definition of

D_{μ}^{n} (R)

, it is simple to show from the strict monotonicity of

D_{μ} (R)

that in order for

{lim}_{n \to \infty} {sup}_{μ \in F} [D_{μ} (C^{n, n}) - D_{μ}^{n} (R)] = 0

, it is required that

{lim sup}_{n \to \infty} \frac{1}{n} log |S_{n}| > R - ϵ

for any

ϵ > 0

. Then, from (16), and since

log | {\tilde{S}}_{n} | / n = R (π_{n})

,

{lim sup}_{n \to \infty} R (C^{n, n}) \leq R

implies that

{lim}_{n \to \infty} R (π_{n}) = 0

. ☐

8.2. Theorem 1

Proof.

The proof builds upon the ideas elaborated in [18] (Theorem 3.2, p. 3065). Let us consider an arbitrary

R > 0

and let

Π = \{(f_{n}, ϕ_{n}) : n \geq 1\}

be the zero-rate learning scheme of the assumption. Using

Π

, let us construct the joint coding and modeling rule of length n by:

\begin{matrix} C^{n, n} = & (f_{n} : X^{n} \to {\tilde{S}}_{n}, ϕ_{n} : {\tilde{S}}_{n} \to Θ, \\ \{f_{n, \tilde{s}} : X^{n} \to S_{n}, ϕ_{n, \tilde{s}} : S_{n} \to {\hat{X}}^{n} : \tilde{s} \in {\tilde{S}}_{n}\}) . \end{matrix}

(29)

Concerning the first stage of

\{C^{n, n} : n \geq 1\}

, it is induced directly from the coding-decoding rules of

Π

. For the second stage,

\forall n \geq 1

,

\forall \tilde{s} \in {\tilde{S}}_{n}

the pair

(f_{n, \tilde{s}}, ϕ_{n, \tilde{s}})

is picked such that

C_{μ_{θ_{n, \tilde{s}}}}^{* n} = ϕ_{n, \tilde{s}} \circ f_{n, \tilde{s}}

, which is the optimal n-block code that achieves

D_{μ_{θ_{n, \tilde{s}}}}^{n} (R)

(from the hypothesis in (ii)), with

θ_{n, \tilde{s}} \equiv ϕ_{n} (f_{n} (\tilde{s}))

short-hand for the reproduction codeword induced from the first stage-pair

(f_{n}, ϕ_{n})

, and

S_{n}

satisfying the R-rate constraint, i.e.,

|S_{n}| = 2^{n R}

. From construction and the fact that

Π

has zero-rate,

lim_{n \to \infty} R (C^{n, n}) = R + lim_{n \to \infty} {log}_{2} |{\tilde{S}}_{n}| / n = R,

then

\{C^{n, n} : n \geq 1\}

satisfies the rate condition. On the other hand, based on the assumption that

Π

is zero-rate uniformly consistent, it follows that

lim_{n \to \infty} sup_{μ \in F} E (V (μ_{{\hat{θ}}_{n} (Z^{n})}, μ)) = 0,

(30)

where

{\hat{θ}}_{n} (Z^{n}) = ϕ_{n} (f_{n} (Z^{n}))

. Then

\{C^{n, n} : n \geq 1\}

achieves the modeling objective. Concerning the coding objective, we use the following key result:

Lemma 2

([18] (Lemma C.1)). Let P and Q be two probability measures in

(X, B (X))

. Let

C^{n} = (f, ϕ)

be a zero-memory n-block coder with the nearest neighbor property (i.e.,

C^{n}

is nearest neighbor if,

\forall x_{1}^{n} \in X^{n}

,

ϕ (f (x_{1}^{n})) = arg {min}_{{\hat{x}}_{1}^{n} \in Γ_{C^{n}}} ρ (x_{1}^{n}, {\hat{x}}_{1}^{n})

with

Γ_{C^{n}}

the reproduction codebook of

C^{n}

.). If we denote the performance of

C^{n}

(

C^{n} = ϕ \circ f

) with respect to P by

D_{P} (C^{n}) \equiv \frac{1}{n} E_{X^{n} \sim P^{n}} (ρ (C^{n} (X^{n}), X^{n})),

(31)

where

P^{n}

denotes the product measure with marginal P in

(X^{n}, B (X^{n}))

, and ρ satisfies the condition i) of Theorem 1 and is bounded by

d_{m a x}

, then

|D_{P} {(C^{n})}^{1 / p} - D_{Q} {(C^{n})}^{1 / p}| \leq 2^{1 / p} d_{m a x} \cdot V (P, Q) .

(32)

Furthermore, the inequality can be extended for the n-order operational distortions in (7), i.e.,

|D_{P}^{n} {(R)}^{1 / p} - D_{Q}^{n} {(R)}^{1 / p}| \leq 2^{1 / p} d_{m a x} \cdot V (P, Q),

(33)

\forall R > 0

.

Let us work with the following distortion redundancy,

\begin{matrix} D_{μ} (C^{n, n} | Z^{n}) - D_{μ}^{n} (R) = & [\frac{1}{n} E_{X^{n} \sim P_{μ}^{n}} (ρ^{n} (X^{n}, C^{n, n} (X^{n}))) - D_{μ_{{\hat{θ}}_{n} (Z^{n})}}^{n} (R)] + \end{matrix}

\begin{matrix} [D_{μ_{{\hat{θ}}_{n} (Z^{n})}}^{n} (R) - D_{μ}^{n} (R)] \end{matrix}

(34)

\begin{matrix} \leq D_{μ} (C_{μ_{{\hat{θ}}_{n} (Z^{n})}}^{* n}) - D_{μ_{{\hat{θ}}_{n} (Z^{n})}}^{n} (R) + 2^{1 / p} d_{m a x} \cdot V (μ_{{\hat{θ}}_{n} (Z^{n})}, μ) \end{matrix}

(35)

\begin{matrix} = D_{μ} (C_{μ_{{\hat{θ}}_{n} (Z^{n})}}^{* n}) - D_{μ_{{\hat{θ}}_{n} (Z^{n})}} (C_{μ_{{\hat{θ}}^{n} (Z^{n})}}^{* n}) \end{matrix}

\begin{matrix} + 2^{1 / p} d_{m a x} \cdot V (μ_{{\hat{θ}}_{n} (Z^{n})}, μ) \end{matrix}

(36)

\begin{matrix} \leq 2^{1 / p + 1} d_{m a x} \cdot V (μ_{{\hat{θ}}_{n} (Z^{n})}, μ) . \end{matrix}

(37)

For the first equality we use (5). The inequality in (35) is from the definition in (31) and (33), and the equality in (36) is from the construction of

C_{μ_{{\hat{θ}}_{n} (Z^{n})}}^{* n}

which is n-operational optimal for the distribution

μ_{{\hat{θ}}_{n} (Z^{n})}

at rate R. Finally, (37) is from (32).

Concluding,

D_{μ} (C^{n, n} | Z^{n}) - D_{μ}^{n} (R)

is random (a measurable function of

Z^{n}

) and dominated by

V (μ_{{\hat{θ}}_{n} (Z^{n})} μ)

. Hence taking the expected value (with respect to

Z^{n}

) on both sides of this inequality (see (6)), we have the uniform convergence in (30) implying that

lim_{n \to \infty} sup_{μ \in F} [D_{μ} (C^{n, n}) - D_{μ}^{n} (R)] = 0,

(38)

and then the coding objective is achieved. ☐

8.3. Theorem 2

Proof.

Let us first assume that

F

is

L_{1}

-totally bounded and prove the direct part of the statement. We adopt the skeleton estimate proposed by Yatracos [29] and extended by Devroye et al. [42,43] (a complete presentation can be found in [30] (Chapter 7)). For any arbitrary

ϵ > 0

, let us consider the

ϵ

-skeleton

G_{ϵ} = \{μ_{θ_{i}^{ϵ}} : i = 1, \dots, N_{ϵ}\}

of

F

. We use

g_{θ_{i}^{ϵ}} (x) \equiv \frac{d μ_{θ_{i}^{ϵ}}}{d λ} (x)

as short-hand for the i-th pdf in

G_{ϵ}

, and we define

Θ_{ϵ} \equiv \{θ_{i}^{ϵ} : i = 1, \dots, N_{ϵ}\} \subset Θ

to represent the index set of

G_{ϵ}

. Let us consider the Yatracos class of

G_{ϵ}

given by [30]

\begin{matrix} A_{ϵ} \equiv \{A_{i, j}^{ϵ}, A_{j, i}^{ϵ} : 1 \leq i < j \leq N_{ϵ}\}, \end{matrix}

(39)

where

A_{i, j}^{ϵ} = \{x \in X : g_{θ_{i}^{ϵ}} (x) > g_{θ_{j}^{ϵ}} (x)\} \in B (X)

is the Scheffé set of

μ_{θ_{i}^{ϵ}}

with respect to

μ_{θ_{j}^{ϵ}}

in (2) [30,33]. Hence, given i.i.d. realizations

X_{1}, . . ., X_{n}

with

X_{i} \sim μ_{θ}

(

μ_{θ} \in F

), let us propose the encoder-decoder pair

(f_{n, ϵ}, ϕ_{n, ϵ})

associated with

A_{ϵ}

by,

\begin{matrix} f_{n, ϵ} (X^{n}) & \equiv arg min_{i \in \{1, . . ., N_{ϵ}\}} sup_{B \in A_{ϵ}} |μ_{θ_{i}^{ϵ}} (B) - {\hat{μ}}_{n} (B)| \in [N_{ϵ}], \end{matrix}

(40)

\begin{matrix} ϕ_{n, ϵ} (i) & \equiv θ_{i}^{ϵ} \in Θ_{ϵ} \subset Θ, \end{matrix}

(41)

where

{\hat{μ}}_{n} (B) = \sum_{j = 1}^{n} 1_{B} (X_{j})

is the standard empirical distribution. In this context,

\begin{matrix} {\hat{θ}}_{ϵ} (X^{n}) = ϕ_{n, ϵ} ((f_{n, ϵ} (X^{n}))) = arg min_{θ_{i}^{ϵ} \in Θ_{ϵ}} sup_{B \in A_{ϵ}} |μ_{θ_{i}^{ϵ}} (B) - {\hat{μ}}_{n} (B)|, \end{matrix}

(42)

is the well-known skeleton estimate [29].

{\hat{θ}}_{ϵ} (X_{1}^{n})

is the minimum-distance approximation of

{\hat{μ}}_{n}

with elements of

G_{ϵ}

[29,30], adopting the measure in the right-hand-side of (42) that is reminiscent of the total variational distance in (1). In order to choose a sequence

{(ϵ_{n})}_{n \geq 1}

, we consider the following performance bound.

Lemma 3

([30] (Theorem 6.3)). For any

μ \in F

,

\begin{matrix} V (μ_{{\hat{θ}}_{ϵ} (X^{n})}, μ) \leq 3 min_{v \in G_{ϵ}} V (v, μ) + 4 sup_{B \in A_{ϵ}} |{\hat{μ}}_{n} (B) - μ (B)| . \end{matrix}

(43)

Equation (43) is valid for any

ϵ > 0

and, consequently, it provides a trade-off between an approximation error term and an estimation error term. The approximation error is

{min}_{v \in G_{ϵ}} V (v, μ)

, which is bounded by the definition of

G_{ϵ}

. For the estimation error, on the other hand, Yatracos proposed the use of Hoeffding’s inequality [44] to obtain that

\forall μ \in P (X)

[30] (Theorem 7.1),

\begin{matrix} E_{X^{n} \sim μ^{n}} (sup_{B \in A_{ϵ}} |{\hat{μ}}_{n} (B) - μ (B)|) \leq \sqrt{\frac{log (2 N_{ϵ}^{2})}{2 n}} . \end{matrix}

(44)

Using (44) in (43), it follows that,

{sup}_{μ_{θ} \in F} E \{V (μ_{{\hat{θ}}_{ϵ} (X^{n})}, μ_{θ})\} \leq 3 ϵ + \sqrt{\frac{8 log (2 N_{ϵ}^{2})}{n}}

. This last expression is distribution-free and it is valid if the approximation fidelity

ϵ

is a chosen function of n [30]. Consequently, for any sequence

{(ϵ_{n})}_{n \geq 1}

,

\begin{matrix} sup_{μ_{θ} \in F} E \{V (μ_{{\hat{θ}}_{ϵ_{n}} (X^{n})}, μ_{θ})\} \leq 3 ϵ_{n} + \sqrt{\frac{8 log (2 N_{ϵ_{n}}^{2})}{n}}, \end{matrix}

(45)

for all

n \geq 1

. Hence, we consider

ϵ_{n}^{*} \equiv inf \{ϵ > 0 : log (2 N_{ϵ}^{2}) \leq \sqrt{n}\}

proposed in [30] (Chapter 7.2), which is well-defined and converges to zero as n tends to infinity. Consequently from (45),

{lim}_{n \to \infty} {sup}_{μ_{θ} \in F} E \{V (μ_{{\hat{θ}}_{ϵ_{n}^{*}} (X^{n})}, μ_{θ})\} = 0

. Then the learning scheme

Π ({(ϵ_{n}^{*})}_{n \geq 1}) \equiv \{(f_{n, ϵ_{n}^{*}}, ϕ_{n, ϵ_{n}^{*}}) : n \geq 1\}

satisfies the learning requirement in Definition 6, where in particular

R (ϕ_{n, ϵ_{n}^{*}} \circ f_{n, ϵ_{n}^{*}}) = \frac{{log}_{2} (N_{ϵ_{n}^{*}})}{n}

is

O (1 / \sqrt{n})

by construction. To conclude the argument of this part (i.e., presenting the construction of the second stage of a joint coding & modeling scheme), we adopt the result and the construction presented in the proof of Theorem 1 (see Remark 1 for details). This result implies that

\forall R > 0

there is a strongly minimax universal joint coding and modeling scheme for

F

at rate R.

For the other implication (the converse part of the statement), let us fix

R > 0

and assume that we have a joint coding & modeling scheme that is strongly minimax universal (Definition 4) for

F

at rate R. Then from Proposition 1, we have a learning scheme

Π = \{(f_{n}, ϕ_{n}) : n \geq 1\}

such that

{lim}_{n \to \infty} R (π_{n} = ϕ_{n} \circ f_{n}) = 0

and

\begin{matrix} lim_{n \to \infty} sup_{μ \in F} E_{P_{μ}^{n}} \{V (μ_{π_{n} (X^{n})}, μ)\} = 0 . \end{matrix}

(46)

For the learning rule of length n, we have its reproduction codebook that we denote by

Θ^{n} \equiv \{θ_{j}^{n} : j = 1, \dots, 2^{n R (π_{n})}\} \subset Θ

. Let us define the minimum-distance oracle solution in

Θ^{n}

by

\begin{matrix} {\tilde{θ}}_{n} (μ) = arg inf_{θ \in Θ^{n}} V (μ_{θ}, μ) . \end{matrix}

(47)

From (46), we have that

{lim}_{n \to \infty} {sup}_{μ \in F} V (μ_{{\tilde{θ}}_{n} (μ)}, μ) = 0

. In other words,

\forall ϵ > 0

, there exists

N (ϵ) < \infty

, such that for all

n \geq N (ϵ)

,

V (μ_{{\tilde{θ}}_{n} (μ)}, μ) < ϵ

uniformly for every element

μ \in F

. This means that

\forall ϵ > 0

there exists

N (ϵ) < \infty

, such that for any arbitrary

\bar{n} > N (ϵ)

,

F \subset ⋃_{θ \in Θ^{\bar{n}}} B_{ϵ} (μ_{θ})

, where by construction

|Θ^{\bar{n}}| < \infty

. Then

F

is totally bounded, which concludes the proof. ☐

8.4. Theorem 3

Proof.

From Lemma 3, for any arbitrary sequence

{(ϵ_{n})}_{n \geq 1}

V (μ_{{\hat{θ}}_{ϵ_{n}} (X^{n})}, μ_{θ}) \leq 3 ϵ_{n} + 4 sup_{B \in A_{ϵ_{n}}} |{\hat{μ}}_{n} (B) - μ_{θ} (B)| .

(48)

with

A_{ϵ_{n}}

the Yatracos class of the skeleton

G_{ϵ_{n}}

. It is clear that

\forall ϵ > 0

,

A_{ϵ} \subset A_{Θ}

. Then by monotonicity

E ({sup}_{B \in A_{ϵ}} |{\hat{μ}}_{n} (B) - μ (B)|) \leq E ({sup}_{B \in A_{Θ}} |{\hat{μ}}_{n} (B) - μ (B)|)

, for all

ϵ > 0

and for any distribution

μ \in P (X)

. Here is where we use the assumption that

A_{Θ}

has finite VC dimension J, which implies from [30] (Theorem 3.1) that

sup_{μ \in F} E (sup_{B \in A_{Θ}} |{\hat{μ}}_{n} (B) - μ (B)|) \leq c \sqrt{\frac{J}{n}}

(49)

for some constant

c > 0

. Substituting this result in (48), the argument concludes by replacing

(ϵ_{n}) = (1 / \sqrt{n})

, a solution which achieves the intended rate of convergence for

{sup}_{μ_{θ} \in F} E \{V (μ_{{\hat{θ}}_{1 / \sqrt{n}} (X^{n})}, μ_{θ})\}

. Finally, the rate of the learning rule is

\frac{⌈ {log}_{2} (N_{1 / \sqrt{n}}) ⌉}{n}

, which tends to zero by the last hypothesis.

For the almost-sure convergence part if

ϵ_{n}^{*} = \frac{1}{\sqrt{n}}

, it is sufficient to show that the second term in the right hand side (RHS) of (48) is

O (\sqrt{log n / n})

P_{μ}

-almost surely. From the fact that

A_{Θ}

has finite VC dimension (Definition A1), and from the classical VC inequality [30] (Corollary 4.1 and Theorem 3.1) and [45] (Chapter 12.4), it follows that

P (sup_{B \in A_{ϵ_{n}^{*}}} |{\hat{μ}}_{n} (B) - μ_{θ} (B)| > δ) \leq 8 {(n + 1)}^{J} \cdot e^{- \frac{n δ^{2}}{32}},

\forall n \geq 0

and

\forall ϵ > 0

. Then considering

a_{n} = \sqrt{log n / n}

and

M^{2} / 32 > J + 2

,

P (sup_{B \in A_{ϵ_{n}^{*}}} |{\hat{μ}}_{n} (B) - μ_{θ} (B)| > M \cdot a_{n}) \leq 8 \frac{{(n + 1)}^{J}}{n^{M^{2} / 32}} \leq \frac{K}{n^{2}}

for some

K > 0

, hence

\sum_{n \geq 0} P (\frac{1}{a_{n}} \cdot {sup}_{B \in A_{ϵ_{n}^{*}}} |{\hat{μ}}_{n} (B) - μ_{θ} (B)| > M) < \infty

. Then from the Borel Cantelli Lemma,

lim {sup}_{n \to \infty} \frac{1}{a_{n}} \cdot {sup}_{B \in A_{ϵ_{n}^{*}}} |{\hat{μ}}_{n} (B) - μ_{θ} (B)| \leq M

P_{μ}

-almost surely, which concludes the proof. As

(a_{n})

is

o (1)

, this result implies the almost-sure convergences to zero of

V (μ_{{\hat{θ}}_{ϵ_{n}^{*}} (X^{n})}, μ_{θ})

as n goes to infinity.

Finally, using similar arguments, it is possible to show that

V (μ_{{\hat{θ}}_{ϵ_{n}^{*}} (X^{n})}, μ_{θ})

is

o (1 / n^{τ})

P_{μ}

-almost surely for any

τ \in (0, 1 / 2)

. ☐

8.5. Lemma 1

Proof.

First note that

Θ

is contained in a compact set

⨂_{i = 1}^{k} [- L, L] \subset R^{k}

, consequently,

Θ

inherits the finite covering property of a compact set, i.e.,

\forall ϵ > 0

, there exists a finite covering

Θ^{ϵ} = \{θ_{1}^{ϵ}, ., θ_{K (ϵ)}^{ϵ}\} \subset Θ

such that,

Θ \subset ⋃_{θ \in Θ} B_{ϵ} (θ) = ⋃_{i = 1}^{K (ϵ)} B_{ϵ} (θ_{i}^{ϵ}) .

(50)

On the other hand, from the locally uniformly Lipschitz assumption on

I_{F} : Θ \to F

, there exists

r > 0

and

m > 0

such that

V (μ_{θ}, μ_{ϕ}) \leq m ||θ - ϕ||

,

\forall θ \in Θ

,

\forall ϕ \in B_{r} (θ)

. Then, by considering

ϵ_{o} < r

, it follows by construction of

Θ^{ϵ_{o}}

that

\begin{matrix} F \subset ⋃_{i = 1}^{K (ϵ_{o})} I_{F} (B_{ϵ_{o}} (θ_{i}^{ϵ_{o}})) \subset ⋃_{i = 1}^{K (ϵ_{o})} B_{m \cdot r}^{V} (μ_{θ_{i}^{ϵ_{o}}}), \end{matrix}

(51)

where

B_{δ}^{V} (μ) = \{v \in P (X) : V (v, μ) < δ\}

is the ball centered at

μ \in P (X)

, induced from the total variational distance, and the last inequality stems from the Lipschitz condition. Hence, from (51),

\forall ϵ > 0

there exists

M (ϵ) = K (min \{ϵ / m, r\}) < \infty

and

\{μ_{1}^{ϵ}, . . ., μ_{M (ϵ)}^{ϵ}\} \subset P (X)

, such that

F \subset ⋃_{i = 1}^{M (ϵ)} B (μ_{θ_{i}^{ϵ}}, ϵ)

, which proves the result.

For the final part, let

(m, r)

be the uniform parameters that characterize the Lipschitz condition of

I_{F} (\cdot)

(Definition 11). Without loss of generality, let us assume the critical regime where

\frac{ϵ}{m} < r

, hence from (51)

N_{ϵ}

is upper bounded by

K (ϵ / m)

, which is the covering number of

Θ

. As

Θ \subset ⨂_{i = 1}^{k} [- L, L] \subset R^{k}

, we will work with a uniform partition of

⨂_{i = 1}^{k} [- L, L]

to find a bound for

K (ϵ / m)

. Let

\bar{ϵ} = \frac{ϵ}{m}

, then inducing a product-type partition, where in each coordinate we have

⌈ \frac{L \sqrt{k}}{\bar{ϵ}} ⌉

uniform length cells, we have the required

\bar{ϵ}

-covering. The number of prototypes is

O (\frac{{(L \sqrt{k})}^{k}}{{\bar{ϵ}}^{k}})

, which is

O (1 / ϵ^{k})

as a function of

ϵ

(

ϵ = \bar{ϵ} \cdot m

).

To clarify the constructive nature of the

ϵ

-covering used to prove this result, an algorithm with the basic steps of the construction of this practical covering is sketched in Appendix C. ☐

8.6. Theorem 4

Proof.

Let

{\tilde{G}}_{ϵ} \subset F

be the

ϵ

-covering induced from the uniform partition of

Θ

presented in Lemma 1. From this we can construct the minimum-distance estimate in (42) adopting the Yatracos class of

{\tilde{G}}_{ϵ}

(with index set

{\tilde{Θ}}_{ϵ}

), i.e.,

{\tilde{A}}_{ϵ}

, which, from (39), yields

{\tilde{θ}}_{ϵ} (X^{n}) \equiv arg min_{θ_{i}^{ϵ} \in {\tilde{Θ}}_{ϵ}} sup_{B \in {\tilde{A}}_{ϵ}} |μ_{θ_{i}^{ϵ}} (B) - {\hat{μ}}_{n} (B)| .

(52)

Considering

ϵ_{n} = 1 / \sqrt{n}

, from (45) it follows that

sup_{μ_{θ} \in F} E \{V (μ_{{\tilde{θ}}_{ϵ_{n}} (X^{n})}, μ_{θ})\} \leq \frac{3}{\sqrt{n}} + \sqrt{\frac{8 log (2 log {|{\tilde{G}}_{1 / \sqrt{n}}|}^{2})}{n}} .

The latter upper bound is asymptotically dominated by

(\sqrt{log n / n})

from the fact that

log |{\tilde{G}}_{1 / \sqrt{n}}|

is

O (k log (n))

(Lemma 1), which proves the assertions made in (26).

Concerning part (ii), using the arguments presented in the proof of Theorem 3, we can obtain that

\forall ϵ > 0

,

sup_{μ_{θ} \in F} E \{V (μ_{{\tilde{θ}}_{ϵ} (X^{n})}, μ_{θ})\} \leq 3 ϵ + 4 \cdot c \sqrt{\frac{J}{n}} .

(53)

From this point, the proof follows from the arguments of Theorem 3 and the fact that

{log}_{2} |{\tilde{G}}_{1 / \sqrt{n}}|

is

O (k / 2 \cdot {log}_{2} n)

. ☐

Author Contributions

Conceptualization, J.F. Silva and M.S. Derpich; Methodology, J.F. Silva and M.S. Derpich; Formal Analysis, J.F. Silva and M.S. Derpich; Investigation and Results, J.F. Silva and M.S. Derpich; Writing—Original Draft Preparation, J.F. Silva and M.S. Derpich; Writing—J.F. Silva & Editing, M.S. Derpich; Project Administration, J.F. Silva; Funding Acquisition, J.F. Silva.

Funding

The work is supported by funding from FONDECYT Grants 1170854 and 1171059, CONICYT-Chile and the Advanced Center for Electrical and Electronic Engineering (AC3E), Basal Project FB0008. In addition, J.F. Silva acknowledges support from Project Anillos ACTI 1405, CONICYT-Chile.

Acknowledgments

We want to thank the anonymous reviewers for their constructive comments that were instrumental to improve the technical content and organization of this work. We thank Diane Greenstein for editing and proofreading all this material and Sebastian Espinosa for preparing Figure 1.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of (22) and (23)

First, we show that the zero-rate skeleton estimate

Π ((ϵ_{n})) = \{(f_{n, ϵ_{n}}, ϕ_{n, ϵ_{n}}) : n \geq 1\}

proposed in (40) and (41) is also strongly consistent.

Proposition A1.

Π ((ϵ_{n}^{*})) = \{(f_{n, ϵ_{n}^{*}}, ϕ_{n, ϵ_{n}^{*}}) : n \geq 1\}

is strongly consistent, i.e., for any

μ \in F

,

\begin{matrix} lim_{n \to \infty} V (μ_{{\hat{θ}}_{ϵ_{n}^{*}} (X^{n})}, μ) = 0, P_{μ} - a l m o s t s u r e l y . \end{matrix}

Proof.

Let us consider the skeleton estimate

μ_{{\hat{θ}}_{ϵ_{n}^{*}} (X^{n})}

, where the sequence was chosen by the rule

ϵ_{n}^{*} = inf \{ϵ > 0 : log (2 N_{ϵ}^{2}) \leq \sqrt{n}\}

. Then

log N_{ϵ_{n}^{*}}^{2} \leq (\sqrt{n} - log 2) \leq \sqrt{n}

for all n. From Lemma 3,

V (μ_{{\hat{θ}}_{ϵ_{n}^{*}} (X^{n})}, μ) \leq 3 ϵ_{n}^{*} + 4 {sup}_{B \in A_{ϵ_{n}^{*}}} |{\hat{μ}}_{n} (B) - μ (B)|

. As by construction

(ϵ_{n}^{*})

is

o (1)

, we just need to concentrate on the estimation error term. Applying Hoeffding’s inequality [44]

\forall δ > 0

,

\begin{matrix} P (sup_{B \in A_{ϵ_{n}^{*}}} |{\hat{μ}}_{n} (B) - μ (B)| > δ) & \leq 2 \cdot N_{ϵ_{n}^{*}}^{2} \cdot e^{- 2 n δ^{2}} \leq 2 e^{(\sqrt{n} / log e - 2 n δ^{2})}, \end{matrix}

(A1)

where from the Borel-Cantelli lemma [46,47], the estimation error convergences to zero almost-surely. ☐

Finally considering the inequality in (37), we have that

D_{μ} (C^{n, n} | Z^{n}) - D_{μ}^{n} (R) \leq

2^{1 / p + 1} d_{m a x} \cdot V (μ_{{\hat{θ}}_{n} (Z^{n})}, μ)

,

\forall μ \in F

, which concludes the argument.

Appendix B. Basic Definitions of Vapnik and Chervonenkis Theory

Let

C \subset B (X)

be a collection of measurable events, and

x^{n} = (x_{1}, . . ., x_{n})

be a sequence of n points in

X^{n}

. Then we define by

S (C, x^{n})

the number of different sets in

\{\{x_{1}, x_{2}, \dots, x_{n}\} \cap B : B \in C\},

and the shatter coefficient of

C

by [40,45]

S_{n} (C) = sup_{x^{n} \in X^{n}} S (C, x^{n}) .

(A2)

The shatter coefficient is an indicator of the richness of

C

to dichotomize a finite sequence of points in the space, where by definition

S_{n} (C) \leq 2^{n}

.

Definition A1.

The first time (in the index n) where

S_{n} (C)

is strictly less than

2^{n}

is called the Vapnik and Chervonenkis (VC) dimension of

C

[45]. If

C

has a finite VC dimension then it is called a VC class; otherwise if

S_{n} (C) = 2^{n}

\forall n \geq 1

, then the class is said to have an infinite VC-dimension.

Appendix C. Pseudo Algorithm to Implement the Practical ϵ-Covering Presented in Lemma 1

Under the parametric assumptions of Lemma 1, we recognize four structural parameters that characterize

F

: k the dimension of the Euclidean space that contains

Θ

,

L > 0

associated with the assumption that

Θ \subset ⨂_{i = 1}^{k} [- L, L]

, and

(r, m)

the parameters associated with the locally Lipschitz assumption of

I_{F}

. Given these four parameters

(k, L, m, r)

and

ϵ > 0

, there is a constructive

ϵ

-covering presented in the proof of Lemma 1 that can be implemented in the following steps:

In each of the k dimensions of $Θ$ , the interval $[- L, L]$ is partitioned uniformly with sub-intervals of length $2 ϵ / (m \sqrt{k})$ . This produces a scalar quantization of $[- L, L]$ with $⌈ m \sqrt{k} L / ϵ ⌉$ prototypes per coordinate.
A product partition of $⨂_{i = 1}^{k} [- L, L]$ is made with the scalar quantizations of the previous step. From the proof of Lemma 1, this is a $ϵ / m$ -covering of $Θ$ with $K = {⌈ m \sqrt{k} L / ϵ ⌉}^{k}$ prototypes. Let us denote this set by $\{θ_{i}, i = 1, \dots, K\} \subset Θ$ .
From the proof of Lemma 1, the covering of $Θ$ constructed in the previous step induces an $ϵ$ -covering of $F$ by applying the indexing function $I_{F}$ , i.e., by $\{I_{F} (θ_{i}) : i = 1, \dots, K\} .$

References

Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial; Now Inc.: Houston, TX, USA, 2004. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley Interscience: New York, NY, USA, 2006. [Google Scholar]
Gyorfi, L.; Pali, I.; van der Meulen, E. There is no unieversal soruce code for an infinite source alphabet. IEEE Trans. Inf. Theory 1994, 40, 267–271. [Google Scholar] [CrossRef]
Davisson, L.D. Universal noiseless coding. IEEE Trans. Inf. Theory 1973, 19, 783–785. [Google Scholar] [CrossRef]
Kieffer, J.C. A unified approach to weak universal source coding. IEEE Trans. Inf. Theory 1978, 24, 674–682. [Google Scholar] [CrossRef]
Rissanen, J. Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory 1984, 30, 629–636. [Google Scholar] [CrossRef]
Boucheron, S.; Garivier, A.; Gassiat, E. Codign on countable infininite alphabets. IEEE Trans. Inf. Theory 2009, 55, 358–373. [Google Scholar] [CrossRef]
Bontemps, D.; Boucheron, S.; Gassiat, E. About adaptive coding on countable alphabets. IEEE Trans. Inf. Theory 2014, 60, 808–821. [Google Scholar] [CrossRef] [Green Version]
Bontemps, D. Universal coding on infinite alphabets: exponentially decreasing envelops. IEEE Trans. Inf. Theory 2011, 57, 1466–1478. [Google Scholar] [CrossRef]
Silva, J.F.; Piantanida, P. Almost lossless variable-length source coding on countably infinite alphabets. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
Silva, J.F.; Piantanida, P. The redundancy gains of almost lossless universal source coding over envelope families. In Proceedings of the IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 1–5. [Google Scholar]
Silva, J.F.; Piantanida, P. Universal weak variable-length source coding on countable infinite alphabets. arXiv 2017, arXiv:1708.08103. [Google Scholar]
Berger, T.; Gibson, J.D. Lossy source coding. IEEE Trans. Inf. Theory 1998, 44, 2693–2723. [Google Scholar] [CrossRef]
Linder, T.; Lugosi, G.; Zeger, K. Rates of convergence in the source codign theorem, in empirical quantization design, and in univesal lossy source codign. IEEE Trans. Inf. Theory 1994, 40, 1728–1740. [Google Scholar] [CrossRef]
Linder, T.; Lugosi, G.; Zeger, K. Fixed-rate universal lossy soruce coding and rate of convergence for memoryless sources. IEEE Trans. Inf. Theory 1995, 41, 665–676. [Google Scholar] [CrossRef]
Neuhoff, D.L.; Gray, R.M.; Davisson, L.D. Fixed rate universal block source coding with a fidelity criterion. IEEE Trans. Inf. Theory 1975, 21, 511–523. [Google Scholar] [CrossRef]
Ziv, J. Coding of sources with unkown statistics-Part II: Distortion relative to a fidelity criterion. IEEE Trans. Inf. Theory 1972, 18, 389–394. [Google Scholar] [CrossRef]
Raginsky, M. Joint fixed-rate univesal lossy coding and identification of continuous-alphabet memoryless sources. IEEE Trans. Inf. Theory 2008, 54, 3059–3077. [Google Scholar] [CrossRef]
Chou, P.; Effros, M.; Gray, R.M. A vector quantization approach to universal noiseless coding and quantization. IEEE Trans. Inf. Theory 1996, 42, 1109–1138. [Google Scholar] [CrossRef] [Green Version]
Rissanen, J. Stochastic complexity and modeling. Ann. Stat. 1986, 14, 1080–1100. [Google Scholar] [CrossRef]
Barron, A.; Rissanen, J.; Yu, B. The minimun description lenght principle in coding and modeling. IEEE Trans. Inf. Theory 1998, 44, 2743–2760. [Google Scholar] [CrossRef]
Barron, A.; Györfi, L.; van der Meulen, E.C. Distribution estimation consistent in total variation and in two types of information divergence. IEEE Trans. Inf. Theory 1992, 38, 1437–1454. [Google Scholar] [CrossRef]
Tao, G. Adaptive Control Design and Analysis; Wiley-IEEE Press: Hoboken, NJ, USA, 2003. [Google Scholar]
Berger, T. Rate Distortion Theory; Prentice Hall: Upper Saddle River, NJ, USA, 1971. [Google Scholar]
Devroye, L.; Györfi, L. Nonparametric Density Estimation: The L₁ View; Wiley Interscience: New York, NY, USA, 1985. [Google Scholar]
Devroye, L.; Györfi, L. Principles of Nonparametric Learning; Chapter Distribution and Density Estimation; Springer: New York, NY, USA, 2001. [Google Scholar]
Shannon, C.E. Coding theorems for a discrete source with fidelity criterion. IRE Int. Conv. Rec. 1959, 4, 325–350. [Google Scholar]
Gallager, R.G. Information Theory and Realiable Communication; John Wiley & Songs: Hoboken, NJ, USA, 1968. [Google Scholar]
Yatracos, Y.G. Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. Ann. Stat. 1985, 13, 768–774. [Google Scholar] [CrossRef]
Devroye, L.; Lugosi, G. Combinatorial Methods in Density Estimation; Springer: New York, NY, USA, 2001. [Google Scholar]
Silva, J.F.; Derpich, M.S. Necessary and sufficient conditions for zero-rate density estimation. In Proceedings of the Information Theory Workshop (ITW), Paraty, Brazil, 16–20 October 2011. [Google Scholar]
Halmos, P.R. Measure Theory; Van Nostrand: New York, NY, USA, 1950. [Google Scholar]
Scheffé, H. A useful convergence theorem for probability distribution. Ann. Math. Stat. 1947, 18, 434–458. [Google Scholar]
Gersho, A.; Gray, R. Vector Quantization and Signal Compression; Kluwer Academic: Norwell, MA, USA, 1992. [Google Scholar]
Gray, R.; Neuhoff, D. Quantization. IEEE Trans. Inf. Theory 1998, 44, 2325–2384. [Google Scholar] [CrossRef]
Gray, R.M. Entropy and Information Theory; Springer: New York, NY, USA, 1990. [Google Scholar]
Kolmogorov, A.N.; Tikhomirov, V.M. ϵ-emtropy and ϵ-capacity of sets in function spaces. Transl. Am. Math. Soc. 1961, 17, 277–364. [Google Scholar]
Yatracos, Y.G. A note on L₁ consistent estimation. Can. J. Stat. 1988, 16, 283–292. [Google Scholar]
Vapnik, V.; Chervonenkis, A.J. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 1971, 16, 264–280. [Google Scholar] [CrossRef]
Vapnik, V. Statistical Learning Theory; John Wiley: Hoboken, NJ, USA, 1998. [Google Scholar]
Dudley, R.M. Central limits theorems for empirical measures. Ann. Probab. 1978, 6, 899–929. [Google Scholar] [CrossRef]
Devroye, L.; Lugosi, G. A universally acceptable smoothing factor for kernel density estimation. Ann. Stat. 1996, 24, 2499–2512. [Google Scholar]
Devroye, L.; Lugosi, G. Nonasymtotic universal smothing factors, kernel complexity and Yatracos classes. Ann. Stat. 1997, 25, 2626–2637. [Google Scholar]
Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]
Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: New York, NY, USA, 1996. [Google Scholar]
Breiman, L. Probability; Addison-Wesley: Boston, MA, USA, 1968. [Google Scholar]
Varadhan, S. Probability Theory; American Mathematical Society: Providence, RI, USA, 2001. [Google Scholar]

Figure 1. Illustration of Raginsky’s two-stage joint source coding and modeling scheme. Top figure illustrates the coding process and the bottom figure shows the respective decoding process.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Silva, J.F.; Derpich, M.S. Fixed-Rate Universal Lossy Source Coding and Model Identification: Connection with Zero-Rate Density Estimation and the Skeleton Estimator. Entropy 2018, 20, 640. https://doi.org/10.3390/e20090640

AMA Style

Silva JF, Derpich MS. Fixed-Rate Universal Lossy Source Coding and Model Identification: Connection with Zero-Rate Density Estimation and the Skeleton Estimator. Entropy. 2018; 20(9):640. https://doi.org/10.3390/e20090640

Chicago/Turabian Style

Silva, Jorge F., and Milan S. Derpich. 2018. "Fixed-Rate Universal Lossy Source Coding and Model Identification: Connection with Zero-Rate Density Estimation and the Skeleton Estimator" Entropy 20, no. 9: 640. https://doi.org/10.3390/e20090640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fixed-Rate Universal Lossy Source Coding and Model Identification: Connection with Zero-Rate Density Estimation and the Skeleton Estimator

Abstract

1. Introduction

Contributions of This Work

2. Preliminaries

2.1. Basic Definitions

2.2. Fixed-Rate Universal Lossy Source Coding with Memory or Training Data

2.3. Raginsky’s Two-Stage Joint Universal Coding and Modeling

3. Connections with Zero-Rate Density Estimation

3.1. Density Estimation with a Rate Constraint

3.2. Main Results

4. Joint Source Coding and Modeling Achievability Results

4.1. Main Result: The Skeleton Density Estimator

4.2. Examples of L 1 -Totally Bounded Clases

4.2.1. Finite Mixture Classes

4.2.2. Monotone Densities in [ 0 , 1 ] d

4.2.3. r-Moment Smooth Class in [ 0 , 1 ]

4.3. Yatracos Classes with Finite VC Dimension

5. The Parametric Scenario

The Practical Skeleton Estimator

6. Summary of the Results

7. Conclusions

8. Proofs of Results

8.1. Proposition 1

8.2. Theorem 1

8.3. Theorem 2

8.4. Theorem 3

8.5. Lemma 1

8.6. Theorem 4

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proof of (22) and (23)

Appendix B. Basic Definitions of Vapnik and Chervonenkis Theory

Appendix C. Pseudo Algorithm to Implement the Practical ϵ-Covering Presented in Lemma 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Examples of $L_{1}$ -Totally Bounded Clases

4.2.2. Monotone Densities in ${[0, 1]}^{d}$

4.2.3. r-Moment Smooth Class in $[0, 1]$