Fixed-Rate Universal Lossy Source Coding and Model Identification: Connection with Zero-Rate Density Estimation and the Skeleton Estimator

This work demonstrates a formal connection between density estimation with a data-rate constraint and the joint objective of fixed-rate universal lossy source coding and model identification introduced by Raginsky in 2008 (IEEE TIT, 2008, 54, 3059–3077). Using an equivalent learning formulation, we derive a necessary and sufficient condition over the class of densities for the achievability of the joint objective. The learning framework used here is the skeleton estimator, a rate-constrained learning scheme that offers achievable results for the joint coding and modeling problem by optimally adapting its learning parameters to the specific conditions of the problem. The results obtained with the skeleton estimator significantly extend the context where universal lossy source coding and model identification can be achieved, allowing for applications that move from the known case of parametric collection of densities with some smoothness and learnability conditions to the rich family of non-parametric L1-totally bounded densities. In addition, in the parametric case we are able to remove one of the assumptions that constrain the applicability of the original result obtaining similar performances in terms of the distortion redundancy and per-letter rate overhead.


Introduction
Universal source coding (USC) has a long history in information theory and statistics [1][2][3][4][5]. Davisson's seminal work [4] formalized the variable-length lossless coding problem and introduced important information quantities for performance analysis [1,2]. In this lossless setting, it is well-understood that the Shannon entropy provides the minimum achievable rate (in bits per sample) [2] to code a stationary and memoryless source when the probability (model) of the source is available. When the probability of the source is not known but belongs to a family of distributions F (the so called universal source coding problem), the focus of the problem is to characterize the penalty (or redundancy in bits per sample) that an encoder and decoder pair will experience due to the lack of knowledge about the samples' probability [1]. In the lossless case, a seminal result states that the least worst-case redundancy over F (or the minimax solution of the USC problem for F) is determined by the information radius of F [1].
Building on this connection between least worse-case redundancy and information radius of F , there are numerous important results developed for lossless USC [1,[6][7][8][9]. In particular, it is known that the information radius grows sub-linearly (with the block-length) for the family of finite alphabet stationary and memoryless sources [1], which implies the existence of a universal source code that achieves Shannon entropy as the block length goes to a large value for every distribution in F . However universality is not possible for the family of alphabet stationary and memoryless sources because the information radius of this family is unbounded [3,5,7]. More recent results on lossless USC over countable infinite alphabets have looked at restricting the analysis to specific collections of distributions (with some tail bounded conditions) to achieve minimax universality [7][8][9] and also looked at weak variations of the lossless source coding setting [10][11][12].
In the fixed-rate lossy source coding problem, assuming first that the probability µ of a memoryless source is known, the performance limit of the coding problem is given by the Shannon distortion-rate function D µ (R) [2,13]. Consequently, the universal lossy source coding problem reduces to compare the distortion of a coding scheme (satisfying a fixed-rate constraint) with the Shannon distortion-rate function assuming that the designer only knows that µ ∈ F . The literature on this problem is rich [3,5,[14][15][16][17][18] with a first result dating back to Ziv [17] who showed the existence of weakly minimax fixed-rate universal lossy source code for the class of stationary sources under certain assumptions about the source, the alphabet, and the distortion measure. More refined results were presented in [5,16] one of which established necessary and sufficient conditions to achieve weakly minimax universality for the class of stationary and ergodic sources. To provide a more specific analysis of universal lossy source coding, Linder et al. [14] presented a lossy USC scheme with a distortion redundancy that goes to zero as O( log log n log n ) for the case of independent and identically distributed (i.i.d.) bound sources. Later Linder et al. [15] improved previous results showing a fixed-rate lossy construction with a distortion redundancy that vanishes as O(n −1 log n) and O( n −1 log n) with n for finite alphabet i.i.d. sources and bounded infinite alphabet i.i.d. sources, respectively. Similar convergence results were obtained using a nearest-neighbor vector quantization approach in [19].
It is also understood that universal variable length lossless-source coding is connected with the problem of distribution estimation [3,6,20] as there is a one-to-one correspondence between prefix-free codes and finite-entropy discrete distributions in the finite and countable alphabet case [1,2,21]. Building on this one-to-one correspondence in the lossless case, Györfi et al. ( [3], Theorem 1) showed that the redundancy (in bits per sample) of a given code upper bounds the expected divergence between the true distribution of the source µ and the estimated distribution derived from the code. Therefore, the existence of a universal (lossless) source code for F implies the existence of a universal (distribution-free in F ) estimator of the distribution in expected (direct) information divergence [22]. This means that achieving lossless USC not only provides a lossless representation of the data, but it offers a consistent (error-free) estimator of the distribution at the receiver.
The connection between coding and distribution estimation that is evident in the lossless case is not, however, present in the (fixed-rate) lossy source coding problem. As argued in [18], a fixed-rate lossy source code does not offer a direct map with a probability distribution (model) for the source. In light of this gap between lossy codes and distributions (models) and motivated by some problems in adaptive control, where it is relevant to both compress data in a lossy way and identify the distribution of the source at the receiver [18,23], Raginsky explored the joint objective of fixed-rate universal lossy source coding and model (i.e., distribution) identification in [18].
Inspired by Rissanen's achievability construction in [6,20], Raginsky [18] proposed a new setting for the problem of fixed-rate universal lossy compression of continuous memoryless sources based on the idea of a two-stage joint coding and model or distribution identification framework. In this context, he proposed a two-stage scheme to consider two objectives: fixed-rate universal lossy source coding and source distribution (model) identification. The first objective of the scheme is to transmit the data (optimally) in the classical distortion-rate sense [24], while the second objective is to learn and transmit a description (quantized version) of the source distribution (model) [25,26]. Taking ideas from statistical learning, Raginsky proposed [18] splitting the data into training and testing samples. The training data is used in the first-stage of the encoding process to construct a quantized estimation of the source distribution and encode it (the first stage bits). Then in a second stage of the encoding process, the first-stage bits are used to pick a matched (with the estimated distribution) fixed-rate lossy source code to encode the test data (the second stage bits). In this joint coding and modeling setting, the existence of a zero-rate consistent estimator of the density (in expected total variation) is sufficient to show the existence of a weakly minimax universal fixed-rate source coding scheme [18] (Theorem 3.2), achieving the Shannon distortion-rate function [2,24,27,28], for any given rate. This result is obtained for a wide class of single-letter bounded distortion functions and for a family of source densities , a parametric collection) with some needed smoothness and learnability conditions [18] (Theorem 3.2).
It is important to highlight that the joint coding and modeling achievability results in [18] did not degrade the performance of the source coding objective. In fact by restricting the analysis to the source coding objective alone, the joint coding and modeling framework in [18] showed the same state-of-the-art performance results as conventional two-stage universal source coding schemes (or universal vector quantizers) [14,15,19] in terms of distortion redundancy and per-letter rate overhead (O( log(n)/n) and O(log(n)/n), respectively) as the block length n tends to a large number. Importantly, the first-stage bits of this joint coding and modeling scheme are used to achieve model identification at the receiver with arbitrary precision in total variation (with a rate of convergence of O( log(n)/n) as n goes to infinity), with no extra cost in bits per-letter compared with conventional fixed-rate lossy source coding methods.

Contributions of This Work
This work formally studies the interplay between density estimation under a data-rate constraint and the joint fixed-rate universal lossy source coding and modeling problem with training data or memory introduced in [18]. The first main result (Theorem 1) establishes a connection between zero-rate density-estimation and a universal joint coding and modeling scheme that achieves optimal lossy source coding (in a distortion-rate sense) and lossless model identification. This result is obtained for the general family of bounded single-letter distortions [13]. Remarkably, this connection implies that the construction of a joint coding and modeling scheme reduces to the construction of a zero-rate density estimator. From this result, the second main result (Theorem 2) stipulates a necessary and sufficient condition for the existence of a weakly minimax universal joint coding and modeling scheme. For the achievability part of this result, we used the skeleton estimator as our learning framework [29]. Using this learning framework we extend the parametric context explored in [18] to the rich non-parametric scenario of L 1 -totally bounded densities [30].
Furthermore, revisiting the parametric case studied in [18], by using the skeleton estimator we are able to remove some of the assumptions that limit the applicability of the original result. We show that the skeleton estimator matches the best performance reported in [18] in terms of the distortion redundancy and (per-letter) rate overhead, in particular obtaining rates of convergence to zero of O( log(n)/n) and O(log(n)/n), respectively, as the block-length tends to infinity. To obtain this, our result relaxes the finite Vapnik and Chervonenkis (VC) dimension assumption considered in [18]. On the other hand, when the finite VC dimension assumption is added in the analysis, the skeleton learning scheme offers a convergence rate of O(1/ √ n) for the distortion redundancy as the sample-length goes to infinity. Finally, the skeleton framework is implementable in the parametric case as its minimum-distance decision is carried out on a finite number of candidates and the oracle -skeleton (or the -covering in total variation of F ) [30] (Chapter 7) can be replaced by a practical uniform covering of the compact index set Θ ⊂ R k (Theorem 4). Finally, it is worth noting that a preliminary version of this work (in the context of density estimation under a data-rate constraint) was presented in [31]. The rest of the paper is organized as follows: Section 2 introduces the setting of the joint coding and modeling with training data. Section 3 elaborates the connections with zero-rate density estimation. Section 4 presents the main joint coding and modeling result (Theorem 2) and introduces the skeleton estimator. Finally, Section 5 revisits a special case where the distributions are indexed by finite dimensional bounded space (the parametric context). A summary of the results is presented in Sections 6 and 7. Finally, the proofs are presented in Section 8.

Preliminaries
The fixed-rate coding and modeling problem introduced in [18] is presented in this section. This joint coding and modeling problem will be the main focus of this work. In addition, notations and definitions used in the rest of the paper will be presented.

Basic Definitions
Let X ∈ B(R d ) be a separable and complete subset of R d where B(R d ) is the Borel sigma field. Let P (X) be the collection of probability measures on (X, B(X)), with B(X) denoting the Borel sigma field restricted to X, and let AC(X) ⊂ P (X) denote the set of probability measures absolutely continuous with respect to the Lebesgue measure λ [32]. For any µ ∈ AC(X), g µ (x) = dµ dλ (x) denotes its probability density function. The total variational distance [30] of v and µ in P (X) is given by (to avoid any confusion, if S is a set then |S| denotes its cardinality).
For µ and v belonging to AC(X), if we define the Scheffé set for the pair (µ, v) by [30,33].

Fixed-Rate Universal Lossy Source Coding with Memory or Training Data
Let {X n : n ≥ 1} be an i.i.d. stochastic process (or stationary and memoryless source), where X i takes values in X ⊂ R d and has a distribution µ in F = {µ θ : θ ∈ Θ} ⊂ AC(X). Θ is in general an index set for F . The problem of lossy source-coding of a finite block of the process X n = (X 1 , ..., X n ) reduces to find a mapping (or code) C n (·) from X n to S n , where S n is a finite set. Given a cardinality constraint on S n , the design objetive is to make C n (X n ) as close as possible to X n (in average) using for that a distortion function. The standard coding problem assumes the knowledge of µ for finding the optimal code (for any finite block n) [1,2,13], as well as for characterizing the fundamental performance limits of this task as n goes to infinity [2,24,28,[34][35][36].
A more realistic scenario is the universal source coding (USC) problem [2], where the source distribution µ ∈ F is unknown and a coding scheme needs to be designed optimally for the family F . Here we focus on a specific learning variation of this task introduced by Raginsky in [18], where in addition to the data that needs to be compressed and recovered (with respect to a fidelity criterion), we have a finite number of i.i.d samples following the same distribution µ and that can be used to estimate µ in the encoding process (more details of this approach in Section 2.3). This additional data can be interpreted as memory, training data, or side information about µ available at the encoder because it is data that is not required to be compressed and recovered. The existence of this memory departs from the standard zero-memory setting considered in universal source coding [1]. However, this information can be seen as a realistic assumption in the context of a sequential block by block coding of an infinite sequence, where the data is partitioned into blocks of the same finite length and compressed sequentially block by block. Then in a given stage of this sequential process, the data from previous blocks are available at the encoder (lossless) for the process compressing the current block [18].
More specifically following the fixed-rate block coding and modeling setting introduced by Raginsky in [18], we consider an n-block coding scheme with finite memory m, where there is a distinction between the data Z m = (Z 1 , ..., Z m ) that is available (as side information) to estimate the source distribution (training data) and the data X n that needs to be encoded and recovered (source or test data), under the important assumption that both data sets are i.i.d. samples of the same unknown probability µ ∈ F . A systematic exposition of this coding setting and its connection with the classical setting of zero-memory block coding is presented in [18] (Section II). Formally, let us define an (m, n)-block code by the pair Then given a set of training samples z m ∈ X m and a finite block of the source x n ∈ X n , C m,n is the composition of: a encoding function f (z m , x n ) that maps x n to an element in a finite set S n conditioned on the training data (or memory) represented by z m , and a decoding function φ(·) that maps a symbol s ∈ S n into the reproduction points Γ C m,n ≡ {φ(s) : s ∈ S n } that we called the codebook of C m,n . In this context,X denotes the reproduction space. As a short-hand, we denote byx n = C m,n (x n ) = φ( f (z m , x n )) the reconstruction of x n obtained by C m,n and its memory z m (for simplicity, the dependency ofx n or C m,n (x n ) on the memory z m will be implicit in the rest of the exposition.).
The rate of C m,n in bits-per-letter is given by R(C m,n ) ≡ In general, it is not possible to recover x n fromx n given the cardinality constraint on S n , and thus a single-letter distortion measure ρ : X ×X → R + is used to quantify the n-block discrepancy by [24] Finally considering X n ∼ µ n and Z m ∼ µ m , the average distortion per-letter of C m,n given Z m is which is a function of Z m and hence the average distortion per-letter of C m,n is In universal source coding the performance of a code D µ (C m,n ) is evaluated over a collection of distributions µ ∈ F and is compared (point-wise) with the best code that can be obtained assuming that µ is known. For this analysis, we need the following definitions: . For a finite block length n and distribution µ ∈ F , the n-order operational distortion-rate function of µ at rate R is In this context, the operational distortion-rate function (DRF) [2,28] is given by The celebrated Shannon lossy source-coding theorem [27] provides a single letter theoretical characterization for D µ (R) in (8) (also known as the Shannon DRF). A nice exposition of this celebrated result can be found in [2,24,28].
It is worth noting that the operational distortion-rate function in (7) is equivalent to the classical zeromemory n-order operational distortion-rate function given by inf C 0,n D µ (C 0,n ) : such that R(C 0,n ) ≤ R [18] (Lemma 2.1). Then, allowing a nonzero memory (side information at the encoder) does not help in the minimization of the distortion when µ is known.
For the rest of the exposition, we will concentrate on the simple case studied in [18] where n = m (i.e., the block-length is equal to the memory of the code). To be precise about the meaning of universality in this context, we resort to some standard definitions: . A coding scheme {C n,n : n ≥ 1} is weakly minimax universal for the class F at rate R, if ∀µ ∈ F lim n→∞ D µ (C n,n ) = D µ (R) (9) and lim sup n→∞ R(C n,n ) = lim sup n→∞ log 2 |S n | n ≤ R. Alternatively, the scheme is said to be strongly minimax universal for the class F at rate R if and lim sup n→∞ R(C n,n ) ≤ R.
Decomposing the distortion redundancy in two terms, the first term D µ (C n,n ) − D n µ (R) is the n-order distortion redundancy, which is the discrepancy that can be attributed exclusively to the goodness of the coding scheme. The second term in (11), i.e., , has to do with how fast D n µ (R) converges to the Shannon DRF as the block length tends to infinity (see further details in [14] (Section III) and references therein). From this observation, we introduce the following definition: and lim sup n→∞ R(C n,n ) ≤ R.
Note that if {C n,n : n ≥ 1} is strongly minimax universal then it is strongly finite-block universal, but the converse result is not true in general. The missing condition to make these two criteria equivalent is the uniform convergence of D n µ (R) to D µ (R) in the class F . More discussion about this point in Section 6.

Raginsky's Two-Stage Joint Universal Coding and Modeling
Motivated by the work of Rissanen [6], Raginsky [18] proposed a two-stage block code with finite memory (training data), with the objective of doing both fixed-rate lossy source coding, and identification of the source distribution at the receiver. More precisely, given Z n ∼ µ n θ and X n ∼ µ n θ (the training and the source-data samples, respectively), an (n, n)-joint coding and modeling rule is given by C n,n ≡ f n : X n →S n , φ n :S n → Θ, f n,s : X n → S n , φ n,s : S n →X n ;s ∈S n , where S n andS n are finite-set functions of n. C n,n processes (Z n , X n ) in two stages. In the first stage, the pair ( f n , φ n ) in (13) uses Z n to do density estimation and finite-rate encoding (quantization) by f n (Z n ), and φ n (·) decodes an estimated density in φ n (s) : s ∈S n ⊂ Θ. At the end, the first stage provides a quantized estimation of µ θ ∈ F given bŷ Using the indexs = f n (Z n ) ∈S n , the second stage of C n,n , represented by ( f n,s , φ n,s ); s ∈S n in (13), encodes and decodes the source data X n by C n,n (X n ) ≡ φ n,s ( f n,s (X n )). (15) In summary, the outcome of the whole encoding process is the concatenation of the bits that represent f n (Z n ) (first-stage bits), and the bits that represent f n, f n (Z n ) (X n ) (second-stage bits). The decoding process, on the other hand, reads the first-stage bits to recoverθ n (Z n ) and then reads the second-stage bits to recover C n,n (X n 1 ). (see Figure 1 in which this process is illustrated). The rate (in bits per letter) of C n,n is

Encoding Process
Decoding Process first stage second stage Based on this two-stage scheme, we could simultaneously achieve source coding and density estimation (modeling) at the decoder. This new joint coding and modeling objective motivates the introduction of the following definition: Consequently, if {C n,n : n ≥ 1} is strongly minimax universal for F , it follows that as n tends to infinity, density estimation is achieved at the decoder (in expected total variations) and, from the source coding perspective, {C n,n : n ≥ 1} is strongly finite-block universal for F in the sense of Definition 3. For the rest of the paper, the strongly minimax universality of Definition 4 will be the main coding and modeling objective.

Connections with Zero-Rate Density Estimation
This section formalizes a connection between the objective of joint coding and modeling (declared in Definition 4) and a problem of zero-rate density estimation.

Density Estimation with a Rate Constraint
Let us first introduce the problem of rate constrained density estimation. Let F = {µ θ : θ ∈ Θ} ⊂ AC(X) be an indexed collection of densities as introduced in Section 2.2.
Definition 5. An (n, 2 nR ) learning rule of length n and rate R for F is a pair of functions ( f , φ), with f : X n → S and φ : S → Θ, where S is a finite set and The composition of these two functions π = φ • f : X n → Θ defines the rate-constrained learning rule for F taking values in the codebook {φ(s) : s ∈ S} ⊂ Θ, where R(π) = log 2 (|S|)/n denotes its description complexity in bits per training sample.
where Z 1 , Z 2 . . . in the left hand side (LHS) of (18) corresponds to i.i.d. realizations driven by µ ∈ F . In this case, we say that Π is an R-rate uniformly consistent scheme (or estimator) for the class F .

Proposition 1.
If for a given R > 0, {C n,n : n ≥ 1} is strongly minimax universal for the class F at the rate R (Definition 4), then its induced finite-description learning scheme obtained from the first stage in (13), i.e., Π = {( f n , φ n ) : n ≥ 1}, is a zero-rate uniformly consistent estimator for F (Definition 6).
The proof is presented in Section 8.1. Interestingly, the existence of a zero-rate uniformly consistent scheme for F is also sufficient to achieve the joint coding and modeling objetive (Definition 4) if some mild conditions are adopted from the work in [18]. This is stated in the following result: Theorem 1. Let us assume that (i) ρ : X ×X → R + can be expressed by ρ(x,x) = d(x,x) p where d(, ) is a bounded metric in X ∪X × X ∪X with p > 0 and (ii) for all µ ∈ F , for all n ≥ 1, and for all R > 0, there exists a (0, n)-block code, say C * n µ , that achieves the n-order operational DRF D n µ (R) in (7).
Then the existence of a learning scheme Π = {( f n , φ n ) : n ≥ 1} that is zero-rate uniformly consistent for F implies that ∀R > 0 there exists a joint coding and modeling scheme {C n,n : n ≥ 1} that is strongly minimax universal for F at rate R (Definition 4).
The proof is presented in Section 8.2.

Remark 1.
The construction proposed for {C n,n : n ≥ 1} at any rate R > 0 (in Section 8.2) using the zero-rate density estimation scheme Π = {π n = φ n • f n : n ≥ 1} satisfies that: R(C n,n ) − R ≤ R(π n ), (20) ∀n ≥ 1, where C > 0 is a constant. It is worth noting that these two inequalities summarize the result in Theorem 1 and, importantly, these two bounds are independent of R.

Remark 2.
An important consequence of the bounds in (19) and (20) is the fact that constructing a learning scheme Π = {π n : n ≥ 1} with specific rates of convergence for sup µ∈F E(V(µ π n (Z n ) , µ)) and R(π n ) (as n goes to infinity) produces a joint coding and modeling scheme that achieves a uniform rate of convergence to zero (over F ) of the overhead in distortion by (19) and a uniform rate of convergence to zero of the overhead in rate by (20). This observation will be used in all the achievable results presented in Sections 4 and 5, where, consequently, the problem reduces to determine Π and expressions for sup µ∈F E(V(µ π n (Z n ) , µ)) and R(π n ).

Joint Source Coding and Modeling Achievability Results
From the connection with zero-rate density estimation in Section 3, here we present a set of new results for the joint coding and modeling problem of Section 2.3. In these results, the general conditions (i) and (ii) stated in Theorem 1 are assumed.

Main Result: The Skeleton Density Estimator
Let us first introduce some notions from approximation theory [37]. Definition 7. Let F ⊂ AC(X) be a class of densities. We say that F is L 1 -totally bounded if for every > 0, there is a finite set of elements {µ i : i = 1, ..., N} in F such that, where B V (µ) ≡ {v ∈ AC(X) : V(µ, v) < }.
Definition 8. For F L 1 -totally bounded, let N denote the smallest positive integer that achieves the condition in (21). N is called the -covering number of F and K( ) ≡ log 2 (N ) is called the Kolmogorov's -entropy of F [30].

Definition 9.
An -covering G of F such that |G | = N is called an -skeleton of F [29].

Theorem 2.
There is a strongly minimax universal joint coding and modeling scheme for F at rate R for any rate R > 0 if, and only if, F is L 1 -totally bounded.
The proof is presented in Section 8.3. The achievability part of the proof of Theorem 2 relies on the adoption of the skeleton estimator [29] (with its minimum distance learning principle in (42)), which is a zero-rate uniformly consistent density estimator for F (Definition 6). Furthermore, Theorem 2 can be complemented saying that the proposed construction {C n,n : n ≥ 1} derived from the skeleton estimator satisfies that (P µ is a short-hand for the process distribution of (Z n ) n≥1 characterized by µ ∈ F under the i.i.d. assumption.) lim n→∞ D µ (C n,n |Z n ) = D µ (R), P µ − almost surely, lim n→∞ V(µ π n (Z n ) , µ) = 0, P µ − almost surely, ∀µ ∈ F . The argument is presented in Appendix A.

Examples of L 1 -Totally Bounded Clases
Knowing specific expressions for K( ) = log 2 N < ∞, the skeleton estimator can be optimized selecting its design parameter appropriately. In particular, the sequence ( n ) n≥1 (see details in Section 8.3) is selected as the solution of the optimal balance between estimation and approximation errors (see (45) in Section 8.3), which is given by * n ≡ inf > 0 : log(2N 2 ) ≤ √ n [30] (Chapter 7.2). The details of this analysis are presented in Section 8.3 and [30] (Chapter 7). By doing so, an optimized zero-rate skeleton scheme Π = ( f * n , φ * n ), n ≥ 1 , with concrete rate of convergence for sup µ∈F E Z n ∼µ n (V(µ π * n (Z n ) , µ)) and R(π * n ), can be obtained. From Remarks 1 and 2, these results imply specific performance results for the induced joint coding and modeling scheme. To illustrate, we present three interesting examples below.  [30], which implies the following finite-rate performance bound [30] (Chapter 7.4):

Finite Mixture Classes
with C a universal non-negative constant. The rate in bits per-sample R(π * n ) = K( * n )/n is O(log n/n).

Monotone Densities in [0, 1] d
Let F be the collection of densities with support on [0, 1] d , monotonically decreasing per coordinate and bounded by a constant L > 0. This class is known to be L 1 -totally bounded, and furthermore K( ) ≤ CL d d [30] (Lemma 7.1), with the constant C depending only on d. From (45), ( * n ) being O(L d/d+2 /n 1/d+2 ) is optimal (please see details in [26,30]) with the following performance bound, In this case, the rate in bits per sample R(π * n ) = K( * n )/n is O(1/n 2/d+2 ).

r-Moment Smooth Class in [0, 1]
Let F be the class of densities defined on the bounded support [0, 1], with r absolutely continuous derivatives (with r an integer greater than zero) and satisfying that: is O(1/n 1/3+r ) and the rate in bits per sample R(π * n ) = K( * n )/n is O(1/n 2/3+r ).

Yatracos Classes with Finite VC Dimension
Looking at the distortion redundancy bound in (19), when F is totally bounded the fastest rate of convergence that could be achieved with the skeleton estimator proposed in Theorem 2 is O( √ 1/n) (see Section 8.3 and the estimation error bound in (45)). In this section, more specific density collections are studied to achieve this best rate O( √ 1/n) for density estimation and distortion redundancy from (19). We follow the path proposed by Yatracos in [38], who explored families of distributions with a finite Vapnik and Chervonenkis (VC) dimension the so-called VC classes [39,40]. Let us first introduce some definitions: Definition 10 ( [38]). Let F = {µ θ : θ ∈ Θ} ⊂ AC(X) be an indexed collection of densities. The Yatracos class for such a collection is given by where A θ,θ ≡ x ∈ X : g µ θ (x) > g µθ (x) ∈ B(X) is the Scheffé set of µ θ with respect to µθ, as defined in (2).

Theorem 3. Let us assume that
(i) F is L 1 -totally bounded, (ii) the Yatracos class A Θ has a finite VC dimension (Definition A1 in Appendix B), and (iii) the Kolmogorov's entropy of F associated with the sequence n = 1/ √ n grows strictly sub-linearly, i.e., where π n (Z n ) = φ n ( f n (Z n )) is the skeleton estimator in (42) with n = 1/ √ n. Furthermore, Π is also a zero-rate strongly consistent density estimator where ∀µ ∈ F V(µ π n (Z n ) , µ) is O( log n/n), P µ − almost surely.
The proof is presented in Section 8.4. From Definition 7, log 2 (N ) is inversely proportional to . In fact, depending of how rich F is, log 2 (N ) can go from being O(log 1/ ), passing from being polynomial in 1/ , to being O(e 1/ ) (see a number of examples in [30] (Chapter 7) and its references). Then the role of (iii) in the statement of Theorem 3 is to bound how fast N should tend to infinity as goes to zero, to guarantee a zero-rate in the skeleton learning scheme. It is simple to show that N being O(e (1/ ) q ) with q ∈ [0, 2) is sufficient to achieve that log 2 (N 1/ √ n ) is o(n). This is a condition satisfied by a rich collection of L 1 -totally bounded classes in AC(X). Concrete examples are presented in [30] (Chapter 7).

The Parametric Scenario
The results presented so far are of theoretical interest because they rely on the skeleton estimator that is constructed from the skeleton covering of F (see Definition 9), which is unknown in practice. Moving towards making the zero-rate skeleton learning scheme of practical interest, we revisit the important parametric scenario in which Θ, the index set of F , is a compact set contained in a finite-dimensional Euclidean space R k . Interestingly, in this context we can consider a practical covering of F induced by the uniform partition of the parameter space Θ, as used in [18]. Unlike [18], where a minimum-distance estimate is first found and then quantized, here we first quantize the space Θ and then find the minimum-distance estimate among a finite collection of candidates (i.e., over a finte number of prototypes in Θ). Some assumptions will be needed. ([18]). Let F = {µ θ : θ ∈ Θ} with Θ ⊂ R k . Let I F : Θ → F be the index function of F that maps θ to µ θ . I F is said to be locally uniformly Lipschitz, if there exists r > 0 and m > 0, such ∀θ ∈ Θ, ∀φ ∈ B r (θ),

Definition 11
where B r (θ) ⊂ Θ denotes the ball of radius r (with respect to the Euclidean norm in R k ) centered at θ.
The following lemma shows that F is L 1 -totally bounded under some parametric assumptions.
) and the mapping I F : Θ → F is locally uniformly Lipschitz (Definition 11), then F is L 1 -totally bounded. Furthermore, N is O(1/ k ) for this family.
The proof is presented in Section 8.5. It is important to note that the -covering of F used in the proof of Lemma 1 to derive an upper bound for N is practical (see Appendix C). This offers the possibility of implementing a practical skeleton estimator, which is the focus of the following result.

The Practical Skeleton Estimator
Under the assumptions of Lemma 1, let (f n, ,φ n, ) denote the learning rule of length n associated with the minimum-distance principle in (42) with parameter (see details in Section 8.3), where instead of using the -skeleton G of F (in Definition 9), the implementable (see Appendix C) -covering of Θ presented in the proof of Lemma 1 is used. This practical -covering is denoted byG (by definition, N = |G | ≤ G =Ñ ∼ O(1/ k ), this last part from Lemma 1.). With this, letΠ(( n ) n≥1 ) ≡ (f n, n ,φ n, n ) : n ≥ 1 denote our practical learning scheme indexed by the precision numbers ( n ) n≥1 ∈ (R + ) N . We are in a position to integrate Theorem 3 and Lemma 1 to state the following: Under the assumptions of Lemma 1, the practical skeleton estimatorΠ(( n ) n≥1 ) with * n = 1/ √ n satisfies that sup µ θ ∈F E Z n ∼µ n V(µπ n, * n (Z n ) , µ θ ) is O( log n/n), and R(π n, * n ) is O(log n/n), whereπ n, (Z n ) ≡φ n, (f n, (Z n )).
In addition, if the Yatracos collection A Θ = A θ,θ : θ,θ ∈ Θ, θ =θ has a finite VC dimension equal to J, then The proof is presented in Section 8.6. When X ⊂ R d , Raginsky [18] showed that the finite VC dimension assumption of Theorem 4 is satisfied by the class of mixture families presented in Section 4.2.1 and a rich collection of exponential families of the form F = {µ θ : [18] (Section V)), and Θ is a compact subset of R k (see details in [18] (Section V)).

Summary of the Results
We summarize the results of the proposed zero-rate density estimation approach adopted for the problem of joint fixed-rate lossy source coding and modeling of continuous memoryless sources.
• Proposition 1 and Theorem 1 formalize the interplay between the two-stage joint fixed-rate coding and modeling objective and the problem of zero-rate uniformly consistent (in expected total variation) density estimation.
• Theorem 2 establishes a necessary and sufficient condition on a family of densities for the existence of a strongly minimax joint coding and modeling scheme achieving both source coding and model identification objectives (Definition 4). The result is obtained for the rich non-parametric collection of L 1 -totally bounded densities.

•
For the modeling stage, we propose using the skeleton estimator, which first quantizes the data and then finds the minimum-distance decision on this finite set of density candidates (42). This is a practical solution in the sense that the inference (minimization) is carried out over a finite set.

•
By introducing combinatorial regularity conditions on the family of distributions F = {µ θ : θ ∈ Θ}, the skeleton scheme achieves O(1/ √ n) rate of convergence in the n-order distortion redundancy, and the same rate in the expected total variational distance for the modeling part (Theorem 3). • Finally, for a relevant parametric setting, a practical skeleton-based joint coding and modeling scheme is proposed that achieves a rate of O(1/ √ n) for the n-order distortion redundancy (Theorem 4). This rate is slightly better than the O( log n/n) achieved in [18] under the same rate overhead of O(log(n)/n). Furthermore, Theorem 4 removes the finite-VC-dimension assumption over the Yatracos class A Θ considered in [18] (Theorem 3.2), while achieving the same performance rates in terms of n-order distortion redundancy O( log n/n), uniform expected risk to learn the density O( log n/n), and rate overhead O(log n/n).
Concerning the last parametric result, we note that the result in [18] can be improved by the adoption of Dudley's entropy bound [41], which would yield the same asymptotic rate reported in this work for the n-order distortion redundancy.
A final remark is that under the bounded distortion metric assumption of Theorem 1 condition (i), Linder et al. [14] (Theorem 2) showed that ∀θ ∈ Θ, and for every R > 0 such that where (r n ) is a sequence that converges to zero (o(1)) uniformly in Θ. This result offers a rate of convergence of the n-order operational distortion-rate function to the Shannon DRF as the block length tends to infinity. In view of (11), we can adopt this result in Theorems 3 and 4, to say that the average distortion of the respective joint coding and modeling schemes at rate R, i.e., D µ (C n,n ), convergences to the Shannon DRF D µ (R) as O( log n n ) point-wise ∀µ ∈ F . Therefore in the process of comparing D µ (C n,n ) with the Shannon DR function, we lose the O( √ 1/n) rate of convergence.

Conclusions
This work revisits the problem of fixed-rate universal lossy source coding and model identification with training data proposed in [18] from a learning perspective. Remarkably, we found that the problem is equivalent to the problem of density estimation of the source distribution with some concrete but non-conventional operational data-rate constraints in bits per sample. This learning problem can be seen as the task of estimating and encoding the distribution of samples with a zero-rate in bits per sample, while achieving a consistent estimation in expected total variations of the distribution after the decoding process. From our perspective, the rate-constraint density estimation problem is interesting in itself and can have relevant applications in other contexts such as distributed learning scenarios and sensor network problems.
Importantly for the joint coding and modeling problem, the connection with density estimation provides a context for the use of the skeleton estimator proposed by Yatracos in [29]. We highlight two important implications from its use. First, we extend results about minimax universality from the parametric context explored in [30] to the rich non-parametric family of L 1 -totally bounded densities [26,30]. This result significantly expands the contexts where the joint model and coding objective can be achieved. We illustrated this with some examples in Section 4.2 and many more can be found in the literature of density estimation [26,30].
Second, in the parametric case studied in [18], we were able to remove some of the assumptions and obtain not only the same performance result in terms of rate of convergence of the n-order distortion redundancy but also slightly better convergence results. Therefore, the Skeleton estimator, though essentially a non-parametric learning scheme, is shown to be instrumental in enriching the applicability of the joint coding and modeling framework.

Proposition 1
Proof. The fact that Π is uniformly consistent for F is directly from Definition 4. On the other hand, the rate of π n = φ n • f n is R(π n ) = 1 n log 2 S n . From the definition of D n µ (R), it is simple to show from the strict monotonicity of D µ (R) that in order for lim n→∞ sup µ∈F D µ (C n,n ) − D n µ (R) = 0, it is required that lim sup n→∞ 1 n log |S n | > R − for any > 0. Then, from (16), and since log |S n |/n = R(π n ), lim sup n→∞ R(C n,n ) ≤ R implies that lim n→∞ R(π n ) = 0.

Theorem 1
Proof. The proof builds upon the ideas elaborated in [18] (Theorem 3.2, p. 3065). Let us consider an arbitrary R > 0 and let Π = {( f n , φ n ) : n ≥ 1} be the zero-rate learning scheme of the assumption. Using Π, let us construct the joint coding and modeling rule of length n by: C n,n = f n : X n →S n , φ n :S n → Θ, f n,s : X n → S n , φ n,s : S n →X n :s ∈S n .
Concerning the first stage of {C n,n : n ≥ 1}, it is induced directly from the coding-decoding rules of Π.
For the second stage, ∀n ≥ 1, ∀s ∈S n the pair ( f n,s , φ n,s ) is picked such that C * n µ θ n,s = φ n,s • f n,s , which is the optimal n-block code that achieves D n µ θ n,s (R) (from the hypothesis in (ii)), with θ n,s ≡ φ n ( f n (s)) short-hand for the reproduction codeword induced from the first stage-pair ( f n , φ n ), and S n satisfying the R-rate constraint, i.e., |S n | = 2 nR . From construction and the fact that Π has zero-rate, lim n→∞ R(C n,n ) = R + lim n→∞ log 2 S n /n = R, then {C n,n : n ≥ 1} satisfies the rate condition. On the other hand, based on the assumption that Π is zero-rate uniformly consistent, it follows that whereθ n (Z n ) = φ n ( f n (Z n )). Then {C n,n : n ≥ 1} achieves the modeling objective. Concerning the coding objective, we use the following key result: Lemma 2 ([18] (Lemma C.1)). Let P and Q be two probability measures in (X, B(X)). Let C n = ( f , φ) be a zero-memory n-block coder with the nearest neighbor property (i.e., C n is nearest neighbor if, ∀x n 1 ∈ X n , φ( f (x n 1 )) = arg minxn 1 ∈Γ C n ρ(x n 1 ,x n 1 ) with Γ C n the reproduction codebook of C n .). If we denote the performance of C n (C n = φ • f ) with respect to P by where P n denotes the product measure with marginal P in (X n , B(X n )), and ρ satisfies the condition i) of Theorem 1 and is bounded by d max , then Furthermore, the inequality can be extended for the n-order operational distortions in (7), i.e., ∀R > 0.
Let us work with the following distortion redundancy, For the first equality we use (5). The inequality in (35) is from the definition in (31) and (33), and the equality in (36) is from the construction of C * n µθ n (Z n ) which is n-operational optimal for the distribution µθ n (Z n ) at rate R. Finally, (37) is from (32). Concluding, D µ (C n,n |Z n ) − D n µ (R) is random (a measurable function of Z n ) and dominated by V(µθ n (Z n ) µ). Hence taking the expected value (with respect to Z n ) on both sides of this inequality (see (6)), we have the uniform convergence in (30) implying that and then the coding objective is achieved.

Theorem 2
Proof. Let us first assume that F is L 1 -totally bounded and prove the direct part of the statement. We adopt the skeleton estimate proposed by Yatracos [29] and extended by Devroye et al. [42,43] (a complete presentation can be found in [30] (Chapter 7)). For any arbitrary > 0, let us consider the dλ (x) as short-hand for the i-th pdf in G , and we define Θ ≡ {θ i : i = 1, ..., N } ⊂ Θ to represent the index set of G . Let us consider the Yatracos class of G given by [30] A ≡ A i,j , A j,i : where A i,j = x ∈ X : g θ i (x) > g θ j (x) ∈ B(X) is the Scheffé set of µ θ i with respect to µ θ j in (2) [30,33]. Hence, given i.i.d. realizations X 1 , ..., X n with X i ∼ µ θ (µ θ ∈ F ), let us propose the encoder-decoder pair ( f n, , φ n, ) associated with A by, whereμ n (B) = ∑ n j=1 1 B (X j ) is the standard empirical distribution. In this context, θ (X n ) = φ n, (( f n, (X n ))) = arg min is the well-known skeleton estimate [29].θ (X n 1 ) is the minimum-distance approximation ofμ n with elements of G [29,30], adopting the measure in the right-hand-side of (42) that is reminiscent of the total variational distance in (1). In order to choose a sequence ( n ) n≥1 , we consider the following performance bound.

Lemma 3 ([30] (Theorem 6.3)). For any
Equation (43) is valid for any > 0 and, consequently, it provides a trade-off between an approximation error term and an estimation error term. The approximation error is min v∈G V(v, µ), which is bounded by the definition of G . For the estimation error, on the other hand, Yatracos proposed the use of Hoeffding's inequality [44] to obtain that ∀µ ∈ P (X) [30] (Theorem 7.1), Using (44) in (43), it follows that, sup µ θ ∈F E V(µθ (X n ) , µ θ ) ≤ 3 + 8 log(2N 2 ) n . This last expression is distribution-free and it is valid if the approximation fidelity is a chosen function of n [30]. Consequently, for any sequence ( n ) n≥1 , for all n ≥ 1. Hence, we consider * n ≡ inf > 0 : log(2N 2 ) ≤ √ n proposed in [30] (Chapter 7.2), which is well-defined and converges to zero as n tends to infinity.
Consequently from (45), lim n→∞ sup µ θ ∈F E V(µθ * n (X n ) , µ θ ) = 0. Then the learning scheme Π(( * n ) n≥1 ) ≡ ( f n, * n , φ n, * n ) : n ≥ 1 satisfies the learning requirement in Definition 6, where in particular R(φ n, * n • f n, * n ) = To conclude the argument of this part (i.e., presenting the construction of the second stage of a joint coding & modeling scheme), we adopt the result and the construction presented in the proof of Theorem 1 (see Remark 1 for details). This result implies that ∀R > 0 there is a strongly minimax universal joint coding and modeling scheme for F at rate R.
For the other implication (the converse part of the statement), let us fix R > 0 and assume that we have a joint coding & modeling scheme that is strongly minimax universal (Definition 4) for F at rate R. Then from Proposition 1, we have a learning scheme Π = {( f n , φ n ) : n ≥ 1} such that lim n→∞ R(π n = φ n • f n ) = 0 and lim n→∞ sup µ∈F E P n µ V(µ π n (X n ) , µ) = 0. (46) For the learning rule of length n, we have its reproduction codebook that we denote by Θ n ≡ θ n j : j = 1, ..., 2 nR(π n ) ⊂ Θ. Let us define the minimum-distance oracle solution in Θ n bỹ θ n (µ) = arg inf θ∈Θ n V(µ θ , µ).

Theorem 3
Proof. From Lemma 3, for any arbitrary sequence ( n ) n≥1 with A n the Yatracos class of the skeleton G n . It is clear that ∀ > 0, A ⊂ A Θ . Then by monotonicity , for all > 0 and for any distribution µ ∈ P (X). Here is where we use the assumption that A Θ has finite VC dimension J, which implies from [30] (Theorem 3.1) that for some constant c > 0. Substituting this result in (48), the argument concludes by replacing ( n ) = (1/ √ n), a solution which achieves the intended rate of convergence for sup µ θ ∈F E V(µθ 1/ √ n (X n ) , µ θ ) . Finally, the rate of the learning rule is , which tends to zero by the last hypothesis.
For the almost-sure convergence part if * n = 1 √ n , it is sufficient to show that the second term in the right hand side (RHS) of (48) is O( log n/n) P µ -almost surely. From the fact that A Θ has finite VC dimension (Definition A1), and from the classical VC inequality [30] (Corollary 4.1 and Theorem 3.1) and [45]  for some K > 0, hence ∑ n≥0 P 1 a n · sup B∈A * n |μ n (B) − µ θ (B)| > M < ∞. Then from the Borel Cantelli Lemma, lim sup n→∞ 1 a n · sup B∈A * n |μ n (B) − µ θ (B)| ≤ M P µ -almost surely, which concludes the proof. As (a n ) is o(1), this result implies the almost-sure convergences to zero of V(µθ * n (X n ) , µ θ ) as n goes to infinity.
For the final part, let (m, r) be the uniform parameters that characterize the Lipschitz condition of I F (·) (Definition 11). Without loss of generality, let us assume the critical regime where m < r, hence from (51) N is upper bounded by K( /m), which is the covering number of Θ. As Θ ⊂ k i=1 [−L, L] ⊂ R k , we will work with a uniform partition of k i=1 [−L, L] to find a bound for K( /m). Let¯ = m , then inducing a product-type partition, where in each coordinate we have L √ k uniform length cells, we have the required¯ -covering. The number of prototypes is O( , which is O(1/ k ) as a function of ( =¯ · m).
To clarify the constructive nature of the -covering used to prove this result, an algorithm with the basic steps of the construction of this practical covering is sketched in Appendix C.
The latter upper bound is asymptotically dominated by ( log n/n) from the fact that log G 1/ √ n is O(k log(n)) (Lemma 1), which proves the assertions made in (26).
Concerning part (ii), using the arguments presented in the proof of Theorem 3, we can obtain that ∀ > 0, sup µ θ ∈F E V(µθ (X n ) , µ θ ) ≤ 3 + 4 · c J n . (53) From this point, the proof follows from the arguments of Theorem 3 and the fact that log 2 G 1/ √ n is O(k/2 · log 2 n).