Statistical estimation of the Kullback-Leibler divergence

Wide conditions are provided to guarantee asymptotic unbiasedness and L^2-consistency of the introduced estimates of the Kullback-Leibler divergence for probability measures in R^d having densities w.r.t. the Lebesgue measure. These estimates are constructed by means of two independent collections of i.i.d. observations and involve the specified k-nearest neighbor statistics. In particular, the established results are valid for estimates of the Kullback-Leibler divergence between any two Gaussian measures in R^d with nondegenerate covariance matrices. As a byproduct we obtain new statements concerning the Kozachenko-Leonenko estimators of the Shannon differential entropy.

Usually one has to reconstruct the measures (describing a stochastic model under consideration) or their characteristics using some collections of observations.In the pioneering paper [19] the estimator of the Shannon differential entropy was proposed, based on the nearest neighbor statistics.In a series of papers this estimate was studied and applied.Moreover, estimators of the Rényi entropy, mutual information and the Kullback -Leibler divergence have appeared (see, e.g., [20], [21], [42]).However, the authors of [27] indicated the occurrence of gaps in the known proofs concerning the limit behavior of such statistics.This issue has attracted our attention and motivated our study of the declared asymptotic properties.Thus in a recent work [7] the new functionals were introduced to prove asymptotic unbiasedness and L 2 -consistency of the Kozachenko -Leonenko estimators of the Shannon differential entropy.The present paper is aimed at extension of our approach to grasp the Kullback -Leibler divergence estimation.Instead of the nearest neighbor statistics we employ the knearest neighbor statistics (on order statistics see, e.g., [3]) and also use more general forms of the mentioned functionals.
Let X and Y be random vectors taking values in R d and having distributions P X and P Y , respectively (further we consider P = P X and Q = P Y ).Consider i.i.d.random vectors X 1 , X 2 , . . ., and i.i.d.random vectors Y 1 , Y 2 , . . ., with law(X 1 ) = law(X) and law(Y 1 ) = law(Y ).Assume that {X i , Y i , i ∈ N} are independent.We are interested in statistical estimation of D(P X ||P Y ) constructed by means of observations X n := {X 1 , . . ., X n } and Y m := {Y 1 , . . ., Y m }, n, m ∈ N.All random variables under consideration are defined on a complete probability space (Ω, F , P).
For a finite set E = {z 1 , . . ., z N } ⊂ R d , where z i = z j (i = j), and a vector v ∈ R d , renumerate points of E as z (1) (v), . . ., z (N ) (v) in such a way that v−z (1) ≤ . . .≤ v−z (N ) , here • is the Euclidean norm in R d .If there are points z i 1 , . . ., z is having the same distance from v then we numerate them according the increasing indexes among i 1 , . . ., i s .In other words, for k = 1, . . ., N, z (k) (v) is the k-NN (Nearest Neighbor) for v in a set E. To indicate that z (k) (v) is constructed by means of E we write z (k) (v, E).Fix k ∈ {1, . . ., n − 1}, l ∈ {1, . . ., m} and (for each ω ∈ Ω) put We assume that X and Y have densities p = dP X dµ and q = dP Y dµ .Then with probability one all points in X n are distinct as well as points of Y m .
Introduce an estimate of D(P X ||P Y ), for n ≥ k + 1 and m ≥ l, letting . (1.3) Here and we come to formula (5) in [42].
Remark 2 All our results will be valid for the following generalization of statistics D n,m (k, l): where K n := {k i } n i=1 , L n := {l i } n i=1 and, for some r ∈ N and all i ∈ N, k i ≤ r, l i ≤ r.Note that (1.4) is well-defined for n ≥ max i=1,...,n k i + 1, m ≥ max i=1,...,n l i .We will only consider the estimates (1.3) since the study of D n,m (K n , L n ) follows the same lines.
Developing the approach of [7] to analysis of asymptotic behavior of the Kozachenko-Leonenko estimates of the Shannon differential entropy (introduced in [35], Part III, Section 20) we encounter new complications due to dealing with k-nearest neighbor statistics for k ∈ N (not only for k = 1).Accordingly, in the framework of the Kullback-Leibler divergence estimation, we propose a new way to bound the function 1 − F m,l,x (u) playing the key role in the proofs (see formula (3.7)).Also instead of the function G(t) = t log t (for t > 1), used in [7] for study of the Shannon entropy estimates, we employ a regularly varying function G N (t) = t log [N ] (t) where (for t large enough) log [N ] (t) is the N-fold iteration of the logarithmic function and N ∈ N is chosen arbitrarily.Whence in the definition of integral functional K p,q (ν, N, t) by formula (2.4) below one can take a function G N (z) having, for z > 0, the growth rate close to that of function z.Moreover, this permits a generalization of [7] results.Here we invoke convexity of G N (see Lemma 6) to provide more simple conditions for asymptotic unbiasedness and L 2 -consistency of the Shannon differential entropy than those employed in [7].
Mention in passing that there exist investigations treating other important aspects of the mutual information and entropy estimation.In [1] entropy estimators are applied to detection of the fiber materials inhomogeneities.The mixed models and conditional entropy estimation are studied, e.g., in [8], [10].The central limit theorem for the Kozachenko-Leonenko estimates is established in [12].The limit theorems for point processes on manifolds are employed in [30] to analyze behavior of the Shannon and the Rényi entropy estimates.The convergence rates for the Shannon entropy (truncated) estimates are obtained in [40] for one-dimensional case, see also [37] for multidimensional case.Ensemble estimation of density functional is considered in [38].A recursive rectilinear partitioning for the differential entropy is considered in [39].The mutual information estimation by the local Gaussian approximation is developed in [16].Note that various deep results (including the central limit theorem) were obtained for the Kullback -Leibler estimates under certain conditions imposed on derivatives of unknown densities (see, e.g., the recent papers [2], [24], [33]).Our goal is to provide wide conditions for the asymptotic unbiasedness and L 2 -consistency of the Kullback -Leibler divergence estimates (1.3), as n, m → ∞, without such smoothness hypothesis.Also we do not assume that densities have bounded supports.
The paper is organized as follows.In Section 2 we formulate main results, Theorems 1 and 2. Their proofs are presented in Sections 3 and 4, respectively.Proofs of several lemmas are given in Appendix (Section 5).

Main results
Some notation is necessary.For a probability density f in R d , x ∈ R d , r > 0 and R > 0, as in [7], introduce the functions (or functionals depending on parameters) where B(x, r) := {y ∈ R d : x − y ≤ r}.Observe that changing sup r∈(0,R] by sup r∈(0,∞) in the definition of M f (x, R) leads to the celebrated Hardy -Littlewood maximal function M f (x) widely used in harmonic analysis.Some properties of the function B(x,r) f (y) dy are considered, e.g., in [14].According to Lemma 2.1 [7], for a probability density f in R d , the function . For N ∈ N, consider the continuous nondecreasing function G N : R + → R + , given by formula (2.3) For probability densities p, q in R d , some N ∈ N and positive constants ν, t, ε, R, we define the following functionals with values in [0, ∞] ).Clearly, for any N ∈ N, ν, t, u > 0 such that t < u, one has Remark 3 We stipulate that 1/0 := ∞ (consequently m −ε 2 q (x, R) := ∞ when m q (x, R) = 0).For arbitrary versions of p and q, we can write in (2.5), (2.6) the integrals over the support S(p) instead of integrating over R d (obviously, the results do not depend on the choice of versions).
Theorem 1 Let P X and P Y have densities p and q, respectively.Suppose that p and q are such that, for some ε i > 0, R i > 0 and N j ∈ N, where i = 1, 2, 3, 4 and j = 1, 2, the functionals (2.8) 2) and the Lebesgue differentiation theorem (see, e.g., Theorem 25.17 [43]) yield that m q (x, R 2 ) ≤ q(x) ≤ M q (x, R 1 ) for µ-almost all x ∈ R d .Evidently, log z ≤ 1 ε z ε for any z ≥ 1 and each ε > 0. Consequently, 2) (and also guarantees that P X ≪ P Y ).
Lemma 1 Let p and q be any probability densities in R d .Then the following statements are valid.
Let us also consider the following simple conditions.(A; p, q, ν) For probability densities p, q in R d and some positive ν (2.9) We formally set log 0 := −∞ and, as usual (B 1 ; f ) There exists a version of density f such that, for some M(f ) ∈ (0, ∞), (C 1 ; f ) There exists a version of density f such that, for some m(f ) ∈ (0, ∞), Corollary 2 Let conditions (A; p, q, ν) and (A; p, p, ν) be satisfied with some ν > 1.Then (2.8) is true, provided that (B 1 ; f ) and (C 1 ; f ) are valid for f = p and f = q.Moreover, if the latter assumption concerning (B 1 ; f ) and (C 1 ; f ) holds then (2.8) is true whenever p and q have bounded supports.
Next we formulate conditions to guarantee L 2 -consistency of estimates (1.3).
Due to Lemma 1 one can recast Theorem 2 as follows.
Corollary 4 Let conditions (A; p, q, ν) and (A; p, p, ν) be satisfied with some ν > 2. Assume that (B 1 ; f ) and (C 1 ; f ) are valid for f = p and f = q.Then (2.10) is true.Moreover, if the latter assumption concerning (B 1 ; f ) and (C 1 ; f ) holds then (2.10) is true whenever p and q have bounded supports.
Note that D.Evans considered the "positive density condition" in Definition 2.1 of [14] meaning that there exist constants β > 1 and δ > 0 such that r d β ≤ B(x,r) q(y)dy ≤ βr d for all 0 ≤ r ≤ δ and x ∈ R d .Consequently m q (x, δ) ≥ 1 It was proved in [15] that if f is smooth and its support is a compact convex body in R d then the mentioned inequalities from Definition 2.1 of [14] hold.Therefore, if p and q are smooth and their supports are compact convex bodies in R d then one can simplify conditions of Corollaries 1 and 3. Now instead of (C1; f ) we consider the following condition introduced in [7] that allows us to work with densities, whose supports need not be bounded.(C 2 ; f ) For a fixed R > 0, there exists a constant c > 0 and a version of a density f such that Remark 5 If, for some positive ε, R and c, condition (C 2 ; q) is true and then obviously T p,q (ε, R) < ∞.Thus in Theorems 1 and 2 one can employ, for f = p and f = q, condition (C 2 ; f ) and suppose, for some ε > 0, finiteness of R d q(x) −ε p(x)dx and R d p 1−ε (x)dx instead of the corresponding assumptions T p,q (ε, R) < ∞ and T p,p (ε, R) < ∞.
To illustrate this observation we provide a result for a density with unbounded support.
Corollary 5 Let X, Y be Gaussian random vectors in R d with EX = µ X , EY = µ Y and nondegenerate covariance matrices Σ X and Σ Y , respectively.Then relations (2.8) and (2.10) hold where The latter formula can be found, e.g., in [25], p. 147.The proof of Corollary 5 is discussed in Appendix.
Similarly to condition (C 2 ; f ) let us consider the following one.
(B 2 ; f ) For a fixed R > 0, there exists a constant C > 0 and a version of a density f such that Remark 6 If, for some positive ε, R and c, condition (B 2 ; q) is true and Thus in Theorems 1 and 2 one can employ, for f = p and f = q, condition (B 2 ; f ) and suppose that R d q(x) ε p(x)dx and R d p 1+ε (x)dx are finite (for some ε > 0) instead of the assumptions For a fixed k ∈ {1, . . ., n − 1}, consider the Kozachenko -Leonenko estimate of the Shannon differential entropy H(X) of a vector X with values in R d having a density p w.r.t. the Lebesgue measure.Namely, H(X) := − R d (log p(x))p(x)µ(dx) and, for i.i.d.observations X 1 , X 2 , . .., such that law(X 1 ) = law(X), set for all n ≥ k + 1, (2.15) Similar to (1.4) one can employ the following generalization of statistics H n (k): where K n := {k i } n i=1 , and, for some r ∈ N and all i ∈ N, k i ≤ r.
Corollary 6 Let Q p,p (ε, R) < ∞ and T p,p (ε, R) < ∞ for some positive ε and R. Then the following statements hold for any fixed k ∈ N.
The proof of the first statement of this corollary is contained in the proof of Theorem 1, Step 5.In a similar way one can infer the second statement of Corollary 6 by means of the proof of Theorem 2, Step 5.

Proof of Theorem 1
For n, m ∈ N such that n > 1, for fixed k ∈ N and m ∈ N, where 1 It is sufficient to prove the following two claims.Statement 1.For each fixed l, all m large enough and any Statement 2. For each fixed k, all n large enough and any Then in view of (3.1), (3.2) and (3.3) We are going to discuss in detail only the proof of Statement 1, since Statement 2 is established in a similar way.It was explained in [7] that if V is a nonegative random variable (hence EV ≤ ∞) and X is an arbitrary random vector with values in R d then (3.4) Formula (3.4) means that simultaneously both sides are finite or infinite and coincide.Let F (u, ω) be a regular conditional distribution function of V given X where u ∈ [0, ∞) and ω ∈ Ω.Let h be a measurable function such that h : R → [0, ∞).Then, for P X -almost all This means that both sides of (3.5) are finite or infinite simultaneously and coincide.By virtue of (3.4) and (3.5) one can prove that E| log φ m,l (i)| < ∞, for all m large enough, fixed l and for all i ∈ N, and (3.2) holds.For this purpose we take V = φ m,l (i), X = X i and h(u) = | log u|, u > 0 (we use h(u) = log 2 u in the proof of Theorem 2).To reduce the volume of the paper we only consider below the evaluation of E log φ m,l (i) as all steps of the proof are the same when treating E| log φ m,l (i)|.
We divide the proof of Statement 1 into four steps.Preliminary Steps 1-3 are devoted to the demonstration, for x ∈ A ⊂ S(p) and i ∈ N, of relation where A depends on p and q versions, P X (S(p) \ A) = 0. Then Step 4 justifies the desired result (3.2).Step 5 contains the validation of Statement 2.
Step 1.Here we establish the distribution convergence for the auxiliary random variables.Fix any i ∈ N and l ∈ {1, . . ., m}.To simplify notation we do not indicate the dependence of functions on d.For x ∈ R d and u > 0, we study the asymptotic behavior (as m → ∞) of the following function where We have employed in (3.7) the independence of random vectors Y 1 , . . ., Y m , X i and condition that Y 1 , . . ., Y m have the same law as Y .We also took into account that an event x − Y (l) (x, Y m ) > r m (u) is a union of pair-wise disjoint events A s , s = 0, . . ., l − 1.Here A s means that exactly s observations among Y m belong to the ball B(x, r m (u)) and other m − s are outside this ball (probability that Y belongs to the sphere {z ∈ R d : z − x = r} equals 0 since Y has a density w.r.t. the Lebesgue measure µ).Formulas (3.7) and (3.8) show that F i m,l,x (u) is the regular conditional distribution function of φ m,l (i) given X i = x.Moreover, (3.7) means that φ m,l (i), i ∈ {1, . . ., n} are identically distributed and we may omit the dependence on i.So, one can replace F i m,l,x (u) with F m,l,x (u).
According to the Lebesgue differentiation theorem (see, e.g., [43], p. 654 Let Λ(q) stand for a set of all the Lebesgue points of a function q, i.e. points x ∈ R d satisfying (3.9).Clearly, Λ(q) depends on the chosen version of q belonging to the class of equivalent functions from L 1 (R d ) and, for an arbitrary version of q, we have µ(R d \ Λ(q)) = 0. Note that, for each u > 0, r m (u) → 0 as m → ∞, and Therefore by virtue of (3.9), for any fixed x ∈ Λ(q) and u > 0, where α m (x, u) → 0, m → ∞.Hence, for x ∈ Λ(q) ∩ S(q) (thus q(x) > 0), due to (3.7) where ξ l,x has Γ(V d q(x), l) distribution.We assume without loss of generality (w.l.g.) that, for all x ∈ S(q), the random variables ξ l,x and {ξ m,l,x } m≥l are defined on a probability space (Ω, F , P) since in view of the Lomnicki -Ulam theorem (see, e.g.[18], p. 93) one can consider the independent copies of Y 1 , Y 2 , . . .and {ξ l,x } x∈S(q) defined on a certain probability space.The convergence in law of random variables is preserved under continuous mapping.Hence, for any x ∈ Λ(q) ∩ S(q), we come to the relation log ξ m,l,x law → log ξ l,x , m → ∞. (3.12) We took into account that, for each x ∈ Λ(q) ∩ S(q), one has ξ l,x > 0 a.s. and since Y has a density we infer that More precisely, we can ignore zero values of nonnegative random variables (having zero values with probability zero) when we take their logarithms.
. By virtue of (3.5), for each Thus, for x ∈ Λ(q) ∩ S(q), the relation holds if and only if (3.13) is true.
According to Theorem 3.5 [4] we would have established (3.13) if relation (3.12) could be supplemented, for µ-almost all x ∈ Λ(q) ∩ S(q), by the uniform integrability of a family {log ξ m,l,x } m≥m 0 (x) .Note that, for each N ∈ N, a function G N (t) introduced by (2.3) is increasing on (0, ∞) and G N (t) t → ∞, as t → ∞.Therefore, by the de la Valle Poussin theorem (see, e.g., Theorem 1.3.4[6]), to guarantee, for µ-almost every x ∈ Λ(q) ∩ S(q), the uniform integrability of {log ξ m,l,x } m≥m 0 (x) it suffices to prove, for such x, a positive C 0 (x) and m 0 (x) ∈ N, that sup where G N 1 appears in conditions of Theorem 1.
Step 3 is devoted to proving validity of (3.15).It is convenient to divide this proof into its own parts (3a), (3b), etc.For any N ∈ N, set where the product over empty set (when N = 1) is equal to 1.We will employ the following result, its proof is given in Appendix.
Lemma 2 Let F (u), u ∈ R, be a distribution function such that F (0) = 0.Then, for each N ∈ N, one has 1) For convenience sake we write I 1 (m, x) and I 2 (m, x) without indicating their dependence on N 1 , l and d.Recall that N 1 is fixed.
Part (3a).We provide bounds for I 1 (m, x).Take R 1 > 0 appearing in conditions of Theorem 1 and any u ∈ 0, Note also that we can consider only m ≥ l everywhere below, because the size of sample Y m should not be less than number of the neighbors l (see, e.g., (3.7)).Thus, for and we obtain an inequality If ε ∈ (0, 1] and t ∈ [0, 1] then, for all m ≥ 1, invoking the Bernoulli inequality, one has By assumptions of the Theorem Q p,q (ε 1 , R 1 ) < ∞ for some ε 1 > 0, R 1 > 0. According to Lemma 1 we can assume that ε 1 < 1.Thus, due to (3.17) and since In view of (3.7), (3.16) and (3.18) one can claim now that, for all x ∈ Λ(q)∩S(q), u ∈ (0, Therefore, for any x ∈ Λ(q) ∩ S(q) and m ≥ m 1 , one can write where Evidently, where By Markov's inequality P(Z ≥ x) ≤ e −λx Ee λZ for any λ > 0 and x > 0. One has Consequently, for each λ > 0, To simplify bounds we take λ = 1 and set where we have used an elementary inequality 1 For R 2 > 0 appearing in conditions of the Theorem and any u ∈ e by definition of m f (for f = q) in (2.2).Now we use the following Lemma 3.2 of [7].
It is easily seen that, for any t > 0 and each δ ∈ (0, e], one has e −t ≤ t −δ .Thus, √ m] and ε 2 > 0, we deduce from conditions of the Theorem (in view of Lemma 1 one can suppose that ε 2 ∈ (0, e]), taking into account that m q (x, R 2 ) > 0 for x ∈ D q (R 2 ) and applying relation (3.25), that Thus, for all x ∈ Λ(q) ∩ S(q) ∩ D q (R 2 ) and any m ≥ m 2 , where Part (3c).Consider J 2 (m, x).In view of (3.26), for all x ∈ Λ(q) ∩ S(q) ∩ D q (R 2 ) and any Then, for all x ∈ Λ(q) ∩ S(q) ∩ D q (R 2 ) and any m ≥ m 2 , where To get bounds for J 3 (m, x) we employ several auxiliary results.
Lemma 4 For each N ∈ N and any ν > 0, there are a The proof is provided in Appendix.
, is slowly varying at infinity.
Its proof is elementary and thus is omitted.Part (3e).Now we are ready to get the bound for J 3 (m, x).Set u = mw.Then one has dw.
Inequality w > m and Lemma 5 imply log By virtue of (3.30) and (3.32) one has Hence it can be seen that G N (log x − y )q(y) dy, A p (G N ) := {x ∈ S(p) : R N (x) < ∞}.
Then, by virtue of (3.31) and (3.34), for all m ≥ m 3 and x ∈ A, we come to the inequality Moreover, for any κ > 0, one can take Then by virtue of (3.36), for each x ∈ A and m ≥ m 0 := max{m 1 , m 2 , m 3 , m 4 }, (3.37) Hence, for each x ∈ A, the uniform integrability of the family {log ξ m,l,x } m≥m 0 is established.
Step 4. Now we verify (2.8).We have already proved, for each x ∈ A (thus, for P Xalmost every x belonging to S(p)) that E(log φ m,l (1) Thus a function G N 1 is nondecreasing and convex.On account of the Jensen inequality (3.38) Relation (3.37) guarantees that, for all m ≥ m 0 , We have established uniform integrability of the family {Z m,l } m≥m 0 w.r.t.measure P X .Thus, for i ∈ N, and we come to relation (3.2).
Step 5. Let us briefly discuss the Statement 2. Similar to F m,l,x (u), one can introduce, for n, k ∈ N, n ≥ k + 1, x ∈ R d and u > 0, the following function where r n (u) was defined in (3.8), Formulas (3.39) and (3.40) show that F n,k,x (u) is the regular conditional distribution function of ζ n,k (i) given X i = x.Moreover, for any fixed u > 0 and x ∈ Λ(p) ∩ S(p) (thus p(x) > 0), G N (log x − y )p(y)dy.
Hence similar to Steps 1-4 we come to relation (3.3).
The proof of Theorem 1 is complete.

Proof of Theorem 2
First of all note that, in view of Lemma 1, the finiteness of K p,q (2, N 1 ) and K p,p (2, N 2 ) implies the finiteness of K p,q (1, N 1 ) and K p,p (1, N 2 ), respectively.Thus the conditions of Theorem 2 entail validity of Theorem 1 statements.Consequently under the conditions of Theorem 2, for n and m large enough, one can claim that D n,m (k, l) ∈ L 1 (Ω) and E D n,m (k, l) → D(P X ||P Y ), as n, m → ∞.
We will show that D n,m (k, l) ∈ L 2 (Ω) for all n and m large enough.Then we can write Therefore to prove (2.10) we will demonstrate that var D n,m (k, l) → 0, n, m → ∞.
Due to (3.7) the random variables log φ m,l (1), . . ., log φ m,l (n) are identically distributed (and log ζ n,k (1), . . ., log ζ n,k (n) are identically distributed as well).Hence (3.1) yields We do not strictly adhere to notation used in Theorem 1 proof.Namely, the choice of the sets A ⊂ R d , A ⊂ R d , positive U j , C j (x), C j (x) and integers m j , n j , where j ∈ Z + and x ∈ R d , could be different.The proof of Theorem 2 is also divided into several steps.Steps 1-3 are devoted to the demonstration of relation 1  n var(log φ m,l (1)) → 0 as n, m → ∞, while Step 4 contains the proof of relation 2 n 2 1≤i<j≤n cov(log φ m,l (i), log φ m,l (j)) → 0 as n, m → ∞.In Step 5 we establish that This step is rather involved.In Step 6 we come to the desired statement var D n,m (k, l) → 0, n, m → ∞.
Step 1.We study E log 2 (φ m,l (1)), as m → ∞.Consider where the first four sets appeared in Theorem 1 proof, and A p,2 (G N ), for N ∈ N and a probability density p on R d , is defined quite similar to A p (G N ).Namely, for x ∈ R d and G N (log 2 x − y )q(y) dy (4.3)In view of (3.7), for each x ∈ A, Note that if η ∼ Γ(α, λ), where α > 0 and λ > 0, then Since ξ l,x ∼ Γ(V d q(x), l) for x ∈ S(q), one has where h 1 := h 1 (l, d) and h 2 := h 2 (l, d) depends only on fixed l and d.
We prove now that, for x ∈ A, one has By virtue of (4.5) and (4.6) relation (4.7) is equivalent to the following one E log 2 ξ m,l,x → E log 2 ξ l,x , m → ∞.So, in view of (4.4) to prove (4.7) it is sufficient to show that, for each x ∈ A, a family log 2 ξ m,l,x m≥m 0 (x) is uniformly integrable for some m 0 (x) ∈ N. As in the proof of Theorem 1, we can verify that, for all x ∈ A and some nonnegative C 0 (x), sup Step 2. Now our goal is to prove (4.8).For each As usual, a product over an empty set (if N = 1) is equal to 1.
To show (4.8) we employ the following result.
The proof of this lemma is omitted, being quite similar to one of Lemma 2. By Lemma 7 and since G N 1 (log 2 u) = 0, for u ∈ 1 ρ(N 1 ) , ρ(N 1 ) , one has To simplify notation we do not indicate the dependence of I i (m, x) (i = 1, 2) on N 1 , l and d.
We divide further proof into several parts.Part (2a).At first we consider I 1 (m, x).As in Theorem 1 proof, for fixed R 1 > 0 and ε 1 > 0 appearing in the conditions of Theorem 2, an inequality F m,l,x (u) ≤ (M q (x, R 1 )) and m ≥ m 1 := max , we get, for m ≥ m 1 , (4.9) Part (2b).Consider I 2 (m, x).As in the proof of Theorem 1, taking into account that, for u where we do not indicate the dependence of J j (m, x) (j = 1, 2, 3) on N 1 and l.
For R 2 > 0 and ε 2 > 0 appearing in the conditions of Theorem 2, one can prove (see Theorem 1 proof), that inequality , ⌈ρ 2 (N 1 )⌉ , l .Here S 1 := S 1 (l) and S 2 are the same as in the proof of Theorem 1.For all x ∈ A and m ≥ m 2 , we come to the relations where Part (2c).Now we turn to J 2 (m, x).Take δ > 0.Then, due to (4.10), for all x ∈ A and any m ≥ m 2 , Part (2d).Now we consider J 3 (m, x).Take u = mw.Then J 3 (m, x) has the form dw.
Step 3. Now we can return to E log 2 φ m,l (1).Set ∆ m,l (x) := E(log 2 φ m,l (1)|X 1 = x) = E log 2 ξ m,l,x .Consider x ∈ A and take any m ≥ m 0 .Function G N 1 is nondecreasing and convex according to Lemma 6. Due to the Jensen inequality Relation (4.17) guarantees that, for each x ∈ A and all m ≥ m 0 , We have established uniform integrability of the family {∆ m,l (•)} m≥m 0 (w.r.t.measure P X ).Therefore, we conclude that It is easily seen that finiteness of integrals This is verified as in Remark 4 by taking into account that log ) → 0 as n, m → ∞.
Step 4. Now we consider cov(log φ m,l (i), log φ m,l (j)) for i = j, where i, j ∈ {1, . . ., n}.For x, y ∈ R d , introduce conditional distribution function Φ i,j m,l,x,y (u, w) Here r m (a) = a m 1 d for all a ≥ 0, as previously.One can write Φ m,l,x,y (u, w) instead of Φ i,j m,l,x,y (u, w), because the right-hand side of (4.19) does not depend on i and j.
Consider (x, y) ∈ A 1 .Obviously, for any a > 0, r m (a) → 0, as m → ∞.For (x, y) ∈ A 1 we take m 5 = m 5 (u, w, x − y ) := In view of (3.7), (4.19) and (4.20), one has for Φ m,l,x,y (u, w) the following representation For any fixed (x, y) ∈ A 1 and u, w > 0, Then, according to (4.21), (3.10) and (4.22), for all fixed u, w > 0, (x, y) Thus Φ l,x,y (•, •) is a distribution function of a vector η l,x,y := (ξ l,x , ξ l,y ), where ξ l,x ∼ Γ(V d q(x), l), ξ l,y ∼ Γ(V d q(y), l) and the components of η l,x,y are independent.Observe also that Φ m,l,x,y (•, •) is a distribution function of a random vector η m,l,x,y := (ξ m,l,x , ξ m,l,y ).Consequently, we have shown that η m,l,x,y law → η l,x,y as m → ∞.Therefore, for any (x, y) Here we exclude a set of zero probability where random variables under consideration can be equal to zero.Note that, for all i, j ∈ N, i = j, log u log w dΦ m,l,x,y (u, w) Obviously, in view of (3.14) and since ξ l,x and ξ l,y are independent, one has E(log ξ l,x log ξ l,y ) = E log ξ l,x E log ξ l,y = (ψ(l) − log V d − log q(x))(ψ(l) − log V d − log q(y)).
Now we intend to verify that, for any (x, y) ∈ A 1 , Equivalently, one can prove that, for each (x, y) ∈ A 1 , E(log ξ m,l,x log ξ m,l,y ) → E(log ξ l,x log ξ l,y ), m → ∞.Part (4a).We establish the uniform integrability of a family {log ξ m,l,x log ξ m,l,y } m≥m 0 for (x, y) ∈ A 1 .The function G N 1 (•) is nondecreasing and convex.Thus, for any (x, y) ∈ A 1 , following the proof of Step 2, one can find m 0 (the same as in the proof of Step 2 ) such that, for all m ≥ m 0 ,  Clearly, U 1 , U 2 , κ, A, B do not depend on x or y by virtue of (4.16).Hence, for any (x, y) ∈ A 1 , a family {log ξ m,l,x log ξ m,l,y } m≥m 0 is uniformly integrable.Therefore we come to (4.24) for (x, y) ∈ A 1 .
Here Φ k,x,y (•, •) is the distribution function of a vector η k,x,y := ( ξ k,x , ξ k,y ), where ξ k,x ∼ Γ(V d p(x), k), ξ k,y ∼ Γ(V d p(y), k) and the components of η k,x,y are independent.Consequently, we have shown that η n,k,x,y law → η k,x,y as n → ∞.Therefore, for any ( Here we exclude a set of zero probability where random variables under consideration can be equal to zero.In a similar way to (4.23), for i, j ∈ {1, . . ., n}, i = j, we write Since ξ k,x and ξ k,y are independent, formula (3.14) yields For any fixed M > 0, consider A 1,M := (x, y) ∈ A 1 : x − y > M .Now our aim is to verify that, for each (x, y) Equivalently, we can prove, for each (x, y) ∈ A 1,M , that The idea that we consider only (x, y) ∈ A 1,M is principle for the further proof.Part (5a).We will establish the uniform integrability of a family {log η y n,k,x log η x n,k,y } n≥ n 0 for (x, y) ∈ A 1,M and some n 0 ∈ N which does not depend on x, y, but can depend on M.Then, due to (4.32), the relation (4.35) would be valid for such (x, y) as well.
As we have seen, the function G N 2 (•) is nondecreasing and convex.Hence Let us consider, for instance, EG N 2 (log 2 η y n,k,x ).As at Step 2 we can write As usual a sum over empty set is equal to 0 (for k = 1).
The same reasoning as was used in Theorem 1 proof (Step 3, Part (3b)) leads to the inequalities for all n ≥ max { n 3 (R 4 ), 3}.Then similarly to (4.15), the relation is valid for all (x, y) ∈ A 1,M and n ≥ n 0 (M) := max { n 1 , n 2 , n 3 , n 4 (κ), 3}.Here U 1 , U 2 , κ, A, B do not depend on x or y.Thus, in view of (4.36), one has  Hence, for any (x, y) ∈ A 1,M , a family {log η y n,k,x log η x n,k,y } n≥ n 0 is uniformly integrable.Thus we come to (4.34) for (x, y) for all (x, y) ∈ A 1 .Relation (4.34) validity is equivalent to the following one: for any Now take any (x, y) ∈ A 1 .Then, for any fixed M > 0 and (x, y) ∈ A 1 , we have proved that Due to (4.41) and (4.43) one can conclude that, for all n ≥ n 0 , Hence, for (x, y) ∈ A 1 , a family T n,k (x, y)I{ x − y > M} n≥ n 0 is uniformly integrable w.r.t.P X ⊗ P X .Consequently, in view of (4.34), for each M > 0, x,y∈R d , x−y >M T n,k (x, y)p(x)p(y) dx dy 0 as X 1 and X 2 are independent and have a density p(x) w.r.t. the Lebesgue measure µ.Then Taking into account that, for an integrable function h, C hdP → 0 as P(C) → 0, we get < ∞ (the proof is similar to the establishing that E log φ m,l (1) < ∞).Hence, for any γ > 0, one can find M 1 = M 1 (γ) > 0 such that, for all M ∈ (0, M 1 ] and n ≥ n 0 , x,y∈R d , x−y ≤M T n,k (x, y)p(x)p(y) dx dy = E log φ m,l (1) log φ m,l (2)I{ X 1 − X 2 ≤ M} < γ 3 .

A Proofs of auxiliary results
Proofs of Lemmas 1, 2 and 3 are similar to the proofs of Lemma 2.5 and 3.1, 3.2 in [7].We provide them for the sake of completeness.
Proof of Lemma 2. We start with relation 1).Note that if a function g is measurable and bounded on a finite interval (a, b] and ν is a finite measure on the Borel subsets of (a, b] then (a,b] g(x)ν(dx) is finite.Thus, for each a ∈ 0, 1 e [N] , using the integration by parts formula

1 .
The reasoning is the same as in the proof of Theorem 1.Recall that, for each x ∈ A, one has log ξ m,l,x law→ log ξ l,x , m → ∞, where ξ m,l,x := m x − Y (l) (x, Y m )d and ξ l,x has Γ(V d q(x), l) distribution.Convergence in law of random variables is preserved under continuous mapping.Hence, for any x ∈ A, we come to the relation log 2 ξ m,l,x law → log 2 ξ l,x , m → ∞. (4.4) w} .Then r m (u) < x−y 2 and r m (w) < x−y 2 for all m ≥ m 5 .Thus B(x, r m (u)) ∩ B(y, r m (w)) = ∅ if m ≥ m 5 .Consequently, for m ≥ m 6 (u, w, x − y ) := max m 5 , 2(l − 1) , y) + B := C 0 (x, y).