Next Article in Journal
Predicting the Outcome of NBA Playoffs Based on the Maximum Entropy Principle
Previous Article in Journal
Quantum Thermodynamics with Degenerate Eigenstate Coherences

Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# The Kullback–Leibler Information Function for Infinite Measures

by
Victor Bakhtin
1,* and
Edvard Sokal
2
1
Department of Mathematics, IT and Landscape Architecture, John Paul II Catholic University of Lublin, Konstantynuv Str. 1H, 20-708 Lublin, Poland
2
Department of Mechanics and Mathematics, Belarusian State University, Nezavisimosti Ave. 4, 220030 Minsk, Belarus
*
Author to whom correspondence should be addressed.
Entropy 2016, 18(12), 448; https://doi.org/10.3390/e18120448
Submission received: 21 July 2016 / Revised: 1 December 2016 / Accepted: 12 December 2016 / Published: 15 December 2016

## Abstract

:
In this paper, we introduce the Kullback–Leibler information function $ρ ( ν , μ )$ and prove the local large deviation principle for σ-finite measures μ and finitely additive probability measures ν. In particular, the entropy of a continuous probability distribution ν on the real axis is interpreted as the exponential rate of asymptotics for the Lebesgue measure of the set of those samples that generate empirical measures close to ν in a suitable fine topology.
MSC:
28D20; 60F10

## 1. Introduction

Let P be a continuous probability distribution on the real axis with density $φ ( x ) = d P ( x ) / d x$. Its entropy is defined as
$H ( P ) = − ∫ R φ ( x ) ln φ ( x ) d x .$
What is the substantive sense of $H ( P )$? More precisely, does there exist a mathematical object whose natural quantitative magnitude (e.g., volume) is a certain function of the entropy?
Traditionally, entropy is treated as a measure of disorder. However, this explanation does not answer the question stated above because it does not establish a relationship between entropy and any other quantitative characteristic of disorder that can be defined and measured regardless of the entropy.
To illustrate the problem, consider the entropy of a discrete distribution $P = ( p 1 , ⋯ , p r )$,
$H ( P ) = − ∑ i p i ln p i .$
Its substantive meaning is well known. Namely, let $X = { 1 , ⋯ , r }$ be a finite alphabet. Then, the set of those words $( x 1 , ⋯ , x n ) ∈ X n$ of length $n ≫ 1$ in which every letter $i ∈ X$ occurs with mean frequency close to $p i$ has cardinality of order $e n H ( P )$ (this follows from the Shannon–McMillan–Breiman theorem (see [1,2])). Thus, the entropy of a discrete distribution determines the exponential rate for the number of those words of length n in which letters occur with prescribed frequencies.
Can we say anything of that sort about the entropy of a continuous distribution? It turns out—yes. Indeed, from Theorem 3 stated below, it follows that entropy (1) determines the exponential rate for the Lebesgue measure of the set of sequences $( x 1 , ⋯ , x n ) ∈ R n$ of length $n ≫ 1$ that generate empirical measures on $R$ close to P. The proximity of distributions should be understood here in the sense of a fine topology, which is defined in the same way as the weak topology, but with the use of integrable functions instead of bounded ones.
For example, if P is the exponential distribution with density $φ ( x ) = λ e − λ x$, $x ≥ 0$, then
$H ( P ) = − ∫ 0 + ∞ λ e − λ x ( ln λ − λ x ) d x = 1 − ln λ ,$
and so the set of sequences $( x 1 , ⋯ , x n ) ∈ R n$ of length $n ≫ 1$ that generate empirical measures close to P (in the fine topology) has Lebesgue measure of order $e n H ( P ) = ( e / λ ) n$.
Another example: for the Gaussian distribution P with density
$φ ( x ) = 1 2 π σ 2 e − ( x − a ) 2 / 2 σ 2 ,$
we get
and the set of sequences $( x 1 , ⋯ , x n ) ∈ R n$ of length $n ≫ 1$ that generate empirical measures close to P (in the fine topology) has Lebesgue measure of order $e n H ( P ) = ( 2 π σ 2 e ) n / 2$.
These examples are based on the presentation of entropy (1) in the form $H ( P ) = − ρ ( P , Q )$, where Q is the Lebesgue measure on the real axis and $ρ ( P , Q )$ is the Kullback–Leibler information function:
$ρ ( P , Q ) = ∫ R φ ( x ) ln φ ( x ) d Q ( x ) , φ ( x ) = d P ( x ) d Q ( x ) ,$
as well as on a certain generalization of the so-called local large deviation principle.
Let P and Q be two probability distributions on a space X. Roughly speaking, the local large deviation principle asserts that the measure $Q n$ of the set of sequences $( x 1 , ⋯ , x n ) ∈ X n$ that generate empirical measures close to P has exponential order $e − n ρ ( P , Q )$, provided $n → + ∞$.
As far as we know this principle was first proven by Sanov for a pair of continuous probability distributions on the real axis in [3]. Later, it was extended to the general metric spaces (see, for example, [4,5,6,7]), abstract measurable spaces (see [8,9,10]), and spaces of trajectories of various stochastic processes (see [11,12,13,14,15,16,17,18,19]).
It should be mentioned that different authors called the function $ρ ( P , Q )$ in different ways: the Kullback–Leibler information function [4], the relative entropy [6], the rate function [5,7,15], the Kullback–Leibler divergence, the action functional [16], and the Kullback–Leibler distance [20] (though, of course, it is nonsymmetric and hence not a metric at all). For brevity, in the sequel, we will prefer the term “Kullback action” rather than any of the listed above.
Until recently, the Kullback action and the local large deviation principle were studied only in the case when both arguments P, Q were probability distributions. Only recently, in papers [9,10], was the measure Q allowed to be no more than finite and positive, and the measure P was allowed to be finitely additive, and, moreover, real-valued. Unfortunately, this is still insufficient for the interpretation of entropy (1) because the Lebesgue measure on the real axis is infinite. Therefore, it is highly desirable to define properly the Kullback action and to obtain a generalization of the local large deviation principle for infinite measures Q. Our main result is the solution of this problem.
It turns out that at least two different ways of generalization are possible. The first approach is based on the use of the fine topology in the space of probability distributions. This is presented in Theorem 3. In the second approach, the whole space X is replaced by its certain part Y of finite measure Q, and the distribution P is replaced by its conditional distribution $P Y$ on Y. Thereby, the problem reduces to the case of finite measures. This approach is implemented in Theorems 4 and 5.
In fact, it makes sense to consider finitely additive probability distributions P as well since some sequences of empirical measures may converge to finitely additive distributions. In such a case, the Kullback action can take values $+ ∞$ or $− ∞$ only (Theorem 6). The corresponding versions of the large deviation principle for finitely additive measures P are presented in Theorems 7 and 8.
First results on the large deviation principle for infinite measures were obtained in [21,22], where a countable set X and the “counting” measure Q (such that $Q ( x ) = 1$ for all $x ∈ X$) were considered. In such a case, the Kullback action $ρ ( P , Q )$ coincides (up to the sign) with entropy (2). It was revealed in [21,22] that, for the “counting” measure Q on the countable space X, the ordinary form of the large deviation principle, formulated in terms of the weak topology, fails and so one should use the fine topology instead.
The paper is organized as follows. In the next section we recall the local large deviation principle for finite measures (Theorem 1). In Section 3, we define the Kullback action $ρ ( ν , μ )$ as the Legendre dual functional to the so-called spectral potential $λ ( φ , μ )$ and formulate two variants of the large deviation principle for the case of σ-finite measure μ (Theorems 3–5). These theorems are proven in Section 4, Section 5, Section 6 and Section 7. In Section 8, we formulate two variants of the large deviation principle for σ-finite measures μ and finitely additive probability distributions ν (Theorems 7 and 8). Theorem 6 states that, in fact, $ρ ( ν , μ )$ turns into $+ ∞$ or $− ∞$ if the measure ν has no density with respect to μ. It is proven in Section 9. The final Section 10 contains proofs of Theorems 7 and 8.

## 2. The Kullback Action for Finite Measures

Let us consider an arbitrary set X supplied with a σ-field $A$ of its subsets. In what follows by “measures” we mean only nonnegative measures on the measurable space $( X , A )$.
We will use the following notation:
$B ( X )$ — all bounded measurable functions $f : ( X , A ) → R$;
$M ( X )$ — all finite measures on $( X , A )$;
$M 1 ( X )$ — all probability measures (distributions) on $( X , A )$;
$M σ ( X )$ — all σ-finite measures on $( X , A )$.
Evidently,
$M 1 ( X ) ⊂ M ( X ) ⊂ M σ ( X ) .$
Suppose that $ν , μ ∈ M σ ( X )$ and the measure ν is absolutely continuous with respect to μ. Then, by the Radon–Nikodym theorem, ν can be presented in the form $ν = φ μ$, where φ is a nonnegative measurable function, which is called the density of ν with respect to μ and denoted as $φ = d ν / d μ$. This function is uniquely defined up to a set of zero measure μ.
The Kullback action $ρ ( ν , μ )$ is a function of a probability measure $ν ∈ M 1 ( X )$ and a finite measure $μ ∈ M ( X )$ defined in the following way: if ν is absolutely continuous with respect to μ, then
$ρ ( ν , μ ) = ∫ X φ ln φ d μ , φ = d ν d μ ,$
and $ρ ( ν , μ ) = + ∞$, otherwise. In (4), we set $φ ln φ = 0$ for $φ = 0$. Therefore, $ρ ( ν , μ )$ belongs to the interval $( − ∞ , + ∞ ]$.
With each finite sequence $x = ( x 1 , ⋯ , x n ) ∈ X n$, we associate an empirical measure $δ x , n ∈ M 1 ( X )$ that is supported on the set ${ x 1 , ⋯ , x n }$ and assigns to each $x i$ the measure $1 / n$. The expectation of any function $f : X → R$ with respect to this empirical measure looks like
$δ x , n [ f ] = f ( x 1 ) + ⋯ + f ( x n ) n .$
Let us fix any probability measure $μ ∈ M 1 ( X )$. If the points $x i ∈ X$ are treated as independent random variables with common distribution μ, then the empirical measure $δ x , n$ becomes a random variable itself, taking values in $M 1 ( X )$. We will be interested in the asymptotics of its distribution. It turns out that, at a first approximation, this asymptotics is exponential with the exponent $− n ρ ( ν , μ )$.
To describe the asymptotics of the empirical measures distribution, we need two topologies on the space $M 1 ( X )$. The first one is the weak topology generated by neighborhoods of the form
where $f 1 , ⋯ , f k ∈ B ( X )$ and $ε > 0$. The second topology is generated by neighborhoods of the same form (5) but with functions $f 1 , ⋯ , f k ∈ L 1 ( X , μ )$ therein. In addition, it is supposed in this case that $O ( μ )$ contains only those measures ν for which all integrals $∫ X f i d ν$ do exist. This topology will be referred to as the fine topology. It is useful because it enables us to formulate the usual law of large numbers in the next form: for any probability distribution $μ ∈ M 1 ( X )$, the sequence of empirical measures $δ x , n$ converges to μ in the probability in the fine topology. On the other hand, a shortcoming of the fine topology is the fact that, with respect to it, the affine map $t ↦ ( 1 − t ) μ 0 + t μ 1$, where $t ∈ [ 0 , 1 ]$, may be discontinuous at the ends of the segment $[ 0 , 1 ]$.
It is easy to see that the fine topology on $M 1 ( X )$ contains the weak one, but the converse, in general, does not take place.
For any nonnegative measure μ on X, denote by $μ n$ its Cartesian power supported on $X n$. The next theorem describes asymptotics of the empirical measures distribution.
Theorem 1 (the local large deviation principle for finite measures).
For any measures $ν ∈ M 1 ( X )$, $μ ∈ M ( X )$, and number $ε > 0$, there exists a weak neighborhood $O ( ν ) ⊂ M 1 ( X )$ such that
On the other hand, for any measures $ν ∈ M 1 ( X )$, $μ ∈ M ( X )$, number $ε > 0$, and any fine neighborhood $O ( ν ) ⊂ M 1 ( X )$, the following estimate holds for all large enough n:
In the case of a metric space X supplied with a Borel σ-field, the neighborhood $O ( ν )$ in (6) can be chosen from the weak topology generated by bounded continuous functions.
Remark 1.
When $ρ ( ν , μ ) = + ∞$, the difference $ρ ( ν , μ ) − ε$ in (6) should be replaced by $1 / ε$.
Remark 2.
So long as each weak neighborhood in $M 1 ( X )$ belongs to the fine topology, estimates (6) and (7) complement each other: the coefficient $ρ ( ν , μ )$ cannot be increased in (6) and cannot be decreased in (7).
Remark 3.
Theorem 1 is also true for finitely additive probability distributions ν on the space X if we set $ρ ( ν , μ ) = + ∞$ in such a case (see [9]).
It is worth mentioning that, until recently, the absolute majority of papers on the large deviation principle dealt with random variables in Polish space (i.e., complete separable metric space), and only a few of them treated random variables in a topological space (see, for example, [4]), or in a measurable space in which the σ-field is generated by open balls and does not necessarily contain Borel sets (see [7], Section 7). In addition, only countably additive probability distributions ν and μ were considered as arguments of the Kullback action. Theorem 1 for an arbitrary measurable space X, finitely additive measures ν and nonnormalized measures μ was first proven in [9], and its generalization for finitely additive measures μ was proven in [10].

## 3. The Kullback Action for σ-Finite Measures

The shortcoming of Theorem 1 is that it does not involve the case of infinite measure μ. In particular, it does not explain any sense of entropy (1) of an absolutely continuous probability distribution on the real axis. Unfortunately, the direct extension of Theorem 1 on infinite measures μ is wrong. The next example demonstrates this.
Example
([22]). Let X be a countable set supplied with the discrete σ-field and μ be the counting measure on X (such that $μ ( x ) = 1$ for every $x ∈ X$). Consider a topology on the space of probability distributions $M 1 ( X )$ generated by the neighborhoods
(in other words, the topology of $L 1 ( X , μ )$). Then, for any neighborhood (8) and any number $C > 0$, there exists a finite subset $X 0 ⊂ X$ such that, for all n large enough,
The topology on $M 1 ( X )$ under consideration contains the weak topology generated by functions from $B ( X )$. It follows that, for $C > − ρ ( ν , μ )$, estimate (9) contradicts (6), and hence the latter cannot take place.
It turns out that, to extend Theorem 1 on σ-finite measures μ, it is enough to replace the weak neighborhood in (6) with a fine one. This is the main result of the paper. Its exact formulation is given in Theorem 3 below.
We also propose one more approach to extend Theorem 1, using only weak topology. Its idea is to replace the space X in estimates (6) and (7) by a large enough subset $Y ⊂ X$ of finite measure $μ ( Y )$, and to replace the probability measure $ν ∈ M 1 ( X )$ by its conditional distribution on Y. The corresponding results are stated in Theorems 4 and 5 below.
In order to describe asymptotics of the empirical measures distribution correctly in the case of σ-finite measure μ, the definition of the Kullback action should be modified. To this end, we have to introduce the notion of a spectral potential.
Denote by $B ¯ ( X )$ the set of all bounded above measurable functions on a measurable space $( X , A )$. The spectral potential is the nonlinear functional
$λ ( φ , μ ) = ln ∫ X e φ d μ , φ ∈ B ¯ ( X ) , μ ∈ M σ ( X ) .$
If the integral in this formula diverges, then we set $λ ( φ , μ ) = + ∞$. Thus, $λ ( φ , μ )$ can take values in the interval $( − ∞ , + ∞ ]$.
For brevity, let us introduce the notation
$ν [ f ] = ∫ X f d ν ,$
where $ν ∈ M 1 ( x )$ and $f ∈ B ¯ ( X )$. If the integral diverges, then we put $ν [ f ] = − ∞$.
Now, we define the Kullback action $ρ ( ν , μ )$ as a function of the pair of arguments $ν ∈ M 1 ( X )$ and $μ ∈ M σ ( X )$ as follows:
The next theorem shows, in particular, that in the case of a finite measure μ this definition coincides with the previous one (4).
Theorem 2.
If a probability distribution $ν ∈ M 1 ( X )$ is absolutely continuous with respect to $μ ∈ M σ ( X )$ and $d ν / d μ = φ$, then
$ρ ( ν , μ ) = ∫ X φ ln φ d μ , if ∫ φ < 1 φ ln φ d μ > − ∞ ,$
$ρ ( ν , μ ) = − ∞ , if ∫ φ < 1 φ ln φ d μ = − ∞ .$
In particular, for the finite measure μ, the alternative (11) takes place.
The following theorem is our main result for the case of countably additive distributions.
Theorem 3 (the local large deviation principle for infinite measures).
For any measures $ν ∈ M 1 ( X )$, $μ ∈ M σ ( X )$, and number $ε > 0$, there exists a fine neighborhood $O ( ν ) ⊂ M 1 ( X )$ such that
On the other hand, for any measures $ν ∈ M 1 ( X )$, $μ ∈ M σ ( X )$, number $ε > 0$, and any fine neighborhood $O ( ν ) ⊂ M 1 ( X )$, the following estimate holds for all large enough n:
If $ρ ( ν , μ ) = + ∞$, then the difference $ρ ( ν , μ ) − ε$ in (13) should be replaced by $1 / ε$, and if $ρ ( ν , μ ) = − ∞$ then the sum $ρ ( ν , μ ) + ε$ in (14) should be replaced by $− 1 / ε$.
Let us also formulate the local large deviation principle in terms of weak neighborhoods.
For any probability measure $ν ∈ M 1 ( X )$ and any measurable subset $Y ⊂ X$ with $ν ( Y ) > 0$, define a conditional measure $ν Y ∈ M 1 ( X )$ according to the formula
$ν Y ( A ) = ν ( A ∩ Y ) ν ( Y ) , A ∈ A .$
It is easily seen that the measure ν can be approximated by the conditional measures $ν Y$, where $μ ( Y ) < + ∞$, in the fine topology (and all the more in the weak one). Therefore, it can make sense to replace fine neighborhoods of ν in Theorem 3 by weak neighborhoods of close conditional measures $ν Y$.
We will say that the Kullback action $ρ ( ν , μ )$ is well-defined if ν has a density $φ = d ν / d μ$, and, in addition, at least one of the two integrals
$∫ φ < 1 φ ln φ d μ , ∫ φ ≥ 1 φ ln φ d μ$
is finite. In all other cases (i.e., when both integrals (15) are infinite or the measure ν has no density with respect to μ), we will say that the Kullback action is ill-defined.
Theorem 4.
Suppose that, for some measures $ν ∈ M 1 ( X )$ and $μ ∈ M σ ( X )$, the Kullback action $ρ ( ν , μ )$ is well-defined. Then, for any number $ε > 0$, there exists a set $X ε ∈ A$ with $μ ( X ε ) < + ∞$ such that for any $Y ∈ A$ containing $X ε$ and having a finite measure $μ ( Y )$:
(a)
there exists a weak neighborhood $O ( ν Y ) ⊂ M 1 ( Y )$ satisfying the estimate
(b)
for any fine neighborhood $O ( ν Y ) ⊂ M 1 ( Y )$ and all large enough n,
In addition, for any $ε > 0$ and any fine neighborhood $O ( ν ) ⊂ M 1 ( X )$, there exists a set $Y ∈ A$ with $μ ( Y ) < + ∞$ such that for all large enough n,
Theorem 5.
Suppose that for some measures $ν ∈ M 1 ( X )$ and $μ ∈ M σ ( X )$, the Kullback action $ρ ( ν , μ )$ is ill-defined. Then, there exists a set $X 0 ∈ A$ with $μ ( X 0 ) < + ∞$, such that, for any $Y ∈ A$ containing $X 0$ and having a finite measure $μ ( Y )$, and any $ε > 0$, there exists a weak neighborhood $O ( ν Y ) ⊂ M 1 ( Y )$ satisfying the estimate
It is worth mentioning that, under conditions of Theorem 5, the equality $ρ ( ν , μ ) = − ∞$ may take place. In such a case, estimates (19) and (14) have opposite senses. Nevertheless, there is no contradiction here because the sets in these estimates are different.

## 4. Proof of Theorem 2

Recall that, under conditions of Theorem 2, the measure $ν ∈ M 1 ( X )$ is absolutely continuous with respect to $μ ∈ M σ ( X )$ and has a density $φ = d ν / d μ$. First of all, we will prove that for any function $ψ ∈ B ¯ ( X )$,
If at least one of the expressions $ν [ ψ ]$ or $λ ( ψ , μ )$ takes the infinite value allowed to it, then the left-hand side of (20) turns into $− ∞$, and so the inequality is true. Thus, it is enough to consider the case of finite $ν [ ψ ]$ and $λ ( ψ , μ )$.
Suppose first that
$∫ φ < 1 φ ln φ d μ > − ∞ .$
For any $ε > 0$, define the set
$A ε = { x ∈ X : ε < φ ( x ) < 1 / ε , ψ ( x ) > − 1 / ε }$
and the conditional distribution $ν ε$ on it:
$ν ε ( B ) = ν ( B ∩ A ε ) ν ( A ε ) , B ∈ A .$
Evidently, $ν ε$ has the density
$φ ε = d ν ε d μ = χ ε φ ν ( A ε ) ,$
where $χ ε$ is the characteristic function of $A ε$.
From elementary properties of integrals, it follows that
$λ ( ψ , μ ) = ln ∫ X e ψ d μ ≥ ln ∫ A ε e ψ d μ = ln ∫ A ε e ψ − ln φ ε d ν ε$
$≥ ∫ A ε ( ψ − ln φ ε ) d ν ε = ∫ A ε ( ψ − ln φ + ln ν ( A ε ) ) d ν ν ( A ε )$
$= 1 ν ( A ε ) ∫ A ε ψ d ν − 1 ν ( A ε ) ∫ A ε φ ln φ d μ + ln ν ( A ε )$
(in the passage from (23) to (24), Jensen’s inequality is used). If $ε → 0$, the expression in (25) converges to
$ν [ ψ ] − ∫ X φ ln φ d μ .$
Therefore, (23)–(25) imply the first case of (20) in the limit.
Now, suppose that $ν [ ψ ]$ and $λ ( ψ , μ )$ are finite and
$∫ φ < 1 φ ln φ d μ = − ∞ .$
Consider the sets
$A ε = { x ∈ X : ε < φ ( x ) < 1 } , ε ≥ 0 .$
As before, define the conditional distributions $ν ε$ and densities $φ ε$ by means of (21) and (22). Then, calculations (23)–(25) still hold, but the expression in (25) converges now to the limit
$1 ν ( A 0 ) ∫ A 0 ψ d ν − 1 ν ( A 0 ) ∫ A 0 φ ln φ d μ + ln ν ( A 0 ) .$
In the situation under consideration, the first and the third summands in (27) are finite, while the second one turns into $+ ∞$. Therefore, from (23)–(25), it follows that $λ ( ψ , μ ) = + ∞$, which contradicts the assumption about finiteness of $λ ( ψ , μ )$. Thus, in the situation when both $ν [ ψ ]$ and $λ ( ψ , μ )$ are finite, equality (26) cannot take place. Thereby, inequality (20) is completely proven.
To finish the proof of Theorem 2, it is enough to verify the equality
By virtue of (20) the left-hand side of (28) does not exceed the right-hand one. If the right-hand side of (28) equals $− ∞$, then the equality is trivial. Consider the case when the right-hand side of (28) is greater than $− ∞$. By σ-finiteness of μ, there exists a function $η ∈ B ¯ ( X )$ such that the integral $∫ X e η d μ$ is also finite. Consider the family of functions
Obviously, $ψ t ∈ B ¯ ( X )$, and if t goes to $+ ∞$, then
$∫ X e ψ t d μ = ∫ φ = 0 e η − t d μ + ∫ 0 < φ ≤ e t φ d μ + ∫ φ > e t e t d μ ⟶ ∫ X φ d μ = 1 , ν [ ψ t ] = ∫ 0 < φ ≤ e t φ ln φ d μ + ∫ φ > e t t φ d μ ⟶ ∫ X φ ln φ d μ , ν [ ψ t ] − λ ( ψ t , μ ) = ν [ ψ t ] − ln ∫ X e ψ t d μ ⟶ ∫ X φ ln φ d μ .$
It follows that the supremum in the left-hand side of (28) coincides with the right-hand side. ☐

## 5. Proof of the First Part of Theorem 3

At first, suppose that there exists a measurable set A with $μ ( A ) = 0$ and $ν ( A ) > 0$. Then, by definition $ρ ( ν , μ ) = + ∞$. Denote by $χ A$ the characteristic function of A. Define a fine neighborhood (in fact a weak one) of the measure ν as follows:
If a sequence $x = ( x 1 , ⋯ , x n ) ∈ X n$ satisfies the condition $δ x , n ∈ O ( ν )$, then $δ x , n [ χ A ] > 0$. This implies that at least one of the points $x i$ belongs to A. Therefore,
which implies (13), where $ρ ( ν , μ ) − ε$ is replaced by $1 / ε$ by convention.
Now suppose that the second case of formula (10) holds true:
$ρ ( ν , μ ) = sup ψ ∈ B ¯ ( X ) { ν [ ψ ] − λ ( ψ , μ ) } .$
If $ρ ( ν , μ ) = − ∞$, then estimate (13) is trivial. Thus, let $ρ ( ν , μ ) > − ∞$. In this case, (29) implies that, for any $ε > 0$, there exists a function $ψ ∈ B ¯ ( X )$ that satisfies
$ρ ( ν , μ ) − ε / 2 < ν [ ψ ] − λ ( ψ , μ ) .$
From this inequality, it follows automatically that $ν [ ψ ] > − ∞$ and $λ ( ψ , μ ) < + ∞$.
Consider the probability distribution $μ ψ = e ψ − λ ( ψ , μ ) μ$. For any sequence $x = ( x 1 , ⋯ , x n ) ∈ X n$, we have
$d μ n ( x ) d μ ψ n ( x ) = ∏ i = 1 n d μ ( x i ) d μ ψ ( x i ) = ∏ i = 1 n e λ ( ψ , μ ) − ψ ( x i ) = e n ( λ ( ψ , μ ) − δ x , n [ ψ ] ) .$
Define a fine neighborhood of the measure $ν ∈ M 1 ( X )$ as follows:
Then, under the condition $δ x , n ∈ O ( ν )$, it follows from (30) and (31) that
$d μ n ( x ) d μ ψ n ( x ) = e n ( λ ( ψ , μ ) − δ x , n [ ψ ] ) < e n ( λ ( ψ , μ ) − ν [ ψ ] + ε / 2 ) < e n ( − ρ ( ν , μ ) + ε ) .$
Next, since the measure $μ ψ n$ is probabilistic,
$μ n { x ∈ X n ∣ δ x , n ∈ O ( ν ) } = ∫ δ x , n ∈ O ( ν ) d μ n ( x ) ≤ ∫ δ x , n ∈ O ( ν ) e n ( − ρ ( ν , μ ) + ε ) d μ ψ n ( x ) ≤ e n ( − ρ ( ν , μ ) + ε ) .$
Thus, inequality (13) is proven in all cases. ☐

## 6. Proof of the Second Part of Theorem 3

Now let us proceed to estimate (14). It is trivial if $ρ ( ν , μ ) = + ∞$. Thus, in the sequel, we may suppose that $ρ ( ν , μ ) ∈ [ − ∞ , + ∞ )$. Then, (10) implies that ν is absolutely continuous with respect to μ and has a density $φ = d ν / d μ$.
First, consider the case of finite $ρ ( ν , μ )$. Then, Theorem 2 implies
$ρ ( ν , μ ) = ∫ X φ ln φ d μ = ∫ X ln φ d ν = ν [ ln φ ] .$
Fix any $ε > 0$ and any fine neighborhood $O ( ν ) ⊂ M 1 ( X )$. Consider the sets
(in the latter inequality, it is supposed that each element of the sequence $x = ( x 1 , ⋯ , x n )$ satisfies the condition $φ ( x i ) > 0$). Note that, for $x ∈ Y n$,
$d μ n ( x ) d ν n ( x ) = ∏ i = 1 n d μ ( x i ) d ν ( x i ) = ∏ i = 1 n 1 φ ( x i ) = e − n δ x , n [ ln φ ] > e − n ( ν [ ln φ ] + ε / 2 ) .$
Hence,
$μ n ( Y n ) = ∫ Y n d μ n ( x ) ≥ ∫ Y n e − n ( ν [ ln φ ] + ε / 2 ) d ν n ( x ) = e − n ( ρ ( ν , μ ) + ε / 2 ) ν n ( Y n ) .$
By the law of large numbers, $ν n ( Y n ) → 1$. Thus, (32) implies (14).
Now, suppose that $ρ ( ν , μ ) = − ∞$. Then, by Theorem 2,
$∫ φ < 1 φ ln φ d μ = − ∞ .$
Divide the whole space X into two parts: $X = X − ⊔ X +$, where
$X − = { x ∈ X ∣ φ ( x ) < 1 } , X + = { x ∈ X ∣ φ ( x ) ≥ 1 } .$
Set $X k + = { x ∈ X + ∣ 1 ≤ φ ( x ) ≤ k }$. Evidently, $X + = ⋃ k X k +$ and
$μ ( X k + ) ≤ μ ( X + ) = ∫ X + d μ = ∫ X + d ν φ ≤ ∫ X + d ν = ν ( X + ) ≤ 1 .$
Then, construct a sequence of embedded sets $X 1 − ⊂ X 2 − ⊂ X 3 − ⊂ ⋯$ with $μ ( X k − ) < + ∞$, such that $X − = ⋃ k X k −$, and, at the same time,
$∫ Y k φ ln φ d μ → − ∞ , where Y k = X k − ∪ X k + .$
Such construction is possible due to (33) and (34). Evidently, $Y 1 ⊂ Y 2 ⊂ Y 3 ⊂ ⋯$, and each $Y k$ is of finite measure μ, and their union gives the whole X.
Denote by $ν k$ the conditional distribution of ν on $Y k$:
$ν k ( A ) = ν ( A ∩ Y k ) ν ( Y k ) , A ∈ A .$
It has the density
$φ k = d ν k d μ = χ k φ ν ( Y k ) ,$
where $χ k$ is the characteristic function of $Y k$. Evidently, the sequence $ν k$ converges to ν in the fine topology, and (35) implies that $ρ ( ν k , μ ) → − ∞$. In addition, the condition $μ ( Y k ) < + ∞$ implies that $ρ ( ν k , μ ) > − ∞$.
Fix an arbitrary $ε > 0$ and an arbitrary fine neighborhood $O ( ν )$. Choose k so large that $ν k ∈ O ( ν )$ and simultaneously $ρ ( ν k , μ ) < − 1 / ε$. In the case of finite Kullback action, estimate (14) is already proven. Apply it to the pair of measures $ν k$ and μ, the neighborhood $O ( ν )$ of the measure $ν k$, and the number $ε ′ = − 1 / ε − ρ ( ν k , μ )$:
$μ n { x ∈ X n ∣ δ x , n ∈ O ( ν ) } ≥ e − n ( ρ ( ν k , μ ) + ε ′ ) = e n / ε ,$
provided n is large enough. This is exactly estimate (14) for the case $ρ ( ν , μ ) = − ∞$. ☐

## 7. Proof of Theorems 4 and 5

Proof of Theorem 4.
Under conditions of Theorem 4, the Kullback action is well-defined. Using the definition of well-definiteness and Theorem 2, we can choose a subset $X ε ∈ A$ with $μ ( X ε ) < + ∞$, such that, for any set $Y ∈ A$ that contains $X ε$ and has a finite measure $μ ( Y )$, one of the following holds:
(a)
$| ρ ( ν Y , μ ) − ρ ( ν , μ ) | < ε$ in the case of finite $ρ ( ν , μ )$,
(b)
$ρ ( ν Y , μ ) > 1 / ε$ in the case $ρ ( ν , μ ) = + ∞$,
(c)
$ρ ( ν Y , μ ) < − 1 / ε$ in the case $ρ ( ν , μ ) = − ∞$.
Now, estimates (16) and (17) follow from the corresponding estimates of Theorem 1.
In addition, estimate (18) comes from estimate (7) of Theorem 1. To see this, it is enough to choose a set $Y ∈ A$ with $μ ( Y ) < + ∞$ such that, along with one of the conditions (a)–(c), it satisfies the condition $ν Y ∈ O ( ν )$. ☐
Proof of Theorem 5.
Suppose that the measure ν is absolutely continuous with respect to μ and has a density $φ = d ν / d μ$, but the Kullback action $ρ ( ν , μ )$ is ill-defined. Consider the set
$X 0 = { x ∈ X ∣ φ ( x ) ≥ 1 } .$
Obviously, $μ ( X 0 ) ≤ ν ( X 0 ) ≤ 1$. In addition, since the Kullback action is ill-defined,
$∫ X 0 φ ln φ d μ = + ∞ .$
For any measurable set $Y ⊃ X 0$ with $μ ( Y ) < + ∞$, the corresponding conditional distribution $ν Y$ has the density $χ Y φ / ν$/(Y). Therefore,
$ρ ( ν Y , μ ) = ∫ Y φ ν ( Y ) ln φ ν ( Y ) d μ = + ∞ ,$
and hence estimate (19) follows from estimate (6) of Theorem 1.
Now consider the case when the measure ν is not absolutely continuous with respect to μ. Then, there exists $X 0 ∈ A$ with $μ ( X 0 ) = 0$ and $ν ( X 0 ) > 0$. Suppose that a set $Y ∈ A$ with $μ ( Y ) < + ∞$ contains $X 0$. Obviously, the conditional distribution $ν Y$ is not absolutely continuous with respect to μ and hence $ρ ( ν Y , μ ) = + ∞$. Thus, we can apply Theorem 1 to the measures $ν Y$, μ on the space Y and obtain (19). ☐

## 8. The Case of Finitely Additive Probability Distributions ν

The necessity of consideration of finitely additive probability distributions ν is caused by the fact that they may happen to be accumulation points for some sequences of empirical measures. Thus, to make the description of the empirical measures distribution complete, we should obtain the estimates similar to (13) and (14) for finitely additive probability distributions ν as well.
In fact, this can be done, and the principal result is that Theorems 3 and 5 still hold true for finitely additive probability distributions ν, provided the Kullback action $ρ ( ν , μ )$ is defined by (10). In addition, in that case, $ρ ( ν , μ )$ may take values $+ ∞$ or $− ∞$ only, and the both are possible.
The transition from countably additive distributions to only finitely additive ones is not trivial. First of all, we should adapt some previous definitions to the new setting.
Denote by $N 1 ( X )$ the set of all finitely additive probability measures on $( X , A )$. Each $ν ∈ N 1 ( X )$ is naturally identified with a positive normalized linear functional on the space of bounded measurable functions $B ( X )$ (i.e., a functional that takes nonnegative values on nonnegative functions and the unit value on the unit function). Using this identification, we denote the integral of $f ∈ B ( X )$ with respect to $ν ∈ N 1 ( X )$ as $ν [ f ]$. In addition, for bounded above functions $f ∈ B ¯ ( X )$, let us define $ν [ f ]$ as
$ν [ f ] = lim c → − ∞ ν [ f ∨ c ] , f ∨ c = max { f , c } .$
Thus, for $f ∈ B ¯ ( X )$, the value $ν [ f ]$ belongs to the interval $[ − ∞ , + ∞ )$. Similarly, for a measurable function f that is bounded from below, put
$ν [ f ] = lim c → + ∞ ν [ f ∧ c ] , f ∧ c = min { f , c } .$
Now, we define the Kullback action $ρ ( ν , μ )$ for the case when $ν ∈ N 1 ( X )$ and $μ ∈ M σ ( X )$:
Obviously, this definition just duplicates (10).
Theorem 6.
If $ν ∈ N 1 ( X )$ has no density with respect to $μ ∈ M σ ( X )$, then $ρ ( ν , μ )$ turns into $+ ∞$ or $− ∞$. In particular, if μ is finite or ν is countably additive, then $ρ ( ν , μ ) = + ∞$.
Let us introduce a fine topology on $N 1 ( X )$ by means of neighborhoods of the form
where $ε > 0$ and the functions $f 1 , ⋯ , f k ∈ B ¯ ( X )$ are such that all $ν [ f i ]$ are finite. Clearly, this definition is analogous to (5). Note that the bounded above functions in (37) may be replaced by bounded below or even nonnegative ones. This will not change the collection of neighborhoods (37).
Now, we reformulate Theorems 3 and 5 for the case of finitely additive distributions ν (note that Theorem 4 cannot be reformulated since $ρ ( ν , μ )$ is well-defined, and hence ν is countably additive in it).
Theorem 7.
For any measures $ν ∈ N 1 ( X )$, $μ ∈ M σ ( X )$, and number $ε > 0$, there exists a fine neighborhood $O ( ν ) ⊂ N 1 ( X )$ such that
On the other hand, for any measures $ν ∈ N 1 ( X )$, $μ ∈ M σ ( X )$, number $ε > 0$, and any fine neighborhood $O ( ν ) ⊂ N 1 ( X )$, the following estimate holds for all large enough n:
If $ρ ( ν , μ ) = + ∞$, then the difference $ρ ( ν , μ ) − ε$ in (38) should be replaced by $1 / ε$, and if $ρ ( ν , μ ) = − ∞$, then the sum $ρ ( ν , μ ) + ε$ in (39) should be replaced by $− 1 / ε$.
A measure $ν ∈ N 1 ( X )$ will be called proper with respect to a measure $μ ∈ M σ ( X )$ if, for any $ε > 0$, there exists a set $A ∈ A$ such that $μ ( A ) < + ∞$ and $ν ( A ) > 1 − ε$. If, on the contrary, there exists an $ε > 0$ such that the inequality $ν ( A ) > 1 − ε$ implies $μ ( A ) = + ∞$, then the measure ν will be called improper with respect to μ. Obviously, in the case of finite μ, all measures $ν ∈ N 1 ( X )$ are proper, and, in the case of σ-finite μ, all countably additive measures $ν ∈ N 1 ( X )$ are proper.
Theorem 8.
Suppose that for some measures $ν ∈ N 1 ( X )$ and $μ ∈ M σ ( X )$, the Kullback action $ρ ( ν , μ )$ is ill-defined, and the measure ν is proper with respect to μ. Then, there exists a set $X 0 ∈ A$ with $μ ( X 0 ) < + ∞$, such that, for any $Y ∈ A$ containing $X 0$ and having a finite measure $μ ( Y )$, and any $ε > 0$, there exists a weak neighborhood $O ( ν Y ) ⊂ N 1 ( Y )$ satisfying the estimate

## 9. Proof of Theorem 6

Lemma 9.
Suppose that a measure $ν ∈ N 1 ( X )$ is proper with respect to $μ ∈ M σ ( X )$, and, for any $ε > 0$, there exists $δ > 0$, such that $μ ( A ) < δ$ implies $ν ( A ) < ε$. Then, ν is countably additive and absolutely continuous with respect to μ.
Proof.
Construct a sequence of embedded measurable sets $A 1 ⊂ A 2 ⊂ A 3 ⊂ ⋯$ such that all of them have finite measures $μ ( A n )$, satisfy the condition $ν ( A n ) > 1 − 1 / n$, and their union is the whole X.
The restriction of μ to each $A n$ is finite and continuous: if a sequence of embedded measurable sets $A n ⊃ B 1 ⊃ B 2 ⊃ B 3 ⊃ ⋯$ has an empty intersection, then $μ ( B k ) → 0$. The assumption of Lemma 9 implies that the restriction of ν to $A n$ is continuous as well. It is known that the continuity of a finite measure is equivalent to its countable additivity. Then, the restriction of ν to each $A n$ is countably additive. Since ν is proper, we have $ν ( B ) = lim n → ∞ ν ( B ∩ A n )$ for any measurable B. It follows that ν is countably additive on the whole X (this may be proven in the same way as the countable additivity of a σ-finite measure). ☐
Proof of Theorem 6.
It follows from (36) that either $ρ ( ν , μ ) = + ∞$ or
In the first case, the assertion of Theorem 6 is valid. Therefore, it is enough to consider the case when the Kullback action is defined by formula (41).
By the assumption of Theorem 6, the measure $ν ∈ N 1 ( X )$ has no density with respect to μ. Then, Lemma 9 guarantees validity of at least one of the following two conditions:
(a)
there exists a positive ε, such that, for any $δ > 0$, one can choose a measurable set $A δ$ such that $μ ( A δ ) < δ$ and $ν ( A δ ) ≥ ε$;
(b)
the measure ν is improper with respect to μ.
Suppose that (a) is valid. If $ρ ( ν , μ ) > − ∞$, then (41) implies existence of a function $ψ ∈ B ¯ ( X )$ such that $ν [ ψ ] − λ ( ψ , μ ) > − ∞$. Fix a number $t > 0$ and consider the family of functions $ψ δ = ψ + t χ δ$, where $χ δ$ is the characteristic function of the set $A δ$. When $δ → 0$, we have
From the arbitrariness of t, (41) and (42), it follows that $ρ ( ν , μ ) = + ∞$.
Now assume that (b) is valid. Consider any function $ψ ∈ B ¯ ( X )$ such that $ν [ ψ ] > − ∞$. Define the sets $A n = { x ∈ X ∣ ψ ( x ) ≥ − n }$. The condition $ν [ ψ ] > − ∞$ implies $ν ( A n ) → 1$. Since the measure ν is improper, it follows that $μ ( A n ) = + ∞$ for all large enough n. Then,
$λ ( ψ , μ ) = ln ∫ X e ψ d μ ≥ ln ∫ A n e − n d μ = − n + ln μ ( A n ) = + ∞ ,$
and hence
$ρ ( ν , μ ) = sup ψ ∈ B ¯ ( X ) { ν [ ψ ] − λ ( ψ , μ ) } = − ∞ .$
Recall that if μ is finite, then ν is proper, and hence alternative (b) cannot take place. In addition, for finite μ one has $ρ ( μ , μ ) > − ∞$. Thus, (a) implies $ρ ( ν , μ ) = + ∞$. If ν is countably additive and has no density with respect to μ, then the first case of (36) takes place, according to which $ρ ( ν , μ ) = + ∞$ as well. ☐

## 10. Proof of Theorems 7 and 8

The proof for the first part of Theorem 7 is exactly the same as for the first part of Theorem 3, so we omit it. If $ν ∈ M 1 ( X )$, then the second part of Theorem 7 follows from the second part of Theorem 3. Thus, it remains to consider the case $ν ∈ N 1 ( X ) ∖ M 1 ( X )$.
Let $B$ be some σ-field of subsets of X. We will call it discrete if it is generated by a countable or finite partition of X.
Lemma 10.
For any measure $ν ∈ N 1 ( X )$ and any its fine neighborhood $O ( ν )$, there exists a discrete σ-subfield $B ⊂ A$ such that
(a)
the restriction of ν to $B$ is countably additive;
(b)
there exists a fine neighborhood $O ′ ( ν ) ⊂ O ( ν )$ generated by $B$-measurable functions;
(c)
if the measure ν is proper with respect to $μ ∈ M σ ( X )$, then the σ-field $B$ mentioned above can be chosen in such a way that each of its atoms has a finite measure μ.
Proof.
A base for the fine topology on $N 1 ( X )$ is formed by the neighborhoods
where $f i$ are measurable nonnegative functions on $( X , A )$ with $ν [ f i ] < + ∞$. Let us prove the Lemma for a neighborhood of this sort.
Define the step-functions $g i = ε [ f i / ε ]$, where $[ · ]$ denotes the integer part of a number, and the neighborhood
Evidently, $| g i − f i | ≤ ε$, and, for each $δ ∈ O ′ ( ν )$, we have
$| δ [ f i ] − ν [ f i ] | ≤ | δ [ f i ] − δ [ g i ] | + | δ [ g i ] − ν [ g i ] | + | ν [ g i ] − ν [ f i ] | < 3 ε .$
It follows that $O ′ ( ν ) ⊂ O ( ν )$.
To each integer vector $k = ( k 1 , ⋯ , k m ) ∈ Z m$, assign the set
These sets form a countable measurable partition of X and generate the desired discrete σ-subfield $B$. The functions $g i$ are $B$-measurable.
Note that, for any $C > 0$, we have
$ν { x ∈ X ∣ g i ( x ) ≥ C } ≤ ν [ g i ] C .$
Thus, when C goes to $+ ∞$,
$∑ k i ≤ C , i ≤ m ν ( X k ) ≥ 1 − ∑ i = 1 m ν { x ∈ X ∣ g i ( x ) ≥ C ε } ≥ 1 − ∑ i = 1 m ν [ g i ] C ε → 1 .$
It follows that the restriction of ν to the σ-field $B$ is countably additive.
Assume that the measure ν is proper with respect to μ. In this case, we can construct a countable partition of X into subsets $Y l ∈ A$ such that $μ ( Y l ) < + ∞$ and $ν ( Y 1 ⊔ ⋯ ⊔ Y l ) → 1$ as $l → + ∞$. The latter condition implies the equality $ν ( X k ) = ∑ l ν ( X k ∩ Y l )$. Therefore, the restriction of ν to the σ-field generated by the atoms $X k ∩ Y l$ is countably additive. This σ-field may be treated as $B$. By construction, its atoms have finite measure μ. ☐
Let us finish the proof of Theorem 7. It remains to obtain estimate (39) for $ν ∈ N 1 ( X ) ∖ M 1 ( X )$. In this situation, the measure ν has no density with respect to μ, and, according to Theorem 6, we have the alternative: either $ρ ( ν , μ ) = + ∞$ or $ρ ( ν , μ ) = − ∞$. In the first case, estimate (39) is trivial. Thus, it is enough to consider the second case $ρ ( ν , μ ) = − ∞$.
Suppose the measure ν is proper with respect to μ and $ρ ( ν , μ ) = − ∞$. We can apply Lemma 10 to ν and construct the corresponding discrete σ-subfield $B ⊂ A$ and fine neighborhood $O ′ ( ν ) ⊂ O ( ν )$. Denote by $ν ¯$ and $μ ¯$ the restrictions of ν and μ to $B$. By Lemma 10, they are countably additive. From definition (36), it follows that if $μ ¯ ( A ) = 0$ for some $A ∈ B$, then $ν ¯ ( A ) = 0$ as well (since otherwise $ρ ( ν , μ ) = + ∞$). Thus, the distribution $ν ¯$ on $B$ is absolutely continuous with respect to $μ ¯$.
Recall that by definition,
where $B ¯ ( X )$ is the set of all bounded above $A$-measurable functions. The same is true for all bounded above $B$-measurable functions, and hence $ρ ( ν ¯ , μ ¯ ) = − ∞$ as well. Since $ν ¯$ is absolutely continuous with respect to $μ ¯$, the second part of Theorem 7 for $ν ¯$ and $μ ¯$ is proven. It implies the estimate
for all large enough n. Due to the inclusion $O ′ ( ν ) ⊂ O ( ν )$, we obtain (39).
Consider the case of improper ν. We can apply Lemma 10 and construct the corresponding discrete σ-subfield $B$ and a fine neighborhood $O ′ ( ν ) ⊂ O ( ν )$ generated by $B$-measurable functions. The field $B$ is generated by a certain denumerable partition $X = X 1 ⊔ X 2 ⊔ X 3 ⊔ ⋯$. Change numeration of the sets $X i$ so that $ν ( X 1 ) ≥ ν ( X 2 ) ≥ ν ( X 3 ) ≥ ⋯$. Put $Y k = X 1 ⊔ X 2 ⊔ ⋯ ⊔ X k$ and denote by $ν k$ the conditional distribution of ν on $Y k$. Due to the countable additivity, $ν ( Y k ) → 1$ and $ν k ∈ O ′ ( ν )$ for all large enough k. In addition, the improperness of ν implies that $μ ( Y k ) = + ∞$ for all large enough k.
Fix such a large k that $ν k ∈ O ′ ( ν )$, and, at the same time, $μ ( Y k ) = + ∞$. The latter implies $μ ( X i ) = + ∞$ for at least one $i ≤ k$. Without loss of generality, we may assume that $ν k ( X i ) > 0$ for all $i ≤ k$. Obviously, for any large enough n, there exists a sequence $y = ( y 1 , ⋯ , y n ) ∈ Y k n$ such that the empirical measure $δ y , n$ is so close to $ν k$ that $δ y , n ∈ O ′ ( ν )$ and each of the sets $X 1 , ⋯ , X k$ contains at least one of the points $y 1 , ⋯ , y n$. Define positive integers $i j$ in such a way that $y j ∈ X i j$ for $j = 1 , ⋯ , n$. Then, $μ ( X i j ) > 0$ for all $j = 1 , ⋯ , n$ (since otherwise $ρ ( ν , μ ) = + ∞$) and $μ ( X i j ) = + ∞$ for at least one j. Therefore,
and thereby estimate (39) is completely proven. ☐
Proof of Theorem 8.
If $ν ∈ M 1 ( X )$, then the assertion of Theorem 8 follows from Theorem 5.
Let $ν ∈ N 1 ( X ) ∖ M 1 ( X )$. Then, ν is not absolutely continuous with respect to μ.
Since ν is proper, by Lemma 9, there exists an $ε 0 > 0$ such that, for any positive integer n, there exists $A n ∈ A$ satisfying $μ ( A n ) < 2 − n$ and $ν ( A n ) ≥ ε 0$. Set $X 0 = ⋃ n A n$.
Suppose a set $Y ∈ A$ with $μ ( Y ) < + ∞$ contains $X 0$. Then, the conditional distribution $ν Y$ is not absolutely continuous with respect to μ. On the other hand, (36) and the conditions $μ ( Y ) < + ∞$ and $ν Y ( X ∖ Y ) = 0$ imply the inequality $ρ ( ν Y , μ ) > − ∞$. Hence, $ρ ( ν Y , μ ) = + ∞$ by Theorem 6. In this case, estimate (40) follows from estimate (6) of Theorem 1. ☐

## Author Contributions

A general idea of the research was suggested by Victor Bakhtin. Theorems 2, 3, 4, and 5 were obtained in collaboration of the authors. Theorems 6, 7, and 8 and their proofs belong to Victor Bakhtin. Both authors have read and approved the final manuscript.

## Conflicts of Interest

The authors declare no conflict of interest.

## References

1. Billingsley, P. Ergodic Theory and Information; Wiley: New York, NY, USA, 1965; 204p. [Google Scholar]
2. Algoet, P.H.; Cover, T.M. A sandwich proof of the Shannon—McMillan—Breiman theorem. Ann. Probab. 1988, 16, 899–909. [Google Scholar] [CrossRef]
3. Sanov, I.N. On the probability of large deviations of random variables. Matematicheskii Sbornik 1957, 42, 11–44. (In Russian) [Google Scholar]
4. Groeneboom, P.; Oosterhoff, J.; Ruymgaart, F.H. Large deviation theorems for empirical probability measures. Ann. Probab. 1979, 7, 553–586. [Google Scholar] [CrossRef]
5. Jain, N. An introduction to large deviations. Lect. Notes Math. 1985, 1153, 273–296. [Google Scholar]
6. Deuschel, J.-D.; Stroockn, D.W. Large Deviations: Pure and Applied Mathematics; Academic Press: Boston, MA, USA, 1989; Volume 137. [Google Scholar]
7. Borovkov, A.A.; Mogul’skii, A.A. On large deviation principles in metric spaces. Sib. Math. J. 2010, 51, 989–1003. [Google Scholar] [CrossRef]
8. Acosta, A. On large deviations of empirical measures in the τ-topology. Stud. Appl. Probab. 1994, 31A, 41–47. [Google Scholar] [CrossRef]
9. Bakhtin, V.I. Spectral potential, Kullback action, and large deviations of empirical measures on measurable spaces. Theory Probab. Appl. 2015, 59, 535–544. [Google Scholar] [CrossRef]
10. Bakhtin, V.I. Spectral potential, Kullback action, and large deviation principle for finitely-additive measures. Tr. Inst. Mat. 2015, 23, 11–23. (In Russian) [Google Scholar]
11. Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, I. Commun. Pure Appl. Math. 1975, 28, 1–47. [Google Scholar] [CrossRef]
12. Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, II. Commun. Pure Appl. Math. 1975, 28, 279–301. [Google Scholar] [CrossRef]
13. Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, III. Commun. Pure Appl. Math. 1976, 29, 389–461. [Google Scholar] [CrossRef]
14. Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, IV. Commun. Pure Appl. Math. 1983, 36, 183–212. [Google Scholar] [CrossRef]
15. Varadhan, S.R.S. Large deviations. Ann. Probab. 2008, 36, 397–419. [Google Scholar] [CrossRef]
16. Freidlin, M.I.; Wentzell, A.D. Random Perturbations of Dynamical Systems, 3rd ed.; Springer: Berlin, Germany, 2012; 458p. [Google Scholar]
17. Borovkov, A.A.; Mogul’skii, A.A. Large deviation principles for random walk trajectories. I. Theory Probab. Appl. 2012, 56, 538–561. [Google Scholar] [CrossRef]
18. Borovkov, A.A.; Mogul’skii, A.A. Large deviation principles for random walk trajectories. II. Theory Probab. Appl. 2013, 57, 1–27. [Google Scholar] [CrossRef]
19. Borovkov, A.A.; Mogul’skii, A.A. Large deviation principles for random walk trajectories. III. Theory Probab. Appl. 2014, 58, 25–37. [Google Scholar] [CrossRef]
20. Borovkov, A.A. Mathematical Statistics; Gordon & Breach: Amsterdam, The Netherlands, 1998; 570p. [Google Scholar]
21. Sokol, E.E. A generalization of McMillan’s theorem on the case of countable alphabet. Tr. Inst. Mat. 2015, 23, 115–122. (In Russian) [Google Scholar]
22. Sokal, E.E. On the informational sense of entropy in the case of countable alphabet. Vestn. Beloruss. Gos. Univ. Ser. 1 Fiz. Mat. Inform. 2016, 1, 96–100. (In Russian) [Google Scholar]

## Share and Cite

MDPI and ACS Style

Bakhtin, V.; Sokal, E. The Kullback–Leibler Information Function for Infinite Measures. Entropy 2016, 18, 448. https://doi.org/10.3390/e18120448

AMA Style

Bakhtin V, Sokal E. The Kullback–Leibler Information Function for Infinite Measures. Entropy. 2016; 18(12):448. https://doi.org/10.3390/e18120448

Chicago/Turabian Style

Bakhtin, Victor, and Edvard Sokal. 2016. "The Kullback–Leibler Information Function for Infinite Measures" Entropy 18, no. 12: 448. https://doi.org/10.3390/e18120448

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.