This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Divergence is a discrepancy measure between two objects, such as functions, vectors, matrices, and so forth. In particular, divergences defined on probability distributions are widely employed in probabilistic forecasting. As the dissimilarity measure, the divergence should satisfy some conditions. In this paper, we consider two conditions: The first one is the scale-invariance property and the second is that the divergence is approximated by the sample mean of a loss function. The first requirement is an important feature for dissimilarity measures. The divergence will depend on which system of measurements we used to measure the objects. Scale-invariant divergence is transformed in a consistent way when the system of measurements is changed to the other one. The second requirement is formalized such that the divergence is expressed by using the so-called composite score. We study the relation between composite scores and scale-invariant divergences, and we propose a new class of divergences called Hölder divergence that satisfies two conditions above. We present some theoretical properties of Hölder divergence. We show that Hölder divergence unifies existing divergences from the viewpoint of scale-invariance.

Nowadays, divergence measures are ubiquitous in the field of information sciences. The divergence is a discrepancy measure between two objects, such as functions, vectors, matrices, and so forth. In particular, divergences defined on the set of probability distributions are widely used for probabilistic forecasting such as weather and climate prediction [

Dissimilarity measures for statistical inference should satisfy some conditions. In this paper, we focus on two conditions. The first one is the scale-invariance property, and the second one is that the divergence should be represented by using the so-called composite score [

The first requirement is the scale-invariance. Suppose that the divergence is used to measure the dissimilarity between two objects, then the divergence will depend on the system of measurements we used to measure the objects. The scale-invariant divergence has a favorable property such that it is transformed in a consistent way when the system of measurements is changed to the other one. For example, the measured value between two objects depends on the unit of length. Typically, measured values in different units are transformed to each other by multiplying an appropriate positive constant. The Kullback-Leibler divergence that is one of the most popular divergences has the scale-invariance property for the measurement of training samples [

As the second requirement, dissimilarity measures should be expressed as the form of composite scores. This is a useful property, when the divergence is employed for the statistical inference of the probability densities. When the divergence

In the present paper, we propose composite scores as an extension of scores, and study the relation between composite scores and scale-invariant divergences. We propose a new class of divergences called Hölder divergence, that is defined through a class of composite scores. We show that Hölder divergence unifies existing divergences from the viewpoint of the scale-invariance. The Hölder divergence with the one-dimension parameter

The remainder of the article is as follows: In Section 2, some basic notions such as divergence, scale-invariance and score are introduced. In Section 3, we propose the Hölder divergence. Some theoretical properties of Hölder divergence are investigated in Section 4. In Section 5, we close this article with a discussion of the possibility of the newly introduced divergences. Technical calculations and proofs are found in the appendix.

Let us summarize the notations to be used throughout the paper. Let ℝ be the set of all real numbers, ℝ_{+} be the set of all non-negative real numbers, and ℝ_{++}, and the set of all positive real numbers. For a real-valued function _{Ω}

In this section, we show definitions of some basic concepts.

Let us introduce scores and divergences. Below, positive-valued functions are defined on the compact set Ω. The

Let

The functional

Bregman divergence [

_{++},

The rigorous definition of

The remarkable property of Bregman divergence is that associated score

Though Bregman divergence is a popular class of divergences, the computation of the potential may be a hard task. The separable Bregman divergence is an important subclass of Bregman divergences. In many applications of statistical forecasting, the separable Bregman divergences are used due to the computational tractability.

_{+}

Due to the convexity of

_{IS}(_{IS}(_{IS}(

^{α}g^{1}^{−α} is not linear in f. The limit α →_{KL}(_{KL}(^{d} to c^{d}, and c′^{d} be the gradient vector. For the transformation f_{c}

Let us consider the scale-invariance of divergences. Suppose that

The support of _{a,σ}_{a,σ}

As a natural requirement, the divergence measure should not be essentially affected by systems of measurement. More concretely, the relative nearness between two densities should be preserved under the transformation

holds for any pair of densities _{a,σ}^{−}^{1} for IS-distance, ^{1+}^{γ}σ^{γ}^{γ/}^{(1+}^{γ}^{)} for pseudo-spherical divergence.

The divergence

Generally, one cannot directly substitute the empirical probability density

^{2} and U and V are real-valued functions. The integrals

The composite score was introduced in [

Separable Bregman divergences are represented by using composite scores. Indeed, the separable Bregman divergence with the potential

Scale-invariant divergences defined from composite scores are useful for statistical inference. Suppose that _{1}_{,σ}_{1}_{,σ}

Let us define the equivalence class among scores. The two scores are said to be

In sequel sections, we introduce the Hölder score that is a class of composite scores with the scale-invariance property. Then, we investigate theoretical properties of the Hölder score.

Let us define a class of scale-invariant divergences expressed by the composite score. The divergence is called the Hölder divergence. The name comes from the fact that the Hölder inequality or its reverse variant is used to prove the non-negativity of the divergence. The Hölder divergence unifies existing divergences from the viewpoint of the scale-invariance.

_{+} → ℝ

^{s}^{(1+ γ)}

The _{γ}_{γ}

_{γ}_{γ}

_{0} and _{−}_{1} coincide with the KL-divergence and IS distance, respectively. Hence, _{γ}

For positive-valued functions

For

in which the first inequality comes from ^{1+}^{γ}_{γ}_{γ}

For γ ∈ (

in which the first inequality comes from ^{−}^{(1+}^{γ}^{)} and the second inequality is derived from the reverse Hölder inequality. The same argument in the case of _{γ}_{γ}

The Hölder divergences have the scale-invariance. The following calculation is straightforward.

In addition, we have _{0}(_{a,σ}, g_{a,σ}_{0}(_{−}_{1}(_{a,σ}, g_{a,σ}^{−}^{1}_{−}_{1}(_{γ}_{a,σ}, g_{a,σ}_{γ}

The class of Hölder divergences includes some popular divergences that are used in statistics and information theory. Some examples are shown below.

^{s}^{(1+}^{γ}^{)}, ^{s}^{(1+}^{γ}^{)}

_{γ,κ}

In this section, we present some theoretical properties of Hölder divergence.

Let us consider the conjugate relation among Hölder divergences. Firstly, we point out that the KL divergence and IS distance are related to each other by the equality,

_{−}_{1}(_{0}(1,

Suppose ^{∗}^{∗}^{s}φ^{∗}^{∗}^{∗}^{s}^{(1+}^{γ}^{)} guarantees the inequality ^{∗}^{s∗}^{(1+}^{γ∗}^{)}. It is straightforward to confirm that the equality

or equivalently,

holds. Let

Then, _{−}_{1} is represented by the KL-divergence _{0}, the representation of _{0}(_{−}_{1} is not properly defined.

Since the Hölder divergence _{γ}

_{γ}_{γ}_{γ}

_{γ} is equivalent with the score S that induces the Bregman divergence D

_{γ} is equivalent with the score S that induces the separable Bregman divergence D

For

Amari [

In Section 3, we showed that the Hölder divergence is defined from the composite score and have the scale-invariance property. Conversely, we show that these properties characterize the class of Hölder divergences. Some technical assumptions are introduced in the below.

^{2}

We use the following lemmas to prove Theorem 4.

^{γ}^{1+}^{γ} for γ 6

The proof of Lemma 1 is shown in Lemma C.1 of [

For the probability densities ^{d}

Separable Bregman divergences are derived from composite scores. Hence, we obtain the following result.

Different invariance property provides different divergences. Indeed, the invariance under any invertible and differentiable data transformation leads to the Csiszár

We proposed Hölder divergence as defined from the composite score, and showed that the Hölder divergence has the scale-invariance property. In addition, we proved that the composite score satisfying the scale-invariance property leads to the Hölder divergence. Hölder divergence is determined by a real number

We presented a method of constructing the scale-invariant divergences from the (reverse) Hölder inequality. This is a new approach for introducing a class of divergences. We expect that the new class of divergences open up a new applications in the field of information sciences.

Let

holds for the Hölder score _{γ}

By differentiating the both sides twice by

The solution of the differential equation is given by _{0} + _{1}^{α}_{0}, _{1},

Since _{γ}

holds, where ^{(1+}^{γ}^{)}^{s}

should hold. Let

holds for all ^{s}_{1} > 0 for _{1} < 0 for 0 _{1}^{1+}^{γ}

Suppose

Let us consider the transformation _{a,}_{1}(

where _{x∈}_{[0}_{,}_{1]}

Therefore, we obtain

Some algebra yields that the above equation is expressed as

for any function υ(_{0}, _{1} and _{2} are constants. Hence, we have

for any positive-valued function

for ^{γ}^{1+}^{γ}^{γ}_{1}_{,σ}

For any pair of positive functions

Remember that

In the above proof, the identity function ^{z}

^{γ}^{1+}^{γ} with γ_{1}_{,σ}^{s}_{+}

We prove that the above

Let us consider the sign of

holds. Let

for _{0} with

should hold. As a result, we have

We prove that
^{(1+}^{γ}^{)}^{s}_{0} > 0 such that _{0}) _{0}^{(1+}^{γ}^{)}^{s}

holds. This is possible by choosing, say, _{0} for some

in which

For ^{−}^{(1+}^{γ}^{)} _{γ}_{γ}^{1+}^{γ}/φ

In the same way, For ^{1+}^{γ} ≥_{γ}_{γ}^{−}^{(1+}^{γ}^{)}

Takafumi Kanamori was partially supported by JSPS KAKENHI Grant Number 24500340.

The authors declare no conflict of interest.