Scale-Invariant Divergences for Density Functions

Divergence is a discrepancy measure between two objects, such as functions, vectors, matrices, and so forth. In particular, divergences defined on probability distributions are widely employed in probabilistic forecasting. As the dissimilarity measure, the divergence should satisfy some conditions. In this paper, we consider two conditions: The first one is the scale-invariance property and the second is that the divergence is approximated by the sample mean of a loss function. The first requirement is an important feature for dissimilarity measures. The divergence will depend on which system of measurements we used to measure the objects. Scale-invariant divergence is transformed in a consistent way when the system of measurements is changed to the other one. The second requirement is formalized such that the divergence is expressed by using the so-called composite score. We study the relation between composite scores and scale-invariant divergences, and we propose a new class of divergences called Hölder divergence that satisfies two conditions above. We present some theoretical properties of Hölder divergence. We show that Hölder divergence unifies existing divergences from the viewpoint of scale-invariance.


Introduction
Nowadays, divergence measures are ubiquitous in the field of information sciences.The divergence is a discrepancy measure between two objects, such as functions, vectors, matrices, and so forth.In particular, divergences defined on the set of probability distributions are widely used for probabilistic forecasting such as weather and climate prediction [1,2], computational fiance [3], and so forth.In many statistical inferences, statistical models are prepared to estimate the probability distribution generating observed samples.A divergence measure between the true probability and the statistical model is estimated based on observed samples, and the probability distribution in the statistical model that minimizes the divergence measure is chosen as the estimator.A typical example is the maximum likelihood estimator based on the Kullback-Leibler divergence [4].
Dissimilarity measures for statistical inference should satisfy some conditions.In this paper, we focus on two conditions.The first one is the scale-invariance property, and the second one is that the divergence should be represented by using the so-called composite score [5], that is an extension of scores [6].
The first requirement is the scale-invariance.Suppose that the divergence is used to measure the dissimilarity between two objects, then the divergence will depend on the system of measurements we used to measure the objects.The scale-invariant divergence has a favorable property such that it is transformed in a consistent way when the system of measurements is changed to the other one.For example, the measured value between two objects depends on the unit of length.Typically, measured values in different units are transformed to each other by multiplying an appropriate positive constant.The Kullback-Leibler divergence that is one of the most popular divergences has the scale-invariance property for the measurement of training samples [7].
As the second requirement, dissimilarity measures should be expressed as the form of composite scores.This is a useful property, when the divergence is employed for the statistical inference of the probability densities.When the divergence D(f, g) is calculated through the expectation with respect to the probability density f , the sample mean over the observations works to approximate the divergence.The score [2,5,6,[8][9][10] is the class of dissimilarity measures that are calculated through the sample mean of the observed data.The characterization of the score is studied by [6,10], and the deep connection between scores and divergences were revealed.
In the present paper, we propose composite scores as an extension of scores, and study the relation between composite scores and scale-invariant divergences.We propose a new class of divergences called Hölder divergence, that is defined through a class of composite scores.We show that Hölder divergence unifies existing divergences from the viewpoint of the scale-invariance.The Hölder divergence with the one-dimension parameter γ is defined from a function φ.Partially, the Hölder divergence with non-negative γ was proposed in [5].Here, we extend the previous result to any real number γ.
The remainder of the article is as follows: In Section 2, some basic notions such as divergence, scale-invariance and score are introduced.In Section 3, we propose the Hölder divergence.Some theoretical properties of Hölder divergence are investigated in Section 4. In Section 5, we close this article with a discussion of the possibility of the newly introduced divergences.Technical calculations and proofs are found in the appendix.
Let us summarize the notations to be used throughout the paper.Let R be the set of all real numbers, R + be the set of all non-negative real numbers, and R ++ , and the set of all positive real numbers.For a real-valued function f : Ω → R defined on a domain Ω in the Euclidean space, let f be the integral In most arguments of the current paper, Ω is the closed interval [0, 1] in R. Extension of the theoretical results to any compact set in the multi-dimensional Euclidean space is straightforward.

Preliminaries
In this section, we show definitions of some basic concepts.

Divergences and Scores
Let us introduce scores and divergences.Below, positive-valued functions are defined on the compact set Ω.The score is defined as the real-valued functional S(f, g) in which f (x) and g(x) are positive-valued functions on Ω.
Let D(f, g) be The functional D(f, g) is called the divergence, if D(f, g) ≥ 0 holds with equality if and only if f = g.Suppose that the score S(f, g) induces a divergence.Then, clearly the score should satisfy S(f, g) ≥ S(f, f ) with equality only when f = g.The divergence does not necessarily satisfy the definition of the distance, because neither the symmetry nor triangle inequality holds in general.
Bregman divergence [11] and Csiszár ϕ-divergence [12,13] are important classes of divergences.Here we focus on the Bregman divergence, since they are frequently employed in various statistical inferences.See [6,11] for details.
Definition 1 (Bregman divergence; Bregman score).For positive-valued function f : Ω → R ++ , let G(f ) be a strictly convex functional and G * f (x) be the functional derivative of G at f , i.e., G * f (x) is determined from the equality for any h such that f + εh is a positive-valued function for sufficiently small ε.Then, the Bregman divergence is defined as The score associated with the Bregman divergence is called the Bregman score, that is defined as The functional G is referred to as the potential of the Bregman divergence, and it satisfies the equality The rigorous definition of G * f requires the dual space of Banach space.See [14] (Chapter 4) for sufficient conditions of the existence of G * f .To avoid technical difficulties, we assume the existence of the functional derivative in the above definition.
The remarkable property of Bregman divergence is that associated score S(f, g) is represented as the linear function of f .This is a nice property for statistical inference, since one can substitute the empirical distribution directly into f .In other words, the sample-based approximation of the Bregman score is obtained by the sample-mean of a function depending on the model g.For this reason, the Bregman divergences have a wide range of applications in statistics, machine learning, data mining, and so forth [15][16][17].
Though Bregman divergence is a popular class of divergences, the computation of the potential may be a hard task.The separable Bregman divergence is an important subclass of Bregman divergences.In many applications of statistical forecasting, the separable Bregman divergences are used due to the computational tractability.
Definition 2 (separable Bregman divergence).Let J : R + → R be a strictly convex differentiable function.The separable Bregman score is defined as where J is the derivative of J.The separable Bregman divergence is The potential of the separable Bregman divergence is G(f ) = J(f ) .
Due to the convexity of J, the non-negativity of the separable Bregman divergence is guaranteed.Moreover, the strict convexity of J ensures that the equality D(f, g) = 0 holds only if f = g.Some examples of divergences are shown below.
Example 1 (Kullback-Leibler divergence).One of the most popular divergences in information sciences is the Kullback-Leibler(KL) divergence [4].Let us define the KL score for positive-valued functions f and g as The associated divergence is called the KL divergence, that is defined as This is represented as the separable Bregman divergence with the potential Example 2 (Itakura-Saito distance).The Itakura-Saito (IS) distance was originally used to measure the dissimilarity between two power spectrum densities [18].Though IS distance does not satisfy the mathematical condition of the distance, the term "distance" is conventionally used.For positive-valued functions f, g on Ω, IS score is defined as and the IS distance is defined as The non-negativity of D IS (f, g) is guaranteed by the inequality of z − log z − 1 ≥ 0 for z > 0. The IS distance is the separable Bregman divergence with the potential G(f ) = − log f + 1 .The IS distance is scale-invariant, i.e., D IS (af, ag) = D IS (f, g) holds for any positive real number a.This invariance ensures that the low energy components have the same relative importance as high energy ones.This is especially important in short-term audio spectra [19,20].
Example 3 (density-power divergence).The density-power divergence is a one-parameter extension of the KL-divergence.The density-power score is defined as for γ ∈ R \ {0, −1}, and the density power divergence is defined as The density-power divergence is employed in the robust parameter estimation [21,22].The limit γ → 0 of the density-power divergence yields the KL-divergence, and the limit γ → −1 yields the IS-distance.
Though originally the density-power divergence is defined for positive γ [21], the above definition works for any real number γ.The density-power divergence is expressed as the separable Bregman divergence with the potential Example 4 (pseudo-spherical divergence; γ divergence).The pseudo-spherical divergence [6,23] is defined as that is derived from the pseudo-spherical score The pseudo-spherical divergence does note satisfy the definition of the divergence in the present paper, since D (γ) sphere (f, g) = 0 holds for linearly dependent functions f and g.On the set of probability density functions, however, the equality D (γ) sphere (p, q) = 0 leads to p = q.Thus, pseudo-spherical divergence is still useful in statistical inference, though it is not divergence on the set of positive-valued functions.The γ divergence [24] is defined as − log(−S (γ) sphere (f, g)) + log(−S (γ) sphere (f, f )), and the first term of the γ divergence is used for robust parameter estimation.The pseudo-spherical divergence is represented as the non-separable Bregman divergence with the potential G(f ) = 1 γ f 1+γ 1/(1+γ) .This potential is strictly convex on the set of probability densities.The parameter γ can take both positive and negative real numbers.
Example 5 (α-divergence).For positive-valued functions f, g, the α-divergence [25,26] is defined as for α ∈ R\{0, 1}.Generally, α-divergence is not included in Bregman divergence, since the term f α g 1−α is not linear in f .The limit α → 1 and α → 0 yield the KL-divergence D KL (f, g) and D KL (g, f ), respectively.We show that the α-divergence has the invariance property.Let c(x) be a one-to-one differentiable mapping from x ∈ R d to c(x) ∈ R d , and c (x) ∈ R d be the gradient vector.For the transformation f alpha (f, g) holds.This invariance is a common property of Csiszár's ϕ-divergence [12,13].

Scale-Invariance of Divergences
Let us consider the scale-invariance of divergences.Suppose that f (x) is a density at the point x ∈ Ω = [0, 1].Here, not only the probability density but also the mass density or spectrum density is considered.Hence, the density is not necessarily normalized, but should be finite measures.For density functions, the total mass does not change under the variable transformation of the coordinate x.Especially, for the scale-transformation x → y = x/σ with σ > 0, the density f (x) in the x-coordinate should be transformed to σf (σy) in the y-coordinate.In addition, under the scale-transformation of the function value, the density f (x) is transformed to af (x) with some positive constant a.For the density function, we allow the combination of the above two transformations, The support of f is also properly transformed to that of f a,σ .The transformation Equation ( 1) is induced by changing the unit of systems of the measurement.On multi-dimensional space, the density f a,σ (x) with the positive constant a and invertible matrix σ is defined as a|det σ|f (σx), in which det σ is the determinant of the matrix σ.In most arguments in the paper, one-dimensional case is considered, since the extension to the multi-dimensional domain is straightforward.
As a natural requirement, the divergence measure should not be essentially affected by systems of measurement.More concretely, the relative nearness between two densities should be preserved under the transformation Equation (1).This requirement is formalized as the relative invariance for the scale transformation, i.e., there exists a function κ(a, σ) such that the equality holds for any pair of densities f, g and any transformation f → f a,σ .The divergence satisfying Equation ( 2) is referred to as the scale-invariant divergence.Some popular divergences satisfy the scale-invariance; κ(a, σ) = a for KL-divergence and α-divergence, κ(a, σ) = σ −1 for IS-distance, κ(a, σ) = a 1+γ σ γ for density-power divergence, and κ(a, σ) = aσ γ/(1+γ) for pseudo-spherical divergence.

Divergence for Statistical Inference
The divergence D(f, g) or score S(f, g) is widely applied in statistical inference.The discrepancy between two probability densities are measured by the divergence or score.Typically, the true probability density p and the model probability density q are substituted into the divergence D(p, q), and D(p, q) is minimized with respect to the model q in order to estimate the probability density p.This is the same as the minimization of the score S(p, q).Usually, one cannot directly access the true probability density.However, the true probability p can be replaced with the empirical probability density of observed samples, when the samples are observed from p. Given the empirical probability density p, the empirical score S( p, q) is expected to approximate S(p, q).The estimator is obtained by minimizing S( p, q) with respect to the model density q.
Generally, one cannot directly substitute the empirical probability density p into p of the score S(p, q), since p is expressed by the sum of Dirac's delta function.Suppose that the score depends on p through the expectation of a random variable with respect to p.Then, one can substitute the empirical distribution p into p.Let us introduce the composite score into which one can substitute the empirical distributions.
Definition 3 (composite score).For positive-valued functions f and g on Ω, the score expressed as is called the composite score, where ψ is a real-valued function on R 2 and U and V are real-valued functions.The integrals f U (g) and V (g) denote Ω f (x)U (g(x))dx and Ω V (g(x))dx, respectively.
The composite score was introduced in [5].The function ψ is arbitrary in the above definition.When we impose some constraints on the composite scores, the form of ψ will be restricted.Concrete expressions of ψ are presented in Section 3.For the purpose of statistical inference, it is sufficient to define scores on the set of probability densities.However, the scores defined for positive-valued functions are useful to investigate theoretical properties; see [10] for details.
Separable Bregman divergences are represented by using composite scores.Indeed, the separable Bregman divergence with the potential G(g) = J(g) is obtained by setting U (g) = −J (g), V (g) = J (g)g − J(g) and ψ(a, b) = a + b in the composite score.Hence, the KL-divergence, Itakura-Saito distance, density-power divergence are represented by using the composite score.Though the pseudo-spherical divergence over the set of probability densities is a non-separable Bregman divergence, it is expressed by the composite score as shown in Section 3.
Scale-invariant divergences defined from composite scores are useful for statistical inference.Suppose that D(f, g) = S(f, g)−S(f, f ) is the scale-invariant divergence.Then, the statistical inference using the score S(f, g) does not essentially depend on the systems of measurement in the observations.Let q be the estimator based on the sample x, and (q 1,σ ) be the estimator based on the transformed sample σx with the model q 1,σ , where q 1,σ (x) = σq(σx).If the estimator is obtained as the optimal solution of the score that induces the scale-invariant divergence, we obtain (q 1,σ ) = ( q ) 1,σ .Such estimator is called the equivariant estimator [27].The estimation result based on the equivariant estimator is transformed in the consistent way, when the systems of the measurement is changed.
Let us define the equivalence class among scores.The two scores are said to be equivalent if a score is transformed to the other score by a strictly increasing function, i.e., for any monotone increasing function ξ, two scores, S(f, g) and ξ(S(f, g)), are equivalent.The statistical inference is often conducted by minimizing the score.Hence, the equivalent scores provide the same estimator.If a equivalence class includes a score that leads to a scale-invariant divergence, all scores in the class provide the equivariant estimator.
In sequel sections, we introduce the Hölder score that is a class of composite scores with the scale-invariance property.Then, we investigate theoretical properties of the Hölder score.

Hölder Divergences
Let us define a class of scale-invariant divergences expressed by the composite score.The divergence is called the Hölder divergence.The name comes from the fact that the Hölder inequality or its reverse variant is used to prove the non-negativity of the divergence.The Hölder divergence unifies existing divergences from the viewpoint of the scale-invariance.
• For γ = 0, the Hölder score is defined as S 0 (f, g) = −f log g + g + cf , where c is a real number.
• For γ = −1, the Hölder score is defined as S −1 (f, g) = f /g + log g + cf , where c is a real number.
The Hölder divergence is defined as The Hölder divergence with the non-negative γ is defined in [5].For γ < 0, γ = −1, it is sufficient to define the function φ(z) for z > 0, since the computation of φ(0) does not required for such γ under the condition that the integral in the divergence is finite.The characterization of the Hölder score is shown in Section 4.3.The Hölder score (divergence) defined from the parameters γ and the function φ is denoted as S γ (D γ ) with φ, or S φ γ (D φ γ ).It is clear that Hölder score is a composite score.We show that Hölder divergence satisfies the conditions of the divergence.
Theorem 1.For positive-valued functions f, g, the Hölder divergence D γ (f, g) satisfies the inequality D γ (f, g) ≥ 0 with equality if and only if f = g.
Proof.The Hölder divergences D 0 and D −1 coincide with the KL-divergence and IS distance, respectively.Hence, D γ with γ = 0 or −1 is the divergence.
For positive-valued functions f and g, the Hölder inequality f g ≤ f α 1/α g β 1/β holds for 1/α + 1/β = 1 with α, β > 1, and the reverse Hölder inequality f g ≥ f α 1/α g β 1/β holds for For γ ∈ [−1, 0], the Hölder inequality or its reverse variant leads to f g γ 1+γ ≤ f 1+γ g 1+γ γ .Hence, we have in which the first inequality comes from φ(z) ≥ −z 1+γ and the second inequality is derived from the (reverse) Hölder inequality.When the first and second inequalities become equality, we have f g γ / g 1+γ = 1 and the linearly dependence of f and g.As a result, the equality S γ (f, g) = S γ (f, f ) gives f = g.For γ ∈ (−1, 0), the reverse Hölder inequality for positive-valued functions f and g is expressed as f g γ 1+γ ≥ f 1+γ g 1+γ γ .Hence, we have in which the first inequality comes from φ(z) ≥ −z −(1+γ) and the second inequality is derived from the reverse Hölder inequality.The same argument in the case of γ ∈ [−1, 0] works to show that the equality S γ (f, g) = S γ (f, f ) leads to f = g.
The Hölder divergences have the scale-invariance.The following calculation is straightforward.
In addition, we have D 0 (f a,σ , g a,σ ) = aD 0 (f, g) and D −1 (f a,σ , g a,σ ) = σ −1 D −1 (f, g).There is no Hölder divergence such that the equality D γ (f a,σ , g a,σ ) = D γ (f, g) holds for arbitrary a, α > 0.Moreover, the theorem in Section 4.3 ensures that there is no scale-invariant divergence based on the composite score such that the scale function, κ(a, σ), is constant.
The class of Hölder divergences includes some popular divergences that are used in statistics and information theory.Some examples are shown below.
For γ < 0, the reverse Minkowski inequality ensures that − f 1+γ 1/(1+γ) is convex in f .For γ > 0, κ > 1 or γ < 0, κ < 0, the corresponding Bregman divergence is given as For the parameter γ < 0 and 0 < κ < 1, the divergence is the negative of the above.The parameter κ = 1 + γ yields the density-power divergence, and the parameter κ = 1 does the pseudo-spherical divergence.In this paper, this divergence is denoted as the Bregman-Hölder divergence, and the divergence with positive γ is considered in [5].The Bregman-Hölder divergence is characterized by the intersection of Bregman divergence and Hölder divergence.This fact is proved in Theorem 3.
Example 9 (α-divergence and Hölder divergence).The α-divergence with α = 0, 1 in Example 5 is represented by using the Hölder divergence, though it is not a member in the class of the composite scores.Indeed, using the density-power divergence in Example 3, we have for α = 0, 1.

Theoretical Properties of Hölder Divergences
In this section, we present some theoretical properties of Hölder divergence.

Conjugate Relation
Let us consider the conjugate relation among Hölder divergences.Firstly, we point out that the KL divergence and IS distance are related to each other by the equality, i.e., for Hölder divergence, the equality D −1 (f, g) = D 0 (1, f /g) holds.This relation is extended to Hölder divergences.
Then, ι • ι is the identity map.This implies that the Hölder divergences, D φ γ and D φ * γ * , are connected by the conjugate relation.In the current setup, though the IS-distance D −1 is represented by the KL-divergence D 0 , the representation of D 0 (f, g) by using D −1 is not properly defined.

Bregman Divergence and Hölder Divergence
Since the Hölder divergence D γ (f, g) is not necessarily convex in f , the Hölder divergence is not always represented as the form of a Bregman divergence.Let us identify the equivalence class of the intersection of Bregman divergences and Hölder divergences.
• If S γ is equivalent with the score S that induces the Bregman divergence D(f, g) = S(f, g) − S(f, f ).Then, D(f, g) is the Bregman-Hölder divergence in Example 8.
• If S γ is equivalent with the score S that induces the separable Bregman divergence D(f, g) = S(f, g) − S(f, f ).Then, D(f, g) is the density-power divergence.
For γ > 0, the theorem was proved in [5].We present the proof for γ < 0. The proof is found in Appendix A.
Amari [28] studied the intersection between Bregman divergence and Csiszár f -divergence under the power-representation of probability distributions.There are some attempts to define the divergence that connects the density-power divergence and the pseudo-spherical divergence [22].The Bregman-Hölder divergence is different from the existing one.

Characterization of Hölder Scores
In Section 3, we showed that the Hölder divergence is defined from the composite score and have the scale-invariance property.Conversely, we show that these properties characterize the class of Hölder divergences.Some technical assumptions are introduced in the below.Assumption 1.Let D(f, g) = S(f, g) − S(f, f ) be the divergence for the positive-valued functions f, g on the compact support Ω.
(a) D(f, g) satisfies the scale-invariance property Equation (2), and S(f, g) is expressed as the composite score ψ( f U (g) , V (g) ).
(b) The functions U, V and ψ are differentiable.The two-dimensional gradient vector of ψ does not vanish on any point x ∈ R 2 that is expressed as x = ( f U (f ) , V (f ) ) for a positive-valued function f .Theorem 4. Suppose that the divergence D(f, g) = S(f, g) − S(f, f ) satisfies Assumption 1.Then, the composite score S(f, g) is equivalent with the Hölder score.
We use the following lemmas to prove Theorem 4.
Lemma 1. Suppose that D(f, g) is the divergence defined from the composite score, S(f, g) = ψ( f U (g) , V (g) ).We assume the condition (b) in Assumption 1.Then, V (z) = czU (z) holds with a non-zero constant c ∈ R.
• U (z) = − log z + c and V (z) = z and c ∈ R.
The proof of Lemma 1 is shown in Lemma C.1 of [5], and hence, we omit the proof.Lemma 2 for positive γ is also proved in [5] under slightly different conditions.The proof of Lemma 2 is shown in Appendix B. Some involved argument is required to specify the expression of the function ψ of the composite score.The detailed proofs are found in Appendix C.
For the probability densities f and g defined on a non-compact support R d , Kanamori and Fujisawa [5] specified the expression of divergence D(f, g) having the affine invariance for the coordinate x.In such case, the Hölder divergence with negative γ such as the Itakura-Saito distance is excluded, since they are not defined for functions on the non-compact support.In the present paper, we consider the divergences for the positive-valued functions on the compact support Ω.
Separable Bregman divergences are derived from composite scores.Hence, we obtain the following result.
Corollary 5. Suppose that the separable Bregman divergence is scale-invariant.Then, the divergence should be the density-power divergence.
Different invariance property provides different divergences.Indeed, the invariance under any invertible and differentiable data transformation leads to the Csiszár ϕ-divergence [7,29], and a different type of the scale-invariance leads to the pseudo-spherical divergence [24].

Conclusions
We proposed Hölder divergence as defined from the composite score, and showed that the Hölder divergence has the scale-invariance property.In addition, we proved that the composite score satisfying the scale-invariance property leads to the Hölder divergence.Hölder divergence is determined by a real number γ and a function φ.In the previous work [5], the Hölder divergence with a positive γ was proposed from the affine-invariance, and it was used to the robust parameter estimation.In this paper, we extended the previous work to Hölder divergence, having even negative parameter γ.As a result, the density-power divergence with a negative parameter and Itakura-Saito distance were unified under the Hölder divergence.The Hölder divergence with a non-negative γ can be used to measure the discrepancy between two non-negative functions on a non-compact support.On the other hand, the Hölder divergence defined from any real number γ is available to measure the degree of nearness between two non-negative functions on a compact domain.Technically, the reverse Hölder inequality and the reverse Minkowski inequality were used to prove the non-negativity of the divergence.Functions with a compact support are also useful in statistical data analysis, though most of frequently-used densities are defined on non-compact set such as the normal distribution.Indeed, the power spectrum densities are defined on the compact set [−π, π], and the IS-distance is used to measure the discrepancy between two power spectrum densities.
We presented a method of constructing the scale-invariant divergences from the (reverse) Hölder inequality.This is a new approach for introducing a class of divergences.We expect that the new class of divergences open up a new applications in the field of information sciences.should hold.Let f (x) on [0, 1] be the step function defined as f (x) = a > 0 for 0 ≤ x ≤ p and f (x) = b > 0 for p < x ≤ 1, where p ∈ (0, 1).Then, the equality holds for all p, a, b.This implies that ξ(z s ) is an affine function with respect to z > 0. Therefore, we obtain J(f ) = c 0 + c 1 f 1+γ .Due to the convexity of J(f ) in f , we find c 1 > 0 for −1 > γ and c 1 < 0 for 0 > γ > −1.As the result, only the separable Bregman score defined from J(z) = c 1 z 1+γ is equivalent with the Hölder score.This is nothing but the density-power score extended to the negative parameter γ.

B. Proof of Lemma 2
Suppose f and g be positive-valued functions defined on Ω = [0, 1].Extension to the compact set in the multi-dimensional space is straightforward.
Therefore, we obtain Some algebra yields that the above equation is expressed as for any function v(x), where c 0 , c 1 and c 2 are constants.Hence, we have for any positive-valued function g(x).
As a result, the function U (z) should satisfy the differential equation for z > 0. Up to a constant factor, the solution is given as U (z) = z γ + c or U (z) = log z + c for γ, c ∈ R. Due to Lemma 1, we have V (z) = z 1+γ for U (z) = z γ + c with γ = 0, −1, V (z) = log z for U (z) = 1/z + c, and V (z) = z for U (z) = − log z + c up to a constant factor.As shown in [5], the relative invariance under the transformation f (x) → f 1,σ (x) = σf (σx) provides the same solution.

C. Proofs of Theorem 4
Proof for U (z) = − log z + c and V (z) = z.The composite score is given as For any pair of positive functions f, g, the inequality −f log g + cf + g ≥ −f log f + cf + f holds, and the equality holds if and only if f = g.Hence, for the function ψ, the equality ψ(x, y) = ψ(z, w) holds for x+y = z +w, and the inequality ψ(x, y) > ψ(z, w) holds for x+y > z +w.Therefore, ψ(x, y) is expressed as ξ(x+y) by using a strictly increasing function ξ.As a result, the score is given as S(f, g) = ξ( −f log g + g + cf ) that is equivalent with the Hölder score with γ = 0 up to a monotone transformation.
Proof for U (z) = 1/z + c and V (z) = log z.The composite score is given as Remember that f /g + log g − 1 + log f is nothing but the Itakura-Saito distance.Hence, for any pair of positive functions f, g, the inequality f /g + cf + log g ≥ 1 + cf + log f holds, and the equality holds if and only if f = g.Hence, for the function ψ, the equality ψ(x, y) = ψ(z, w) holds for x + y = z + w, and the inequality ψ(x, y) > ψ(z, w) holds for x + y > z + w.Therefore, ψ(x, y) is expressed as ξ(x + y) by using a strictly increasing function ξ.The score should be represented as S(f, g) = ξ( f /g + cf + log g ), that is equivalent with the Hölder score with γ = −1.
In the above proof, the identity function ξ(z) = z leads to the Itakura-Saito distance, and ξ(z) = e z with c = 0 also leads to another scale-invariant divergence.