From f-divergence to quantum quasi-entropies and their use

Csiszar's f-divergence of two probability distributions was extended to the quantum case by the author in 1985. In the quantum setting positive semidefinite matrices are in the place of probability distributions and the quantum generalization is called quasi-entropy which is related to some other important concepts as covariance, quadratic costs, Fisher information, Cramer-Rao inequality and uncertainty relation. A conjecture about the scalar curvature of a Fisher information geometry is explained. The described subjects are overviewed in details in the matrix setting, but at the very end the von Neumann algebra approach is sketched shortly.

Let X be a finite space with probability measures p and q. Their relative entropy or divergence was introduced by Kullback and Leibler in 1951 [27]. More precisely, if p(x) = q(x) = 0, then log(p(x)/q(x)) = 0 and if p(x) = 0 but q(x) = 0 for some x ∈ X , then log(p(x)/q(x)) = +∞.
A possible generalization of the relative entropy is the f -divergence introduced by Csiszár: It is interesting to remark that qf (p/q) can be considered also as a mean of p and q. In that case the mean of p and p should be p, so in the theory of means f (1) = 1 is a different natural requirement.
Example 1 Let f (x) = |x − 1|. Then is the variational distance of p and q.
is the squared Hellinger distance of p and q.
Example 3 The function gives the relative α-entropy The limit α → 0 gives the relative entropy.
Several other functions appeared in the literature, we list a few of them: (1 + x) log 2 + x log x − (1 + x) log(x + 1) if β = 1. [32] The following result of Csiszár is a characterization (or axiomatization) of the fdivergence.
Theorem 3 Assume that a number C(p, q) ∈ R is associated to probability distributions on the same set X for all finite sets X . If (b) if A is a partition of X , then C(p A , q A ) ≤ C(p, q) and the equality holds if and only if p A (A)q(x) = q A (A)p(x) whenever x ∈ A ∈ A, then there exists a convex function f : R + → R which is continuous at 0 and C(p, q) = D f (p||q) for every p and q.

Quantum quasi-entropy
In the mathematical formalism of quantum mechanics, instead of n-tuples of numbers one works with n × n complex matrices. They form an algebra and this allows an algebraic approach. In this approach, a probability density is replaced by a positive semidefinite matrix of trace 1 which is called density matrix [39]. The eigenvalues of a density matrix give a probability density. However, this is not the only probability density provided by a density matrix. If we rewrite the matrix in a certain orthonormal basis, then the diagonal element p 1 , p 2 , . . . , p n form a probability density.
Let M denote the algebra of n×n matrices with complex entries. For positive definite matrices ρ 1 , ρ 2 ∈ M, for A ∈ M and a function f : R + → R, the quasi-entropy is defined as where B, C := Tr B * C is the so-called Hilbert-Schmidt inner product and ∆(ρ 1 /ρ 2 ) : M → M is a linear mapping acting on matrices: This concept was introduced in [33,34], see also Chapter 7 in [31] and it is the quantum generalization of the f -entropy of Csiszár used in classical information theory (and statistics) [7,30].
The monotonicity in Theorem 1 is the consequence of the Jensen inequality. A function f : R + → R is called matrix concave if one of the following two equivalent conditions holds: for every number 0 < λ < 1 and for positive definite square matrices A and B (of the same size). In the other condition the number λ is (heuristically) replaced by a matrix: if CC * + DD * = I.
A function f : R + → R is called matrix monotone if for positive definite matrices A ≤ B the inequality f (A) ≤ f (B) holds. It is interesting that a matrix monotone function is matrix concave and a matrix concave function is matrix monotone if it is bounded from below [17].
Let α : M 0 → M be a mapping between two matrix algebras. The dual α * : M → M 0 with respect to the Hilbert-Schmidt inner product is positive if and only if α is positive. Moreover, α is unital if and only if α * is trace preserving. α : for every B ∈ M 0 .
Theorem 4 Assume that f : R + → R is an operator monotone function with f (0) ≥ 0 and α : M 0 → M is a unital Schwarz mapping. Then holds for A ∈ M 0 and for invertible density matrices ρ 1 and ρ 2 from the matrix algebra M.
Proof: The proof is based on inequalities for operator monotone and operator concave functions. First note that for a positive constant c. Due to the Schwarz inequality (8), we may assume that f (0) = 0.
It is remarkable that for a multiplicative α we do not need the condition f (0) ≥ 0. Moreover, V * ∆V = ∆ 0 and we do not need the matrix monotonicity of the function f . In this case the only condition is the matrix concavity, analogously to Theorem 1.
If we apply the monotonicity (9) to the embedding α(X) = X ⊕ X of M into M ⊕ M and to the densities then we obtain the joint concavity of the quasi-entropy: holds. The case f (t) = t α is the famous Lieb's concavity theorem: Tr Aρ α A * ρ 1−α) is concave in ρ [29].
The concept of quasi-entropy includes some important special cases. If ρ 2 and ρ 1 are different and A = I, then we have a kind of relative entropy. For f (x) = x log x we have Umegaki's relative entropy S(ρ 1 ρ 2 ) = Tr ρ 1 (log ρ 1 − log ρ 2 ). (If we want a matrix monotone function, then we can take f (x) = log x and then we get S(ρ 2 ρ 1 ).) Umegaki's relative entropy is the most important example, therefore the function f will be chosen to be matrix convex. This makes the probabilistic and non-commutative situation compatible as one can see in the next argument.
Let ρ 1 and ρ 2 be density matrices in M. If in certain basis they have diagonal p = (p 1 .p 2 , . . . , p n ) and q = (q 1 , q 2 , . . . , q n ), then the monotonicity theorem gives the inequality for a matrix convex function f . If ρ 1 and ρ 2 commute, them we can take the common eigenbasis and in (13) the equality appears. It is not trivial that otherwise the inequality is strict.
If ρ 1 and ρ 2 are different, then there is a choice for p and q such that they are different as well. Then Conversely, if S f (ρ 1 ρ 2 ) = 0, then p = q for every basis and this implies ρ 1 = ρ 2 . For the relative entropy, a deeper result is known. The Pinsker-Csiszár inequality says that This extends to the quantum case as see [22], or [39,Chap. 3].
Problem 1 It would be interesting to extend Theorem 3 of Csiszár to the quantum case. If we require monotonicity and specify the condition for equality, then a function f is provided by Theorem 3, but for non-commuting densities the conclusion is not clear.

Example 4 Let
is matrix monotone decreasing for α ∈ (−1, 1). (For α = 0, the limit is taken and it is − log x.) Then the relative entropies of degree α are produced: These quantities are essential in the quantum case.
If ρ 2 = ρ 1 = ρ and A, B ∈ M are arbitrary, then one can approach to the generalized covariance [38].
The usual symmetrized covariance corresponds to the function f (t) = (t + 1)/2: The interpretation of the covariances is not at all clear. In the next section they will be called quadratic cost functions. It turns out that there is a one-to-one correspondence between quadratic cost functions and Fisher informations.

Fisher information 3.1 The Cramér-Rao inequality
The Cramér-Rao inequality belongs to the basics of estimation theory in mathematical statistics. Its quantum analog was discovered immediately after the foundation of mathematical quantum estimation theory in the 1960's, see the book [21] of Helstrom, or the book [24] of Holevo for a rigorous summary of the subject. Although both the classical Cramér-Rao inequality and its quantum analog are as trivial as the Schwarz inequality, the subject takes a lot of attention because it is located on the highly exciting boundary of statistics, information and quantum theory.
As a starting point we give a very general form of the quantum Cramér-Rao inequality in the simple setting of finite dimensional quantum mechanics. For θ ∈ (−ε, ε) ⊂ R a statistical operator ρ(θ) is given and the aim is to estimate the value of the parameter θ close to 0. Formally ρ(θ) is an n×n positive semidefinite matrix of trace 1 which describes a mixed state of a quantum mechanical system and we assume that ρ(θ) is smooth (in θ). Assume that an estimation is performed by the measurement of a self-adjoint matrix A playing the role of an observable. A is called locally unbiased estimator if This condition holds if A is an unbiased estimator for θ, that is To require this equality for all values of the parameter is a serious restriction on the observable A and we prefer to use the weaker condition (19).
Let ϕ 0 [K, L] be an inner product (or quadratic cost function) on the linear space of self-adjoint matrices. When ρ(θ) is smooth in θ, as already was assumed above, then with some L = L * . From (19) and (21), we have ϕ 0 [A, L] = 1 and the Schwarz inequality yields This is the celebrated inequality of Cramér-Rao type for the locally unbiased estimator.
The right-hand-side of (22) is independent of the estimator and provides a lower bound for the quadratic cost. The denominator ϕ 0 [L, L] appears to be in the role of Fisher information here. We call it quantum Fisher information with respect to the cost function ϕ 0 [ · , · ]. This quantity depends on the tangent of the curve ρ(θ). If the densities ρ(θ) and the estimator A commute, then We want to conclude from the above argument that whatever Fisher information and generalized variance are in the quantum mechanical setting, they are very strongly related. In an earlier work [36,37] we used a monotonicity condition to make a limitation on the class of Riemannian metrics on the state space of a quantum system. The monotone metrics are called Fisher information quantities in this paper.
Since the sufficient and necessary condition for the equality in the Schwarz inequality is well-known, we are able to analyze the case of equality in (22). The condition for equality is A = λL for some constant λ ∈ R. Therefore the necessary and sufficient condition for equality in (22) isρ Therefore there exists a unique locally unbiased estimator A = λJ −1 0 (ρ 0 ), where the number λ is chosen such a way that the condition (19) should be satisfied.
where ρ is a positive definite density and B is a self-adjoint traceless operator. A is locally unbiased when Tr AB = 1. In particular, is a locally unbiased estimator and in the Cramér-Rao inequality (22) the equality holds when ϕ 0 [X, Y ] = Tr XY , that is, J 0 is the identity.
If Tr ρB = 0 holds in addition, then the estimator is unbiased.

Coarse-graining and monotonicity
In the simple setting in which the state is described by a density matrix, a coarsegraining is an affine mapping sending density matrices into density matrices. Such a mapping extends to all matrices and provides a positivity and trace preserving linear transformation. A common example of coarse-graining sends the density matrix ρ 12 of a composite system 1 + 2 into the (reduced) density matrix ρ 1 of component 1. There are several reasons to assume completely positivity about a coarse graining and we do so.
Assume that ρ(θ) is a smooth curve of density matrices with tangent A :=ρ at ρ. The quantum Fisher information F ρ (A) is an information quantity associated with the pair (ρ, A), it appeared in the Cramér-Rao inequality above and the classical Fisher information gives a bound for the variance of a locally unbiased estimator. Let now β be a coarse-graining. Then β(ρ(θ)) is another curve in the state space. Due to the linearity of β, the tangent at β(ρ 0 ) is β(A). As it is usual in statistics, information cannot be gained by coarse graining, therefore we expect that the Fisher information at the density matrix ρ 0 in the direction A must be larger than the Fisher information at β(ρ 0 ) in the direction β(A). This is the monotonicity property of the Fisher information under coarse-graining: Although we do not want to have a concrete formula for the quantum Fisher information, we require that this monotonicity condition must hold. Another requirement is that F ρ (A) should be quadratic in A, in other words there exists a non-degenerate real bilinear form γ ρ (A, B) on the self-adjoint matrices such that The requirements (25) and (26) are strong enough to obtain a reasonable but still wide class of possible quantum Fisher informations.
We may assume that for an operator J ρ acting on matrices. (This formula expresses the inner product γ D by means of the Hilbert-Schmidt inner product and the positive linear operator J ρ .) In terms of the operator J ρ the monotonicity condition reads as for every coarse graining β. (β * stand for the adjoint of β with respect to the Hilbert-Schmidt product. Recall that β is completely positive and trace preserving if and only if β * is completely positive and unital.) On the other hand the latter condition is equivalent to We proved the following theorem in [36]. and where the linear transformations L ρ and R ρ acting on matrices are the left and right multiplications, that is L ρ (X) = ρX and R ρ (X) = Xρ .
The above γ ρ (A, A) is formally a quasi-entropy, S Aρ −1 1/f (ρ, ρ), however this form is not suitable to show the monotonicity. Assume that ρ = Diag (λ 1 , λ 2 , . . . , λ n ). Then It is clear from this formula that the Fisher information is affine in the function 1/f . Therefore, Hansen's canonical representation of the reciprocal of a standard operator monotone function can be used [19].
Theorem 6 If f : R + → R + be a standard operator monotone function, then where µ is a probability measure on [0, 1].
The theorem implies that the set {1/f : f is standard operator monotone} is convex and gives the extremal points One can compute directly that Hence g λ is decreasing in the parameter λ. For λ = 0 we have the largest function g 0 (t) = (t + 1)/(2t) and for λ = 1 the smallest is g 1 (t) = 2/(t + 1). (Note that this was also obtained in the setting of positive operator means [26], harmonic and arithmetic means.) Via the operator J ρ , each monotone Fisher information determines a quantity ϕ ρ [A, A] := Tr AJ ρ (A) (32) which is a quadratic cost functional. According to (29) (or Theorem 4) this possesses the monotonicity property Since (28) and (29) are equivalent we observe a one-to-one correspondence between monotone Fisher informations and monotone quadratic cost functions. (2) ϕ ρ [A, A] is continuous in ρ for every fixed A, For the purpose of a quantum Cramér-Rao inequality the minimal quantity seems to be the best, since the inverse gives the largest lower bound. In fact, the matrix L has been used for a long time under the name of symmetric logarithmic derivative, see [24] and [21]. In this example the quadratic cost function is and we have J ρ (B) = 1 2 (ρB + Bρ) and for the operator J of the previous section.
If Tr ρ 0 T = 0 and Tr ρ 0 T 2 = 1, then and T is a locally unbiased estimator (of the parameter θ at θ = 0). Since we have equality in the Cramér-Rao inequality, see (24).
When A = i[ρ, B] is orthogonal to the commutator of the foot-point ρ in the tangent space, we have Apart from a constant factor this expression is the skew information proposed by Wigner and Yanase some time ago ( [41]). In the limiting cases β → 0 or 1 we have f 0 (x) = x − 1 log x and the corresponding Fisher information is named after Kubo, Mori, Bogoliubov etc. The Kubo-Mori inner product plays a role in quantum statistical mechanics (see [12], for example). In this case Therefore the corresponding quadratic cost functional is

This holds if
In the Cramér-Rao inequality (22) the equality holds when J 0 (K) = 1 0 D t KD 1−t dt. Note that (44) is again an exponential family, the differential equation for D(θ) = exp(H + θT ) has the form (37) with Problem 2 It would be interesting to find more exponential families. This means solution of the differential equation If the self-adjoint T and the positive ρ commute, then the solution is D(θ) = exp(θT )ρ 0 . A concrete example is ∂ ∂θ D(θ) = D(θ) 1/2 T D(θ) 1/2 .

Manifolds of density matrices
Let M := {ρ(θ) : θ ∈ G} be a smooth m-dimensional manifold of invertible density matrices. When a quadratic cost function ϕ 0 is fixed, the corresponding Fisher information is a Riemannian metric on the manifold. This gives a possibility for geometric interpretation of statistical statements [1,2].
Fisher information appears not only as a Riemannian metric but as an information matrix as well. The quantum score operators (or logarithmic derivatives) are defined as and is the quantum Fisher information matrix.
The next result is the monotonicity of Fisher information matrix.
Assume that F j are positive operators acting on a Hilbert space H 1 on which the family M := {ρ(θ) : θ ∈ G} is given. When n j=1 F j = I, these operators determine a measurement. For any ρ(θ) the formula β(ρ(θ)) := Diag (Tr ρ(θ)F 1 , . . . , Tr ρ(θ)F n ) gives a diagonal density matrix. Since this family is commutative, all quantum Fisher informations coincide with the classical (23) and the classical Fisher information stand on the left-hand-side of (47). The right-hand-side can be arbitrary quantum quantity but it is minimal if it based on the symmetric logarithmic derivative, see Example 6. This particular case of the Theorem is in the paper [5].
Assume that a manifold M := {ρ(θ) : θ ∈ G} of density matrices is given together a statistically relevant Riemannian metric γ. Given two points on the manifold their geodesic distance is interpreted as the statistical distinguish-ability of the two density matrices in some statistical procedure.
Let ρ 0 ∈ M be a point on our statistical manifold. The geodesic ball B ε (ρ 0 ) := {ρ ∈ M : d(ρ 0 , ρ) < ε} contains all density matrices which can be distinguished by an effort smaller than ε from the fixed density ρ 0 . The size of the inference region B ε (ρ 0 ) measures the statistical uncertainty at the density ρ 0 . Following Jeffrey's rule the size is the volume measure determined by the statistical (or information) metric. More precisely, it is better to consider the asymptotics of the volume of B ε (ρ 0 ) as ε → 0. It is known in differential geometry that where m is the dimension of our manifold, C m is a constant (equals to the volume of the unit ball in the Euclidean m-space) and Scal means the scalar curvature, see [13, 3.98 Theorem]. In this way, the scalar curvature of a statistically relevant Riemannian metric might be interpreted as the average statistical uncertainty of the density matrix (in the given statistical manifold). This interpretation becomes particularly interesting for the full state space endowed by the Kubo-Mori inner product as a statistically relevant Riemannian metric.
The Kubo-Mori (or Bogoliubov) inner product is given by or (41) in the affine parametrization. On the basis of numerical evidences it was conjectured in [35] that the scalar curvature which is a statistical uncertainty is monotone in the following sense. For any coarse graining α the scalar curvature at a density ρ is smaller than at α(ρ). The average statistical uncertainty is increasing under coarse graining. Up to now this conjecture has not been proven mathematically. Another form of the conjecture is the statement that along a curve of Gibbs states e −βH Tr e −βH the scalar curvature changes monotonly with the inverse temperature β ≥ 0, that is, the scalar curvature is monotone decreasing function of β. (Some partial results are in [4].) Let M be the manifold of all invertible n × n density matrices. If we use the affine parametrization, then the tangent space T ρ consists of the traceless self-adjoint matrices and has ab orthogonal decomposition We denote the two subspaces by T q ρ and T c ρ , respectively. If A 2 ∈ T c ρ , then independently of the function f . Moreover, if A 1 ∈ T q ρ , then Therefore, the decomposition (50) is orthogonal with respect to any Fisher information and any quadratic cost functional. Moreover, the effect of the function f and the really quantum situation are provided by the components from T q ρ .

Skew information
Let f be a standard function and X = X * ∈ M n . The quantity was called skew information in [19] in this general setting. The skew information is nothing else but the Fisher information restricted to T q ρ , but it is parametrized by the commutator.
This implies that the identity holds if Tr ρX = 0 andf The following result was obtained in [14].
Theorem 9 If f : R + → R is a standard function, thenf is standard as well.
The original proof is not easy, even matrix convexity of functions of two variables is used. Here we sketch a rather elementary proof based on the fact that 1/f →f is linear and on the canonical decomposition in Theorem 6.
Lemma 1 Let 0 ≤ λ ≤ 0 and f λ : R + → R be a function such that Then the function f : R + → R defined in (52) is an operator monotone standard function.
The proof of the lemma is elementary. From the lemma and Theorem 6, Theorem 9 follows straightforwardly [40].
The skew information is the Hessian of a quasi-entropy: Theorem 10 Assume that X = X * ∈ M n and Tr ρX = 0. If f is a standard function such that f (0) = 0, then for the standard function F =f.
The proof is based on the formula see [40].
We know that

Von Neumann algebras
Let M be a von Neumann algebra. Assume that it is in standard form, it acts on a Hilbert space H, P ⊂ H is the positive cone and J : H → H is the modular conjugation. Let ϕ and ω be normal states with representing vectors Φ and Ω in the positive cone. For the sake of simplicity, assume that ϕ and ω are faithful. This means that Φ and Ω are cyclic and separating vectors. The closure of the unbounded operator AΦ → A * Ω has a polar decomposition J∆(ω/ϕ) 1/2 and ∆(ω/ϕ) is called relative modular operator. AΦ is in the domain of ∆(ω/ϕ) 1/2 for every A ∈ M.
Theorem 11 Assume that f : R + → R is an operator monotone function with f (0) ≥ 0 and α : M 0 → M is a Schwarz mapping. Then holds for A ∈ M 0 and for normal states ω and ϕ of the von Neumann algebra M.
The relative entropies are jointly convex in this setting similarly to the finite dimensional case. Now we shall concentrate on the generalized variance.

Generalized covariance
To deal with generalized covariance, we assume that f : R + → R is a standard operator monotone (increasing) function. The natural extension of the covariance (from probability theory) is where ∆(ω/ω) is actually the modular operator. Although ∆(ω/ω) is unbounded, the definition works. For the function f , the inequality 2x holds. Therefore AΩ is in the domain of f (∆(ω/ω)).
For a standard function f : R + → R + and for a normal unital Schwarz mapping β : N → M the inequality is a particular case of Theorem 11 and it is the monotonicity of the generalized covariance under coarse-graining. The common symmetrized covariance

The Cramér-Rao Inequality
Let {ω θ : θ ∈ G} be a smooth m-dimensional manifold in the set of normal states of the von Neumann algebra M and assume that a collection A = (A 1 , . . . , A m ) of selfadjoint operators is used to estimate the true value of θ. The subspace spanned by A 1 , A 2 , . . . , A m is denoted by V . Concerning the proof we refer to [38].

Uncertainty relation
In the von Neumann algebra setting the skew information (as a sesquilinear form) can be defined as if ω(X) = ω(Y ) = 0. (Then I f ω (X) = I f ω (X, X).) Lemma 2 Let K be a Hilbert space with inner product · , · and let · , · be a sesquilinear form on K such that for every vector f ∈ K. Then holds for every f 1 , f 2 , . . . , f m ∈ K.
by assumption. This says that G − H is positive semidefinite, hence it is clear that G ≥ H.
Theorem 13 Assume that f, g : R + → R are standard functions and ω is a faithful normal state on a von Neumann algebra M. Let A 1 , A 2 , . . . , A m ∈ M be self-adjoint operators such that ω(A 1 ) = ω(A 2 ) = . . . = ω(A m ) = 0. Then the determinant inequality holds.
Proof: Let E( · ) be the spectral measure of ∆(ω, ω). Then for m = 1 the inequality is where dµ(λ) = d AΩ, E(λ)AΩ . Since the inequality holds for standard functions [16], we have and this implies the integral inequality.
Consider the finite dimensional subspace N generated by the operators A 1 , A 2 , . . . , A m . On N we have the inner products This theorem is interpreted as quantum uncertainty principle [3,15,14,25]. In the earlier works the function g from the left-hand-side was (x + 1)/2 and the proofs were more complicated. The general g appeared in [16].