Measurement Uncertainty for Finite Quantum Observables

Measurement uncertainty relations are lower bounds on the errors of any approximate joint measurement of two or more quantum observables. The aim of this paper is to provide methods to compute optimal bounds of this type. The basic method is semidefinite programming, which we apply to arbitrary finite collections of projective observables on a finite dimensional Hilbert space. The quantification of errors is based on an arbitrary cost function, which assigns a penalty to getting result $x$ rather than y, for any pair (x,y). This induces a notion of optimal transport cost for a pair of probability distributions, and we include an appendix with a short summary of optimal transport theory as needed in our context. There are then different ways to form an overall figure of merit from the comparison of distributions. We consider three, which are related to different physical testing scenarios. The most thorough test compares the transport distances between the marginals of a joint measurement and the reference observables for every input state. Less demanding is a test just on the states for which a"true value"is known in the sense that the reference observable yields a definite outcome. Finally, we can measure a deviation as a single expectation value by comparing the two observables on the two parts of a maximally entangled state. All three error quantities have the property that they vanish if and only if the tested observable is equal to the reference. The theory is illustrated with some characteristic examples.


Introduction
Measurement uncertainty relations are quantitative expressions of complementarity. As Bohr often emphasized, the predictions of quantum theory are always relative to some definite experimental arrangement, and these settings often exclude each other. In particular, one has to make a choice of measuring devices, and typically quantum observables cannot be measured simultaneously. This often used term is actually misleading, because time has nothing to do with it. For a better formulation recall that quantum experiments are always statistical, so the predictions refer to the frequency with which one will see certain outcomes when the whole experiment is repeated very often. So the issue is not simultaneous measurement of two observables, but joint measurement in the same shot. That is, a device R is a joint measurement of observable A with outcomes x ∈ X and observable B with outcomes y ∈ Y , if it produces outcomes of the form (x, y) in such a way that if we ignore outcome y, the statistics of the x outcomes is always (i.e., for every input state) the same as obtained with a measurement of A, and symmetrically for ignoring x and comparing with B. It is in this sense that non-commuting projection valued observables fail to be jointly measurable.
However, this is not the end of the story. One is often interested in approximate joint measurements. One such instance is Heisenberg's famous γ-ray microscope [11], in which a particle's position is measured by probing it with light of some wavelength λ, which from the outset sets a scale for the accuracy of this position measurement. Naturally, the particle's momentum is changed by the Compton scattering, so if we make a momentum measurement on the particles after the interaction, we will find a different distribution from what would have been obtained directly. Note that in this experiment we get from every particle a position value and momentum value. Moreover, errors can be quantified by comparing the respective distributions with some ideal reference: The accuracy of the microscope position measurement is judged by the degree of agreement between the distribution obtained and the one an ideal position measurement would give. Similarly, the disturbance of momentum is judged by comparing a directly measured distribution with the one after the interaction. The same is true for the uncontrollable disturbance of momentum. This refers to a scenario, where we do not just measure momentum after the interaction, but try to build a device that recovers the momentum in an optimal way, by making an arbitrary measurement on the particle after the interaction, utilizing everything that is known about the microscope, correcting all known systematic errors, and even using the outcome of the position measurement. The only requirement is that at the end of the experiment, for each individual shot, some value of momentum must come out. Even then it is impossible to always reproduce the pre-microscope distribution of momentum. The tradeoff between accuracy and disturbance is quantified by a measurement uncertainty relation. Since it simply quantifies the impossibility of a joint exact measurement, it simultaneously gives bounds on how an approximate momentum measurement irretrievably disturbs position. The basic setup is shown in Fig. 1. Note that in this description of errors we did not ever bring in a comparison with some hypothetical "true value". Indeed it was noted already by Kennard [13] that such comparisons are problematic in quantum mechanics. Even if one is willing to feign hypotheses about the true value of position, as some hidden variable theorists will, an operational criterion for agreement will always have to be based on statistical criteria, i.e., the comparison of distributions. Another fundamental feature of this view of errors is that it provides a figure of merit for the comparison of two devices, typically some ideal reference observable and and an approximate version of it. An "accuracy" ε in this sense is a promise that no matter which input state is chosen, the distributions will not deviate by more than ε. Such a promise does not involve a particular state. This is in contrast to preparation uncertainty relations, which quantify the impossibility to find a state for which the distributions of two given observables (e.g., position and momentum) are both sharp.
Measurement uncertainty relations in the sense described here were first introduced for position and momentum in [23], and were initially largely ignored. A bit earlier, an attempt by Ozawa [15] to quantify error-disturbance tradeoffs with state dependent and somewhat unfortunately chosen [7] quantities had failed, partly for reasons already pointed out in [1]. When experiments confirmed some predictions of the Ozawa approach (including the failure of the error-disturbance tradeoff), a debate ensued [4,16,6,2]. Its unresolved part is whether a meaningful role for Ozawa's definitions can be found. Technically, the computation of measurement uncertainty for position and momentum in [6] carries over immediately to more general phase spaces [24,3]. Apart from some special further computed instances [8,5], this remained the only case in which sharp measurement uncertainty relations could be obtained. This was in stark contrast with preparation uncertainty, for which an algorithm based on solving ground state problems [8] efficiently provides the optimal relations for generic sets of observables. The main aim of the current paper is to provide efficient algorithms also for sharp measurement uncertainty relations.
In order to do that we restrict the setting in some ways, but allow maximal generality in others. We will restrict to finite dimensional systems, and reference observables which are projection valued and non-degenerate. Thus, each of the ideal observables will basically be given by an orthonormal basis in the same d-dimensional Hilbert space. The labels of this basis are the outcomes x ∈ X of the measurement, where X is a set of d elements. We could choose all X = {1, . . . , d}, but it will help to keep track of things using a separate set for each observable. Moreover, this includes the choice X ⊂ R, the set of eigenvalues of some hermitian operator. We allow not just two observables but any finite number n ≥ 2 of them. This is makes some expressions easier to write down, since the sum of an expression involving observable A and analogous one for observable B becomes an indexed sum. We also allow much generality in the way errors are quantified. In earlier works, we relied on two elements to be chosen for each observable, namely a metric D on the outcome set, and an error exponent α, distinguishing, say absolute (α = 1), root-mean-square (α = 2), and maximal (α = ∞) deviations. Deviations were then averages of D(x, y) α . Here we generalize further to an arbitrary cost function c : X × X → R, which we take to be positive, and zero exactly on the diagonal (e.g., c(x, y) = D(x, y) α ), but not necessarily symmetric. Again this generality comes mostly as a simplification of notation. For a reference observable A with outcome set X and an approximate version A with the same outcome set, this defines an error ε(A |A). Our aim is to provide algorithms for computing the uncertainty diagram associated with such data, of which Fig. 2 gives an example. The given data for such a diagram are n projection valued observables A 1 , . . . , A n , with outcome sets X i , for each of which we are given also a cost function c i : X i × X i → R for quantifying errors. An approximate joint measurement is then an observable R with outcome set × i X i , and hence with POVM elements R(x 1 , . . . , x n ), where x i ∈ X i . By ignoring every output but one we get the n marginal observables and a corresponding tuple of errors. The set U L of such tuples, as R runs over all joint measurements, is the uncertainty region.
The surface bounding this set from below describes the uncertainty tradeoffs. For n = 2 we call it the tradeoff curve. Measurement uncertainty is the phenomenon that, for general reference observables A i , the uncertainty region is bounded away from the origin. In principle there are many ways to express this mathematically, from a complete characterization of the exact tradeoff curve, which is usually hard to get, to bounds which are simpler to state, but suboptimal. Linear bounds will play a special role in this paper. We will consider three ways to build a single error quantity out of the comparison of distributions, denoted by ε M (A |A), ε C (A |A), and ε E (A |A). These will be defined in Sect. 2. For every choice of observables and cost functions, each will give an uncertainty region, denoted by U M , U C , and U E , respectively. Since the errors are all based on the same cost function c, they are directly comparable (see Fig. 2). We show in Sect. 3 that the three regions are convex, and hence characterized completely by linear bounds. In Sect. 4 we show how to calculate the optimal linear lower bounds by semidefinite programs. Finally, an Appendix collects the basic information on the beautiful theory of optimal transport, which is needed in Sects. 2.1 and 4.1.

Deviation measures for observables
Here we define the measures we use to quantify how well an observable A approximates a desired observable A. In this section we do not use the marginal condition (1), so A is an arbitrary observable with the same outcome set X as A, i.e., we drop all indices i identifying the different observables. Our error quantities are operational in the sense that each is motivated by an experimental setup, which will in particular provide a natural way to measure them. All error definitions are based on the same cost function c : X × X → R, where c(x, y) is the "cost" of getting a result x ∈ X, when y ∈ X would have been correct. The only assumptions are that c(x, y) ≥ 0 with c(x, y) = 0 iff x = y.
As described above, we consider a quantum system with Hilbert space C d . As a reference observable A we allow any complete von Neumann measurement on this system, that is, any observable whose the set X of possible measurement outcomes has size |X| = d and whose POVM ele-ments A(y) ∈ B(C d ) (y ∈ X) are mutually orthogonal projectors of rank 1; we can then also write A(y) = |φ y φ y | with an orthonormal basis {φ y } of C d . For the approximating observable A the POVM elements A (x) (with x ∈ X) are arbitrary with A (x) ≥ 0 and x∈X A(x) = 1.
The comparison will be based on a comparison of output distributions, for which we use the following notations: Given a quantum state ρ on this system, i.e., a density operator with ρ ≥ 0 and tr ρ = 1, and an observable such as A, we will denote the outcome distribution by ρA, so (ρA)(y) := tr(ρA y ). This is a probability distribution on the outcome set X and can be determined physically as the empirical outcome distribution after many experiments.
For comparing just two probability distributions p : X → R + and q : X → R + , a canonical choice is the "minimum transport cost" where the infimum runs over the set of all couplings, or "transport plans" γ : X × X → R + of p to q, i.e., the set of all probability distributions γ satisfying the marginal conditions y γ(x, y) = p(x) and x γ(x, y) = q(y). The motivations for this notion, and the methods to compute it efficiently are described in the Appendix. Since X is finite, the infimum is over a compact set, so it is always attained. Moreover, since we assumed c ≥ 0 and c(x, y) = 0 ⇔ x = y, we also haveč(p, q) ≥ 0 with equality iff p = q. If one of the distributions, say q, is concentrated on a point y, only one coupling exists, namely γ(x, y) = p(x)δ y y . In this case we abbreviateč(p, q) =č(p, y), and geť i.e., the average cost of moving all the points x distributed according to p to y.
which we call the maximal measurement error. Note that, like the cost function c and the transport costsč, the measure ε M (A |A) need not be symmetric in its arguments, which is sensible as the reference and approximating observables have distinct roles. Similar definitions for the deviation of an approximating measurement from an ideal one have been made, for specific cost functions, in [4,6] and [8] before.
The definition (5) makes sense even if the reference observable A is not a von Neumann measurement. Instead, the only requirement is that A and A be general observables with the same (finite) outcome set X, not necessarily of size d. All our results below that involve only the maximal measurement error immediately generalize to this case as well.
One can see that it is expensive to determine the quantity ε M (A |A) experimentally according to the definition: one would have to measure and compare the outcome statistics ρA and ρA for all possible input states ρ, which form a continuous set. The following definition of observable deviation alleviates this burden.  Calibration is a process by which one tests a measuring device on inputs (or measured objects) for which the "true value" is known. Even in quantum mechanics we can set this up by demanding that the measurement of the reference observable on the input state gives a sharp value y. In a general scenario with continuous outcomes this can only be asked with a finite error δ, which goes to zero at the end [4], but in the present finite scenario we can just demand (ρA)(y) = 1. Since, for every outcome y of a von Neumann measurement, there is only one state with this property (namely ρ = |φ y φ y |) we can simplify even further, and define the calibration error by

Calibration error ε C (A |A).
Note that the calibration idea only makes sense when there are sufficiently many states for which the reference observable has deterministic outcomes, i.e., for projective observables A.
A closely related quantity has recently been proposed by Appleby [2]. It is formulated for real valued quantities with cost function c(x, y) = (x − y) 2 , and has the virtue that it can be expressed entirely in terms of first and second moments of the probability distributions involved. So for any ρ, let m and v be the mean and variance of ρA, and v the mean quadratic deviation of ρA from m.
Here we added the square to make Appleby's quantity comparable to our variance-like (rather than standard deviation-like) quantities, and chose the letter D, because Appleby calls this the D-error.
Since in the supremum we have also the states for which A has a sharp distribution (i.e. v = 0), we clearly have ε D (A |A) ≥ ε C (A |A). On the other hand, let Φ(x) = t(x − m) 2 and Ψ(y) = t/(1−t)(y−m) 2 with some parameter t ∈ (−∞, 1). Then one easily checks that Φ(x)−Ψ(y) ≤ (x−y) 2 , so (Φ, Ψ) is a pricing scheme in the sense defined in the Appendix. Thereforě Maximizing this expression over t gives exactly (7). Therefore In quantum information theory a standard way of providing a reference state for later comparison is by applying a channel or observable to one half of a maximally entangled system. Two observables would be compared by measuring them (or suitable modifications) on the two parts of a maximally entangled system. Let us denote the entangled vector by Ω = d −1/2 k |kk . Since later we will look at several distinct reference observables, the basis kets |k in this expression have no special relation to A or its eigenbasis φ y . We denote by X T the transpose of an operator X in the |k basis, and by A T the observable with POVM elements A(y) T = |φ y φ y |, where φ y is the complex conjugate of φ y in |k -basis. These transposes are needed due to the well-known relation (X ⊗ 1)Ω = (1 ⊗ X T )Ω. We now consider an experiment, in which A is measured on the first part and A T on the second part of the entangled system, so we get the outcome pair (x, y) with probability As A is a complete von Neumann measurement, this probability distribution is concentrated on the diagonal (x = y) iff A = A, i.e., there are no errors of A relative to A. Averaging with the error costs we get a quantity we call the entangled reference error Note that this quantity is measured as a single expectation value in the experiment with source Ω. Moreover, when we later want to measure different such deviations for the various marginals, the source and the tested joint measurement device can be kept fixed, and only the various reference observables A T i acting on the second part need to be adapted suitably.

Summary and comparison
The quantities ε M (A |A), ε C (A |A) and ε E (A |A) constitute three different ways to quantify the deviation of an observable A from a projective reference observable A. Nevertheless, they are all based on the same distance-like measure, the cost function c on the outcome set X. Therefore it makes sense to compare them quantitatively. Indeed, they are ordered as follows: Here the first inequality follows by restricting the supremum (5) to states which are sharp for A, and the second by noting the (6) is the maximum of a function of y, of which (10) is the average.
Moreover, as we argued before Eq. (10), ε E (A |A) = 0 if and only if A = A , which is hence equivalent also to ε M (A |A) = 0 and ε C (A |A) = 0.

Convexity of uncertainty diagrams
In this section we will consider tuples (A 1 , . . . , A n ) of projection valued non-degenerated observables, as described in the introduction. We will collect some basic properties of the uncertainty regions U L , where L ∈ {M, C, E}, that is, For two observables B 1 and B 2 with the same outcome set we can easily realize their mixture, or convex combination B = tB 1 + (1 − t)B 2 by flipping a coin with probability t for heads in each instance and then apply B 1 when heads is up and B 2 otherwise. In terms of POVM elements this reads . We show first that this mixing operation does not increase the error quantities from Sect. 2.
Proof. The basic fact used here is that the pointwise supremum of affine functions (i.e., those for which equality holds in the definition of a convex function) is convex. This is geometrically obvious, and easily verified from the definitions. Hence we only have to check that each of the error quantities is indeed represented as a supremum of functions, which are affine in the observable B.
For L = E we even get an affine function, because (10) is linear in A . For L = C equation (6) has the required form. For L = M the definition (5) is as a supremum, but the functionč is defined as an infimum. However, we can use the duality theory described in the Appendix (e.g. in (49)) to write it instead as a supremum over pricing schemes, of an expression which is just the expectation of Φ(x) plus a constant, and therefore an affine function. Finally, for Appleby's case (7), we get the same supremum, but over a subset of pricing schemes (the quadratic ones, see below (7)).
The convexity of the error quantities distinguishes measurement from preparation uncertainty. Indeed, the variances appearing in preparation uncertainty relations are typically concave functions, because they arise from minimizing the expectation of (x−m) 2 over m. Consequently, the preparation uncertainty regions may have gaps, and non-trivial behaviour on the side of large variances. The following proposition will show that measurement uncertainty regions are better behaved.
For every cost function c on a set X we can define a "radius" c * , the largest transportation cost from the uniform distribution (the "center" of the set of probability distributions) and a "diameter" c * , the largest transportation cost between any two distributions: Proposition 2. Let n observables A i and cost functions c i be given, and define c M Then, for L ∈ {M, C, E}, the uncertainty regions U L is a convex set and has the following (monotonicity) property: When x = (x 1 , . . . , x n ) ∈ U L and y = (y i , . . . , y n ) ∈ R n such that Proof. Let us first clarify how to make the worst possible measurement B, according to the various error criteria, for which we go back to the setting of Sect. 2, with just one observable A, and cost function c. In all cases, the worst measurement is one with constant and deterministic output, i.e., For L = C and L = M such a measurement will have ε L (B|A) = max y c(x * , y), and we can choose x * to make this equal to c * = c L . For L = E we get instead the average, which is maximized by c * .
We can now make a given joint measurement R worse by replacing it partly by a bad one, say for the first observable A 1 . That is, we set, for λ ∈ [0, 1], Then all marginals A i for i = 1 are unchanged, but . Now as λ changes from 0 to 1, the point in the uncertainty diagram will move continuously in the first coordinate direction from x to the point in which the first coordinate is replaced by its maximum value (see Fig. 6(left)). Obviously, the same holds for every other coordinate direction, which proves the monotonicity statement of the proposition. Figure 6: The blue shaded region corresponds to the monotonicity statement for ε L (R). (left) R is a mixture of R and B 1 . We can also get an observable V by mixing the second marginal of R with B 2 and thus reach every point in the blue shaded region. (right) ε L (R) is componentwise convex. So the mixture of the points ε L (R 1 ) and ε L (R 2 ) is always in the monotonicity region corresponding to ε L (R).
Let R 1 and R 2 be two observables, and let R = λR 1 + (1 − λ)R 2 be their mixture. For proving the convexity of U L we will have to show that every point on the line between ε L (R 1 ) and ε L (R 2 ) can be attained by a tuple of errors corresponding to some allowed observable (see Fig. 6 (right)). Now lemma 1 tells us that every component of ε L (R) is convex, which implies that ε L (R) ≤ λ ε L (R 1 ) + (1 − λ) ε L (R 2 ). But, by monotonicity, this also means that λ ε L (R 1 ) + (1 − λ) ε(R 2 ) is in U L again, which shows the convexity of U L .

Example: Phase space pairs
As is plainly visible from Fig. 2, the three error criteria considered here usually give different results. However, under suitable circumstances they all coincide. This is the case for conjugate pairs related by Fourier transform [24]. The techniques needed to show this are the same as for the standard position/momentum case [6,22], and in addition imply that the region for preparation uncertainty is also the same.
In the finite case there is not much to choose: We have to start from a finite abelian group, which we think of as position space, and its dual group, which is then the analogue of momentum space. The unitary connecting the two observables is the finite Fourier associated with the group. The cost function needs to be translation invariant, i.e., c(x, y) = c(x − y). Then, by an averaging argument, we find for all error measures that a covariant phase space observable minimizes measurement uncertainty (all three versions). The marginals of such an observable can be simulated by first doing the corresponding reference measurement, and then adding some random noise. This implies [8] that ε M (A |A) = ε C (A |A). But we know more about this noise: It is independent of the input state so that the average and the maximum of the noise (as a function of the input) coincide, i.e., ε C (A |A) = ε E (A |A). Finally, we know that the noise of the position marginal is distributed according to the position distribution of a certain quantum state which is, up to normalization and a unitary parity inversion, the POVM element of the covariant phase space observable at the origin. The same holds for the momentum noise. But then the two noise quantities are exactly related like the position and momentum distributions of a state, and the tradeoff curve for that problem is exactly preparation uncertainty, with variance criteria based on the same cost function. In this case all uncertainty regions, also the one for preparation uncertainty, coincide. The parameter of the above tradeoff curves is the order d = 2, 3, . . . , 10, · · · , ∞ of the underlying abelian group.
If we choose the discrete metric for c, the uncertainty region depends only on the number d of elements in the group we started from [24]. The largest ε for all quantities is the distance from a maximally mixed state to any pure state, which is ∆ = (1 − 1/d). The exact tradeoff curve is then an ellipse, touching the axes at the points (0, ∆) and (∆, 0). The resulting family of curves, parameterized by d, is shown in Fig. 7. In general, however, the tradeoff curve requires the solution of a non-trivial family of ground state problems, and cannot be given in closed form. For bit strings of length n, and the cost some convex function of Hamming distance there is an expression for large n [24].

Computing uncertainty regions via semidefinite programming
We show here how the uncertainty regions -and therefore optimal uncertainty relations -corresponding to each of the three error measures can actually be computed, for any given set of projective observables A 1 , . . . , A n and cost functions c 1 , . . . , c n . Our algorithms will come in the form of semidefinite programs (SDPs) [20,19], facilitating efficient numerical computation of the uncertainty regions via the many existing program packages to solve SDPs. Moreover, the accuracy of such numerical results can be rigorously certified via the duality theory of SDPs. To obtain the illustrations in this paper we used the CVX package [10,9] under MATLAB.
As all our uncertainty regions U L ⊂ R n (for L = M, C, E) are convex and closed (Sect. 3), they are completely characterized by their supporting hyperplanes (for a reference to convex geometry see [17]). Due to the monotonicity property stated in Prop. 2 some of these hyperplanes just cut off the set parallel along the planes x i = c L i . The only hyperplanes of interest are thus those with nonnegative normal vectors w = (w 1 , . . . , w n ) ∈ R n + (see Fig. 8). Each hyperplane is completely specified by its "offset" b L ( w) away from the origin, and this function determines In fact, due to homogeneity b L (t w) = t b L ( w) we can restrict everywhere to the subset of vectors w ∈ R n + that, for example, satisfy i w i = 1, suggesting an interpretation of the w i as weights of the different uncertainties ε i . Our algorithms will, besides evaluating b L ( w), also allow to compute an (approximate) minimizer ε, so that one can plot the boundary of the uncertainty region U L by sampling over w, which is how the figures in this paper were obtained. Figure 8: The lower bound of the uncertainty region U L can be described by its supporting hyperplanes (red line) with a normal vector w ∈ R n + .
Let us further note that knowledge of b L ( w) for some w ∈ R n + immediately yields a quantitative uncertainty relation: every error tuple ε ∈ U L attainable via a joint measurement is constrained by the affine inequality w · ε ≥ b L ( w), meaning that some weighted average of the attainable error quantities ε i cannot become too small. When b L ( w) > 0 is strictly positive, this excludes in particular the zero error point ε = 0. The obtained uncertainty relations are optimal in the sense that there exists ε ∈ U L which attains strict equality w · ε = b L ( w).
Having reduced the computation of an uncertainty region essentially to determining b L ( w) (possibly along with an optimizer ε), we now treat each case L = M, C, E in turn.

Computing the uncertainty region U M
On the face of it, the computation of the offset b M ( w) looks daunting: expanding the definitions we obtain where the infimum runs over all joint measurements R with outcome set X 1 × . . . × X n , inducing the marginal observables A i = A i (R) according to (1), and the supremum over all sets of n quantum states ρ 1 , . . . , ρ n , and where the transport costsč i (p, q) are given as a further infimum (3) over the couplings γ i of p = ρA i and q = ρA i .
The first simplification is to replace the infimum over each coupling γ i , via a dual representation of the transport costs, by a maximum over optimal pricing schemes (Φ α , Ψ α ), which are certain pairs of functions Φ α , Ψ α : X i → R, where α runs over some finite label set S i . The characterization and computation of the pairs (Φ α , Ψ α ), which depend only on the chosen cost function c i on X i , is described in the Appendix. The simplified expression for the optimal transport costs is theň We can then continue our computation of b M ( w): where λ max (B i,α ) denotes the maximum eigenvalue of a Hermitian operator B i,α . Note that λ max (B i,α ) = inf{µ i | B i,α ≤ µ i 1}, which one can also recognize as the dual formulation of the convex optimization sup ρ tr(ρB i,α ) over density matrices, so that We obtain thus a single constrained minimization: Making the constraints on the POVM elements R(x 1 , . . . , x n ) of the joint observable R explicit and expressing the maginal observables A i = A i (R) directly in terms of them by (1), we finally obtain the following SDP representation for the quantity b M ( w): b M ( w) = inf i w i µ i with real variables µ i and d × d-matrix variables R(x 1 , . . . , x n ) subject to . . . , x n ) ≥ 0 ∀x 1 , . . . , x n x1,...,xn R(x 1 , . . . , x n ) = 1. (25) The derivation above shows further that, when w i > 0, the µ i attaining the infimum equals µ i = sup ρči (ρA i , ρA i ) = ε M (A i |A i ), where A i is the marginal coming from a corresponding optimal joint measurement R(x i , . . . , x n ). Since numerical SDP solvers usually output an (approximate) optimal variable assignment, one obtains in this way directly a boundary point ε = (µ 1 , . . . , µ n ) of U M when all w i are strictly positive. If w i = 0 vanishes, a corresponding boundary point ε can be computed via . . . , x n ) − y Ψ α (y)A(y)) from an optimal assignment for the POVM elements R(x 1 , . . . , x n ).
For completeness we also display the corresponding dual program [20] (note that strong duality holds, and the optima of both the primal and the dual problem are attained): (26)

Computing the uncertainty region U C
To compute the offset function b C ( w) for the calibration uncertainty region U C we use the last form in (6) and recall that the projectors onto the sharp eigenstates of A i (see Sect.

2.2) are exactly the POVM elements
where again the infimum runs over all joint measurements R, inducing the marginals A i , and we have turned, for each i = 1, . . . , n, the maximum over y into a linear optimization over probabilities λ i,y ≥ 0 (y = 1, . . . , d) subject to the normalization constraint y λ i,y = 1. In the last step, we have made the A i explicit via (1).
The first main step towards a tractable form is von Neumann's minimax theorem [14,18]: As the sets of joint measurements R and of probabilities {λ i,y } are both convex and the optimization function is an affine function of R and, separately, also an affine function of the {λ i,y }, we can interchange the infimum and the supremum: The second main step is to use SDP duality [19] to turn the constrained infimum over R into a supremum, abbreviating the POVM elements as R(x 1 , . . . , x n ) = R ξ : which is very similar to a dual formulation often employed in optimal ambiguous state discrimination [12,25].
Putting everything together, we arrive at the following SDP representation for the offset quantity The dual SDP program reads (again, strong duality holds, and both optima are attained):  To compute a boundary point ε of U C lying on the supporting hyperplane with normal vector w, it is best to solve the dual SDP (33) and obtain ε = (m 1 , . . . , m n ) from an (approximate) optimal assignment of the m i . Again, this works when w i > 0, whereas otherwise one can compute ε i = max y x1,...,xn tr[R(x 1 , . . . , x n )A i (y)]c i (x i , y) from an optimal assingment of the R(x 1 , . . . , x n ). From many primal-dual numerical SDP solvers (such as CVX [10,9]), one can alternatively obtain optimal POVM elements R(x 1 , . . . , x n ) also from solving the primal SDP (32) as optimal dual variables corresponding to the constraints Y ≤ . . ., and compute ε from there.

Computing the uncertainty region U E
As one can see by comparing the last expressions in the defining equations (6) and (10), respectively, the evaluation of b E ( w) is quite similar to (27), except that the maximum over y is replaced by a uniform average over y. This simply corresponds to fixing λ i,y = 1/d for all i, y in Eq. (29), instead of taking the supremum. Therefore, the primal and dual SDPs for The computation of a corresponding boundary point ε ∈ U E is similar as above.
In this appendix we collect the basic theory of optimal transport adapted to the finite setting at hand. This eliminates all the topological and measure theoretic fine points that can be found, e.g., in Villani's book [21], which we also recommend for extended proofs of the statements in our summary. We slightly generalize the setting from the cost functions used in the main text of this paper: We allow the two variables on which the cost function depends to range over different sets. This might actually be useful for comparing observables, which then need not have the same outcome sets. Which outcomes are considered to be close or the same must be specified in terms of the cost function. We introduce this generalization here less for the sake of applications rather than for a simplification of the proofs, in particular for the book-keeping of paths in the proof of Lemma 4.
The basic setting is that of two finite sets X and Y , and a arbitrary function c : X × Y → R, called the cost function. The task is to optimize the transport of some distribution of stuff on X, described by a distribution function p : X → R + , to a final distribution q : Y → R + on Y when the transportation of one unit of stuff from the point x to the point y costs c(x, y). In the first such scenario ever considered, namely by Gaspar Monge, the "stuff" was earth, the distribution p a hill, and q a fortress. Villani [21] likes to phrase the scenario in terms of bread produced at bakeries x ∈ X to be delivered to cafés y ∈ Y . This makes plain that optimal transport is sometimes considered a branch of mathematical economics, and indeed Leonid Kantorovich, who created much of the theory, received a Nobel prize in economics. In our case the "stuff" will be probability.
A transport plan (or coupling) will be a probability distribution γ : X × Y → R + , which encodes how much stuff is moved from any x to any y. Since all of p is to be moved, y γ(x, y) = p(x), and since all stuff is to be delivered, x γ(x, y) = q(y). Now, for any transport plan γ we get a total cost of x,y γ(x, y)c(x, y), and we are interested in the optimum This is called the primal problem, to which there is also a dual problem. In economic language it concerns pricing schemes, that is, pairs of functions Φ : X → R and Ψ : Y → R satisfying the inequality and demands to maximizê In Villani's example [21], think of a consortium of bakeries and cafés, that used to organize the transport themselves according to some plan γ. Now they are thinking of hiring a contractor, which offers to do the job, charging Φ(x) for every unit picked up from bakery x, and giving Ψ(y) to café y on delivery (these numbers can be negative). Their offer is that this will reduce overall costs, since their pricing scheme satisfies (37). Indeed, the overall charge to the consortium will be Taking the sup on the left hand side of this inequality (the company will try to maximize their profits by adjusting the pricing scheme (Φ, Ψ)) and the inf on the right hand side (the transport plan γ was already optimized), we getĉ(p, q) ≤č(p, q). It can be shown via the general duality theory of linear programming [20] that the duality gap closes in this case, i.e., we actually always havê So the consortium will face the same transport costs in the end if the contractor chooses an optimal pricing scheme. (Note that both the infimum and the supremum in the definitions ofč andĉ, respectively, are attained as X and Y are finite sets.) What is especially interesting for us, however, is that the structure of the optimal solutions for both variational problems is very special, and both problems can be reduced to a combinatorial optimization over finitely many possibilities, which furthermore can be constructed independently of p and q. Indeed, pricing schemes and transport plans are both related to certain subsets of X × Y . We define S(γ) ⊆ X × Y as the support of γ, i.e., the set of pairs on which γ(x, y) > 0. For a pricing scheme (Φ, Ψ) we define the equality set E(Φ, Ψ) as the set of points (x, y) for which equality holds in (37). Then equality holds in (39) if and only if S(γ) ⊂ E(Φ, Ψ). Note that for γ to satisfy the marginal condition for given p and q, its support S(γ) cannot become too small (depending on p and q). On the other hand, E(Φ, Ψ) cannot be too large, because the resulting system of equations for Φ(x) and Ψ(y) would become overdetermined and inconsistent. The kind of set for which they meet is described in the following Definition. Definition 1. Let X, Y be finite sets and c : X × Y → R a function. Then a subset Γ ⊂ X × Y is called cyclically c-monotone ("ccm" for short), if for any sequence of distinct pairs (x 1 , y 1 ) ∈ Γ, . . . , (x n , y n ) ∈ Γ, and any permutation π of {1, . . . , n} the inequality holds. When Γ is not properly contained in another cyclically c-monotone set, it is called maximally cyclically c-monotone ("mccm" for short).
A basic example of a ccm set is the equality set E(Φ, Ψ) for any pricing scheme (Φ, Ψ). Indeed, for (x i , y i ) ∈ E(Φ, Ψ) and any permutation π we have The role of ccm sets in the variational problems (36) and (38) is summarized in the following proposition.
Proposition 3. Let X, Y, c, p, q be given as above. Then (1) A coupling γ minimizes (36) if and only if S(γ) is ccm.

Sketch of proof.
(1) Suppose (x i , y i ) ∈ S(γ) (i = 1, . . . , n), and let π be any permutation. Set δ = min i γ(x i , y i ). Then we can modify γ by subtracting δ from any γ(x i , y i ) and adding δ to γ(x i , y πi ). This operation keeps γ ≥ 0 and does not change the marginals. The target functional in the infimum (36) is changed by δ times the difference of the two sides of (41). For a minimizer γ this change must be ≥ 0, which gives inequality (41). For the converse we need a Lemma, whose proof will be sketched below.
By applying this to Γ = S(γ) we find that the duality gap closes for γ, i.e., equality holds in (39), and hence γ is a minimizer.
(2) Every subset Γ ⊂ X ×Y can be thought of as a bipartite graph with vertices X ∪Y and an edge joining x ∈ X and y ∈ Y iff (x, y) ∈ Γ (see Fig. 9). We call Γ connected, if any two vertices are linked by a sequence of edges. Consider now the equality set E(Φ, Ψ) of some pricing scheme. We modify (Φ, Ψ) by picking some connected component, and setting Φ (x) = Φ(x) + a and Ψ (y) = Ψ(y) + a for all x, y in that component. If |a| is sufficiently small, (Φ , Ψ ) will still satisfy all the inequalities (37), and E(Φ , Ψ ) = E(Φ, Ψ). The target functional in the optimization (38) depends linearly on a, so moving in the appropriate direction will increase, or at least not decrease it. We can continue until another one of the inequalities (37) becomes tight. At this point E(Φ , Ψ ) E(Φ, Ψ). This process can be continued until the equality set E(Φ, Ψ) is connected. Then (Φ, Ψ) is uniquely determined by E(Φ, Ψ) up to a common constant. It remains to show that connected equality sets E(Φ, Ψ) are mccm. Suppose that Γ ⊇ E(Φ, Ψ) is ccm. Then by Lemma 4 we can find a pricing scheme (Φ , Ψ ) with E(Φ , Ψ ) ⊇ E(Φ, Ψ). But using just the equalities in (37) coming from the connected E(Φ, Ψ), we already find that Φ = Φ + a and Ψ = Ψ + a, so we must have E(Φ , Ψ ) = E(Φ, Ψ).
(3) is trivial from the proof of (2) that mccm sets are connected.
Proof sketch of Lemma 4. Our proof will give some additional information on the set of all pricing schemes that satisfy E(Φ, Ψ) ⊃ Γ and Φ(x 0 ) = 0 for some reference point x 0 ∈ X to fix the otherwise arbitrary additive constant. Namely we will explicitly construct the largest element (Φ + , Ψ + ) of this set and the smallest (Φ − , Ψ − ), so that all other schemes (Φ, Ψ) satisfy for all x ∈ X and y ∈ Y . The idea is to optimize the sums of certain costs over paths in X ∪ Y .
We define a Γ-adapted path as a sequence of vertices z 1 , . . . , z n ∈ X ∪ Y such that the z i ∈ X ⇒ (z i , z i+1 ) ∈ Γ, and z i ∈ Y ⇒ z i+1 ∈ X. For such a path we define with the convention c(y, x) := −c(x, y) for x ∈ X, y ∈ Y . Then Γ is ccm if and only if c(z 1 , . . . , z n , z 1 ) ≤ 0 for every Γ-adapted closed path. This is immediate for cyclic permutations, and follows for more general ones by cycle decomposition. The assertion of Lemma 4 is trivial if Γ = ∅, so we can pick a point x 0 ∈ X for which some edge (x 0 , y) ∈ Γ exists. Then, for any z ∈ X ∪ Y , we define, for z = x 0 , where the suprema are over all Γ-adapted paths between the specified endpoints, we define χ + (x 0 ) := χ − (x 0 ) := 0, and empty suprema are defined as −∞. Then χ ± are the maximal and minimal pricing schemes, when written as two functions Φ ± (x) = χ ± (x) and Ψ ± (y) = χ ± (y) for x ∈ X and y ∈ Y .
Here the inequality follows because the adapted paths x 0 → x going via y as the last step are a subclass of all adapted paths and give a smaller supremum. The second statement follows, because for x = x 0 there has to be some last step from Y to x. The inequality (46) also shows that (Φ + , Ψ + ) is a pricing scheme. The same argument applied to the decomposition of paths (x 0 , . . . , x, y) with (x, y) ∈ Γ gives the inequality − χ + (x) + c(x, y) ≤ −χ + (y) for (x, y) ∈ Γ.
Let us summarize the consequences of Proposition 3 for the computation of minimal costs (36). Given any cost function c, the first step is to enumerate the corresponding mccm sets, say Γ α , α ∈ S, for some finite label set S, and to compute for each of these the pricing scheme (Φ α , Ψ α ) (up to an overall additive constant, see Proposition 3). This step depends only on the chosen cost function c. Then, for any distributions p, q we get This is very fast to compute, so the preparatory work of determining the (Φ α , Ψ α ) is well invested if many such expressions have to be computed. However, even more important for us that (49) simplifies the variational problem sufficiently so that we can combine it with the optimization over joint measurements (see Sect. 4.1). Of course, this leaves open the question of how to determine all mccm sets for a cost function. Some remarks about this will be collected in the next subsection.

A.2 How to find all mccm sets
We will begin with a basic algorithm for the general finite setting, in which X, Y , and the cost function c are arbitrary. Often the task can be greatly simplified if more structure is given. These simplifications will be described in the following sections.
The basic algorithm will be a growth process for ccm subsets Γ ⊆ X × Y , which stops as soon as Γ is connected (cf. the proof of Proposition 3(2)). After that, we can compute the unique pricing scheme (Φ, Ψ) with equality on Γ by solving the system of linear equations with (x, y) ∈ Γ from (37). This scheme may have additional equality pairs extending Γ to an mccm set. Hence, the same (Φ, Ψ) and mccm sets may arise from another route of the growth process. Nevertheless, we can stop the growth when Γ is connected, and eliminate doubles as a last step of the algorithm. The main part of the algorithm will thus aim at finding all connected ccm trees, where by definition a tree is a graph containing no cycles. We take each tree to be given by a list of edges (x 1 , y 1 ), . . . (x N , y N ), which we take to be written in lexicographic ordering, relative to some arbitrary numberings X = {1, . . . , |X|} and Y = {1, . . . , |Y |}. Hence the first element in the list will be (1, y), where y is the first element connected to 1 ∈ X.
At stage k of the algorithm we will have a list of all possible initial sequences (x 1 , y 1 ), . . . (x k , y k ) of lexicographically ordered ccm trees. For each such sequence the possible next elements will be determined, and all the resulting edge-lists of length k + 1 form the next stage of the algorithm. Now suppose we have some list (x 1 , y 1 ), . . . (x k , y k ). What can the next pair (x , y ) be? There are two possibilities: (a) x = x k is unchanged. Then lexicographic ordering dictates that y > y k . Suppose that y is already connected to some x < x k . Then adding the edge (x k , y ) would imply that y could be reached in two different ways from the starting node (x = 1). Since we are looking only for trees, we must therefore restrict to only those y > y k which are yet unconnected.
(b) x is incremented. Since in the end all vertices x must lie in one connected component, the next one has to be x = x k + 1. Since the graphs at any stage should be connected, y must be a previously connected Y -vertex.
With each new addition we also check the ccm property of the resulting graph. The best way to do this is to store with any graph the functions Φ, Ψ on the set of already connected nodes (starting from Φ(1) = 0), and update them with any growth step. We then only have to verify inequality (37) for every new node paired with every old one. Since the equality set of any pricing scheme is ccm, this is sufficient. The algorithm will stop as soon as all nodes are included, i.e., after |X| + |Y | − 1 steps.

A.3 The linearly ordered case
When we look at standard quantum observables, given by a Hermitian operator A, the outcomes are understood to be the eigenvalues of A, i.e., real numbers. Moreover, we typically look at cost functions which depend on the difference (x − y) of two eigenvalues, i.e., c(x, y) = h x − y .
For the Wasserstein distances one uses h(t) = |t| α with α ≥ 1. The following Lemma allows, in addition, arbitrary convex, not necessarily even functions h.
As a consequence, if Γ is a ccm set for the cost function c and (x 1 , y 1 ) ∈ Γ, then all (x, y) ∈ Γ satisfy either x ≤ x 1 and y ≤ y 1 or x ≥ x 1 and y ≥ y 1 . Loosely speaking, while in Γ, one can only move north-east or south-west, but never north-west or south-east.
This has immediate consequences for ccm sets: In each step in the lexicographically ordered list (see the algorithm in the previous subsection) one either has to increase x by one or increase y by one, going from (1, 1) to the maximum. This is a simple drive on the Manhattan grid, and is parameterized by the instructions on whether to go north or east in every step. Of the |X| + |Y | − 2 necessary steps, |X| − 1 have to go in the east direction, so altogether we will have at most r = |X| + |Y | − 2 |X| − 1 (52) mccm sets and pricing schemes. They are quickly enumerated without going through the full tree search described in the previous subsection.

A.4 The metric case
Another case in which a little bit more can be said is the following [21, Case 5.4, p.56]: Lemma 6. Let X = Y , and consider a cost function c(x, y) which is a metric on X. Then: (1) Optimal pricing schemes satisfy Φ = Ψ, and the Lipshitz condition |Φ(x) − Φ(y)| ≤ c(x, y).
One even more special case is that of the discrete metric, c(x, y) = 1 − δ xy . In this case it makes no sense to look at error exponents, because c(x, y) α = c(x, y). Moreover, the Lipshitz condition |Φ(x) − Φ(y)| ≤ c(x, y) is vacuous for x = y, and otherwise only asserts that Φ(x) − Φ(y) ≤ 1, which after adjustment of a constant just means that |Φ(x)| ≤ 1/2 for all x. Hence the transportation cost is just the 1 norm up to a factor, i.e.,