The Reconciliation of Multiple Conflicting Estimates: Entropy-Based and Axiomatic Approaches

When working with economic accounts it may occur that multiple estimates of a single datum exist, with different degrees of uncertainty or data quality. This paper addresses the problem of defining a method that can reconcile conflicting estimates, given best guess and uncertainty values. We proceeded from first principles, using two different routes. First, under an entropy-based approach, the data reconciliation problem is addressed as a particular case of a wider data balancing problem, and an alternative setting is found in which the multiple estimates are replaced by a single one. Afterwards, under an axiomatic approach, a set of properties is defined, which characterizes the ideal data reconciliation method. Under both approaches, the conclusion is that the formula for the reconciliation of best guesses is a weighted arithmetic average, with the inverse of uncertainties as weights, and that the formula for the reconciliation of uncertainties is a harmonic average.


Introduction
With improvements in information technology, the world has become more unified and interconnected. Information is now typically shared quickly and easily from all over the globe, such that barriers formed by linguistic and geographic boundaries essentially have been torn down. This has enabled people from disparate cultures and backgrounds to share ideas and information. One outcome of this regime change has been a boosting of the perceived benefits of statistical information. While some benefits of such statistical information have been known since at least Quetelet's (1835) tome on so-called "social physics" was published, today's massive socio-economic statistical repositories in Europe, North America, and East Asia are enabling a data revolution of sorts. Indeed, the fields of data mining and data analytics are fast becoming important fields of academic study. Mirroring the rise of data availability and the nature of some of the data itself, the term "big data" has been coined [1] to refer to the extremely voluminous and complex data sets that require specialized processing application software to deal with them.
The most prominent stewards of socio-economic data are government statistical agencies, which focus on producing and disseminating data products secured via surveys (for example the American Community Survey), censuses (such as Japan's 2015 Population Census), and administrative procedures (like information needed to get an academic promotion in Spain). As a result, data storage is now ubiquitously electronic, replicated offsite to guard against storage failure, and measured in petabytes. Electronic storage enables low-cost dissemination of data. It also facilitates the integration of records across disparate databases-for example, into a system of national accounts, which is what centers on the introduction of the number of previously combined priors and a ranking of estimates by their relative quality.

Basic Concepts
Bayesian inference was first developed by Laplace [29] and later expanded by others, such as Jeffreys [30], Jaynes [31] and Jaynes [23]. According to the Bayesian paradigm, a probability is a degree of belief about the likelihood of an event, and should reflect all relevant available information about that event. According to Weise and Woger [32], if an empirical quantity is subject to measurement errors, it must be described by a random variable, whose expectation is the best guess and whose standard-deviation is the uncertainty estimate.
More formally, a prior datum θ i is characterized by a probability distribution π(q i ), which expresses the degree of belief that the datum takes realization q i . The best guess is µ i = E[θ i ] and the uncertainty is σ i = Var[θ i ]. When multiple data are considered, e.g., θ i and θ j , it is necessary to introduce the correlation between them, ρ ij = Cov[θ i , θ j ]/σ i σ j . Rodrigues [33] further provides a series of rules to determine the properties of a strictly positive prior datum, using the maximum-entropy principle [34].
The type of data we are interested in are connected to one another through accounting identities of the form: where θ 0 is an aggregate datum and the θ i 's are disaggregate data. If the set of data is arranged in a vector θ of length n T , the set of n K accounting identities can be defined through a concordance matrix G, where, for a given accounting identity i, G ij = 1 if θ j is a disaggregate datum, G ij = −1 if θ j is an aggregate datum and G ij = 0 otherwise.
If the prior configuration is unbalanced, then Gθ = 0, where 0 is a vector of zeros. Rodrigues [25] derives an analytical solution and a series of approximations that, given a concordance matrix and prior configuration, provide a posterior configuration, t, such that Gt = 0. The notational convention used here is that Greek letters refer to priors while Latin cognates will refer to posteriors, i.e., m i , s i and r ij are, respectively, the best guess and uncertainty of t i and correlation between t i and t j .

Problem Formulation
We are now in position to formulate the data reconciliation problem. Given initial priors θ and θ and a system with n T + 1 numerical data {θ 1 , . . . , θ n T −1 , θ , θ }, and n K + 1 accounting identities, where accounting identity n K + 1 takes the form θ = θ , our goal is to determine the final prior θ, in a new system with n T numerical data, {θ 1 , . . . , θ n T −1 , θ} and the n K first accounting identities of the original system, in which the posteriors {t 1 , . . . , t n T −1 } are identical in both data balancing problems, and t = t = t . Conceptually, we are approaching data reconciliation as a form of preliminary data balancing, as illustrated in Figure 1. The conflicting estimates are initial priors of the same datum, and the reconciled value is a final prior. Note the following notational convention: while other variables (and their properties) are denoted with subscripts, initial priors/posteriors (and their properties) are denoted with one ( ) or two ( ) primes, and the final prior/posterior is denoted with neither subscripts nor primes.
Three situations emerge: Either the datum to be reconciled is only a disaggregate datum; it is only an aggregate datum; or it is both a disaggregate and an aggregate datum, in different accounting identities. We will deal with the three cases separately.
balancing reconciliation balancing Figure 1. On the left-hand side balancing in a single step, with multiple initial estimates (priors) of the same datum, θ and θ , balanced to the same quantity (posterior), t = t . On the right-hand side balancing in two steps: First the reconciliation procedures combines the multiple initial estimates (initial priors), θ and θ , into a final prior, θ; afterwards the full system is balanced, leading to posterior t. We impose that the result from both procedures is the same, t = t = t .
We now present simple systems to illustrate the three possible cases. As a benchmark consider a tabular system (i.e., with data organized in rows and columns) with no multiple estimates consisting of a 2 × 3 table A with row sums b and columns sums c. Furthermore, consider that the sum of both b and c is known as d. If i is a vector of ones of appropriate length, all vectors are in column format by default, and prime ( ) adjoined to a matrix or vector denotes transpose, then the previous set of constraints means that: The vectorized form of this system and the concordance table is presented in Table 1. In the baseline system there is a total of twelve variables (columns of the concordance matrix G) and seven constraints (rows thereof). The first six variables are disaggregate values (corresponding to the initial A matrix), the following five are mixed (row and column sums b and c), and the last one is an aggregate datum (d). The first two constraints (rows of G) are the row sums of A, the following three are its columns sums, and the last two are the sums of b and c. To understand how G is constructed let us consider the first constraint, which is the row sum of A. Formally, this is: hence in the first row of G the entries corresponding to the columns of A 11 , A 12 and A 13 have 1s, the entry corresponding to the column of b 1 has −1 and all entries are zero.
We are now in position to formalize the three situations of multiple estimates of a single datum as variants of Table 1 in which an additional row and column has been added to G.
The case of disaggregate datum occurs if the datum for which multiple estimates exist is an interior point, which for concreteness we consider to be element A 23 : The set of constraints is shown in Table 2. As an illustration of the case of there being two estimates of an aggregate datum consider it to be d: The set of constraints is shown in Table 3. Finally, consider as example of an element that is both aggregate and disaggregate that of b 1 : The set of constraints is shown in Table 4.
It is perhaps instructive to describe how the reconciliation problems differ from the features of the baseline system. The three variants of the baseline are constructed by adding a single variable, the conflicting estimate, which by convenience is always appended to the original system. It is also necessary to add an extra constraint, connecting the two conflicting estimates. Finally, the baseline system is also changed so that in one of the original occurrences of the datum to be reconciled is the first conflicting estimate and the second occurrence is the other conflicting estimate. Table 1. Prior vector and concordance matrix, with no multiple estimates.  Table 3. Prior vector and concordance matrix, with multiple estimates of d. Table 4. Prior vector and concordance matrix, with multiple estimates of b 1 .
Note that in this simple example there are only two constraints affecting each datum, but that naturally is not generally the case. The number of constraints per datum is arbitrary and can be either one or larger than two. An example of what this system might represent is employment count by region and sector, with an extra dimension being type of ownership (private or local, state, or federal government), as reported in the QCEW database.

From Balancing to Reconciliation
Rodrigues [25] shows that if the posterior configuration is balanced, then its first-and second-moment constraints are: (1) where m and S are the posterior best-guess vector and covariance matrix, and the latter is defined as S =ŝRŝ, where s is the vector of posterior uncertainties and R is the vector of posterior correlations, andˆdenotes diagonal matrix. Likewise µ and Σ are the prior best guess vector and covariance matrix, and the latter is defined as Σ =σPσ.
The analytical solution of the data-balancing problem is: Notice that Equations (3) and (4) contain symbols adjoined with˜(which we refer to as Gaussian parameters) while Equations (1) and (2) do not. The connection between the Gaussian parameters and the corresponding observable quantities is described in Rodrigues [25]: When relative uncertainty, σ j /µ j or s j /m j , is low, then the Gaussian parameter and the observable are identical. When relative uncertainty is high, the best guess Gaussian parameter tends to −∞ and the uncertainty Gaussian parameter tends to ∞, in such a way that if relative uncertainty is unitary, −μ j /σ 2 There is no closed-form expression between observables and Gaussian parameters in the multivariate case.
If both the prior uncertainty of aggregate data and initial prior correlations are high, we obtain a simplified weighted least-squares (WLS) method in which the weights are prior uncertainties: and posterior correlations are set by considering that relative uncertainty is constant, s = m σ µ, where and are Hadamard (or entrywise) product and division, and the update takes place in small steps. This WLS method is a generalization of the standard biproportional balancing method (RAS) for arbitrary structure and uncertainty data [25]. However, it is in a way too simple for the data reconciliation problem, because it keeps relative uncertainty constant. In the data reconciliation problem this assumption is untenable, whenever the relative uncertainty of the initial priors differs.
Thus, we now look for a simplification of the general solution (Equations (3) and (4)) that is still feasible and that allows both for best guess and uncertainty reconciliation. Let us consider that correlations change little from prior to posterior, so that only uncertainties are adjusted. Equations (3) and (4) where we dropped the˜, meaning that all variables are observables. If correlations are not adjusted, then R = P, and if variances change little s σ The previous expressions become: For convenience, consider now that a datum corresponding to entry (i, j) in the tabular matrix is t ij , while the sums of row or column i is t i , and the Lagrange parameters of a row sum or column sum are adjoined with superscript R or C. For a particular entry, the previous matrix equation reads: If the adjustment from prior to posterior is small, and σ i σ ji , then the previous expression matrix expressions simplify to: where the derivation of Equation (7) follows along identical lines to that of Equation (6). We now use these expressions to obtain a tentative solution of the data reconciliation problem, even though they were derived under rather strict assumptions.

A Tentative Solution
We now examine the implications of applying Equations (6) and (7) to different data reconciliation configurations as described in Section 2.2: multiple estimates of (a) an aggregate datum; (b) a disaggregate datum; and (c) a datum that is both aggregate and disaggregate. We shall see that the same expression applies to all these problems.
For clarity, the analysis is carried out using scalar expressions, and, for brevity, only to the case of two constraints per datum. The strategy of the proof is the same for all configurations: to derive constraints connecting prior and posterior in the original problem and in a modified problem in which there is only a single datum where originally there were the conflicting estimates.

Aggregate Datum
Consider that there are two initial priors of a datum, θ 0 and θ 0 and that the datum is involved in two accounting identities, the first summing over elements 1 to n and the second summing over n + 1 to n : where each t i , for i > 0, can be affected by other accounting identities. The Lagrange parameters associated with these three expressions in Equation (6) are denoted, respectively, by β 0 , β 0 and β 0 . We wish to determine a final prior θ 0 , such that: Equation (6) reads, for the original problem: where . . . refers to other Lagrange parameters. And in the modified problem: Notice that for every datum i > 0 the original and modified problem are identical. Because the posteriors of the aggregate datum are all identical, s 0 = s 0 = s 0 , we can write: A similar expression can be obtained from Equation (7) for the final prior best guess, leading to the solution: Thus, both the final prior of the absolute uncertainty, σ, and the relative uncertainty, σ/µ, are obtained as the harmonic average of the initial prior absolute and relative uncertainties.

Disaggregate Datum
Consider now that there are two initial priors of an interior point, θ 1 and θ 1 , which is affected by two accounting identities, such that the posteriors satisfy: The Lagrange parameters associated with these three expressions are, as before, β 0 , β 0 and β 0 . We wish to determine a final prior θ 1 , such that: Equation (6) reads, for the original problem: and in the modified problem: As before, the data for which there are no conflicting estimates (t 0 , t 0 and t i with i > 1) are subject to the same set of constraints in the original and in the modified problem. Because the posteriors of the disaggregate datum are all identical, s 1 = s 1 = s 1 , we can write: At this stage it becomes clear that we will encounter exactly the same solution as in the case of an aggregate datum:

Mixed Datum
Consider now that there are two initial priors, θ 1 and θ 1 , of a datum that is both aggregate and disaggregate, in different accounting identities, and whose posteriors satisfy: As before the Lagrange parameters are denoted as β 0 , β 0 and β 1 . We wish to determine a final prior θ 1 , such that: Equation (6) reads, for the original problem: and in the modified problem: As has become routine, for datum 0 and for every datum i > 1 the original and modified problem are identical. Because s 1 = s 1 = s 1 , we can write: Thus, it is clear that the solution is again identical.

Axiomatic Formulation
In Section 2 we obtained a data reconciliation algorithm from first principles, as an operation of data balancing under a particular structure. However, we can also reason about the data reconciliation algorithm in terms of its properties, i.e., we will not determine what it is, but what it ought to be.
If θ and θ are two initial priors, the data reconciliation algorithm is a function f (·) that generates a final prior θ = f (θ , θ ), where each prior θ is characterized by a best guess, µ, an absolute uncertainty, σ, and a relative uncertainty, u = σ/µ, which can take values in the range 0 ≤ u ≤ 1. Let x min = min{x , x } and x max = max{x , x }, where x can be µ, σ or u.
We now propose a series of properties that define the data reconciliation method.
Property 1 (Lower and upper bounds). The parameters of the final prior lie within the range set by the parameters of the initial priors, µ min ≤ µ ≤ µ max , σ min ≤ σ ≤ σ max and u min ≤ u ≤ u max .

Property 4 (Identity).
If the initial prior best guesses are identical, µ = µ then the final prior best guess is identical, µ = µ = µ . If the initial prior uncertainties are identical, σ = σ then the final prior uncertainty is identical, σ = σ = σ .
Property 5 (Monotonicity). The relative adjustment from initial to final prior increases with the relative magnitude of initial uncertainty: where dg(x)/dx > 0 and dh(x)/dx > 0.
Property 6 (Absorption). If initial prior θ is known with minimal uncertainty, u = 0, and θ is not, u > 0, then the final prior is identical to the first initial prior, f (θ , θ ) = θ . If initial prior θ is known with maximal uncertainty, u = 1, and θ is not, u < 1, then the final prior is identical to the second initial prior, f (θ , θ ) = θ .
We believe that these six properties are uncontroversial and self-explanatory. However, it turns out that the problem as formulated here has no solution, i.e., no formula can satisfy all of the above properties. We later overcome this hurdle by generalizing the problem formulation, to include two additional concepts: A hierarchy of data quality and the number of combined priors.

The Canonical Data Reconciliation Method
The properties outlined in Section 3.1 constrain the range of data reconciliation algorithms but do not define a unique solution. However, Equations (8) and (9) suggests how it may be possible to obtain a solution. Let us consider that g(x) and h(x) take the simple yet flexible form of g(x) = ax b and h(x) = cx d .
The condition of identity (Property 4), in the case of µ = µ and σ = σ leads to the indeterminacy: But if the limit is approached as µ = µ − δ and µ = µ + δ, when δ → 0, then: Thus, under the condition of identity, Equations (8) and (9) imply that: so a = c = 1. Let us further consider the simplest possible case b = d = 1, so that g(·) and h(·) are the identity lines. Applying g(x) = x and h(x) = x to Equations (8) and (9) leads to: Rearranging terms: Recalling that u = σ/µ we obtain the canonical data reconciliation method as: Equation (10) can be be expressed in two other ways: Thus, if the ratio of relative adjustment of best guesses and uncertainties is identical to the ratio of absolute uncertainties of the initial priors, the best-guess data reconciliation method is a weighted average, where the weights are proportional to the inverse of absolute uncertainty, and the absolute and relative uncertainty data reconciliation methods are harmonic averages.
Does this data reconciliation method satisfy the properties of Section 3.1? It is trivial to check that Properties 1, 2, 4 and 5 are satisfied. But this is not the case for Properties 3 and 6. In the following subsections we present suitable extensions of the canonical data reconciliation method to address these problems.

The Number of Combined Priors
The canonical data reconciliation method is not associative. The properties of f (θ , f (θ , θ )) are: .
. But upon some reflection, this result is in fact reasonable. The final prior is the combination of two initial priors with equal weights. If some of these initial priors are themselves a combination of other initial priors, this information has to be considered explicitly.
Let us introduce a new quantity, n, as the number of combined priors, so that now a prior θ is defined by a best guess, µ, an absolute uncertainty, σ, and n. Consider the following data reconciliation rule: As before, Equation (14) can be be expressed in two other ways: This data reconcilation rule satisfies the first five properties of Section 3.1.
In order to ensure that the absorption of maximal uncertainty is satisfied, we use the concept of data quality, introduced in Rodrigues [25]. The idea is that, besides an uncertainty estimate, which formalizes quantitatively a degree of confidence in the accuracy of the best guess of a datum, it is also possible to formalize qualitatively a degree of confidence in the accuracy of a datum relative to others.
For the purpose of data balancing, Rodrigues [25] suggests that a datum that is considered to be of higher quality should be kept fixed while lower quality data are adjusted. The natural corollary, in the problem of data reconciliation, is to consider that when one wishes to combine two initial priors of differing levels of data quality, the prior of lower quality should be disregarded.
If a datum has unitary relative uncertainty, then it is maximally uninformative, and it is reasonable to disregard it. After all, a maximally uninformative prior should only be used if no better alternative is available. We therefore suggest that, if σ = µ and σ < µ , then θ = θ directly, without using Equations (10) and (11).

Summary
We now present the expressions for the combination of n initial priors, θ i , with i = 1, . . . , n into a single final prior θ. Addressing this problem requires the specification, for each prior, θ i , of its best guess, µ i , its absolute uncertainty, σ i , and the number of previously combined priors, n i .
If all relative uncertainties, u i = σ i /µ i , are in the range 0 < u i < 1, then the final prior properties are defined as: Equation (19) can be expressed as: If some initial priors have zero relative uncertainty, u i = 0, then all other initial priors should be disregarded. If some initial priors have unitary relative uncertainty, u i = 1, then it is they which should be disregarded.
In Figure 2 we illustrate the behaviour of Equation (19), when n = 2 and n 1 = n 2 = µ 2 = 1. The plot shows different curves of the combined posterior best guess µ as a function of the prior best guess µ 1 , where each curve corresponds to a different ratio of uncertainties, σ 1 /σ 2 . When both uncertainties are identical, the posteriod best guess is the arithmetic average of the two prior best guesses. When the uncertainties differ, the best guess prior with the largest uncertainty contributes the least to combined best guess: in the limit case in which σ 1 σ 2 the prior µ 1 is ignored and the posterior is similar to µ 2 ; when σ 1 σ 2 the reverse occurs and µ µ 1 . In turn, Figure 3 we illustrate the behaviour of Equation (20). We still consider that n = 2 and n 1 = n 2 but now no explicit assumption about best guesses is necessary. Instead, the uncertainty of the second variable is fixed, σ 2 = 1, and the curve shows the value of the combined posterior as a function of prior uncertainty σ 1 . The figure describes an arc slightly above the diagonal line. When both prior uncertainties are identical (σ 1 = 1), then the posterior equals the priors, as expected. As σ 1 becomes smaller than σ 2 the combined prior becomes closer to σ 1 than to σ 2 , but always larger, σ > σ 1 , except in the limit case σ 1 → 0, in which case σ → σ 1 .

Conclusions and Discussion
Herein we investigated using two distinct pathways the problem of reconciling multiple conflicting estimates in the course of database development. We assume that the developer (data snooper) is tooled with a best guess and uncertainty for each of those conflicting estimates.
First, we apply a maximum-entropy Bayesian inference method, under the limiting condition that the adjustment from prior to posterior uncertainties is small. Second, we obtain a canonical data reconciliation method through an axiomatic approach that is as simple as possible but satisfied important qualitative properties. Each approach verifies the other.
The resulting formula for the best guess, Equation (19), is a weighted average showing that, as the count of conflicting priors underlying a particular prior rises, the value of that prior increases in importance in terms of obtaining a solution. We get a similar result with the inverse of the uncertainty, that is, the narrower the uncertainty of an estimate the more it contributes to the final solution. The resulting formula for the uncertainty, Equation (20), is a harmonic average where the same factors are present: As the count of conflicting priors underlying a particular prior rises, the value of that prior increases in importance; and the narrower the uncertainty of a prior, the more it contributes to the final solution.
Of course, limitations to our approach must be mentioned. And the key limitation is certainly that, in some practical applications, the data snooper will lack information on, either or both, best guess and uncertainty. It may be that instead, one only has upper and lower bounds for the datum of interest to inform its best guess and uncertainty. This is certainly the case in some instances when data are censored, e.g., the anti-suppression problem of Gerking et al. [13] and Isserman and Westervelt [12]. Future work using variable ranges with externally informed priors would be a natural extension of what is presented here. Indeed, some initial forays into this line of investigation are already underway, see, e.g., Makarkina and Lahr [35].
It should be mentioned that although the focus of attention here was on conflicting estimates arising from economic accounts there are other circumstances in which a formally identical problem arises, for example in expert elicitation [36].