Defense of the Least Squares Solution to Peelle ’ s Pertinent Puzzle

Generalized least squares (GLS) for model parameter estimation has a long and successful history dating to its development by Gauss in 1795. Alternatives can outperform GLS in some settings, and alternatives to GLS are sometimes sought when GLS exhibits curious behavior, such as in Peelle’s Pertinent Puzzle (PPP). PPP was described in 1987 in the context of estimating fundamental parameters that arise in nuclear interaction experiments. In PPP, GLS estimates fell outside the range of the data, eliciting concerns that GLS was somehow flawed. These concerns have led to suggested alternatives to GLS estimators. This paper defends GLS in the PPP context, investigates when PPP can occur, illustrates when PPP can be beneficial for parameter estimation, reviews optimality properties of GLS estimators, and gives an example in which PPP does occur.


Introduction
Generalized least squares (GLS) for parameter estimation has a long and successful history dating to its development by Gauss in 1795.In some settings, alternatives to GLS can be effective, and are sometimes sought when GLS exhibits curious behavior, such as in Peelle's Pertinent Puzzle (PPP).
PPP was introduced in 1987 in the context of estimating fundamental parameters that arise in nuclear interaction experiments [1].PPP is described below and when it occurs, the GLS estimate of the parameter is guaranteed to be outside the range of the data, which has elicited concerns that GLS is flawed and has led to suggested alternatives to GLS estimators.
A GLS estimate lying outside the range of the data causes heartache among nuclear scientists.Therefore, PPP continues to be of theoretical and practical interest.For example, a summary report of International Evaluation of Neutron Cross Section Standards [2] discusses PPP in terms of a standard least-squares procedure.Neutron cross sections are fundamental parameters that describe the probabilities of various neutron interactions.These cross sections are typically estimated from multiple experiments so some type of weighted average estimation scheme is used.The cross section estimates from any two experiments can have shared errors arising for example from using the same measured background, which can lead to covariance structures such as those described below.
We quote the original PPP problem proposed by [1] from a report of Chiba and Smith [3], "Suppose we are required to obtain the weighted average of two experimental results for the same physical quantity.The first result is 1.5, and the second result 1.0.The full covariance matrix of those data is believed to be the sum of three components.The first component is fully correlated with standard error 20% of each respective value.The second and third components are independent of the first and of each other, and correspond to 10% random uncertainties in each experimental result.
Although this PPP statement is vague, by converting it to something more interpretable, GLS can be applied and the resulting estimate is 0.88 (with an associated standard deviation of 0.22), which is outside the range of the measurements.Zhao and Perey [4] re-interpreted PPP by introducing a third datum c, through which the common error can be explicitly specified as follows, "Suppose we have two independent measurements.One is m 1 = 1.5 ± 10%.Another is m 2 = 1.0 ± 10%.To convert this quantity into another physical quantity, we need a conversion factor c, which after intermediate steps omitted here is 1.0 with uncertainty of 20%.Now the experimental results are, y 1 = cm 1 = 1.5 and y 2 = cm 2 = 1.0.We are required to obtain the weighted average of those experimental data.
In this interpretation, the common error (the "fully correlated" component) is understood to be multiplicative, and m 1 = 1.5 ± 10% is assumed to mean that the true standard deviation is 0.15 for m 1 (and 0.10 for m 2 ).Even after these interpretations, some vagueness remains.There is no convention regarding what confidence is associated with ±10%.Nor is there a convention for whether the standard deviation includes all error sources, or only includes random error effects, ignoring accuracy.In addition, we show below that it can matter whether the standard deviation is expressed as a fraction of the true quantity or of the measured quantity.
One of our contributions is to make explicit assumptions and examine their implications in order to convert vague statements to statements about which it is possible to find agreement among physical scientists and statisticians regarding suitable approaches.We also defend GLS in the PPP context, illustrate when PPP can be beneficial, briefly describe properties of GLS estimators, show that PPP cannot occur for certain measurement error models, and calculate a covariance matrix Σ for y 1 and y 2 for which PPP occurs that follows from a physical description of a realistic measurement scenario.

PPP
The vagueness of the original PPP statement is one reason there are so many interpretations of PPP [5].PPP can occur if there is a large positive covariance between y 1 and y 2 and the variances of y 1 and y 2 are very different.
Let Σ be the 2-by-2 symmetric covariance matrix for y 1 and y 2 with diagonal entries σ 2 1 , σ 2 2 , and off-diagonal entry σ 12 , which denote the variance of y 1 , the variance of y 2 , and the covariance of y 1 and y 2 , respectively.Zhao and Perry [4] approximated Σ for their definition of PPP (using y 1 = cm 1 and y 2 = cm 2 as described in the Introduction) as where σ m1 , σ m2 , and σ c are the standard deviations in the quantities m 1 , m 2 , and c, respectively.The first approximation arises because the relatively small term σ 2 c σ 2 m was omitted (see below).The second approximation follows by approximating µ 2 m by m 1 m 2 .Equation 1 assumes that m and c are independent.And, the relative standard deviations of 10%, 10%, and 20% are assumed to be the fractions relative to the measured values of 1.5, 1.0, and 1.0, respectively by definition, so that σ m1 = 0.15, σ m2 = 0.10, and σ c = 0.2.See Section 4 for further discussion.
Readers might find it informative that those with traditional statistical education among the authors were the most willing to accept GLS estimates despite the apparent flaw of lying outside the range of the data.Statisticians will often consider alternatives to GLS, but recognize that GLS estimation is difficult to beat, at least in terms of typical performance measures such as being close on average to the true parameter value over hypothetical repeats of the pair of experiments [6].Also, note that because = 0.5/0.21= 2.39, one might consider the original PPP to be an unusual data realization.However, we show using elementary algebra and error modeling in Theorem 2 that when PPP occurs, it occurs for all data realizations.
Although Peelle [1] originally constructed the covariance matrix Σ as in Equation 1 and other authors followed, this result is not exact if the common error is multiplicative because of the omitted term σ 2 c σ 2 m and because of the need to estimate µ m .If we include the σ 2 c σ 2 m term, then where σ m is σ m1 or σ m2 in Equation 1.The covariance matrix is therefore estimated as where again as in Equation 1, µ m is estimated by m 1 or m 2 .
As shown below, prior to substituting the approximation for µ 2 m , Equations 1 and 3 satisfy conditions under which PPP cannot occur (Theorem 2 in Section 3).However, because in all published investigations we are aware of, µ 2 m is approximated by m 1 m 2 in the off-diagonal and µ 2 m is approximated by m 2  1 in the upper left entry in the second term and by m 2 2 in the lower right entry in the second term, Σ as estimated does not have the condition referred to in Theorem 2 below.Therefore, due to estimation error in Σ, PPP does appear to occur, which means that the GLS estimate of µ is outside the range of the data for Σ as given by Equation 1or Equation 3.
The fact that µ m is approximated in the context of estimating variance and covariance in a multiplicative error model raises at least three issues: (1) the issue of how uncertainty in measurement is expressed; (2) an issue related to simulating observations from the assumed measurement error model as a way to consider likelihoods other than the Gaussian, and (3) accounting for uncertainty in Σ.For issue (1), we will make our measurement error model assumptions explicit throughout in order to eliminate needless ambiguities.Issues ( 2) and (3) are investigated in [6].
To focus on GLS behavior when PPP occurs, this paper assumes Σ is known exactly without error.However, for historical and presentation purposes, Equations 1 and 3 are presented, and both clearly involve approximations.In contrast, Theorem 3 illustrates a measurement error model for which the exact covariance can satisfy the PPP condition.

GLS
It is well known that the GLS method can be applied to y 1 and y 2 to obtain the best linear unbiased estimate (BLUE) μ of µ [7].Here, "best" means minimum variance and unbiased means that on average (across hypothetical or real realizations of the same experiment), the estimate μ will equal its true value µ.The GLS estimate for µ arising from the model with Σ = Cov(e 1 , e 2 ) = Cov(y 1 , y 2 ) given by where the scalar c = (G t Σ −1 G) −1 .And the variance σ 2 of the GLS estimate μ is given by where the matrix G is given by G t = (1, 1).Putting the covariance of Equation 1into Equations 5 and 6 gives μ = c 1 y 1 + c 2 y 2 = 0.88 and σ = 0.22 where c 1 = −0.24and c 2 = 1.24.Notice that c 1 + c 2 = 1 (so that μ is unbiased) but also that c 1 < 0 and that μ is smaller than each of the two measured values, y 1 = 1.5 and y 2 = 1.0.GLS is usually introduced in the context of estimating β and future y values in a linear regression relating the response y to predictors X via y = Xβ + ϵ ( [7]).Therefore, in Equation 4, the mean µ plays the role of the unknown β.The GLS solution in Equations 5 and 6 then follows from standard calculus or projection matrix results.For example, one can write μ = a and solve for a 1 to minimize var(μ) by setting the derivative of var(μ) with respect to a 1 to 0.
The GLS solution of Equations 5 and 6 with covariance Σ of Equation 3 is 0.89 ± 0.22.Because the last term in Equation 2 is smaller than the others, the Σ in Equation 3 that is slightly modified compared to the Σ in Equation 1 still leads to PPP.But in Section 4 we describe other modifications to Σ that do not lead to PPP.
GLS estimation has a long and successful history, but met with serious objection within the nuclear physics community in the context of combining estimates from multiple experiments upon observing a tendency to produce estimates that are outside the range of the data.More specifically, to date the tendency has been to produce estimates that are less than the minimum data value, so have been criticized as being "too small" [2].
GLS estimation is guaranteed to produce the BLUE even if the underlying data are not Gaussian.However, if the data is not Gaussian, then the minimum variance unbiased estimator (MVUE) is not necessarily linear in the data.Also, though unbiased estimation might sound politically correct, it is not necessarily superior to biased estimation [8].Therefore, PPP has motivated the nuclear physics community to consider estimators other than GLS.
If the data has a Gaussian distribution, then it is well known that the GLS estimate is the same as the maximum likelihood (ML) estimate.This is because the log of the Gaussian likelihood involves a sum of squares, so choosing an estimate (the GLS estimate) that minimizes a sum of squares corresponds to choosing an estimate (the ML estimate) that maximizes the likelihood.Ordinary LS (OLS), weighted LS (WLS), and GLS are all essentially the same technique, but OLS is used if Σ is proportional to a unit-diagonal matrix, WLS is used if Σ is proportional to a diagonal matrix, and GLS is used if Σ is an arbitrary positive definite covariance matrix.The Gauss-Markov theorem [7] proves that the OLS estimator is the BLUE, and very similar theorems prove the same result for WLS or GLS.
The ML estimate depends on the assumed distributions for the errors.For example, if we replace the Gaussian (Normal) distributions with logNormal distributions, the ML estimate will change.In the cases considered here, ML gives the same estimate as GLS, because the data distribution is Gaussian.Because the ML approach makes strong use of the assumed error distributions, the ML estimate is sensitive to the assumed error distribution.The ML method has desirable properties, including asymptotically minimum variance as the sample size increases.However, in our example, the sample size is tiny (two), so asymptotic results for ML estimates are not relevant.It still is possible that an ML estimator will be better for nonGaussian data than GLS [6].Typically, "better" is defined as the mean squared error (MSE) of the estimator, which is well known to satisfy MSE = variance + bias 2 .In some cases, biased estimators have lower MSE than unbiased estimators because the bias introduced is more than offset by a reduction in variance [8].

Closer Look into PPP
The original PPP does not clearly state whether the common error is additive or multiplicative.This ambiguity was examined in [5].In the "additive" scenario case, y 1 = m 1 + b and y 2 = m 2 + b, and the source of correlation may be a common background measurement b.And, the situation is somewhat different from PPP, because if σ b is 20% of y 1 = 1.5, then it is 30% of y 2 = 1.0.The covariance matrix for this case is and the GLS solution is 1.15 ± 0.31.With the covariance matrix of Equation 7, PPP does not occur, and the GLS solution of 1.15 is a weighted average of 1.0 ± 0.1 and 1.5 ± 0.15.Although as σ b changes, the standard deviation σ μ is scaled accordingly, the GLS solution does not change.It is reasonable that the GLS solution does not change as σ b changes, because whatever the background fluctuation is, y 1 and y 2 are impacted in the same manner by b.
If instead σ b is 20% of µ m , then its value is unknown because µ m is unknown and must be estimated.However, regardless of the value of σ b > 0, PPP cannot occur in the additive case as parameterized in Equation 7(Theorem 1 below).
In Sivia's [9] notation, Equation 5 can be written as with where ρ is the correlation coefficient.Sivia demonstrated that PPP does not occur if the condition is fulfilled (and does occur if the condition does not hold).It is not difficult to show that the additive case of Equation 7 satisfies this condition, which we state as Theorem 1.
Theorem 1.If Σ is given by then PPP does not occur.
is a large probability that both y 1 and y 2 lie either above the mean or below the mean, so indeed µ is likely to fall outside the range of (y 1 , y 2 ).Integration of the bivariate normal for Σ given by Equation 1indicates that with probability approximately 0.40, µ lies below the minimum of (y 1 , y 2 ) and with the same probability µ lies above the maximum of (y 1 , y 2 ).This is a total of approximately 0.80 probability that µ falls outside the range of (y 1 , y 2 ), which is an example of feature one.Having an estimate lie outside the range of the data is therefore defensible, provided (feature two) that one can guess with better than random chance performance whether µ lies below the minimum or whether µ lies above the maximum of (y 1 , y 2 ).To see that one can beat random guessing performance, suppose y 1 > y 2 as in our case (y 1 = 1.5 and y 2 = 1.0).Then, because σ 2 y 1 = 0.1125 is larger than σ 2 y 2 = 0.05 in Equation 1, µ is more likely to fall below y 2 because if instead µ > y 1 then the distance from y 1 to µ would be smaller than the distance from y 2 to µ, contradicting the fact that σ 2 y 1 > σ 2 y 2 .To confirm this line of reasoning, in 10,000 simulations (in the statistical computing language R) of (y 1 , y 2 ) pairs having µ = 0.88, 57% of the simulation runs for which y 1 > y 2 did in fact also have µ < y 2 .On the basis of 10,000 simulations, 57% is repeatable to within ±1% or less, so this is better than random chance (50%) guessing.This is not a formal proof but does suggest a direction to understand when a GLS estimate falling outside the range of the data is effective.Note that y 1 = 1.5 and y 2 = 1.0 in the PPP statement, and the GLS estimate is μ = 0.88 < y 2 .

Example Where PPP Occurs without Approximation
Thus far we have not demonstrated any error model which exactly (without approximation) produces a Σ that leads to PPP.A situation that leads to PPP without the approximations in Equation 1 is expressed in Theorem 3. Theorem 3. Suppose m 1 = µ + ϵ R1 and m 2 = µ + ϵ R2 where ϵ R1 is random error in m 1 with variance σ 2 R1 and similarly for ϵ R2 .Then if y 1 = m 1 + ϵ S and y 2 = m 2 + αϵ S , where ϵ S ∼ N (0, σ S ) and α is any positive scale factor other than 1, the covariance matrix Σ of (y 1 , y 2 ) can lie in the PPP region.
• If α < 0 then PPP cannot occur.However, our applications have α > 0. Jones et al. [10] showed that if ρ → ±1 and σ 1 ̸ = σ 2 , then µ can be estimated with surprisingly small variance.That fact plus the known BLUE property of GLS could convince us to just "live with" the PPP because it can make sense for the GLS estimate μ to lie outside the range of (y 1 , y 2 ) in the case α > 0, or σ 12 > 0.
One example in which the assumptions of Theorem 3 hold involves subtracting a background measurement from a region of interest (ROI) measurement to get a net result.Because the background measurement often involves a different number of channels than the ROI measurement, a scale factor k is introduced to estimate the net counts as net = peak − k × background.Figure 2 illustrates a hypothetical example where each of three peak ROIs have a corresponding background measurement in the plot of the square root of detected neutron counts versus neutron energy in arbitrary units (au).
Suppose each ROI and corresponding background are analyzed separately, and consider the first ROI in Figure 2. The count times could vary between the two experiments, so σ 2 R1 ̸ = σ 2 R2 .Both experiment 1 and experiment 2 measure the ROI counts, but in many situations, the background measurement is made only by experiment 1.In that case, experiments 1 and 2 must use the background measurement made by experiment 1.Also, if the ROI is found by analyzing the shape of the curve (the "spectrum") that describes detected count rates versus particle energy, then the ROI for experiments 1 and 2 could differ.We then have N where N is net counts, G is gross counts, B is background, a 1 is the scale factor for experiment 1, and a 2 is the scale factor for experiment 2. The scale factor a 1 for experiment 1 is the ratio of the number of ROI channels to the number of background channels, and similarly for the scale factor a 2 for experiment 2. The channel counts have variation from repeat to repeat so the detected counts will vary around the true counts with some error.As an aside, often the channel counts have approximately a Poisson distribution which for large count rates is well approximated by a Gaussian distribution.Regardless of which probability distribution best describes the channel counts, there are measurement errors in N 1 , N 2 , and one can divide N 1 = G 1 − a 1 × B by a 1 to convert this pair of equations to those assumed in Theorem 3.

Conclusions
Because Peelle's original statement is vague, there have been several interpretations and solutions.In our experience, there is considerable variation among experimentalists in the expression of measurement uncertainty, and a wide the range of analyses can result from vague uncertainty statements.
The three main contributions of this paper are: (1) illustrating examples when PPP cannot occur (Theorem 1); (2) providing insight when PPP is effective and appropriate (related to Theorem 2), and (3) deriving a realistic covariance matrix Σ for which PPP occurs according to physical descriptions of realistic measurement scenarios (Theorem 3).We also showed via numerical integration that an estimate lying outside the range of the data is sensible.This is because the unequal variances of x 1 and x 2 provide information regarding whether µ is more likely to be less than the minimum or greater than the maximum of x 1 and x 2 .
Of course GLS provides a good estimate μ in general because of its well-known BLUE property (and MVUE if the data is Gaussian) and in particular for the PPP problem if the covariance Σ is well known.
There will almost always be estimation error in Σ, and often the measurement errors are nonGaussian.Therefore, we consider the following two topics in [6]: (1) alternatives to GLS when there is estimation error in Σ, and to provide estimators other than GLS that use the estimated likelihood in the case of non-Gaussian error models that make ML estimation difficult.Regarding (1), it is known that weighted estimates do not always outperform equally-weighted estimates when there is estimation error in the weights.We have already noted that estimation of Σ in Equation 1 introduces apparent PPP when PPP does not actually occur.
Finally, [10] showed that when the diagonal entries in Σ are different and the correlation is large and positive, the GLS estimate can have lower variance than if the correlation is zero.This fact does not seem to be widely known, and experimental opportunities to exploit this fact are under investigation.

Figure 1 .
Figure 1.Contours of example bivarate normal density of y 1 and y 2 illustrating that µ = 0.88 is likely to fall outside the range of y 1 and y 2 because of the large probability of (y 1 , y 2 ) falling in region 1 or 3.

Figure 2 .
Figure 2. Example region of interest and corresponding background that can lead to the PPP condition.