Proofs of non-optimality of the standard least-squares method for track reconstructions

It is a standard criterium in statistics to define an optimal estimator the one with the minimum variance. Thus, the optimality is proved with inequality among variances of competing estimators. The inequalities, demonstrated here, disfavor the standard least squares estimators. Inequalities among estimators are connected to names of Cramer, Rao and Frechet. The standard demonstrations of these inequalities require very special analytical properties of the probability functions, globally indicated as regular models. These limiting conditions are too restrictive to handle realistic problems in track fitting. A previous extension to heteroscedastic models of the Cramer-Rao-Frechet inequalities was performed with Gaussian distributions. These demonstrations proved beyond any possible doubts the superiority of the heteroscedastic models compared to the standard least squares method. However, the Gaussian distributions are typical members of the required regular models. Instead, the realistic probability distributions, encountered in tracker detectors, are very different from Gaussian distributions. Therefore, to have well grounded set of inequalities, the limitations to regular models must be overtaken. The aim of this paper is to demonstrate the inequalities for least squares estimators for irregular models of probabilities, explicitly excluded by the Cramer-Rao-Frechet demonstrations. Estimators for straight and parabolic tracks will be considered. The final part deals with the form of the distributions of simplified heteroscedastic track models reconstructed with optimal estimators and the standard (non-optimal) estimators. A comparison among the distributions of these different estimators shows the large loss in resolution of the standard least-squares estimators.


Introduction
It is evident the crucial importance to study the inequalities among estimators of different fitting algorithms. In fact, a standard definition of optimality is connected to the estimator with the minimum variance among competing estimators. Cramer-Rao-Frechet (CRF) introduced methods to calculate those inequalities, and in few special cases they calculated the absolute minimums of variances. Reference [1] was dedicated to extend the CRF approach to inequalities for estimators of the least-squares methods for Gaussian probability density functions (PDFs). The Gaussian PDFs were used in a toy-model discussed in ref. [1,2]. That model was defined in ref. [2] to illustrate the essential improvements introduced in the least-squares method by the appropriate hit variances (heteroscedasticity), and to compare them with the standard least squares without those differences (homoscedasticity). However, it was clear to us, well before ref. [1,2], that realistic effective variances of the observations (hits) drastically improved the quality of track fitting, in models of silicon trackers. References [3,4] clearly showed the importance of calculations dedicated to the estimations of the hit properties relevant to track reconstruction. Among these fitting results, one was very interesting [4]: a linear growth of the maximums of the momentum distributions with the number N of detecting layers. Such growth is much faster than the corresponding growth, as √ N, of the standard least squares. Further details were added in ref. [2] to the origin of this linear growth and to show its generality in tracker physics. For this task, it was simulated the estimator of the direction, of a set of straight tracks, in a realistic model of silicon tracker. To better illustrate these new results, a section of ref. [2] was dedicated to a Gaussian toy-model very easily reproducible, by the interested reader, with few lines of MATLAB code [5]. The Gaussian toy-model gives results very similar to the more complex realistic simulations. Its linear growth has a very effective visual impact for the unusual slenderness of the parameter distributions and their fast growth with N. Instead, the standard least-squares does not move from its modest growth as √ N. Such evident differences activated, in few of our readers (and referees), reminiscences of mathematical statistics where the √ N is a frequent result. In particular, the widely reported form of the CRF-bound was used to raise doubts [6] on the realizability of the linear growths, in spite of a their reproduction of our toy-model. Evidently, a simulation is a numerical demonstration, and it rules out any theorem (if it could exist) with conflicting statements. However, the necessity to remove forever any possible criticism about our growth (linear or faster than linear) was the motivation of our ref. [1]. Given that the criticisms were essentially addressed to the Gaussian toy-model, the demonstrations were based on Gaussian PDFs. Reference [1] proved the perfect consistency of those results with the CRF inequalities and the non optimality of the standard least squares estimators for Gaussian PDFs. The analytical properties of the Gaussian PDFs are ideal for those developments, and, in general, the CRF method requires special type of PDFs defined as regular models [7,8,9]. Even if the inequalities turn out to be independent from their original PDFs, the extension to non-Gaussian PDFs, or irregular models, can only be guessed by the analytical forms of the variances. Given that a large part of our approaches are base on truncated Cauchy-like PDFs (typical irregular models), the absence of direct demonstrations could allow the survival of possible doubts. It is our aim to eliminate any misunderstanding with a direct demonstration of the variance inequalities even for irregular models. To illustrate the effects of these results, we introduce a toy-model whose PDFs have rectangular shapes. This PDF is used in ref. [7], and elsewhere, as prototype of irregular model. The parameters of the toy-model are identical to those of ref. [1] with the sole differences in the form of the hit PDFs. The rectangular toy-model shows strong similarity with the Gaussian one, as suggested by the formal identities of the new inequalities with those for the Gaussian PDFs.   Figure 1 shows Monte Carlo simulations of a simple tracker with N detecting layers of identical technology (silicon microstrips) crossed at orthogonal incidence by a set of parallel straight tracks of minimum ionizing particles (MIPs). The reported estimator is the track direction (the tangent of a small angle). The hit uncertainties are obtained by a modulation of the rectangular distributions ([rand-1/2] √ 12), of zero mean and unit variance, with two type of hit quality. One of very good quality when the MIP crosses the detector near to the borders (20% of strip width), and the other of bad quality (80% of the strip width) around the strip center. The standard deviation of the hits are σ 1 = 0.18 (bad hits) and σ 2 = 0.018 strip widths (the strip width is 63 µm and the tracker length is 445 mm). The hit properties are a drastic simplification of the results of refs. [3,4] for a type of silicon detector of large use. (Results of ref. [10] share some similarity with these assumptions). The simulated orthogonal straight tracks are fitted with two different least-squares methods: one that uses only the hit positions (standard least squares) and the second that uses also the hit standard deviations (weighted least squares). Further details of these simulations are reported in Appendix B of ref. [1]. As it is evident in fig. 1, the blue lines have a very slow growth as √ N. Instead, the red lines of the heteroscedastic least-squares model have an almost linear growth of the maximums of the empirical PDFs. Translating these results on mean variances, the standard fit shows a 1/N trend, instead, the weighted fit shows a trend of the mean variances as 1/N 2 . For precision, we have to specify better the origin of the linear growth, because the inequalities only state relations among variances of the estimators. This only implies that a set of simulated tracks with heteroscedastic hit errors, reconstructed with the appropriate estimators, have smaller mean variances than the mean variance of the corresponding estimators of the standard least-squares. The linear growth with N is mainly due to the usual simulation of a tracker that distributes, on the tracks, good or bad hits with a binomial probability distribution. In the weighted least squares, the track resolution is mainly dominated by the good hits that, in the binomial distribution, grow as N in average (as far as the multiple scattering can be neglected). Instead, the standard least-squares fit has a very weak sensitivity to the hit quality and to their binomial distributions. Its variances have explicit N dependencies (as σ 2 m /N for the β estimator) that modulate the maximums of the distributions to grow as √ N.

Inequalities for irregular (and regular) models
The developments of ref. [1] require very special properties for the PDFs: continuity, at least two times differentiability, inversion of derivation and integration, and others. All these these properties are globally indicated as "regular models". The Gaussian PDFs are perfect examples of regular models. Instead, the rectangular PDFs (for example) lack few of them and the CRF-methods can not be used in this case. Even the schematic (realistic) models of ref. [2,3,4] are irregular models for the construction of the weights as described in ref. [4]. Hence, a different method is required to prove the variance inequalities for systems excluded by the CRF-assumptions.

Definitions of the probability model
Very general properties will be postulated for of the PDFs f j (x). They cover regular and irregular models.
To simplify the notations of the integrations, the PDFs f j (x) are defined on all the real line even for f j (x) different from zero on a finite interval. Thus, the limits of the integrals are always −∞ and +∞, these limits are ever subtended. The properties of the f j (x) are: The case of a single estimator is irrelevant for track reconstruction and will be neglected. However, the variance inequality can be directly proved by a simple Cauchy-Schwarz inequality. Instead, we will concentrate first on straight tracks. They cross N detection planes (microstrips) producing the observations . Where β is the track impact point, and γy j is the angular shift in a plane distant y j from the reference plane. The reference plane is selected to have ∑ j y j = 0. The likelihood function for the N observations (hits) of the track is: Two different mean-squares of observations are defined; the standard one M s (homoscedastic) and the weighted one M w (heteroscedastic): The definition of the unbiased estimators is done with the derivatives in β and γ of M s and M w . The two required vectors are S 1,2,...,N (x, β , γ) and U 1,2,...,N (x, β , γ): The condition ∑ j y j = 0 is neglected in S 1,2,...,N (x, β , γ) to maintain a similarity with U 1,2,...,N (x, β , γ).
The mean values of S 1,2,...,N (x, β , γ) and of U 1,2,...,N (x, β , γ) with the likelihood L 1,2,...,N (x, β , γ) are always zero for eqs. 1, and identically for any of their linear combinations. The linear combinations to extract the unbiased estimators from the vector S 1,2,...,N (x, β , γ) and U 1,2,...,N (x, β , γ) are given by the following two 2× 2 matrices R and I (from now on, the indications 1, 2, . . . , N of the type of PDFs f j (x) will be subtended): The matrices R and I are formally identical to the Fisher information for Gaussian PDFs, homoscedastic for R and heteroscedastic for I. The integral form of equation 7 serves to save an analogy with ref. [1], but it is irrelevant here. The transposed matrices are indicated as R ′ or I ′ . For their symmetry, we often neglect to distinguish them from their transposed.
The Gaussian model of ref. [1] shows that equation 7 is even the integrals of the likelihood L(x, β , γ) with products U i × U j . This proof requires the the commutation of derivatives with integrations. Unfortunately, these commutations are impossible for irregular models. Instead, a direct calculation with the equations 1 gives the last matrix of equation 7. The study of the variances requires the definition of the unbiased estimators. The unbiased estimators T 1 (x), T 2 (x) are obtained from the vector U : The matrix I −1 and the two estimators T 1 (x), T 2 (x) are: For their special linear combinations of U components, the integrals of T 1 (x) and T 2 (x) with the likelihood L(x, β , γ) give β and γ: As previously anticipated, the direct calculation of the integrals of the mean values of the products U i U j (with equations 1) gives: the matrix of equation 11 coincides with I. Differently from ref. [1], no (impossible here) double differentiation of the logarithm of the likelihood is required. The variance-covariance matrix for T 1 (x), T 2 (x) becomes: A very similar procedure is done on the vector S(x, β , γ), their unbiased estimators are: As previously done for other expressions, even the cross-correlation matrix of the estimators is given by a direct integrations of the terms in square brackets: These results are identical to those obtained in ref. [1] for Gaussian PDFs, but now they are derived with the sole use of the equations 1, therefore valid for irregular and regular models.

The variance inequalities
Here, we abandon the method of refs. [1,9] and the use of positive semi-definite matrices, for the more direct Cauchy-Schwarz inequality. The matrix of equation 14 contains the integral: which implies the following Cauchy-Schwarz inequality (with the trivial substitution of the likelihood L = √ L √ L): The equality is excluded because T a (x) and T 1 (x) are never proportional for N > 2 (and true heteroscedastic systems). For N = 2, the minima of M w and M s coincide (we use this condition in fig. 1 to check the consistency of the code). Equation 12 states that the matrix element (I −1 ) 1,1 is just the variance of T 1 (x), therefore, the inequality becomes: The inequality for T b (X) and T 2 (x) is obtained with equation 15 to equation 17 and the matrix element (I −1 ) 2 2 , : Again, the equality is excluded because T a (x) and T 1 (x) are never proportional for N > 2.
Up to now, we supposed the absence of any effective variance in the standard least squares. But often, the difference of detector technologies is introduced with a global effective variance σ a l for each hit in a detector layer. It is easy to demonstrate that the precedent inequalities continue to be valid even in this case. In fact, the correct likelihood attributes to each hit its optimized variance, and this overcomes the global effective variance. This demonstration introduces small modifications with respect to the previous one, essentially, the R matrix contains now all these modifications. In any case, the cross-correlation matrix is always generated by the product R −1 R I −1 .

Extension of inequalities for momentum estimators to irregular models
More important than the estimators for the straight tracks are the estimators for fits of curved tracks in magnetic field. Therefore, we will calculate these inequalities for general irregular models. However, due to 1, they remain valid even for regular models.
We will limit to a simpler case of high-momentum charged particles in a tracker with homogeneous magnetic field orthogonal to the particle track. The segment of a circular path can be approximated with a parabola. As for straight tracks, the standard least-squares fit and the weighted one will be compared. The likelihood function with the addition of curvature η for the N observations (hits) is: The two mean squares of observations, M s (homoscedastic) and M w (heteroscedastic), are: (20) The definition of the unbiased estimators requires the introduction of the vectors S(x, β , γ, η) and U (x, β , γ, η): To save the similarity with the form of U (x, β , γ, η) the condition ∑ j y j = 0 is not implemented. The vector U (x, β , γ, η) is: Even here, the integrations with the likelihood are irrelevant. On these irregular models, the integrations of products of S · S ′ and U ·U ′ must be performed on their forms 21 and 22.
The vectors U and S give the unbiased estimators T 1 , T 2 , T 3 and T a , T b , T c : The direct integrations of the products U i ·U j with the likelihood give: and the variance-covariance matrix becomes: The cross-correlation matrix of the estimators T 1 (x), T 2 (x), T 3 (x) and T a (x), T b (x), T c (x) is given by a direct calculations of the integrals. The integrations in the square brackets give R: and the cross-correlation of T c (x) with T 3 (x), the unbiased estimators of the curvature, is: Equation 30 produces a Cauchy-Schwarz inequality: (remembering that the likelihood can be written as the equality is excluded because T c (x) and T 3 (x) are never proportional for N > 3. For N = 3, the minima of M w and M s coincide. For equation 29, the matrix element (I −1 ) 3,3 is the variance of T 3 (x). Therefore, the inequality for the curvature becomes: This inequality extends the validity of the results of our simulations of ref. [4] to any heteroscedastic system. The inequalities for β and γ can be obtained with a similar procedure. These demonstrations are directly extendible to a number of parameters greater than those considered here.

Effects of the inequalities on the resolution of the estimators
Up to now we discussed the formal proofs of variance inequalities, attributing to figure 1 the illustration of their effects. We will show the direct connections of the variance inequalities to the line-shapes of figure 1 and in the corresponding figures of refs. [2,1]. To simplify the notations, we will take β and γ equal to zero (as in the simulations). It must be recalled that the illustrated simulations deal with a large number of tracks with different successions of hit quality (σ j 1 , · · · , σ j N ). For each succession of σ j 1 , · · · , σ j N , extracted from the allowed set of {σ k }, we have a definite form of the estimators T 1 (x) and T 2 (x) (weighted least-squares). Instead, T a (x) and T b (x) (standard least-squares) are invariant. The studied estimator is the track direction γ, thus T 2 (x) and T b (x) are the selected estimators.

The line-shapes of the estimators T 2 (x) and T b (x)
For any set of observations x j 1 , · · · , x j N , the estimator T 2 (x)gives a definite value γ * . The set of all possible γ * has the probability distribution P j 1 ,··· , j N (γ * ) (with the method of ref. [4]): This integral, for the linearity of T 2 in the observations x, gives a weighted convolution of the functions f j 1 (x 1 ) f j 2 (x 2 ) · · · f j N (x N ). For the convolution theorems, the variance of P j 1 ,··· , j N (γ * ) is given by the (I −1 ) 2,2 of eq. 12 with the corresponding σ j : In the case of ref. [1,2], the convolution of Gaussian PDFs gives another Gaussian function with the variance Σ j 1 ,··· , j N , therefore P j 1 ,··· , j N (γ * ) becomes: The mean value of P j 1 ,··· , j N (γ * ) is zero, as it must be for the unbiasedness of the estimator, and its maximum is 1/ 2 π Σ j 1 ,··· , j N . A similar procedure of eqs. 33-35 can be extended to the standard least squares with the due differences In this case the variance of the convolution is defined: and, for Gaussian PDFs, the probability P b j 1 ,··· , j N (γ * b ) is the function: The maximum of P b j 1 ,··· , j N (γ * b ) is 1/ 2 π S j 1 ,··· , j N . For the inequality 18, it is: The equality is true if and only if all the σ j of the track are identical.
The law of large numbers allows to determine the convergence of the line-shape toward the function: for the weighted fits. Analogously for the standard fits: The probability of the sequence of hit quality is p j 1 ,··· , j N . The law of large numbers gives even an easy way for an analytical calculation of the maximums of the two distributions: where the index k indicates a track of a large number of simulated tracks, Σ k is the corresponding variance calculated as in the definition of (I −1 ) 2,2 and S k with the expression (R −1 ) 2,2 . The inequality 38 assures that Π(0) > B(0). For our 150000 tracks the convergence of the maximums of the empirical PDFs to eqs. 41 is excellent for Gaussian PDFs. For the rectangular PDFs of fig. 1, the results of eqs. 41 are good for the standard fit but slightly higher for weighted fits (3 ∼ 4%). The strong similarity of the line-shapes of the rectangular PDFs with the Gaussian PDFs is due to the convolutions among various functions f j (x) that rapidly tend to approximate Gaussian functions (Central Limit Theorem). The weighted convolutions of eq. 33 has higher weights for the PDFs of good hits. These contribute mostly to the maximum of the lineshape. Unfortunately, the good hits have a lower probability and the lower number of the convolved PDFs produce a more approximate Gaussian than in the case of standard fit. In fact, in the probability P b j 1 ,··· , j N (γ * b ) of the standard fit, the weights of the convolved functions are identical and the convergence to a gaussian is better (for N not too small). Figure 1 evidences the large differences of the direction PDFs of the weighted least squares fits compared to the PDFs of the standard fits. For N > 3 the PDFs of the standard fits have negligible differences from a gaussian. Instead, the PDFs of the weighted fits are surely non-Gaussian. The definition of resolution becomes very complicated in these cases. In our previous papers, we always used the maximums of the PDFs to compare different algorithms. Our convention was lively contested by our readers of ref. [2], they sustained that the usual definition was the standard deviation of all the data. It is evident the weakness of this position for non Gaussian distributions. In fact, the standard deviation (as the variance), for non Gaussian PDF, is principally controlled by the tails of the distributions. Instead, the maximums are not directly effected by the tails and they are true points of the PDFs. In addition to this, a decreasing resolution, when the resolving power of the algorithm/detector increases, easily creates contradictory statements.

Resolution of estimators
Let us justify our preference. The resolution indicates a property of an algorithm/instrument to discriminate among near values of the calculated/measured parameters. It is evident that the maximum discriminating power is obtained if the response of the algorithm/instrument is a Dirac δ -function. In this case, the parameters are confused if and only if they coincide. Therefore, the algorithm with the best resolution is that with the maximal similarity with a Dirac δ -function. Among a set of responses, it is evident that the response with the higher maximum is the most similar to a Dirac δ -function and thus that with the largest resolution. This is the reason of our selection. However, this is a conservative position. In fact, in recent years it is easy to find, in literature, more extremist conventions. It is a frequent practice (as in ref. [11] for example) to fit with a piece of Gaussian the central part (the core) of the distribution and to take the standard deviation of this Gaussian as the measure of the resolution. The Gaussian toy-model allows to calculate the variance of a Gaussian fitted to the core of the PDFs. A method to fit a Gaussian is to fit a parabola to the logarithm of the set of data. The second derivative of the parabola, at the maximum of the distribution, is the inverse of the variance of the Gaussian. This variance can be obtained by the function Π(γ * ) of equation 39 and equation 41: 1 The common factors are eliminated in the last fraction. Gaussian functions, with standard deviations Σ e f f , reproduce well the core of Π(γ * ) even if the Gaussian rapidly deviates from Π(γ * ). Instead, the Gaussian functions, with the height coinciding with Π(0), are outside the core of Π(γ * ), their full width at half maximum is always larger. Figure 2 illustrates few of these aspects for three different toy-models.  fig.1. Black line, the fraction of red PDF due to tracks with two or more good hits. Green line: the fraction of the red PDF due to one or zero good hits The right plot reports the heights 1/ √ 2 π Σ l where Σ l is a variance for each N={2, · · · , 13}. The variances Σ l are: green asterisks: variance of standard fits, red asterisks variances of weighted fits. Ciano asterisks: the variance Σe f f . Magenta asterisks maximums of red line of figure 1 (even for Gaussian and triangular toy models), black asterisks: the results of Π(0).
The left side of fig 2 reports the line-shapes for N=7 layers of the three toy models: the Gaussian of refs. [1,2], the rectangular of fig. 1 and a toy-model with triangular PDFs. These three type of PDFs are those considered by Gauss in his paper of 1823 on least squares method. The line-shapes are almost identical with small differences in their maximums. The red PDFs are formed by the sums of the black and the green distributions, respectively the fraction of the red PDF given by the tracks with two or more good hits, and the fraction of the PDF given by tracks with less than two good hits. It is evident the essential contribution of the black distribution to the height of the red PDF. The PDFs with the lowest maximums are those of the rectangular toy-model and the highest maximums are those of the Gaussian toy-model.
The right side of fig 2 reports the heights 1/ √ 2πΣ l of Gaussian PDFs with variance Σ l and the maximums of the empirical PDFs. The form of equation 42 is a weighted mean of the inverse of the variance of P j 1 ,··· , j N (γ * ), with a weight proportional to the maximum of its P j 1 ,··· , j N (γ * ). This form allows its extension to rectangular and triangular toy-models even if their derivatives can be problematic. For the standard fits all these heights are almost identical beyond N=3. The case N=2 is dealt in Sec.2.2, for N=3 the standard fit tends to have the estimators very similar to those fof N=2, due to y 2 = 0 that suppresses the central observations. For the heteroscedastic fits all the effective heights 1/ √ 2πΣ l are largely different. Hence, the gain in resolution, with respect to the homoscedastic fit, has values: 1.34, 2.4 and 4.1, for the model with seven detecting layers. In this case, our preferred value is 2.4 times the result of the standard fit. However, we prefer to correlate the obtained PDFs to physical parameters of the problem. In ref. [4], we used the magnetic field and the signal-to-noise ratio to measure the fitting improvements. At last, we have to remember that the non-Gaussian models allow a further increase in resolution with the maximum-likelihood search. References [3,4] show the amplitude of the increase in resolution due to the maximum-likelihood search. If the triangular and rectangular-toy models have similar improvements, the Gaussian-model would be the worst of all.

Conclusions
Inequalities among variances of the least-squares estimators are calculated for heteroscedastic systems. They demonstrate that the standard least squares method has not the minimal variances. Thus, it is non optimal according to the usual definition of optimality. These developments are explicitly constructed for irregular models that can not be handled with the Cramer-Rao-Frechet approach. However, the conditions imposed to the probability density functions are valid even for regular models. Thus, any mixtures of regular and irregular models are equally dealt in these studies. The linearity of the least-squares estimators (weighted or standard) are key features for the demonstrations. To test the amplitudes of the improvements due to these inequalities, the results of simulations with one irregular models are reported. The simulation has probability density functions of rectangular forms. This form is used as an example of probability model intractable with the Cramer-Rao-Frechet method. The effects of the proved inequalities is connected to the formation of the line-shape of the figures. The probability distributions for the direction estimators is defined in the two different fitting methods. The law of large numbers is used to reconstruct the maximums of the plotted distributions. For the Gaussian distributions the agreement if excellent. For the rectangular distributions slight differences are observed. It is calculated even the Gaussian function that fits the core of the model distributions.