The Cramer-Rao inequality to go beyond the $\mathbf{\sqrt{N}}$-limit of the standard least-squares method in track fitting

The Cramer-Rao-Frechet inequality is reviewed specializing it to track fitting. A diffused opinion attributes to this inequality the limitation of the resolution of the track fits with the number N of observations. It turns out that this opinion is incorrect, weighted least squares method is not subjected to that N-limitation. In a previous publication, simulations with a simple Gaussian model produced interesting results: a linear growth of the peaks of the distributions with the number $N$ of observations, much faster than the $\sqrt{N}$ of the standard least squares. These results could be considered a violation of a well known $1/N$-rule for the variance of an unbiased estimator, frequently reported as the Cramer-Rao-Frechet bound. To clarify this point beyond any doubt, it would be essential a direct proof of the consistency of those results with this inequality. Unfortunately, such proof is lacking or very difficult to find. Hence, the Cramer-Rao-Frechet developments are applied to prove the efficiency (optimality) of the simple Gaussian model and the consistency of its results. The inequality remains valid even for irregular models supporting the results of realistic models with similar growths.


Introduction
Very recently criticisms and doubts were raised against our paper (ref. [1]) on track fitting. The contested results were principally those of the following fig. 1. This figure illustrated (Monte Carlo) simulations of a very simple tracker with N detecting layers of identical technology (silicon microstrips) crossed by a set of parallel straight tracks of minimum ionizing particles (MIPs). The reported estimator was the track direction. x 10 4 Figure 1: The simple Gaussian model of ref. [1]. Distributions of the fits to the track directions for tracker models with N=2 to N=13 detecting layers. The first distributions are centered on zero, the others are shifted of N-2 steps of fixed amplitude. The blue distributions are the results of the standard least-squares. The red distributions are obtained with the weighted least-squares.
The hit uncertainties were approximated with Gaussian distributions with two type of hit quality (heteroscedastic model). One of very good quality when the MIP crosses the detector near to the borders (20% of strip width), and the other of bad quality (80% of the strip width) around the strip center. A slight random mechanical misalignment was supposed with mean value around a half strip (30 µm). Random misalignments much larger than this are always present in real trackers. This misalignment (rotations and translations) is simulated as a randomization of the hit position on the strips producing a binomial distribution of the hit quality. The collection of N hits (a track) is fitted with two different least squares method, one with the standard least squares and the other with the weighted least-squares with the weights given by the hit quality. The track direction is the illustrated estimator. With standard least-squares, we define the usual least-squares where all the used observations have identical variances (homoscedastic models). The standard least-squares, built for homoscedastic models, when applied to heteroscedastic models gives results almost insensitive to the presence of the good hits, and the maximums of the empirical probability density functions (PDFs) show a growth very near to √ N. The blue lines of fig.1. Instead, weighted least-squares, consistent for the heteroscedastic models, reports a surprising fast increase, almost linear, of the maximums of the simulated PDFs. Translating these results on mean variances, the standard fit shows a 1/N trend and the weighted fit shows a trend of the mean variances as 1/N 2 . This trend 1/N 2 was considered a violation of the Cramer-Rao-Frechet (CRF) bound as reported in many books and documentations. Our arguments about the different hypotheses used to prove the 1/N rule and those of our approach had no effect. To settle any type of contrast, a complete demonstration is required to prove our consistency with the CRF-inequality. It is evident that this long demonstration could not be inserted in ref. [1]. The main part of ref. [1] deals with physical models with Cauchy-like tails and (in theory) with infinite variance. Hence, any finite bound on the variance given by the CRF-inequality is automatically satisfied by them. However, the strong similarity of the physical models with the Gaussian model could extend them the suspects of possible inconsistencies. Thus, to give a strong support to our results we review the CRF inequality to show the inappropriate extension of the rule 1/N to our approach. For this, we will apply the CRF developments to heteroscedastic models and to the least squares method for track fitting (for straight tracks and curved tracks) with a special attention to our Gaussian model. The last two points are very difficult to find in literature. Surely other authors did similar developments and we apologize to not citing them. In any case, the enormous statistical samples of tracks (of the order of 10 12 tracks per year) renders very important careful studies of any detail of their fitting methods. During our analysis of the track fitting we had to read carefully many papers (in any case a small portion of those published) on this subject and we found inaccurate fitting strategies. Probably the reason of this inaccuracy is connected to the weak growth as √ N of the standard least-squares (the Kalmann filter is essentially identical) that is negligibly modified by any improvement. Another element of inaccuracy is a frequent claim of a "measure" the hit resolutions. In reality, those parameters are not a "measure", but an application of an equation valid only for homoscedastic models, inappropriate for the physics of the trackers (surely always heteroscedastic). Instead, our fitting methods of ref. [1,2,3] open very different possibilities. The heteroscedastic models allow drastic improvements of the quality of the track fitting. In spite of the possible doubts, they are perfectly consistent with the CRF inequality. Instead, the CRF inequality demonstrates the inefficiency of the standard least squares for heteroscedastic models. The lucky model of ref. [1] is a first order tool to insert heteroscedasticity in track fitting and able to produce consistent improvements.

The standard CRF-inequality
The developments of this section follow as far as possible ref. [4] and ref. [6]. As usual it is considered a N-times repetition of random observations {x 1 , x 2 . . . x N } described by an identical probability density function (PDF) f (x, θ ), where θ is a (scalar) parameter. These random variables x = {x 1 , x 2 . . . x N } are indicated as independent and identically distributed or i.i.d. random variables. We will use this notation in the following, without commenting the absolute physical inconsistency of this assumption.
The probability of the set x of i.i.d. random variables (likelihood function) is: As usual f (x, θ ) is positive, normalized and supposed differentiable with respect to the θ -parameter. Let us define the random variable U (x, θ ): The following developments require to differentiate the integrals of f (x, θ ) and invert the order of integration and differentiation. The functions that allow these operations are called regular models and they imply strict analytical conditions. A necessary condition of regular models is the independence of the range of f (x, θ ) from the θ -parameter. The normalization of the f (x, θ ) is: Differentiating eq. 3 with respect to θ and inverting the integral with the differentiation gives: The second line of eq. 4 is given by a property of the logarithmic derivative and it implies that the mean value of U (x, θ ) is zero. Instead, its mean square is different from zero: The first term of eq. 5 is called Fisher information of the sample, the last integral is the Fisher information contained in a single observation. For the i.i.d. random variables the Fisher information is N-times the Fisher information of a single observation.
This originates the 1/N factor in the CRF-bound. It will evident that for non i.i.d (heteroscedastic) random variables this factorization disappears. We underline the i.i.d. assumption in eq. 5 because often this condition is a standard assumption (as in ref. [4]), and it can only be recovered with an accurate re-reading of the intere demonstration and doing another demonstration without the i.i.d assumption.
If f (x, θ ) is twice differentiable with respect to θ , the last term of eq. 4 gives the following identity: The Fisher information i(θ ) for a single observation becomes: The last form of i(θ ) is particularly effective for Gaussian PDF.

The CRF inequality
At this point is is easy to obtain the CRF-inequality for these i.i.d. random variables. Defining an estimator T (X) = T (x 1 , x 2 . . . x n ) and its mean value: differentiating eq. 8 in θ we obtain: The form of the last integral is allowed by the positivity of L(x, θ ). The subtraction of τ(θ ) from T (x) does not modify the integral for eq. 4. The Cauchy-Schwarz inequality gives: or in its final form: The variance of T (x) can not be lower than the right side of eq. 11. This is the well-known CRF inequality that creates many doubts to our readers. When the CRF inequality becomes an identity, the estimator is defined efficient.

Efficient estimators
The Cauchy-Schwarz inequality becomes an equality if and only if T (x) and U (x, θ ) are linearly dependent: Hence, if an efficient estimator exists, it is a function of the model. The logarithmic functions contained in U (x, θ ) allow to find efficient estimators only in exponential models.

The standard mean with an i.i.d. Gaussian model
The PDF f (x, θ ), the mean value τ(θ ) and the Fisher information i(θ ) are: The last of eq. 13 gives the efficient estimator of θ for this model as defined by eq. 12. Its unbiased form T e (x) = T e (x) − θ , inserted in eq. 11 gives the identity σ 2 /N = σ 2 /N.

CRF inequality for heteroscedastic models
It is well-known by long time that different observations have always non identical PDF. The universe would be very simple if the observations were i.i.d., in this case the precision of a measure could be improved at any level by the the simple repetition and averaging the results. Gauss knew very well this problem and in his paper of 1823 extended his least squares method to handle observations of different PDF. For our misfortune the standard books of mathematical statistics do not cover this argument. But to complete our confutation of critics moved to ref. [1], we have to venture in this field. As in the previous section, we suppose a set of observations obtained by an array of different detectors as in a tracker. Even if all the detectors are built with identical technology, differences are always present. Each detector is anisotropic for its construction, optimized to the measure of positions. The signal spreads in a different recognizable way for each hit point. Moreover, each observation has a quantum mechanical interaction that renders unrepeatable the observations. However, a PDF for each observation can be defined (ref. [3]) even if very different from the form required for the CRF-inequality. However, the likelihood of a set of heteroscedastic observations becomes: The evident difference is an index attached to each PDF. Instead of the awful PDFs of ref. [3,2], we will use an easy model. The simple model of ref. [1] is a heteroscedastic Gaussian model (independent non identically distributed Gaussian random variables), hence, it has all the analytical properties required for the CRF-inequality. Again U 1,2,...,N (x, θ ) is defined as: Even the Fisher information acquires a set of indices, but now all the PDF of the observations are different as the Fisher information of each observation. The factorization of eq. 5 disappears with its N factor and its supposed generality. All the other equations are formally identical up to the first line of eq.10 ( a part an additional index to the PDFs ). With the new Fisher information the CRF-inequality for an unbiased estimator T 1,2,...,N (x) becomes: Supposing a large set of possible PDFs ( as those of ref. [3] ) the estimator T 1,2,...,N (x, θ ) must keep the track of the types and the order of the PDFs contained. It must be substituted with T j 1 , j 2 ,... j N (x, θ ), similarly for the likelihood function L j 1 , j 2 ,... j N (x, θ ) and the Fisher information. In this extended form the CRF-inequality becomes: The efficient estimators have the variance identical to the CRF-bound. Again the logarithmic functions of eq. 15 limit the definition of efficient estimators to exponential PDFs.

The weighted mean in heteroscedastic Gaussian models
Let us consider the standard mean T a 1,2,...N (x) = ∑ N i=1 x i /N as an estimator in heteroscedastic Gaussian models. The PDFs f j (x, θ ), the mean value τ 1,2,...N (θ ) and the Fisher information due to the j-PDF i j (θ ) are: With the variance of T a the CRF-inequality is: The equality is impossible for the assumption of the heteroscedasticity (the identity of the σ j is excluded). Let us extract from eq. 15 an efficient estimator defined (if it exists) as usual: This estimator is proportional to ∂ ln L 1,2,...,N (x, θ )/∂ θ and transforms the Cauchy-Schwarz inequality in an identity. It is easy to verify, without the CRF inequality, that the variance of the standard mean is always greater than the variance of the weighted mean.
The second equation is the Cauchy-Schwarz inequality for the two vectors {σ j } and {1/σ j } . The equality is impossible because σ 2 and 1/σ 2 can never be proportional excluding the trivial case of identical σ j . Heteroscedasticity excludes this case. Thus, in heteroscedastic models, the weighted average has always minor variance compared to the standard mean for regular and irregular models.
Hence, each combination of observations from our Gaussian model satisfies the CRF inequality. No 1/N factor is present in the equation for a different Fisher information of each single observation. The case of identity of all the observations is excluded a priori. Moreover, the mean of the standard variances of a set of random sequences { j 1 , j 2 . . . , j N }, not all identical, is higher than the mean of the corresponding weighted averages. The absence of 1/N factors in the variance of the weighted average allows fast decreases with N that is impossible to the variance of the standard mean.

The weighted mean in a homoscedastic model
To prove another aspect of the CRF-inequality, we use the weighted mean outside its definition model (heteroscedasticity) and we use it in a homoscedastic model. In this case the standard mean is the efficient estimator. The CRF-inequality states that the variance of the weighted mean is greater than the variance of the standard mean. It is easy to illustrate this fact (the reverse of the last equations).
The last inequality is the Cauchy-Schwarz inequality. The equality is excluded. Thus an efficient estimator in a model, outside its model has always a greater variance compared to the efficient estimator of the new model.

CRF-bound for heteroscedastic least squares of straight tracks
The application of the CRF-inequality to heteroscedastic least-squares fit for straight tracks adds other complications. Two parameters must be handled. The first parameter is the constant β : the impact point of the track in the reference plane. The second is γ y i : the angular shift of the track in the plane distant y i from the reference plane. Also here, the heteroscedasticity eliminates any easy rule with the increase of the number (N) of the observation planes for two different reasons. One is again the differences of the σ i ; the new one is due to the constraint of the fixed length of the tracker. Each insertion of a new detecting plane requires a repositioning of all the others. We will study our Gaussian model with the condition ∑ N j=1 y j = 0, this introduces some simplification on standard least squares equations. The likelihood function is similar to that of eq. 14, but now, the order of the PDFs must be conserved. The parameter y j orders the measuring planes: The function U 1,2,...,N (x, θ ) is the 2x1 matrix U 1,2,...,N (x, β , γ): The Fisher information I 1,2,...,N becomes the 2x2 matrix: Its inverse is: The CRF inequality assumes the form of a matrix inequality, we limit to relations among variances: Diag.
where Var(β , β ) and Var(γ, γ) are the variances of β and γ: and: The Gaussian model is efficient because its variances are equal to the diagonal of I −1 (here and in the following the indications 1, 2, . . . , N of the used PDFs will be subtended). The unbiased efficient estimators of β and γ for the model of eq. 25 are given by I −1 U (x, β , γ) (similar to those for a single parameter): The last equation allows the extraction of the formal expressions of the unbiased efficient estimators for this model (and are identical to those reported everywhere for example in ref. [5]). The mean values of products of logarithmic derivatives are obtained by the mean values of double derivative of a single logarithmic function (as in eq. 6), and due to eq. 28 gives: The second equation uses the first equation giving the Fisher information and the symmetry I −1 = I −1 ′ . The diagonal terms are the variances of the efficient estimators for β and γ. It is evident the privileged position of the Gaussian PDFs in the CRF-inequality.

The standard least-squares equations with heteroscedastic likelihood
We defined standard least squares equations the estimators (for β and γ) obtained with a homoscedastic model. It is easy to prove that the standard least squares estimators are efficient estimators for a Gaussian model with identical σ s. The extension of eqs. 13 to the case for two parameters shows immediately this property. The condition ∑ N i=1 y i = 0 simplifies the forms of the estimators. In the case of identical σ s the CRF-inequality states that any other estimator has a greater variance compared to the efficient estimators. It is evident that the efficiency of an estimator is strictly bound to the likelihood model of its definition (in a given model the efficient estimator of a parameter is unique) . Outside this model the estimators are no more efficient. We could derive the forms of the standard least squares estimators as the efficient estimators for the Gaussian model with identical σ j , but the usual derivation is faster. The estimators of β and γ of the standard least squares are: For an easy check of the forms of the estimators, we report the least squares equations that give their expressions (after inserting ∑ i y i = 0). The heteroscedastic likelihood L(x, β , γ) destroys their efficiency (optimality) and the new variances are different from those in the homoscedastic model (σ 2 /N and σ 2 / ∑ j y 2 j ): and, due to the differences from the variances of the efficient estimators, the CRF inequalities of eq. 32 and eq. 33 impose to them to be grater. We will extensively prove this property as a prototype of demonstration [7] that will be used even for the momentum variances.
First of all, it is required the joint variance-covariance matrix of T s β ,γ with the efficient estimators . This matrix is positive semi-definite (symmetrical with the variances in the diagonal). Let us define with C this matrix, with T the column vector T = { T s β , T γ s } and with < > the average on the likelihood of eq. 25. The matrix I −1 is the variance-covariance matrix of the efficient estimators (eq. 35) It is easy to show this general property of the covariance of the unbiased estimator T β s with the efficient estimators. By definition, the mean value of an unbiased estimator is always zero as any of its derivative: The derivative of the likelihood is the first component of U (x, β , γ) of eq. 27, and this correlation becomes equal to one. The derivative in γ of the mean value of T β s has no one and the sought correlation is zero. Similarly for the other estimator T γ s . Hence the matrix < T U ′ > is: The I −1 is a symmetric matrix. The positivity of C is conserved even with transformation A ′ C A with A any real matrix of four lines and two columns (few details about the positive definite matrices are reported in the appendix): The last of the eq. 41 is eq. 31 as a matrix relation the equality is eliminated being evidently impossible. For the definition of positive definite matrices it is straightforward to extract the relations among the variances. The eqs. 31 to 41 will be extended the estimators of the momentum reconstructions. Even if it was demonstrated with a heteroscedastic Gaussian model, eqs. 41 remain always greater also for heteroscedastic irregular models with equivalent set of {σ j }. The efficient estimators of heteroscedastic models continue to be efficient even for homoscedastic models, converging to them when all the σ j become identical.
All this set of demonstrations for heteroscedastic Gaussian model is addressed to each given chain of PDFs with a defined order of σ j . Their variances of eq. 32 and eq. 33 are strongly modified by the differences among the {σ j } without any evident N limitation. Instead, the variances of the standard least squares are weakly modified and they save explicit N limitations. For example the variances of the β parameter has averages ≈ σ 2 m /N. Figure 1 illustrates this limits for the γ estimator (the blue distributions). Due to the heavy modifications of the variances of the chains of PDFs, other PDFs become essential. For example those that select the chains among the possible set of chains. These PDFs are introduced by the properties of the measuring devices. As we said above, a binomial PDF suffices in the disordered tracker detector for the models of refs. [1,2], just a mean disorder of half strip (≈ 30 µm) is enough for this. The use of standard least squares method in heteroscedastic models has enormous simplifications, because it does not need the knowledge of the σ j for each observation. Unfortunately, it amounts to a large loss of resolution. Our models of ref. [1] illustrate in evident way the amplitude of this loss. In any case the lucky model of ref. [1] can be an economic way to introduce effective σ j . For our pleasure, in fig. 2, we invented a faster growth than that of the Gaussian model of ref. [1]. The quality of the detectors increases with the number N of detecting layers, each inserted layer is better than the previous one. The bleu straight line shows an evident deviation from a linear growth with an almost parabolic growth. The blue distributions are practically insensible to these changes. With cleaver selection of the hit quality many types of growth can be implemented.

Deviations from optimality
Equations 41 proved the non-optimality of the standard least-squares in the case of it use with a different (heteroscedastic) likelihood. The standard least squares gives the efficient estimators for a homoscedastic Gaussian model, the transportation outside its condition of efficiency (optimality) destroys its optimality and produces an increase of variances with a degradation of the resolution as illustrated by the simulations of ref. [1,2,3]. Even our heteroscedastic model suffers of a similar non-optimality loss outside its range of validity. We underlined that the orders and the values of the sets of {σ j } define the model, each deviation from the order and values implies an increase of the variances compared to the optimality condition. We used this property to prove the essential connection of the appropriate σ j with the hit j. If the ordered set of {σ j } is randomized, destroying the correlation with the set of the hits, the distributions of the estimators change drastically becoming worse than the standard least squares. Instead, the differences of the heteroscedastic efficient estimator minus a suboptimal estimator with few eliminated PDFs rapidly converge as the eliminated PDFs are reduced. This rapid convergence is not observed in the case of the standard least-squares.

Inequality for Momentum estimators
As stated above the CRF-inequality can be extended to case of the momentum reconstruction. It will be studied the case of a high-momentum charged-particle moving in a tracker with a homogeneous magnetic field and the path of the particle is orthogonal to the field direction. Here, the circular path can be approximated with a parabola. As for the straight tracks we will compare the estimators of standard least squares with those of a heteroscedastic model. The estimators of the standard least squares are efficient (optimum) in the homoscedastic Gaussian model.

The momentum estimators in homoscedastic Gaussian model
We use this approach to define our expressions for the estimators. These forms will be used in heteroscedastic models. Our equation will be expressed in compact way with matrices. The function L(x, β , γ, η) is the likelihood for this case, β and γ are just defined, η is the curvature of the track, proportional to 1/p with p the track momentum: The index j orders the PDFs f j as the tracker layers. The equivalent of eq. 27 is: The unbiased estimators for β , γ, η are: The vector R −1 V (x, β , γ, η) contains the efficient estimators of this model, the matrix R is its the Fisher information (with the condition ∑ j y j = 0 and j ={1,2,...,N}): For eq. 35 extended to this case, it is: The matrices R and R −1 are symmetric and positive semi-defined, and the second line of the last equation is the variance-covariance matrix of the efficient estimators. The variance-covariance matrix of a generic vector T of unbiased estimators of β , γ, η with the efficient estimators is: As defined, the symbol <> means an integral with the likelihood L, the cross correlations < V T ′ > follow the rules of eq. 39 giving now the 3x3 identity matrix I 3 reducing the off-diagonal elements of K to R −1 . With a transformation of the type A ′ K A with a 6x3 real matrix A the fundamental inequality is obtained: The equality is obtained only for the efficient estimators of this model. Therefore, any other set of estimators have a greater variance in this model. Any heteroscedastic model has greater variance than the efficient model. The reason is connected to the modification that the likelihood L introduces in the variances. These modifications always increase the variances. In this model, the parameter σ disappears in the expressions for the estimators. The form R of the Fisher information depends from N, this limits the growth with the numbers of detecting layers as illustrated in ref. [2].

The momentum estimators in heteroscedastic Gaussian model
In its essence this model can be obtained from the corresponding homoscedastic model with a set of formal differences.
The index j orders the PDFs as the detecting layers and selects a different σ j for this observation. The other differences are in the definitions of the vector T 1 , T 2 , T 3 that are different from the previous T a , T b , T c and the Fisher information I.
The Fisher information becomes: The efficient unbiased estimators for the likelihood L(x, β , γ, η) are contained in the vector I −1 U (x, β , γ, η). Even in this case the Fisher information has no obvious N-dependence. The variance-covariance matrix for any vector T of unbiased estimators can be transformed as previously described giving the inequality: Here, the symbol < > indicates an integral with the likelihood of this model L(x, β , γ, η). The equality to zero is obtained only when the vector T is I −1 U (x, β , γ, η). For any other unbiased estimator, that difference is greater than zero (in the sense of the positive definite matrices).

Conclusions
The Cramer-Rao-Frechet inequality is applied to a set of heteroscedastic Gaussian model to prove the absence of any limitation introduced by this fundamental inequality. Instead, the inequality shows that, in heteroscedastic models, the estimators of the standard least-squares have always greater variances compared to that given by the weighted least-squares. The variances of the standard least-squares conserve approximate 1/N form that imply limitations in the growth of the distributions with the number N of detecting layers. No similar limitations are present in the weighted least squares that are dominated by the differences of the weights 1/σ 2 j . For identical σ j the estimators converge to those of the standard least-squares. The inequality is calculated also for the momentum estimators. Again the appropriate heteroscedastic estimators have a lower variances than those of the standard least squares. Given the loss of resolution implied by greater variances it must be avoid the use of standard least squares with heteroscedastic data. The lucky model may be helpful in silicon detectors.

Appendix
Let us add few details about variance-covariance matrix or about the cross-covariance matrix. For their definitions these matrices are positive semi-defined (they are symmetric with the positive variances in the principal diagonal). If the matrix C is a positive semi-definite matrix it must be a symmetric mxm matric C ′ = C, and for any real vector z it must be z ′ Cz ≥ 0. From these definitions it follows that if A is a real matrix of mxn with n < m the matrix K given by A ′ C A is again a nxn positive semi-definite matrix. It is easy to show that the diagonal elements of K are given by z ′ Cz where each z is a column of A and the off-diagonal terms are symmetric. A positive semi-definite matrix implies an infinite number of inequalities (∞ m ), our interest is to find the minimal one with useful dimensions. The eqs. 41 48 52 are addressed to this. It is easy to show that the identity matrices I 2 and I 3 select the useful blocks of the matrices. AS an example we can elaborate the C matrix (4x4). In general one can use a 4x2 matrix of the form (I 2 , aI 2 ) with a any real constant and search the minimum in a: (I 2 aI 2 ) C I 2 aI 2 =< T T ′ > +(2 a + a 2 )I −1 ≥ 0 ∂ ∂ a < T T ′ > +(2 a + a 2 )I −1 ⇒ (2 + 2a) = 0 ⇒ B =< T T ′ > −I −1 ≥ 0 The solution a = −1 gives the last inequality. The inequality for each couple of variances is obtained with z ′ B z where z is (1, 0) for the first two and (0, 1) for the second two.