Regression with Gaussian Mixture Models Applied to Track Fitting

: This note describes the application of Gaussian mixture regression to track ﬁtting with a Gaussian mixture model of the position errors. The mixture model is assumed to have two components with identical component means. Under the premise that the association of each measurement to a speciﬁc mixture component is known, the Gaussian mixture regression is shown to have consistently better resolution than weighted linear regression with equivalent homoskedastic errors. The improvement that can be achieved is systematically investigated over a wide range of mixture distributions. The results conﬁrm that with constant homoskedastic variance the gain is larger for larger mixture weight of the narrow component and for smaller ratio of the width of the narrow component and the width of the wide component.


Introduction
In the data analysis of high-energy physics experiments, and particularly in track fitting, Gaussian models of measurements errors or Gaussian models of stochastic processes such as multiple Coulomb scattering and energy loss of electrons by bremsstrahlung may turn out to be too simplistic or downright invalid. In these circumstances Gaussian mixture models (GMMs) can be used in the data analysis to model outliers or tails in position measurements of tracks [1,2], to model the tails of the multiple Coulomb scattering distribution [3,4], or to approximate the distribution of the energy loss of electrons by bremsstrahlung [5]. In these applications it is not known to which component an observation or a physical process corresponds, and it is up to the track fit via the Deterministic Annealing Filter or the Gaussian-sum Filter to determine the association probabilities [6][7][8][9]. The standard method for concurrent maximum-likelihood estimation of the regression parameters and the association probabilities is the EM algorithm [10,11].
In a typical tracking detector such as a silicon strip tracker, the spatial resolution of the measured position is not necessarily uniform across the sensors and depends on the crossing angle of the track and on the type of the cluster created by the particle. By a careful calibration of the sensors in the tracking detector a GMM of the position measurements can be formulated, see [12][13][14]. In such a model the component from which an observation is drawn is known. This can significantly improve the precision of the estimated regression/track parameters with respect to weighted least-squares, as shown below. This introduction is followed by a recapitulation of weighted regression in linear and nonlinear Gaussian models in Section 2, both for the homoskedastic and for the heteroskedastic case. Section 3 introduces the concept of Gaussian mixture regression (GMR) and derives some elementary results on the covariance matrices of the estimated regression parameters. Section 4 presents the results of simulation studies that apply GMR to track fitting in a simplified tracking detector. The gain in precision that can be obtained by using a GMM rather than a simple Gaussian model strongly depends on the shape of the mixture distribution of the position errors. This shape can be described concisely by two characteristic quantities: the weight p of the wider component, and the ratio r of the smaller over the larger standard deviation. The study covers a wide range of p and r and shows the gain as a function of the number of measurements within a fixed tracker volume. It also investigates the effect of multiple Coulomb scattering on the gain in precision. Finally, Section 5 gives a summary of the results and discusses them in context of the related work in [12][13][14].

Linear Regression in Gaussian Models
This section briefly recapitulates the properties of linear Gaussian regression models and the associated estimation procedures.

Linear Gaussian Regression Models
The general linear Gaussian regression model (LGRM) has the following form: where y is the vector of observations, θ is the vector of unknown regression parameters, X is the known model matrix, c is a known vector of constants, and is the vector of observation errors, distributed according to the multivariate normal distribution N (0, V ) with mean 0 and covariance matrix V . In the following it is assumed that dim(y) ≥ dim(θ) and rank(X) = dim(θ).
If the covariance matrix V is diagonal, i.e., the observation errors are independent, two cases are distinguished in the literature. The errors in Equation (1) and the model are called homoskedastic, if V is a multiple of the identity matrix: Otherwise, the errors in Equation (1) and the model are called heteroskedastic. If V is not diagonal, the observation errors are correlated, in which case the the model is called homoskedastic if all diagonal elements are identical, and heteroskedastic otherwise.

Estimation, Fisher Information Matrix and Efficiency
In least-squares (LS) estimation, the regression parameters θ are estimated my minimizing the following quadratic form with respect to θ: Setting the gradient of S(θ) to zero gives the LS estimateθ of θ, and linear error propagation yields its covariance matrixĈ: It is easy to see thatθ is an unbiased estimate of the unknown parameter θ.
In the LGRM, the LS estimate is equal to the maximum-likelihood (ML) estimate. The likelihood function L(θ) and its logarithm (θ) are given by: The gradient ∇ (θ) is equal to: and its matrix of second derivatives, the Hesse matrix H(θ) is equal to: Setting the gradient to zero yields the ML estimate, which is identical to the LS estimate. Under mild regularity conditions, the covariance matrix of an unbiased estimator cannot be smaller than the inverse of the Fisher information matrix; this is the famous Cramér-Rao inequality [15]. The (expected) Fisher information matrix I(θ) of the parameters θ is defined as the negative expectation of the Hesse matrix of the log-likelihood function: Equation (4) shows that in the LGRM the covariance matrix of the LS/ML estimate is equal to the inverse of the Fisher information matrix. The LS/ML estimator is therefore efficient: no other unbiased estimator of θ can have a smaller covariance matrix.

Nonlinear Gaussian Regression Models
In most applications to track fitting, the track model that describes the dependence of the observed track positions on the track parameters is nonlinear. The general nonlinear regression Gaussian model has the following form: where f (θ) is a nonlinear smooth function. The estimation of θ usually proceeds via Taylor expansion of the model function to first order: where θ 0 is a suitable expansion point and the Jacobian X is evaluated at θ 0 . The estimate θ 1 is obtained according to Equation (4). The model function is then re-expanded at θ 1 , and the procedure is iterated until convergence.

Linear Regression in Gaussian Mixture Models
If the observation errors do not follow a normal distribution, the Gaussian linear model can be generalized to a GMM, which is more flexible, but still retains some of the computational advantages of the Gaussian model. Applications of Gaussian mixtures can be traced back to Karl Pearson more than a century ago [16]. In view of the applications in Section 4, the following discussion is restricted to Gaussian mixtures with two components with identical means. The probability density function (PDF) g(y) of such a mixture has the following form: where ϕ(y; µ, σ) denotes the normal PDF with mean µ and variance σ 2 . The Gaussian mixture (GM) with this PDF is denoted by GM(p, µ, σ 1 , σ 0 ). The component with index 0 is called the narrow component, the component with index 1 is called the wide component in the following.

Homoskedastic Mixture Models
We first consider the following linear GMM: where n = dim(y). The variance of i is equal to: and the joint covariance matrix of is equal to As the total variance is the same for all i, the GMM is called homoskedastic. By using just the total variances, it can be approximated by a homoskedastic LGRM, and the estimation of the regression parameters proceeds as in Equation (4), which can be simplified to: If the association of the observations to the components is known, it can be encoded in a binary vector z ∈ Z = {0, 1} n , where z i = j indicates the component with variance σ 2 j , for i = 1, . . . , n. In the case of independent draws from the mixture PDF in Equation (13), the z i are independent Bernoulli variables with P(1) = p, P(0) = 1 − p. The probability of z = (z 1 , . . . , z n ) is therefore equal to: Conditional on the known associations z, the model can be interpreted as a heteroskedastic LGRM with a diagonal covariance matrix V z : It follows from the binomial theorem that: The GMR is thus a mixture of heteroskedastic Gaussian linear regressions with mixture weights P(z), z ∈ Z. In each of the heteroskedastic LGRMs the regression parameters are estimated by: Conditional on z, the estimator is normally distributed with mean θ and covariance matrixĈ z , and therefore unbiased and efficient. The unconditional distribution ofθ is obtained by summing over all z with weights P(z). Its density therefore reads: where ϕ(θ; θ,Ĉ z ) is the multivariate normal density ofθ with mean θ and covariance matrixĈ z .θ is therefore unbiased, and its covariance matrix is obtained by summing over the 2 n possible values of z with weights P(z):Ĉ The unconditional estimator has minimum variance among all estimators that are conditionally unbiased for all z ∈ Z, as shown by the following theorem. Theorem 1. Letθ be an estimator of θ with E[θ|z] = θ for all z ∈ Z. Then its covariance matrixC is not smaller thanĈ in the Loewner partial order:Ĉ

≤C.
(24) Proof. As E[θ|z] = θ,C can be written in the following form: Asθ z is efficient conditional on z andθ|z is conditionally unbiased, it follows that Summing over all z with weights P(z) proves the theorem.
The effective improvement in precision of the heteroskedastic estimatorθ w.r.t.regarding to the homoskedastic estimatorθ is difficult to quantify by simple formulas, but easy to assess by simulation studies. The results of three such studies in the context of track fitting are presented in Sections 4.1-4.3, respectively.

Heteroskedastic Mixture Models
The GMM in Equation (14) can be further generalized to the heteroskedastic case, where the mixture parameters depend on i: The joint covariance matrix of is now diagonal with The approximating LGRM that uses just the total variances is now heteroskedastic with covariance matrix V . The corresponding estimatorθ with its covariance matrixČ is computed as in Equation (4).
As in Section 3.1, the unconditional covariance matrixĈ is the weighted sum of the conditional covariance matricesĈ z :Ĉ It can be proved as before thatĈ ≤Č.
In the application to track fitting, heteroskedastic mixture models can be useful in non-uniform tracking detectors, in which the mixture model obtained by the calibration depends on the layer or on the sensor or on the possible cluster types.
A nonlinear GMM can be approximated by a linear GMM in the same way as a nonlinear Gaussian model can be approximated by a LGRM, see Section 2.3.

Simulation Study with Straight Tracks
The setting of the first simulation study is a toy detector that consists of n equally spaced sensor planes in the interval from 0 cm to 115 cm. The track model is a straight line in the (x, y)-plane. No material effects are simulated. The track position is recorded in each sensor plane, with a position error generated from a Gaussian mixture, see Equation (13). The total variance in each layer is equal to σ 2 tot = (0.005 cm) 2 , corresponding to a standard deviation of σ tot = 0.005 cm. As the main objective of the simulation study is the comparison of the homoskedastic estimator of Equation (17) Table 1. The number n of planes varied from 3 to 20. For each pair (p, r) and each value of n the exact covariance matrix was computed according to Equation (23).  The track parameters were chosen as the intercept d and the slope k. All tracks were fitted with the homoskedastic and with the heteroskedastic error model. Figure 1 shows the estimated standard deviation (rms) of both fits, along with the exact values, for the intercept d and the slope k, with p = 0.8, r = 0.2 in the range 3 ≤ n ≤ 20. The standard deviations of the heteroskedastic fit are not only smaller, but also fall at a faster rate than the ones of the homoskedastic fit, i.e., faster than 1/ √ n. The agreement between the estimated and the exact standard deviations is excellent. Figure 2 shows the relative standard deviation (heteroskedastic rms over homoskedastic rms) for the slope k and all 20 combinations of p and r. The corresponding plot for the intercept d looks very similar. A clear trend can be observed: the ratio gets smaller with decreasing p and with decreasing r.
The largest improvement is therefore seen for p = 0.6, r = 0.1, where the probability of very precise measurements (r = 0.1) is the largest (1 − p = 0.4).
For large values of n the ratio curves approach a constant level, because asymptotically both rms values fall at the same rate, namely 1/ √ n, see Figure 3. In the experimentally relevant region 15 ≤ n ≤ 20 the ratio is already close to its lower limit in most, but not all, cases. In these cases, the superior performance of the heteroskedastic fit is fully exploited, and adding more measurements yields little further improvement. The exception is the case p = 0.9, where the limit is reached much later, at n = 200 in the extreme case.

Simulation Study with Circular Tracks
The setting of the second simulation study is a toy detector that consists of n equally spaced concentric cylinders with radii in the interval [R min , R max ], with R min = 10, R max = 100 cm. The track model is a helix through the origin. Only the parameters in the bending plane ((x, y)-plane) are estimated. The track parameters are: Φ, the azimuth of the track position at R = R min ; φ, the azimuth of the track direction at R = R min ; and κ, the curvature of the circle. 10 6 tracks were generated for each of the same 20 combinations of p and r = σ 0 /σ 1 as in Section 4.1. The track position in the bending plane was recorded in each cylinder, with a position error generated from the Gaussian mixtures in Table 1. Each sample of 10 6 tracks consists of 10 subsamples with circles of 10 different radii, namely ρ/ cm = 100, 200, . . . , 1000. The number n of planes varied from 3 to 20.
As the track model is nonlinear, estimation was performed according to the procedure in Section 2.3. All tracks were fitted with the homoskedastic and with the heteroskedastic error model, and for each pair (p, r) and each value of n the exact covariance matrix was computed according to Equation (23). Figure 4 shows the estimated standard deviation (rms) of both fits, along with the exact values, for Φ, φ and κ, with p = 0.8, r = 0.2 in the range 3 ≤ n ≤ 20. The standard deviations of the heteroskedastic fit are not only smaller, but also fall at a faster rate than the ones of the homoskedastic fit, i.e., faster than 1/ √ n. The agreement between the estimated and the exact standard deviations is again excellent. Figures 5-7 show the relative standard deviation for the three track parameters and all 20 combinations of p and r. Because of the large correlation between φ and κ, Figures 6 and 7 are virtually identical. The same trend as before can be observed: the ratio gets smaller with decreasing p and with decreasing r. The largest improvement is again seen for p = 0.6, r = 0.1, where the probability of very precise measurements (r = 0.1) is the largest (1 − p = 0.4). Compared to the results of the straight-line fits, the gain in precision is somewhat smaller.

Simulation Study with Circular Tracks and Multiple Scattering
The third study investigates the effect of multiple Coulomb scattering (MCS) on the results presented in the previous subsection. To this end, every layer of the toy detector is assumed to have a thickness of 1% of a radiation length. The simulation and the reconstruction of the tracks is modified accordingly, see for example [17]. 10 5 tracks were generated for each of four values of the transverse momentum, i.e., p T = 10, 20, 50, 100 GeV/c and for each of the 20 combinations of p and r = σ 0 /σ 1 as in Section 4.2. The dip angle of the tracks was set to λ = π/4. The track position in the bending plane was recorded in each cylinder, with a position error generated from the Gaussian mixtures in Table 1 plus a random deviation due to MCS. The number n of planes varied from 3 to 20.
It is to be expected that the benefit provided by the heteroskedastic regression is diminished by the introduction of MCS. This is confirmed by the results in Figures 8-11, which show the relative standard deviation of the curvature κ for the four values of p T and the 20 combinations of p and r. Although the gain in precision of the heteroskedastic estimator is rather poor for the lowest value of p T , see Figure 8, it steadily improves with increasing p T and approaches the optimal level for the highest value of p T , see

Summary and Conclusions
We have evaluated the gain in precision that can be achieved by GMR in track fitting using a mixture model of position errors rather than a simple Gaussian model with an average resolution. The results, both with straight and with circular tracks, show the expected behavior: the gain rises if the mixture weight (proportion) of the narrow component becomes larger, and it rises when the ratio of the width of the narrow component and the width of the wide component becomes smaller. Unsurprisingly, the gain is negatively affected by MCS, depending on the track momentum and on the amount of material.
It is also instructive to look at the results in the light of the findings reported in [12][13][14], in particular the claim that the heteroskedastic regression, which is in fact the GMR described here, achieves a linear growth of the resolution with the number n of detecting layers, instead of the usual √ n behavior [13]. A glance at Figure 3 shows that in fact the heteroskedastic standard deviations initially fall faster than the homoskedastic ones, i.e., faster than 1/ √ n. Eventually, however, the relative standard deviations level out at a constant ratio, showing that asymptotically both standard deviations obey the expected 1/ √ n mode of falling with n. The observation reported in [13] is clearly a small-sample effect, which-in contrast to most other cases-is very significant in the mixture with p = 0.8, r = 0.1, the one assumed in [13]. Fortunately, it is in the range of n that is relevant for realistic tracking detectors, i.e., somewhere between 5 and 20 for a silicon tracker. It seems therefore absolutely worthwhile to optimize the local reconstruction and error calibration of the position measurements as far as possible.