Estimation of Autoregressive Parameters from Noisy Observations Using Iterated Covariance Updates

Estimating the parameters of the autoregressive (AR) random process is a problem that has been well-studied. In many applications, only noisy measurements of AR process are available. The effect of the additive noise is that the system can be modeled as an AR model with colored noise, even when the measurement noise is white, where the correlation matrix depends on the AR parameters. Because of the correlation, it is expedient to compute using multiple stacked observations. Performing a weighted least-squares estimation of the AR parameters using an inverse covariance weighting can provide significantly better parameter estimates, with improvement increasing with the stack depth. The estimation algorithm is essentially a vector RLS adaptive filter, with time-varying covariance matrix. Different ways of estimating the unknown covariance are presented, as well as a method to estimate the variances of the AR and observation noise. The notation is extended to vector autoregressive (VAR) processes. Simulation results demonstrate performance improvements in coefficient error and in spectrum estimation.


Introduction
The problem of estimating the parameters {b i , i = 1, . . . , p} of an autoregressive (AR) process where the input η(n) is a white noise process, is important in many aspects of signal processing. It plays roles a variety of applications, such as speech coding and analysis (see References [1][2][3][4][5][6][7]). Autoregressive modeling is instrumental in many spectrum estimation algorithms, and algorithms for noise-free measurements have been developed from that perspective [8][9][10][11][12][13][14]; see also Reference [5] for a survey and perspective. Vector autoregressive modeling is also important in econometric modeling [15]. Among many other applications, AR models have also been used in biomedical signal processing (see, e.g., Reference [16]); communication (see, e.g., Reference [17]); and financial modeling (see, e.g., Reference [15]). Because of its importance, the AR parameter estimation problem has been well-studied from a realization of noise-free process {ξ(m)} When autocorrelations r k = E[ξ(m)ξ * (m − k)] are known (or estimated), the Yule-Walker equations describe the solutions [18,19]. This problem may also be considered as the problem of finding optimal predictor coefficients, and solutions invoking adaptive filters are well known [20]. The parameters can also be estimated using a Kalman filter applied to a statespace using the estimatesζ(m) as if they were the true values. These two filters operate in conjunction to jointly converge to the AR parameter estimate. To evaluate our method in comparison with the dual Kalman filter method, the algorithm here is compare to models also seen in Reference [104].
Recently, Monte Carlo methods, such as particle filtering has been used for AR and ARMA estimation [107][108][109][110]. Particle filtering methods are quite general and can track dynamical variables and jointly estimate static parameters. These methods offer the possibility of convergence to good estimates in situations when the noise is not Gaussian (which has been assumed in this paper), but offer the usual drawbacks of potentially high computational complexity if many particles are used. These represent an alternative to the methods of this paper, and a thorough comparison, while valuable, lies beyond the scope of this paper.
This topic is relevant to its journal of publication (Entropy), since spectral analysis under a maximum entropy criterion has (famously) been shown to be equivalent to AR modeling [11]. As this work shows, however, when there are noisy observations particular attention must be devoted to the information present in the autocorrelation matrix of the observations, beyond the first order equations that result from the historical maximum entropy approach.
We present the method first for the scalar AR model. For generality, noise is assumed to be circular complex Gaussian noise; modifications for real signals is straightforward. In addition to the coefficient estimation problem, we also address the problem of variance estimation. Following development for scalar AR processes, the notation is developed for the vector AR (VAR) processes. Several simulation results demonstrate that the ICWARE method provide significant improvements over classical YW-type methods.

Scalar Parameter Estimation in Noise
The noisy observation equation, represented by the system in Figure 1, is where ξ(m) ∈ C and each b i ∈ C, and where ν(m) is a white noise process. The process {ξ(m)} is said to be the noise-free AR process, and the process {y(m)} is said to be the noisy AR process. The input and observation noises are assumed to be complex circular Gaussian, so that

ν(m)
where the real and imaginary parts are uncorrelated, and draws at different times are independent. Furthermore, these noises are assumed to be uncorrelated with each other, so and y(i) = y(i) y(i − 1) · · · y(i − p + 1) T (this vector is re-defined below to have a different length). The conventional least-squares approach to estimating b (in the forward prediction error sense) would be to computê which results in the conventional estimatê This is essentially the covariance method [19]. The estimate (3) can be computed using recursive formulation, resulting in a recursive least-squares (RLS) adaptive filter. This approach essentially neglects the noise ν(m). It is known that the noise will bias the estimate results [22].
Note: The error could also be computed in a backward prediction error sense, or in a combined forward-backward sense as is done in some spectrum estimation algorithms such as the modified covariance method [14], ([5], Section 75). The ICWARE method we describe below can be extended to these generalizations as well, but for brevity only forward prediction error is used.
Substituting the AR signal in terms of the observation ξ(m) = y(m) − ν(m) into the AR model (1) and using the observation model (2) we obtain where denotes the noise. The expression (4) is an ARMA(p, p) process and is the basis for some ARMA-based approaches to AR estimation in noise. The expression (5) has the appearance of an AR(p) process, except that the noise w(m) is not white, having correlation If the noise w(n) were uncorrelated, the solution of (3) would be least-squares optimal. However, since w(m) is correlated, the sample-by-sample approach of (3) is suboptimal since it does not take the correlation into account. As we now show, there is information in the covariance structure of the signal that can be used to improve the estimate. To take the correlation into account, stack the observations of (5) into vectors of length d (that is, the depth), as . . .
Write this as The correlation matrix of the vector noise is In moving through a sequence of data, the data can be advanced by a skip s to form a sequence of vectors y(m), y(m + s), y(m + 2s), . . . , y(m + s(k − 1)) and matrices Y(m), Y(m + s), Y(m + 2s), . . . , Y(m + s(k − 1)), which we write for convenience as y 0 , y 1 , . . . , y k−1 , and Y 0 , Y 1 , . . . , Y k−1 , respectively. The index m is chosen to make it possible to use only causal data, m ≥ d + p − 1.
Assuming independence (such as when s ≥ d) and assuming that the correlation function R w (b) is known, the likelihood function can be written as a circular complex Gaussian as Finding the true maximum likelihood solution by directly maximizing (9) with respect to b is difficult, since b enters nonlinearly in R w (b). Instead, an iterative approach is employed. An estimated value R w (b) i is used at the ith step. Assuming still that each R w (b) i is known, a maximum likelihood solution may be obtained fromb Here, a forgetting factor λ k−i , with λ < 1, has been introduced to allow for tracking a time-varying b. Significantly, the norm here is weighted by the inverse covariance R w (b) −1 i . If we take d = 1, we obtain the regular least-squares problem, and the additional correlation structure does not appear. Taking the gradient of the cost functional (10) results in the normal equation Write this as Φ −1 kb k = φ k . This can be recursively updated using RLS recursions (see, e.g., References ( [111], Section 8.7), [112]), starting from an initial Φ −1 = 1 δ I for a small scalar δ and propagating Φ k . Computing (11) requires knowledge of R w (b) i , which is not available since b is to be found. In combination with a recursively updated solution to (11), R w (b) i is estimated using previously computed values ofb and used as a time-varying covariance matrix. That is, we write is the framework shown in Algorithm 1, similar to a vector RLS adaptive filter.
Algorithm 1: AR Parameter estimation with noisy data.
The "Update the covariance" step, detailed below, is how the structure of R b (b) can be used to improve the parameter estimates.
We have explored several different approaches to computing and using R w (b): 1.
Ignore R w (b): Neglect the correlation structure and simply assume that R w (b k ) i = I. This gives the equivalent of taking a scalar measurement and is used as a sort of worst-case basis for comparison among the different algorithms.

2.
Use the correct value of R w (b): That is, assume that b and σ 2 ν and σ 2 η are known and compute R w (b) k according to (6). This provides a limit on best-case performance against which other methods can be compared.

3.
Use the estimate of b: Using the correct values of σ 2 ν and σ 2 η , compute the autocorrelation matrix usingb k in (6).
In the early stages whileb k is poorly converged, it is best to use option 1 (assuming identity covariance matrix) untilb has settled near its final value, then switch to option 4 using estimatedb k until the moments in Σ(b) have converged sufficiently well that reliable estimates of the variances can be obtained and option 5 can be used. As described in Section 3, the information necessary to estimate the variances can be accumulated without yet having a decent estimate forb k .
The covariance update/inverse does not necessarily need to be done at every step. Particularly asb k settles towards its final value, there is little to be gained by updating R w (b) at every step.

Estimating the Variances
As the results below will indicate, fixed values of σ 2 ν and σ 2 η can be used in the estimate of R w (b) instead of estimated values, so for the purposes of estimating b, it is not strictly necessary to estimate these variances. But for other purposes it may be necessary to have an estimate of σ 2 ν and σ 2 η . A maximum likelihood estimation approach is thus described in this section.
For a given estimate of the coefficientsb, write R w (b) as For a sequence of k vectors y 0 , y 1 , . . . , y k−1 , assumed to be nonoverlapping, encompassing N samples, the joint log likelihood function is a complex (circularly symmetric) Gaussian Let denote the sample covariance. It can be shown that (see Appendix A) and that The gradients (14) and (15) are equal to zero when The variance estimates are chosen to satisfy the condition (16) using an estimate of b from previous estimates, that is We define the offset trace as where the usual trace is obtained when = 0, and for > 0, the sum is taken on the th superdiagonal.
where that is, the trace on the th superdiagonal. Taking the offset trace of (17) gives The ML estimates of the variances are the solutions to A significant number of terms of Σ(b) must be accumulated in order to estimate the variances well. Initially, before reasonably accurate estimates of b are available, using inaccuratebs can result in a highly inaccurate Σ(b). As a result, it is desirable to accumulate moments of y i and Y i without usingb. The trace of the sample covariance is The different terms can be written as (It is noted that, except when = 0, these two terms are not conjugates of each other, so that both of these terms must be retained.) These terms can be computed by recursively propagating the following quantities, for k = 1, 2, . . . , p and = 0, 1, . . . , p: The true value of b is used. The least-squares solution to (18) can produce variance estimates for σ 2 η which are negative, which is nonphysical. Solutions constrained so that the variances are constrained to the range [0. 1,5] were also computed using CVX [113]. From these results, about 500 samples are needed before the variance estimate converges to a value somewhat close to the true value.

Vector Autoregressive Formulation
In this section we extend the results of the previous section to vector autoregressive random processes in noise, where at each time vector of length K is produced, and the coefficients are K × K matrices. In the interest of generality, a constant offset v is also included. The noisy observations can be written as As in the scalar case, the noisy observations can be written as Following the usual practice [15], this is vectorized to obtain a vector of unknown parameters as follows.
The noise structure for a single vector can be written as We also find As in the scalar case, stack up multiple observations so that the correlation structure may be exploited: . . .
Write this as y(m) = Y(m)b + w(m).
The correlation structure of the stacked noise vector is Let y(m), y(m + s), . . . , y(m + sk) and Y(m), Y(m + s), . . . , Y(m + sk) be a sequence of vectors and matrices where the vector samples are skipped by s samples at each step, and for convenience denote these as y 1 , y 2 , . . . , y k and Y 1 , Y 2 , . . . , Y k . Following the method of section 2, the estimatê With the appropriate use of vectorized data, Algorithm 1 can be used for the VAR as well.

Some Results
Several test cases were examined to determine the performance of the ICWARE approach; these are summarized in Table 1, where the pole locations are given in polar form ρe jθ . The first example, designated Case 1, is a system with p = 3 having poles at ρe jθ 1 , ρe jθ 2 , ρe jθ 3 , with ρ = 0.95 and (θ 1 , θ 2 , θ 3 ) = (0.65, 0.7, 0.75). resulting in a fairly narrowband complex AR signal. The noise variances are σ 2 η = 1, and σ 2 ν = 1. This case is used to explore some of variations in performance as parameters of the algorithm are varied. Figure 3 shows b −b 2 2 as a function of iteration using various forms of R b (b) and values of d (the stack height) from 1 to 7. The "skip" is set to s = 3. The results are obtained by averaging the results of 50 independent runs. In these plots, "Iteration number" refers to the number of blocks of data used in an online scheme, each iteration of the algorithm corresponding to one data block. The results are compared with b −b 2 2 for YW with noisy measurements and noise-free measurements and scalar AR estimation. The autocorrelation values for the YW method are computed using 30000 points. The Burg method was also used, but with this many points there is little difference between Burg and conventional YW. The scalar AR estimation (black dotted line) converges to the YW performance, as does the ICWARE with d = 1, as expected.
In Figure 3, three different ways of determining R b (b) are used. The solid lines (subplot (a)) use the true R b (b). This is, of course, unavailable in practice, but these plots serve as a basis for comparison against real algorithms. The dashed lines (subplot (b)) use an identity matrix for R b (b) for the first 30 iterations (to establish an estimate of b), following which R b (b) is computed using estimated b and the correct variances σ 2 ν and σ 2 . The dotted lines (subplot (c)) are similar, except that fixed (but wrong) values of σ 2 ν = 5 and σ 2 = 5 are used in the computation of R b (b). It is clear that as d increases, the performance significantly improves, especially for the first few values of d. Interestingly, for d = 7, the error performance is on the order of the error in the noise-free YW.  Table 2 for s = 2, 3 and 4. There is little difference between s = 2 and s = 3, but when s = 4, that is, s > p, the performance declines.       Figure 6 shows the magnitude frequency response for the true spectrum, the YW spectrum, and the spectrum computed fromb obtained using d = 7 and d = 3 after convergence. (The spectrum is not an average, but only a single outcome.) The ICWARE estimated spectrum is very close to the true spectrum while the YW spectrum has considerable bias, exhibiting strong peaks not present in the true spectrum. As a point of comparison, TLS is used to estimate the AR parameters. In TLS, the equation (for some m and M) is perturbed in both the matrix on the left and the vector of observations on the right. TLS can be computed using the singular value decomposition (SVD), making it computationally complex. It also makes it more difficult to create algorithms which track changing parameters, and, as shown below, gives inferior parameter estimates than the method presented here. Figure 7 shows the error for b computed using TLS. The parameter M is the height of the linear system that is solved using the TLS method described in Reference [114]. The results presented to this point take 1000 blocks of data. A consideration is whether it is possible to iteratively re-use data so that less data is required, but the reduced error from the ICWARE algorithm can be obtained. In Figure 8, k = 100 blocks of data are employed, and the processing is repeated 10 times on each block. Figure 9 shows the error as a result of each iteration. Each line in the plot corresponds to one pass through the data. The top plot uses R b (b) estimated usingb and true variances; the bottom plot uses R b (b) usingb and fixed variances. These figures show that iterating over the data provides essentially the same performance as longer data. The next example, designated Case 2, is also a third order system with poles at the same angles as Case 1, but with ρ = 0.9. The following figures show examples of the same sort of results shown for Case 1. In Figure 10 the various methods are compared. In this case, however, the minimum error does not reach as low as the noise-free YW case. Figure 11 shows the resulting spectrum, again showing that the ICWARE spectrum is closer than the YW spectrum. Case 3 involves a system having poles at ρ = 1-that is, pure sinusoidal signals-and angles (θ 1 , θ 2 , θ 3 ) = (0.65, 0.655, 0.66). The estimated spectra are shown in Figure 12. The spectral peaks are clearly evident in the ICWARE estimate.
As another comparison, ICWARE spectral estimates are compared to the spectra from dual Kalman filters using the examples from Reference [104], shown as Cases 4,5, and 6 in Table 1. Figure 13 shows the results. Spectra for twenty realizations are plotted, along with the true spectrum and the mean. The values of d = p + 3 and s = 3 were selected, since simulations for Case 1 (above) suggest these are reasonable values. Comparison of these results with figures 3, 4, and 6 of Reference [104] shows fairly comparable performance. When the poles are not near the unit circle (Case 4), the ICWARE estimate tend to have smoother spectra (estimated poles with smaller magnitude). When the poles are nearer the units circle (Case 5, Case 6), ICWARE seems to do a comparable job at capturing the peaks, and does somewhat better at representing the high-frequency dropoff of the spectrum. In all of these simulations, an estimated R b (b) was employed.

Summary and Conclusions
In this paper, we have shown that accounting for observation noise added to a pure AR(p) process results in noise which is correlated across lags. This makes it expedient to employ stacked observation vectors, and estimating the parameters in a vector sense. This results in an algorithm that is essentially a vector RLS adaptive filter. Several different methods were described to obtain information about the correlation matrix R w (b). Also, estimating the variances of the AR and observation noise was described.
It was shown by simulation that the method can be applied repetitively to the same block of data, providing accurate results with data of moderate length.
The improvement of the technique compared with Yule-Walker is a function of the depth d to which the observations are stacked. Values even greater than p continue to yield improvement. The improvement relative to YW also is system dependent. Based on simulations, a depth d = p + 3, where p is the order of the system, seems a reasonable choice.
Comparisons were made against a dual Kalman filter approach, with comparable or slightly superior results.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Maximizing The Log Likelihood Function for Estimating Variances
In this section we present details on the computation of the derivative of the likelihood (12) with respect to σ 2 ν . We note that ∂ ∂σ 2 The derivative of the second term of (12) is The summation term of (12) can be written as so that the derivative is Combining (A1) and (A2) gives (14).