An Improved Tikhonov-Regularized Variable Projection Algorithm for Separable Nonlinear Least Squares

In this work, we investigate the ill-conditioned problem of a separable, nonlinear least squares model by using the variable projection method. Based on the truncated singular value decomposition method and the Tikhonov regularization method, we propose an improved Tikhonov regularization method, which neither discards small singular values, nor treats all singular value corrections. By fitting the Mackey–Glass time series in an exponential model, we compare the three regularization methods, and the numerically simulated results indicate that the improved regularization method is more effective at reducing the mean square error of the solution and increasing the accuracy of unknowns.


Introduction
Many problems in physics, chemistry, machine learning, computer vision, signal processing and mechanical engineering can only be described by a specialized type of nonlinear regression model, which is a linear combination of nonlinear functions. In particular, given the sequence of the observed values {(t i , y i ), i = 1, 2, · · · , m}, a fitting model can be established as follows: η(θ L , θ N ; t i ) := n ∑ j=1 a j ϕ j (θ N ; t i ) (1) where θ L = (a 1 , a 2 , · · · , a n ) T is the linear parameter, θ N = (b 1 , b 2 , · · · , b k ) T is the nonlinear parameter and ϕ j (θ N ; t i ) is a quadratic differentiable nonlinear function, which depends on θ N and t i . With the least squares method, it reduces to the following nonlinear function.
which can be rewritten in the matrix form where y = (y 1 , y 2 , · · · , y m ) T , t = (t 1 , t 2 , · · · , t m ) T and φ j (θ N ; t) = (ϕ j (θ N ; t 1 ), · · · ϕ j (θ N ; t m )) T . Additionally, Φ(θ N ) = (φ 1 (θ N ; t), · · · φ n (θ N ; t)) is a matrix function, and · 2 denotes the Euclidean norm. The minimization problem (3) is a nonlinear least squares problem that does not consider the properties of the parameters. The Gauss-Newton (GN) algorithm [1], the Levenberg-Marquardt (LM) algorithm [2,3] and the iterative embedding points algorithm [4] are commonly used to solve such problems. In fact, there are two sets of mutually dependent parameters in this model: the linear parameter and the nonlinear parameter. Golub and Pereyra [5,6] refer to this type of data-fitting problem as a separable nonlinear least squares (SNLLS) problem. In consideration of the specialized separable structure of this type of model, they proposed the variable projection (VP) algorithm. The general aim of implementing this algorithm is to eliminate linear parameters and obtain a simplified problem with only nonlinear parameters. For any given nonlinear parameter θ N , θ L can be obtained by solving the following linear least squares problem.
The least squares solution of Equation (4) iŝ where Φ(θ N ) + is the pseudo-inverse of Φ(θ N ). By substitutingθ L into (3), we obtain where P Φ(θ N ) = Φ(θ N )Φ(θ N ) + is the orthogonal projection on the linear space spanned by the column vectors of Φ(θ N ), and P ⊥ Φ(θ N ) = I − P Φ(θ N ) is the projector on the orthogonal complement of the column space of Φ(θ N ).
Equation (6) is the revised residual function, in which the linear parameter is eliminated. The VP algorithm reduces the dimensionality of the parameter space and increases the probability of finding the global optimal solution. It is an effective method for solving the SNLLS problem.
The VP algorithm has been significantly improved and widely applied since it was proposed. For example, Kaufman [7] proposed an improved VP algorithm based on the trapezoidal decomposition of a matrix, and provided a simplified Jacobian matrix of the VP method. Ruhe et al. [8] analyzed the asymptotic convergence of the simplified VP algorithm. O'Leary and Rust [9] subsequently discovered that a VP algorithm with a complete Jacobian matrix requires fewer iterations than Kaufman's simplified algorithm. Alternatively, Ruano et al. [10] proposed a more simplified Jacobian matrix for the VP algorithm, and an improved VP algorithm that entailed applying QR decomposition to the sparse nonlinear function matrix. They found that their method effectively increased computational efficiency. Gan et al. [11] compared the separation algorithms for different Jacobian matrices and concluded that a VP algorithm with the complete Jacobian matrix is more stable than the simplified algorithm proposed by Kaufman. Gan and Li [12] proposed a VP algorithm that utilizes the classic Gram-Schmidt (GS) matrix decomposition method to treat cases in which the number of observations is significantly larger than the number of linear parameters; their algorithm was found to reduce the computational cost. Alternatively, Chen et al. [13] employed a modified GS method in their development of a more robust VP algorithm for SNLLS problems; they reported that the combination of a VP algorithm and the L-M method is more effective than a combination of the VP algorithm and the G-N method.
Although many considerable efforts have been dedicated to developing methods to solve SNLLS problems, there are few studies on ill-conditioned problems with iterative processes. It can be seen that the least squares solution given as Equation (5) is conditioned on is invertible. However, when the equation is ill-conditioned, the smallest characteristic root of the matrix Φ(θ N ) T Φ(θ N ) is near zero; therefore, the elements of Φ(θ N ) T Φ(θ N ) −1 will become significantly large. Small changes in the observation value will significantly affect Axioms 2021, 10,196 3 of 14 the least squares solution, causing it to become unstable. Thus, a regularization method needs to be introduced to convert the ill-posed problem into a relatively mild or benign problem [14][15][16] before proceeding. There are many regularization methods, including Tikhonov regularization (TR) [17,18], truncated singular value decomposition (TSVD) regularization [19,20], kernel-based regularization [21,22] and l 1 -regularization [23]. Generally, a regularization method is applied to improve the condition of the ill-conditioned matrix by introducing regularization parameters. Common methods for determining regularization parameters include ridge-tracing, generalized cross-check and L-curve-based methods. The TSVD method is significantly popular because it is relatively well developed, and can be applied to solve computationally complex problems. Zeng et al. [24] estimated the linear parameters in a separable least squares parameter optimization process by implementing regularization parameter detection techniques in the TR and TSVD methods. In their attempt to regularize SNLLS ill-posed problems using a VP algorithm, Chen et al. [25] developed a weighted generalized cross-validation method to determine TR parameters. The effectiveness of the algorithm was verified through experimentation. Wang et al. [26] separated linear and nonlinear regularization parameters using a singular value decomposition (SVD)-based VP algorithm. They also estimated the linear parameters by using both the LS method and an L-curve-based spectral correction method. Their experiments confirmed that the algorithm can effectively solve ill-conditioned problems.
In this paper, an improved TR optimization method is proposed for the parameter estimation problem of SNLLS models. With the VP method as the basic framework, the algorithm was developed to take into account the specialized structure of the linear combination of nonlinear functions and separate the linear and nonlinear parameters. The linear parameters are estimated by using an improved TR method, whereas the nonlinear parameters are optimized by using the LM method. Regarding the iterative process, the simplified Jacobian matrix proposed by Kaufman is implemented in the VP algorithm. Numerical simulations were performed to compare the improved TR method to the original TR and TSVD regularization methods. The effectiveness of the VP algorithm with the improved TR method was verified, and the accuracies of different linear parameter estimation methods were evaluated.
We begin this paper by summarizing the VP method and the existing problems related to solve SNLLS problems. The methods of nonlinear parameter estimation and linear parameter estimation are explained based on an improved VP algorithm derived from SVD, and the steps to solve SNLLS problems are then detailed. Thereafter, we compare and evaluate the performances of the method proposed in this paper and the conventional regularized VP algorithm by numerical simulations.

Regularization Algorithm for Linear Parameter Estimation
With no loss of generality, the model is abbreviated as Then, as a result of applying SVD, Φ can be rewritten as where U = (u 1 , u 2 , · · · , u m ) and V = (v 1 , v 2 , · · · , v n ) are composed of the eigenvectors Φ T Φ and ΦΦ T respectively; and U ∈ R m×m , V = R n×n , U T U = I and V T V = I. S = diag(σ 1 , · · · , σ l , σ l+1 , · · · , σ n ) is a diagonal matrix with diagonal elements that are singular values of Φ, where σ 1 ≥ σ 2 ≥ · · · ≥ σ l > 0 and σ l+1 = σ l+2 = · · · = σ n = 0. l is the rank of Φ. The Moore-Penrose inverse of Φ is given as The least squares solution of the linear parameter, according to Equation (5), is given aŝ Clearly, if some small singular values are significantly close to zero, even a small error will produce a significantly large discrepancy between the least square solution and the true value, making the solution unstable. To avoid this problem, a filter factor was introduced to suppress the error term in the ill-conditioned solution. Thus, the approximate solution can be obtained asθ

TSVD Method
The TSVD method removes the disturbing influence of small singular values on the LS solution, thereby allowing a stable solution to be achieved. This is achieved by setting a threshold and then setting any singular value smaller than this threshold to equal zero. Generally, an appropriate threshold can be determined by setting a cutoff ratio coefficient α, such that, if σ i < ασ 1 , then σ i is set to zero. The corresponding TSVD filter factor is At this point, the TSVD regularization solution is given aŝ where k ≤ l is the cutoff point for the singular value, which is typically implemented as the regularization parameter in the TSVD method.

TR Method
The TR method is currently the preferred method for solving ill-conditioned problems. The method entails the use of a regularization parameter µ to constrain the solution and construct a secondary optimization problem as follows: where µ is the regularization parameter. When θ N is fixed, the solution of Equation (13) becomesθ Subsequently, applying SVD to the matrix yieldŝ Thus, the TR filter factor can be expressed as ϕ µ =

Improved TR Method
The TSVD and TR methods are essentially the same. However, there is a difference in the extent to which either method can reduce the influence of small singular values on the solution. Specifically, the TR method changes all singular values, which may cause the approximated solution to be over-smoothed. Alternatively, the TSVD method sets a small portion of the singular values to zero, which not only reduces the variance, but also affects the resolution of the solution, resulting in low accuracy. However, the improved TR method only changes small singular values by enabling the determination of the regularization matrix [27]. Given that the differences between the larger and smaller singular values of the ill-conditioned matrix are large, the singular values are determined to be small singular values with a significant influence when the sum of the standard components of small singular values accounts for more than 95% of the standard deviation. Consequently, they are regularized to reduce the influence on the standard deviation. This condition can be expressed as follows [28]: Thus, the regularization parameter k can be obtained. Then, with the filter factor set as The improved TR method is a combination of the TSVD and TR methods. Unlike the TSVD method, the improved TR method does not disregard small singular values; furthermore, it yields a more accurate solution. Additionally, unlike the conventional TR method, the improved TR method only modifies the small singular values, keeping the large singular values unchanged, to ensure high accuracy.
A regularization method is employed to determine the appropriate regularization parameters. The L-curve method was adopted in this work. This method entails selecting different µ values to obtain a set of points y − Φθ L 2 , θ L ; subsequently, an L-shaped curve is constructed with y − Φθ L 2 as the abscissa and θ L as the ordinate. Finally, the corresponding µ regularization parameter value is determined by locating the point of maximum curvature.

LM Algorithm for Nonlinear Parameter Estimation
After values for the linear parameters have been determined, Equation (6) contains only nonlinear parameters, the values of which can be determined by using the LM algorithm. The iteration strategy is where α k is the step length, which ensures that the objective function is in a descending state, and d k is the search direction. α k can be obtained by employing a line search method for imprecise searches. Let m k be the smallest non-negative integer m satisfying r(θ k N + ρ m d k ) ≤ r(θ k N ) + σρ m g T k d k in the k th iteration process. Then, where d k can be determined by solving the following equation: Axioms 2021, 10, 196 6 of 14 where γ k is the damping parameter that controls the magnitude and direction of d k . When γ k is sufficiently large, the direction of d k is consistent with the negative gradient direction of the objective function. When γ k tends toward zero, d k tends toward the G-N direction.
In the LM algorithm implemented here, γ is adjusted by employing a strategy similar to adjusting the radius of the trust region [29]. A quadratic function is defined at the current iteration point as follows: The ratio of the objective function to the increment of q(d) is denoted by η k as follows: In each step of the LM algorithm, γ k is first assigned an initial value to calculate d k , such as the corresponding previous iteration value. Then, γ k is adjusted according to the value of η k ; d k is subsequently calculated according to the adjusted γ k , and a line search is performed. Clearly, when η k is close to one, γ should be relatively smaller; this can be achieved by using the LM method to solve the nonlinear problem. When η k is close to zero, the modulus length of d k must be reduced, and γ should be relatively larger. When η k is neither close to one nor zero, the value of the parameter γ k is determined to be suitable. The critical values of η are typically 0.25 and 0.75. Accordingly, the update rule of γ k is given as In Equation (20), J(θ k ) is the Jacobian matrix of the residual function, which, according to the simplified Jacobian matrix calculation method proposed by Kaufman [7], is given as where DΦ is the Fréchet derivative of the matrix Φ, and Φ − is the symmetric generalized inverse of Φ. In [30], P Φ = ΦΦ − , where Φ − satisfies: Thus, Φ + is not needed to calculate P Φ . Applying SVD to decompose matrix Φ yields where Σ is the lth diagonal matrix, and the diagonal elements are the singular values of Φ. Note that rank(U 1 ) = rank(V 1 ) = l, where l is the rank of Φ. Accordingly, we obtain Then Thus, the corresponding residual function is Axioms 2021, 10, 196 7 of 14 and the Jacobian matrix is

Algorithm Solution Determination
Here, we describe an improved TR optimization method for the SNLLS problem. This method entails the implementation of the VP algorithm to separate the variables, followed by the use of the improved TR method to update the linear parameters, and the LM method to search for nonlinear parameters. We compared the performance of the improved TR regularization method to those of the conventional TR method and TSVD regularization method to verify the effectiveness of the method. The model is given as A summary of the steps used for obtaining the solution is given as follows: Step 1: Take the initial value of the nonlinear parameter θ L via the TR, TSVD or improved TR method. Then the residual function r(θ N ) and approximate Jacobian matrix J KAU are obtained.
Step 3: The iterative step length α k and search direction d k are determined by solving Equations (19) and (20), respectively; thereafter, the nonlinear parameters are updated according to Equation (18).
Step 4: The linear and nonlinear parameters are cross-updated until k > M; then, the calculation is terminated.

Predicting the Mackey-Class Time Series Using an RBF Neural Network
Numerical Mackey-Glass time-series simulations were performed for the validation experiment. In the experiment, the LM algorithm was separately implemented in the TSVDbased VP method (VP TSVD ), the TR-based VP method (VP TR ) and the improved TR method (VP TSVD-TR ) to fit a time-series image. This experiment was also performed to reveal the advantages and disadvantages of the aforementioned VP algorithms. The experimental environment was MATLAB R2016a running on a 1.80 GHz PC with Windows 7.
The exponential model is given as where a = (a 1 , a 2 , · · · a n ) is the linear parameter set, and α = λ 2 , λ 3 , · · · , λ n ; z T 2 , z T 3 , · · · , z T n T is the nonlinear parameter set, for the functions ϕ 1 (α; x) ≡ 1, ϕ j (α; x) = e −λ j x−z j 2 , and j = 2, 3, · · · , n. This model was used to fit the chaotic Mackey-Glass time series generated by the following delay differential equation: where a = 0.2, b = 0.1, c = 10, τ = 17. The initial value condition was set as y(0) = 1.2, and the time interval was set to 0.1. The Runge-Kutta method was used to solve the differential equation and generate 500 data points, as shown in Figure 1. Among these 500 data points, 300 were extracted from the generated data as follows: [y(t − 24), y(t − 18), y(t − 12), y(t − 6), y(t)] t = 24 323 Axioms 2021, 10, 196 8 of 14 and the time interval was set to 0.1. The Runge-Kutta method was used to solve the differential equation and generate 500 data points, as shown in Figure 1. Among these 500 data points, 300 were extracted from the generated data as follows: 24 , 18 , 12 , 6 , 24 323 y t y t y t y t y t t é ù é ù ----= ê ú ë û ë û Using these 300 data points, VPTSVD+LM, VPTR+LM and VPTSVD-TR+LM were applied to estimate the parameters for the model; the remaining data were used for prediction. When n = 2, the exponential model yielded 24 nonlinear parameters and 19 linear parameters. Given the same initial iterative value, the fits of the curves derived from the training and prediction data points output by the three algorithms are shown in Figure 2. The red circles correspond to data points extracted from the time-series images. It can be intuitively determined from the figure that the curve fitted by the VPTSVD-TR+LM algorithm is in good agreement with the data generated.  Using these 300 data points, VP TSVD +LM, VP TR +LM and VP TSVD-TR +LM were applied to estimate the parameters for the model; the remaining data were used for prediction.
When n = 2, the exponential model yielded 24 nonlinear parameters and 19 linear parameters. Given the same initial iterative value, the fits of the curves derived from the training and prediction data points output by the three algorithms are shown in Figure 2. The red circles correspond to data points extracted from the time-series images. It can be intuitively determined from the figure that the curve fitted by the VP TSVD-TR +LM algorithm is in good agreement with the data generated.
The initial value condition was set as ( ) 0 1.2 y = , and the time interval was set to 0.1. The Runge-Kutta method was used to solve the differential equation and generate 500 data points, as shown in Figure 1. Among these 500 data points, 300 were extracted from the generated data as follows: Using these 300 data points, VPTSVD+LM, VPTR+LM and VPTSVD-TR+LM were applied to estimate the parameters for the model; the remaining data were used for prediction. When n = 2, the exponential model yielded 24 nonlinear parameters and 19 linear parameters. Given the same initial iterative value, the fits of the curves derived from the training and prediction data points output by the three algorithms are shown in Figure 2. The red circles correspond to data points extracted from the time-series images. It can be intuitively determined from the figure that the curve fitted by the VPTSVD-TR+LM algorithm is in good agreement with the data generated.   Table 1 presents the results-the number of iterations, the number of calculated functions, the second norm of the vector formed by the residuals of each point and the root mean square error (RMSE t) of the objective function with respect to the original datafor the three iterative methods. Figure 3 shows the convergences of the corresponding nonlinear parameters for the three iterative algorithms.
From the results in Table 1, we can see that the VP TSVD method required the most iterations, and had the highest RMSE-t value. However, the VP TR and the VP TSVD-TR methods were improved significantly and optimized in the iterative process due to fewer iterations, fewer function calculations and smaller RMSEs. Although the numbers of iterations and function evaluations were similar, the results obtained by using the VP TSVD-TR method to calculate the model parameters were considerably more accurate.  Table 1 presents the results-the number of iterations, the number of calculated functions, the second norm of the vector formed by the residuals of each point and the root mean square error (RMSE t) of the objective function with respect to the original data-for the three iterative methods. Figure 3 shows the convergences of the corresponding nonlinear parameters for the three iterative algorithms.   From the results in Table 1, we can see that the VPTSVD method required the most iterations, and had the highest RMSE-t value. However, the VPTR and the VPTSVD-TR methods were improved significantly and optimized in the iterative process due to fewer iterations, fewer function calculations and smaller RMSEs. Although the numbers of iterations and function evaluations were similar, the results obtained by using the VPTSVD-TR method to calculate the model parameters were considerably more accurate.

Number of Iterations Function Evaluation Number Second Norm of Residual
From Figure 3, it is evident that all three methods yielded converging results with the same initial values. However, the results of the VPTSVD+LM method began to approach the optimized solution around the 40th iteration, whereas the VPTSVD-TR+LM algorithm began slowly approaching the optimized parameter values at the third iteration, indicating a higher rate of convergence.
Every 10 points were grouped, beginning at the 400th point in the Mackey-Glass time series, to obtain several time-series predictions. The RMSE results for each group are shown in Figure 4. From Figure 3, it is evident that all three methods yielded converging results with the same initial values. However, the results of the VP TSVD +LM method began to approach the optimized solution around the 40th iteration, whereas the VP TSVD-TR +LM algorithm began slowly approaching the optimized parameter values at the third iteration, indicating a higher rate of convergence.
Every 10 points were grouped, beginning at the 400th point in the Mackey-Glass time series, to obtain several time-series predictions. The RMSE results for each group are shown in Figure 4.  It can be seen from Figure 4 that the RMSE values for the VPTSVD+LM method increased with the number of predictions, indicating decreasing prediction accuracy and degrading prediction ability. Conversely, in the case of the VPTSVD-TR+LM method, there was a minimal increase in the RMSE, proving the superior stability of the method. These results also indicate that the proposed method features a higher prediction accuracy. It can be seen from Figure 4 that the RMSE values for the VP TSVD +LM method increased with the number of predictions, indicating decreasing prediction accuracy and degrading prediction ability. Conversely, in the case of the VP TSVD-TR +LM method, there was a minimal increase in the RMSE, proving the superior stability of the method. These results also indicate that the proposed method features a higher prediction accuracy.

Height Anomaly Fitting
A height anomaly is the difference in elevation between the surface of the reference ellipsoid and a quasi-geoid. As we can see in Figure 5, the height anomaly ξ can be expressed as ξ = H − H normal . The geodetic height of a point can be calculated by the GPS positioning technique. The normal height can be calculated as long as the height anomaly is determined in the certain area. It can be seen from Figure 4 that the RMSE values for the VPTSVD+LM method increased with the number of predictions, indicating decreasing prediction accuracy and degrading prediction ability. Conversely, in the case of the VPTSVD-TR+LM method, there was a minimal increase in the RMSE, proving the superior stability of the method. These results also indicate that the proposed method features a higher prediction accuracy.

Height Anomaly Fitting
A height anomaly is the difference in elevation between the surface of the reference ellipsoid and a quasi-geoid. As we can see in Figure 5, the height anomaly ξ can be expressed as  A multi-surface function model was used, and its expression had the characteristic that the linear and nonlinear parameters could be separated. The TSVD method, TR method and improved algorithm were applied to solve the parameters of the model according to the height anomaly values of known control points. The model is best expressed as: A multi-surface function model was used, and its expression had the characteristic that the linear and nonlinear parameters could be separated. The TSVD method, TR method and improved algorithm were applied to solve the parameters of the model according to the height anomaly values of known control points. The model is best expressed as: The example data, coordinates, geodetic heights, normal heights and height anomaly values of the known points are shown in Table 2.
D01-D15 were used for fitting, and the rest were used to verify the reliability of the parameters and the prediction abilities of the three algorithms. The iteration initial values of the nonlinear parameters were set to 0.5. Model (34) and the algorithm proposed in this paper were applied to fit the height anomalies, and the results are compared with the results calculated by VP TSVD and VP TR .
The parameter estimations were obtained after alternative calculations between the linear parameters and nonlinear parameters. The fitting images corresponding to the three algorithms are shown in Figure 6. Table 3 shows the residual sum of squares, root mean squared error and coefficient of determination values during the processes of fitting and prediction. It can be concluded from Table 3 that all algorithms enabled the model to get high fitting accuracy. During fitting and prediction, The order of magnitude of the root mean square error for VP TSVD , VP TR and VP TSVD-TR is respectively 10 −1 , 10 −2 and 10 −3 . The accuracy of VP TSVD algorithm was slightly lower. The fitting accuracies of VP TR and VP TSVD-TR were higher. It can be seen from SSR f and RMSE f that the improved algorithm proposed in this paper had better performance in height anomaly prediction.  The normal height of each point was obtained according to H normal = H − ξ and the height anomaly values fitted by three variable projection algorithms. The parameter estimations were verified by the known normal height. Figure 7 shows the fitting effects, and Table 4 lists the normal height fitting residual and RMSE of each point. It can be concluded from Figure 7 and Table 4 that the differences between the normal heights estimated by the three algorithms and the known example data are very small, and the three broken lines are almost coincident, which indicates that the model well fit the normal heights of known points. This is consistent with the conclusions drawn in Figure 6 and Table 3.   estimations were verified by the known normal height. Figure 7 shows the fitting effects, and Table 4 lists the normal height fitting residual and RMSE of each point. It can be concluded from Figure 7 and Table 4 that the differences between the normal heights estimated by the three algorithms and the known example data are very small, and the three broken lines are almost coincident, which indicates that the model well fit the normal heights of known points. This is consistent with the conclusions drawn in Figure  6 and Table 3.

Conclusions
Separable nonlinear models have been widely applied in many fields and studied by many scholars all over the world. The VP method is one of the most effective algorithms for solving related problems. However, the classical VP method may not be able to estimate the parameters of ill-conditioned problems. As regularization is the most common method for solving ill-conditioned problems, we proposed an improved regularized VP method for the SNLLS problem. This method entails applying certain rules to determine the filter factors necessary to correct small singular values, and using the L-curve method to determine regularization parameters. The results of numerical simulations revealed that the mean square error of the proposed regularized VP method was less than those of the TSVD and TR methods. Furthermore, we found that combining a regularization method with the LM algorithm-based nonlinear parameter estimation method results in a higher rate of convergence. This improved regularized parameter estimation method was demonstrated to have certain advantages in terms of its ability to mitigate the ill-conditioned problem and improve parameter estimation accuracy. It was also found to be more stable when applied to an SNLLS problem.