A Generalized Variable Projection Algorithm for Least Squares Problems in Atmospheric Remote Sensing

This paper presents a solution for efficiently and accurately solving separable least squares problems with multiple datasets. These problems involve determining linear parameters that are specific to each dataset while ensuring that the nonlinear parameters remain consistent across all datasets. A well-established approach for solving such problems is the variable projection algorithm introduced by Golub and LeVeque, which effectively reduces a separable problem to its nonlinear component. However, this algorithm assumes that the datasets have equal sizes and identical auxiliary model parameters. This article is motivated by a real-world remote sensing application where these assumptions do not apply. Consequently, we propose a generalized algorithm that extends the original theory to overcome these limitations. The new algorithm has been implemented and tested using both synthetic and real satellite data for atmospheric carbon dioxide retrievals. It has also been compared to conventional state-of-the-art solvers, and its advantages are thoroughly discussed. The experimental results demonstrate that the proposed algorithm significantly outperforms all other methods in terms of computation time, while maintaining comparable accuracy and stability. Hence, this novel method can have a positive impact on future applications in remote sensing and could be valuable for other scientific fitting problems with similar properties.


Introduction
One of the fundamental tasks in scientific computing is to find the unknown vector of parameters x ∈ R n+p , which, for given data (y i , t i ), i = 1, . . .m, m ≥ n + p, and model function η(x, t), solves the least squares problem where the vectors η ∈ R m and y ∈ R m have the entries [η(x)] i = η(x, t i ) and [y] i = y i , respectively.If the model function is nonlinear, the minimization problem is solved iteratively by using step-length or trust-region methods, such as the Levenberg-Marquardt algorithm [Moré, 1978].
In separable least squares problems, the model function η is a linear combination of nonlinear functions φ j , j = 1, . . ., n, i.e., η(α, β, t) = n j=1 β j φ j (α, t), (2) and the vector of parameters x is split into a vector of linear parameters β ∈ R n , and a vector of nonlinear parameters, α ∈ R p .
In 1973, Golub and Pereyra [1973] proposed a powerful method for solving such problems, called variable projection (VP).Specifically, the problem is reduced to a nonlinear least squares problem involving α only, and a linear least squares problem involving β only.Thus, the dimension of the nonlinear problem to be solved is reduced from n + p to p. Later on, Golub and LeVeque [1979] extended the method to the case of multiple datasets y k with k = 1, . . ., s.In this type of problem, a vector of linear parameters β k is associated with a dataset y k , while the vector of nonlinear parameters α corresponds to all datasets.
One example of this type of problem can be found in the retrieval of the atmospheric composition from passive remote sensing measurements (cf.[Hochstaffl et al., 2022]).Here, an atmospheric radiative transfer model with molecular composition parameters (nonlinear) and reflectivity parameters (linear) is fitted to spectral radiance measurements to determine the underlying atmospheric state parameters.
For long-lived molecules, such as CO 2 or CH 4 , which are homogeneously spread throughout the atmosphere, it is possible to fit several (nearby) observations for one concentration value while the surface reflectivity differs for each measurement, making this a separable problem with multiple right-hand sides.
Fitting, for example, 4 × 4 = 16 radiance spectra with three distinct linear reflectivity variables simultaneously for two nonlinear molecular concentration variables, the total number of unknowns for conventional solvers adds up to 50 (= 16 • 3 + 2).However, with a separable approach, such as variable projection, one can reduce this size by a factor of 25.
However, the variable projection algorithm established by Golub and LeVeque [1979] to solve such problems with multiple right-hand sides is based on the assumption that all datasets have the same lengths and corresponding nonlinear models.
In this article, the theory by Golub and LeVeque [1979] is, therefore, elaborated and further enhanced for scenarios such as differently sized datasets or varying nonlinear model setups (see Section 2).The modifications were not only necessary in order to apply their method to the example described above, but this new algorithm could also be useful for other scientific fitting problems.The method was implemented in Python (see Section 3) and applied to the least squares problem arising in atmospheric trace gas retrieval (see Section 4).In Section 5, experimental results are discussed and concluded.

Theoretical Background
Assuming a model that is a linear combination of nonlinear functions, such as (2), one can define the arising separable least squares problem as follows: where the matrix Φ(α) ∈ R m×n has the entries [Φ(α)] ij = φ j (α, t i ).

The Variable Projection Method
The minimization problem in (2), written as is reduced to two least squares problems.Golub and Pereyra [1973] pointed out that for any fixed α, one considers the merely linear least squares problem (5) https://doi.org/10.3390/math11132839 PREPRINT which is solved by where Φ † = (Φ ⊤ Φ) −1 Φ ⊤ is the generalized inverse of the matrix Φ.Second, for the functional which is obtained by substituting the expression of β(α) given by Equation ( 6) into the residual function, one considers the nonlinear least squares problem where is the orthogonal projection operator onto the column space of the matrix Φ(α).Thanks to this formulation, the method is called variable projection (VP).
Thus, the method of solution involves, in the first step, the computation of the minimizer α = arg min α F(α), and in the second step, the computation of the vector of linear parameters as β = Φ † ( α)y.Golub and Pereyra [1973] showed that this solution method yields the correct result, under the assumption that Φ(α) has a locally constant rank in the neighborhood of α.In order to minimize a nonlinear problem, such as (7), the derivative of the objective function, with respect to the unknown variables, is needed.If the rank of matrix Φ(α) was not constant across the points throughout the iteration where its derivative is calculated, the pseudo-inverse Φ † (α) would not be a continuous function and, therefore, not differentiable.
This separation technique has powerful advantages over conventional algorithms, which all follow from the fact that the nonlinear functional F(α) only depends on α (see also the review by Golub and Pereyra [2003]).Since α is a vector of length p and β of length n, the VP method effectively turns the original nonlinear problem of n + p variables into one of only p.This reduction of the parameter space of the problem results in a smaller size of the problem's Jacobian and, therefore, requires less time for its computation.Moreover, O'Leary and Rust [2013] pointed out that the reduced parameter space may also lead to a reduced number of local minimizers, making it more likely to find the global minimum instead of a local one.To summarize, the VP solver is a lot more efficient and also converges better than conventional methods with no separation [Ruhe andWedin, 1980, Golub andPereyra, 2003].Moreover, the need for a smaller initial guess vector can generally lead to a better-conditioned and more stable problem.
For solving the nonlinear least squares problem (8), one needs to calculate the partial derivatives ∂P ⊥ Φ /∂α l .In this regard, Golub and Pereyra [1973] provide the computational formula which is proved in the Appendix.Kaufman [1975] proposed a modification of the original method by making use of a QR decomposition of the matrix Φ: where and R 1 ∈ R n×n is an upper triangular, non-singular matrix.In this case, the generalized inverse Φ † , satisfying the relation https://doi.org/10.3390/math11132839 PREPRINT From the invariance of the 2-norm under orthogonal transformations and Since the optimal β for any given α is we find , showing that the nonlinear least squares problem reduces to Moreover, for derivative calculations, Kaufman [1975] proposed the simplified formula which is justified in the Appendix.This was shown to save function and gradient evaluation costs and, therefore, reduce the computing time per iteration, which is why this simplified version of (9) became well established.
In conclusion, the minimization problems ( 8) and ( 14) are equivalent, but the size of the matrix Q ⊤ 2 is smaller than that of the matrix P ⊥ Φ .Consequently, an algorithm for solving Equation ( 14) should be more efficient.

Multiple Right-Hand Sides (MRHS)
As Golub and LeVeque [1979] pointed out, there can be optimization problems where multiple sets of data y 1 , y 2 , . . ., y s are to be fit to a model function, such as (2).If the model parameters are to vary for each set y k , this will just result in s distinct separable problems, such as (3).There are, however, cases where only the linear variables are specific to each dataset, while the nonlinear variables have to hold for all available data simultaneously.By exploiting separability, the sizes of such minimization problems can be reduced from sn + p to just p unknown variables, which is even greater than with only one dataset.Separable problems with s right-hand side(s) (RHS) can be posed as the minimization of where the nonlinear parameter vector is α ∈ R p and the matrix is B ∈ R n×s , containing the linear parameter vectors β k for each RHS y k with k = 1, . . ., s as its columns, using the Frobenius norm ∥•∥ F .Here, the data matrix Y = (y 1 . . .y s ) ∈ R m×s is fit to a single model of the form Φ(α)B, where Φ(α) ∈ R m×n is as before (cf.Equation (3)).
arises.This formulation allows for solving the separable problem with multiple RHS, by means of the variable projection algorithm already discussed.However, even for a moderate number of datasets, the matrix G(α) becomes overly large.Moreover, the sparse structure and the fact that all diagonal blocks in (17) are the same can be better utilized in the earlier formulation (16).

Golub-LeVeque Approach
Therefore, Golub and LeVeque [1979] suggested a different approach: Starting from ( 16), one can, in the same manner as Equation ( 7), exploit the problem's separability and reduce it to a purely nonlinear minimization problem of the form with the orthogonal projector P ⊥ Φ (α) = I − Φ(α)Φ † (α) as before.Acknowledging that the Frobenius norm in ( 19) is equivalent to the 2-norm of a vector function z(α), set up as this can be minimized with any of the established methods for nonlinear least squares problems, such as the Levenberg-Marquardt algorithm, which is used in [Golub and LeVeque, 1979].The Jacobian matrix of z(α) can be calculated analogously by defining its lth column as where is exactly as in (9).

Kaufman Approach
Since the Frobenius norm is invariant under the orthogonal transformation, the above method can also be written in terms of the QR decomposition of Φ(α) outlined in Section 2.1.This approach based on Kaufman [1975] can, therefore, be established by replacing the orthogonal projector P ⊥ Φ (α) ∈ R m×m in all of the above equations of the Golub-LeVeque method, with the smaller orthogonal matrix Q 2 ⊤ ∈ R (m−n)×m derived in (10).Both versions of the approach are included in the Python implementation outlined in Section 3 and are, therefore, the subjects of the numerical experiments performed in Section 4.

Extensions to the Golub-LeVeque Approach
Working with real measurements, which can be subject to errors or missing data points, it is not always possible to use datasets that all have the exact same number of data points.In this case, each dataset y k has a specific size m k , such that the matrix depiction used in (16) does not hold any longer.
Moreover, the nonlinear model functions stored in the matrix Φ(α) often depend on further auxiliary parameters, which may vary for each individual measurement (e.g., observation angle).Thus, there can be different None of these aspects were explicitly mentioned by Golub and LeVeque [1979] or Kaufman and Sylvester [1992], who later simplified the Golub-LeVeque method further, with the assumption of https://doi.org/10.3390/math11132839PREPRINT equal lengths of the y k .It was only mentioned by Kaufman [2010] that the TIMP package by Mullen and van Stokkum [2007] allows for differently sized data vectors, but not for differently constructed Φs.
For the implementation introduced in this paper, both of these changes are taken into account.It can be seen that the special structure of vector (20) and matrix ( 21) can be exploited to rewrite as a stack of (differently sized) vectors and, likewise, the lth column of its Jacobian This necessary modification naturally increases the computational expense compared to the original Golub-LeVeque method, as matrix Φ and its derivative have to be calculated s times instead of once.This also reduces the gain in efficiency that one would have had from exchanging P ⊥ Φ with Q 2 ⊤ , as the QR decomposition now needs to be calculated for every single matrix Φ k .However, this modified VP method for multiple right-hand sides is still significantly more efficient than using a standard nonlinear optimizer for the same problem, as will be shown in Section 4.

Implementation
The first FORTRAN implementation of a variable projection algorithm that allows for multiple right-hand sides was VARP2 [LeVeque, 1985], which was developed by Randy LeVeque.It is a modification of the subroutine VARPRO [Bolstad, 1977] for classical separable problems with only a single RHS.Both work with user-provided Jacobians.Another efficient implementation of a solver for multiple datasets is the TIMP package by Mullen and van Stokkum [2007], written in the statistical computing language R, where the Jacobian is calculated by finite difference approximations.
This work was partly motivated by the fact that many scientific computing problems are nowadays implemented in Python.However, the only modern implementation of a VP algorithm is a MATLAB code by O'Leary and Rust [2013], which does not allow for multiple right-hand sides, as discussed in Section 2.3.Here, an implementation is introduced where the standard VP algorithm by O' Leary and Rust [2013] is enhanced for multiple datasets, as described in Section 2.
The MATLAB code by O'Leary and Rust [2013] was specifically designed to be brief and easily understandable, so it would be well-suited for translation.O'Leary and Rust [2013] argued that many of the established implementations, such as VARPRO [Bolstad, 1977], and PORT library [Fox et al., 1978] subroutines, such as NSF/NSG [Kaufman, 1975], are somewhat outdated today and lack readability; thus, they proposed an efficient present-day implementation written in an interpreted language, with the advantage that it can easily be enhanced or modified.One reason for their code's brevity is that they used built-in MATLAB functions, such as lsqnonlin.m,to solve the nonlinear least squares problem (7), and svd.m to solve the remaining linear problems via singular value decomposition, instead of writing their own, making the algorithm modular.Another advantage of their code is the variety of statistical diagnostics it offers, some of which were also modified for multiple datasets and used for the assessment in Section 4.

Algorithm for a Single Right-Hand Side
A short outline of a standard VP algorithm (such as the one by O'Leary and Rust [2013]) can be formulated as this: 1.The user supplies the measurement vector y, the number of linear parameters n, additional independent variables for the model, and an initial guess for the nonlinear variables α 0 .In https://doi.org/10.3390/math11132839 PREPRINT addition, a subroutine/function which will be called ADA (note that a detailed description of how to set up and implement this subroutine for a variable projection algorithm, such as the one presented here, can be found in [O' Leary and Rust, 2013], Sections 2.3 and 2.4) needs to be provided, which, for the given input n and α, calculates the nonlinear model matrix Φ and its partial derivatives ∂Φ/∂α l .
2. Calculate the pseudo-inverse Φ † to generate the variable projection functional from ( 7), which is dependent on a given α.
3. Use the partial derivatives ∂Φ/∂α l to generate the Jacobian matrix dP ⊥ Φ dα y for a given α, as in (9). 4. Minimize the variable projection functional (8) by using the results from steps 2 and 3 and insert them into an already existing nonlinear least squares solver to obtain the final result α.
In the original implementation by O'Leary and Rust [2013], the linear solution, including the computation of Φ † (α), was calculated via a singular value decomposition, which is a very robust direct method for solving linear least squares problems.Moreover, instead of using the simplification by Kaufman [1975] of dropping the second term in the Jacobian (9), O'Leary and Rust [2013] argued that, in modern computers, the balance between the computing time for extra iterations vs. the second term has tipped.In 1975, when Kaufman [1975] proposed simplification, it used to be more efficient to put up with more functional evaluations by saving matrix computations; this could have changed today [O'Leary and Rust, 2013].To verify this argument, a simplification by Kaufman [1975] is included as a subject for testing in Section 4.3.
To solve the nonlinear problem in step 4, the least_squares function [SciPy v1.8.0 Manual, 06.02.2022] from the scipy.optimizepackage [Virtanen et al., 2020] is used in the implementation.Its default solving method is the so-called trust region reflective algorithm [Branch et al., 1999], which has been shown to work efficiently within the VP algorithm for the example discussed in Section 4 [Bärligea, 2022].

Modifications for Multiple Right-Hand Sides
In the following, the modifications required for multiple right-hand sides y 1 , . . ., y s with possibly varying lengths m 1 , . . ., m s , are going to be discussed.For the naive approach, the only necessary change of the above procedure arises in the user-supplied function ADA from step 1; instead of one matrix Φ(α) and its derivative, it has to return the sparse matrix G(α) as in ( 17), with the possibly different Φ k (α) on its diagonal, and the corresponding derivatives are also sparse matrices with the Jacobians ∂Φ k /∂α on the diagonal.After setting up ỹ and β, as in Equation ( 18), the above steps can be performed unchanged to obtain the results α and β = ( β1 , . . ., βs ) ⊤ .This is different from the Golub-LeVeque approach described in Section 2.2.Here, the user-defined method ADA remains the same as in step 1 of Section 3.1; however, instead of calculating (8) in step 2, the new vector function z(α) has to be set up, such as ( 22).The same holds for the new derivative matrix dz(α)/dα from (23) in step 3.After this, in step 4, ∥z(α)∥ 2 is minimized for α; consequently, in step 5, the final linear parameter matrix B can be derived by calculating its kth column as follows: Finally, for the Kaufman approach [Kaufman, 1975], it is important to note that the pseudo-inverse computed in steps 2 and 5 is now calculated via the QR decomposition (10), as in (11), while the remaining matrix Q 2 of the decomposition can be taken for setting up z(α) and ∂z(α)/∂α, just as before, by replacing P ⊥ Φ in ( 22) and ( 23) with Q 2 ⊤ .The simplified Jacobian ( 15) is also adopted in the implementation of this approach.

Numerical Experiments
The goal of this section is to outline an example of a separable least squares problem with multiple right-hand sides and use it to evaluate the performance of the suggested VP algorithms.The tests were https://doi.org/10.3390/math11132839PREPRINT also set up to establish a comparison to conventional nonlinear least squares (NLS) methods, which ignore separability.Those were implemented using the least_squares function from the scipy.optimizepackage [SciPy v1.8.0 Manual, 06.02.2022].
Without separation, the minimization problem of multiple right-hand sides can be composed as follows: which is easily extendable to the case of varying sizes of y 1 , . . ., y s , and correspondingly different η 1 , . . ., η s .

Application: Trace Gas Retrieval
A real-world example for the described problem set can arise in the area of remote sensing, more specifically, in the retrieval of atmospheric trace gas concentrations from spectral radiance measurements.The concentration of carbon dioxide (CO 2 ) or methane (CH 4 )-both important greenhouse gases-can be inferred from spectra observed in the short-wave infrared (SWIR).Such measurements are often spaceborne in order to achieve global coverage of the atmospheric composition.
For this paper, observations from the OCO-2 (Orbiting Carbon Observatory-2) satellite by NASA [Crisp et al., 2004, Eldering et al., 2017, Jet Propulsion Laboratory, California Institute of Technology, 2023] were used, which was designed to monitor CO 2 by measuring its absorption bands in the SWIR.In this spectral region, a radiance measurement can be modeled by the radiative transfer model (based on the Beer-Lambert law for molecular absorption neglecting scattering) [Gimeno García et al., 2011] dependent on the wavenumber ν.The term wavenumber, which represents the inverse of the often used wavelength λ, has the common unit for the SWIR region of cm −1 , corresponding to 10 4 /µm.
The two sets of fitting parameters are r ∈ R n for linear parameters r j and α ∈ R p for nonlinear parameters α l .
The first term is a polynomial that approximates the wavenumber-dependent surface reflectivity of the Earth at the measurement location.Factor µ ⊙ corresponds to cos(θ ⊙ ) (with θ ⊙ : solar zenith angle) and accounts for the geometry of the measurement, I 0 (ν) is the incoming solar radiation (at the top of the atmosphere), and is the total optical depth of the lth molecule, which is the path integral over the number density n l , and its pressure-and temperature-dependent cross-section σ l .In trace gas retrieval, τ (ν) is the most important measure, as it is directly related to a molecule's concentration in the atmosphere on a given path s.Unfortunately, SWIR observations do not provide enough information to retrieve the concentration profile of a molecule.Therefore, the calculation of ( 27) is only possible under prior assumptions of the atmospheric state (i.e., the temperature and pressure profiles p(s), T (s), and the molecular number density, i.e., n(s)).Hence, a simple scaling factor α l is fitted as ) in the forward model ( 26) to retrieve the "real" optical depth at the time and place of measurement.Lastly, S(ν) is the spectral response function of the sensor, which has to be convolved with the monochromatic radiance in order to mimic a real measurement.
For trace gas retrieval, one has to consider all p molecules that have non-negligible absorbance in the measured spectral region.In the case of OCO-2 observations, the only relevant molecule apart from CO 2 is H 2 O (water vapor), meaning that there are two nonlinear fitting parameters.For the linear parameters, it is common to use approximately three reflectivity coefficients (depending on the size of the spectral interval).This means that for each spectrum, the necessary fitting parameters are r = ( r 0 , r 1 , r 2 ) and α = ( α CO2 , α H2O ) , https://doi.org/10.3390/math11132839PREPRINT Note that even though it is physically necessary to use all of these variables in a fit, the only one of interest in this context is the molecular scaling factor α l of the molecule l under scrutiny, which in this case is α CO2 ), as this alone contains the relevant information about its atmospheric concentration.
This together with (26) clearly fits the criteria for a separable problem, and a conventional VP algorithm from the PORT Mathematical Subroutine Library [Dennis et al., 1981, Fox et al., 1978] has already been tested by Gimeno García et al. [2011] and validated by Hochstaffl et al. [2018].
How is this an example of problems with multiple right-handed sides?Many satellites have sensors that measure radiance simultaneously in several spectral windows, e.g., OCO-2 observes the strong (around 6250 cm −1 ) and weak absorption bands (around 5000 cm −1 ) of CO 2 (cf. Figure 1).Assuming consistent model input, both spectra should deliver the same values for α, but as surface reflectivity varies strongly for different wavelength regions, they each have a specific reflectivity polynomial and, therefore, r.Thus, for every observation, two spectral measurement windows (of different lengths) should be fitted simultaneously.
This concept of multiple right-hand sides can also be transferred into a spatial dimension: Some molecules, such as carbon dioxide or methane, are very long-lived, so they are distributed relatively homogeneously in the atmosphere.This means that observations from nearby locations should all yield quite similar concentrations.Thus, they might as well be fit for one α CO2 at once.Note that the assumption made about atmospheric carbon dioxide might not hold for all other absorbing molecules in the observed spectral region, such as H 2 O, which has rather variable concentrations across the globe.However, as their variations are less than that of surface reflectivity, and no physical insight is sought from the fit of the α H2O parameter, it can be seen as a mere auxiliary parameter for completing the model and, therefore, be treated as a "constant" nonlinear fitting variable for a group of neighboring spectra.Still, in this case, the reflectivity coefficients, r j , representative of the surface at the place of measurement, are distinct for every geolocation and, therefore, specific to each measured spectrum.Another possible linear model parameter, which is distinct for each spectrum, would be a constant baseline correction added to the model ( 26), as suggested by Gimeno García et al. [2011].[Crisp et al., 2021], each displaying radiance spectra in units [erg/(s cm 2 sr cm −1 )] from both the strong and weak bands with 809 and 651 spectral pixels each.
Finally, it is possible to combine both the spectral and spatial dimensions of multiple RHS fittings in trace gas retrieval.The OCO-2 satellite, for instance, always stores eight observations ("soundings") in one so-called "frame", with the spatial coverage not exceeding 24 km 2 .The concentration of carbon dioxide can be assumed to fluctuate only minimally within such an area on the globe.Figure 2 shows exemplary retrieval results of eight soundings from one OCO-2 frame.Most of the fluctuations in these results (all except for sounding number 5) could be merely due to noise in the measurements, as the mean value stays within the uncertainty for almost all of them.A fit over multiple observations, as proposed, could, therefore, help to constrain the fluctuation to a more reliable retrieval product., 2019a, , Crisp et al., 2020] ] for one exemplary frame (cf. Figure 1), including eight soundings.
The following tests were conducted with a Python version of BIRRA (Beer infrared retrieval algorithm) [Gimeno García et al., 2011], which has been validated for the SWIR trace gas retrieval of CO by Hochstaffl et al. [2018].In this code, the Jacobian matrix of the model function ( 26) is set up analytically for the least squares fit, reducing numerical instabilities.BIRRA is an extension of the radiative transfer model Py4CAtS (Python for Computational Atmospheric Spectroscopy) [Schreier et al., 2019], which is used to calculate the a priori total optical depths ( 27) needed in ( 26).
It must be noted that all retrievals conducted with the model described above are only supposed to evaluate the methodology and algorithms and in no way claim to represent full-fledged physical CO 2 retrieval products, such as those by Crisp et al. [2020].Moreover, this technique of fitting multiple spectra measured within a certain spatial distance from each other is only reasonable as long as the assumption holds that, within an order of magnitude of the spatial resolution of the sensor, there are only small to no physically caused fluctuations/gradients in the sought trace gas concentration(s).This is, of course, not the case for localized emitters, such as power plants or biomass-burning events.

Tests with Synthetic Data
The goal of this subsection is to show the conceptual and effective differences of algorithms solving multiple RHS compared to the classical case of solving one.This analysis was conducted on the basis of synthetic spectra.Those are simulated radiance measurements generated with the radiative transfer model Py4CAtS [Schreier et al., 2019].The benefit of using this in tests is that, in the retrieval (i.e., the fitting process), there is no model error and the exact solution is known.The only deviation from a "perfect" fit is, therefore, controlled by adding noise to the modeled measurements.
In order to be representative of the later tests with real measurements, the same numbers, sizes, and types (distinct in spatial or spectral dimensions) of datasets were generated as the ones used by OCO-2.Moreover, for consistency, all test retrievals were conducted using the same number of fitting parameters, with n = 3 linear ones per dataset and a total of p = 2 nonlinear ones (cf.vectors in ( 29)).This allowed for the test cases summarized in Table 1.
For the MRHS case, the Golub-LeVeque VP algorithm (introduced in Section 2.2) was used as a representative solver for multiple RHS problems.Tests with synthetic spectra indicated that all mentioned MRHS methods (see Table 2) yielded equal accuracy, confirming the theoretical proof offered by Golub and Pereyra [1973] that the solutions found by a variable projection solver should be equivalent to those of conventional nonlinear solving methods.Moré [1978] First, the fitting precision of the VP MRHS solver was compared to that of single solvers using spectra with signal-to-noise (SNR) ratios in the range of 20 to 500.One measure of the goodness of fitting results is the relative error, compared to the true parameter values.
Figure 3 shows the distribution of these errors and the corresponding standard deviations for different SNR values.The signal-to-noise ratios achieved by satellites, including OCO-2, ranged between approximately 200 and 800 for the frames used [Crisp et al., 2021].This broad variation of OCO-2's SNR comes from changes in the solar position and varying surface reflectivities across the orbit.As expected, both solvers achieved improved precision for increasing SNR values, since a fit becomes more accurate for less noisy data.While both single methods (NLS and VP) showed equal performances, the VP MRHS yielded standard deviations that were slightly worse.This trend is also reflected in the distributions of the relative errors, which are always more sharply distributed around zero for the single solvers than for the MRHS solver.This behavior is not surprising since a less-dimensional residual vector (coming from the shorter data vector) and fewer unknown parameters most generally leave less freedom in the fit and, therefore, lead to more precise fitting results.To phrase it differently: one can expect that, as the size of the least squares problem increases, the number of possible local minima that the fit can reach will behave accordingly.
Considering the similarly shaped distributions in Figure 3 and the fact that VP MRHS seems to improve at the same rate as NLS and VP single, MRHS fits can be viewed as equally effective.In particular, at higher SNRs, both methodologies achieve deviations from the exact results in such low orders of magnitude that the precisions of their fitted α CO2 values are very comparable.
Another important measure for the accuracy of a fit is the standard deviation of the residuals r i = y i − ŷi (prediction errors with y: vector of observations, and ŷ: fitted model), also known as the sigma of regression, defined as where the numerator is the norm of the residual vector of the fit and the denominator represents the number of degrees of freedom (the number of data points minus the number of unknown variables).
Of course, for the s right-hand side case, the number of linear parameters n becomes sn, and the number of data points becomes ms (or m 1 + . . .+ m s ), respectively.
Figure 4 shows the mean sigma of regression of all fits over the SNR.Here, the development of single and MRHS is similar to that of the errors discussed above.The mean sigmas produced by the MRHS fits are slightly higher than the ones achieved by the single solvers.Again, this is intuitive: A single solver is able to produce distinct nonlinear parameters for the noisy spectra and, therefore, has more degrees of freedom to mimic the noisy spectra.An MRHS solver, on the other hand, only has one set of nonlinear parameters for all the spectra, leading to an overall larger deviation between the "observed" and modeled data.This does not necessarily mean the latter is less accurate.On the contrary, since it is less prone to including the specific noise of the spectra into the fit, MRHS solvers could have a smoothing effect on otherwise fluctuating retrieval results (see Figure 2).
The effect of possible "overfitting" might, however, only be relevant for high noise levels (low SNR).
As the sigma of regression is proportional to the noisy radiance, which is proportional to (1 + γ/SNR) (with γ: normally distributed random value), it decreases by the inverse of the SNR value for both the single and MRHS fits (see fitted hyperbolas in Figure 4).In this way, as the noise increases, the sigmas of regression become more similar, such that for an SNR of 200 and higher (representative of OCO-2), the performance differences between the methodologies (single and MRHS) disappear.Thus, for reasonably good data, there is no sacrifice in the precision or accuracy of the produced results when fitting multiple RHS simultaneously instead of fitting one by one.
Now that the differences between MRHS solvers and classical single solvers are established, we need to analyze which MRHS algorithm (see Section 2.2) is the best.

Tests with Real Measurements
The goal of this subsection is to assess the performance of the new enhanced VP algorithms for multiple RHS described in Sections 2. The SciPy function used for the NLS reference approach allows the user to choose between three different nonlinear least squares algorithms (explored thoroughly for a single RHS VP algorithm by Bärligea [2022]).In order to better judge the solvers' performances, two of them were used in the tests: the trust region reflective method ('TRF') [Branch et al., 1999] and the Levenberg-Marquardt method ('LM') [Moré, 1978].While the is the most efficient, latter can be considered the most robust [Bärligea, 2022], which could be helpful for an increasing number of variables.
In this subsection, an analysis of the test cases listed in Table 2 is conducted.For the assessment, a set of 18 OCO-2 frames (all measured on the 25 of May 2020 on orbit 31366a in the nadir (downward view) acquisition mode just above Australia, with a spacecraft altitude of approximately 711 km [Gunson and Eldering, 2019b]) was used; each included 8 observations in both spectral bands (cf. Figure 1), within an area of 24 km 2 measured along a ground track no wider than 80 km, labeled as cloud-free (no scattering), above land (better reflectivity), and good quality, according to criteria defined by Crisp et al. [2021].With those, a few hundred test fits were performed, with the VP and NLS methods (see Table 2, with varying numbers of RHS ranging from 2 to the maximum available number of 16 (only even numbers due to the combination of the two spectral bands in one observation).Again, the fits used n = 3 linear parameters per spectrum and p = 2 nonlinear parameters.
For the evaluation of accuracy, the sigma of regression, the R-Score measure, the confidence bounds of the results, and the fitted residuals, were analyzed.The sigma of regression σ defined in Equation ( 31) turned out to be equal for all of the tested methods (see the residual analysis below).
A second statistical quantity is the so-called R-score, defined as indicating the amount of variance (the mean of the measurements y) accounted for by the fitted model ŷ.M means the cumulative size m 1 + . . .+ m s of all the datasets.R must be within [0, 1], and the best possible score a fit can achieve would be 1.In the experiments, all of the discussed methods obtained R-scores of approximately 0.99.The only difference could be observed for VP GL and VP KM, which had average higher scores by 0.02 % compared to all other methods, which is negligible.
In order to calculate the confidence bounds of the retrieval results, the covariance matrix needs to be calculated, with H containing the partial derivatives of the model function, with respect to the p nonlinear and sn linear parameters.For a VP method with multiple right-hand sides, it can https://doi.org/10.3390/math11132839PREPRINT be composed as follows: Here, the first matrix is the M × p Jacobian of the purely nonlinear function z(α) defined in ( 21) and ( 22) with respect to the nonlinear parameters α, and the second matrix G( α) ∈ R M ×ns , defined in ( 17), is the Jacobian of all the linear parameters β (cf.Equation ( 18)).For a confidence level of 95 %, one can then calculate the confidence bound(s) (CB) of the retrieved parameters x by for which q represents the standard normal distribution quantile of 1 2 (1 − 0.95), and the diagonal elements of C are the variances of the estimated parameters x.
Figure 5 shows the distribution and mean values of the calculated confidence bounds for the α CO2 parameter for an increasing number of RHS.While the confidence bounds are already relatively small, they are decreasing for an increasing number of datasets, similar to before with the increasing SNR values (see Figure 4).This indicates that more data cause more accurate fitting results.However, a small difference can be observed in Figure 5 between the naive VP method and the "others" (including VP GL, VP KM, NLS TRF, and NLS LM, which all produced the same results).Apparently, the confidence bounds of the results from the naive method, though decreasing, are slightly worse than the rest.This is probably due to the different and more lavishly calculated Jacobian matrix of the naive problem (18).One can, therefore, argue that this is mainly a numerical issue and does not correspond to a lack of accuracy of the VP naive solver.In light of the measures considered above, the tested MRHS solvers all achieved equally accurate fits.This was also confirmed when the residuals of the fits were analyzed, which turned out equally for all methods (cf.Table 2).The statistical diagnostics for the residuals of one exemplary VP GL fit (representative of all methods, including NLS) are shown in Figure 6.Ideally, the errors between the fitted model and measurements should be normally distributed.Due to noise and outliers in the spectral data, this distribution may, however, deviate slightly from a normal one.Yet, the fact that the residuals have their highest density around zero indicates that all the algorithms conducted reasonably good fits.
As for the robustness, all algorithms yielded convergence rates of 100 % for decent initial guesses.For a discussion on the impact of bad initial guesses, see O' Leary and Rust [2013], who showed that the VP method ultimately converges more reliably than conventional NLS algorithms.This is mostly due to the fact that the former deal with a reduced nonlinear least squares problem needing only p instead of p + n initial guesses, making the solver a lot more stable.In the next step, the fitting times of all mentioned methods were analyzed to compare their computational efficiency.Figure 7 shows the mean running times for a fit for an increasing number of datasets.Here, the VP KM method is not shown since its performance is similar to VP GL.For fairly small numbers of datasets, the NLS algorithms were faster than the VP methods.This stems from the fact that these algorithms are part of the SciPy package [Virtanen et al., 2020], which is operationally optimized, whereas the proposed VP code was originally made as a proof of concept and is not yet optimized in the same manner.Still, this scheme changed drastically when more RHS were used.Table 3 shows the exact values for 2, 4, and 6 datasets.For six RHS and more, the suggested VP GL algorithm not only becomes significantly faster than the rest, but it is also the only method with fitting times that increase linearly with the number of right-hand sides (see Figure 7), while all the other tested methods exhibit an almost quadratic evolution.This confirms that VP GL and VP KM are the most efficient methods when it comes to dealing with the rising complexity of multiple RHS problems.It also reveals the inferiority of the naive VP method compared to the 'good' VP methods in every test.Even though the naive approach separates the problem and should, therefore, be just as stable as the good approaches, the time it needs for solving also rises quadratically with the number of fitting windows, similar to the slower https://doi.org/10.3390/math11132839PREPRINT (and inferior) NLS solvers.This must be due to the increasing size of the block diagonal matrix G(α) and the resulting extra costs for calculating G † (α) and the Jacobian ∂P ⊥ G /∂α. Comparisons of the two "good" VP algorithms (VP GL and VP KM) showed that, in all of the above categories, such as robustness or accuracy, the Kaufman approach did equally as well as the Golub-LeVeque one.The only difference could be found in the fitting times, for which the method by Kaufman [1975], as predicted, was consistently faster.However, the relative improvements in the running times remained below 1 % and are, therefore, almost negligible.This confirms the point made by O'Leary and Rust [2013] that Kaufman's simplification does not necessarily pose a computational benefit to modern computers anymore.

Discussion and Conclusions
Motivated by a real-world application in atmospheric remote sensing, a variable projection algorithm was extended to multiple right-hand sides of different sizes and nonlinear model setups.A modern MATLAB implementation by O'Leary and Rust [2013] was translated into Python and modified according to the theory presented in this paper.It incorporates the ideas of Golub and LeVeque [1979] and Kaufman [1975] for solving separable nonlinear least squares problems with multiple RHS.
Numerical tests using synthetic data demonstrate that simultaneous fittings over multiple measurements maintain accuracy and precision compared to single dataset solvers, with potential benefits in reducing "overfitting" (with noise and outliers affecting the retrieval results) and fluctuations in the results.
A comprehensive comparison with conventional nonlinear least squares solvers using real measurements from NASA's OCO-2 satellite [Jet Propulsion Laboratory, California Institute of Technology, 2023] indicated similar accuracy among all algorithms.The most significant finding was that the variable projection methods based on Golub and LeVeque [1979] and Kaufman [1975] significantly outperformed all other methods in computing time, particularly as the number of datasets increased.Thus, these algorithms are deemed more efficient than conventional solvers.Furthermore, our experiments indicate that a popular simplification proposed by Kaufman [1975] did not yield significant performance improvements.
The algorithm presented in this article proved to be highly effective and efficient.This indicates that the recommended modifications to the original algorithm by Golub and LeVeque [1979] preserve its computational advantages.Note, that the benefits arising from a fast solver must always be considered in relation to the computational costs associated with the remaining part of the overall task.In trace gas retrieval applications, the computation time required for the forward model, i.e., radiative transfer Equation ( 26), can significantly exceed that of solving the inherent least squares problem.Our algorithm, thus, offers the most significant advantage in tasks where the overall performance heavily relies on the fitting process.Consequently, we endorse using this implementation not only for remote sensing, but also for other scientific problems implemented in Python with similar characteristics.The simultaneous fitting of more data can reduce fluctuations in the results, which is highly desirable in some applications.

Figure 1 :
Figure1: Four exemplary soundings of frame 1728 from the OCO-2 level 1b (L1B) measurement product[Crisp et al., 2021], each displaying radiance spectra in units [erg/(s cm 2 sr cm −1 )] from both the strong and weak bands with 809 and 651 spectral pixels each.

Figure 3 :
Figure3: Distribution plots of the relative errors (difference between exact and fitted α CO2 results) on the right and the corresponding standard deviations from the exact solution on the left for both fitting setups single and MRHS with increasing signal-to-noise ratios (SNRs).

Figure 4 :
Figure 4: Mean sigma of regression for both single and MRHS fits for different noisy spectra.The dashed lines correspond to fitted hyperbolas.

Figure 5 :
Figure 5: Vertical box plots (with horizontal offset for better distinction) of the confidence bounds of the α CO2 parameter achieved by each method for different numbers of datasets.

Figure 6 :
Figure 6: Statistical diagnostics for one exemplary VP fit.The plot on the left shows the distribution of the residuals via a coarse histogram and the continuous kernel density estimate (KDE).The plot on the right is a normal probability plot, displaying the deviation of the residuals from a normal distribution.

Figure 7 :
Figure 7: Comparison of the evolution of mean running times of NLS and VP methods for a fit over the number of used datasets.

Table 1 :
Test cases for the analysis of fitting 1 (denoted as single) or 16 (denoted as MRHS) synthetically generated datasets, simultaneously, with both VP and NLS algorithms.

Table 2 :
Test cases for the analysis of real radiance spectra with both VP and NLS algorithms for multiple RHS.

Table 3 :
Fitting times for VP methods compared to NLS methods for small numbers of datasets.The bold numbers mark the smallest mean values within a row.