4.1. Application: Trace Gas Retrieval
A real-world example for the described problem set can arise in the area of remote sensing, more specifically, in the retrieval of atmospheric trace gas concentrations from spectral radiance measurements. The concentration of carbon dioxide (CO2) or methane (CH4)—both important greenhouse gases—can be inferred from spectra observed in the short-wave infrared (SWIR). Such measurements are often spaceborne in order to achieve global coverage of the atmospheric composition.
For this paper, observations from the OCO-2 (Orbiting Carbon Observatory-2) satellite by NASA [
19,
20,
21] were used, which was designed to monitor CO
2 by measuring its absorption bands in the SWIR. In this spectral region, a radiance measurement can be modeled by the radiative transfer model (based on the Beer–Lambert law for molecular absorption neglecting scattering) [
22]
dependent on the wavenumber
. The term wavenumber, which represents the inverse of the often used wavelength
, has the common unit cm
−1 for the SWIR region, corresponding to 10
4/μm. The two sets of fitting parameters are
for linear parameters
and
for nonlinear parameters
.
The first term is a polynomial that approximates the wavenumber-dependent surface reflectivity of the Earth at the measurement location. Factor
corresponds to
(with
: solar zenith angle) and accounts for the geometry of the measurement,
is the incoming solar radiation (at the top of the atmosphere), and
is the total optical depth of the
molecule, which is the path integral over the number density
, and its pressure- and temperature-dependent cross-section
. In trace gas retrieval,
is the most important measure, as it is directly related to a molecule’s concentration in the atmosphere on a given path
s. Unfortunately, SWIR observations do not provide enough information to retrieve the concentration profile of a molecule. Therefore, the calculation of (
27) is only possible under prior assumptions of the atmospheric state (i.e., the temperature and pressure profiles
,
, and the molecular number density, i.e.,
). Hence, a simple scaling factor
is fitted as
in the forward model (
26) to retrieve the “real” optical depth at the time and place of measurement. Lastly,
is the spectral response function of the sensor, which has to be convolved with the monochromatic radiance in order to mimic a real measurement.
For trace gas retrieval, one has to consider all
p molecules that have non-negligible absorbance in the measured spectral region. In the case of OCO-2 observations, the only relevant molecule apart from CO
2 is H
2O (water vapor), meaning that there are two nonlinear fitting parameters. For the linear parameters, it is common to use approximately three reflectivity coefficients (depending on the size of the spectral interval). This means that for each spectrum, the necessary fitting parameters are
Note that even though it is physically necessary to use all of these variables in a fit, the only one of interest in this context is the molecular scaling factor
of the molecule
l under scrutiny, which in this case is
), as this alone contains the relevant information about its atmospheric concentration.
This together with (
26) clearly fits the criteria for a separable problem, and a conventional VP algorithm from the PORT Mathematical Subroutine Library [
14,
23] has already been tested by Gimeno García et al. [
22] and validated by Hochstaffl et al. [
24].
How is this an example of problems with multiple right-handed sides? Many satellites have sensors that measure radiance simultaneously in several spectral windows, e.g., OCO-2 observes the strong (around 6250 cm
−1) and weak absorption bands (around 5000 cm
−1) of CO
2 (cf.
Figure 1). Assuming consistent model input, both spectra should deliver the same values for
, but as surface reflectivity varies strongly for different wavelength regions, they each have a specific reflectivity polynomial and, therefore,
. Thus, for every observation, two spectral measurement windows (of different lengths) should be fitted simultaneously.
This concept of multiple right-hand sides can also be transferred into a spatial dimension: Some molecules, such as carbon dioxide or methane, are very long-lived, so they are distributed relatively homogeneously in the atmosphere. This means that observations from nearby locations should all yield quite similar concentrations. Thus, they might as well be fit for one
at once. Note that the assumption made about atmospheric carbon dioxide might not hold for all other absorbing molecules in the observed spectral region, such as H
2O, which has rather variable concentrations across the globe. However, as their variations are less than that of surface reflectivity, and no physical insight is sought from the fit of the
parameter, it can be seen as a mere auxiliary parameter for completing the model and, therefore, be treated as a “constant” nonlinear fitting variable for a group of neighboring spectra. Still, in this case, the reflectivity coefficients,
, representative of the surface at the place of measurement, are distinct for every geolocation and, therefore, specific to each measured spectrum. Another possible linear model parameter, which is distinct for each spectrum, would be a constant baseline correction added to the model (
26), as suggested by Gimeno García et al. [
22].
Finally, it is possible to combine both the spectral and spatial dimensions of multiple RHS fittings in trace gas retrieval. The OCO-2 satellite, for instance, always stores eight observations (“soundings”) in one so-called “frame”, with the spatial coverage not exceeding 24 km
2. The concentration of carbon dioxide can be assumed to fluctuate only minimally within such an area on the globe.
Figure 2 shows exemplary retrieval results of eight soundings from one OCO-2 frame. Most of the fluctuations in these results (all except for sounding number 5) could be merely due to noise in the measurements, as the mean value stays within the uncertainty for almost all of them. A fit over multiple observations, as proposed, could, therefore, help to constrain the fluctuation to a more reliable retrieval product.
To summarize, the OCO-2 product [
25] allows for multiple RHS fits of a combined 8 (spatial)
2 (spectral) = 16 datasets (cf.
Section 4.3).
The following tests were conducted with a
Python version of BIRRA (Beer infrared retrieval algorithm) [
22], which has been validated for the SWIR trace gas retrieval of CO by Hochstaffl et al. [
24]. In this code, the Jacobian matrix of the model function (
26) is set up analytically for the least squares fit, reducing numerical instabilities. BIRRA is an extension of the radiative transfer model Py4CAtS (Python for Computational Atmospheric Spectroscopy) [
28], which is used to calculate the a priori total optical depths (
27) needed in (
26).
It must be noted that all retrievals conducted with the model described above are only supposed to evaluate the methodology and algorithms and in no way claim to represent full-fledged physical CO
2 retrieval products, such as those by Crisp et al. [
27]. Moreover, this technique of fitting multiple spectra measured within a certain spatial distance from each other is only reasonable as long as the assumption holds that, within an order of magnitude of the spatial resolution of the sensor, there are only small to no physically caused fluctuations/gradients in the sought trace gas concentration(s). This is, of course, not the case for localized emitters, such as power plants or biomass-burning events.
4.2. Tests with Synthetic Data
The goal of this subsection is to show the conceptual and effective differences of algorithms solving multiple RHS compared to the classical case of solving one. This analysis was conducted on the basis of synthetic spectra. Those are simulated radiance measurements generated with the radiative transfer model Py4CAtS [
28]. The benefit of using this in tests is that, in the retrieval (i.e., the fitting process), there is no model error and the exact solution is known. The only deviation from a “perfect” fit is, therefore, controlled by adding noise to the modeled measurements.
In order to be representative of the later tests with real measurements, the same numbers, sizes, and types (distinct in spatial or spectral dimensions) of datasets were generated as the ones used by OCO-2. Moreover, for consistency, all test retrievals were conducted using the same number of fitting parameters, with
linear ones per dataset and a total of
nonlinear ones (cf. vectors in (
29)). This allowed for the test cases summarized in
Table 1.
For the MRHS case, the Golub–LeVeque VP algorithm (introduced in
Section 2.2) was used as a representative solver for multiple RHS problems. Tests with synthetic spectra indicated that all mentioned MRHS methods (see
Table 2) yielded equal accuracy, confirming the theoretical proof offered by Golub and Pereyra [
2] that the solutions found by a variable projection solver should be equivalent to those of conventional nonlinear solving methods.
For the
single cases, a classical VP (based on O’Leary and Rust [
6]) and a conventional NLS single-RHS solver [
15] (based on Branch et al. [
17]) were tested.
First, the fitting precision of the VP MRHS solver was compared to that of
single solvers using spectra with signal-to-noise (
SNR) ratios in the range of 20 to 500. One measure of the goodness of fitting results is the relative error,
compared to the true parameter values.
Figure 3 shows the distribution of these errors and the corresponding standard deviations for different
SNR values. The signal-to-noise ratios achieved by satellites, including OCO-2, ranged between approximately 200 and 800 for the frames used [
25]. This broad variation of OCO-2’s
SNR comes from changes in the solar position and varying surface reflectivities across the orbit. As expected, both solvers achieved improved precision for increasing
SNR values, since a fit becomes more accurate for less noisy data. While both
single methods (NLS and VP) showed equal performances, the VP MRHS yielded standard deviations that were slightly worse. This trend is also reflected in the distributions of the relative errors, which are always more sharply distributed around zero for the
single solvers than for the MRHS solver. This behavior is not surprising since a less-dimensional residual vector (coming from the shorter data vector) and fewer unknown parameters most generally leave less freedom in the fit and, therefore, lead to more precise fitting results. To phrase it differently: one can expect that, as the size of the least squares problem increases, the number of possible local minima that the fit can reach will behave accordingly.
Considering the similarly shaped distributions in
Figure 3 and the fact that VP MRHS seems to improve at the same rate as NLS and VP
single, MRHS fits can be viewed as equally effective. In particular, at higher
SNRs, both methodologies achieve deviations from the exact results in such low orders of magnitude that the precisions of their fitted
values are very comparable.
Another important measure for the accuracy of a fit is the standard deviation of the residuals
(prediction errors with
: vector of observations, and
: fitted model), also known as the sigma of regression, defined as
where the numerator is the norm of the residual vector of the fit and the denominator represents the number of degrees of freedom (the number of data points minus the number of unknown variables). Of course, for the
s right-hand side case, the number of linear parameters
n becomes
, and the number of data points becomes
(or
), respectively.
Figure 4 shows the mean sigma of regression of all fits over the
SNR. Here, the development of
single and MRHS is similar to that of the errors discussed above. The mean sigmas produced by the MRHS fits are slightly higher than the ones achieved by the
single solvers. Again, this is intuitive: A
single solver is able to produce distinct nonlinear parameters for the noisy spectra and, therefore, has more degrees of freedom to mimic the noisy spectra. An MRHS solver, on the other hand, only has one set of nonlinear parameters for all the spectra, leading to an overall larger deviation between the “observed” and modeled data. This does not necessarily mean the latter is less accurate. On the contrary, since it is less prone to including the specific noise of the spectra into the fit, MRHS solvers could have a smoothing effect on otherwise fluctuating retrieval results (see
Figure 2).
The effect of possible “overfitting” might, however, only be relevant for high noise levels (low
SNR). As the sigma of regression is proportional to the noisy radiance, which is proportional to
(with
: normally distributed random value), it decreases by the inverse of the
SNR value for both the
single and MRHS fits (see fitted hyperbolas in
Figure 4). In this way, as the noise increases, the sigmas of regression become more similar, such that for an
SNR of 200 and higher (representative of OCO-2), the performance differences between the methodologies (
single and MRHS) disappear. Thus, for reasonably good data, there is no sacrifice in the precision or accuracy of the produced results when fitting multiple RHS simultaneously instead of fitting one by one.
Now that the differences between MRHS solvers and classical
single solvers are established, we need to analyze which MRHS algorithm (see
Section 2.2) is the best.
4.3. Tests with Real Measurements
The goal of this subsection is to assess the performance of the new enhanced VP algorithms for multiple RHS described in
Section 2.3 and
Section 3.2 (VP naive, VP Golub–LeVeque, VP Kaufman) by comparing them to conventional NLS solvers.
The
SciPy function used for the NLS reference approach allows the user to choose between three different nonlinear least squares algorithms (explored thoroughly for a single RHS VP algorithm by Bärligea [
18]). In order to better judge the solvers’ performances, two of them were used in the tests: the trust region reflective method (‘
TRF’) [
17] and the Levenberg–Marquardt method (‘
LM’) [
1]. While the former is the most efficient, the latter can be considered the most robust [
18], which could be helpful for an increasing number of variables.
In this subsection, an analysis of the test cases listed in
Table 2 is conducted. For the assessment, a set of 18 OCO-2 frames was used (all measured on the 25 of May 2020 on orbit 31366a in the nadir (downward view) acquisition mode just above Australia, with a spacecraft altitude of approximately 711 km [
29]); each included 8 observations in both spectral bands (cf.
Figure 1), within an area of 24 km
2 measured along a ground track no wider than 80 km, labeled as cloud-free (no scattering), above land (better reflectivity), and good quality, according to criteria defined by Crisp et al. [
25]. With those, a few hundred test fits were performed, with the VP and NLS methods (see
Table 2, with varying numbers of RHS ranging from 2 to the maximum available number of 16 (only even numbers due to the combination of the two spectral bands in one observation). Again, the fits used
linear parameters per spectrum and
nonlinear parameters.
For the evaluation of accuracy, the sigma of regression, the R-Score measure, the confidence bounds of the results, and the fitted residuals, were analyzed. The sigma of regression
defined in Equation (
31) turned out to be equal for all of the tested methods (see the residual analysis below).
A second statistical quantity is the so-called R-score, defined as
indicating the amount of variance (the mean of the measurements
) accounted for by the fitted model
.
M means the cumulative size
of all the datasets.
R must be within
, and the best possible score a fit can achieve would be 1. In the experiments, all of the discussed methods obtained R-scores of approximately 0.99. The only difference could be observed for VP GL and VP KM, which had average higher scores by 0.02% compared to all other methods, which is negligible.
In order to calculate the confidence bounds of the retrieval results, the covariance matrix
needs to be calculated, with
containing the partial derivatives of the model function, with respect to the
p nonlinear and
linear parameters. For a VP method with multiple right-hand sides, it can be composed as follows:
Here, the first matrix is the
Jacobian of the purely nonlinear function
defined in (
21) and (
22) with respect to the nonlinear parameters
, and the second matrix
, defined in (
17), is the Jacobian of all the linear parameters
(cf. Equation (
18)). For a confidence level of 95%, one can then calculate the confidence bound(s) (CB) of the retrieved parameters
by
for which
q represents the standard normal distribution quantile of
, and the diagonal elements of
are the variances of the estimated parameters
.
Figure 5 shows the distribution and mean values of the calculated confidence bounds for the
parameter for an increasing number of RHS. While the confidence bounds are already relatively small, they are decreasing for an increasing number of datasets, similar to before with the increasing
SNR values (see
Figure 4). This indicates that more data cause more accurate fitting results. However, a small difference can be observed in
Figure 5 between the naive VP method and the “others” (including VP GL, VP KM, NLS TRF, and NLS LM, which all produced the same results). Apparently, the confidence bounds of the results from the naive method, though decreasing, are slightly worse than the rest. This is probably due to the different and more lavishly calculated Jacobian matrix of the naive problem (
18). One can, therefore, argue that this is mainly a numerical issue and does not correspond to a lack of accuracy of the VP naive solver. In light of the measures considered above, the tested MRHS solvers all achieved equally accurate fits.
This was also confirmed when the residuals of the fits were analyzed, which turned out equally for all methods (cf.
Table 2). The statistical diagnostics for the residuals of one exemplary VP GL fit (representative of all methods, including NLS) are shown in
Figure 6. Ideally, the errors between the fitted model and measurements should be normally distributed. Due to noise and outliers in the spectral data, this distribution may, however, deviate slightly from a normal one. Yet, the fact that the residuals have their highest density around zero indicates that all the algorithms conducted reasonably good fits.
As for the robustness, all algorithms yielded convergence rates of 100% for decent initial guesses. For a discussion on the impact of bad initial guesses, see O’Leary and Rust [
6], who showed that the VP method ultimately converges more reliably than conventional NLS algorithms. This is mostly due to the fact that the former deal with a reduced nonlinear least squares problem needing only
p instead of
initial guesses, making the solver a lot more stable.
In the next step, the fitting times of all mentioned methods were analyzed to compare their computational efficiency.
Figure 7 shows the mean running times for a fit for an increasing number of datasets. Here, the VP KM method is not shown since its performance is similar to VP GL. For fairly small numbers of datasets, the NLS algorithms were faster than the VP methods. This stems from the fact that these algorithms are part of the
SciPy package [
16], which is operationally optimized, whereas the proposed VP code was originally made as a proof of concept and is not yet optimized in the same manner. Still, this scheme changed drastically when more RHS were used.
Table 3 shows the exact values for 2, 4, and 6 datasets.
For six RHS and more, the suggested VP GL algorithm not only becomes significantly faster than the rest, but it is also the only method with fitting times that increase linearly with the number of right-hand sides (see
Figure 7), while all the other tested methods exhibit an almost quadratic evolution. This confirms that VP GL and VP KM are the most efficient methods when it comes to dealing with the rising complexity of multiple RHS problems. It also reveals the inferiority of the naive VP method compared to the ’good’ VP methods in every test. Even though the naive approach separates the problem and should, therefore, be just as stable as the good approaches, the time it needs for solving also rises quadratically with the number of fitting windows, similar to the slower (and inferior) NLS solvers. This must be due to the increasing size of the block diagonal matrix
and the resulting extra costs for calculating
and the Jacobian
.
Comparisons of the two “good” VP algorithms (VP GL and VP KM) showed that, in all of the above categories, such as robustness or accuracy, the Kaufman approach did equally as well as the Golub–LeVeque one. The only difference could be found in the fitting times, for which the method by Kaufman [
8], as predicted, was consistently faster. However, the relative improvements in the running times remained below 1% and are, therefore, almost negligible. This confirms the point made by O’Leary and Rust [
6] that Kaufman’s simplification does not necessarily pose a computational benefit to modern computers anymore.