1. Introduction
Without the ability to precisely define CO
2 dynamics, understanding ecosystem–climate interactions is very difficult [
1,
2]. Gross primary production (GPP) is the primary driver of land CO
2 sink, removing approximately one-quarter of annual anthropogenic CO
2 emissions [
3]. The signal of sun-induced chlorophyll fluorescence (SIF) is a promising predictor of vegetation function [
4]. It directly reflects the photosynthetic activity of plants and can be used as a remote sensing index for estimating photosynthetic energy conversion and carbon absorption [
1,
5]. Using multiple available platforms, global spatio-temporal patterns of SIF can be easily monitored from space [
6,
7,
8,
9,
10,
11,
12,
13]. While SIF can be used as a proxy for intra- and inter-annual global and regional patterns in GPP [
7,
14,
15,
16], its relationship to GPP is biome- [
5,
13] and scale-dependent [
1,
13].
Despite the promising potential of SIF as a proxy for GPP, limitations in the current satellite SIF Level 2 data hinder our understanding of the GPP-SIF relationship at various scales. For clarity, ‘SIF Level 2 data’ refers to satellite-retrieved SIF data that retain the original satellite viewing geometry and have undergone basic quality control and calibration but lack global coverage due to the satellite’s orbital path and cloud cover. Ideally, these data would be available at high resolutions and provide global, contiguous coverage, which, unfortunately, is seldom the case. As a result, a variety of methodologies have been developed to generate contiguous Level 3 SIF data, sometimes at resolutions different from the native Level 2 data, thereby addressing some of the shortcomings mentioned above. Level 2 data represent a higher level of processed information compared to Level 1 data, which typically include raw or minimally processed sensor data. Level 2 SIF satellite data undergo additional processing steps, such as radiometric and geometric corrections, atmospheric corrections, and calibration. These processes aim to remove various artifacts and correct for known errors or biases in the raw data, resulting in more accurate and usable information. Typically, Level 3 processing refers to the aggregation and averaging of satellite data over larger spatial and temporal scales, which implies coarser resolution of the Level 3 datasets, usually gap-free. This aggregation helps to reduce noise and improve the overall accuracy and reliability of the data. Geostatistics and multiple machine learning techniques [
17] used so far for this purpose demonstrated the possibility of creating Level 3 datasets matching or even exceeding the spatio-temporal supports of the Level 2 datasets. An example of such a high-resolution dataset is GOSIF, derived from OCO-2 satellite observations, Moderate Resolution Imaging Spectroradiometer (MODIS), and meteorological reanalysis data, which significantly enhances our ability to monitor global photosynthesis and assess ecosystem health and functionality, as in a study by [
18].
Broadly, the spectrum of the created approaches can be divided into two groups: (1) ML-based approaches and (2) geostatistical approaches. The former group of methods exploits the relationships between SIF and ancillary data, which broadly represent various covariates informative of the SIF value (e.g., NDVI), aiming to reconstruct the SIF value at unsampled locations, while the latter group leverages the observed/modeled autocovariance structure within available data and then utilizes measurements and modeled covariance structure to reconstruct the data at unsampled locations. Both approaches have been successfully deployed so far to improve, downscale, or fill gaps in SIF satellite SIF measurements. For example, refs. [
19,
20] recently used a convolutional neural network (CNN) and Extreme Gradient Boosting (XGBoost), together with high-resolution ancillary data, to downscale SIF retrievals from a TROPOspheric Monitoring Instrument (TROPOMI) on board the satellite Sentinel-5P by a factor of up to 500 m and 0.05°, respectively. Refs. [
21,
22] harmonized GOME-2 and SCIAMACHY SIF datasets using ML with a moderate-resolution imaging spectroradiometer (MODIS) to downscale SIF products. Ref. [
23] created a high-resolution OCO-2 Level 3 SIF dataset using ML constrained by physiological understandings, and ref. [
24] did the same using high-resolution ancillary data. Ref. [
22] recently downscaled the GOME-2 SIF Level 2 dataset using the Random Forest (RF) model. Ref. [
25] downscaled OCO-2 SIF data to a super-fine resolution of 0.0005° using convolutional neural networks. Examples of the application of geostatistics to improve Level 2 satellite data are numerous, and some of the previous efforts were focused on SIF Level 2 data. In our previous studies [
26,
27], we gap filled GOME-2 SIF Level 2 data and provided a framework for upscaling and harmonizing the data to a higher resolution to create contiguous Level 3 datasets. The approach was based on modeling the covariance structure of Level 2 SIF data and employing spatial [
27] or spatio-temporal covariance models [
26] to estimate SIF at unsampled locations using the block kriging methodology. This methodology allows for the estimation of SIF values at locations not covered by genuine Level 2 data while also accounting for the change in support compared to the original Level 2 data. In a recent study, we further improved the modeling approach that can be used to process SIF data [
28]. Interestingly, despite the common goals and similar outputs, these two groups of methods have never been directly compared or evaluated together, nor has the potential of creating a synthetic approach been exploited.
In this study, we venture beyond the existing methods and explore the potential of a hybrid model that integrates kriging and machine learning-based approaches. Our goal is to assess whether this hybridization could yield superior accuracy in SIF estimates by capitalizing on the strengths of both methodologies. Before the study, we expected that the hybrid technique’s accuracy would outperform both approaches separately due to synergy, given that the two techniques rely on different types of inputs, namely the relationship of the primary variable with ancillary data and the spatio-temporal covariance structure of the primary variable, in this case, SIF data. We chose universal kriging, also known as kriging with external drift, as the geostatistical component of our hybridization framework. This choice was inspired by its successful use in a similar context, where it helped reconstruct the spatial distribution pattern of the CO
2 mixing ratio [
29]. For the machine learning component, instead of creating a model and dataset from scratch, we opted to use a publicly available, recently created contiguous Level 3 OCO-2 SIF dataset created using neural networks [
24] as a covariate in the hybrid approach; thus, the properties of the used dataset, including the SIF bands used, spatio-temporal resolution, etc., match the ones from that study. These resources, we believe, provide a solid foundation for our hybrid model due to their proven effectiveness in similar applications. We deployed the moving window ordinary kriging technique as a paradigmatic geostatistical approach, mimicking the approach from [
27]. This method has already been used to estimate GOME-2 SIF at native and upscaled resolution. To evaluate the accuracy of both the parent techniques and the newly created hybrid, we employed a validation method known as ‘leave-one-out cross-validation’. In simpler terms, this method involves using one data point from the dataset as a ‘test’ case and the rest of the dataset for ‘training’. This process is repeated for each data point in our dataset, which consists of the entire 2019 year of OCO-2 SIF retrieval aggregates, containing over 400,000 data points (see
Section 2.2).
2. Methods
Our research is underpinned by the assumption that both geostatistical and machine learning (ML) methodologies for quantifying solar-induced chlorophyll fluorescence (SIF) can achieve enhanced effectiveness when synergistically combined within a hybrid model. To test this hypothesis, we use a combination of aggregated SIF retrieval data and ML-based SIF estimates as inputs to the hybrid technique, which are built within the kriging with the external drift framework. The ML-based estimates are derived from a continuous SIF (CSIF) dataset published by [
24], where the authors introduced a novel application of a multi-layer perceptron (MLP) type of ANN for generating a spatio-temporally continuous Level 3 SIF dataset based on ancillary data obtained from the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument. The resulting CSIF product has a spatial resolution of 0.05 degrees (equivalent to approximately 5.6 km × 5.6 km at the equator) and a temporal resolution of four days, which aligns with the MODIS climate model grid (CMG) resolution and the resolution of the ancillary variables used in the original ML technique. The SIF estimates generated by this method will be validated with the aggregated SIF measurements from the Orbiting Carbon Observatory-2 (OCO-2) satellite at the corresponding geolocations (grid cells) of the CMG, where both the target SIF measurements and the ancillary variable measurements were recorded.
2.1. Data
OCO-2 is a NASA satellite that was launched in July 2014. The instrument possesses a three-channel grating spectrometer, with a spectral resolving power of
[
6,
30], centered around the oxygen A band at 0.765 µm and carbon dioxide bands at 1.61 and 2.06 µm. The instrument conducts eight measurements across tracks, with swath widths of ~10 km. Its spatial resolution at the nadir is 1.3 km × 2.25 km. It has a 98.8 min orbit, with a 13:36 nodal crossing time and a 16 d ground-track repeat cycle [
31]. OCO-2 SIF retrievals were validated by comparison to airborne measurements using the Chlorophyll Fluorescence Imaging Spectrometer [
13,
32].
Figure 1 shows the flight trajectory of the OCO-2 satellite on the world map and the measured SIF values.
When utilizing data as inputs for a hybrid data analysis technique like kriging with external drift, the spatial resolution must be carefully considered. In this study, the main variable (SIF retrievals) and the support variable (ML-based estimates) must have the same spatial resolution to produce accurate results. The optimal resolution is determined by the lower resolution of the two inputs, which, in this case, is the resolution of the CMG. Therefore, the support of the main variable must be spatially aligned with the grid cells to maintain consistency. To ensure this consistency, the Level 2 OCO-2 SIF retrievals were brought to a uniform 0.05-degree resolution using the process of aggregation. This involved selecting all SIF soundings within the bounds of a specific grid cell, as well as choosing only recoveries classified as clear-sky in the OCO-2 dataset. The remaining retrievals were then aggregated by calculating their mean, following the approach outlined by [
33]. As a final step, grid cells containing less than five clear-sky retrievals were removed from the pseudo-retrieval SIF dataset.
The measured values of solar-induced fluorescence (SIF) are subject to variations that are primarily influenced by several factors, each of which contributes differently based on the environment and context of observation. The challenges of upscaling SIF measurement to daily average values are discussed in detail in a recent study [
34]. The impacts of solar-view geometry and canopy structure were analyzed in [
35] and the impact of growth and environmental factors in [
36]. In addition, vegetation physiology and stress can cause variations ranging from 5% in controlled agricultural settings to more than 50% in natural ecosystems under environmental stress [
36,
37]. Additionally, the viewing geometry of the satellite and its orbital characteristics can introduce variability in SIF measurements by 5–20%, depending on the sensor and orbit specifics [
38].
2.2. Generating Continuous SIF Estimates by Using an ML Model
In a study by [
24], a neural network architecture composed of five input variables and one output variable was used to precisely estimate clear-sky solar-induced chlorophyll fluorescence (SIF) values at specific coordinates. The input variables, which were selected based on their informative value, were obtained from the Nadir Bidirectional reflectance distribution Adjust Reflectance (NBAR) product of the MODIS (Moderate Resolution Imaging Spectroradiometer) dataset (MCD43C4 V006). In particular, the first four bands of MODIS were utilized to extract the reflectance data, which center at wavelengths of 645 nm, 858 nm, 469 nm, and 555 nm, as previously suggested [
39]. These particular bands were chosen due to their known influence on the variation in SIF and their inclusion of significant vegetation-related information [
40]. It is important to highlight that the selection, rationale, and justification, as well as any potential constraints, regarding the training data utilized in the development of the machine learning (ML) model by [
24], were beyond the control of the authors of the current study. While it could be argued that the incorporation of additional variables, such as supplementary vegetation indices or meteorological data, might alter or enhance the performance of the ML model, it is pertinent to note that the neural network-generated dataset is publicly available and serves an as input for our study. Therefore, providing an extensive elaboration on the decisions made during the development of that dataset is beyond the scope of this study. The output variables used for training the neural network were derived from the OCO-2 dataset and consisted of sounding-based SIF retrievals at 757 nm. The output data were subsequently processed by averaging and filtering, and the CMG resolution was adjusted to align with the resolution of the input data.
To accurately estimate the values of solar-induced chlorophyll fluorescence (SIF) using a hybrid method, data were collected from the OCO-2 SIF dataset for the entirety of 2019. This period was selected to account for any intra-annual and seasonal variations in the SIF signal and any potential correlations between SIF and other variables. This study is confined to 2019, despite the availability of data beyond this timeframe, as its primary focus lies in method development rather than providing updated or expanded datasets. The resulting dataset served as a reference, against which the performance of the machine learning (ML), geostatistical, and hybrid approaches was compared using the leave-one-out cross-validation scheme (detailed in
Section 2.5). The fraction of the CSIF product used for the development of the hybrid method also includes data from the entire year of 2019. This specific year was chosen due to the availability of the most recent version of the CSIF product, with a full year’s data. This allows for the alignment of the collection of the ground truth data with the best-available ML-based SIF estimates in support of the hybrid method. An R
2 analysis was performed on the CSIF and OCO-2 data, specifically for the year 2019, to ensure the data’s validity. The OCO-2 data were initially aggregated for each grid cell, creating pseudo-retrievals, a term referred to in the subsequent sections, following the methodology introduced by [
24]. This approach considers only clear-sky values and grid cells with more than 5 clear-sky observations. The resulting plot is illustrated in
Figure 1a. The achieved R2 score is 0.80, comparable to values reported in the original study.
2.3. Moving Window Ordinary Kriging
The ordinary kriging method used in the present study builds on the previous work of [
27,
41]. We perform the three-step mapping for each estimation location using observations collected on the same day. These steps are as follows: (1) subsampling of the observations, (2) characterization of the local spatial covariance structure, and (3) interpolation at the native spatial resolution. The kriging scheme that does not allow for the change of support is chosen because the supports for the pseudo-retrievals and ML-derived estimates have been previously harmonized. The goal of the subsampling strategy is to put more weight on the observations in the vicinity of a given estimation location, for both the characterization of the local spatial covariance structure and interpolation step, so that the mapped value and the associated uncertainty are representative of local values and variability. This is accomplished by selecting the minimum number of observations (N) that need to be present within the distance (‘moving window’; D) from the estimation location for the mapping to be performed. N is selected to be large enough to yield a representative sample. We do not constrain the upper bound of N, because all Ns encountered in this use case were computationally feasible to process. For the present study, N was set to 20 and D to 500 km. The preliminary step of the modeling is variography. For each estimation grid cell, a raw variogram is calculated based on the subsampled observations:
where
γ is the raw variogram value for a given pair of observations
y(
xi) and
y(
xj), and
h is the great circle distance between the locations
and
) of these observations. The exponential theoretical variogram model with a nugget effect is fitted to the raw variogram using non-linear least squares, mimicking [
27]:
where
σ2 and
are the variance and correlation length of the quantity mapped quantity, and
σ2nug is the nugget variance, typically representative of retrieval errors.
The variogram parameters are used to define a corresponding local spatial covariance structure:
The matrix representing the measurement and retrieval error covariance structure (the nugget effect) is:
After modeling covariance parameters for each estimation location, the linear system of equations is solved to obtain the
N weights λ assigned to the subsampled observations:
where Q is an
N × N covariance matrix among the
N subsampled observations, as defined in Equation (3), R is an
N × N diagonal retrieval error covariance matrix among the
N observations, as defined in Equation (4), 1 is an
N × 1 unity vector,
T denotes the vector transpose operation, and q is an
N × 1 vector of the spatial covariances between the estimation location and the
N observation locations.
λ and the Lagrange multiplier
are obtained by solving the system in Equation (5). They are subsequently used to define the estimate (ẑ) and estimation uncertainty variance (σ
2ẑ):
where y is the
N × 1 vector of subsampled observations, and σ is the variance in the SIF, as shown in Equation (3).
2.4. Hybrid Approach: Kriging with External Drift
Universal kriging, also known as kriging with external drift, is a technique used for data with a significant trend [
42]. The mathematical machinery of universal kriging is very similar to that of ordinary kriging (see Equations (5)–(7) in [
43]). In many cases encountered in environmental sciences, the trend is described as a function of spatial or spatio-temporal coordinates. However, the universal kriging framework is indifferent to the origins of the trend component, which allows for the use of machine learning (ML) SIF estimates as the source of the trend. The resulting approach combines kriging, which relies on spatial covariance structure analysis, with ML, which relies on the relationship between the primary variable (SIF) and ancillary data. It is generally accepted that the inclusion of secondary variables improves the accuracy of the kriging-based predictions [
43], and this method also allows for the inclusion of non-linearities in kriging through the use of ML SIF covariates, making the hybrid technique more flexible. Alternatively, instead of introducing ML predictions as covariates into kriging machinery, one could use a regression kriging approach, i.e., first, ML predictions are used as the model of the trend, followed by kriging the residuals using ordinary kriging. However, these two approaches are mathematically equivalent and produce identical results [
44].
Furthermore, the ML-based SIF estimates are employed as ancillary data (1) in the universal kriging scheme (2), along with the Level 2 SIF data (4) and covariance model parameters (6) derived through variography based on the Level 2 SIF data (5),
Figure 2. These three inputs undergo universal kriging to generate hybrid SIF estimates, which exhibit superior performance compared to both ML-based estimates and purely geostatistical estimates without ancillary data.
By incorporating both ML-based and geostatistical techniques, the hybrid method leverages the strengths of each approach, leading to improved SIF estimates. The utilization of ancillary data, including ML-based estimates and Level 2 SIF data, contributes to the enhanced accuracy and reliability of the hybrid SIF estimates.
This hybrid approach utilizes machine learning (ML)-derived solar-induced fluorescence (SIF) estimates as an external drift in the kriging with external drift model, effectively exploiting the detailed spatial patterns and relationships captured by ML together with the spatial interpolation strength of kriging. The methodological integration aims to mitigate the limitations inherent in each approach when used independently, thereby offering a more robust and accurate estimation of SIF.
Combining machine learning models with geostatistical methods, this hybrid approach leverages the strengths of both methodologies: the capacity of machine learning to handle complex, non-linear relationships within large datasets, and the efficacy of geostatistical methods in incorporating spatial autocorrelation and addressing spatial data anomalies. Through this integration, our objective is to enhance prediction accuracy and reliability beyond what could be achieved using individual methods, particularly in spatial contexts where environmental variables exhibit strong spatial dependencies.
While SIF measurements and predictions were utilized to illustrate the current approach, it is essential to note that the applicability of this methodology is not limited to SIF alone. We aim to establish a potential general framework for the hybridization of ML and geostatistical approaches in other domains.
2.5. Method Evaluation: Leave-One-Out Cross-Validation
The performance of the mapping method was tested in terms of (1) accuracy (the difference between estimates and true values) and (2) bias (the mean of the difference between estimates and true values). Leave-one-out cross-validation technique on the entire dataset was used for the assessment. Estimates were given at satellite native supports, allowing for direct comparison of estimates and retrievals. Model data mismatch was assessed for every extracted coordinate using all retrievals within that same day (except the one obtained at the actual estimation location). Mapping steps were repeated ab initio for every cross-validation location. The statistics were built upon a comparison of estimates and measurements (retrievals).
4. Discussion
The limitations of the presented approaches are evident in the discrepancies observed in error statistics and correlation coefficients among the three methods. A notable decline in correlation is observed in machine learning (ML)-generated solar-induced fluorescence (SIF) values beyond a certain threshold, along with a reduction in variability, a phenomenon known as a loss of variance, particularly evident in low SIF values. This phenomenon points to intrinsic constraints in the ML approach in handling extremes, well recognized beyond the boundaries of this study. Specifically, the machine learning model tends to compress the range of predicted SIF values, resulting in a vertical line pattern for low values and a cut-off effect for high values. This behavior implies a potential underestimation of variability in areas with minimal vegetation and an artificial cap on SIF predictions, which may fail to capture the highest levels of photosynthetic activity adequately. While the hybrid approach partially mitigates these issues by incorporating geostatistical methods, acknowledging and addressing these limitations in machine learning predictions are essential for enhancing the accuracy and reliability of SIF estimation across diverse vegetation types and conditions.
In light of the observed properties of the hybrid method and indications suggesting its relative outperformance compared to both parent methods, we posit that further enhancements in the absolute performance of both the hybrid and parent methods are conceivable. The inherent accuracy of machine learning (ML)-derived predictions fundamentally depends on two factors: the nature and quality of ancillary data inputs and the efficacy of the specific ML model architecture tailored to the current use case. While the selection of ancillary data is contingent upon the specific use case, potentially unrelated to solar-induced fluorescence (SIF) specifically, ML model architectures remain subject to ongoing evolution. We foresee that the adoption of convolutional network-based approaches, expanding the scope of ancillary data inputs to encompass a broader spatio-temporal neighborhood, holds promise for enhancing the accuracy of ML-based techniques and their resulting predictions [
19]. On the geostatistical front, sophisticated hyperdimensional approaches exploiting spatio-temporal correlation fields and data from a wider spatio-temporal neighborhood have demonstrated superior performance over traditional spatial kriging methods [
28]. Additionally, the proposed hybridization scheme employing kriging with external drift represents merely one feasible approach. Alternatively, geostatistical predictions, combined with ancillary data, could serve as inputs for an ML model, potentially yielding even greater accuracy. We advocate for the exploration of these possibilities in future research endeavors.
5. Conclusions
In this research, we successfully developed and assessed a novel hybrid methodology for estimating solar-induced chlorophyll fluorescence (SIF) at locations where direct samples are not available. This methodology integrates two foundational approaches—ordinary kriging within a moving window and machine learning—into a cohesive framework using kriging with external drift.
Our evaluation, conducted through leave-one-out cross-validation on the global OCO-2 SIF dataset from 2019, reveals that our hybrid method consistently matches or surpasses the parent techniques in terms of accuracy, variance explained, and bias reduction. A distinctive advantage of our hybrid approach is its adaptive weighting mechanism, which intelligently allocates more weight to proximal sampled locations and increasingly leans on machine learning covariates as the distance to the nearest sample extends beyond the decorrelation length. Additionally, we offer a quantification of SIF estimate uncertainty, which is inherent to kriging methodologies.
The chosen resolution in our study aligns with the spatial support of the employed datasets, showcasing the adaptability of our method to meet the resolution requirements of various applications. This adaptability, combined with the assumption of stationarity in the relationship between ancillary data and solar-induced fluorescence (SIF) values, suggests the potential extension of our methodology into a downscaling technique within the geostatistical domain. The methodological groundwork for this extension has already been laid out; as previously mentioned, machine learning (ML) has been successfully employed for downscaling SIF data [
25], and a geostatistical framework for downscaling SIF and other data is provided in the area-to-point subset of methods within the kriging family [
46]. Our results are promising and indicate that the developed hybrid approach is not confined to SIF estimation but is applicable to a broader range of satellite datasets that exhibit similar characteristics. Furthermore, the machine learning component of our framework is flexible, allowing for the future integration of more comprehensive ancillary data or alternative model architectures, with no fundamental alterations to the method.
Expanding upon this innovative hybrid methodology for solar-induced fluorescence (SIF) estimation, several practical applications emerge, underscoring its versatility across diverse fields. For instance, in precision agriculture, this approach can accurately map spatial variability in crop health at a resolution sufficient for informing targeted intervention strategies, ultimately boosting yields and reducing input costs. Similarly, in forest management, the methodology’s capability to integrate and analyze satellite datasets can aid in the early detection of areas under stress from drought or disease, facilitating timely conservation efforts and mitigating potential losses. Urban planners could utilize the refined SIF data to monitor the health of urban green spaces, contributing to strategies aimed at improving air quality and enhancing the urban environment for residents. Additionally, in climate research, the framework’s ability to provide accurate and high-resolution SIF estimation enables more precise modeling of carbon fluxes, offering insights into the impacts of climate change on terrestrial ecosystems and informing global carbon management policies. These examples underscore the broad applicability and potential impact of the developed hybrid methodology on environmental management and sustainability efforts worldwide.
The versatility and robustness of the proposed framework mark a significant advancement in the field of remote sensing and geospatial analysis. We anticipate that the methodology will not only enhance the precision of SIF estimation but also serve as a template for the development of similar hybrid models in related domains. Future research will focus on exploring the potential application of this framework to various types of satellite data. Specifically, in reference to solar-induced fluorescence (SIF), we target data obtained from the Fluorescence Explorer (FLEX) mission. The high resolution and specificity of FLEX’s fluorescence measurements are expected to provide detailed and accurate SIF estimations. This will enable the enhancement of predictive models and the investigation of subtle photosynthetic dynamics across diverse vegetation types. The proposed method facilitates a straightforward implementation of advanced and emerging machine learning techniques, thereby fostering the development of more robust and versatile models.