Estimation of Terrestrial Global Gross Primary Production (GPP) with Satellite Data-Driven Models and Eddy Covariance Flux Data

: We estimate global terrestrial gross primary production (GPP) based on models that use satellite data within a simplified light-use efficiency framework that does not rely upon other meteorological inputs. Satellite-based geometry-adjusted reflectances are from the MODerate-resolution Imaging Spectroradiometer (MODIS) and provide information about vegetation structure and chlorophyll content at both high temporal (daily to monthly) and spatial ( ∼ 1 km) resolution. We use satellite-derived solar-induced fluorescence (SIF) to identify regions of high productivity crops and also evaluate the use of downscaled SIF to estimate GPP. We calibrate a set of our satellite-based models with GPP estimates from a subset of distributed eddy covariance flux towers (FLUXNET 2015). The results of the trained models are evaluated using an independent subset of FLUXNET 2015 GPP data. We show that variations in light-use efficiency (LUE) with incident PAR are important and can be easily incorporated into the models. Unlike many LUE-based models, our satellite-based GPP estimates do not use an explicit parameterization of LUE that reduces its value from the potential maximum under limiting conditions such as temperature and water stress. Even without the parameterized downward regulation, our simplified models are shown to perform as well as or better than state-of-the-art satellite data-driven products that incorporate such parameterizations. A significant fraction of both spatial and temporal variability in GPP across plant functional types can be accounted for using our satellite-based models. Our results provide an annual GPP value of ∼ 140 Pg C year − 1 for 2007 that is within the range of a compilation of observation-based, model, and hybrid results, but is higher than some previous satellite observation-based estimates.


Introduction
Global terrestrial gross primary productivity (GPP), the amount of carbon dioxide (CO 2 ) that is assimilated by plants through photosynthesis, plays a critical role in the global carbon cycle as it can moderate the amount of CO 2 from anthropogenic sources that remains in the atmosphere. Despite the collection of ground-based data at hundreds of individual sites, as well as vast amounts of remotely-sensed data from satellites and the development of sophisticated process-based models, there still remains significant uncertainty in global GPP estimates including their global annual mean and inter-annual variability (IAV) [1]. Estimation of the variability in the land sink for global carbon budget assessments relies upon dynamic global vegetation models that are driven by observed environmental changes [2]. Peters et al. [3] note that the understanding of the land sink is currently limited in part by a dearth of spatially-explicit observations of vegetation changes and recommend systematic benchmarking of the models, including productivity, for improvement.
GPP is not measured directly either at the ground or from satellite. Using the eddy covariance technique, tower-based measurements of net ecosystem exchange (NEE) are partitioned into GPP and ecosystem respiration using assumptions regarding the latter (e.g., [1,4,5] and references therein). Despite the assumptions used in this approach, tower-based estimates are considered a standard for GPP estimates. Approaches to scale up these measurements and extrapolate globally involve first deriving relationships between tower-based GPP estimates and parameters from satellite-based instruments and meteorological analyses and then applying these parameterizations globally (e.g., [6][7][8][9][10]). Tramontana et al. [9] demonstrated that such approaches that use only remotely-sensed data perform almost as well as the best model that also incorporates meteorological inputs.
Other data-driven approaches to estimate GPP globally use satellite data in combination with models and various assumptions (see for comprehensive reviews [1,11]). Many of these are based on the light-use efficiency (LUE) concept [12,13], i.e., where PAR in is photosynthetically-active radiation incident at the top of canopy and FAPAR chl is the fraction of PAR in absorbed by chlorophyll. Several studies have used the LUE model with space-based estimates of FAPAR chl or a proxy such as from satellite-derived vegetation indices (VI), PAR in estimates from a data assimilation system or other measurements, and parameterizations of LUE that use meteorological data as inputs to estimate GPP globally or on a smaller scale (e.g., [14][15][16][17][18][19][20][21]). VIs are computed with a limited number of broad-band channels on satellite imagers. VIs that have been used or considered for estimating GPP include the dual wavelength Normalized Difference Vegetation Index (NDVI) [22] and the Near-Infrared Reflectance of terrestrial Vegetation (NIR V ) [23] that is defined as the product of NDVI and reflectance from the near-infrared channel used in the NDVI. The potential benefit of using VIs for GPP estimation is that they can be simply derived from satellite reflectances. In particular, an NDVI record is available for over three decades [24].
VIs have been used to compute GPP with an approximation to Equation (1), i.e., GPP = S × VI × PAR in (2) (e.g., [25]) or other similar forms (e.g., [26]) where VI is used as a proxy for FAPAR chl . LUE, here represented as S, may be treated a constant [27,28] or as a more complex function of VI itself [29,30] or other parameters where S may vary with plant functional type (PFT) (e.g., [31]). Many of these approaches use flux tower data to train or calibrate their models with a limited number of sites or PFTs and have not been tested globally. Some of the noted problems using VIs are (1) a decrease in sensitivity at high values, particularly in managed croplands (e.g., [25,32]) and (2) a positive offset that occurs when photosynthetic activity should be very low or zero (e.g., [20,26,33,34]) (i.e., these parameters have positive values when photosynthesis is not occurring). Yuan et al. [35] found that LUE models in general tended to overestimate GPP at low values and underestimate at high values and that most LUE models do not account for the fact that LUE is higher on cloudy days with more diffuse radiation (e.g., and references therein [36]). However, some LUE model formulations have included the radiation effect on LUE [37,38].
Gitelson et al. [39] and Peng et al. [40] found that the use of potential PAR (PAR pot , i.e., the PAR at top of canopy that would be present in a cloudless atmosphere) in a simple model, i.e., produced superior results as compared with PAR in in cropland vegetation. The improvement obtained by the use of PAR pot in place of PAR in with constant LUE implicitly accounts for the diffuse radiance effect on LUE by producing an effectively higher LUE at lower values of PAR in . Peng et al. [40] notes that this model does not account for variations in GPP produced by short time scale (minutes to hours) stresses. They also suggested testing of this model at other geographic locations and for other vegetation types.
Here, we expand on this approach by training and evaluating models based on Equations (2) and (3) with available global GPP estimates from flux tower observations. We then test the satellite-based trained model using independent global GPP from flux towers (i.e., not used in the training). This is a simple and computationally-efficient method to estimate global GPP with a limited number of remotely-sensed parameters with a focus of daily to monthly time scales. We specifically examine whether a modified function of NDVI or a linear combination of band reflectances can accurately represent FAPAR chl to mitigate the above-mentioned shortcomings of VIs.
We further explore whether satellite measurements of chlorophyll solar-induced fluorescence (SIF) may be used to improve estimation of global GPP. SIF originates from the photosynthetic machinery inside leaves and has been measured with several satellite instruments [41][42][43][44][45][46][47][48]. The spectrum of chlorophyll fluorescence is spread over a range of wavelengths with two broad peaks, centered near 685 and 740 nm, referred to as the red and far-red emissions, respectively. Radiative transfer and physiological modeling suggests that canopy-level far-red SIF is strongly related to APAR [49]. Most satellite SIF studies have been conducted using far-red SIF, and henceforth, in this work, SIF will refer to far-red SIF. Several studies have shown a highly linear relationship between far-red canopy-level SIF and GPP on monthly time scales [32,46,[50][51][52][53][54] that can be explained by modeling results [55]. Far-red SIF also closely follows seasonal [33] and daily to diurnal [54,56] variations in GPP. SIF measurements also alleviate the above-mentioned short-comings of VIs [32,33,57]. However, unlike VIs, SIF is a very small and difficult to measure signal. As a result, satellite SIF measurements are relatively noisy, prone to systematic errors [44,50,58] and currently limited in terms of both spatial and temporal resolution.
Here, we use attributes of both satellite-based reflectances and SIF in a computationally-efficient way to estimate global GPP accurately at high temporal and spatial resolution. In doing so, we seek to address the question: How much of the variance in GPP can be explained with satellite-derived quantities using a simple observation-based approach, particularly in comparison with more sophisticated approaches? To answer this question, we use GPP estimates derived at eddy covariance flux tower stations from the FLUXNET 2015 database to both calibrate and evaluate our satellite-based models. The results are examined at various temporal and spatial resolutions. We also compare with other data-driven GPP estimates and proxies as benchmarks. This comparison provides some insights into which parameters may be important for machine learning (ML) GPP estimates and ultimately may be used to improve the ML results.

Calculations of and Adjustments to Vegetation Indices
The NDVI and NIR V (Normalized Difference Vegetation or Infrared Indices) are computed from MODIS Band 1 and 2 reflectances (ρ) using: and: where C is set to 0.08 as in Badgley et al. [23]. As noted above, the NDVI offsets at low values (denoted here as N 0 ) are presumably the result of non-chlorophyll-containing substances that produce a positive NDVI when photosynthesis or GPP is at or near zero. The chlorophyll-containing vegetation fraction ( f ) occupying a given satellite pixel increases during the growing season along with the NDVI. Therefore, subtracting a constant at all values of NDVI to reduce the offset may produce a corresponding underestimate at high values. Instead, we compute an adjusted NDVI according to NDVI = NDVI − NDVI 0 where NDVI 0 decreases linearly with NDVI from N 0 to zero as NDVI increases from N 0 (at f = 0) to >N 1 (at f = 1), i.e., An expectation is that N 0 and perhaps N 1 vary with location. Satellite SIF measurements are near or at zero when GPP is near zero (e.g., [33]) and therefore offer the potential to map N 0 globally. We attempted to derive spatially-varying N 0 using coincident observations of NDVI and satellite-based SIF. SIF data are only available with good temporal resolution at fairly low spatial resolution (∼0.5 • ). We applied standard least-squares fitting for grid boxes with sufficient seasonal variability in f . In this approach, we set N 1 to an empirically-derived constant value of 0.75. In the end, we obtained improved results using a constant value of N 0 = 0.25 rather than site-specific N 0 derived with SIF data. Therefore, for the remainder of the paper, we use N 0 = 0.25 and N 1 = 0.75 to compute NDVI according to Equation (6). It is hoped that improved results may be obtained in the future with higher spatial resolution and higher quality SIF data.

Use of Linear Combinations of Bands as an FAPAR chl Proxy
In place of VIs or modified VIs in Equations (1) or (2), we alternatively use the form: where i is a particular reflectance band, N is the number of bands considered and ρ i are combined using weights A i and an offset A 0 . Gentine and Alemohammad [62] similarly used a neural network to estimate a SIF-based FAPAR chl proxy using MODIS reflectances. We examine the results below applied to GPP estimation for two different combinations of MODIS bands: (1) Bands 1 and 2 (red and NIR 1 ) that are used to compute NDVI and NIR V and (2) Bands 1-7 that are generally used for terrestrial remote sensing. Note that from MODIS, the first two bands are measured at ∼250 m, while the second group is measured at ∼500 m. Bands 1 and 2 or similar bands are also available on a number of other satellite sensors, such as the Advanced Very-High Resolution Radiometer (AVHRR) dating back to the early 1980s (albeit at lower spatial resolution than MODIS) and very high spatial resolution imagers such as on the Landsat and Sentinel 2 satellites. The additional MODIS bands may be useful for further removal of signals from non-chlorophyll-containing substances or atmospheric effects, as well as for noise reduction in the case of redundant information.

FLUXNET 2015
FLUXNET 2015 is a collection of eddy covariance flux data from multiple regional networks [63] that is an extension of the FLUXNET La Thuile dataset (2007). It includes improvements to the data quality control protocols and the data processing pipeline and the use of reanalysis data to fill gaps in micrometeorological variable records. Figure 1 shows maps of the flux tower sites used in this study that include only Tier 1 data open and free for scientific use. A listing of each site used in this study is provided in the Appendix B. There are two main ways to estimate GPP (in units of g C m −2 d −1 ) by partitioning of the NEE computed with the variable Ustar (u * ) threshold (VUT): (1) night-time method (GPP_NT_VUT_REF) [5]; and (2) day-time method (GPP_DT_VUT_REF) [4]. All results shown here use an average of night-and day-time partitioning methods. We repeated all analyses with each method separately and found that the conclusions did not differ significantly depending on which method was used. We use daily (DD, calculated from half-hourly data) and monthly (MM, average from daily data) files. We also made use of the short-wave (SW) incoming (IN) flux (SW_IN_F) and potential (top of atmosphere, SW_IN_POT), both in W m −2 , provided in the dataset.
We apply quality control similar to that in previous works [10]. We use only daily or monthly averages where 20% or less have come from gap-filled data. We also use only data where the NEE uncertainties provided in the FLUXNET 2015 dataset are <3 g C m −2 d −1 . Finally, we remove data points where the day-time and night-time partitioning results in GPP estimates that differ by more than 3 g C m −2 d −1 . Data processing methods are described at http://fluxnet.fluxdata.org/data/ fluxnet2015-dataset/data-processing/.
Total uncertainty in derived GPP includes gap-filling, partitioning uncertainty, random uncertainty and threshold friction velocity uncertainty [64]. Following Joiner et al. [33], we compute uncertainties in 8-day GPP flux tower-derived data (σ GPPtot ) according to: This formulation is based on a general fitting to data from the AmeriFlux network of flux measurements spanning the Americas. For the sample of 8-day data that meet all the criteria, this produces a mean uncertainty estimate of ∼0.86 g C m −2 d −1 . Assuming that random uncertainties can be reduced by averaging and that random uncertainty represents 50% of the total uncertainty consistent with Schaefer et al. [64], the uncertainties for monthly and daily data will be ∼0.68 and 1.8 g C m −2 d −1 , respectively.

Satellite SIF and SIF-Based Downscaled Products
We use Version 27 (v27) Level 3 (gridded) SIF derived from the Global Ozone Monitoring Experiment 2 (GOME-2) onboard the MetOp-A (GOME-2A) from February 2007-December 2016 [43,44]. MetOp-A is in a polar orbit with an Equator crossing time of 09:30. The native spatial resolution of the GOME-2A data from launch to the middle of July 2013 was 40 × 80 km. From that time until the present, the resolution has been 40 × 40 km. GOME-2A provides global coverage in approximately 1-3 days. Here, we use SIF derived from the far-red fitting window: 734-758 nm. SIF is a very small signal (typically < a few mW/m 2 /nm/sr). Spatial and temporal averaging is employed to increase SIF signal-to-noise ratios (SNRs).
The quality control filters for GOME-2 Level 3 SIF are described in Joiner et al. [44]. GOME-2 SIF data with effective cloud fractions <30% are gridded monthly at 0.5 • × 0.5 • resolution. This cloud screening is less rigorous as compared with that applied to MODIS data. GOME-2 retrievals of canopy-level SIF, which utilize the filling-in of solar Fraunhofer lines, are themselves not as greatly affected by clouds as are vegetation indices such as the NDVI [65]. However, top-of-atmosphere (TOA) SIF is affected by clouds in that it is sensitive to both the amount of solar radiation reaching the surface, as well as the amount of SIF that reaches the satellite. Therefore, TOA SIF should be reduced by approximately the square of the cloud transmittance as compared with the value in clear skies assuming that SIF yield remains constant with cloud conditions. Frankenberg et al. [66] showed that a substantial fraction of the canopy SIF can be detected by satellite for small to moderate cloud optical thicknesses.
Duveiller and Cescatti [67] implemented a method to downscale SIF with higher resolution MODIS data. Their approach used NDVI, as well as land surface temperature and evapotranspiration derived from MODIS to downregulate how much SIF is related to NDVI (or FAPAR). This product, referred to here as SIF*, is provided from 2007-2013 at a monthly timescale on a 0.05 • × 0.05 • grid. Version 1 (v1) SIF* uses GOME-2 SIF Version 25 as an input. It should be noted that since v1 was produced, there have been two updates to the GOME-2 SIF data culminating in v27. These later versions used the same first step retrievals, but addressed various biases as identified by Köhler et al. [58] by applying posterior adjustments [44]. To obtain a monthly full-day estimate of SIF* from the original observations near 09:30 local time, we first divide each monthly grid box value of SIF* by the monthly-averaged top-of-atmosphere PAR (PAR TOA ) at 09:30, then multiply by monthly-averaged daily PAR TOA [68].

State-of-the-Art Satellite-Derived GPP Estimates
As benchmarks for comparison, we use two state-of-the-art data-driven global GPP estimates. FLUXCOM-RS is based on the global upscaling of flux tower data using satellite remotely-sensed (RS) data (no meteorological data) with a machine learning approach [6,7,10]. The product is generated at 0.0833 • × 0.0833 • spatial and 8-day temporal resolution. It uses flux tower GPP with an ensemble of machine learning-based models and over 200 explanatory variables.
The Vegetation Photosynthesis Model (VPM) V20 [15,21,[69][70][71] is based on the LUE approach and utilizes many remotely-sensed and climate reanalysis datasets. Different from the MODIS MOD17 GPP product, it uses a linear transformation of the Enhanced Vegetation Index (EVI) as a proxy of FAPAR chl (i.e., FAPAR chl =1.25 [EVI-0.1]); the maximum LUE only differs for C3 and C4 plants, which is supported by a convergence of LUE when using this FAPAR chl [72]. VPM provides data at various resolutions. Here, we use monthly gridded data at 0.05 • × 0.05 • resolution.

Sampling the FLUXNET 2015 Datasets for Training and Evaluation
In this work, we examine how well global GPP estimated from flux towers can be explained using simple relationships with satellite-based observations. Here, we use flux tower-derived GPP estimates to both calibrate and evaluate a variety of models. We use half of the available observations for calibration (i.e., deriving slopes between various satellite-based products and GPP) and the other half (independent) for evaluation of the derived relationships. We split the data alphabetically, i.e., according to the flux site name (even-or odd-numbered flux tower sites). In this way, half the sites are used for training and the other half for independent evaluation. This sorting selects sites such that if there are two (or more) sites per country, one site will be used for training and the other for evaluation. It also eliminates the possibility of data from a single site being included in both training and evaluation samples. This leads to more independence between the training and evaluation datasets. We find that results are similar, although not identical, depending on which half is used for training and that all general conclusions hold no matter which half of the sites is used for training.
In order to test GPP models on a wide range of conditions and plant functional types, we use all flux tower sites except for 23 identified by Zhang et al. [21] as having incompatibility with MODIS pixels and a few more additional sites at coastlines where FLUXCOM-RS did not produce good values. Each site was checked visually with MODIS 5km data. Sites were retained if the land cover and vegetation condition within the flux tower footprint (assumed 250-m radius) was similar (≥80%) to that within the overlapping MODIS data. All analyses are conducted for years 2003-2014 when FLUXNET 2015 and high quality MODIS data from both the Terra and Aqua satellites are available.

Statistical Parameters Used for Evaluation of Models
We evaluate different models on the basis of various statistical parameters including the root-mean squared error (RMSE) of the fit to the flux tower-based GPP, the variance explained (r 2 ), and the bias.
As in Tramontana et al. [10], we use the model efficiency factor (MEF), a measure of how well our model reproduces independent data as compared with the mean value of observations, defined as: where y i and x i are the observed and fitted values, respectively, and y is the mean of y i . Values of MEF > 0 indicate that the model used is better than the mean, and a value of unity would indicate a perfect model. We also use the metric λ developed by Duveiller et al. [73], defined as: where This index is equal to the correlation coefficient r when r ≥ 0 and there is no additive or multiplicative bias and will be less than r in the presence of a bias. λ has the advantage of summarizing the agreement in a single metric irrespective of whether any disagreement is due to lack of correlation or to a bias.

Data Processing Flow
A basic flow diagram of the training and evaluation using the various datasets is shown in Figure 2. The details involved in each of the processing steps will be described in more detail below. The datasets, with spatial and temporal resolutions listed in Table 1, may be grouped into several distinct subsets: (1) reflectances used to compute VIs (MCD43); (2) radiation, TOA short-wave (SW TOA ); (3) FLUXNET 2015 eddy covariance GPP estimates, half used for training and the other half for evaluation; (4) SIF (used to delineate highly productive regions) or downscaled SIF (SIF* used as a GPP proxy); (5) independent satellite data-driven GPP estimates used as benchmarks (VPM and FLUXCOM-RS). Further details on other datasets examined are given in Appendix A.   MCD43D reflectances a daily 0.0083

MODIS
any of the above any of the above NDVI × SW TOA b,e any of the above any of the above FLUXNET 2015 GPP f daily, monthly site footprint a Used to compute FAPAR proxies such as NDVI, NIR V and linear combinations of channels; b used as a GPP proxy with optional LUE parameterization; c used to identify highly productive regions optionally; d used for benchmark evaluation; e computed using MCD43D reflectance dataset; f half of sites used for training, the other half used for independent evaluation.

Results and Discussion
Using the approach shown in Figure A1, we tested a variety of products and models using GPP derived from a consistent sample of eddy covariance sites to both train GPP satellite-driven models and evaluate them. A particular satellite data-driven model is given by GPP = bx, where a model parameter x, such as NDVI × SW TOA , is used to predict flux tower estimates of GPP and b is determined by fitting to a subset of flux tower data.
We investigated the impact of using different MODIS datasets in Appendix A. Generally, MCD43D provided superior results as compared to most other datasets. Our results in the remainder of this paper focus on the MCD43D reflectances that can be used at their highest resolution (∼1 km) or degraded to be comparable to the resolution of other datasets used for benchmarking.
We also tested the use of different radiation datasets in Appendix C. The use of top-of-atmosphere shortwave (SW TOA ), computed as being proportional to cosine of the solar zenith angle (SZA) integrated over a day, as compared with the satellite-derived all-sky SW produced the best results with Equation (2) where LUE was assumed constant. These results are consistent with those of Gitelson et al. [39] and Peng et al. [40], who used a limited number of flux tower sites. Those studies found that GPP did not follow PAR in , but rather that other factors affect GPP, and that the use of PAR pot reduces some of the variability resulting from the dependence of LUE on cloudiness. This would occur if LUE is not constant, but rather anti-correlated with PAR in (or correlated with cloudiness), such that various combinations of PAR in and LUE produce a relatively constant GPP over a range of cloud conditions and for a given fixed value of FAPAR chl . It is well known that LUE is higher in situations with more diffuse light (e.g., [36]) (i.e., cloudy conditions) as compared with clear-sky conditions. One exception is very low light conditions where LUE saturates.
For following comparisons between models and flux tower data, results are typically shown in terms of scatter diagrams or 2D histograms with accompanying statistics. For completeness, we also provide probability distribution functions in Appendix D. These show the degree to which the differences follow a normal distribution that is assumed in statistical significance tests. While the differences deviated somewhat from normal distributions, the number of samples was large enough in most cases that the significance tests (when p-values were very small) were assumed to be valid.

The Use of Downscaled SIF to Estimate GPP
Duveiller and Cescatti [67] showed that SIF* agreed better with flux tower GPP as compared with MOD17 GPP (see Appendix A for a description of MOD17). Here, we further evaluate SIF* by comparing it consistently with the results of our models. We first verified that all of our models also outperformed MOD17 GPP on eight-day to monthly time scales in all statistical measures. Note that because SIF depends upon incident radiation, we did not explicitly use PAR in SIF-based models (other than to normalize it appropriately as described above); here, we simply applied a single scaling factor to SIF* at monthly time-scales to estimate GPP globally. The assumption of a single scaling factor to relate SIF to GPP, at least encompassing many plant functional types, has been suggested [53,74]; however, the use of a single scaling factor was not generally supported by previous studies (e.g., [46]).
We examine monthly SIF* along with NDVI and NIR V results in more detail in Figure 3 by comparison with tower-derived GPP for years 2007-2013 when GOME-2 SIF v25 data were available. Note that the resolution for SIF* was 0.05 • , while the MCD43D reflectances had been resampled to 0.083 • ; note that the MCD43D results were degraded as compared with the ∼1 km 2 results in Table A5, which were much closer to the footprint of the eddy covariance technique [34]. While the positive SIF* bias was clearly seen at GPP values near zero, SIF* was shown to have a more dynamic range as compared with the NDVI -based GPP estimates; NDVI -based GPP estimates tended to saturate at around 10 g C m −2 d −1 ; while SIF* and NIR V -based estimates reached values close to 15 g C m −2 d −1 . SIF* appeared somewhat noisier than the NDVI -or NIR V -based GPP estimates. This was expected as the downscaled SIF was directly tied to the more coarsely-gridded GOME-2 SIF that was inherently noisy due to the very small SIF signal. Correlations were significantly different between all three models (i.e., null hypothesis rejected) with p-values < 0.0001.
The apparent noise in SIF* may also have resulted from the effects of vegetation structure on the escape of SIF through the canopy [75,76]. We attempted to account for the variations in canopy escape as suggested by Yang and van der Tol [76] using MODIS Band 2 (near infrared) reflectance for normalization. However, we did not find improvement in the global statistics. One explanation is that the approach may not work well when bare soil contributes significantly to the observed reflectance and/or because the wavelengths of the MODIS reflectances are not optimal for this purpose.
While the SIF* results were noisy as compared with reflectance-based approaches, we were encouraged by these initial SIF* results given that only a single scaling factor was applied. We expect improvements in subsequent versions of SIF* that will use more recent datasets for SIF, as well as the high spatial resolution variables employed for downscaling. In addition, we expect progress to be made in developing methods to account for the effects of vegetation structure on SIF escape from the canopy globally.  Figure 4 shows results at an eight-day temporal resolution and the highest spatial resolution used here (0.0083 • ) for two-and seven-band models where linear combinations of bands were used as FAPAR chl proxies according to Equation 7. Both band models provided improved dynamic range in estimated GPP as compared with the NDVI -based model, and the correlation improvements were statistically significant with p-values < 0.025. The seven-band and two-band models were not statistically distinguishable. All models underestimated GPP at the highest GPP values.

Use of SIF to Delineate Highly Productive Agricultural Areas
One vegetation type for which GPP was not well predicted by our global linear regression models is croplands with GPP > ∼15 (see Figures 3b and 4). An examination of these points shows that they are primarily from three sites in the U.S. Corn Belt (US-Ne1, US-Ne-2 and US-Ne3) that are frequently planted with maize, a C4 plant. Zhang et al. [21] used various datasets to compute grid box fractions of C4 plants (natural and agricultural) for the VPM and defined different maximum efficiencies for C3 and C4 plants within their LUE formulation. The datasets used to compute C4 fractions were static (i.e., no interannual variability) and at lower spatial resolution than the VPM datasets.
Here, we explore an alternative approach that uses satellite-based SIF to delineate highly productive regions. Guanter et al. [32] demonstrated that SIF was more highly correlated with GPP than EVI at flux tower sites located in the U.S. Corn Belt because EVI, like NDVI, displays saturation at high values. They also showed that data-driven models, such as the predecessors of FLUXCOM-RS, did not produce the very high values of GPP derived from flux towers in this region. Global maps of SIF show very high values in summer in these same regions, while NDVI and EVI do not [32]. Here, we test the use of SIF observations, albeit at low spatial resolution (0.5 • ), to identify these regions of high productivity not prominent in the NDVI that may be related to large fractions of highly productive crops that are not well captured by the reflectances alone.
Using collocated GPP, SIF and NDVI data, we found that the FLUXNET 2015 sites with high maximum GPP values that were not well fit using our single linear regression models displayed the following criteria of high NDVI and a large ratio of SIF to NDVI when normalized appropriately by PAR: specifically, NDVI t > 0.76 and SIF t /(NDVI t × cos (θ)) > 3.0, where t is the month with the highest climatological value of NDVI and θ is the average solar zenith angle for that month corresponding to the GOME-2 SIF observations. The climatological monthly means are computed over the years 2007-2016. Figure 5 shows a map of the areas detected with this approach. Adjusting the thresholds upwards resulted in somewhat less flagging over areas outside the U.S. Corn Belt (e.g., Eurasia) and vice versa. Additional flux tower sites in high productivity agricultural regions may help to further constrain the thresholds.
To improve GPP estimates with the simple linear regression approaches, we separated the flux tower data into two different samples, one meeting the above criteria for high GPP and SIF and the other not. We then performed separate linear fits to the GPP data for these two subsets. The designated high productivity sites included in the training set were US-Ne1, US-Ne3 and DK-Fou, and the only site in the validation dataset was US-Ne2. Table 2 shows the results of the validation from this dual fit approach, henceforth referred to as NDVI 2 or 7band 2 . The statistics shown here are for daily data. We note significant improvement in metrics with flux tower GPP with dual fit as compared with the single fit models (null hypothesis rejected). These results demonstrate a clear capability of the satellite data to estimate GPP down to a daily time scale. Note that for the band models, the coefficients for each band in the calculation of FAPAR chl proxies in Equation (7) were estimated using all flux tower training data. Only the slopes between GPP and the product of FAPAR and PAR proxies were computed using the two different samples.
Even though the results using SIF were positive, there is room for improvement. The use of a low-resolution static SIF dataset has obvious limitations. We do not account for the fact that only a fraction of our identified high productivity grid boxes was covered by highly productive crops. Improved SIF datasets are expected in the near future that will address some of these concerns as explained below.

Further Parameterization of LUE
Bauerle et al. [77] found that photoperiod explained more seasonal variation in leaf photosynthetic capacity (related to LUE) than temperature. Zhang et al. [78] also derived a seasonally-varying photosynthetic capacity from satellite SIF and other measurements and showed that using this seasonal variation within a canopy transport model improved estimates of GPP and LUE. Functions of NDVI have been used to account for these seasonal variations in LUE [29,30].
We tested various expressions of seasonal variation of LUE in an attempt to further improve our satellite-based GPP estimates. We found that a small portion of the remaining variance in LUE can best be modeled as a simple polynomial expression of NDVI. In this model, we have for each of the two separate fits within the dual fit approach, where FAPAR is an FAPAR proxy such as NDVI or multi-band proxies. Other LUE parameterizations tested included second order polynomials in SW TOA , SW TOA /max(SW TOA ), or NDVI/max(NDVI) for each location as proxies for photoperiod or products of these predictors. However, the straight NDVI parameterization provided the best results. Figure 6 shows results for our dual fit seven-band variable LUE (Equation (11)) models that we will refer to as FluxSat-7, the dual fit NDVI -based variable LUE model (FluxSat-N), as well as NIR V -based results at eight-day temporal resolution. At the 0.00833 • spatial resolution of MCD43D, both FluxSat-7 and FluxSat-N were more centered about the 1:1 line than the NIR V -based model, and correlation differences between FluxSat and the NIR V -based models were statistically significant with very low p-values. The FluxSat-7 correlation improvement over FluxSat-N was significant with a p-value of 0.05. Improvements were obtained with the NDVI-dependent LUE as compared with the results in Figure 4.

Comparison with Other Satellite Data-Driven GPP Estimates
Next, we compare FluxSat-7 and NIR V -based GPP with that from state-of-the-art data-driven FLUXCOM-RS and VPM. Figure 7 shows 2D histograms of FluxSat-7, NIR V -based and FLUXCOM-RS eight-day GPP estimates at 0.0833 • resolution versus independent FLUXNET 2015 data. At this spatial resolution, all three generally overestimated GPP at low values and underestimated at high values. FluxSat-7 had the best performance at the highest values of FLUXNET 2015 GPP. All correlations were statistically different (p-values < 0.0001). FluxSat-7 showed the highest overall correlations with respect to FLUXCOM 2015. As shown above, comparisons with FLUXCOM 2015 at the full spatial resolution of MCD43D (0.0083 • , Figure 6) provided superior results as compared with results at the spatial resolution of FLUXCOM-RS (0.083 • ).
We note that FLUXCOM-RS was not trained with FLUXNET 2015 data, but rather with a different processing of an overlapping subset of eddy covariance flux tower data from the La Thuile dataset [10]. This may explain the bias of FLUXCOM-RS with respect to FLUXNET 2015 that was not present with respect to the La Thuile training dataset [10]. The bias may be easily removed using a linear regression with respect to FLUXNET 2015. When this was done, we obtained statistics that were similar to the single fit models presented in Section 3.2 applied at the same spatial resolution; the highest GPP values from the US-Ne3 site were still not well matched with FLUXCOM-RS after this adjustment.   Figures 9 and 10 show 2D histograms similar to Figures 7 and 8, but this time of normalized GPP differences from the climatology (also known as anomalies or IAVs). To compute the normalized IAVs, the seasonal cycle of GPP was first removed, then the remaining differences were normalized by the range of observed climatological values from the FLUXNET 2015 data for each individual location (known as the min-max normalization). For example, a normalized value of 0.5 means that a positive difference is 50% of the range of observed GPP. If the min-max normalization was not performed, then analyses suggested that the models did not capture IAVs well. All models were able to reproduce GPP IAVs well. FluxSat-7 provided improved results as compared with the other models. This was the case particularly for the points with the largest IAVs (both positive and negative). All correlations were different with high statistical significance. Figure 11 examines the spatio-temporal variation in annual GPP produced by the different satellite data-driven models. Each point on the scatter diagrams represents a mean annual value for a particular site and year where only the independent sites (i.e., not used in training of either NIR V or FluxSat-7) are shown and all datasets are at 0.0833 • resolution. Only complete or nearly complete years (i.e., a few missing eight-day segments) are shown, and the same sampling was done for all datasets (i.e., if there was a missing data point for one dataset, it was eliminated for all datasets). The metrics were best for FluxSat-7, but the model differences were not statistically distinguishable in this case.

Comparisons across PFTs
Scatterplots of satellite data-driven GPP estimates (FluxSat-7, NIR V -based and FluxSat-N at the highest MCD43D resolution of 0.0083 • for eight-day data) versus flux-tower estimates are displayed for different PFTs in Figures 12 and 13. All three models showed good performance and relatively low biases across PFTs. On the whole, FluxSat-7 outperformed the other two models with higher correlations and slopes generally closer to unity. NIR V -based had the most dynamic range for deciduous broadleaf forest (DBF) + mixed forest (MF), followed by FluxSat-7; FluxSat-N showed the smallest range for this type. However, FluxSat-7 and FluxSat-N better reproduced the very high values of GPP found in some cropland sites as shown above. Figures 14 and 15 similarly compare IAVs across PFTs. Again, all three models effectively capture IAVs across PFTs. The models perform particularly well in capturing IAVs for savannas and grasslands. Upon further investigation of the EBF anomalies, we found two distinct clusters of points. The first cluster was for the Australian sites. The anomalies of these sites were well captured by the satellite data-driven models. The second cluster was for two European EBF sites (FR-Pue and IT-Cpz); anomalies from these sites in FLUXNET 2015 were not well captured by our satellite data-driven models.  Figure 16 shows globally mapped FluxSat-7 and FLUXCOM-RS GPP for two eight-day periods in 2007. As expected based on Guanter et al. [32], large differences are shown in the U.S. Corn Belt and other highly productive agricultural areas identified with the SIF data. GPP values near the peak of the northern hemisphere growing season (for the eight-day period starting at 193) are seen to be generally higher with FluxSat-7 as compared with FLUXCOM-RS; this may be related in part to the different training datasets used. Differences in the winter hemispheres are more similar. FluxSat-7 provided mostly higher values in the tropics as compared with FLUXCOM-RS.   To check our results in the tropics, we ran an experiment where half the sites were used for training, but the one tropical station with high GPP values from FluxSat-7, BR-Sa3, was withheld from the training set and was instead used for evaluation. Figure 18a shows that the high GPP values for this site produced by FluxSat-7 were supported by the flux tower values. This is an important finding because this region is frequently cloudy and satellite-derived GPP values have large uncertainties owing to few ground-based training sites. We similarly compare GPP values at four independent high latitude sites (above 60 • N) in Figure 18b where two sites were ENF and two were wetland. We see that both FluxSat-7 and VPM underestimated GPP at the higher values for ENF sites, while FluxSat-7 somewhat overestimated for wetland sites and lower GPP values at ENF sites. VPM had a more consistent underestimation

Comparison of Globally Mapped GPP
The annual GPP values from FluxSat-7, VPM and FLUXCOM-RS were 140.8, 125.0 (in agreement with that reported by Zhang et al. [21]) and 111.6 Pg C y −1 , respectively for 2007. If SIF was not used for delineation of high productivity areas, and a single fit to the flux tower data was used for all areas, then FluxSat-7 reduced to 139.4 Pg C y −1 . It has been noted that VPM has an underestimation for ENF and EBF that is possibly related to higher light use efficiency for diffuse radiation [21]. It is therefore not surprising to see higher values from FluxSat-7 as compared with VPM for those PFTs that occupy a significant fraction of the vegetated land surface.
FluxSat-7 is within the range of a compilation of observation-based, model, and hybrid results [1], but outside the high range of values from previously reported observation-based diagnostic models [79]. Discussing the evaluation of global annual GPP produced by Earth system models with an earlier version of FLUXCOM (that produced a value of 119 ± 6 Pg C y −1 ), Anav et al. [80] noted that higher estimates of 150-175 and 146 ± 19 Pg C y −1 were reported by Welp et al. [81] and Koffi et al. [82], respectively, though there were large uncertainties associated with those estimates. How our observation-based GPP values are interpreted within the context of the land sink and budget imbalance in global carbon budget estimates that rely upon dynamic global vegetation models [2] will be a topic for future investigation.

Conclusions
Here, we have presented a suite of satellite data-driven models based on the LUE concept to estimate GPP globally. The models were trained (or calibrated) using a subset of eddy covariance flux tower data and evaluated using a set of independent flux tower data. The models did not have any explicit dependence upon PFT; they were trained and evaluated with data across all PFTs with the exception of sites with high SIF relative to NDVI that were trained separately and generally coincided with high fractions of C4 maize crops. These areas represent only a few percent of the total land surface. Our best satellite data-driven models outperformed more complex state-of-the-art data-driven models that may or may not also incorporate ancillary information such as meteorological data.
We find that using SW TOA (similar to clear-sky SW radiation) with constant or slowly varying (on seasonal time scales) LUE effectively accounts for the dependence of LUE on incident (cloudy-sky) radiation; specifically, light use efficiency increases with cloudiness index or diffuse radiation fraction in such a way as to produce a relatively constant GPP over a wide range of cloud conditions, and our model effectively accounts for this. This may explain why our GPP estimates are higher than other satellite-based estimates, particularly in the cloudy tropics. Our high derived GPP values in the tropics agree well with the one available tropical Tier 1 FLUXNET 2015 site that has high values. Higher values of estimated GPP in the tropics and also in the extra-tropical Northern Hemisphere lead to higher annual estimated values of GPP (140.8 Pg C y −1 for 2007) with our best model. This is higher than other state-of-the-art satellite-based GPP estimates and on the high end of values from both prognostic and diagnostic models.
We applied a relatively simple adjustment to NDVI data to mitigate the NDVI offset. Models that use NDVI or the two NDVI bands have the advantage that they can be applied to satellite data that date back to the early 1980s from the Advanced Very-High Resolution Radiometer (AVHRR) such as from Tucker et al. [24] or to field-scale imagery from Landsat, Sentinel 2 or other commercial satellites. The linear combination of either two or seven bands provides more dynamic range than the NDVI that tends to saturate at high values.
We find that inclusion of an LUE parameterization in terms of a second order polynomial function of NDVI helps to further explain some variability in GPP. This formulation presumably accounts for variations in the maximum carboxylation rate tied to the photoperiod.
For available input VI-related satellite datasets, we find that better results are obtained when using BRDF-adjusted data as compared with non-adjusted data. We obtained improved results using NDVI computed from averaged reflectances or albedos rather than composited "best-value" NDVI data. We also confirm that in general, better statistical comparisons with the flux tower data were achieved when satellite data are used at high spatial resolution similar to the typical footprint of eddy covariance data (∼1 km 2 ) as compared with (5-8 km) 2 . In terms of temporal resolution, the best results were obtained at monthly temporal resolution, as expected. However, results at eight-day or daily resolution were only marginally degraded as compared with monthly averages.
The best results were obtained when our models took advantage of satellite-based SIF data to identify areas of high productivity. These models also outperformed the NIR V for estimation of global GPP. The results overall with downscaled SIF were encouraging given that only a single scaling factor was used and considering that the input datasets have since been updated. However, to be further improved, it is likely that variations in fluorescence escape from the canopy and fluorescence yield will have to be better accounted for.
The reflectance-based models were able to take advantage of high fidelity imager reflectance data with low noise and consequently high spatial resolution and frequent revisit. Our results suggest that machine learning or other similar approaches may benefit from the use of these data along with SIF. While our approach of using a static low resolution SIF dataset to detect high productivity areas is not optimal and can be improved, the number of grid boxes with high SIF relative to NDVI was only approximately 2-3% of all land grid boxes, excluding Antarctica. Therefore, small errors due to temporal variability or the fact that only a fraction of those grid boxes may contain high productivity crops are unlikely to have a significant impact on our global annual GPP estimates. The results obtained here may be further improved using higher spatial resolution SIF data ideally taken over several years, such as from the recently launched TROPOspsheric Monitoring Instrument (TROPOMI) with ground footprints ∼7 km × 3.5 km at nadir and complete daily global coverage [65]. Funding: This research was funded by NASA.

Acknowledgments:
We are grateful to EUMETSAT for providing the GOME-2 data and NASA for providing the MODIS, CERES and GMAOdatasets used in this study. Funding for this work was provided by the NASA Earth Science U.S. Participating Investigator program. We also thank Arlindo da Silva for helpful discussions. We thank all investigators who provided data as part of the FLUXNET 2015 Tier 1 dataset. This work used eddy covariance data acquired and shared by the FLUXNET community, including these networks: AmeriFlux, AfriFlux, AsiaFlux, CarboAfrica, CarboEuropeIP, CarboItaly, CarboMont, ChinaFlux, Fluxnet-Canada, GreenGrass, ICOS, KoFlux, LBA, NECC, OzFlux-TERN, TCOS-Siberia and USCCC. The ERA-Interim reanalysis data are provided by ECMWF and processed by LSCE. The FLUXNET eddy covariance data processing and harmonization were carried out by the European Fluxes Database Cluster, AmeriFlux Management Project the and Fluxdata project of FLUXNET, with the support of CDIACand ICOSEcosystem Thematic Center and the OzFlux, ChinaFlux and AsiaFlux offices.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Appendix A. Additional MODIS Reflectance and Higher Order Products Examined
In addition to MCD43D, we examined the following MODIS products that are summarized in Table A1: (1) daily Terra MODIS reflectances provided in the MOD09 climate modeling grid (CMG) (0.05 • latitude × 0.05 • longitude resolution) product [83][84][85][86] for MODIS Bands 1-7; these data have been adjusted to remove the effects of atmospheric gases and aerosol; data for each grid box were selected on the basis of the low solar zenith angle, minimum Band 3 (blue) reflectance and absence of cloud from Level-3 intermediate files; (2) daily Terra and Aqua MODIS MCD43C4 view angle-corrected nadir BRDF-adjusted reflectances (NBAR) on the CMG; MCD43C4 data were generated similar to MCD43D products described above [87]; (3) angle-adjusted reflectances from the multi-angle implementation of atmospheric correction (MAIAC) applied to Aqua and Terra MODIS (MCD19) and averaged daily on the CMG [88,89]; Collection 6 MCD19 data on the CMG over Australia are not yet available; (4) monthly NDVI composite (MOD13C2) on the CMG [90]; (5) The MCD15A2H Version 6 MODIS Level 4 fraction of photosynthetically-active radiation (FPAR), product; this is an eight-day composite dataset with 500-m resolution [91,92]; (6) the MOD17A2H GPP product is an eight-day composite at 500-m resolution based on the LUE concept [14,93]. We examined each of these datasets in terms of their use to predict GPP as derived from eddy covariance flux tower data, as will be explained in detail below. Figure A1 shows a flowchart similar to the one in the main manuscript, but containing all datasets examined in Table A1 in addition to the ones described in the main manuscript.  Figure A1. Flowchart of all datasets examined in training and evaluating satellite-based GPP models. The poorest results were obtained using either the MOD17A2H GPP product or the MCD15A2H FPAR (FAPAR) product that drives MOD17A2H. Noted problems in MOD17 GPP, including, but not limited to cloud contamination in MOD15, and proposed solutions have been documented [94][95][96][97]. We obtained significant improvements using the MOD13 composite NDVI product.
Further improvements were obtained when we processed reflectances using a mean compositing approach [98] as compared with picking a single observation based on a set of criteria, as is common in composite datasets. We found that results with MOD09 reflectances were highly dependent on the quality control filters that were selected. For example, excellent results were achieved when stringent quality control checks were applied including filters for clouds, cloud shadows and fires. However, these checks eliminated data completely from several flux tower sites. Results were degraded using only the basic cloud filtering that appears not to catch all cloud-contaminated pixels. Improvements were possible using MAIAC MCD19 data (where available) as an additional quality filter.
MAIAC daily results were much more sparse as compared with MCD43 products that are filled in with the 16-day window for data processing. Results with MAIAC and MCD43D angle adjusted reflectances were comparable at the same spatial resolution. The current MAIAC gridded data are at a lower resolution than MCD43D and do not yet have Australia included. We may expect MAIAC MCD19 to perform better at sites with a high probability of cloud and aerosol contamination, such as those in and around Amazonia. There are few of these sites available in FLUXNET 2015 Tier 1. After this initial evaluation, we decided to focus primarily on one reflectance product, MCD43D, as it provides straight-forward quality control and excellent, complete results in a high spatial resolution gridded format.

Appendix B. FLUXNET2015 Sites Used in This Study
Tables A2-A4 list the flux tower sites used in this study along with corresponding information.

Appendix C. Comparison of Results Using Different Radiation Input Datasets
Estimates of the global monthly SW flux at the surface under all-sky conditions are provided in the Clouds and the Earth's Radiant Energy System (CERES) Energy Balanced and Filled (EBAF) dataset at a 1 • × 1 • spatial resolution. The CERES temporal sampling is enhanced with the use of geostationary data. We used the Edition 2.8 CERES EBAF data.
We also used daily-averaged surface all-sky incoming shortwave flux (SWGDN), as well as clear-sky shortwave flux (SWGDNCLR) from the NASA Global Modeling and Assimilation Office (GMAO) Modern-Era Retrospective analysis for Research and Applications 2 (MERRA-2) data assimilation reanalysis system [99]. These are two-dimensional fields from the radiation diagnostics. In addition, we computed daily TOA SW flux (SW TOA ) as being proportional to the cosine of the solar zenith angle (SZA) integrated over a day.
To compare results with different radiation input datasets, we used monthly averaged GPP estimates provided in the FLUXNET 2015 dataset to train models of the form of Equation (2), where NDVI and NIR V computed from the MCD43D reflectances were used as simple proxies for FAPAR chl . Here, we assume that S (or LUE) is a constant. The NIR V is defined here as NIR V = (NDVI−0.08)×ρ 2 and is also evaluated without multiplication by radiation as in Badgley et al. [23].
We first compare results using two different radiation input datasets: (1) surface short-wave flux from CERES (SW CERES ) in all-sky conditions and (2) SW TOA . Note that corresponding PAR values are simply proportional to these quantities; the latter is an approximation of PAR pot , defined as clear-sky PAR at the top-of-canopy, to be used in Equation (3). The main difference between SW TOA and incident SW (or PAR) at top of canopy is the effect of the clouds. Table A5 shows that improved results are achieved using SW TOA as compared with all-sky SW CERES for both NDVI-and NIR V -based GPP estimates. These improvements are statistically significant. Similar improvements are obtained on daily results using MERRA-2 clear-sky SW (SWGDNCLR) as compared with all-sky estimates (SWGDN). We also compared the use of SWGDNCLR to SW TOA ; note that SWGDNCLR accounts for clear-sky atmospheric scattering and absorption (but not clouds), while SW TOA does not. Results using SW TOA or SWGDNCLR were similar. Finally, we note that the use of NDVI has a lower bias as compared with NDVI. Table A5. Statistical comparison between various monthly-averaged GPP estimates from MCD43D reflectances (0.00833 • resolution) at 64 unique independent flux tower sites from FLUXNET 2015 (models were trained using data from 64 other sites) with a total of 3702 individual collocated observations including the variance explained (r 2 ), bias (mean of flux tower GPP minus the satellite-derived GPP), root mean squared error (RMSE) of the model, model efficiency factor (MEF) (Equation (9)) and λ (Equation (10)

Appendix D. PDFs of Differences between Model-and Flux Tower-Derived GPP
For completeness, all figures in this Appendix relate to figures from the main manuscript, but show probability distribution functions of differences rather than scatter plots or 2D histograms.    Figure A2 (same data sample, eight-day at 0.00833 • resolution at 64 sites), but with data corresponding to Figure 6. Units of GPP are g C m −2 d −1 .   Figure A6. Similar to Figure A2, but with monthly model data at 0.05 • resolution corresponding to Figure 8. Units of GPP are g C m −2 d −1 .