Assimilation of Remotely-Sensed Leaf Area Index into a Dynamic Vegetation Model for Gross Primary Productivity Estimation

Quantitative estimation of the magnitude and variability of gross primary productivity (GPP) is required to study the carbon cycle of the terrestrial ecosystem. Using ecosystem models and remotely-sensed data is a practical method for accurately estimating GPP. This study presents a method for assimilating high-quality leaf area index (LAI) products retrieved from satellite data into a process-oriented Lund-Potsdam-Jena dynamic global vegetation model (LPJ-DGVM) to acquire accurate GPP. The assimilation methods, including the Ensemble Kalman Filter (EnKF) and a proper orthogonal decomposition (POD)-based ensemble four-dimensional (4D) variational assimilation method (PODEn4DVar), incorporate information provided by observations into the model to achieve a better agreement between the model-estimated and observed GPP. The LPJ-POD scheme performs better with a correlation coefficient of r = 0.923 and RMSD of 32.676 gC/m2/month compared with the LPJ-EnKF scheme (r = 0.887, RMSD = 38.531 gC/m2/month) and with no data assimilation (r = 0.840, RMSD = 45.410 gC/m2/month). Applying the PODEn4DVar method into LPJ-DGVM for simulating GPP in China shows that the annual amount of GPP in China varied between 5.92 PgC and 6.67 PgC during 2003–2012 with an annual mean of 6.35 PgC/yr. This study demonstrates that integrating remotely-sensed data with dynamic global vegetation models through data assimilation methods has potential in optimizing the simulation and that the LPJ-POD scheme shows better performance in improving GPP estimates, which can provide a favorable way for accurately estimating dynamics of ecosystems.


Introduction
Terrestrial gross primary productivity (GPP), the integral of organic carbon fixed through photosynthesis over all of the leaves, is the beginning of the biogeochemical cycle and a primary input for vegetation carbon [1,2].About half of GPP will be incorporated into new plant tissues (e.g., leaves, roots), and the remainder will be released back into the atmosphere [3,4].Therefore, quantitative estimates of the magnitude and variability of GPP can enhance further comprehension about the biogeochemical dynamics and the carbon cycle of terrestrial ecosystems.
Numerous ecosystem models, such as the Lund-Potsdam-Jena Dynamic Global Vegetation Model (LPJ-DGVM) [5], have been widely used as a practical method for quantifying temporal and spatial variation of GPP at regional, continental, and global levels.However, considering the spatial-temporal heterogeneity of ecosystem processes, it might be difficult to acquire accurate input parameters of models, thus possibly resulting in low-confidence output results [2,6].Moreover, the simulation errors caused by the uncertainties of models, both in process mechanisms and related parameters settings, can accumulate gradually and cause results deviating from the true values [7].Satellite remote sensing data could offer broad spatial-temporal distribution information on vegetation coverage, as well as the state variables of models (e.g., biomass, leaf area index) [8].Thus, it can be considered as a good option for integration with process-based models to reduce uncertainties in model estimates.Data assimilation (DA) is one technique widely used for combining observation data with process-based models and incorporates real-time data into models constantly to generate a better estimation [7], which can provide a favorable method for accurately analyzing dynamics of ecosystem, expending the use of observations in ecosystem studies.Through assimilation, observed variables are applied to adjust the model state variables; hence, several related variables will be optimized.
The ensemble Kalman Filter (EnKF) algorithm has gained wide attention owing to its simple conceptual formation, optimum performance, and easy implementation [9,10].Based on the Monte Carlo method, EnKF designs an ensemble of state variables, and the average of the ensemble can be used as the optimal estimation of state variables.Many studies have adopted EnKF for integrating remote sensing data with models in meteorological and hydrological fields [11][12][13][14].Wang et al. [15] assimilated observed soil moisture into the LPJ-DGVM by using the EnKF method to construct a data assimilation system for optimizing LPJ performance.In the field of land surface vegetation, Ju et al. [16] employed EnKF in optimizing model parameters that significantly improve the simulation of GPP, latent heat (LE), and sensible heat (SH), especially during dry periods.Quaife et al. [17] demonstrated that the assimilation of Moderate Resolution Imaging Spectroradiometer (MODIS) reflectance data into an ecosystem model performed better in improving the estimates of GPP and net ecosystem productivity (NEP).
As a sequential assimilation method, EnKF is naturally designed to deal only with sequential information; therefore, it lacks the temporal smoothness constraint included in the four-dimensional (4D) variational data assimilation (4DVar) method.Research on coupling 4DVar with EnKF has been conducted to fully utilize their respective advantages [18][19][20][21].Recently, a hybrid method known as the proper orthogonal decomposition (POD)-based ensemble 4D variational data assimilation method (PODEn4DVar), was proposed by Tian et al. [21].The PODEn4DVar method has the superiority that: (1) the dynamic constraint can use complete simulation equations; (2) it is capable of simultaneously assimilating observations at multiple times; (3) it can reflect the spatial structure of the 4D variables and characteristics of time evolution by using POD-transformed base vectors; and (4) this method does not require an integral adjoint model during the computational process, and it is easy to realize.The PODEn4DVar method can simplify the process of data assimilation and maintain most advantages of the traditional 4DVar method, which has shown considerable satisfactory performance within land data assimilation [22,23], Tan-Tracker joint data assimilation [24], and radar data assimilation [25].Considering that EnKF and PODEn4DVar each have unique features, each method was applied into LPJ-DGVM separately, and their respective performances were compared in this study.
In this study, we developed a GPP assimilation system for coupling the remotely-sensed leaf area index (LAI) with the LPJ-DGVM dynamic vegetation model, which can be used to optimize the GPP simulation at a given time.The specific objectives of this study were to: (1) apply two assimilation algorithms into the process-based model combined with remotely-sensed LAI data; (2) test and evaluate the ability of different assimilation algorithms within the LPJ-DA system by using six eddy-covariance flux tower sites, and select the optimal model for spatial GPP estimation; and (3) obtain the spatial distribution and calculate the total amount of terrestrial GPP in China during the period of 2003-2012.

Study Area
China has a vast territory covering cold, temperate, and tropical zones.Terrains are distributed in a ladder shape, descending from west to east.Mountainous regions and plateaus occupy a large area.With a remarkable monsoon climate, it manifests high-temperature and rainy weathers in summer, and cold and rainless weathers in winter.The contrast of the East Asian summer and winter monsoons, together with the high elevation of the Tibetan Plateau result in a unique terrestrial ecosystem.Over one half of the land area is covered by forests and grasslands.Forests cover about 18.9% of total land area and are located mainly in the south Yangtze River, large and small Xing'an Mountains, and Changbai Mountain to the northeast.Grasslands account for 31.8% and are mainly distributed in most regions of Northwest and North China in addition to the Southwest.

Flux Tower Measurement
We use the measurements from the FLUXNET2015 Dataset [26] for the verification of the estimations.The six flux tower sites used in this study include four vegetation types: evergreen broadleaf forest (EBF), evergreen needleleaf forest (ENF), mixed forest (MF), and grassland (GRA).Based on previous versions of the FLUXNET Dataset releases (FLUXNET Marconi Dataset 2007 and FLUXNET LaThuile Dataset 2007), the FLUXNET2015 Dataset includes several improvements in data quality control protocols and the data processing pipeline, including close interaction with tower teams to improve data quality, new methods for uncertainty quantification, and use of reanalysis data to fill long gaps of micrometeorological variable records.
Eddy covariance systems give the direct measurement of NEE, and the measured GPP is calculated by subtracting the respiration using a community standard method [27,28].However, uncertainties remain in the measured GPP values such as underestimation of the total ecosystem respiration (R eco ) at night, uncertainties of the gap-filled algorithm, and partitioning and random errors of flux measurements [29][30][31], which may increase the mismatch between measured GPP and simulated GPP.To verify the model results, the six eddy covariance towers including 17 site-years were selected to represent four vegetation types (Figure 1; Table 1).Detailed descriptions of sites information are available at [26].

Model-Driving Datasets
The input forcing data of LPJ-DGVM was obtained from continuous series of monthly climate data covering the period 1901-2012.This data, consisting of cloud cover, temperature, precipitation,

Model-Driving Datasets
The input forcing data of LPJ-DGVM was obtained from continuous series of monthly climate data covering the period 1901-2012.This data, consisting of cloud cover, temperature, precipitation, and wet day frequency on a 0.5 • × 0.5 • global grid [38], was provided by the Climatic Research Unit (CRU), University of East Anglia [39].Monthly data were linearly interpolated to obtain "quasi-daily" values and were used to simulate processes within LPJ-DGVM at daily time steps.An annual dataset of historical atmospheric CO 2 concentrations during 1901-2012 was obtained by combining ice-core measurements and atmospheric observations at the Mauna Loa Observatory in Hawaii [40,41].Soil texture data was obtained from the Food and Agriculture Organization of the United Nations (FAO) soil dataset [42].

Remote Sensing Observation
Estimation of LAI from remote sensing data is the feasible method for generating LAI products at regional and, in particular, global scales.The LAI products, such as the MODIS LAI product (MOD15) and the CYCLOPES LAI product, are almost spatially incomplete and temporally discontinuous owing to cloud contamination and are not accurate for certain vegetation types [43,44].The use of multi-temporal data could be a better strategy for addressing problem of information missing, which is a major feature of Global Land Surface Satellite (GLASS) products.The GLASS product, generated and released by the Center for Global Change Data Processing and Analysis of Beijing Normal University [45], has a temporal resolution of eight-day intervals and spans the period of 1981-2014.
In the GLASS LAI algorithm, general regression neural networks (GRNNs) are trained over the fused LAI from MODIS and CYCLOPES LAI products and pre-processed MODIS/AVHRR surface reflectance of BELMANIP sites.The LAI is retrieved from time-series reflectance data with the trained GRNNs.Direct validation and inter-comparative studies have demonstrated that GLASS LAI has significantly longer temporal ranges, higher spatio-temporal resolution, and improved accuracy compared with other products, such as MODIS LAI and CYCLOPES LAI [44,46].
The GLASS LAI product employed in this study spans the period 2003-2012 at 1 km and eight-day resolution in a sinusoidal projection system.For integrating into the data assimilation modeling framework, we processed the LAI for China and resampled it to 0.5 • × 0.5 • resolution for consistency with the climatic datasets applied within the LPJ model.

Data Assimilation Strategy
A data assimilation system is generally composed of a dynamic model, observation operator, assimilation algorithms, and experiment datasets, including forcing data, model parameters, and observations [7].In this study, we developed a LPJ-DA modeling framework by using LPJ-DGVM as the model operator to simulate energy and mass exchanges between the surface vegetation and land-atmosphere.To make observations consistent with the state physical and space variables, we linked assimilation algorithms and LPJ-DGVM.We used Equation (1) as an observation operator to convert the simulated foliage projective cover (FPC) state variable into the observed LAI variable when new observations became available.We applied two different assimilation algorithms, EnKF and PODEn4DVar, as discussed in Sections 3.2.1 and 3.2.2,respectively, to utilize the observation information of GLASS LAI [46], as discussed in Section 2.4.This measure was performed to update the FPC state variable so that the related variables (GPP in this study) can be obtained.A schematic diagram of the LPJ-DA modeling framework is shown in Figure 2.

Lund-Potsdam-Jena Dynamic Global Vegetation Model
The LPJ-DGVM [5] is a process-based model dealing with the exchange of carbon and water between terrestrial vegetation and land-atmosphere in a modular framework, and inherits many characteristics from the biome family of models [5,[47][48][49].The key features primarily involve feedback through canopy conductance between photosynthesis and water balance, and close coupling among these "fast" processes and other ecosystem processes, including resource competition and production, tissue turnover, growth, carbon cycle in soil and litter, and disturbance regime [5].For each grid cell, ten plant functional types (PFTs) were defined clearly in terms of FPC, including eight woody and two herbaceous types, with the potential occurrence determined by the bioclimatic limits as a standard for determining whether it can survive and/or regenerate in the simulation.Daily photosynthesis is a function of several abiotic factors including Absorbed Photosynthetically Active Radiation (APAR), day length, biochemical pathway (C3 or C4), atmospheric CO2 concentration, and temperature [5,50].For each PFT, the GPP is calculated by applying the coupled photosynthesis and water balance.The entire simulation is derived by monthly climate data, soil texture, and atmospheric CO2 concentration in every grid cell.
As an effective method for describing plant community structure, FPC is defined as the vertically-projected percentage cover of photosynthetic foliage of all strata [51], which has an essential relationship with canopy transpiration and radiation interception.Both LAI and FPC can be used to describe the leaf area, and based on the Lambert-Beer Law [52]:

Lund-Potsdam-Jena Dynamic Global Vegetation Model
The LPJ-DGVM [5] is a process-based model dealing with the exchange of carbon and water between terrestrial vegetation and land-atmosphere in a modular framework, and inherits many characteristics from the biome family of models [5,[47][48][49].The key features primarily involve feedback through canopy conductance between photosynthesis and water balance, and close coupling among these "fast" processes and other ecosystem processes, including resource competition and production, tissue turnover, growth, carbon cycle in soil and litter, and disturbance regime [5].For each grid cell, ten plant functional types (PFTs) were defined clearly in terms of FPC, including eight woody and two herbaceous types, with the potential occurrence determined by the bioclimatic limits as a standard for determining whether it can survive and/or regenerate in the simulation.Daily photosynthesis is a function of several abiotic factors including Absorbed Photosynthetically Active Radiation (APAR), day length, biochemical pathway (C 3 or C 4 ), atmospheric CO 2 concentration, and temperature [5,50].For each PFT, the GPP is calculated by applying the coupled photosynthesis and water balance.The entire simulation is derived by monthly climate data, soil texture, and atmospheric CO 2 concentration in every grid cell.
As an effective method for describing plant community structure, FPC is defined as the vertically-projected percentage cover of photosynthetic foliage of all strata [51], which has an essential relationship with canopy transpiration and radiation interception.Both LAI and FPC can be used to describe the leaf area, and based on the Lambert-Beer Law [52]: where ind represents average individual values.In a grid cell, the entire FPC is calculated by multiplying FPC ind by the crown area (CA) and population density (P): CA is obtained by applying the stem diameter (D) based on the inversion of Reinecke's rule [53]: where k allom1 = 100, k rp ≈ 1.6, and CA is prescribed below a constant CA max (15 m 2 ).The daily GPP for each PFT is calculated by combining the FPC from Equation ( 2), the daily phenological status (have leaves or not), and climate data.The entire calculation of GPP flux is performed for the middle day of each month, and the daily values are derived by interpolation of the mid-month values.In our scheme, the state variable is FPC, and the observed variable is LAI obtained from remote sensing data (GLASS LAI).We selected Equation (1) as the observation operator for integrating the simulation and observation to update the FPC values.

Ensemble Kalman Filter
The EnKF [10, [54][55][56], connects the ensemble forecast with the Kalman filter (KF), which makes an implicit assumption that the observations are related to the true state X t (e.g., FPC at time t) by: where Y is the m-dimensional observation vector, ε is the observation error with average ε and covariances R = (ε − ε)(ε − ε) T , T refers to transposed matrix, and H is the observation operator.
Assuming the availability of observation data at time t + 1, the state variable of each corresponding ensemble is updated as: where the superscripts f and a refer to the prior (forecast) and posterior (analysis/updated) estimates, respectively; i represents each individual ensemble number; P f t+1 represents the error covariance of the state variable.
Since the observation vector Y has a higher dimension (m), EnKF was found to be inefficient when we calculated the inverse matrix shown in Equation (5).To address this issue, a modified inverse calculation was proposed by Liang et al. [57] based on the Sherman-Morrison-Woodbury formula [58]: that is, HP f t+1 H T ≡ ZZ T .Since the total number of disturbing ensembles of state variable N is generally not very high, calculation of the inverse matrix of I + Z T R −1 t Z T is more efficient.As a result, this method can significantly reduce the amount of calculation and enhance the operation efficiency.

POD-Based Ensemble 4D Variational Assimilation Method
On the basis of POD and ensemble forecasting techniques, Tian et al. [21] proposed a hybrid method: the POD-based ensemble 4-D variational assimilation method (PODEn4DVar).This method uses linear combinations of the transformed model perturbations (MPs) and observation perturbations (OPs) to express the optimal MP and its OP, respectively.Moreover, this method can assimilate multiple-time observation data and update its background error covariance by integrating the advantages of variational methods and evolving forecast ensembles.By minimizing the incremental format of the cost function in the 4DVar algorithm, an optimal increment of initial condition can be achieved at the initial time by: where x = x − x b , y = y (x ) = y(x) − y(x b ), y obs = y obs − y(x b ), and y = H M t 0 →t k (x) .Here, the b represents the background field, index k stands for the observation time, P b is the background error covariance, and obs denotes observation.
Applying the POD transformation, the optimal solution can be expressed as: The background error covariance P b can be calculated from: where N is the size of the ensemble.By substituting Equations ( 9) and (10) into the Equation ( 8), we obtain another cost function with control variables β instead of x : marking , the incremental analysis x a is expressed by: therefore, the final analysis x a can be calculated as: To ameliorate the spurious long-range correlations between observation locations and model variables resulting from the ensemble-based method, a localization technique is applied to the calculations process [59].The final analysis is rewritten as: where d h and d v are the horizontal and vertical distances between the spatial positions of state and observed variables, respectively; and d h,0 and d v,0 are the horizontal and vertical covariance localization Schur radii, respectively.The filtering function C 0 is expressed as: where r is the radius of the filter.

Model Performance
Pearson's correlation coefficient (r), standard deviation (SD), root-mean-square difference (RMSD), and centered RMSD (CRMSD) are used to quantify model accuracy and agreement using six flux tower sites.In this study we consider the flux-derived GPP as the "ground truth", though we acknowledge that the flux tower measurements may contain instabilities in observation caused by the contamination of atmospheric and cloud, and instrument errors.
Here, GPP model and GPP EC represent the estimated and observed GPP values, respectively; GPP model and GPP EC are the mean values of estimated and observed GPP values, respectively; and n is the number of observations.The Taylor diagram [60] provides a concise and graphical statistical summary of r, CRMSD, and SD of the observed and estimated patterns on 2D plots.In this paper, we normalized SD and CRMSD (NSD and NCRMSD, respectively), to display and compare the performances of all flux towers in one diagram [61].
Here, r, NSD, and NCRMSD are not independent, but are related by the following formula: In the Taylor diagram, NSD is expressed as a radial distance from the origin and the correlation with tower-based data as an angle in the polar plot.The distance to the observation point is proportional to NCRMSD between the estimated and observed values.The simulated points having good agreement with observations lie nearest to the point identified as "observed" on the x-axis.This diagram is especially practical and visual in evaluating multiple aspects of various models.
In addition to the aforementioned indices, the relative RMSD (RMSD r ) is calculated as a proportion of the mean estimated value [62] as:

Accuracy Assessment of Simulated GPP
In this experiment, we assimilated ten years of GLASS LAI products into the LPJ-DGVM to update FPC beginning in 2003.The temporal trends of estimated and assimilated results of FPC at six sites are given in Figure 3, depicted as: FPC L (FPC simulated from LPJ-DGVM), FPC LE (FPC simulated from the assimilated LPJ-DGVM using the EnKF), and FPC LP (FPC simulated from the assimilated LPJ-DGVM using the PODEn4DVar).FPC G represented GLASS FPC (calculated from GLASS LAI products using Equation ( 1)).
Notably, the estimation of FPC with assimilation has great improvement in most sites.We can see that the simulated FPC with assimilation at Figure 3a obviously corrected the modelling tracks of the LPJ-DGVM, improving the original model which overestimates in peak value and slightly underestimates in winter.Figure 3b,c showed higher estimated FPC compared with FPC G , especially at the peak of the growing season, but the assimilated FPC followed GLASS FPC closer than the simulated FPC and can capture the variation of FPC exactly at these two sites.As shown in Figure 3d-f, the simulated results without assimilation clearly overestimated after the peak of the growing season, while results with assimilation significant decreased margin between estimated and remote-sensed values.It is noticeable that FPC LP performed much better than FPC L and FPC LE , and the estimated results are quite consistent with the FPC G , especially for the growing season of forests (Figure 3a-c).The seasonal variations of estimated and observed GPP at six sites in China during 2003-2010 are given in Figure 4a-f.GPPEC, GPPL, GPPLE, and GPPLP were the abbreviations of GPP from eddycovariance observations and Lund-Potsdam-Jena (LPJ), LPJ-Ensemble Kalman Filter (LPJ-EnKF), and LPJ-proper orthogonal decomposition (LPJ-POD) schemes.The results showed that schemes with assimilation can exactly retrieve GPPEC by integrating remote-sensed information.At the CN-Din and CN-Qia sites (Figure 4a,b), good agreement was shown for both GPPLE and GPPLP with GPPEC.Both tracked the seasonal dynamics of GPPEC well but underestimated the value when it captured the peak of the growing season in 2004.At the CN-Qia site, there was an apparently underestimation for GPPL, but was improved greatly with the PODEn4DVar method.At the CN-Cha site (Figure 4c), GPPLP was much closer to the observed values at the peak of the growing season.However, data assimilation did not show obvious advantages at the CN-Dan site (Figure 4d), with three schemes having higher GPP values than observations.This was consistent with results shown in Figure 3d, with all schemes displaying higher FPC than FPCG.The estimated GPP showed later initiation, but captured the cessation of the growing season for the CN-HaM site (Figure 4e).GPPL overestimated GPP to a large extent after August for the CN-Cng site (Figure 4f), indicating that assimilation can accurately retrieve GPP by adding remotely sensed LAI.Before and after the peak of the growing season at the CN-Cng site, all of the estimated GPP values were significantly overestimated.The GPPLE values were very close to GPPL at several site-years (e.g., CN-Dan 2004, CN-Cng 2008), which indicated that the EnKF The seasonal variations of estimated and observed GPP at six sites in China during 2003-2010 are given in Figure 4a-f.GPP EC , GPP L , GPP LE , and GPP LP were the abbreviations of GPP from eddy-covariance observations and Lund-Potsdam-Jena (LPJ), LPJ-Ensemble Kalman Filter (LPJ-EnKF), and LPJ-proper orthogonal decomposition (LPJ-POD) schemes.The results showed that schemes with assimilation can exactly retrieve GPP EC by integrating remote-sensed information.At the CN-Din and CN-Qia sites (Figure 4a,b), good agreement was shown for both GPP LE and GPP LP with GPP EC .Both tracked the seasonal dynamics of GPP EC well but underestimated the value when it captured the peak of the growing season in 2004.At the CN-Qia site, there was an apparently underestimation for GPP L , but was improved greatly with the PODEn4DVar method.At the CN-Cha site (Figure 4c), GPP LP was much closer to the observed values at the peak of the growing season.However, data assimilation did not show obvious advantages at the CN-Dan site (Figure 4d), with three schemes having higher GPP values than observations.This was consistent with results shown in Figure 3d, with all schemes displaying higher FPC than FPC G .The estimated GPP showed later initiation, but captured the cessation of the growing season for the CN-HaM site (Figure 4e).GPP L overestimated GPP to a large extent after August for the CN-Cng site (Figure 4f), indicating that assimilation can accurately retrieve GPP by adding remotely sensed LAI.Before and after the peak of the growing season at the CN-Cng site, all of the estimated GPP values were significantly overestimated.The GPP LE values were very close to GPP L at several site-years (e.g., CN-Dan 2004, CN-Cng 2008), which indicated that the EnKF method did not make significant improvements.However, the GPP LP for these site-years showed greater consistency with GPP EC and were significantly closer to the observed GPP.Points lying on the red solid line indicated estimated GPP having variations similar to those of the observations.The position of each point on the two plots quantified the agreement between simulated GPP and observations.Sites with large or small NSD showed inconsistency between simulated and observed GPP.Therefore, we considered that the points between two dashed lines (with 0.7≤ ≤1.3) had similar variability with the observations, and we further grouped them into three levels based on the correlation values: I (0.7 < r ≤ 0.8), II (0.8 < r ≤ 0.9), III (0.9 < r ≤ 1.0).All of these points showed better performance than other points (with NSD < 0.7 or NSD > 1.3 and r ≤ 0.7), and the higher correlation coefficient presented better consistency with the observations.As shown in Figure 5 the GPP values with data assimilation behaved better in terms of correlation, especially for III (0.9 < r ≤ 1.0) with correlations of 7/17 (seven points meet the requirement in 17 site-years) for both of two assimilation methods.For all site-year, the total proportion within the scope of 0.7 ≤ ≤ 1.3 and r > 0.7 increased by 87.5% from both GPPL (8/17) to GPPLE (15/17),  Sites with large or small NSD showed inconsistency between simulated and observed GPP.Therefore, we considered that the points between two dashed lines (with 0.7 ≤ NSD ≤ 1.3) had similar variability with the observations, and we further grouped them into three levels based on the correlation values: I (0.7 < r ≤ 0.8), II (0.8 < r ≤ 0.9), III (0.9 < r ≤ 1.0).All of these points showed better performance than other points (with NSD < 0.7 or NSD > 1.3 and r ≤ 0.7), and the higher correlation coefficient presented better consistency with the observations.As shown in Figure 5 the GPP values with data assimilation behaved better in terms of correlation, especially for III (0.9 < r ≤ 1.0) with correlations of 7/17 (seven points meet the requirement in 17 site-years) for both of two assimilation methods.For all site-year, the total proportion within the scope of 0.7 ≤ NSD ≤ 1.3 and r > 0.7 increased by 87.5% from both GPP L (8/17) to GPP LE (15/17), and GPP LP (15/17).GPP LP performed better with higher proportion (8/17) than GPP L (4/17) and GPP LE (4/17) for II (0.8 < r ≤ 0.9).Moreover, the correlation of certain site-years (e.g., CN-Din) was substantially improved by more than 0.2 when using the PODEn4DVar method, and exhibited stronger correlations (generally r > 0.85) with the site observations than the values from LPJ and LPJ-EnKF schemes (Table 2).For four PFTs, a similar situation is shown in Figure 5b.The results with data assimilation generally agreed well with observations, with higher correlation and NSD closer to 1. Point A (PFT: EBF), apparently, had been greatly improved by using assimilation method (improving r by 0.141 and 0.249 for GPP LE and GPP LP , respectively, and NSD getting closer to 1) (Figure 5b, Table 3).Point D (PFT: GRA, GPP L ) did not exist between the two dashed lines owing to its higher NSD, which increased the distance from the red solid line, but it had been improved by PODEn4DVar method with a higher correlation and lower RMSD.Overall, GPP LP had a higher proportion and higher correlation with 0.7 ≤ NSD ≤ 1.3; all of the PFTs were present between the dashed lines.This showed obvious improvement in all statistical indices over values without assimilation, especially for evergreen needleleaf forest and mixed forest, with r > 0.94 (Table 3).In general, the PODEn4DVar method performed slighter better with r = 0.923 and RMSD = 32.676gC/m 2 /month than LPJ-DGVM (r = 0.840, RMSD = 45.410gC/m 2 /month) and LPJ-EnKF (r = 0.887, RMSD = 38.531gC/m 2 /month) schemes (Figure 6).It can be concluded that PODEn4DVar had the best advantage for improving model estimations when integrated in dynamic vegetation models.(a-c) Scatterplot of observed versus estimated monthly gross primary productivity (GPP) at the six sites listed in Table 1.

China's Terrestrial GPP
The spatial distributions of terrestrial GPP in China over the period 2003-2012 were generated from our LPJ-POD data assimilation system, and their averages are given in Figure 7a (pixel size 0.5° × 0.5°).Similar distribution trends were found from 2003 to 2012, with a decreasing gradient from the southeast toward the northwest regions, as suggested in previous studies [63,64].The highest GPP estimate (>2000 gC/m 2 /yr) appeared in the Yunnan province, Southeastern Tibet, south of East and South China, mainly with a distribution of evergreen and mixed forests.Warm temperature, plentiful precipitation, and sufficient radiation in these regions provided an advantageous environment for vegetation growth.The lowest GPP (<700 gC/m 2 /yr) was concentrated in large portions of Tibet and Northwest China and in central regions of Inner Mongolia province, which is partially covered by ice, bare soil, and rock in a relatively cold climate with scarce precipitation [65].
The SD was applied to indicate inter-annual variability of the GPP during the period 2003-2012 (Figure 7b).Relatively high SD appeared in most of North China, the northeast area of Northwest China, north of East China, and south of Southwest China, indicating greater volatility in these areas.

China's Terrestrial GPP
The spatial distributions of terrestrial GPP in China over the period 2003-2012 were generated from our LPJ-POD data assimilation system, and their averages are given in Figure 7a (pixel size 0.5 • × 0.5 • ).Similar distribution trends were found from 2003 to 2012, with a decreasing gradient from the southeast toward the northwest regions, as suggested in previous studies [63,64].The highest GPP estimate (>2000 gC/m 2 /yr) appeared in the Yunnan province, Southeastern Tibet, south of East and South China, mainly with a distribution of evergreen and mixed forests.Warm temperature, plentiful precipitation, and sufficient radiation in these regions provided an advantageous environment for vegetation growth.The lowest GPP (<700 gC/m 2 /yr) was concentrated in large portions of Tibet and Northwest China and in central regions of Inner Mongolia province, which is partially covered by ice, bare soil, and rock in a relatively cold climate with scarce precipitation [65].
The SD was applied to indicate inter-annual variability of the GPP during the period 2003-2012 (Figure 7b).Relatively high SD appeared in most of North China, the northeast area of Northwest China, north of East China, and south of Southwest China, indicating greater volatility in these areas.
The LPJ-POD scheme determined that the average annual GPP in China from 2003 to 2012 was about 6.35 PgC/yr, with estimates ranging from 5.92 PgC/yr in 2003 (Figure 8a) to 6.67 PgC/yr in 2012 (Figure 8b).The GPP in 2003, 2006, and 2007 showed relatively low values, at 5.92 PgC/yr, 6.15 PgC/yr, and 6.18 PgC/yr, respectively.Drought conditions occurred frequently and became a major limitation to plant growth and development, especially the drought stress that occurred during the summer of the three years [66][67][68].In spring 2003, a rainfall shortage caused droughts on a large scale in the northeast.Sustained high temperature and diminished rainfall caused a severe drought in South China during the summer and autumn periods of 2003, and a wide range of regional drought severely affected the photosynthetic activities of plants, thus affecting ecosystem productivity [66].Compared with the drought years of China, 2012 was a year with less meteorological drought, although the northern regions of Central China and East China experienced moderate or severe droughts during May and June [69].
vegetation growth.The lowest GPP (<700 gC/m 2 /yr) was concentrated in large portions of Tibet and Northwest China and in central regions of Inner Mongolia province, which is partially covered by ice, bare soil, and rock in a relatively cold climate with scarce precipitation [65].
The SD was applied to indicate inter-annual variability of the GPP during the period 2003-2012 (Figure 7b).Relatively high SD appeared in most of North China, the northeast area of Northwest China, north of East China, and south of Southwest China, indicating greater volatility in these areas.

Discussion
Ecosystem models can provide a description of the complex vegetation dynamics of the ecosystem, but only by a simplified presentation of the actual situation.There are many approximations, assumptions, and uncertainties that will increase the uncertainties in the terrestrial GPP estimates.Data assimilation methods have been used to improve the performance of the processed model and reduce errors in model simulation [15][16][17][70][71][72].Most studies regarding assimilating observation data into the ecosystem models mainly focused on the evaluation of the EnKF approach for ecological forecasting.Though EnKF can give an optimum estimate of GPP at the next moment of model simulations, it does not provide an analytical solution of overall balance.Other assimilation algorithms, like 4DVar, may be adopted so that the entire model trajectory will be affected by the observations and an analytical solution of overall balance will be obtained [18].To make productivity estimates more precise, a combination of different assimilation approaches should be considered.The PODEn4DVar adopted by this paper complements the advantages and disadvantages of EnKF and 4DVar well, which can simultaneously assimilate multi-time and multisource observations, providing the overall balance analysis solution and the error estimates with flow-dependence [21].The results of statistical analysis indicate that the simulated GPP from the LPJ-POD scheme has been improved significantly compared with the LPJ and LPJ-EnKF schemes.This study is generally encouraging for the further application of the PODEn4DVar method integrating with dynamic vegetation models, to reduce uncertainty of terrestrial carbon flux estimates at different scales.
The performance of data assimilation schemes may also be affected by the quality of observation data.Currently, most studies have investigated the capability of assimilating MODIS LAI product

Discussion
Ecosystem models can provide a description of the complex vegetation dynamics of the ecosystem, but only by a simplified presentation of the actual situation.There are many approximations, assumptions, and uncertainties that will increase the uncertainties in the terrestrial GPP estimates.Data assimilation methods have been used to improve the performance of the processed model and reduce errors in model simulation [15][16][17][70][71][72].Most studies regarding assimilating observation data into the ecosystem models mainly focused on the evaluation of the EnKF approach for ecological forecasting.Though EnKF can give an optimum estimate of GPP at the next moment of model simulations, it does not provide an analytical solution of overall balance.Other assimilation algorithms, like 4DVar, may be adopted so that the entire model trajectory will be affected by the observations and an analytical solution of overall balance will be obtained [18].To make productivity estimates more precise, a combination of different assimilation approaches should be considered.The PODEn4DVar adopted by this paper complements the advantages and disadvantages of EnKF and 4DVar well, which can simultaneously assimilate multi-time and multi-source observations, providing the overall balance analysis solution and the error estimates with flow-dependence [21].The results of statistical analysis indicate that the simulated GPP from the LPJ-POD scheme has been improved significantly compared with the LPJ and LPJ-EnKF schemes.This study is generally encouraging for the further application of the PODEn4DVar method integrating with dynamic vegetation models, to reduce uncertainty of terrestrial carbon flux estimates at different scales.
The performance of data assimilation schemes may also be affected by the quality of observation data.Currently, most studies have investigated the capability of assimilating MODIS LAI product into the process-based models [72][73][74].However, the MODIS LAI product lacks adequate stable profiles, especially for the growing seasons [44].Zhao et al. [72] have adopted a Savizky-Golay filter to resolve the discontinuous performance of MODIS LAI arising from snow cover, cloud, and sensor errors during the crop growth stages, but the filter has brought new errors which will decrease the higher LAI and create abnormal lower values when involving the universal low LAI values.Xiao et al. [44] proposed a method for retrieving a long time series GLASS LAI product from multi-temporal data.The retrieval algorithm used annual observations to generate the LAI annual profile, which can remove abrupt spikes and dips effectively and produce temporally continuous LAI product.Therefore, we assimilated high-quality GLASS LAI sequentially into the LPJ model to renew the FPC that is related to photosynthesis.The better performance for GLASS LAI than the MODIS LAI were validated through the quantitative comparison at site of different biomes [44].
Though our LPJ-POD scheme successfully improved the modeling accuracy, and explicitly described the spatial distributions of GPP in China during the period 2003-2012, there were still some uncertainties.For six selected sites, overestimated and underestimated GPP remained at a few site-years.This discrepancy also raised the inaccuracy of the temporal and spatial features of the simulated GPP at the regional scale.The uncertainty was partly owing to the model structure.First, one limitation is that many critical parameters provided as default values in LPJ will induce uncertainty in simulating ecosystem carbon fluxes [75][76][77].For each grid cell, the LPJ-DGVM simulated potential vegetation distribution according to several bioclimatic limits [5], which would bring deviations if simulated vegetation compositions are different from actual composition of the PFTs in the cell.For example, the potential vegetation types simulated by the LPJ-DGVM of the CN-Dan (FPT: GRA) are C 3 perennial grass and boreal broadleaved summergreen tree.The simulated GPP of CN-Dan showed sizable higher values due to a large percentage of C 3 perennial grass was simulated as summergreen tree.Moreover, it is indicated that default leaf longevity may cause significant differences of terrestrial carbon flux estimation, and other parameters, such as turnover times of the leaf, sapwood, and fine root, also need to be revised to reduce uncertainties [75].Second, previous research has found that plant photosynthesis and runoff are particularly sensitive to air temperature, which will be affected by the interpolation of monthly temperature to "quasi-daily" values [78,79].Third, empirical or simplified algorithms were adopted in the LPJ model to simulate the ecosystem carbon cycle, which also induced uncertainties in model outputs [5].Considering the poor performances of the LPJ model in calculating the water cycle process, it is necessary to assimilate soil moisture into the model to constrain some model parameters such as sensible and latent heat fluxes [15] and, hence, decrease the mismatch between simulation results and observations.
Our GPP estimates also have other sources of uncertainty.Though GLASS LAI is widely used for modeling research, it cannot be the same as the ground-measured LAI.Moreover, the LPJ model is usually influenced by uncertainty of several crucial parameters, such as soil texture and atmospheric driving data, including temperature, precipitation, and cloud cover [7].These errors could propagate to the final GPP values and eventually cause deviation of the model estimates from the observed data.To optimize the model outputs, incorporating the remote sensing model into the LPJ-POD scheme by calibrating sensitive parameters of the LPJ model [71,80] will be also considered in further work.
In this study, GPP simulated by the optimal scheme (LPJ-POD) in China was comparable to previous studies.At the national scale, the GPP estimates exhibited strong discrepancies from various ecosystem models (Table 4; [63,64,[81][82][83][84]).A higher terrestrial annual productivity estimate of 12.26 PgC in China was obtained from a remote sensing model [83], which is almost the equivalent of twice the GPP estimates of this study and Li et al. [63].Liu et al. [82] constructed a daily driving Terrestrial Ecosystem Production process model (TEPC) and gave a relative higher vegetation production of 7.356 PgC/yr in China based on the Boreal Ecosystem Productivity Simulator (BEPS) and the bio-geochemical model (Biome-BGC).Based on the analysis of published terrestrial productivity in China, lower productivity focused mostly on Carnegie-Ames-Stanford Approach (CASA), BIOME-BGC and BEPS, and the results revealed that the averaged GPP in China was 5.656 Pg/yr according to the 16 studies on terrestrial productivity of China [85], which is 10% lower than GPP estimates in our study.Different model performance of GPP estimates in China owing to different definition of vegetation dynamic, parameter settings, and simulation algorithms.Our developed LPJ-POD system provides an effective way to constrain parameter values and makes model outputs closer to the observation data, which can be a comparative method when estimating China's annual GPP.

Conclusions
In this study, we developed a GPP assimilation system based on EnKF and PODEn4DVar algorithms.Eddy covariance flux datasets of six sites provided by FLUXNET2015 were adopted for validating the performances of GPP values estimated by two assimilation methods (GPP LE , GPP LP ) and no assimilation method (GPP L ).The optimal method was applied to assess the spatial distribution of GPP values in China during the period 2003-2012.Results showed that the EnKF and PODEn4DVar methods were reasonable for application to the LPJ model, showing better correlation and smaller RMSD.The LPJ-POD scheme performed better, with an r of 0.923 and RMSD of 32.676 gC/m 2 /month compared with the LPJ-EnKF scheme values of r = 0.887, 38.531 gC/m 2 /month and the LPJ scheme with r = 0.840, 45.410 gC/m 2 /month.According to the results of the optimal method, the LPJ-POD scheme, the annual GPP values in China from 2003 to 2012 varied between 5.92 PgC/yr (2003) and 6.67 PgC/yr (2012) with a mean of 6.35 PgC/yr.This study demonstrated that integrating remotely-sensed data with dynamic global vegetation models through data assimilation methods has potential for simulation optimization and that the LPJ-POD scheme showed better performance in improving GPP estimations.However, our developed LPJ-POD scheme still has uncertainties both in the magnitude and spatial distribution of the GPP at the national scale.To improve process-based models and obtain better simulated results, integrating with a remote sensing model, such as CASA, to calibrate ecophysiological parameters of the LPJ model is a further step for future study.

Figure 3 .
Figure 3. (a-f) Comparison of simulated eight-day FPC (FPC L , FPC LE , and FPC LP ) and GLASS FPC (FPC G , calculated from GLASS LAI).FPC L represents FPC simulated from LPJ-DGVM, FPC LE and FPC LP represent FPC obtained from the assimilated LPJ-DGVM using the EnKF and PODEn4DVar, respectively, and FPC G represents GLASS FPC.

Figure 5
Figure 5 quantifies the similarity between simulated and observed GPP in terms of their correlation, normalized centered RMSD, and relative amplitude of variations represented by normalized SDs by withholding 17 site-years (No. 1 to No. 17, Figure 5a) and four PFTs (A to D, Figure 5b) including EBF, ENF, MF, and GRA.Only the site-years or PFTs with p-values <0.05 were considered.An ideal model manifested as NSD of 1.0 and r of 1.0.The points between the two black dashed lines in the figure represented the site-year or PFTs with 0.7 ≤ ≤ 1.3.The excluded siteyears were CN-Din 2003 (No. 1, GPPL, p-value > 0.05) and CN-Dan 2005 (No. 11, NSD > 2.0).Points lying on the red solid line indicated estimated GPP having variations similar to those of the observations.The position of each point on the two plots quantified the agreement between simulated GPP and observations.Sites with large or small NSD showed inconsistency between simulated and observed GPP.Therefore, we considered that the points between two dashed lines (with 0.7≤ ≤1.3) had similar

Figure 5
Figure 5 quantifies the similarity between simulated and observed GPP in terms of their correlation, normalized centered RMSD, and relative amplitude of variations represented by normalized SDs by withholding 17 site-years (No. 1 to No. 17, Figure 5a) and four PFTs (A to D, Figure 5b) including EBF, ENF, MF, and GRA.Only the site-years or PFTs with p-values <0.05 were considered.An ideal model manifested as NSD of 1.0 and r of 1.0.The points between the two black dashed lines in the figure represented the site-year or PFTs with 0.7 ≤ NSD ≤ 1.3.The excluded site-years were CN-Din 2003 (No. 1, GPP L , p-value > 0.05) and CN-Dan 2005 (No. 11, NSD > 2.0).Points lying on the red solid line indicated estimated GPP having variations similar to those of the observations.The position of each point on the two plots quantified the agreement between simulated GPP and observations.Sites with large or small NSD showed inconsistency between simulated and observed GPP.Therefore, we considered that the points between two dashed lines (with 0.7 ≤ NSD ≤ 1.3) had similar variability with the observations, and we further grouped them into three levels based on the correlation values: I (0.7 < r ≤ 0.8), II (0.8 < r ≤ 0.9), III (0.9 < r ≤ 1.0).All of these points showed better performance than other points (with NSD < 0.7 or NSD > 1.3 and r ≤ 0.7), and the higher correlation coefficient presented better consistency with the observations.

Figure 6 .
Figure 6.(a-c) Scatterplot of observed versus estimated monthly gross primary productivity (GPP) at the six sites listed in Table1.

Figure 6 .
Figure 6.(a-c) Scatterplot of observed versus estimated monthly gross primary productivity (GPP) at the six sites listed in Table1.

Table 1 .
Description of flux tower sites used for gross primary productivity (GPP) validation in this study.

Table 1 .
Description of flux tower sites used for gross primary productivity (GPP)

validation in this study. No. Site Code Site Name Latitude ( • N) Longitude ( • E) IGBP a Years Used Reference
a IGBP: vegetation type at site.EBF: evergreen broadleaf forest; ENF: evergreen needleleaf forest; MF: mixed forest; GRA: grassland.

Table 2 .
Accuracy of model-estimated gross primary productivity (GPP) from three models by withholding site-year.Bold numbers represent the optimal results among GPP L , GPP LE and GPP LP .

Table 3 .
Accuracy of model-estimated gross primary productivity (GPP) from three models by withholding plant functional type (PFT).Bold numbers represent the optimal results among GPP L , GPP LE and GPP LP .

Table 4 .
Estimation of total gross primary productivity (GPP) values in China with different terrestrial models.