A 30 + Year AVHRR LAI and FAPAR Climate Data Record : Algorithm Description and Validation

Inland surface models, which are used to evaluate the role of vegetation in the context of global climate change and variability, LAI and FAPAR play a key role, specifically with respect to the carbon and water cycles. The AVHRR-based LAI/FAPAR dataset offers daily temporal resolution, an improvement over previous products. This climate data record is based on a carefully calibrated and corrected land surface reflectance dataset to provide a high-quality, consistent time-series suitable for climate studies. It spans from mid-1981 to the present. Further, this operational dataset is available in near real-time allowing use for monitoring purposes. The algorithm relies on artificial neural networks calibrated using the MODIS LAI/FAPAR dataset. Evaluation based on cross-comparison with MODIS products and in situ data show the dataset is consistent and reliable with overall uncertainties of 1.03 and 0.15 for LAI and FAPAR, respectively. However, a clear saturation effect is observed in the broadleaf forest biomes with high LAI (>4.5) and FAPAR (>0.8) values.


Introduction
Leaf Area Index (LAI) and Fraction of Absorbed Photosynthetically Active Radiation (FAPAR) are key vegetation biophysical variables.the Global Terrestrial Observing System [1,2] has defined these variables as follow: (i) "the LAI (m 2 /m 2 ) is geometrically defined as the total one-sided area of photosynthetic tissue per unit ground surface area"; and (ii) "the FAPAR is a primary variable controlling the photosynthetic activity of plants, and therefore constitutes an indicator of the presence and productivity of live vegetation, as well as of the intensity of the terrestrial carbon sink".These two biophysical variables are recognized by the Global Climate Observing System (GCOS) as Essential Climate Variables (ECVs, [3]).ECVs are defined as measurements of atmosphere, oceans, or land that are technically and economically feasible for systematic observation and have a high impact on the requirements of the United Nations Framework Convention on Climate Change (UNFCCC) and the Intergovernmental Panel on Climate Change (IPCC).The concept of ECVs includes a wide panel of terrestrial variables.Vegetation processes incorporated in models (i.e., photosynthesis, transpiration, carbon assimilation, and respiration) are strongly driven by the surface of the plant in contact with atmosphere.In land surface models, which are used to evaluate the role of vegetation in the context of

AVH09 Surface Reflectance
The AVH09C1 surface reflectance product (named hereafter AVH09) is the core input of the AVH15 CDR.This is the product distributed as the AVHRR Surface Reflectance (SR) CDR by NOAA's NCEI.It corresponds to the surface reflectance (SR) measured in two wavelengths (red, 580-680 nm, and near-infrared, 725-1000 nm) and resample at the CMG spatial resolution (0.05 ˝).The SR is normalized for BRDF effects (nadir view and 45 ˝solar zenith angle) using the VJB approach [14].A full description of the product is given in the AVHRR Surface Reflectance Climate Algorithm Theoretical Basis Document (C-ATBD, [15]).The CDR distribution of AVH09 contains associated quality information such as indications of cloud, cloud-shadow, and snow-free conditions.NDVI (Normalized Difference Vegetation Index) was derived directly from the nadir-adjusted surface reflectance.
While the AVH09 product is a measurement of SR normalized to a constant 45 ˝solar zenith angle, the FAPAR varies according to solar zenith angle.To be consistent with the FAPAR definition used in comparable FAPAR products, such as the MODIS product (which is an instantaneous FAPAR at the time of the satellite overpass), FAPAR is derived from nadir-adjusted surface reflectance with the solar zenith angle at the time of the acquisition.The same VJB BRDF model performs this adjustment.
The VJB model is based on two coefficients (V and R), describing the 3D structure of the surface, that have been demonstrated to be well correlated to NDVI [14].Based on MODIS CMG data, the linear regression coefficients were retrieved globally over the land surface.The retrieval of the nadir-adjusted and variable solar zenith (θ s ) angle SR (ρ pO S , 0, 0q) from the AVH09 product is computed as in Equation (1), where ρ N p45, 0, 0q is the AVH09 SR and V and R are the two coefficients of the VJB model.F1 and F2 are two kernels defined by [16] and [17], respectively, that both depend on illumination-view geometry.

Reference LAI/FAPAR
The MCD15A2 (denoted MCD15 hereafter) is used as the reference LAI/FAPAR dataset.It is a global LAI and FAPAR product.MCD15 is used as a LAI/FAPAR reference to calibrate the AVH15 algorithm.MCD15 is composited every eight days at 1-km resolution on a sinusoidal grid.It is produced on near real-time during the entire MODIS era (2000 to present).The main algorithm is based on lookup tables (LUT) simulated from a 3D radiative transfer model [18].Red and NIR (Near infrared) atmospherically-corrected MODIS reflectance [19] and the corresponding illumination-view geometry are used as inputs of the LUT.The output is the mean LAI and FAPAR values computed over the set of acceptable LUT elements for which simulated and MODIS surface reflectances agree within specified level of (model and measurement) uncertainties.When the main algorithm fails, a backup algorithm based on NDVI relationships, calibrated over the same radiative transfer model simulations is used [20].Retrievals from both algorithms were used.LAI and FAPAR are first produced daily.Then, the eight-day composite corresponds to the values of the product when the maximum FAPAR value within the eight day period is observed.Note that no LAI and FAPAR values are retrieved over bare or very sparsely vegetated area, permanent ice or snow area, permanent wetland, urban area, or water bodies.

Land Cover Map
Land cover classification, used to stratify the outputs, follows the 1981-1994 IGBP (International Geosphere-Biosphere Program) map [21].To avoid inconsistent land cover changes, the same classification is used for the entire dataset.Moreover, to reduce spatial discontinuity, the number of classes was reduced from 17 to 6 (see Table 1).A global map depicting the land cover categories used is shown in Figure 1.classes was reduced from 17 to 6 (see Table 1).A global map depicting the land cover categories used is shown in Figure 1.[21].Labels were simplified according to Table 1.

Calibration and Validation Sites
For calibration and validation purposes, a set of globally-distributed sites were selected.Two networks covering most land biomes were used: the BELMANIP2 network for calibration and the DIRECT network for validation.Figure 2 illustrates the global coverage and distribution of sites from both networks.
The BELMANIP2 (BEnchmark Land Multisite ANalysis and Intercomparison of Products 2, updated version of BELMANIP1, [22]) network was created using sites from existing experimental networks (FLUXNET, AERONET, VALERI, BigFoot, ...) completed with selected sites from the GLC2000 land cover map.The site selection was performed for each band of latitude (10° width) by keeping the same proportion of biome types within the selected sites as within the whole band of latitude.Attention was paid so that the sites were homogeneous over a 10 × 10 km 2 area, almost flat, and with a minimum proportion of urban area and permanent water bodies.The BELMANIP2 dataset includes 445 sites.[21].Labels were simplified according to Table 1.

Calibration and Validation Sites
For calibration and validation purposes, a set of globally-distributed sites were selected.Two networks covering most land biomes were used: the BELMANIP2 network for calibration and the DIRECT network for validation.Figure 2  Matchups with MCD15 and AVH09 were performed by selecting pixels whose center is the nearest to the BELMANIP2 and DIRECT site locations.We focused on coincident years between NOAA-16 and MODIS products: 2001-2007.Since MCD15 products are 1 km resolution, they were aggregated to 0.05° to be consistent with the spatial resolution of AVH09.Finally, for temporal consistency and to reduce the noise, AVH09 and MCD15 products were averaged monthly.The BELMANIP2 (BEnchmark Land Multisite ANalysis and Intercomparison of Products 2, updated version of BELMANIP1, [22]) network was created using sites from existing experimental networks (FLUXNET, AERONET, VALERI, BigFoot, ...) completed with selected sites from the GLC2000 land cover map.The site selection was performed for each band of latitude (10 ˝width) by keeping the same proportion of biome types within the selected sites as within the whole band of latitude.Attention was paid so that the sites were homogeneous over a 10 ˆ10 km 2 area, almost flat, and with a minimum proportion of urban area and permanent water bodies.The BELMANIP2 dataset includes 445 sites.
DIRECT is a collection of sites for which ground measurements have been collected [23] and processed according to the CEOS-LPV (Centre for Earth Observation Science-Land Product Validation) guidelines.This network is used for evaluation purposes since it contains sites different from those in BELMANIP2.There are currently 113 in situ data points available.Data were obtained through the On-Line Validation Exercise (OLIVE) a CEOS-LPV initiative for online validation of global land products [24].
Matchups with MCD15 and AVH09 were performed by selecting pixels whose center is the nearest to the BELMANIP2 and DIRECT site locations.We focused on coincident years between NOAA-16 and MODIS products: 2001-2007.Since MCD15 products are 1 km resolution, they were aggregated to 0.05 ˝to be consistent with the spatial resolution of AVH09.Finally, for temporal consistency and to reduce the noise, AVH09 and MCD15 products were averaged monthly.

Algorithm Definition and Calibration
The AVH15 algorithm is based on an artificial neural network (ANN) connecting LAI or FAPAR to AVH09 SR products for each of the five biomes (omitting the water class) as defined in Section 2. The ANN were trained using the MCD15 products as reference.The calibration procedure was done using data from BELMANIP2 as described in Section 2.4.We retained data for the period spanning from 2001 to 2007.

Input and Output Normalization
Normalization is achieved simply by scaling between the minimum and maximum values: the normalized values (Y) will vary between ´1 and +1, and are computed from the raw values (X), and the minimum (X min ) and maximum (X max ) values (Equation ( 2)).Minimum and maximum values of input and outputs retrieved with the calibration data are reported per class (i.e., for each ANN) in Table 2.

ANN Definition and Learning
ANN are connections of "neurons", associated by "synaptic" weights.Each neuron transforms the sum of the weighted signal from the previous neurons according to a given transfer function and a bias.The combination of sigmoid and linear functions is capable to fit any type of function [25].The ANN architecture finally retained followed the proposition of [6] to have as many intermediate neurons as inputs plus one.The ANN architecture is composed of (see Figure 3

ANN Definition and Learning
ANN are connections of "neurons", associated by ''synaptic'' weights.Each neuron transforms the sum of the weighted signal from the previous neurons according to a given transfer function and a bias.The combination of sigmoid and linear functions is capable to fit any type of function [25].
The ANN architecture finally retained followed the proposition of [6] to have as many intermediate neurons as inputs plus one.The ANN architecture is composed of (see Figure 3): -One input layer made of the four normalized inputs: AVH09 Red SR, AVH09 NIR SR, AVH13 NDVI and the cosine of the solar zenith angle, -One hidden layer with five neurons having hyperbolic tangent sigmoid transfer functions (Equation (3), where x is the neuron input and y the output), -One output layer via a linear transfer function, and For each of the 10 configurations (i.e., five classes × two output variables), 10 ANN were trained, resulting to 100 ANN in total.The selection of the 10 optimal ANN was based on the RMSD (root-mean-square deviation) between the outputs and the in situ data from DIRECT network sites.The learning process is composed of two elements: the training dataset that was described previously, and the learning rule that is now described.The Levenberg-Marquardt optimization algorithm [26] is used to adjust the synaptic weights and neuron bias to get the best agreement between the output simulated by the network and the corresponding value of MCD15 LAI/FAPAR.The initial values of the weights and biases were set to a random value between −1 and +1.For each of the 10 configurations (i.e., five classes ˆtwo output variables), 10 ANN were trained, resulting to 100 ANN in total.The selection of the 10 optimal ANN was based on the RMSD (root-mean-square deviation) between the outputs and the in situ data from DIRECT network sites.
The learning process is composed of two elements: the training dataset that was described previously, and the learning rule that is now described.The Levenberg-Marquardt optimization algorithm [26] is used to adjust the synaptic weights and neuron bias to get the best agreement between the output simulated by the network and the corresponding value of MCD15 LAI/FAPAR.The initial values of the weights and biases were set to a random value between ´1 and +1.

Domain Definition
The ANN are trained over a defined area and the output accuracy decreases considerably outside of the domain delimited by the learning dataset.Therefore, we defined an acceptable input domain for each class based on SR inputs used during ANN training.Since output data are not considered, no differentiation is necessary between LAI and FAPAR domain definitions.
The domain definition was defined based on calibration data from NOAA-16 AVHRR SR pixels overlapping spatially and temporally with BELMANIP2 sites during the 2001-2007 period.Figure 4 illustrates the density distribution of the learning dataset for each class and the associated domain is delimited by a polygon.Polygons were defined to include 97% of the density distribution pixels (0.01 resolution for red and NIR).With these selected pixels, a convex hull algorithm is used to define the exact envelope.Finally, the envelope was simplified using a recursive Douglas-Peucker polyline simplification algorithm [27].

Domain Definition
The ANN are trained over a defined area and the output accuracy decreases considerably outside of the domain delimited by the learning dataset.Therefore, we defined an acceptable input domain for each class based on SR inputs used during ANN training.Since output data are not considered, no differentiation is necessary between LAI and FAPAR domain definitions.
The domain definition was defined based on calibration data from NOAA-16 AVHRR SR pixels overlapping spatially and temporally with BELMANIP2 sites during the 2001-2007 period.Figure 4 illustrates the density distribution of the learning dataset for each class and the associated domain is delimited by a polygon.Polygons were defined to include 97% of the density distribution pixels (0.01 resolution for red and NIR).With these selected pixels, a convex hull algorithm is used to define the exact envelope.Finally, the envelope was simplified using a recursive Douglas-Peucker polyline simplification algorithm [27].Each processed pixel is compared to the polygon of the corresponding class.If it falls outside the polygon, an indicator flag is reported in the QA associated with the retrieval.

CDR Performance and Validation
CDR performance is evaluated by examining the theoretical performance of the ANN as presented in Figure 5. Three statistical metrics (Equations ( 4)-( 6), [28]) are calculated: bias, the rootmean-square deviation (RMSD), and the unbiased RMSD (ubRMSD).Each processed pixel is compared to the polygon of the corresponding class.If it falls outside the polygon, an indicator flag is reported in the QA associated with the retrieval.

CDR Performance and Validation
CDR performance is evaluated by examining the theoretical performance of the ANN as presented in Figure 5. Three statistical metrics (Equations ( 4)-( 6), [28]) are calculated: bias, the root-mean-square deviation (RMSD), and the unbiased RMSD (ubRMSD).
Bias "  4)-( 6); values in parenthesis correspond to metric values divided by the reference mean value.Refer to Table 1 for class biome definitions.4)-( 6); values in parenthesis correspond to metric values divided by the reference mean value.Refer to Table 1 for class biome definitions.
In Equations ( 4)-( 6), n is the number of valid samples used for the comparison and ε i is the estimate minus the reference.Relative values for the three metrics are computed by dividing the metric by the mean value of the reference observation.
The data shown in Figure 5 are based on DIRECT site locations, which were not used for the training but only for the evaluation of the ANN.The cumulative distribution functions demonstrate the good overall performance of the ANN to reproduce the range of value.However, a clear saturation effect is observed in the DBF and EBF classes with high LAI (>4.5) and FAPAR (>0.8) values.The good performance of non-broadleaf-forest classes (i.e., NLF, shrub, and CGNV) is consistent with the fact that the input data used for calibration (i.e., MCD15) have shown better performance on such land cover [6,29].
Another way to evaluate the reproducibility of the algorithm is to compare outputs from AVHRR sensors on board two different platforms.We selected NOAA-16 and NOAA-18, with an overlapping period from 2 July 2005 to 31 December 2006.The analysis is carried on BELMANIP2 and DIRECT sites.The resultant scatterplot is displayed in Figure 6.RMSD are 0.35 for LAI and 0.07 for FAPAR, which are comparatively better than the ANN performances computed in Figure 5l,x comparing AVHRR-derived to MODIS-derived LAI and FAPAR, respectively.Direct comparison of the satellite-derived products with in situ measurement is a key validation step.It has been defined as Stage 1 of the CEOS-LPV guideline.However, an important issue related to the validation of any coarse resolution retrieval is to link the pixel footprint to the spatial representativeness of the measurement [30].We relied on the work performed by Garrigues et al. [23] who contributed to the conception of the DIRECT network.They first gathered in situ measurements from many locations and scaled them up to a 3 km ˆ3 km area using medium-resolution (<100 m) data.To extend the measurement from a 3 km ˆ3 km area to a 0.05 ˝area, we applied a ratio calculated using the LAI/FAPAR MCD15 (1 km) retrieval aggregated over the measurement footprint (3 km ˆ3 km) and the one aggregated at 0.05 ˝.The outputs were finally compared to the AVH15 retrieval (Figure 7).Notice that these DIRECT sites are independent of the ANN learning process.LAI validation scatter plots were separated among the type of measurement: effective and true LAI, depending if the clumping factor is considered or not [6,31].By merging the two LAI measurements types, we obtained an uncertainty of 1.03.
The error budget is detailed in Table 3, which includes per-class Bias, ubRMSD and RMSD from the validation over DIRECT sites.The class "Grasslands and Croplands and Non-vegetated" (CGNV) is the most represented class of the in situ dataset, in terms of the largest N, and shows the best result for Effective LAI.The computed RMSD fit in the medium range of previously published.Camacho et al. [32] validated four global LAI/FAPAR products (GEOV1, CYCLOPE, MCD15, and GLOV2) and found RMSD ranges of (0.74-1.39) and (0.078-0.228) for effective LAI and FAPAR, respectively.

Conclusions
This paper presented the 30+ year AVH15 LAI/FAPAR CDR distributed by the NOAA's NCEI.The dataset is a global, daily, 0.05° (~5 km) spatial resolution, spanning from 1982 to 10 days from present.The algorithm relies on artificial neural networks which were calibrated per land cover type using the MODIS LAI/FAPAR product.Five land cover classes were included: evergreen broadleaf forest, deciduous broadleaf forest, needle leaf forest, shrubland, croplands and grasslands, and nonvegetated.

Conclusions
This paper presented the 30+ year AVH15 LAI/FAPAR CDR distributed by the NOAA's NCEI.The dataset is a global, daily, 0.05 ˝(~5 km) spatial resolution, spanning from 1982 to 10 days from present.The algorithm relies on artificial neural networks which were calibrated per land cover type using the MODIS LAI/FAPAR product.Five land cover classes were included: evergreen broadleaf forest, deciduous broadleaf forest, needle leaf forest, shrubland, croplands and grasslands, and non-vegetated.
Reproducibility of the algorithm was demonstrated to achieve overall uncertainties performance of 0.54 (27%) and 0.08 (16%), for LAI and FAPAR, respectively.However, per-biome scores are contrasted.Best performances were computed for cropland, grassland, and non-vegetated surfaces.The main limitation of the algorithm is the incapacity to reproduce variability of densely vegetated cover such as deciduous forest.A clear saturation of the algorithm was observed in these biomes for high LAI (>4.5) and FAPAR (>0.8) values.The outputs were also compared to in situ data.Overall uncertainties of 0.98 and 1.13 were found for LAI, depending on the accounting of the clumping index in the in situ processing.Overall uncertainties of 0.15 were found for FAPAR.The high values of these scores are also related to the complexity of producing a consistent measurement over a pixel footprint (i.e., 0.05 ˝ˆ0.05 ˝).
3. The algorithm has five steps: -Input normalization, -ANN execution (per class and variable), -Output normalization, -Classes fusion according to the IGBP land cover as defined in Section 2.3, and -Flagging pixels outside of the defined domain.
): -One input layer made of the four normalized inputs: AVH09 Red SR, AVH09 NIR SR, AVH13 NDVI and the cosine of the solar zenith angle, -One hidden layer with five neurons having hyperbolic tangent sigmoid transfer functions (Equation (3), where x is the neuron input and y the output), -One output layer via a linear transfer function, and -Normalized output.

Figure 3 .
Figure 3. Conceptual representation of the ANN, including normalization ("Norm") steps.Notice that the number of neurons correspond to the actual number of neurons.S and L stand for "sigmoid" and "linear" neurons, respectively.

Figure 3 .
Figure 3. Conceptual representation of the ANN, including normalization ("Norm") steps.Notice that the number of neurons correspond to the actual number of neurons.S and L stand for "sigmoid" and "linear" neurons, respectively.

Figure 4 .
Figure 4. (a-e) Domain definition for the five classes (red polygons) in the red/NIR surface reflectance space.Greyscale images represent the density function for each 0.01 surface reflectance (SR) bin (white = no value; black = high density).Refer to Table1for biome class definitions.The domain definition is calculated using AVH09C1 data acquired from 2001 to 2007.

Figure 4 .
Figure 4. (a-e) Domain definition for the five classes (red polygons) in the red/NIR surface reflectance space.Greyscale images represent the density function for each 0.01 surface reflectance (SR) bin (white = no value; black = high density).Refer to Table1for biome class definitions.The domain definition is calculated using AVH09C1 data acquired from 2001 to 2007.

Figure 5 .
Figure 5. (a-x) Theoretical performances of the LAI and FAPAR retrieval.Each row refers to a land cover class and the bottom row to all classes merged.On the first column, the cumulative distribution functions (CDF) of MCD15 training data and AVH15 LAI retrieval are shown.On the second column, scatter plots between LAI MCD15 (x-axis) and LAI AVH15 (y-axis) are displayed.The graphs are reproduced for FAPAR on the third and fourth column.Only data from DIRECT sites (not used for training) were plotted.Statistical metrics on the second and fourth subplot columns are defined in Equations (4)-(6); values in parenthesis correspond to metric values divided by the reference mean value.Refer to Table1for class biome definitions.

Figure 5 .
Figure 5. (a-x) Theoretical performances of the LAI and FAPAR retrieval.Each row refers to a land cover class and the bottom row to all classes merged.On the first column, the cumulative distribution functions (CDF) of MCD15 training data and AVH15 LAI retrieval are shown.On the second column, scatter plots between LAI MCD15 (x-axis) and LAI AVH15 (y-axis) are displayed.The graphs are reproduced for FAPAR on the third and fourth column.Only data from DIRECT sites (not used for training) were plotted.Statistical metrics on the second and fourth subplot columns are defined in Equations (4)-(6); values in parenthesis correspond to metric values divided by the reference mean value.Refer to Table1for class biome definitions.

Figure 7 .
Figure 7. (a-c) In situ validation over DIRECT sites.Ground measurement covers initially a footprint of 3 km × 3 km and were extrapolated to 0.05° using MCD15 products for direct comparison.Statistical metrics are defined in Equations (4)-(6); values in parenthesis correspond to metric values divided by the reference mean value.

Figure 7 .
Figure 7. (a-c) In situ validation over DIRECT sites.Ground measurement covers initially a footprint of 3 km × 3 km and were extrapolated to 0.05° using MCD15 products for direct comparison.Statistical metrics are defined in Equations (4)-(6); values in parenthesis correspond to metric values divided by the reference mean value.

Figure 7 .
Figure 7. (a-c) In situ validation over DIRECT sites.Ground measurement covers initially a footprint of 3 km ˆ3 km and were extrapolated to 0.05 ˝using MCD15 products for direct comparison.Statistical metrics are defined in Equations (4)-(6); values in parenthesis correspond to metric values divided by the reference mean value.

Table 1 .
Reclassification table of IGBP land cover classes.

Table 1 .
Reclassification table of IGBP land cover classes.

Table 2 .
Minimum and maximum values used for normalization.Refer to Table 1 for biome class definition.

Table 2 .
Minimum and maximum values used for normalization.Refer to Table1for biome class definition.

Table 3 .
Error budget based on in situ validation.N corresponds to the number of points used to compute the statistical metrics.

Table 3 .
Error budget based on in situ validation.N corresponds to the number of points used to compute the statistical metrics.

Table 3 .
Error budget based on in situ validation.N corresponds to the number of points used to compute the statistical metrics.