The Potential of High Resolution ( 5 m ) RapidEye Optical Data to Estimate Above Ground Biomass at the National Level over Tanzania

In this paper, we review the potential of high resolution optical satellite data to reduce the significant investment in resources required for a national field survey for producing estimates of above ground biomass (AGB). We use 5 m resolution RapidEye optical data to support a country wide biomass inventory with the objective of bringing to the attention of the traditional forestry sector the advantages of integrating remote sensing data in the planning and execution of field data acquisition. We analysed the relationship between AGB estimates from a subset of the national survey field plot data collected by the Tanzania Forest Service, with a set of remote sensing biophysical parameters extracted from a sample of fine spatial (5 m) resolution RapidEye images using a regression estimator. We processed RapidEye data using image segmentation for 76 sample sites each of 20 km by 20 km (covering 2.3% of the land area of the country) to image objects of 1 ha. We extracted reflectance and texture information from those objects which overlapped with the field plot data and tested correlations between the two using four different models: Two models from inferential statistics and two models from machine learning. The best results were found using the random forests algorithm (R2 = 0.69). The most important explicative factor extracted from the remote sensing data was the shadow index, measuring the absorption of light in the visible bands. The model was then applied to all image objects on the RapidEye images to obtain AGB for each of the 76 sample sites, which were then interpolated to estimate the AGB stock at the national scale. Using the relative efficiency measure, we assessed the improvement that the introduction of remote sensing data brings to obtain an AGB estimate at the national level, with the same precision as the full survey. The improvement in the precision of the estimate (by reducing its variance) resulted in a relative efficiency of 3.2. This demonstrates that the introduction of remote sensing data at this fine resolution can substantially reduce the number of field plots required, in this case threefold.


Background to the Study-The REDD+ Initiative
Deforestation and forest degradation account for up to 12% of the global carbon emissions [1].Most of these emissions (>80%) [2] originate from the tropics, where over 90% of humid forests are located.The United Nations Framework Convention on Climate Change (UNFCCC) aims to reduce such emissions via its REDD+ mechanism (Reducing Emissions from Deforestation and forest Degradation), by financially rewarding developing countries for reducing greenhouse gas (GHG) emissions from deforestation and degradation.Countries may claim financial incentives by submitting reports on carbon stock changes based on the area changes in forest cover (activity data) and density of forest carbon stock (emission factors) [2].
To estimate GHG emissions, countries should use the Intergovernmental Panel on Climate Change (IPCC) guidance and guidelines as adopted (i.e., [3]) or encouraged [4].According to such guidelines, there are three approaches for assessing forest area changes (activity data) and three tiers for assessing emissions factors, each with increasing accuracy and precision.
The activity data must be assessed with satellite data at a minimum mapping unit of up to a maximum of 1.0 ha.Area changes of forest to non-forest (i.e., deforestation) consider that a forest has a minimum tree-crown cover of 10-30%, with trees that have reached, or could reach, a minimum height of 2-5 m at maturity in the same location.Changes within the same forest class are also accounted for when there is a reduction of the carbon stock (i.e., degradation).Challenges remain for consistently mapping activity data [5], notably forest degradation [6].
For emissions factors, three tiers are proposed: Tier 1 uses default above-ground biomass (AGB) values per ecological zone and continent.Tier 2 and Tier 3 are more elaborate, based on country-specific remote sensing or permanent sample-plot data.Although the Tier 1 approach has large uncertainties, for the time being, many developing countries have to use this representative carbon stock as they lack the financial and technical capacities to implement forest carbon stock inventories to derive Tier 2 and Tier 3 emission factors [7].
A key challenge, therefore, is to enable participating countries to obtain estimates of carbon stocks (AGB) in a cost effective way.
It is in this context that we place the current work.We wish to demonstrate that by combining optical RS data with a reduced number of field measurements, substantial savings can be made in making AGB estimates, both in time and resources.The work is undertaken not as a test case, where all data are available for a chosen test site, but as an 'operational' national study, where we work with the data available-both remote sensing and field measurements.In addressing a country wide survey, we also must deal with different ecosystems manifesting different vegetation types that exist across Tanzania.In addition, the availability of satellite images and their utility varies across the country, restricted by available cloud free imagery, and vegetation conditions due to phenology.This paper attempts to address these problems.

Use of Remote Sensing Data to Map Above Ground Biomass
Mapping and monitoring of forest carbon stocks in the tropics has traditionally relied on field surveys that are often limited to particular ecosystems and locations [15].Given the restrictions of the scope and of the costs of recurring surveys, many researchers have used available satellite and aerial remote sensing instead [16][17][18].
Earth Observation offers the unique opportunity to assess the state and changes of vegetation dynamics, providing data over large (e.g., continental scale) areas and long (e.g., a decade or more) periods, at spatial and temporal sampling frequencies that are potentially suitable for detecting key forest carbon stock distribution and changes [19,20].
Remote sensing-based datasets that are widely used are: (1) Light Detection and Ranging (LiDAR); (2) optical data at various spatial resolutions; and (3) Synthetic Aperture Radar (SAR).LiDAR data are able to map forest structure and height in the three dimensions [21][22][23][24].However, acquisition costs, processing, and difficulties in replicating in both space and time often render the use of LiDAR difficult.Optical data have been extensively used for AGB modelling with very high (less than 10 m) or coarse spatial resolution (circa 250m) data.Generally, AGB estimation from optical remotely sensed data is carried out through regression models, based on relationships between AGB and reflectance [25].
Direct biomass estimations based on optical satellite data have been successfully carried out with very high resolution images (VHR) [26][27][28][29][30].Despite good results, such approaches are only currently applicable to small area sites, due to their costs and availability of VHR data.At the regional level, coarse spatial resolution satellite imagery have been used in combination with field plot measurements and space-borne LiDAR data to derive wall-to-wall pan-tropical biomass maps at 1 km resolution for the year, 2000 [31], and 500 m resolution for 2007 [32].Both studies use a similar approach, correlating tree height from Geoscience Laser Altimeter System (GLAS) data points to AGB values of (spatially overlapping) field inventory plots.These are then interpolated to deduce the AGB values for all non-overlapping GLAS data points.
Langner [33] reviewed these two pantropical biomass maps and a combination of both (combined dataset) in comparison with the IPCC Tier 1 values, which led to an overestimation of AGB as the values often refer to intact forest sites [31].The mean AGB show a good consistency between the two datasets values per pan-tropical ecological zone with a correlation coefficient (R 2 ) of 0.87.When restricting the regression to intact forest areas (IFA [34]), R 2 is even higher: 0.97.For non-IFA, R 2 is lower: 0.80.A comparison between these two pan-tropical maps generally shows higher AGB values from the 500 m map [32] than the 1 km map [31].However, a detailed look at the data shows significant differences (both in spatial distribution and magnitude of AGB) between the maps, notably for tropical dry Africa.Nevertheless, Mitchard et al. [35] concluded that no single map was generally superior to the other despite the substantial differences.
The main limitations of optical satellite data are: (i) Cloud cover can significantly affect AGB estimates by limiting the number of satellite orbits acquired over a cloudy area; (ii) optical data become saturated with dense canopy and high biomass.
SAR (Synthetic Aperture Radar) data are considered as reliable indicators of AGB [36][37][38][39].Radar backscatter is highly sensitive to vegetation structural properties that are related to biomass [40], with C, L, and P bands giving higher backscatters to leaves, branches, and trunks respectively [41][42][43].The ability of SAR to penetrate cloud cover makes it particularly valuable in cloudy areas, such as the tropics, however, limitations occur in areas of high topography.L-band is the most widely used for forest AGB estimation.The physical properties of vegetation and woody components (comparable in size to the wavelengths used) influences the AGB-backscatter relationship.However, the L-band is only efficient up to around 100 t ha −1 , above which sensitivity to AGB is lost [44].
To overcome this limitation, data fusion of SAR and optical images has been proposed [45][46][47].The combination of ALOS PALSAR 2 (Phased Array type L-band Synthetic Aperture Radar) and WorldView-2 data has been shown to be a promising approach to improve biomass estimation over a larger area in China [48].Scientific consensus on the accuracy of SAR and optical data fusion is still not complete [49], and most of these tests are of limited area coverage, unlike this study where a country wide estimate is reviewed.

The Current Study
We use field plot data collected by Tanzania's NAFORMA project (National Forest Monitoring and Assessment) in conjunction with RapidEye, 5 m resolution optical satellite data.The choice of RapdiEye data was seen as the opportunity to have high spatial resolution data over a large area.Previous studies on linking optical data at this (national) scale rely on low (circa 250 m) to medium (circa 30 m) resolution data [50].
Nationwide field surveys for measuring biomass, such as that carried out by NAFORMA, are expensive and time consuming and contain errors, in some cases up to 30% [9].We assess the possibility of combining a limited sample of the field survey data (circa 1000 field plots out of 32,000) to train and validate estimates of AGB using remotely sensed images through a regression estimator.We use a readily available fine spatial (5 m) resolution RapidEye optical data systematically distributed across Tanzania in 76 sample sites, in conjunction with a corresponding subset of field plots extracted from the national survey carried out by NAFORMA.Different parametric and non-parametric approaches were applied to develop linear and nonlinear regressions between input variables and AGB, using random forest (RF) and support vector machine (SVM) regression methods.
If by combining these two data sets (through modelling and extrapolation), we can demonstrate the potential for providing estimates of AGB at the national level with less field data, significant financial savings could be achieved.This would allow more countries to rapidly provide data on AGB for REDD+ reporting [51].The measure of the improvement of the combined approach compared to the field plots alone is provided by the relative efficiency [10,52], which gives the improvement in precision (i.e., reduction in variance).A relative efficiency of two would mean that a ground survey of X plots with remote sensing correction would give the same precision as a ground survey of 2X plots without remote sensing data.
The novelties of this study are in the combination of: (i) The use of 5m optical data to support a country wide biomass inventory across all ecosystems; (ii) the use of object segmentation at this scale along with texture measures [26,27,53] and (iii) the provision of traditional forestry sector agencies with guidelines for integrating remote sensing data in the planning of field data acquisition.If the remote sensing parameters and field data results on AGB can be shown to be correlated, one can obtain significant cost savings by reducing field data acquisition.
The work therefore addresses two related themes: (1) Specific results and quantification of increased precision from use of RapidEye optical data, and (2) a variety of practical, operational considerations and potential applications of RapidEye data.

Materials and Methods
The methodology was applied in the following steps: We determined the optimal minimum mapping unit (MMU) to employ when processing the remote sensing data and applied it to the RapidEye data after cloud and cloud shadow masking.Reflectance, texture, and indices were extracted from those image objects which corresponded spatially to the field data (geometric location), and ambiguous or erroneous field data were screened and removed.We then tested models so as to relate field data on AGB with the spectral and textural parameters extracted from satellite data.The best model was then applied to all the image objects in the full Remote Sensing data set, so as to obtain AGB estimates for each of the remote sensing sample sites.These results were then interpolated by direct expansion to the national level and compared to the results from the full field survey.

Study Area
The study area covers the full land surface of mainland Tanzania, which is located on the east coast of Africa between parallels 1 • S and 12 • S and meridians 30 • E and 40 • E. Tanzania has an area of 945,090 km 2 .Following the White classification scheme [54], the country is covered by five ecosystems (percentage area of the country is given in brackets); the Zambezian ecoregion (55.7%), the Zanzibar-Inhambane (8.5%) the Somali Masai (26.5%), and the Afromontane region (4.8%).The Victoria region refers to the area around Lake Victoria (4.5%)The predominant vegetation covers are respectively: Miombo woodland, heavily disturbed coastal forests, arid shrub-lands, and dense forests (Figure 1) [55].
Almost all of the forests are naturally regenerated, with only 1% of the forest cover being considered as 'primary'.The national survey does not give AGB levels by ecological zone, however, estimates are given by vegetation type (Table 1) from which we see that the country average will be close to that of woodlands, which are pre-dominant.The ecological zone map of Tanzania [54] with the location of the field plots (not to scale) to coincide with the satelite data.An excert of the points for one cluster overlaid on a RapidEye image is shown to the right.

NAFORMA Field Data
The National Forestry Resources monitoring and Assessment (NAFORMA) project was set up by the Ministry of Natural Resources of Tanzania (MNRT) and the United Nations Programme on Reducing Emissions from Deforestation and Forest Degradation (UN-REDD), with the technical support of FAO (the Food and Agriculture Organization) and Finland [56,57] for assessing and monitoring forest carbon pools compatible with REDD+ requirements.The national field data set collected by NAFORMA was designed to provide a national estimate of forest area, wood volume, and growing stock [58], along with a set of social indicators and a soil database.The survey design was stratified using potential biomass and accessibility to facilitate the field visits [59,60].Some 32,000 plots were visited over a period of 3 years, collecting the biophysical and social variables.The final results were released in 2015 [55].
The sampling design produced clusters each of 10 plots, five plots running west-east and five north-south, with a distance of 250 m between plots (see inset in Figure 1).The size of the plots (15 m radius circle) was devised for efficiency in the field, however, this may not be optimum for linking with remote sensing data [10].The field work started in May 2010 and was completed in June 2013.Data for over half the field plots were collected during the period of RapidEye satellite acquisition.The remaining images were acquired up to a maximum of 14 months after the field data (see Supplementary Materials).For each plot, data were collected on canopy cover, tree and shrub height, trunk diameter (DBH), species, dead wood, and soil.The tree measurements (DBH ≥ 1 cm) were conducted in concentric circular plots.Species, health, and diameter were measured for all trees, and height, stump diameter, and bole height measured for tally trees (i.e., every 5th tree) [58,61].We used the data from individual plots, rather than aggregating the data at the cluster scale, due to the observed heterogeneity between adjacent plots.
The biophysical data on basal area and tree height were then used by NAFORMA to calculate bole volume in a model and hence carbon biomass for each plot, from which national and regional level estimates were made.Allometric equations were used to calculate tree volume using the DBF and estimated height for each plot.Different models were used, three for commonly used plantation tree species and two generic models for the remaining natural species.To obtain AGB, the tree volume was multiplied by the wood specific density, which was given as 0.58 for forest species and 0.50 for woodlands [58].While we use these data as our reference, we also need to be aware that the models used to calculate AGB from field data also suffer from bias [62].
Through a collaborative agreement with the Tanzania Forest Service, our institution, the European Commission's Joint Research Centre (JRC), was given access to a subset of the NAFORMA field data.This subset consisted of those plots falling within 20 km by 20 km boxes centered on the 76 latitude-longitude confluence points (i.e., the crossing of the integral longitude and latitude) in Tanzania, a total of around 1000 plots.These sites were chosen as they corresponded to the satellite image sample sites used for the Global Forest Resources Assessment Remote Sensing Survey 2010 carried out by the JRC and United Nations Food and Agriculture Organization (FAO) [63].Their geographical distribution means that the remote sensing data cover all types of different ecosystems in the country.

RapidEye Satellite Data
To support the FAO FRA remote sensing survey, the European Union's GMES (Global Monitoring for Environment and Security) provide a RapidEye dataset [64] over the 76 confluence points in Tanzania.For providing continental estimates of forest change between 2000 and 2010, the FRA project, in conjunction with the JRC, used this systematic sample of satellite images across the globe with samples based on the latitude-longitude confluence points (i.e., the crossing for the integral longitude and latitude) [65].These data were the only freely available 5 m resolution data set which covered Tanzania.Over 80% of the images were acquired in the second half of 2010, with the remaining images coming from the first quarter of 2011.On average, the image acquisition precedes the field data collection by 10 months (see Supplementary Materials Figure S1 for image acquisitions and differences between field data collection).
RapidEye data come from a constellation of five radiometrically inter-calibrated satellites [66], each with a swath width of approximately 500 km, acquiring data in 5 spectral bands (Table 2), from blue to near infra-red (including a 'red edge' band).The data revisit period is a nominal of 1 day on request, however, the effective acquisition is limited due to recording and receiving capacities, relatively high cloud cover over the target area, and priority demands for commercial acquisitions.The latter is especially true for Africa, where lower data availability can occur due to the demand for images over the European agricultural zone.The 'best' (cloud and artefact free) images from RapidEye satellites 3, 4, and 5 were selected and mosaicked using all bands to cover a 20 km by 20 km box around each confluence point (Figure 2). the field data collection by 10 months (see supplementary material Figure S1 for image acquisitions and differences between field data collection).
RapidEye data come from a constellation of five radiometrically inter-calibrated satellites [66], each with a swath width of approximately 500 km, acquiring data in 5 spectral bands (Table 2), from blue to near infra-red (including a 'red edge' band).The data revisit period is a nominal of 1 day on request, however, the effective acquisition is limited due to recording and receiving capacities, relatively high cloud cover over the target area, and priority demands for commercial acquisitions.The latter is especially true for Africa, where lower data availability can occur due to the demand for images over the European agricultural zone.The 'best' (cloud and artefact free) images from RapidEye satellites 3, 4, and 5 were selected and mosaicked using all bands to cover a 20 km by 20 km box around each confluence point.(Figure 2)

Channel
Spectral Band Name Spectral Range (nm) Cloud and cloud shadow are masked out in the first stages of the pre-processing, then image segments are assigned AGB values using the model developed using random forest.

RapidEye Preprosessing-Radiometric Calibration
RapidEye level 3A data are provided as 5 band layer stacked Geo-Tiff files, with a nominal 5 m pixel size stored as 16 bit data.The data were converted to at-sensor radiance (W m −2 sr −1 μm −1 ), and then the top of atmosphere reflectance was calculated using the local solar zenith angle and sun irradiance supplied for each band in the metadata and published calibration coefficients [64].To obtain at sensor radiance in watts per steradian per square meter (W m −2 sr −1 μm −1 ), a scale factor is applied as follows: where ScaleFactor(λ) = 0.01 The Top of Atmosphere reflectance is calculated by: Where: ρλ= TOA reflectance for band λ Lλ = Radiance for band λ θSZ= Local solar zenith angle d = (1 − 0.01672 × cos (0.01745 × (0.9856 × (Julian Day Image − 4))) The mean exoatmospheric solar irradiance, ESUN λ, in W/m 2 /μm) for each channel is respectively: ESUN λ1-5 = (B1=1997.8B2=1863.5 B3=1560.4B4=1395.0 B5=1124.4) The reflectance values, ranging between 0 and 1, are rescaled to 16 bit Unsigned Integer (0-10,000) with a linear multiplication factor of 10,000.The required formulas and parameters are taken from the literature [64].We performed an 'evergreen forest normalization' [67,68]), based on the theory of dark object subtraction [69], which reduces the variance in reflectance between images acquired at different dates and locations.Cloud and cloud shadow are masked out in the first stages of the pre-processing, then image segments are assigned AGB values using the model developed using random forest.

RapidEye Preprosessing-Radiometric Calibration
RapidEye level 3A data are provided as 5 band layer stacked Geo-Tiff files, with a nominal 5 m pixel size stored as 16 bit data.The data were converted to at-sensor radiance (W m −2 sr −1 µm −1 ), and then the top of atmosphere reflectance was calculated using the local solar zenith angle and sun irradiance supplied for each band in the metadata and published calibration coefficients [64].To obtain at sensor radiance in watts per steradian per square meter (W m −2 sr −1 µm −1 ), a scale factor is applied as follows: where ScaleFactor(λ) = 0.01.The Top of Atmosphere reflectance is calculated by: where: The reflectance values, ranging between 0 and 1, are rescaled to 16 bit Unsigned Integer (0-10,000) with a linear multiplication factor of 10,000.The required formulas and parameters are taken from the literature [64].We performed an 'evergreen forest normalization' [67,68]), based on the theory of dark object subtraction [69], which reduces the variance in reflectance between images acquired at different dates and locations.

RapidEye Preprosessing-Image Segmentation to Obtain The MMU
The UNFCC, at its 7th Convention of the Parties (COP), proposed that countries should choose an MMU no greater than 1 ha [70].For its submission to the UNFCCC, Tanzania used a 0.5 ha MMU [71].
Image segmentation was chosen to map landscape units at a cartographic MMU corresponding as closely as possible to that set by the national authority responsible for forests (the Tanzania Forest Service) and international guidelines.Almost all national forest definitions are based on tree height and tree cover with a minimum mapping unit.Remote sensing data are imaged at the pixel level and so segmentation allows us to cluster pixels of a similar reflectance and texture into objects (vector polygons) that correspond to landscape units (e.g., 'a forest') [72].The segmentation of the image into the objects is achieved through the software package, eCognition [73].
To select the segmentation parameters which determine the aggregation of pixels to objects (or polygons), we effectuated a series of tests changing the shape factor, the compactness, and reflectance weights of the segmentation process executed in the eCognition software package.We found that a high compact value (0.9) provided results that were more in line with traditional photo interpretation results.Results with lower compactness gave us highly fragmented polygons.
The shape factor is the parameter (integer > 1) that controls the size of the objects [74].The greater the shape factor, the larger the objects.However, the results of the segmentation (i.e., object size) are dependent not only on the shape factor, but on the image itself, i.e., the image contrast, variance and landscape fragmentation.Therefore, a different value of the shape factor parameter may be required to obtain objects of the same size on different images.In conjunction with the FAO, the JRC developed an iterative routine applied to each image to create initial, small, landscape units, and then in a stepwise process, increased these units until the mean object area of each of the image corresponded to the MMU (Figure 3).Objects smaller than the MMU were then dissolved and added to the adjacent object with the closest spectral signature [72].As the mean object area must be calculated only on land cover features, objects corresponding to clouds and cloud shadows had to be detected, masked out, and omitted from the routine.An example of the increasing shape factor iteration is given in the Supplementary Materials Figure S3 with the final result in Figure S4.

Cloud and Cloud Shadow Masking
The RapidEye data exhibited two artefacts that needed to be addressed: (i) Cloud and cloud shadow displayed a locational shift of several pixels between image bands, (ii) the sun azimuth angle, which is used in classification software for detecting cloud shadow, was missing in the metadata.
The locational shift (i) between bands is as a result of the push-broom design of the RapidEye sensor, which means that different spectral channels are acquired at slightly different times.This effect is combined with a parallax effect, which depends on the height of the cloud in the individual images [76].We therefore used an incremental approach whereby an initial (high) reflectance threshold of 50% was applied to detect core cloud areas in each band [77], creating 'core cloud objects'.These objects were then merged together and then a lower (10%) threshold was applied only to those areas adjacent to the core areas (i.e., edge of cloud).This has the effect of expanding the area The texture of the image's reflectance can be an effective parameter for measuring AGB [26,27,75].
To have a consistent measure of texture, the scale of analysis (kernel) needs to be large enough to cover the variation in the target landcover with respect to the resolution of the satellite data.At fine spatial scales (e.g., 1 m), the texture of tree crowns shows high variance.This variation remains high until the unit of observation, i.e., the kernel, is larger than a single tree.We carried out a series of tests extracting texture measures from the RapidEye data using a set of incrementally increasing circular units in homogeneous landscapes from 0.25 ha to 6.25 ha to find a MMU that gave stable (i.e., invariant) results (Supplementary Materials Figure S2).Some texture measurements were highly sensitive to the size of the image object, but stabilized at 1 ha.This was then used as the MMU for our objects.One hectare is larger than the final national specification (0.5 ha), but is in line with international guidelines.

Cloud and Cloud Shadow Masking
The RapidEye data exhibited two artefacts that needed to be addressed: (i) Cloud and cloud shadow displayed a locational shift of several pixels between image bands, (ii) the sun azimuth angle, which is used in classification software for detecting cloud shadow, was missing in the metadata.
The locational shift (i) between bands is as a result of the push-broom design of the RapidEye sensor, which means that different spectral channels are acquired at slightly different times.This effect is combined with a parallax effect, which depends on the height of the cloud in the individual images [76].We therefore used an incremental approach whereby an initial (high) reflectance threshold of 50% was applied to detect core cloud areas in each band [77], creating 'core cloud objects'.These objects were then merged together and then a lower (10%) threshold was applied only to those areas adjacent to the core areas (i.e., edge of cloud).This has the effect of expanding the area classified as cloud.
To address (ii) that of the missing the sun azimuth angle, an interface was developed to visually estimate the sun azimuth angle.An operator used the position of clouds and their respective shadows to provide an angle for the missing value to replace the missing metadata.

Extraction of Remote Sensing Parameters for Developing Models for AGB
After cloud and cloud shadow masking, the segmentation iteration loop, as described in Section 2.5, was run to obtain image objects at the MMU (1 ha) for all images.A set remote sensing parameters were then extracted from the level 1 ha objects corresponding (spatially) to the field data plots.These parameters were then used as the basis for generating a model for relating AGB to remotely sensed data (Figure 4).Spectral indices have been used for detecting vegetation and more specifically forest parameters in a number of studies [79,80].The shortwave infrared bands (1.6 and 2.7 μm) are found to be highly correlated with forest parameters and canopy cover [52].Unfortunately, these bands are absent from the RapidEye sensor.Indices, such as the shadow index, have also been used for forest applications [81,82].Improvements in assessing AGB have been achieved using temporal series to monitor The parameters fall into five categories: Simple reflectance means and standard deviations for the objects; derived vegetation indices designed to highlight vegetation characteristics, advanced texture measures based on Grey Level Co-occurrence Matrices (GLCM) [78], and a categorical class giving the percentages of the polygon belonging either to bare soil, grasslands, or woody vegetation.
Spectral indices have been used for detecting vegetation and more specifically forest parameters in a number of studies [79,80].The shortwave infrared bands (1.6 and 2.7 µm) are found to be highly correlated with forest parameters and canopy cover [52].Unfortunately, these bands are absent from the RapidEye sensor.Indices, such as the shadow index, have also been used for forest applications [81,82].Improvements in assessing AGB have been achieved using temporal series to monitor phenological changes [83], however, in our study, we were limited to single data imagery.
As satellite data of finer spatial resolution have become more readily available, the use of the texture measures has become more common (e.g., [84]).The number of texture indices available is extensive (200 are available in the image processing software, eCognition).Each measure (column 1 of Table 2) can be assessed in 5 different directions for each individual band or for all bands combined.Since these indices are computationally time consuming, we used pairwise correlation tests on the parameters to identify those indices that were highly inter-correlated (see correlation matrix in the Supplementary Materials Figure S5).As a result, we reduced the number of potential texture indices to 25.
For the percentages of bare soil, woody vegetation, and photosynthetically active non-woody vegetation, we used the Shadow Index (SI), the Bare Soil Index (BSI) [85], and the Modified Chlorophyll Absorption Ratio Index (MCARI) [86].To restrict our final models to 'woody' vegetation, we produced a classification of each of the RapidEye images based on a decision tree approach [77], which assigns each object to either forest, shrub, or non-forest.The data extraction processing chain (Figure 4) was implemented in eCognition (the rule set is available both as text and an eCognition dcp file in the Supplementary Materials).The list of reflectance and texture measures extracted from the RapidEye objects is shown in Table 3. (NIR − RED)/((NIR + RED + L)) × (1 + L) SAVI Shadow Index [85] ( 33 Shadow_Index Modified Transverse Vegetation Index [86] (1.5 ])ˆ0.5) − 0.5)ˆ0.5 MTVI Modified Chlorophyll Absorption Reflectance Index [90] (

Reviewing the Field Plot Data with Respect to the Rapideye Data
To assess the compatibility between field data and image data, we developed an interface displaying the image data, the plot location, and the plot information (see Supplementary Materials Figure S6).A number of problems came into evidence during this review.These related to spatial mismatches, temporal mismatches between image acquisition and field visits, and the impact of mixed land cover within the sample sites. Specifically: -The field data (15 m circle plots-circa 700 m 2 ) cover 27 RapidEye pixels.However, the precision of the plot geolocation taken in the field (Garmin C60) had limited accuracy (+/−7 m).There are also known problems in changing between the Arc60 datum of topographic maps of East Africa used in the field survey, and that of the satellite reference datum, UTM-WGS84 [92]; -there were a number of discrepancies in the geolocation of various RapidEye scenes, which after the review, were addressed by shifting images to a reference data base of Landsat scenes.Over 50 sample site images were shifted by up to 12 pixels, 30 in the X direction and 41 in the Y direction; -temporal differences; the field data collection took place over three years; satellite data were only available for one of these years.On average, the difference between field data collection and satellite image acquisition was 10 months (see Supplementary Materials Figure S1), with most of the field data collected after the image acquisition; -data collected in the field gave no systematic estimation of the respective cover of trees, shrubs, or other land cover classes, despite being foreseen in the original protocol.On a number of occasions, when given, the canopy cover did not correspond to that seen from the satellite image-perhaps due to problems of geolocation between the data sets, or differences in the time between the field visit and the image acquisition.Also, canopy density is known to be difficult to measure with accuracy from the ground [8]; -the land cover classification given to the field teams was not adapted to providing adequate field data for calibrating remote sensing data.The vast majority of plots were classified as 'woodland', without further elaboration; -even if the land cover has not changed throughout the year, its condition does, especially in the tropics, predominantly due to seasonality.It may be in a lush green phase, drying out, exceptionally dry, burnt, or flooded.All these present different spectral signatures for the same land cover.We removed 33 plots that were burnt, 9 that were flooded, and 13 that had cloud or cloud shadow; and finally, we found that the field plot was not always representative of the 1 ha image object on the remote sensing data.

Models for Predicting Above Ground Biomass
We tested four predictive models to relate the remotely sensed parameters to AGB and evaluated their accuracies.Two models using inferential statistics (i.e., a generalised linear model and a generalised exponential model) and two machine learning models (i.e., a random forest model and a support vector machine (SVM) model).The predictors of the models are image bands, their textures, and spectral indices, while the response variable is the AGB.A scatterplot of each input variable (predictors in the y axis) and biomass is shown in the Supplementary Materials Figure S7.
In the generalised linear model, the dependent variables are assumed to be a linear function of several independent variables, where each of them has a weight (i.e., a regression coefficient).Generalised linear models require a linear relation between predictors and response variables.
Allometric equations relating tree height and basal area to volume, hence calculating AGB, are generally non-linear [11].We therefore employed a logarithmic transformation to the dependent variables, which is a generalised exponential regression.
Random forest (RF) [93] is an ensemble method that builds several decision trees (weak learners) and outputs the mean prediction of the individual decision tree models.The improved RF strategy alleviates the often reported overfitting problem of simple trees.The decorrelation of the decision trees is achieved through the random selection of the input explanatory variables [94] by bootstrap methods.In this case, 63% of the data is used for training (in-bag data) and the remaining 37% (out-of-bag data) for validation.The choice of random forest was mainly motivated by: (i) RF runs efficiently on large databases (i.e., it is relatively fast to build and even faster to predict), (ii) RF is resistant to outliers and over-training, (iii) RF does not require cross-validation for model selection, (iv) RF provides further information about the most relevant variables inputted in the model, and (v) RF is computationally parallelizable.In a decision tree, an input is entered at the top and as it moves down the decision tree the data gets binned into smaller and smaller sets.The main principle behind ensemble methods is that a group of "weak learners" can come together to form a "strong learner", for details see [93].Two parameters are important in the RF algorithm: (i) The number of decision trees used in the forest (ntree) and (ii) the number of random variables used in each decision tree (mtry).In our RF model, we have 500 decision trees (i.e., ntree = 500) and the default mtry value (i.e., mtry = 9).To find the correct ntree number, we built RF models with different ntree values, recorded the error, and computed the number of decision trees needed to reach the minimum error estimate.RF provides the percentage of variance explained and the most relevant input variables in the model.For reproducibility, we used a fixed random number parameter, which forces the generator to give the same random numbers in the random forest function.Since random forest performances generally increase as the size of training data increases, we trained the final model with the entire ground truth dataset.
Random forest is also able to provide information about the importance of the predictors.Specifically, the mean decrease accuracy ("%IncMSE") (Figure 5) has been used as a measure of variable importance.The %IncMSE quantifies by how much the removal of each variable reduces the accuracy of the model.The higher the %IncMSE is, the more important the variable.We ran random forest 100 times to find a robust estimate of these predictors, with the average values and the quartile distributions of the 20 most relevant predictors shown in Figure 5.The Shadow Index was found to be the most important variable.R2SD2 (that is the ratio of band 2 reflectance to its standard deviation) and MCARI are also able to influence AGB variability.Other predictors are ranked lower in their relative variable importance.The main limitation of the random forest model is the lack of capacity to predict beyond the range of the response values in the training data.
The support vector machines (SVM) are supervised learning methods used for regression tasks that originated from statistical learning theory [95].SVM methods perform a linear regression in a high dimension using kernels.This allows it to capture nonlinear tendencies in the original input data.SVM has some parameters that need to be selected (i.e., the kernel) and tuned (i.e., cost and gamma) to obtain a better performance.Among all the kernels available in the literature, we chose the radial basis function (RBF) kernel, due to theoretical and computational convenience.In order to optimise the SVM, we have tested several different cost and gamma values, and returned the one which minimizes the mean squared error for a 10-fold cross validation.Specifically, the optimal values were found as 10 and 0.5 for the cost and gamma parameters, respectively.
The ground truth dataset is the national field data set collected by NAFORMA that consists of 512 points where we have AGB information (circa 500 other points were excluded due to very low or zero biomass, detected on the basis of the decision tree classifier-Section 2.7).The ground truth dataset was split, 85% into a training dataset (435 points) and 15% into a validation dataset (77 points).Random forest already separates partial data for an internal validation (i.e., 63% of the data is used for training and 37% for validation as discussed in Section 2.9), but to compare results with other models, it has been trained and validated with the same validation dataset as for the other models.This scheme ensures: (i) That identical datasets are used for training and (ii) an independent validation with a dataset not included in the training phase.The support vector machines (SVM) are supervised learning methods used for regression tasks that originated from statistical learning theory [95].SVM methods perform a linear regression in a high dimension using kernels.This allows it to capture nonlinear tendencies in the original input data.SVM has some parameters that need to be selected (i.e., the kernel) and tuned (i.e., cost and gamma) to obtain a better performance.Among all the kernels available in the literature, we chose the radial basis function (RBF) kernel, due to theoretical and computational convenience.In order to optimise the SVM, we have tested several different cost and gamma values, and returned the one which minimizes the mean squared error for a 10-fold cross validation.Specifically, the optimal values were found as 10 and 0.5 for the cost and gamma parameters, respectively.
The ground truth dataset is the national field data set collected by NAFORMA that consists of 512 points where we have AGB information (circa 500 other points were excluded due to very low or zero biomass, detected on the basis of the decision tree classifier-Section 2.7).The ground truth dataset was split, 85% into a training dataset (435 points) and 15% into a validation dataset (77 points).Random forest already separates partial data for an internal validation (i.e., 63% of the data is used for training and 37% for validation as discussed in Section 2.9), but to compare results with other models, it has been trained and validated with the same validation dataset as for the other models.This scheme ensures: (i) That identical datasets are used for training and (ii) an independent validation with a dataset not included in the training phase.
Once we obtained the best model for AGB, we created the final model using all of the ground truth dataset (for further details see next section).The final model was then applied to all the image objects pre-classified as 'woody' in the full Remote Sensing data set, so as to obtain AGB estimates for each of the remote sensing sample sites.
Four statistical indicators were used to evaluate the performances of the models: (i) Root mean square error (RMSE); (ii) mean absolute error (MAE); (iii) relative root mean square error (relRMSE); and (iv) the coefficient of determination (R 2 ).Note that results of models refer only to the validation Once we obtained the best model for AGB, we created the final model using all of the ground truth dataset (for further details see next section).The final model was then applied to all the image objects pre-classified as 'woody' in the full Remote Sensing data set, so as to obtain AGB estimates for each of the remote sensing sample sites.
Four statistical indicators were used to evaluate the performances of the models: (i) Root mean square error (RMSE); (ii) mean absolute error (MAE); (iii) relative root mean square error (relRMSE); and (iv) the coefficient of determination (R 2 ).Note that results of models refer only to the validation dataset, however, the performance of the model with the training dataset is shown in the Supplementary Materials Figure S8.

Interpolating the Results to the National Level
To obtain the country level estimates, we interpolate by direct expansion to using the Direct Expansion Estimator.The application of the direct expansion method has been used in various studies [52].
where D is the total study area and y is the average AGB ha −1 of the sample areas.

The Relative Efficiency to Measure the Improvement in Precision Brought by the Remote Sensing Data
We calculated REff, the relative efficiency (i.e., the ratio of the variance of the estimates for the AGB calculated from the RapdiEye data combined with field data (VAR REFD ) to the variance of the AGB from the field plots alone (VAR FD ).This measures the improvement in precision (as opposed to accuracy) that the introduction of the remote sensing data brings [96].In theory, a relative efficiency of X would mean that the improvement brought about by the introduction of the remotely sensed data could be achieved by using X times as many field plots [10].
An estimate [52] can be made with: where R 2 is the square of the Pearson coefficient of correlation between the remote sensing data and the field data.

Model Results
Table 4 summarizes these main statistical indicators, with scatterplots of the difference between the observed and modelled AGB shown in Figure 6.All performance measures (i.e., RMSE, MAE, relRMSE, R 2 , and observed vs. modelled scatterplots) were assessed using only the validation dataset, with random forest exhibiting the best correlation (R 2 = 0.69).The RMSE of the random forest model is ~30, with an MAE of ~20 and an relRMSE of 15%.The performance of SVM is less satisfactory: RMSE is ~40, MAE is ~30, and relative RMSE is ~40%.As expected, the four methods presented a gain in performance on the training dataset, showing the RF with very high accuracies (i.e., RMSE~15, R 2 = 0.93), while the exponential and linear regression methods still presented poor (biased) estimates of AGB (see Supplementary Materials Figure S8).The RMSE of the random forest model is ~30, with an MAE of ~20 and an relRMSE of 15%.The performance of SVM is less satisfactory: RMSE is ~40, MAE is ~30, and relative RMSE is ~40%.As expected, the four methods presented a gain in performance on the training dataset, showing the RF with very high accuracies (i.e., RMSE~15, R 2 = 0.93), while the exponential and linear regression methods still presented poor (biased) estimates of AGB (see Supplementary Materials Figure S8).However, all four methods produce biased results, overestimating low AGB and underestimating AGB, as indicated by the slopes' deviation from the 1:1 line.

Analysis at the National Level and by Ecoregion
For our sample sites, covering 2.3% of the country, we find an average of 22.1 tons per ha of above ground biomass (AGB).This is only for those landscape units that are classed as being 'woody biomass'.Using the areas provided by NAFORMA for forests and woodlands (54,534,500 ha), this results in a total of 602.4 MtC for the country and compared to the estimate from the field plots of 624.9 MtC, a difference of 4%.The standard deviation is high-25%, however, this is not unusual, as land cover is not generally distributed in a regular manner, especially in Tanzania, which exhibits land covers of highly different levels of biomass.
Using the location of the individual sample sites, we calculated the average AGB for the respective ecoregions (Table 5).Due to the limited number of samples in the other ecoregions, the informative results are for the Somalia-Masai and Zambezian regions.

Relative Efficiency
The relative efficiency (REff) is calculated using Equation ( 5), which gives a REff of 3.2.The introduction of the remote sensing data can therefore reduce the field data collection by three, obtaining the same precision.
This result helps in answering one of the main questions of the study: "How much RapidEye can improve the precision of field-based inventory estimates of above-ground biomass in forests and woodlands of Tanzania?"

Discussion
A key point that this paper wishes to address is how to improve the link between remote sensing parameters and data collected in the field.

Practical and Operational Consderations to Imprvve AGB Estimates from RapidEye Data
In preparing data for testing models to relate remote sensing parameters to field data (AGB), we found that a major effort was required in 'cleaning' the field data.By 'cleaning' we mean removing those aspects of the field data that reduce their potential to calibrate remote sensing data for biomass estimation.The country wide RS data acquired under the GMES initiative was restricted to 2010-2011, whereas the field data were collected over a period of three years.Hence a temporal difference often occurs between a specific image and the corresponding field measurement.However, few plots would have undergone a change during this time lapse.To avoid cloud contamination, most images come from the dry season.At this time, the vegetation is senescent, with trees and shrubs losing their leaves.In this situation, the differences in reflectance between grasslands, shrub and tree savannah, and open woodlands is reduced.

Co-Location of the Data Sets
The site visits for field data plots were organized without using remote sensing imagery.Field teams, using the national topographic maps (in Arc60 datum), navigated to the pre-selected sites.Such field visits should be supported by cartographic and digital extracts of the field locations.Both high cost (high precision GPS with inbuilt satellite and/or map visualisation) and low cost (smart phone/pad and PDF maps) exist to facilitate both navigation and on site location.When local datum are used, so as relate to the national mapping datum, dual GPS locations with WGD 1984 should be employed.

Data Collected
The data collected for the national assessment were suited for the statistical assessment of biomass and for other parameters.However, we found that they were not specific enough to easily correlate them to the satellite data.Optical satellite data record reflectance values within the plot that can be related to the vegetation canopy cover, distribution condition (dry/green/burnt/flooded), and structure.The field data collected need to correspond to these vegetation parameters, which should be recorded and mapped on site during the survey with the aid of an orthomap produced from the remote sensing data to be used in calibration.

Data Cleaning
The inclusion of non-woody, low biomass sites was found to introduce high variance in the spectral signatures for similar values of AGB.These sites have a high variation in land cover and land cover condition.They include barren surfaces, agriculture exploitations, grasslands, and park savannah, each of which manifested different states-burnt, flooded, green flush, senescence, bare soil.We therefore removed them from the data set, reducing the available points from circa 1000 to circa 500.

Spectral Signatures
The reflectance of forest canopy depends on a number of factors, structure, or row orientation (for young artificial forests), optical properties of the background (soil and understory), and canopy geometry [97].The Shadow Index was found to be the most relevant parameter related to AGB, followed by texture measures.The value of the Shadow Index increases as the forest density increases, hence it is appropriate for relating to woody biomass.
Mature, natural forest stands tend to have more heterogeneous surfaces than non-forest and young regrowth, creating more shadows.The shortwave (1.6 µm and 2.7 µm) infrared bands (SWIR) have been shown to be highly successful in mapping forests [97].This is mainly because at these wavelengths, there is very low diffuse light, hence shadows are more contrasted.In the absence of SWIR bands on the RapidEye sensor, we used the shadow index as defined by Rikimaru et al. (2002) [58], which uses the visible bands only.Similarly, texture measures, at finer spatial resolutions, differentiate between heterogeneous and homogenous surfaces (Table 3).

Ancillary Data
The parameters entered into the model came exclusively from the single date RapidEye data.While these (RapidEye) are superior in terms of resolution, they are limited for temporal analysis.With the arrival of the Sentinel 2a & 2b sensors, 10 m resolution data are now available at a 5 day frequency.Simulation of this scenario with the SPOT4take5 program [98] has already shown that differences in forest canopy cover (and as a result AGB) can be determined [99].

Estimation by Ecozone
The Somalia-Masai shows a significantly lower biomass than the Zambezian region (Table 5) as expected, the former is dominated by shrub formations, the latter by woodlands.The other zones are too small to consider using this methodology due to the small number of RapidEye images available in each zone.The Zanzibar-Inhambane, home of the Coastal forests, should have a high average AGB, however, it is these forests that have been depleted most by the impact of exploitation by residents of Dares Salaam and Morogoro [100].

Relative Efficiency
It is difficult to quantify the potential financial savings the inclusion of remote sensing data would bring to the project without knowing the direct costs of the field survey.In the NAFORMA project, the total spent on salaries for the field survey component alone was circa $5 million [63].However, this includes household surveys (n = 3500) and soil data collections (n = 4000).The costs of overheads, vehicles, and fuel would need to be added.In any case, a relative efficiency of over three indicates that when combined with remote sensing data, the field component could be more than halved to obtain the same precision.The cost of RapidEye data at the time was around $1 per square kilometre, giving a commercial cost of just over $30,000 for the satellite data used in this study.Full country coverage by RapidEye would be around $1 million.Such a coverage would also provide a wall-to-wall map of biomass, which could allow for the estimation of biomass in small ecozones described above.

Model Results
The final model has to cover the full variation in vegetation types and conditions across Tanzania.Clearly, studies on small areas [45], single ecosystems [39], or a mono-species plantation [46] will provide better results, however, these are not comparable to a country wide assessment.
It should be noted that two field plots may have the same AGB, but with totally different characteristics, in terms of vegetation type, cover, and condition.We can imagine a plot with one tree surrounded by grass having the same AGB as a plot fully covered by shrub.This is where the strength of random forest lies, applying a decision tree to associate different input parameters (in this case RS variables) to the same class.
Compared with the other methods available, random forest considerably reduces the dispersion of residuals (e.g., RMSE) and increases the model fit (R 2 ).Despite this improvement in accuracy and precision, the AGB estimate presents a reduced range (40-150 t ha −1 in the case of random forest), which is much lower than the response values in the training data (in the range of 10-200 t ha −1 ).Specifically, the statistics by the biomass range show that RF tends to overestimate biomass in the range of 0-40 t ha −1 and underestimates it at higher values (i.e., above 150 t ha −1 ).
Note that this behavior is present for all the four models and is minimized for RF.This behaviour has been observed in many other AGB products [101].A solution to this would be to compute AGB error statistics for biomass ranges (i.e., 0-50 t ha −1 , 50-100 t ha −1 ).It also has to be remembered that the biomass estimates themselves are interpolated from field measures using allometric equations, which in themselves may have bias for certain ranges of AGB [62].
The AGB overestimation at low biomass and underestimation at high biomass is still an open issue, which has been widely reported in the literature using either optical or radar data (e.g., [102]).Although the presence of a bias can be regarded as a main drawback of such maps, this caveat does not mean that the model result is without merit, as all models are affected by various limitations, and their identification allows a proper use of the outputs.For example, if the map accuracy is not sufficient for certain applications (e.g., local level management), other applications (e.g., stratification, regional analysis) can benefit from knowing the spatial distribution of biomass in relative terms.In any case, quantile analysis on the results of RF regression (Supplementary Materials Figure S9) shows that RF is able to discriminate four main levels of AGB with a relatively good accuracy (e.g., 70% and almost 80% for low and high values, respectively) and the bias in the extremes is not so important.
Underestimation of high biomass values may be due to the following reasons: (i) The limited sensitivity of satellite sensors to variations in canopy height and tree diameter in dense forests.Specifically, optical sensors' radiometers (such as RapidEye) tend to saturate at high biomass in dense forest where there is a weak reflectance-biomass relationship, e.g., [103].
For this reason, the combined use optical data in combination with SAR and LiDAR data would improve results as shown in [104,105]; (ii) we were limited to the use of single date imagery with most of the images acquired in the dry season when many seasonal forests have lost their leaves and are spectrally similar to low biomass grasslands; and (iii) RF is trained on a dataset where high biomass values represent the tail of the frequency distribution and therefore its performance decreases as the biomass increases.
With regards to low biomass values, there could be two explanations: (i) Confusion between understory vegetation and forest canopy where shrubs might be confused with the canopy thus overestimating AGB; and (ii) tree savanna, isolated trees may tend to have a larger canopy, which is relatively similar (as seen from satellite) to that of a denser forest.
Our work demonstrated that forest services face major hurdles when trying to integrate remote sensing data into a national inventory.These hurdles include obtaining cloud free data for all reference (field) measurements at dates as close to the data collection campaign as possible; ensuring a good geolocation of the two data sets; and having an adequate representation of the distribution of the biomass levels across the full country.We provide potential users with techniques to cope with these problems and show that optical data of finer spatial resolution, in this case 5 m resolution, can be used to obtain biomass estimates at a national level, albeit, with some bias.The arrival of the freely available Sentinel-2 satellite data (although at 10 m resolution) provides an opportunity for more frequent wall-to-wall national coverage.Since 2016, wide area coverage has been available from the commercial 3 m spatial resolution PlanetScope set of satellites [107].
In our study, we were limited by the sampling scheme and the acquisition dates of the available RapidEye data.The scheme, a regular systematic grid, was devised for a continental scale forest change estimates.A more appropriate approach would be to use a stratified random scheme with more (smaller) samples (e.g., [50]) so as to better capture the variance of the biomass across the different ecosystems.While the sampling approach is an efficient method for obtaining biomass estimates, is does not provide a wall-to-wall cartographic product, which would be a benefit for forest management planning and provide orthomaps to support the field survey and stratified estimation [108].

Conclusions
We found that remote sensing parameters extracted from 5 m resolution satellite data combined with modelling using the random forest algorithm were an effective method for improving the precision of AGB estimates.
The Shadow Index was found to the most relevant parameter related to AGB, followed by texture measures.The use of both vegetation indices and texture data confirms our hypothesis that a combination of different information extracted from the RS data enhances the precision of AGB estimation, but also can increase bias.This is because the relationship between reflectance and AGB over dense forests is affected by the saturation phenomenon.In contrast, textures and non-woody vegetation cover information (i.e., shadow index) might be more sensitive to canopy architecture and structure.As expected, the high performances of random forest are more evident when the response variables are a result of non-linear and complex interactions between multiple predictors, as has been demonstrated in numerous previous studies.The RF was shown to be quite robust for including a large number of heterogeneous input variables.Also, many predictors of AGB are highly inter-correlated and RF handles both these and redundant variables well.
In terms of supporting REDD+ initiatives, this work demonstrates that nationwide AGB inventories can be carried out in a more rapid and cost effective way by combining 5 m resolution optical satellite data with field plots.
The codes and data used for this study have been made permanently and publicly available in the Supplementary Materials so that results are entirely reproducible.

Supplementary Materials:
The following are available online at http://www.mdpi.com/1999-4907/10/2/107/s1, Figure S1: Acquisition dates of the RapidEye images (top) and the differences in months between field plots and associated image (bottom).Figure S2: The impact on texture indices resulting from changing the mapping unit (top) by analysing increasing circles over the RapidEye data (bottom) where the red circle shows the different areas used.Figure S3: The application of successive shape factors to RapidEye data; From left to right SF = 20, 60, 100, 124; The resulting number of objects falls from 59198 to 8059 to 3190 to 2217 objects as the MMU of 1 ha is approached through the iteration.Figure S4: In the final step all polygons less than 1 ha are dissolved and we arrive at 1395 objects Figure S5: Cross correlation results carried out so as to reduce the number of texture measures by removing 'duplicate' measures-i.e., those that are highly correlated-blue and red.Due to text sizes the names of the indices are reduced.Figure S6: The interface developed to review the plot data.The plots are seen on the middle panel overlaid on the RapidEye data.On the left panel high resolution images from GoogleEarth.The plot characteristics as collected by NAFORMA are displayed on the right.Figure S7: Scatterplot of each input variable versus Above Ground Biomass.Regression Line is in blue, while the 95% confidence interval is in grey.Figure S8: Scatterplot of Predicted vs. Modelled AGB in tha −1 for the four models using the training dataset.Both circle size and colour refer to the actual AGB.The blue line indicates the linear regression between actual and modelled data and the grey area is the 95% confidence level interval.

Forests 2019, 9 27 Figure 1 .
Figure1.The ecological zone map of Tanzania[54] with the location of the field plots (not to scale) to coincide with the satelite data.An excert of the points for one cluster overlaid on a RapidEye image is shown to the right.

Figure 1 .
Figure1.The ecological zone map of Tanzania[54] with the location of the field plots (not to scale) to coincide with the satelite data.An excert of the points for one cluster overlaid on a RapidEye image is shown to the right.

Figure 2 .
Figure 2. The RapidEye ABG map (a) for a sample centered on the 9 • S 38 • E confluence point and (b) the satellite image false colour composite (Shortwave Infrared (SWIR), Near-Infrared (NIR), Red).Cloud and cloud shadow are masked out in the first stages of the pre-processing, then image segments are assigned AGB values using the model developed using random forest.

Figure 5 .
Figure 5. Box-plot of the importance of variables for AGB in the final Random forest model, showing the increase in mean standard error (MSE) when removing a variable (y-axis), and their importance on the x-axis.For the sake of simplicity, only the first 12 variables are shown.RXSDX denotes the ratio of the band X reflectance to its standard deviation.

Figure 5 .
Figure 5. Box-plot of the importance of variables for AGB in the final Random forest model, showing the increase in mean standard error (MSE) when removing a variable (y-axis), and their importance on the x-axis.For the sake of simplicity, only the first 12 variables are shown.RXSDX denotes the ratio of the band X reflectance to its standard deviation.

Figure 6 .
Figure 6.Scatterplot of predicted vs. modelled AGB in tha -1 for the four models using the independent validation dataset.Both circle size and colour refer to the actual AGB.Model performance indicators are also shown in Table4.The blue line indicates the linear regression between the actual and modelled data and the grey area is the 95% confidence level interval.

Figure 6 .
Figure 6.Scatterplot of predicted vs. modelled AGB in tha −1 for the four models using the independent validation dataset.Both circle size and colour refer to the actual AGB.Model performance indicators are also shown in Table4.The blue line indicates the linear regression between the actual and modelled data and the grey area is the 95% confidence level interval.
Figure shows the result of arranging all scores or predicted values in sorted quantiles, from worst to best, and how the classification goes compared to the test set.Specifically we have used 4 AGB quantiles using the validation subset.The x axis represents the 4 quantiles of AGB from NAFORMA plots.The y axis shows how much the random forest derived AGB falls in each quantiles.File 1: Rule set used to segment the RapidEye images and extract the spectral and textural information for correlation with the field data on AGB: RapidEye2018RuleSet.dcp.File 2: A text version of the above: RapidEye2018RuleSetDocu.txt.File 3: The extracted remote sensing parameter with the field data on AGB by plot: DataBiomassRS.csv.File 4: The R statistical package code used to set up and test the four models: Code_AGB_RS.R. Author Contributions: L.H.G., F.J.G.H. and H.E. conceived the paper in the framework of L.H.G.'s Ph.D. L.H.G. carried out the remote sensing processing.L.H.G. and G.C. produced the statistical analysis supported by F.J.G.H. V.A. reviewed the draft text, providing a number of new incites to the direction of the paper.Field survey was designed and carried out by H.E. and L.G.H.

Table 3 .
List of reflectance and texture measures extracted from the RapidEye objects.

Table 4 .
Statistical indicators of modelled data versus observed AGB.

Table 5 .
Estimates of average AGB (t ha −1 ) by White ecoregion.*, Refers to the number of RapidEye sites in each ecoregion.