Woody Cover Fractions in African Savannas From Landsat and High-Resolution Imagery

The challenge of mapping savanna vegetation has limited our understanding of the factors that shape these ecosystems at large scales. We tested seven methods for mapping savanna woody cover (trees and shrubs; WC) across 12 protected areas (PAs) in eastern Africa using Landsat 8 imagery. Because we wanted a method viable for mapping across the often-limited Landsat historical archive, we limited ourselves to three images: one each from the wet, dry, and transition (halfway between wet and dry) seasons. Models were trained and tested using 1,330 WC reference points and the variance explained by cross validation (VEcv) accuracy metric. Of the methods we tested, RF significantly (p < 0.001) outperformed the others with the best models in nine PAs scoring over 75% (range of 34.5%–91.1%). RF models trained using data from all the PAs and tested in the individual PAs significantly (p < 0.001) outperformed their single-PA-derived counterparts (67.7 ± 23.3% versus 30.5 ± 27.4%). We also found that while the transition image appears to be critical to mapping WC and the wet season image should be avoided, no single season or seasonal combination significantly outperformed all the others, allowing some flexibility in image selection. Our findings show that with proper sampling of landscape heterogeneity, even with limited imagery, accurate maps of savanna WC are possible and could catalyze discoveries in this crucial biome.


Introduction
Savannas cover a fifth of earth's land surface [1] and are home to over 500 million people (estimate derived by intersecting The Nature Conservancy's savanna and shrubland ecoregions [2] with population estimates for 2015 [3]). Defined as mixed tree-grass communities [4], savannas support pastoralist cultures [5], the world's largest, functionally complete set of terrestrial megafauna [6] and thriving tourist economies [7,8]. Savannas also play a disproportionate role in the interannual variability of the global carbon cycle [9,10]. In all of these, the woody cover (WC; i.e., trees and shrubs) of savannas plays an important role: pastoralism, tourism, and grazing wildlife all rely on sparse WC and are threatened by woody encroachment [5,11,12], while savanna carbon dynamics are disproportionately affected by woody vegetation [4,13]. Yet, our understanding of the factors controlling savanna WC, in comparison to other biomes (i.e., grasslands and forests), is relatively limited.
Part of the challenge of understanding savannas has likely come from the data used in prior studies. Several influential studies of savannas and/or multiple stable states [31][32][33][34][35][36] centered on analyses of Vegetation Continuous Fields (VCF) global tree cover data [37]-data now known to have significant inaccuracies in savannas [38][39][40]. Consequently, the producers of the dataset cautioned against the use of VCF in areas with less than 20-30% tree cover [38], effectively ruling out savannas given their characteristic~20% mean VCF tree cover [32]. Additionally, VCF, which was primarily developed to monitor forests, not savannas, defines trees as woody vegetation >5 m tall [37]. While a fair threshold for forests, the majority of savanna woody vegetation likely falls below it [41], thus under-representing the woody component of these systems. More importantly, disturbances predominately act upon woody vegetation recruits (<~5-6 m) [19,25,[41][42][43][44], meaning VCF is unlikely to detect the direct impacts of disturbances, i.e., the major determinants of WC. Combined, these factors make the use of VCF in any savanna problematic, potentially undermining our ability to understand these ecosystems.
In lieu of VCF, researchers often make maps of their own. However, making large-scale maps of savanna WC is a challenge because, unlike in forests, savanna imagery contains a high proportion of pixels with a combination of woody vegetation, herbaceous vegetation, and bare soil-referred to as mixed pixels [45,46]. Unmixing pixels requires knowledge of the spectral characteristics of all the materials within the pixel. Meeting that requirement across large areas has been a long-standing challenge [47] because vegetation and soil spectral properties change across space and time, particularly in savannas [37,38,48,49].
To limit the spectral variability of cover types, researchers often produce small-scale, site-specific maps using a range of approaches, including simple linear regressions [50], regression trees [51], cluster analysis [52], and spectral unmixing [53]. Meanwhile, some of those who have attempted to produce maps across larger areas or across several sites abandoned such approaches, instead manually classifying a high number small areas (~0.5 hectare) using very high-resolution (VHR; < 1 m) imagery [54,55] or, in an effort to lower VCF error, resampling the data to coarser resolutions, thereby abandoning fine scale analysis [34]. Still others use commercially produced national landcover maps [56]. The different data sources and methodologies mean maps are rarely easily comparable across studies. That, combined with an overall shortage of these maps, limits large-scale studies of savanna dynamics, while also limiting our ability to compare the approaches used to generate them. Further, studies rarely map WC across time, despite the demonstrated insights from temporal data [36,57,58].
Here, we develop an accurate, replicable method for mapping total WC across 12 protected areas in eastern Africa, from Uganda to South Africa, using Landsat 8 imagery. While we recognize separate maps of shrub cover and tree cover would be ideal when studying savannas and could be possible using ancillary data (e.g., LiDAR-derived digital surface models), we sought a method that could be applied through time, back to the launch of Landsat 4 (1982)-a span that extends beyond most ancillary data. Further, we chose to not use ancillary data that could be assumed to be constant across the temporal record (e.g., topography) to avoid circularity in future studies of landscape-scale controls on savanna distributions and WC. Last, given the marked decline in Landsat image availability going back to the 1990s and 1980s, especially in rural Africa [59], we limited ourselves to three images per site: one each from the wet and dry seasons, along with the transition between them. The result was the development of 30-m resolution maps of WC fractions across all the sites; site-specific rankings of seasonal imagery; a novel approach for reference data generation; and a clear designation of the best mapping approach.

Study Sites
We mapped WC in 12 sites across eastern and southern Africa ( Figure 1; Table 1). All of our sites are protected areas (International Union for Conservation of Nature protected status II-IV). We selected protected areas (PAs) because, as part of a larger project studying the drivers of savanna processes, we sought to avoid modern anthropogenic impacts to the extent possible, though we recognize that humans have long played a role in savannas [60][61][62]. The PAs cover a broad range of mean annual precipitation (MAP) (~500-1250 mm) and size (~200-44,800 km 2 ). Initial inspection using satellite imagery suggested the PAs also cover a broad range of WC, with wetter PAs appearing significantly woodier than drier PAs. The background represents the mean annual precipitation from 1988-2017 [63]. See Table 1 for corresponding protected area (PA) names and attributes. Mpala Research Center (2) is~11 km at its widest point (no scale in inset). To train our approaches and assess the accuracy of the resulting maps, we generated WC reference data using VHR imagery available through Google Earth ( Figure 2)-a practice that is increasingly common in broad-scale studies [54][55][56][64][65][66]. We note that in our maps and reference data, we mapped crown cover (the vertically projected area occupied by a tree crown), not canopy cover (crown cover minus intra-canopy skylight). We assume the globally derived relationship between canopy cover and crown cover (canopy cover = 0.8*crown cover) [37] holds in our study sites, allowing us to convert when necessary (e.g., VCF uses canopy cover instead of crown cover).
Previous studies mapped reference point landcover using a range of techniques, such as visual estimation [56,65], decision tree algorithms [64] and augmented visual interpretation using software such as CollectEarth [54,55,67]. In particular, CollectEarth uses VHR imagery from Google Earth and Bing Maps to compute helpful metrics (e.g., vegetation indices), then uses tens of user-classified sampling points within the reference image to estimate the spatial extent of each cover type.  a; b). We then mapped the woody cover (green), grass (yellow), and soil (white) of the pixel (c). The percent woody cover was then extracted to generate the reference data. All reference data is available online: [68].
Similar to CollectEarth, we developed a fast, semi-automated approach using VHR imagery from Google Earth. However, our approach mapped, rather than sampled, the complete extent of each cover type (code available in the online dataset: [68]). We downloaded VHR true color imagery from Google Earth (GE) using the RgoogleMaps package [69] in R [70], retrieving 180 × 180 m areas centered on one 30 × 30 m Landsat pixel (the reference point). We then mapped our perceived extent of WC, herbaceous cover (simply "grass", hereafter) and soil cover of the 30 × 30 m reference point by manually thresholding image-derived metrics.
In our mapping, we took advantage of the fact that WC is predominantly darker than both soil and grass, thresholding pixel brightness (the sum of RGB values) to classify WC. When necessary, we adjusted thresholds to account for shadows that would have otherwise inflated WC values. In addition, when visually distinguishing WC from grass was a challenge, we assumed that objects with shadows were WC. We also used any on-the-ground photographs available through GE, along with our own on-the-ground experience in African savannas, to further inform our mapping. If we could not confidently distinguish a reference point's WC, or if brightness thresholding failed to do the same-both of which were common when defoliated WC was present-a new reference point in a different location was automatically generated.
After WC, the remaining grass and soil were classified using pixel brightness in conjunction with the Green-Red Vegetation Index (GRVI) ( Table 2) [71], an index that exploits the red and green spectral differences between green vegetation and soil. However, we found that the GRVI threshold could often be left static (ca. −0.1), with only brightness threshold adjustment needed. We mapped a minimum of 100 reference points in each PA, increasing the number when densities fell below one point per 200 km 2 , altogether mapping 1330 reference points. All the points were in cloud-free positions in the Landsat imagery. We then randomly subset the data, splitting it into PA-specific training (70% of points) and testing data (30%). Table 2. Equations for indices used in this study. Color names refer to the corresponding image band.

Green−Red Green+Red
Normalized Difference Vegetation Index (NDVI)

Landsat Image Collection and Processing
We collected Landsat 8 Tier 1 Surface Reflectance imagery for 2016 using Google Earth Engine ( Figure 3) [74]. Google Earth Engine (GEE) utilizes the computational capabilities of Google to enable researchers to access and process the Landsat archive. We used GEE to select, preprocess, and download imagery, as well as carry out a subset of our mapping approaches (described below). However, we limited our exposure to any potential change or loss of services from GEE, such as Google's 2018 decision to phase out their "Fusion Table" data type [75], by carrying out most of our mapping outside of GEE. We selected relatively cloud-free imagery (< 30% cloud cover over PAs in wetter regions (SEL, SER, MAR, RUA, MUR, QUE); < 10% in drier regions (CHO, MPA, KRU, LIM, NLU, SLU) ( Table 1)), and required all PA images within the same Landsat path to be from the same date. We then cloud masked the imagery and selected images we estimated to correspond with the wet, dry, and transition seasons. For each PA, the wet season image (Wet) was characterized as that with the highest mean Normalized Difference Vegetation Index (NDVI; Table 2) [76]; the dry season image (Dry) as that with the lowest NDVI; and the transition (Tran) as that with its mean NDVI closest to the midpoint between the Wet and Dry NDVIs, falling chronologically after the wet season and before the dry season. If no image met our requirements for the Tran image, we took the image closest to the NDVI midpoint, regardless of where it fell in the year.  Figure 3. Conceptual diagram of methodology. For each PA, we collected, processed, and downloaded Landsat imagery from the dry, wet, and transition seasons using Google Earth Engine (GEE) (a). We then extracted band and index values at reference points throughout each PA and used very high-resolution (VHR) imagery available through Google Earth to map the woody cover (WC) at each point (b). These data, which varied across single and combined season imagery, were used to create WC maps of the individual PAs using regression, spectral unmixing and regression trees (c-e). In addition, the reference data from all the PAs (ALL) were pooled to generate another series of Random Forest (RF)and regression-derived maps. We assessed the accuracy of the maps using VEcv, E 1 , R 2 , and RMSE (f). SMA: spectral mixture analysis; MESMA: multiple endmember spectral mixture analysis; MCU: Monte Carlo unmixing.
The purpose of using three images was three-fold. First, by using the images individually, we aimed to identify the time of the year that led to the best maps of WC. Past studies attempting to map WC and/or biomass have used the dry season [52,77], while others have used the transition [78,79]-sometimes at the same sites [52,78]-with no consensus on the most accurate approach. The dry season is often selected because many woody species remain green, enhancing the difference between their spectral signatures and that of brown, senesced grass, thereby aiding mapping [80][81][82]. However, drier tropical and subtropical sites generally have more trees that drop all or a fraction of their leaves during the peak of the dry season (drought deciduous or raingreen) [83,84]. Therefore, we expected Tran images to outperform Dry images in the drier of our sites. Finally, we expected the Wet images, captured when both woody and herbaceous plants are photosynthetically active and therefore most spectrally similar, to yield the least accurate maps. The second reason we used three images was that, when stacked together into a single image or used to create summative statistics, we expected to increase map accuracies by capturing the phenological differences of woody and herbaceous vegetation, as done in similar work [56,[85][86][87]. Third, we limited ourselves to three images to simulate the shortage we would likely encounter when mapping historical imagery.
Because many of the PAs spanned multiple Landsat paths and each additional path added significant time investment, paths that covered less than 10% of a PA were excluded. Adjacent PAs in the same path with imagery from the same dates were treated as a single PA (requirements only met by SER and MAR, hereafter together referred to as "SMR").
In all, 105 individual Landsat scenes were selected. Images in the same path were subsequently mosaicked, yielding 45 images. We then masked all the images within each PA using the same PA-specific mask. Each PA's mask was generated using the "pixel_qa" band provided with the Landsat data, and all but the pixels marked as "clear" were removed by the masks. The masks were combined across seasonal images, meaning masked areas in one image were masked in the others. This meant that sometimes large image fractions (approx. 15-20%) were removed, which is not ideal when creating a map. However, our objective was not to create contiguous images but was to produce images with identical data to test which seasons and approaches best predicted WC.
Finally, we clipped the images to 20-km buffers around each PA boundary, with the exception of the smallest PA, MPA, which we clipped to a 50-km buffer. The buffers allowed us to capture areas with 100% crown cover-areas often only available outside PAs and required in a subset of our approaches. All images were then downloaded from GEE (code available in the online dataset: [68]).
For the 70% of reference points serving as training data, we extracted Landsat reflectance values to serve as predictors of WC. For each point, we supplemented reflectance values with indices known or likely to have relationships with vegetation in semi-arid environments [50]: the Soil Normalized Difference Index (SNDI) [50], Soil Adjusted Total Vegetation Index (SATVI) [73], the updated Modified Soil Adjusted Vegetation Index (MSAVI2) [72] and NDVI [76] (Table 2). We also computed the visual brightness (mean of red, green and blue bands; hereafter simply "brightness") of each reference point. We chose brightness because of the strong relationship it had with WC during reference point generation, as mentioned in Section 2.2.1 above. We then created an additional variable to contextualize the brightness within the landscape because, when we were creating the reference data, views of the larger landscape helped the user distinguish between grass and WC. In particular, a completely wooded pixel was often only distinguishable from a grass-dominated pixel when we viewed the larger landscape and saw all the brighter, grass-dominated pixels. To generate the contextualized variable, we computed the normalized difference between reference point brightness and PA mean reference point brightness. We refer to this as the brightness context (BC ; Table S1).
In addition to the variables generated from single images, we created multi-image composite variables to incorporate phenology-a practice not novel to vegetation mapping [37]. The specific image composites were: Dry, Tran and Wet (DTW); Dry and Wet (DW); Tran and Dry (TD) and Wet and Tran (WT). The values for each composite's variables were calculated by taking the mean and normalized difference of each single-image variable across images (testing showed the normalized difference had stronger relationships with WC than the simple difference). This doubled the number of variables in the composites compared to the single images. For example, while the Dry and Wet images each generated a single NDVI variable, the DW composite had two NDVI variables: the (1) mean and (2) normalized difference of the Dry and Wet NDVIs. However, the DTW composite presented a unique challenge: calculating the difference of three values (e.g., Dry NDVI, Wet NDVI and Tran NDVI). We therefore computed the DTW differences by subtracting the normalized differences of the WT composite from the DW (Table S1). This metric had a distinct phenological meaning. Assuming the date of the Tran image was between the Wet and Dry, a value of zero meant all changes in the variable occurred early (between the Wet and Tran images)-likely representative of the early senescence of a grass-dominated pixel. As more of the total change occurred after the Tran image (i.e., the late season senescence of woody cover), the value exponentially approached 1 or −1, depending on the direction of the variable change moving from the wet to the dry season (e.g., for NDVI, which decreases across the seasons, the DTW difference metric approaches −1). For simplicity, we use "difference" to describe the difference metrics of all the composite images, even though all but DTW use the normalized difference.
Across the single images and composites, these steps generated 132 variables (Table S1). Hereafter, we refer to these data collectively as the "reflectance data."

Mapping Woody Cover
We mapped WC using seven techniques falling under three general approaches: linear regression, spectral unmixing, and regression trees ( Figure 3). To incorporate their nested structure, we refer to these as "sub-approaches" and "approaches," respectively. Because all the approaches were not available in a single software package, we worked across multiple programs. Further, not all approaches used all the possible variables when those variables did not improve results and/or caused inputs to exceed the capacity of most modern computer systems, thereby limiting their adoption (see Section 2.3.2). The programs and approaches are described below, with detailed procedures and code provided in the online dataset: [68].

Linear Regression
We used linear regressions [88] to identify any consistent relationships between the reflectance data and WC. All regressions were carried out in R [70]. We expected maps generated using linear regression would form baseline accuracies compared to the other, more complex, approaches.
We conducted regressions using three sub-approaches. In the first, all 132 reflectance data variables were regressed independently against WC (simple linear regression). In the second, we used the red, near infrared (NIR) and first short-wave infrared (SWIR1) bands together in multiple regression (for the composite images, we did separate regressions using the means (1) and differences (2) of the red, NIR and SWIR bands). We refer to these as the "RNS" bands/regressions. We chose the RNS bands because they capture the spectral signature of vegetation, hence their widespread use in vegetation indices [50,72,73,76]. Third, we conducted forward stepwise regressions (STEP) to identify novel relationships that might warrant further investigation. To avoid overfitting the models and multicollinearity between variables, we only added variables with high levels of significance within the new model (p-values below 0.05) and low collinearity (R 2 < 0.6).
We carried out the regressions in individual PAs to derive PA-specific relationships, then with all PAs combined ("ALL") to derive regional relationships. Altogether, the simple, RNS and STEP regressions yielded 1800 regression models. We applied the models to their respective PAs, then applied the ALL-derived, regional models to the individual PAs to quantify the local accuracy of the regional models.

Spectral Unmixing
Spectral unmixing, which some have shown outperforms linear regression in savanna WC mapping [89], is based on the premise that pixel reflectance is an area-weighted, linear combination of the landcovers within each pixel [90,91]. Spectral unmixing, then, uses the reflectance values of each individual cover type-values referred to as "endmembers"-to unmix, or back-calculate, the fractional coverage of each land cover type within a pixel. A unique feature of this approach, compared to other approaches used here, is that field-based measurements are not required-all values can come from the image being unmixed [92]. This means unmixing can be done on historical imagery where no ground truth or reference data is available, which made it particularly attractive for our goals. The challenge, however, is selecting endmembers that (1) have only the target land cover in the pixel and (2) best summarize a land cover's spectral variability.
Multiple unmixing sub-approaches have been developed with their own method of endmember selection. We selected three common sub-approaches and used each to unmix WC, grass, and soil fractions in each image. The first sub-approach, spectral mixture analysis (SMA) [45,90,91,93], relies entirely on the user to identify the best endmembers. We selected endmembers with the minimum, maximum, mean, and median brightness values (sum of all bands). Separate SMAs were ran for each endmember in GEE using the "unmix" function [74].
The second sub-approach, Multiple Endmember SMA (MESMA) [94], builds upon SMA by allowing multiple endmembers within a single class, instead of only one, affording the classifier more flexibility and greater mapping accuracies [95]. MESMA selects endmembers by running individual SMAs with every possible combination of endmembers, while also allowing the number of classes to vary [94]. It then retains, on a per-pixel basis, only those results from the best-fit model. In other words, a MESMA result is a mosaic of the best-fit SMAs. The MESMAs were run in ENVI (Harris Geospatial Solutions, Boulder, CO) using the ViperTools add-on [92].
Even though MESMA can take as input all potential endmembers, the evaluation of all the models (>200,000 in many cases) required significant processing time. To minimize this, Roberts et al. [92] recommends screening out spectrally similar endmembers by examining whether the endmembers formed clusters, then selecting the endmembers that best represented each cluster. To do this, we used metrics built into ViperTools: Endmember Average RMSE (EAR) [96], Count-based Endmember Selection (COB) [97] and Minimum Average Spectral Angle (MASA) [98]. We used the metrics according to the methodology suggested by Roberts et al. [92] (p.44), selecting endmembers with the lowest EAR within each COB-identified cluster, or the lowest MASA when COB failed to identify any clusters. We referred to this as the EMC (EAR-MASA-COB) selection method. The EMC method yielded 1-5 endmembers for each class.
In a separate method, after testing multiple selection criteria, we found selecting endmembers from each class using the minimum, maximum, and mean brightness values (sum of all bands) yielded the best results. We referred to this as the HML (high-middle-low) selection method.
Altogether, these endmember selection methods yielded ca. 20-40 models per MESMA-a significant reduction from the thousands the full suite of endmembers generated. To ensure the reduction in models was not having a detrimental effect on the accuracy of the MESMA results, we ran preliminarily tests using the full suite of endmembers versus our subsets and found no significant difference.
The third unmixing sub-approach, Monte Carlo Unmixing (MCU) [53,99], is similar to MESMA in that it allows multiple endmembers within a class. However, whereas MESMA selects results based on best-fit models, MCU iteratively draws random selections from the pool of endmembers, running an SMA for each. MCU then reports the mean and standard deviation of all the results as the final result. However, the final result is sensitive to the number of iterations; with too few iterations, different MCUs can yield very different results. Therefore, the stability of the final result depends on the number of iterations. Others achieved stability after 30 iterations [100]. Because we had several sites with multiple images per site, all of which might achieve stability at a different number of iterations, we set the number of iterations at 300-a number we assumed would achieve stability in all locations. We wrote and ran our own MCU function in GEE utilizing the "unmix" function. The code for the MCU function is available in the online dataset [68].
Unlike the regressions, which used the reflectance data described in Section 2.2.2 (i.e., point data) to train and test models, unmixing used entire Landsat images. This, in combination with the fact that MESMA was not automatable, led to marked increases in user input and computational demands. Consequently, we limited the number of images we unmixed, along with the number of bands/variables in each image. We created DW composites with the Dry and Wet image bands stacked into a single image without any of the additional bands/variables described in Section 2.2.2. The same applied to the Dry, Wet, and Tran images: they only contained the original Landsat bands. However, unlike the regressions, the unmixing approaches utilized the full suite of bands available from Landsat 8: bands 1-7 (visible, near infrared and short-wave infrared bands) and 10-11 (thermal bands). While earlier Landsat satellites do not have bands 1, 10, and 11 we found that including these bands substantially increased accuracies, while bands representing the vegetation indices did not. We did not test all the possible additional bands and combinations because of the amount of time this would have required.
Because linear mixture models can produce results outside the range of 0-1, all the unmixing methods allow the user to constrain the results to stay within the 0-1 range. Depending on the software, we found that constraining the results caused pixels to be removed (ViperTools) or forced to zero or one (GEE)-something we could do independently in post-processing. Therefore, we chose not to constrain the values. In addition, ViperTools had the option to constrain candidate models based on their RMSE and residuals, and GEE to constrain the unmixed values to sum to one. We set the RMSE constraint to 0.1 (effectively unconstrained) and left the other two unconstrained to avoid differences across unmixing methods.
As an alternative to constraining results in the individual software, we extracted mean mapped WC values from the WC (WC WC ) and grass (WC grass ) endmember pixel locations, which were meant to have 100% and 0% WC, respectively. We then normalized the unmixed values (WC i ) following Equation (1): This was meant to preserve more of the relationship between the unmixed values and the reference data-a relationship that was lost using the software constraints described above if all unmixed values were negative or over 1, which was not uncommon. However, normalization still left some mapped values outside the 0-1 range, and it was at this point that we set negative values to zero and high values (>1) to one.
For each approach, in addition to unmixing WC, grass and soil, we also unmixed only WC and grass to test if accuracies improved when unmixing only these, the two most similar cover types. As in the regressions, we also ran a subset of approaches using only the RNS bands. However, unlike the regressions, we did not combine all the PAs and unmix them as a single PA, nor use such a model to map individual PAs.

Regression Trees
Regression trees, unlike the other methods used here, are capable of handling non-linear relationships between landcover types and their reflectance-a situation which is particularly common when mapping at regional to global scales [38]. Accordingly, during the development of VCF, which used a coupled, regression tree and linear regression approach, Hansen et al. [101] found their approach outperformed spectral mixture analysis. We expected our regression tree approach to do the same.
However, we did not attempt to replicate the VCF approach, which used stepwise regressions at the regression tree nodes to smooth the outputs, along with variables derived from MODIS bands not available in Landsat imagery. Instead, we used another regression tree approach, Random Forest (RF) [102]. RF has become a popular and effective procedure for mapping WC in savannas [103][104][105], along with regional-to-global scale land cover [106][107][108][109]. RF draws random samples from a dataset of predictor and response variables using bootstrap aggregation to initiate a regression tree. RF then divides the data based on variance, creating branches with the smallest possible intra-subset variance at each split. However, instead of considering the entire subset at each split, RF uses a random selection of predictor variables to determine the split. This process is repeated to generate a "forest" of different regression trees. Once all trees are grown, a predicted value is calculated as the mean prediction of all the regression trees.
We performed the RFs in R using the randomForest package [110] and the reflectance data described in Section 2.2.2. The RFs used the entire set of predictor variables available for each image. Each RF was set to grow 500 trees and a set number of variables were randomly selected at each split: seven (out of 12 available variables) for the single-season images and seven (out of 24) for the composite images. We set the number of variables after we ran an optimization process across all the PAs that showed including more variables did not significantly improve accuracy-a threshold we sought in order to avoid overfitting the models. Like the linear regressions, the RFs were trained and applied to their respective PAs (including ALL), with the ALL-derived models also applied to the individual PAs.

Accuracy Assessment
We assessed the accuracy of the WC results from each mapping method using the variance explained by predictive models based on cross-validation (VEcv) ( Table 3) [111]. VEcv and equivalent measures such as the G-statistic [112] and Nash-Sutcliffe efficiency [113] closely resemble the coefficient of determination (R 2 ; Table 3). R 2 measures the proportion of the observed variance that is predictable from an independent variable (in this case the predicted values). However, in model validation, we are not interested in the ability of predicted values to predict observed values; we are interested in how well predicted values match the observed values. VEcv was developed for the latter case. It does this by directly comparing the observed values to the predicted values, i.e., a 1:1 line; rather than compare values to a fitted regression line, as R 2 does. In this way, VEcv is also similar to root mean square error (RMSE; Table 3). Beyond combining the utilities of R 2 or RMSE, VEcv values below zero correspond to instances where the mean of the observed values better predicts the observed values than the model being evaluated-something demarcated by neither R 2 nor RMSE. This means that while VEcv generates negative values that can appear meaningless, it is important to remember that the model producing a negative score could have a R 2 of 1.0 if it exhibits errors that can be perfectly predicted using the observed data. In such a case, RMSE would be required to show the model was flawed. Therefore, when a single metric is needed to rank performance across models, VEcv is superior and we use it for that reason. However, we recognize that most studies have used R 2 and RMSE and we include R 2 and RMSE values where we felt it would aid comparisons with other studies. Table 3. The names and equations of accuracy measures used in this study.

Accuracy Measure Equation*
Variance In addition to VEcv, we used Legates and McCabe's efficiency (E1) ( Table 3) [114] as a secondary measure of model accuracy. Whereas model errors are squared in the VEcv equation, E1 takes their absolute value, thereby quantifying the percentage of the sum of absolute differences explained by the model. Like VEcv, E1 reports accuracy as percentages, with 100% corresponding to an exact match and values < 0% denoting situations where the mean of the observed values is a better predictor of the observed values than the model. However, because VEcv uses squared errors, it is the more sensitive measure. For this reason, we primarily used VEcv and only utilized E1 if we needed to differentiate models with similar VEcv scores, as suggested by Li [111].
We assessed the accuracy of the maps using the withheld testing data/points (30% of reference points). The same PA-respective testing points were used across all maps. We compared the accuracies across approaches and sub-approaches, then across seasons and variables, and finally present the best approaches both overall and for each PA. We used ANOVAs and post-hoc Tukey honest significant different (HSD) tests to compare accuracies across approaches and sub-approaches. To compare seasons and variables, we used repeated measures ANOVAs and HSD. Throughout, we also evaluate the relationship between the Landsat-rescaled VCF tree cover dataset [115] and WC. We refer to both VCF [37] and the Landsat-rescaled VCF [115] as "VCF" and assume that for our purposes there are no meaningful differences between the two. This is not an accuracy assessment of VCF, which monitors woody vegetation >5 m, while our reference data monitor all woody vegetation, regardless of height. However, we included VCF to demonstrate the amount of woody vegetation excluded by VCF in these systems and assess whether VCF might be used to predict WC (i.e., whether an adjustment factor can convert VCF tree cover to WC).

Post-Processing
Some post-processing of the accuracy assessment data was necessary. Overall, models trained and applied to individual PAs and all the PAs pooled together created 4842 WC maps with accuracies ranging from a high VEcv of 91.1% down to scores below −1000%. While negative VEcv scores are to be expected, some appeared so low they might bias our final results. Normal outlier analysis removed more points than we felt appropriate, so instead we ranked and plotted all 4842 map accuracies in search of an inflection point, i.e., where gains in accuracies from map to map become relatively consistent. We found this point around a VEcv of -500% ( Figure S1). We removed maps below this threshold from further analysis, eliminating 157 maps (~3% of the total number of maps). The majority (147) of these were produced using spectral unmixing. The remaining were from regressions.

Evaluation of Accuracy Measures
Because VEcv is not a commonly used measure, we first compared approach accuracies using VEcv, E 1, RMSE and R 2 (Figure 4a-d). While all the measures found RF significantly outperformed the others, R 2 gave VCF scores significantly higher than the regressions and unmixing approaches, while VEcv, E 1 and RMSE did not ( Figure 4d). As outlined in Section 2.4, this is likely due to the fact that R 2 is not a measure of model accuracy while the other measures are, including RMSE (Table 3) [111]. Because VEcv produced results similar to those using E 1 and RMSE, and is the most sensitive measure of accuracy, we used it for the remaining analyses.

Best Approaches and Sub-Approaches
Across both the approaches and sub-approaches, RF significantly (p < 0.001) outperformed the others, rarely scoring below zero, with a VEcv mean and standard deviation of 49.0 ± 30.8% (Figure 4a,e).
Spectral unmixing underperformed the other approaches (VEcv mean and standard deviation of −148.8 ± 123.0%; Figure 4a). We expect spectral unmixing was limited due to the fundamental challenge of choosing endmembers. While it might be possible to accurately define endmembers across a relatively small area, when the study area expands, so does the spectral variability within each landcover type, making endmember definition more of a challenge. For this reason, successful applications of spectral unmixing often include a regional component in endmember selection [116]. We decided adding a regional component to our endmembers was infeasible: candidate endmembers were often scarce within the PAs, let alone areas within the PA. Other successful applications of spectral unmixing have been aided by additional data, such as the increased number of bands and finer spectral resolution of hyperspectral imagery and/or airborne lidar which help differentiate cover types [117]. Here we were limited to Landsat's relatively coarse spectral resolution (9 bands). Though including variables identical to those available to RF and the regressions might have improved accuracies, this was not possible due to computational limitations. Further, this would have required the removal of bands 1, 10, and 11 -bands whose testing showed improved results while the vegetation indices did not.
1 Figure 4. Approach and sub-approach accuracies. Across all accuracy measures, RF significantly outperformed the other approaches (a-d). However, we note that of the accuracy measures, R 2 was the only to give Vegetation Continuous Fields (VCF) a median score significantly higher than the regressions (REG), demonstrating the discrepancies created when using R 2 as a measure of accuracy. Among the sub-approaches (e), RF remained the best performer, significantly outperforming the others. In all plots, letters signify approaches whose accuracies were not significantly different (p < 0.05). We tested for significance using ANOVAs and post-hoc Tukey honest significant different tests. In the boxplots, the bold centerline represents the median score, the box encompasses the 2 nd and 3 rd quartiles, and the top and bottom whiskers respectively represent the largest and smallest values within 1.5 times the interquartile range. Values outside that range are marked as outliers. In the sub-approach plot (e), to the left of the boxplots, points representing the accuracies of individual maps are scattered horizontally to limit overlap. The points illustrate both the number and distribution of map scores. To the right of the boxplots, smoothed histograms further illustrate the distribution of each sub-approach's scores. The gray line separates VCF from the other results as a reminder that this was not a true accuracy assessment of VCF (VCF does not incorporate all WC as our maps do).
Among the unmixing sub-approaches, MESMA significantly outperformed MCU and SMA. We attributed this to the fact that MESMA uses a more advanced approach to selecting the best endmembers for each pixel. Because of the overall poor performance of the spectral unmixing approach, we did not evaluate the effects of the different settings within the models (e.g., unmixing woody cover and grass versus unmixing woody cover, grass and soil).
Regressions significantly (p < 0.05) outperformed the spectral unmixing sub-approaches (Figure 4a). While no regression sub-approach significantly outperformed the others, stepwise regression, with its ability to add significant predictor variables, performed better on average (Figure 4e). The regressions did not significantly outperform VCF in an HSD test, likely due to the limited number of VCF data points (n = 11).

Evaluation of Seasonal Images
Unlike the approaches and sub-approaches, both within and across PAs, no season significantly outperformed all the others ( Figure 5; Table 4). DTW did the best on average (VEcv mean and SD of 16.6 ± 23.8%) and significantly outperformed the Dry, Wet, and DW images, with the Wet image performing the worst (6.5 ± 24.9%). DTW did not significantly outperform TD, WT, or Tran. This confirmed our expectation that the image with the most information (DTW) would do the best, while the Wet season image, captured when WC and grass are most spectrally similar, would do the worst. Meanwhile, the Tran image was included in all the best performing images (DTW, TD, WT, and Tran). We also found that Tran was the best image when all PAs were combined (see "ALL" in Table 4). These findings suggest the Tran image is critical to mapping WC. Indeed, all the images with the lowest scores did not contain Tran (Dry, Wet, and DW) and Tran significantly outperformed both Dry and Wet, making Tran the best option if a user were going to use only a single image in a new site. Figure 5. Heat map of model types from each approach that performed best on average across the PAs and seasons. To produce this figure, we took up to three model types from each approach (RF and VCF only had two and one, respectively) with the highest average VEcv scores across all the data. We then separated model performances by season and PA. Because the unmixing approach did not use the DTW, TD, or WT composites, those spaces are not shown. Higher scores are darker and colors with no green coloring represent accuracies below zero. The contrast between approaches is apparent, as are the lack of any clear improvement in accuracy across the images and the consistently high scores of MUR. For display purposes only, scores less than -100 were set to -100 and VCF results were repeated across seasons.
The best image for each PA varied and, like above, no image significantly outperformed all the others within a single PA (Table 4). However, unlike above, most of the best images in the PAs were something other than DTW (Table 4). Instead, Tran was most often selected as the best image and was included in the majority (9 of 12) of the PAs' best images (i.e., DTW includes Tran). This provided further evidence that the Tran image is critical to mapping savanna woody cover, likely capturing WC and grass at their most spectrally dissimilar point, when WC is still green and grasses have senesced.
We tested whether we could generalize the best image for a PA based on the PA's average woody cover, mean annual precipitation (MAP) or precipitation seasonality (WorldClim bioclimatic variables #12 & 15, respectively). For instance, we expected one image might do best in drier PAs and another in wetter PAs. However, MAP and precipitation seasonality had no relationship with accuracies. Meanwhile, WC had a positive relationship with accuracies, but the relationship existed across all the images ( Figure S2). While this meant woodier PAs are likely to have the most accurate maps, it also meant that, like MAP and seasonality, the WC of a PA cannot be used to determine the best image. While no image across or within PAs significantly outperformed the others and the PA characteristics we tested could not be used to predict a best image, it was also encouraging: the findings also suggest one could use any of the seasonal images or composites -whichever available. The notable exception to this was the Wet image. The Wet image had the lowest average score across the PAs, never appeared as a PA's best image, nor was it ever used alone in any of the best models (more on models in Section 3.6; Table 4). Table 4. Best seasons, approaches, and models as measured by VEcv (%). The best locally derived model refers to those trained using reference data from the respective PA only. The best overall model is that with the absolute highest accuracy, regardless of the source of the training data. The best general model is the single model that did best on average across all the PAs. Model naming structure: Approachsource of training dataimage season -variables used in model. We list the source of the training data only when it was ALL-the combination of all the PAs in this study-otherwise the source is the PA itself. Similarly, only models that used individual variables (i.e., regressions; "REG") list those variables. Standard deviations are given where accuracies are mean values; otherwise, values are the accuracy of the single map produced by the model. Because we did not apply the unmixing approach to all image combinations (only DW), we excluded unmixing results from the best season evaluation, resulting in higher than otherwise expected mean accuracies. * Seasons, approaches and models that significantly (p < 0.05) outperformed their counterparts. Significance testing was carried out using two different measures depending on the comparison: Repeated Measures ANOVA (best season) and simple ANOVA (all others). † RF -ALL -DW was the best general model.

Evaluation of Protected Area Accuracies
Among the individual PAs, MUR had significantly (p < 0.05) higher scores than all the other PAs (VEcv mean and SD of 41.7 ± 23.6%; Figure 5). RUA, KRU, and QUE all produced the least accurate maps (average accuracies amongst the three were not significantly different). Of them, QUE had the lowest average score (−9.1 ± 33.7%). Like the images, we tested for relationships between PA accuracies and MAP, precipitation seasonality or average woody cover. Like before, we found a positive relationship (significant) with PA woody cover ( Figure S3) but no relationship with MAP or seasonality. Again, this suggests woodier PAs are easier to map while a similar relationship does not extend to MAP or seasonality.

Evaluation of Variables
Across PAs and seasons, we found that regression models using RNS, NDVI, and BC performed the best (average accuracies amongst the three were not significantly different), with VEcv means and standard deviations of 23.7 ± 20.5%, 22.9 ± 23.9%, and 21.5 ± 21.5%, respectively. The variables significantly outperformed all the other variables, except Band 4 (red), which NDVI and BC did not significantly outperform. The variables with the lowest scores were SNDI, SATVI and Band 5 (−4.7 ± 12.1%, −0.6 ± 32.5%, and −0.4 ± 12.4%, respectively).
We also tested for the image-specific variable that best predicted WC across all PAs. We found this to be the mean NDVI of the TD composite ( Figure S4; VEcv: 29.1 ± 25%). While it did not have an average score significantly higher than all the other variables, it was the only variable to significantly outperform some of the other variables (47 out of 143). Most of these (37 of 47) were composite image variables derived using the normalized difference, suggesting the mean was the better statistic to use for these images. We tested this and found the mean significantly (p < 0.001) outperformed the variables derived using the normalized difference, with respective scores of 13.0 ± 23.3% and −5.2 ± 22.7%. Therefore, while the variables using the normalized difference might add some explanatory power to WC models, it is likely minimal.
Stepwise regression revealed no novel, consistently strong (VEcv > 50%) relationships between any combination of variables and WC across the PAs. However, NDVI and BC were the most commonly selected variables across the 84 total stepwise regressions: NDVI was selected in 29, and BC in 17. When we pooled all the PAs' data together as if they were single PA, NDVI was not selected in any of the regressions. Instead, the stepwise regressions selected BC alone or in combination with other variables in five of the seven regressions (one regression per image). Accuracies ranged from 18.2 to 24.9%. These findings suggest that as the mapped area expands, relationships between NDVI and WC break down, while for BC, which simply quantifies how bright a pixel is in relation to its surroundings, they remain relatively robust.

Best Models
The main objective of our work was to develop a single, accurate model for mapping WC. We first evaluated the best models trained and tested within the individual PAs ("Best Locally Derived Model" in Table 4). Across the 12 PAs (including ALL), most of the best models (11 of 12) were either regression-or RF-based, and several utilized NDVI. However, there was no clear pattern between the models. We then expanded the pool of models to include those trained using data from all the PAs combined ("Best Overall Model" in Table 4). Only one of the locally derived models was the best overall model for the PA: the SMR-derived MESMA that unmixed only WC and grass (TG) using the EMC endmember selection method and the Dry image (MESMA EMC TG -Dry). For all the other PAs besides SMR, the best overall models were RF models trained using data from all the PAs combined. When we expanded the comparison to include all RF models (not just the best), those trained using data from all the PAs and tested in the individual PAs significantly (p < 0.001) outperformed those trained and tested in a single PA (VEcv: 67.7 ± 23.3% versus 30.5 ± 27.4%; R 2 : 0.76 ± 0.16 versus 0.42 ± 0.19; RMSE: 12.5 ± 2.6% WC versus 19.5 ± 5.2% WC). This implies researchers can use training data from many savanna sites, even those hundreds of kilometers away, to improve models for a single site. This might be particularly helpful for researchers with limited data for their site. This finding also suggests that PAs share WC-reflectance relationships that a single model could use to accurately map WC across space and time.
We also tested for the model that produced the highest average accuracy across PAs ('Best General Model' in Table 4). This model was RF-ALL-DW: an RF model using training data from all the parks (ALL) combined and the DW image. Compared to the best overall models for each PA, RF-ALL-DW did not cause a significant decrease in accuracy: average accuracy across the parks fell 4.9 percentage points, from VEcv values of 76.5 ± 15.4% to 71.6 ± 20.8% (p = 0.063). We note that when all the PAs were combined (i.e., the testing data from the different PAs was combined) and mapped as a single PA ("ALL" -last row of Table 4), the RF-ALL-DW model did not produce the best results -instead, the best model was RF-ALL-DTW. The difference is attributable to the fact that we ranked each candidate model based on its average accuracy across all the PAs, not based on its ability to map all PAs as one (like R 2 , the average VEcv of different datasets will not equal the VEcv when those data are pooled and evaluated as one).
Last, we tested which model performed best on average across all the PAs when only using training data from the individual PAs. In other words, we wanted to know the model that was most likely to perform well when mapping WC somewhere outside our list of sites using only training data from the new site. Like above, this model proved to be a RF using the DW image (RF-DW).
Both within and across PAs, while the best models did significantly outperform many of the other models, they did not significantly outperform them all (Table 4). This, in combination with the similar finding for the images, suggests that when attempting to map WC through the Landsat archive, researchers can be flexible in their year-to-year image and model selections. For example, if a researcher wanted to produce a map for SEL, the best model to use would be the RF-ALL-DTW (VEcv of 86.2%). But if the Wet image was not available, then the next best model, RF-ALL-TD (VEcv of 85.4%), would result in what is still an accurate map with an accuracy that is not significantly lower than RF-ALL-DTW. The same would apply if only the Tran image were available: RF-ALL-Tran would produce an accuracy of 84.5% (see the online dataset for a complete list of model accuracies). Besides providing flexibility year-to-year, these findings suggest areas with cloud cover in one image could be filled using WC values derived from the next best cloud-free image-something we did not attempt to do here.

Map Comparisons
We compared the accuracies of both the maps produced by the best overall models ("Best Overall Model" in Table 4; Figures S5-S13) and VCF (Figures 6 and 7). For VCF, in one respect this was a simple demonstration of the difference between tree cover and WC, but it was also a true measure of VCF error where VCF exceeded WC (WC includes tree cover, therefore VCF values that exceed WC are true overestimates).
1 Figure 6. Our best maps and VCF compared to the testing data. Using the testing data ('Reference') from each PA, we plotted the errors of our best maps ("Best Overall Model" in Table 4) and the differences of VCF tree cover (a and c, respectively). We then pooled the PAs and compared the reference data to our best maps ("Predicted") and VCF (b and d, respectively). Overall, our best maps have errors that are not significantly different than zero (a) and a strong relationship with the reference data (R 2 = 0.881, VEcv = 87.0%) (b). Meanwhile, differences between VCF and woody cover are significantly different than zero (c) and VCF has a weak relationship with woody cover (R 2 = 0.324, VEcv = 1.36%) (d). Both regression lines (red) are significant (p < 0.001). Individual PA maps, scatter plots and accuracy metrics (VEcv and R 2 ) are available in the Supplement, Figures S5-S13.
Overall, our maps had a significant (p < 0.001) and strong relationship with WC (combined VEcv = 87.0%, R 2 = 0.881; Figure 6b), while VCF had a significant (p < 0.001) but relatively weak relationship with WC (combined VEcv = 1.36%, R 2 = 0.324; Figure 6d). Our maps overpredicted WC in areas of sparse WC and underpredicted in woodier areas (Figure 6b), while VCF was mostly below WC but exceeded WC (true errors) at 139 of 421 testing points, mostly in sparsely wooded areas (<~10% WC; Figure 6d). Wald tests showed these biases were significant (the regressions between WC and our maps and VCF differed significantly from 1:1 lines). However, when we used one-sample t-tests to test for overall biases, i.e., whether errors were significantly different than zero, we found our maps combined did not have a significant bias (mean and SD of errors: 0.95 ± 10.9% WC; p = 0.074;), while VCF did (mean and SD: −16.7 ± 25.1% WC; p = 2.5e-35). The size and significance of VCF's bias presented a potential adjustment factor. However, adding 16.7 percentage points (the mean error across PAs) to VCF was not enough to move the average accuracy across PAs above a VEcv of 0% (average accuracy increased from a VEcv of −57.1% to −12.8%). Further, when looking at PA-specific errors, our maps' errors generally center around zero (Figure 6a), while VCF's did not (Figure 6c). Combined, we interpreted these to mean VCF cannot be used to represent WC generally, nor can it be easily adjusted for that purpose at the scale of a single PA. Visual assessment of our maps and VCF underscored these findings. Our maps accurately represented WC, even along ecotones and gradients, while VCF, whether due to error or chance (tree cover and WC might have no relationship), did not represent any of these well (Figure 7).  Table 4) and VCF with VHR satellite imagery for reference. All inset maps have diameters of 275 m. Our maps for KRU and LIM (a), SMR (b) and CHO (c) capture gradients in WC (KRU/LIM inset), the complete absence of WC (SMR inset) and ecotones (CHO inset) better than VCF. CHO and LIM both show data missing at their eastern and western boundaries, respectively, from excluded Landsat paths that did not cover >10% of the PA. LIM, SMR, and CHO also all show some areas where cloud masks removed data.

Caveats and Concerns
Because our best maps use RF, they are susceptible to the same criticism as VCF when it comes to detecting multiple stable states: the regression tree approach might create artificial discontinuities in the WC data that could be misinterpreted as support for multiple stable states [39]. However, we argue that by taking the average prediction of hundreds of separate regression trees, RF should minimize the risk of any artificial discontinuities in the data. Further, the sheer inaccuracy of the other approaches would also be a barrier to such studies. However, to be cautious, we advise duplicate analyses-one using the RF-based maps and the other using the best regression-based. Any significant difference in the results should be investigated accordingly.
Of our study areas, SMR was an unusual challenge to map. While SMR did not have the lowest average scores across all the PAs, its best model had the lowest score and even VCF nearly outperformed our best general model ("Best General Model" in Table 4). However, VCF also had its largest proportion of true errors in SMR, with overpredictions at 76% of its testing points, making it appear the PA is a challenge to map in general. The fact that VCF overestimated the tree cover of SMR, when it appears to have mostly underestimated it in the other PAs, suggests to us that the unusually fertile plains of SMR might make WC and grass spectral differences less distinct, thereby confounding models. This was further supported by the fact that, in addition to the VCF overpredictions, the largest underpredictions (and largest errors overall) of our maps were also in SMR (Figure 7). However, like the other PAs, RF models trained using data from all the PAs did better in SMR than RF models only using SMR's training data, suggesting additional training data is likely to improve results.

Conclusions
The main objective of this study was to find a procedure we could use to accurately map WC in both current and historical imagery using a limited number of images per year. We found that RF clearly outperformed the other approaches, while no season or specific model significantly outperformed all the others. This suggests that having limited historical imagery available should not significantly affect map accuracies. However, special consideration should be given to the Tran image, which appeared in all the best image composites and significantly outperformed the Wet and Dry images. Further, NDVI, BC, and the RNS bands had the strongest relationships with WC, suggesting they should be included in future mapping efforts. Using training data from all the sites led to the models that performed best in all but one of the PAs, suggesting mapping efforts at one site can be aided by training data from other sites. Finally, while the best models varied by PA, the best general model across all the PAs did not significantly decrease accuracies.
Savanna ecosystems play a significant role in the global carbon cycle [9,10], are critically important wildlife habitat [6], and support a large fraction of the global human population [2,3]. As such, it is essential that we develop and refine tools for mapping and understanding the spatiotemporal patterns of vegetation change in these systems. The high accuracies of our general model across our sites suggest that with proper sampling of heterogeneity, a single model could accurately map WC across space and time, opening the door to critical discoveries in these crucial ecosystems.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/12/5/813/s1. Figure S1: Ranked map accuracies. Figure S2: Relationship across seasons between best PA map accuracies and PA average woody cover, mean annual precipitation (MAP), and precipitation seasonality (coefficient of variation). Figure S3: Relationship between best PA map accuracies and PA average woody cover, mean annual precipitation, and precipitation seasonality. Figure S4: Accuracy of maps produced using each of 143 variables in regression. Figure S5-S13: Maps produced using the best method for each PA. Table S1: Variables input to random forest and regression models. Supplementary data and code related to this article can be found online [68].