Biophysical Characterization of Protected Areas Globally through Optimized Image Segmentation and Classification

Protected areas (PAs) need to be assessed systematically according to biodiversity values and threats in order to support decision-making processes. For this, PAs can be characterized according to their species, ecosystems and threats, but such information is often difficult to access and usually not comparable across regions. There are currently over 200,000 PAs in the world, and assessing these systematically according to their ecological values remains a huge challenge. However, linking remote sensing with ecological modelling can help to overcome some limitations of conservation studies, such as the sampling bias of biodiversity inventories. The aim of this paper is to introduce eHabitat+, a habitat modelling service supporting the European Commission’s Digital Observatory for Protected Areas, and specifically to discuss a component that systematically stratifies PAs into different habitat functional types based on remote sensing data. eHabitat+ uses an optimized procedure of automatic image segmentation based on several environmental variables to identify the main biophysical gradients in each PA. This allows a systematic production of key indicators on PAs that can be compared globally. Results from a few case studies are illustrated to show the benefits and limitations of this open-source tool.


Introduction
Protected areas (PAs) play a key role in conservation programs and in the sustainable use of natural resources [1].They can also be considered as reservoirs of ecosystem services for surrounding urban areas [2,3].Science-based conservation requires access to a wealth of information on species, ecosystems and threats at the local, national, regional and global scale in order to identify priorities [4].Evaluating and prioritizing PAs remains a complex process [5,6].Furthermore, assessing PAs for biodiversity conservation at the national, regional and international scale requires methods and tools that can assess their natural values in terms of species and habitats.Typical requirements for such analyses are species inventories, distribution and status data from the International Union for Conservation of Nature (IUCN) Red List of Threatened Species [7], as well as information on biophysical variables [8].By developing metrics and indicators based on these variables, PAs can be globally evaluated and compared in order to set conservation priorities [9,10].
Large-scale assessments, as opposed to case studies on individual parks, require objective, global or continent-wide datasets and methodologies [11,12].For this reason, remote sensing (RS) and ecological modelling have become important tools for conservation and biodiversity studies [13][14][15][16][17]. Habitat types can be systematically characterized and mapped by means of RS using available global climatic and biotic datasets [18][19][20].Well-mapped habitat types can in turn provide useful information for the conservation and management of species, ecosystems and ecosystem services, such as water and soils, and recreational and educational uses [21][22][23][24][25].
The European Union is supporting the Convention on Biological Diversity (CBD) as the key international instrument for achieving global biodiversity targets [26].To this end, due to the fact that PAs are the cornerstone of biodiversity conservation, the Digital Observatory for Protected Areas (DOPA) was developed by the Joint Research Centre of the European Commission to assess the state of and pressures on PAs on a global scale and, thus, support policy and decision making [27,28].
One of the main goals of the DOPA is to help in prioritizing PAs according to their natural values and the pressures to which they are exposed and, consequently, to assess, monitor and possibly forecast biodiversity within and around PAs at the global scale.In this regard, eHabitat [29] was initially developed as part of DOPA to identify similar areas outside PAs.However, the systematic ecological characterization and assessment of more than 200,000 PAs [30] is challenging because of the vast amount of information to be taken into consideration.
According to the IUCN Habitats Classification Scheme [31] and the European Habitats Directive [32], natural habitat types can be mapped at a global scale by means of remote sensing in a harmonized way, avoiding the bias associated with sampling efforts [33][34][35][36][37]. Species data at the global scale might be biased due to differential sampling efforts related to location or taxa, whereas remote sensing data about habitats can be obtained more systematically and globally [38][39][40].
Habitat-based assessments can provide important complementary insights to the more commonly-used species-based approaches to conservation [41][42][43][44].PAs are composed of heterogeneous landscapes containing different habitat types that need to be identified in order to assess structural and functional ecosystem properties [45].Stratifying the natural landscape into more homogeneous regions allows reducing the complexity of the landscape into something that is more manageable and understandable [46].Diverse methodologies in relation to ecological land classification can be found in the literature applied at different scales [47,48].Functional attributes, such as vegetation phenology based on time series of NDVI, complement and improve the classification based only on structural features by including ecosystem functioning properties [49][50][51][52].
The primary objective of this study was to overcome one of the main limitations of the previous version of eHabitat [53], which overgeneralized the diverse habitat characteristics of PAs in heterogeneous landscapes and, thus, also led to an overestimation of the probabilities of finding similar areas elsewhere.The previous statistical approach calculated an 'average habitat' over the whole surface of the analysed PA, with the variables characterizing this 'average habitat' represented by a range of values that was too broad, leading consequently to a high variance in the final results.The method proposed here extends the previous version of eHabitat, now called eHabitat+, aiming to characterize PAs according to their habitat functional type composition.This methodology is based on an iterative image segmentation process that automatically decomposes each PA into an optimal set of independent areas, representing a biophysical gradient of structural and functional habitat type properties.

Study Area
The study areas of the DOPA, as of April 2016, correspond to a subset of all designated terrestrial PAs with defined boundaries and a minimum size of 100 km 2 coming from the World Database on Protected Areas [54].While we recognize the essential contribution of smaller areas in conserving biodiversity, we have restricted the processing to a limited number of areas (12,479 terrestrial PAs), which together cover 95% of all of the protected land area on Earth [28].In this study, we used a few case studies to review the segmentation results against independent datasets using a set of heterogeneous PAs of global significance, as attested by their protection status (most of them being a World Heritage Site-WHS), covering a range of biomes and habitats on three continents and with relatively good documentation, including existing independent habitat/ecosystem maps at the national or local level.Table 1 and Figure 1 show the main characteristics and location of the case studies.

Biophysical Input Variables
A set of nine relevant biophysical variables, all globally mapped at 1-km 2 resolution, were carefully selected according to previous similar studies and global coverage availability [29,49,[55][56][57][58][59][60].The nine selected biophysical variables describe: (a) the topographical gradients: slope [61]; (b) the climatic context: mean annual bio-temperature (temperature excluding below zero values) and mean annual precipitation [62]; (c) ecosystem structure: percentage of woody and grassland vegetation cover [63]; and (d) carbon-water cycling (ecosystem functioning): mean annual evapotranspiration/ precipitation ratio (square root transformed in order to account for small differences in the most arid areas), mean annual Normalized Difference Vegetation Index (NDVI Max. and Min.from the period 2001 to 2010), representing maximum and minimum vegetation activity [64][65][66], and the Normalized Difference Water Index (NDWI) as an indicator of vegetation and soil water content [67].More detailed information about the input variables can be found in the Supplementary Online Material as Annex I.

Retrieval of Habitat Functional Types
Several studies have previously used image segmentation and classification in order to perform delimitation of habitats [68][69][70][71][72].In this study, we propose a two-step methodology based on image segmentation followed by a classification of the resulting segments through hierarchical clustering to delineate habitat functional types (HFTs).Image segmentation is the process of grouping similar pixels into unique segments [73][74][75][76][77].In contrast to image classification, pixels that are similar to each other and are assigned to the same segment need to be contiguous.The aim of this step was to discriminate a set of contiguous distinct areas inside each PA based on the selected biophysical variables in order to obtain habitat functional types (HFTs).A 'seed region growing' automatic segmentation algorithm implemented in GRASS GIS 7 was used for this purpose [78].As input for this analysis, we used the resulting first three axes of a principal components analysis (PCA) of the nine input variables.Since the processing time of the automatic segmentation procedure increases exponentially with the number of input variables, this step allowed us to reduce computational load without losing much of the information contained in the original variables.
The segmentation algorithm used requires two main parameters, i.e., the minimum segment size and similarity threshold.The minimum segment size was set as the square root of the area of the PA bounding box.This value was chosen in order to reduce the salt and pepper effect, to avoid obtaining segment sizes that were too small to represent any manageable area and also to select a proper scale for processing each PA according to its size, while at the same time reducing the overall processing time.In fact, each PA was segmented, including the bounding box area where it was contained, and the result then cropped to the park boundaries.When necessary, an automatic segmentation step was applied a posteriori for segments inside the PA that did not meet the minimum size criterion after the cropping.The similarity threshold, ranging from zero to one, determines which pixels will be merged based on the similarity of all input variables.A threshold value of zero would allow only identical pixels to be merged, while a value of one would allow all pixels to be merged.
Given that our aim was to obtain the optimal set of homogeneous areas different from each other and due to the inherent subjectivity of the similarity threshold parameter and its great influence on the segmentation results, there was a need to select its optimal value for each PA by comparing the impact of different similarity threshold values on the resulting segmentation.To this end, a set of nine sequential segmentation runs for each PA was automatically performed using an increasing sequence of similarity thresholds, ranging from 0.1 to 0.9.The segments resulting from each run were used as seeds for the next run, thus achieving a hierarchical segmentation where future segments only grow from the previous ones.This procedure runs in parallel across all of the processed PAs according to the number of processors available.
In order to assess the results of all segmentation runs and to select an optimal run for each PA, the autocorrelation and internal homogeneity of the resulting areas corresponding to each similarity threshold were measured based on all input variables and then compared among all segmentation runs.Moran's I was selected to measure autocorrelation, and variance was chosen to measure internal homogeneity.Moran's I is a feature belonging to each PA, while variance is a feature belonging to each object.First, a curve was produced with the mean autocorrelation based on all input variables and one, the mean internal variance of all of the segments based on all input variables on the Y axis, both scaled from zero to one, versus the corresponding similarity thresholds on the X axis (see Figure 9 in Section 3.6).The optimal similarity threshold was then automatically selected when the two curves first crossed, and in the case that curves crossed more than once, the lowest similarity threshold was selected.The objective was to achieve a trade-off between having homogeneous and distinct areas, i.e., low internal variance and autocorrelation values at the same time, as in previous studies [79][80][81].
The resulting segments are then automatically reclassified into a fewer number of HFTs at a similarity level of 25% to obtain classes that are substantially different from each other [82], irrespective of their adjacency.For this, we used hierarchical clustering and Euclidean distance based on the mean and variance of the nine original biophysical variables, scaled prior to analysis.The maximum number of classes has an upper limit based on the logarithm to base 10 of the PA size (measured by the number of pixels of 1 km 2 ), in order to avoid having substantially more classes than input variables while accounting for the PA size.The result of the logarithm was then set to the smallest integer not less than the corresponding value.Not setting an upper limit would lead to a less comparable number of HFTs across PAs, some of them being difficult to distinguish from each other, especially taking into account the relatively coarse resolution at which analysis was performed at the global scale [83] and the fact that the biggest PA is larger than 2 million km 2 .Figure 2 summarizes the workflow of the whole methodology.

eHabitat+ Source Code
The source code of eHabitat+ uses GRASS GIS 7 [84], Python 2.7 [85] and R [86] and is available online [87].eHabitat+ is contained in four scripts that are run consecutively and that make use of several Python and R libraries for multivariate and spatial analysis.A Docker virtual machine is also available on the same code repository that contains the required environment and libraries to run it.

Evaluation and Assessment
The aim of this methodology was to systematically define and compare the HFT composition and gradients present in PAs globally and in an automated way.Strict validation with other sources of information is not always possible since results vary depending on the environmental variables and stratification method used.Rather than focusing on the spatial accuracy of the HFTs as if they were categorical units, here we focus on the overall biophysical gradients identified in terms of the degree of habitat dissimilarities found in PAs, which can be used as a proxy for habitats and species diversity [88,89].Rather than comparing different maps of habitats or ecosystem types with the obtained HFTs, we think that finding correlations between species and HFT composition was a more meaningful way of testing our results, provided that: (1) species data available are spatially explicit; (2) were systematically sampled; and (3) the environmental variables used for the analysis are relevant for the selected species.For this study, we compared our results with available sources of information regarding species, habitat types or ecosystems in PAs.To this end, a few well-known PAs were selected as case studies, for which we obtained local species or habitat data based on field sampling or expert knowledge, most of which can be found in the Supplementary Online Material as Annex II.For Sierra Nevada National Park, spatially-explicit species data were available, and a Mantel test [90] was used to check for the correlation between species and habitat composition by means of multivariate analysis.A multiple-resolution-goodness-of-fit (MRGF) algorithm implemented by [91] was used to compare the spatial agreement between the HFT and the ecosystem maps.Fits for moving windows of increasing cell sizes were computed according to [92].The fits for each map were summarized into a weighted average with equal weights for all window sizes [93].

Results
The results of our biophysical habitat stratification were compared to available maps of vegetation, ecosystem or habitat types obtained ad hoc using different variables or methods with higher resolution at five different PAs: Sierra Nevada National Park (Spain), the Virunga National Park WHS (Democratic Republic of the Congo), the Kakadu National Park WHS (Australia), the Okavango Delta WHS (Botswana) and the Canaima National Park WHS (Venezuela).Overall, there was good agreement between the environmental gradients reflected in our habitat functional types and those of major vegetation and landscape types, but with varying degrees of congruence.

Sierra Nevada National Park
Our stratification for Sierra Nevada National Park, comprising 1724 km 2 , resulted in four major habitat types (Figure 3).Based on previous studies and maps (Figure 4) [94], the most relevant feature of HFT 1 is the presence of dolomites and limestone on low slope hills.This area is similar to a badland covered by pine plantations, which could explain the low NDVI obtained.Most of the area is covered by scattered trees and bushes and bare soil.HFT 2 fitted well with the distribution of natural forests (mainly Quercus ilex and Quercus pyrenaica), as well as dense pine plantations.This explains the high values for NDVI and woody cover.HFT 3 fitted perfectly with the alpine area usually covered by snow, with high soil wetness and rainfall and no woody vegetation.HFT 4 corresponded to a semiarid zone with very scattered trees, mainly covered by shrublands and Mediterranean grasslands.The comparison between the HFT and the ecosystem maps using the MRGF algorithm showed an accuracy value of 0.77.For this PA, spatial data on plant species presence and absence were available from the Sierra Nevada Global Change Observatory [95] comprising 253 species that were systematically sampled at 2487 locations.These data were used for correlating the species versus habitat dissimilarities among the four HFT identified in Sierra Nevada.The Jaccard and Euclidean indices were used for building the species and environmental distance matrices, respectively, and these matrices were then used as inputs to a Mantel test [90].Environmental data were scaled prior to analysis.The Mantel test showed a significant correlation between the plant species and the habitat composition (r = 0.967; P < 0.05).The list of species sampled present at each HFT in Sierra Nevada can be found as Annex III.

Virunga National Park WHS
Our stratification for Virunga National Park WHS, comprising 7805 km 2 , resulted in four major habitat types (Figure 5).When the four major habitat types of Virunga National Park are compared to the vegetation types mapped by [96], HFT 1 perfectly matches the equatorial rainforest zone in the north, presenting the highest NDVI and woody vegetation cover values.HFT 2 includes the mountain ecosystems of the Rwenzori massif in the north and those of the Virunga volcanoes in the south, characterized by high slope, precipitation and water content values and a mixed composition of woody and grassland cover.Mountain rainforest, bamboo, Hagenia and Erica forests and treeless alpine vegetation dominate these Afromontane ecosystems.HFT 4 has the highest grassland cover and aridity values, covering the majority of the park's grassland and woodland savannahs in the lowlands.In contrast, HFT 3 is characterized by a mixed composition of woody and grassland cover, covering mainly savannah areas and dry forests at low and medium elevation.

Kakadu National Park WHS
Our stratification for Kakadu National Park WHS, comprising 19,139 km 2 , resulted in five major habitat types (Figure 6).According to [97], the whole park has a tropical monsoon climate.HFT 4 corresponds to a hilly terrain with shallow, stony soils.The vegetation is a mix of sandstone heathlands and rainforests, hummock grasslands and woodlands.Forest cover is around 20%.HFT 1 and 2 both represent mainly flat, low-lying areas, drained by several large rivers.Vegetation communities include eucalypt forest and woodlands with tussock and hummock grass understory.Forest cover is around 50%.HFT 3 and 5 both encompass hilly to rugged ridges with undulating plains.Vegetation communities include eucalypt woodlands with patches of monsoon forests.Forest cover is around 10%.

Okavango Delta WHS
Our stratification for the Okavango Delta WHS, comprising 20,443 km 2 , resulted in five major habitat types (Figure 7).Data from [98] show five widespread vegetation types: permanent and seasonal swamps, occasional floodplain and Mopane and Acacia woodlands.HFTs 1 and 3 in the Delta's Panhandle area are characterized by high water content (NDWI), medium to high NDVI values and dominant grassland cover.These HFTs correspond to the Delta's permanent swamp area dominated by four grassy plant species.However, these HFTs differ in elevation (slope), temperature and precipitation, with cooler and wetter conditions in the higher HFT 1. HFT 5 has overall similar biophysical conditions, but with a predominance of woody vegetation.HFT 2 has lower woody vegetation cover and relatively high water content (NDWI), covering mainly Mopane woodlands and floodplain areas on clayey, shallow alluvial sediments [98].Finally, HFT 4 has the highest temperatures, lowest water content and medium woody vegetation cover, including some of the drier Acacia woodlands on deep alluvial sands, but also Mopane woodlands, floodplain and (seasonally) swampy areas.The description of the study variables can be found in Section 2.2.

Canaima National Park WHS
Our stratification for Canaima National Park WHS, comprising 28,828 km 2 , resulted in five major habitat types (Figure 8) whose boundaries correspond well to those of the three main landscape types of the park defined by [99].With the highest NDVI and woody vegetation cover values, HFT 1 includes most of the forest dominated areas.In contrast, HFT 3 has the highest grassland cover values and includes largely savannah-dominated areas.Showing the highest slope and elevation values, HFT 2 encompasses the large plateaus on the table mountains (Tepuis), dominated by so-called Saxícola vegetation.HFT 5 corresponds to a large, isolated savannah area within the forest matrix in the west, characterized by high grassland cover and warmer and wetter conditions than the other savannah areas.Finally, HFT 4 highlights a small 'island' of interspersed forest and Saxícola vegetation in the southeastern savannahs.

Optimization and Input Variables
An example of the autocorrelation and variance curve used for the optimization of the similarity threshold parameter is shown in Figure 9 for the Canaima National Park WHS.Three intersections of the curves can be seen, with the first one on the left automatically chosen by eHabitat+, corresponding to an optimal similarity threshold value of 0.7.The whole set of PAs processed to date can be visualised and explored online in the development version of DOPA [53].As can be seen in Table 2, the total variance explained by the first three PCA axes of the input variables used for the segmentation is very high (above 70%), and each study area shows a different relative contribution from each input variable, depending on which are the main biophysical gradients that are present.

Discussion
Our results show that the enhanced segmentation method allows the main ecological features and habitats of each PA to be identified automatically, supporting the systematic assessment of the diversity and gradients of habitat functional types in PAs as a proxy of biodiversity [100] and complementing methods based on species inventories [101,102].The proposed method could also be used to inform ecosystem mapping, PA zoning and management and the planning of wildlife surveys.
Regarding the combination of variables used for the global assessment, NDVI supplements the information on vegetation cover and structure by capturing the productivity and health of that vegetation, which helps to distinguish functional habitat types that might have similar proportions of woody of herbaceous vegetation, but a very different capacity to support biodiversity.The maximum and minimum values were used because this captures information on the seasonality of the habitat and its productivity.NDWI is known to be a good proxy for plant water stress and can therefore supplement the information on climate and vegetation coverage by distinguishing differing responses to available water.The maximum is used because this is likely to have a strong effect on the capacity of a vegetation community to support permanent assemblages of species.
When coupled with a time series of remote sensing input data, this methodology could also effectively facilitate the monitoring of environmental changes in PAs, as it allows for the automation of large-scale analyses of PAs using continent-wide consistent datasets and key indicators.Monitoring PAs to evaluate the success of conservation management practices and to anticipate future conservation problems is a key priority for habitat and species conservation.
This methodology can also provide valuable information for studies at the regional, national and local level, depending on the scale of interest [103].In these contexts, the variables and stratification parameters used could be different from the ones used here for the global scale.Especially for national or local case studies, higher spatial resolution or more relevant variables are needed and can be more easily found.As an example, freshwater ecosystems, where many important physical processes and gradients are below ground, are probably not so well characterized based on the input data used for the global analysis, which are more suitable for terrestrial ecosystems.
Connectivity analyses of PAs [104][105][106] could also be improved with the HFT classification component of eHabitat+, allowing such analyses to focus on connectivity among similar HFTs, which potentially host similar species assemblages.Systematic monitoring approaches are needed to go beyond park boundaries in order to identify and maintain habitat corridors and assess human pressures and threats outside PAs.
The HFT delineation method described here can additionally provide base maps for integrated landscape analyses where a strong link between vegetation structure and functioning is needed.In fact, some approaches similar to HFT delineation are very useful as base maps to run regional climate modelling [107,108].HFTs can be also useful to quantify ecosystem services using modelling environments, such as ARIES [109], due to its capacity to aggregate different sources of information in a single map containing both structural and functional features.
Regarding the previous eHabitat methodology [27], the identification of distinct HFTs prior to the computation of the similarity maps helps with reducing the natural variability of the study areas, thus leading to a more accurate assessment of where similar biophysical conditions are to be found outside the training area.Future work will also include testing whether the current network of PAs in a specific ecoregion is representative of the ecoregion's habitat diversity [48].In order to test this, we will compute HFTs for whole ecoregions and countries, in a manner similar to previous studies [60,110] and compare those results with the HFTs covered by all PAs in that ecoregion or country.Mapping HFTs in the adjacent areas of a PA can also help identifying additional habitat areas outside of the current park boundaries, which should be also considered in the management of the PA.

Conclusions
The method and tool proposed here, referred to as eHabitat+, systematically stratifies protected areas globally into different habitat functional types.A set of remote sensing data was used as biophysical variables to identify for each protected area the diversity and the main gradients of habitat functional types following an automatic and optimized image segmentation procedure.The resulting segments can be further used to assess the ecological heterogeneity of each area, as well as their connectivity with their environment.Still, further work will be required to optimize the selection of the input variables and better understand their links with the species composition to effectively become baseline information, which can be used for policy support.
eHabitat+ is a core element of the DOPA, which is designed to be used as a support tool for monitoring national, regional and global PA systems with reference to the CBD Aichi Biodiversity Targets [111][112][113][114].Such assessments require methods using consistent datasets as opposed to local case studies.This allows results to be used to assess the relative importance of a protected area in comparison to others in the same country or ecoregion.On the operational side, considering the very large number of protected areas, more than 200,000 as of June 2016, such an assessment needs to be automated as much as possible using objective global data.In all of these efforts, Earth observations are the primary and direct source of information supporting systematic assessments with minimum uncertainties.
On the technical side, eHabitat+ is a free and open source software (FOSS) allowing large amounts of remote sensing data to be statistically analysed and designed to be interoperable with other GIS.As such, it can be implemented and used via web interfaces, as well as under local desktop environments [115,116].In this regard, Docker containers, such as the one available for running eHabitat+, encourage reproducible science by packaging applications with the libraries on which they depend, making the practical implementation easier.The software developed in this study can be integrated as part of an OGC Web Processing Service, as shown for a previous version [29,56], so that users can select the most suitable input variables or parameters for their study area based on expert knowledge or previous studies.

Figure 1 .
Figure 1.Map of the protected areas selected for this study (European Petroleum Survey Group-EPSG:4326).

Figure 2 .
Figure 2. Workflow of the whole methodology for each protected area and the main piece of software used at each step.

Figure 3 .
Figure 3. Map of habitat functional types (HFT) identified in the Sierra Nevada National Park and normalized mean values of the biophysical variables used (European Petroleum Survey Group-EPSG:4326).The description of the study variables can be found in Section 2.2.

Figure 5 .Figure 6 .
Figure 5. Map of HFTs identified in the Virunga National Park and normalized mean values of the biophysical variables used (European Petroleum Survey Group-EPSG:4326).The description of the study variables can be found in Section 2.2.

Figure 7 .
Figure 7. Map of HFTs identified in the Okavango Delta World Heritage Site and normalized mean values of the biophysical variables used (European Petroleum Survey Group-EPSG:4326).The description of the study variables can be found in Section 2.2.

Figure 8 .
Figure 8. Map of HFTs identified in the Canaima National Park and normalized mean values of the biophysical variables used (European Petroleum Survey Group-EPSG:4326).The description of the study variables can be found in Section 2.2.

Figure 9 .
Figure 9. Example of autocorrelation and variance curves used for the optimization of the similarity threshold parameter for Canaima National Park.Variance refers to the mean internal variance of all of the segments based on all input variables, and Moran's I refers to the mean autocorrelation based on all input variables.See Section 2.3 for more details.

Table 1 .
Main characteristics of the protected areas (PA) selected for this study.WHS stands for World Heritage Site.

Table 2 .
Eigenvectors of each input variable corresponding to the first 3 PCA axes and the amount of variance explained by each axis at each study area.The description of the study variables can be found in Section 2.2.