Earth Observation and Biodiversity Big Data for Forest Habitat Types Classiﬁcation and Mapping

: In the light of the “Biological Diversity” concept, habitats are cardinal pieces for biodiversity quantitative estimation at a local and global scale. In Europe EUNIS (European Nature Information System) is a system tool for habitat identiﬁcation and assessment. Earth Observation (EO) data, which are acquired by satellite sensors, offer new opportunities for environmental sciences and they are revolutionizing the methodologies applied. These are providing unprecedented insights for habitat monitoring and for evaluating the Sustainable Development Goals (SDGs) indicators. This paper shows the results of a novel approach for a spatially explicit habitat mapping in Italy at a national scale, using a supervised machine learning model (SMLM), through the combination of vegetation plot database (as response variable), and both spectral and environmental predictors. The procedure integrates forest habitat data in Italy from the European Vegetation Archive (EVA), with Sentinel-2 imagery processing (vegetation indices time series, spectral indices, and single bands spectral signals) and environmental data variables (i.e., climatic and topographic), to parameterize a Random Forests (RF) classiﬁer. The obtained results classify 24 forest habitats according to the EUNIS III level: 12 broadleaved deciduous (T1), 4 broadleaved evergreen (T2) and eight needleleaved forest habitats (T3), and achieved an overall accuracy of 87% at the EUNIS II level classes (T1, T2, T3), and an overall accuracy of 76.14% at the EUNIS III level. The highest overall accuracy value was obtained for the broadleaved evergreen forest equal to 91%, followed by 76% and 68% for needleleaved and broadleaved deciduous habitat forests, respectively. The results of the proposed methodology open the way to increase the EUNIS habitat categories to be mapped together with their geographical extent, and to test different semi-supervised machine learning algorithms and ensemble modelling methods.


Introduction
Global-scale environmental issues, from climate change to biosphere integrity [1], are creating an intense social pressure and a growing need for information with appropriate reliability and suitable spatial scale (from the local to global analysis and vice versa) that must be provided by the scientific community [2][3][4][5].
Therefore, it is strictly urgent to ensure the integrity of the Bio, Hydro, and Geosphere by following the advance of the high technologies. The High-Tech frontier looks further

Materials and Methods
In ecology, HDMs link data on the distribution of habitat to abiotic or biotic conditions [57] that are also derived from remote sensing data. Typically, the starting point of HDM is location data on the occurrence of a habitat as response variable, and mapped environmental or satellite data as predictors [58]. Statistical methods, such as Random Forests, are used to estimate the presence of habitat as a function of the predictor variables.

Research Step Flowchart
Describing the classification procedure of forest habitats used here was not effortless, so the flowchart that is presented in the Figure 1 summarizes and visualizes the single stages processed in the text. Figure 1. Flowchart of the approach for forest habitat mapping using a supervised machine learning model. For more details about EUNIS codes (i.e., II and/or III level) see Table 3.

Study Area
The study area encompasses Italy (Figure 2), which covers about 300.000 km 2 , and it is characterized by a noteworthy geo-morphological and land use heterogeneity (there are two mountain ridges, the Alps in the northern, and the Apennines in the southern), as a result of complex tectonics processes starting from the Upper Cretaceous and centuriesold land management, respectively [59]. The area also shows a marked climatic variability, from the Mediterranean climate along the central and southern coasts, up to the alpine in the mountains [60].
The land use is characterized by three main classes: croplands (51.46%), natural (43.05%), and urban areas (5.49%) [44,61]. Within the natural areas, the woodlands and Figure 1. Flowchart of the approach for forest habitat mapping using a supervised machine learning model. For more details about EUNIS codes (i.e., II and/or III level) see Table 3.

Study Area
The study area encompasses Italy (Figure 2), which covers about 300.000 km 2 , and it is characterized by a noteworthy geo-morphological and land use heterogeneity (there are two mountain ridges, the Alps in the northern, and the Apennines in the southern), as a result of complex tectonics processes starting from the Upper Cretaceous and centuries-old land management, respectively [59]. The area also shows a marked climatic variability, from the Mediterranean climate along the central and southern coasts, up to the alpine in the mountains [60]. The land use is characterized by three main classes: croplands (51.46%), natural (43.05%), and urban areas (5.49%) [44,61]. Within the natural areas, the woodlands and forests cover 85830 km 2 , corresponding to the 28.5% of total land surface [62]. The forest habitat (shrublands and tree plantations excluded) was subdivided in nine forest types [63], as shown in Table 1. Over centuries, along the Italian peninsula these climatic, morphological, and land use features have defined peculiar geographic regions with similar environmental conditions, which support plant and animal species with comparable life strategies and adaptations (i.e., the biogeographical regions) [64]. Indeed, in 2011 from EU Member States consultation meetings, following Article 17 of the Habitats Directive (92/43/EEC) and Natura 2000 Biogeographical processes, the biogeographical regions were defined for the whole Europe [65]. Italy is included in three Biogeographical regions (see Figure 2): Mediterranean (characterized by hot dry summers and humid, cool winters); Alpine (where extreme temperatures and annual precipitation are strictly related to the physiography and in late fall and winter all precipitation above 1500 m a.s.l. is in the form of snow); and, Continental (with significant annual variation in temperature, hot summers, and cold winters, and precipitation tends to be moderate, being mostly concentrated in the warmer months) [60].
whole Europe [65]. Italy is included in three Biogeographical regions (see Figure 2): Mediterranean (characterized by hot dry summers and humid, cool winters); Alpine (where extreme temperatures and annual precipitation are strictly related to the physiography and in late fall and winter all precipitation above 1500 m a.s.l. is in the form of snow); and, Continental (with significant annual variation in temperature, hot summers, and cold winters, and precipitation tends to be moderate, being mostly concentrated in the warmer months) [60].  The primary source for producing the EUNIS forest habitat classification is a dataset extracted by the European Vegetation Archive [17,66]. The archive collects records of vegetation plots, which include: a full list of vascular plant species; the estimates of cover abundance for each species; georeferenced point location; and, additional information on vegetation structure and environmental plot features. Vegetation plots that are larger than 200 m 2 and that only reported species composition without cover abundance information have been excluded.
The final forest vegetation dataset (see Table 3) was obtained according to the following stepwise procedure: 1.
resampling data: to perform the habitat classification only on the natural woodland areas, a forest mask was generated to resample the vegetation plots. The mask was obtained by combining Copernicus Land Monitoring Service products, specifically the High-Resolution Layers Tree Cover Density (TCD, [67]), the Imperviousness [68], and the CORINE Land Cover [69]. All of the records outside the forest mask were excluded; 2.
clustering data: from the whole dataset only forest habitats were extracted and clustered according to the EUNIS hierarchical classification nomenclature, following the EUNIS-ESy definitions [22]; and, 3.
filtering data: all data assigned at more than one EUNIS codes (e.g., bias plots assignment of mixed forest that linked at two different EUNIS groups) and all data recorded with the same geographic coordinates (i.e., plots of re-surveying monitoring research activities) were also excluded. Finally, a visualization data test was performed to identify spatial mismatches errors and/or spatial bias occurrences.

Predictor Variables
A spatially explicit dataset of predictors was produced to generate classified habitat maps by the classification approach selected.
Dataset of predictors includes three main categories: temporal data: variables representing temporal statistics and phenological metrics estimated from the biophysical index time series, generated from satellite EO data.
All of the variables were collected or calculated from dataset distributed under openaccess policy and spatially resampled at 20 m resolution. See Table A1 for a complete list of variables, with name, description, units, and data sources.
Reference spatial grid has been set to 20 m spatial resolution for the coordinate reference systems EPSG 32632 and EPSG 32633. The environmental variables were selected based on the relevance as drivers of plant distribution [55,70]. Among the geographic variables, the linear distance in km from both shoreline and river network were computed for each grid cell [71]. A digital elevation model [72] at 20 m spatial resolution was used to calculate elevation, slope, and aspect using a 3 × 3 pixel neighbourhood. Northerness and easterness components were later calculated form aspect, using sine or cosine transformation, in order to obtain continuous variables. The total rainfall [73] and average temperature [74] at 1 km resolution were interpolated from cell centroid coordinates to reference grid resolution using regularized spline with tension. The average temperature was normalized for the altitude effect [75]. Daily average solar irradiance was estimated at 20 m spatial resolution using the equations for computation of solar energy related parameters [76] and while considering local topography. Daily solar irradiance was aggregated to calculate monthly averages, and later daytime cloud cover collected from the ESA-Climate Change Initiative cloud dataset, for the period 2004-2014, were used to calculate the monthly average climatology cloud cover percentage [77]. Cloud cover percentage has been resampled to 20 m spatial resolution using regularized spline with tension and then used to weight solar irradiance with a factor of 0.75, to account for sky-diffuse solar radiation on cloudy days [78]. Finally, monthly weighted solar irradiance was used to calculate daily average solar radiation. Datasets representing soil properties at 250 m, specifically the soil organic carbon stock, pH, and absolute depth to bedrock, were collected from the SoilGrids (see Table A1) repository and interpolated to reference grid resolution using a Bartlett filter with 750 m radius.
Spectral variables, specifically spectral reflectance bands and spectral indices, were calculated from EO data that were acquired by Multi-Spectral Instrument (MSI) sensor onboard Sentinel-2 satellites. MSI is a multispectral sensor with a high spatial resolution (10 m, 20 m, and 60 m), high revisit capability (five days with a two satellite constellations), an orbital swath width of 290 km, and a spectral band set from the visible to shortwave infrared. Sentinel-2 MSI L2A data used for the analysis, distributed by Theia in MUSCATE format, represent the bottom of the atmosphere (BOA) reflectance, orthorectified, terrain-flattened, and atmospherically corrected with MACCS-ATCOR Joint Algorithm (MAJA) [79,80].
All of the satellite acquisitions of the period 2016-2019 with cloud cover lower than 90% were collected for the entire study area, for a total of around 26000 images distributed in 61 granules and corresponding to around 34 Terabyte (TB) of data stored. For each satellite image, all 10 surface reflectance spectral bands were first spatial resampled at 20 m, to obtain a full consistent set of spectral bands, namely the B2 (blue), B3 (green), B4 (red), B5 (Red edge 1), B6 (Red edge 2), B7 (Red edge 3), B8 (NIR1), B8a (NIR2), B11 (SWIR1), and B12 (SWIR2). Later, pixels corresponding to clouds, cloud shadows, snow, and water in the product quality flag masks or corresponding to topographic shadows (identified by product quality flags representing the topographic mask and low-sun mask, with the latter considering the pixel aspect, slope, and sun zenith angle) were masked out from further analysis. Sentinel-2 MSI surface reflectance spectral bands for the period March-October have been used for the classification. Spectral information from satellite data sensed during late fall and winter period were not considered among the predictors, since they are often hampered by snow cover and topographic shadows, and they have already been demonstrated to have a scarce result on forest habitat detection [55]. However, temporal predictors included the winter period with seasonal statistics (i.e., average, maximum, and standard deviation), since they allowed for the distinction between evergreen and broadleaved forest patterns.
In addition to Sentinel-2 MSI spectral bands, four spectral indices were selected to account for green, red, and yellow photosynthetic leaf pigments, during flowering and senescence, which is the period when the plant is curtailing chlorophyll production revealing various accessory pigments [81] and, consequently, determining the starting of fall foliage (Table 2). Table 2. Spectral indices adopted, equation with Sentinel-2 Multispectral Instrument (MSI) bands, and bibliographic references.
Temporal variables were calculated from LAI time series, stacked in a large multidimensional datacube after applying an image co-registration step [87]. LAI time series were later temporally smoothed and daily interpolated using the procedure based on second order weighted polynomial fitting and the Whittaker smoothing, as described in [88]. LAI daily time series were used to calculate the annual (average, minimum, maximum, delta, standard deviation) and seasonal (winter and summer average, minimum, maximum) temporal statistics and to estimate phenological metrics using the method that is described in [89]. The phenological metrics have only been used as predictors for T1 classification (i.e., deciduous broadleaved), since plant phenology of evergreen plants has lower temporal fluctuation in terms of LAI values, which makes their estimation uncertain. Other studies attempted to use EO time series to describe plant phenology in vegetation mapping, without an in-depth exploitation through the phenological metrics estimates [90]. The median value of each temporal variable calculated for the years 2016-2019 has been selected as temporal predictor used for the training, validation and testing phases of the classification algorithm.
All of the procedures used to obtain the dataset of predictors were performed using GRASS GIS for the processing of environmental predictors, SNAP for the Sentinel-2 MSI data processing, and R software for the processing of temporal predictors.

Predictors Selection
The Pearson correlation coefficient and the variance inflation factor (VIF) among all variables were performed to test for multi-collinearity to reduce the high number of predictors and improve model accuracy removing redundancy. All of the predictors with a Pearson correlation coefficient higher than 0.7 and a VIF higher than 10 [91] were removed. R-package 'Boruta' [92] was used to test whether the remaining variables were correlated. Boruta is a wrapper algorithm around a Random Forest (RF) classification model, in which several runs of RF are performed to generate shadow features and test the significance of each variable, in order to confirm or reject them. Finally, RF [93] was used to investigate the importance of predictors and assess their contribution to the prediction of the response Remote Sens. 2021, 13, 1231 9 of 28 variable, through the calculation of the importance value and the Gini index. RF was executed with a default set of hyperparameters using the 'ranger' R-package [94], and selecting the variables on an expert-based decision. All of these steps were done for each RF model (see the following section). Finally, the variable selection procedure was amended, based on expert knowledge and scientific reference [29,50,95], in order to reconsider a few variables that have a high significance for the habitat detection (see Figure 4).

Supervised Machine Learning Model
For classifying all of the forest habitats, RF was used as supervised machine learning model [93], which has been successfully applied for classification purpose in the context of vegetation remote sensing (see Kattenborn et al. [52] for an exhaustive list of reference). RF is a decision tree algorithm and is currently the most popular ensemble method for classifying and predicting forest habitat types [96]. Because RF produced several independent trees by intensive resampling of different subset of predictors, it is natural to consider this adapted bootstrapping scheme for big data context [97]. In addition, in some studies of comparison between different classification algorithms, RF was found to be the most performing [30,52,96,98], or at least comparable with other. Hence, when considering that the proposed study was not a comparison of classification methods, but a demonstrative procedure, the final choice was to apply the RF classifier. To build and grow trees, RF involves several hyperparameters that control both the structure of each individual tree (e.g., nodesize) and the structure and the size of the forest (e.g., mtry, ntree). Despite that it is well known that RF works reasonably well with the default set of hyperparameters, their tuning can improve the performance of RF [99]. R package 'mlr' [100] was used to tune the best RF hyperparameters combination, evaluating the following value ranges: mtry from squared root of the number of variables predictors to the number of variables predictors minus 1; min.node.size from 1 to 3; and, ntree from 401 to 1001. The best set of hyperparameter was chosen selecting the higher Cohen's kappa coefficient calculated using five-fold cross-validation with 20 repetitions.
A two-stage hierarchical classification scheme using 'ranger' R-package [94] was built by training a total of 4 RF models, each one setup with a distinct set of predictors and hyperparameters. The first stage RF model classified EUNIS II level forest classes: deciduous broadleaved, evergreen broadleaved, and needleleaved (T1, T2, and T3, respectively). At the second stage, the other three RF models classified EUNIS III level for each EUNIS II class identified in the first stage.
Each RF model was trained using a stratified random sample of 70% of data, and the model performance was tested using the remaining 30% (internal evaluation). The accuracy assessment of the procedure depends on a confusion matrix (error matrix) and the following accuracy measures derived from that: the Overall accuracy, the User's and Producer's accuracy, and their Standard error. The formulas to obtain the accuracy metrics and the standard error were presented in Stehman & Foody, 2019 [101]. The computing of the Cohen's kappa was also made.
The trained RF models were applied to the selected sets of environmental, spectral, and temporal predictors in order to classify the forest habitat types for the entire Italian national territory. The Gini index was calculated to evaluate the importance of each predictor.
The model classification was spatialized for the entire Italian territory to produce maps of habitat forest types according to EUNIS II and III levels. An additional qualitative check of the obtained results was done by expert judgment and the validity/accuracy was confirmed.

Response Variable
From the vegetation database, 14385 plots were selected and classified in 24 forest habitats according to the EUNIS III level: 12 broadleaved deciduous (T1), four broadleaved evergreen (T2), and eight needleleaved forest habitats (T3) ( Table 3). The records are representative of natural forest cohorts existing on the Italian territory, according to Italian National Forestry Inventory [62]. The distribution of vegetation database is shown in Figure 3, according to the EUNIS II level classification.

Predictor Variables and Selection
A set of 163 variables grouped in four categories has been used as predictors: 13 environmental variables (three geographic, four geomorphologic, three climatic, and three soil properties); 124 spectral variables (80 spectral bands, 44 spectral indices); and, 26 temporal variables (10 temporal statistics, 16 phenological metrics) (see Table A1 for the complete list). After the variable selection procedure, a total of 68 variables have been retained as predictors for the RF classification (see Figure 4). Successively, some of the predictors excluded in the previous step (i.e., elevation, normalized annual temperature, minimum value of LAI, maximum value of LAI, and latitude) were reconsidered in the classification procedure of the four RF models (see Figure 4). All of the selected variables reported a higher importance value than the shadow features generated by the Boruta algorithm.

Predictor Variables and Selection
A set of 163 variables grouped in four categories has been used as predictors: 13 environmental variables (three geographic, four geomorphologic, three climatic, and three soil properties); 124 spectral variables (80 spectral bands, 44 spectral indices); and, 26 temporal variables (10 temporal statistics, 16 phenological metrics) (see Table A1 for the complete list). After the variable selection procedure, a total of 68 variables have been retained as predictors for the RF classification (see Figure 4). Successively, some of the predictors excluded in the previous step (i.e., elevation, normalized annual temperature, minimum value of LAI, maximum value of LAI, and latitude) were reconsidered in the classification procedure of the four RF models (see Figure 4). All of the selected variables reported a higher importance value than the shadow features generated by the Boruta algorithm.

Supervised Machine Learning Model
From the five-fold cross-validation tuning with 20 repetitions, the selected hyperparameters for the EUNIS II level classification were: mtry = 20; num.trees = 908; min.node.size = 2. For the EUNIS III level classification, a different hyperparameters set has been tuned for each EUNIS II level class. For the T1 classification, the mtry was set to 24, the num.trees was set to 989, and the min.node.size was set to 3; for the T2 classification, the mtry was set

Supervised Machine Learning Model
From the five-fold cross-validation tuning with 20 repetitions, the selected hyperparameters for the EUNIS II level classification were: mtry = 20; num.trees = 908; min.node.size = 2. For the EUNIS III level classification, a different hyperparameters set has been tuned for each EUNIS II level class. For the T1 classification, the mtry was set to 24, the num.trees was set to 989, and the min.node.size was set to 3; for the T2 classification, the mtry was set to 16, the num.trees was set to 527, and the min.node.size was set to 2; and, for the T3 classification, the mtry was set to 15, the num.trees was set to 981, and the min.node.size was set to 2.
The forest habitat types classification, using the RF modelling approach based on all input records (i.e., response variable) and including the variables selected (i.e., predictor variables), achieved an overall accuracy of 87% at the EUNIS II level classes (i.e., T1, T2, and T3). The highest misclassification was found between the broadleaved deciduous and the broadleaved evergreen habitat-type forests. The three EUNIS II level classes of forest (broadleaf and conifer forest) achieved very high producer's and user's accuracies (>85%; see Table 4). The second stage of habitat-type classification has been performed at EUNIS III hierarchical level, achieving an overall accuracy of 76.14%. The highest overall accuracy was the one of the broadleaved evergreen forest, equal to 91%, followed by 76% and 68% for needleleaved and broadleaved deciduous habitat forests, respectively (see Tables 5-7). Within the broadleaved evergreen habitat type classes, the mainland laurophyllous forest (EUNIS III T2) obtained the lowest classification accuracy (29.4%), and its highest misclassification was with the Mediterranean evergreen Quercus forest; instead, regarding the Mediterranean evergreen Quercus forest, a user's accuracy of 96.5% was achieved (see Table 5). The needleleaved forest (EUNIS II T3) classes were, on average, well classified; indeed, the user's accuracy was higher than 65% for each class. On the contrary, classes of broadleaved deciduous forests (EUNIS II T1) were classified worst: six classes had a user's accuracy below 50%, three classes had a user's accuracy between 50% and 60%, and only three classes had a user's accuracy higher than 60%.  T11  T15 T17 T18 T19  T1A  T1B  T1C  T1D  T1E  T1F  T1G  Sum T11  T15 T17 T18 T19  T1A  T1B  T1C  T1D  T1E  T1F  T1G  Sum  UA (SE)  T19  13  6  20  0  206  54  2  0  3  17  8  6   Overall, the environmental variables were, by far, the most important factor for explaining the classification of the forest habitat-types, regardless of the EUNIS level. The variables exhibiting higher Gini index values for the EUNIS II level classification were: latitude; distance from shoreline; elevation; annual standard deviation LAI; pH index; Sentinel-2 MSI spectral index RI value at yearly month 03; annual cumulated rainfall; annual delta LAI; Sentinel-2 MSI spectral index EVI value at yearly month 03; and, absolute depth to bedrock (see Figure 4). For the two broadleaved classification at EUNIS III level (T1 and T2), the most important variables were: latitude; distance from shoreline (Log10); elevation; pH index; and, annual cumulated rainfall. The variables with high Gini index values were latitude and distance from river network for T1 and T2 forest habitat-types, respectively (see Figure 4). Even for the T3 habitat-type EUNIS III level classification, the most important variables were: distance from shoreline; soil organic carbon stock; annual cumulated rainfall and normalized annual average air temperature; Sentinel-2 MSI LAI value at yearly month 05; absolute depth to bedrock; annual standard deviation LAI; and, distance from river network (see Figure 4).
The following figures show the results of the projected classified map of forest habitat for the entire of Italy ( Figure 5) and at local resolution ( Figure 6), within the forest mask at the EUNIS III level T1-3. Figure A1 also shows pictures of woodland landscapes according to: T1 Broadleaved Deciduous ( Figure A1 panel a); T2 Broadleaved Evergreen ( Figure A1 panel b); and, T3 needleleaved forest habitat type ( Figure A1 panel c). Finally, of the 240 million pixels potentially detectable (spatial extension of the Forest Mask layer), approximately 9.15% was not classified. Most of the area not classified were distributed on mountainous regions. cumulated rainfall and normalized annual average air temperature; Sentinel-2 MSI LAI value at yearly month 05; absolute depth to bedrock; annual standard deviation LAI; and, distance from river network (see Figure 4). The following figures show the results of the projected classified map of forest habitat for the entire of Italy ( Figure 5) and at local resolution ( Figure 6), within the forest mask at the EUNIS III level T1-3. Figure A1 also shows pictures of woodland landscapes according to: T1 Broadleaved Deciduous ( Figure A1 panel a); T2 Broadleaved Evergreen ( Figure  A1 panel b); and, T3 needleleaved forest habitat type ( Figure A1 panel c). Finally, of the 240 million pixels potentially detectable (spatial extension of the Forest Mask layer), approximately 9.15% was not classified. Most of the area not classified were distributed on mountainous regions.  [102]. For the code reference of the legend, see Table 3.   Table 3.

Discussion
This study proposed an approach to map forest habitats according to EUNIS classification (II and III levels), when considering the great habitat variability of the study area, in order to support the elaboration of conservation strategies and the environmental assessment. It combined nationwide biodiversity database, environmental information, and satellite EO data (Sentinel-2 MSI at 20 meters spatial resolution) to train a supervised machine learning model for forest habitat classification. Previous studies on forest habitat mapping in Italy used satellite data for classification at coarse spatial and thematic resolutions, often investigating small areas (regional or sub regional scale areas) [37,90,103] or focusing on single tree species [96].  Table 3.

Discussion
This study proposed an approach to map forest habitats according to EUNIS classification (II and III levels), when considering the great habitat variability of the study area, in order to support the elaboration of conservation strategies and the environmental assessment. It combined nationwide biodiversity database, environmental information, and satellite EO data (Sentinel-2 MSI at 20 m spatial resolution) to train a supervised machine learning model for forest habitat classification. Previous studies on forest habitat mapping in Italy used satellite data for classification at coarse spatial and thematic resolutions, often investigating small areas (regional or sub regional scale areas) [37,90,103] or focusing on single tree species [96].
Spectral and temporal features from Sentinel-2 MSI big datacubes were fully exploited in order to generate spectral and temporal predictor dataset (e.g., the estimation of phenological metrics). Such predictors dataset, combined with geomorphological, climate, and pedological variables, allowed for training a supervised machine learning model using a two-stages hierarchical classification scheme.
A potential feature of the current study concerns the updatability of the approach proposed, which allows for the habitat maps to be updated using EO temporal predictors, providing a dynamic and key contribution to ecological monitoring and assessment. The change detection metrics derived from the proposed procedure could be relevant for environmental issues, having a strong impact on the conservation status of habitats [104]. The approach can also improve predictive distribution models of endangered animal species, which are based on the use of high resolution EO data to identify fine-scale habitat features [105]. Moreover, it could be useful to evaluate the effects of global changes on resilience and species composition of habitats [106]. When considering that Sentinel-2 satellite mission systematically acquires data worldwide over land, the proposed approach can be reproduced and extended to all of the vegetated geographical areas in the earth, thanks to the existing high resolution layers (i.e., Copernicus product) [44] and the availability of plant species archive (i.e., EVA) [17]. Moreover, in the absence of high resolution land cover and thematic maps, they could be generated from the time series analysis of satellite derived vegetation indices (e.g., NDVI, EVI, and LAI). The evaluation of the importance of the predictor categories used for the classification models was one of the scopes of this research. Using the VIF and correlation analyses, a reduced set of predictors was selected to remove the high multi-collinearity for each classification model (see Table A2). The spectral predictors that result in a higher importance for discriminating forest habitats are those related to plant greening (Sentinel-2 MSI EVI value at yearly month 04, Sentinel-2 MSI B2 and B12 value at yearly month 03), flowering and browning (Sentinel-2 MSI NDYI value at yearly month 04, 05, and 10, Sentinel-2 MSI RI value at yearly months 03, 04, and 10). This result confirms that the spring and fall seasons show the highest heterogeneity of radiometric signals due to the variability on vegetation growing and senescing pattern along the altitudinal and latitudinal gradients in Italy. Among the spectral bands, B2 (blue) showed the greater Gini index values and resulted in being more frequently selected among the predictors than other spectral bands, while the redness indices (RI and CRI1) were the spectral indices with higher Gini index values and more frequently selected.
LAI biophysical information was selected as variable for the estimation of temporal predictors. Apart from its reduced saturation in forest sites as compared to spectral indices (e.g., NDVI) [107], LAI derived predictors played a key role in the forest habitats discrimination. In fact, its derived temporal statistics resulted in high importance values within the classification model (annual standard deviation LAI, annual delta LAI, and Sentinel-2 MSI LAI value at yearly month 05), being even higher than single spectral predictors.
Overall, this study demonstrates the suitability of the Sentinel-2 MSI derived information for the separability of broad cover forest habitat types (i.e., EUNIS II level, based on plant functional traits), as well as for the identification of detailed EUNIS classification types (i.e., III level), based on their ecological features (i.e., species composition) [55].
The Gini index confirms how the environmental predictors remarkably contribute to the final habitat classification (see also [58]), for both the EUNIS II and III levels (see Table A2) [29,108]. In terms of explanatory variable importance, the results show how the variability of climatic and geographic pattern of the Italian territory play a paramount role in affecting the plant community trait responses and, consequently, determining a wide diversity of forest habitat types along the peninsula, which includes some of most important forest ecosystems in southern Europe (i.e., the Mediterranean biogeographical region) [109]. The high elevation-relief ratio between elevation versus distance from the shoreline, and the presence of high mountain ranges (i.e., reaching a maximum of over 2500 m for the Apennine and 4500 in the Alps), are all considered to be determining factors that affect the bio-climate in Italy (i.e., from Mediterranean to Alpine through the Continental climate zones) [110]. According to the hypotheses that was proposed by Miller & Franklin [50] regarding the proportional increasing of importance of the topo-climatic variables with the increasing of study area and pixels grain size, a different importance weight between environmental and spectral predictors was observed. However, looking at the predictors importance at EUNIS II level of classification, our results show that the information related to plant phenology (leaf longevity, greening, flowering, browning, and leaf senescence) estimated from EO spectral and temporal predictors plays an important role in the separability of broadleaved deciduous, broadleaved evergreen, and needleleaved forests, according to [108].
The mismatching between the classified pixels and not classified pixels along the mountain ranges are due to the undetermined values of spectral variables, where cloud cover and shadows are affected by both topographic (i.e., slope and aspect) and climate conditions [111].
The results of classified tiles show an underestimation of needleleaved forests in central and southern Italy (see panel a of Figure 6), where most of them are afforested areas not available in the vegetation database (see Table 3). The natural distribution of native conifer forests in central and southern Italy is limited by environmental and climatic conditions and it often overlaps the broadleaved forests belts [112].
In the Apennine needleleaved forests, there are scarce data on the impact of forestation/afforestation on the indigenous stands (eg. forests of P.pinea, P.halepensis, P. nigra, and Picea abies in northern Apennine), with the exception of the well-known natural mixed coniferous-broadleaved forests [113,114], such as the woods with European silver fir (A. alba) in southern Italy, or Bosnian Pine (P. heldeeicrii) in the southernmost western part of the peninsula. Indeed, afforested stands are excluded from most of the surveys from botanists compiling biogeographic archives, thus determining a geographical bias [30,115] with an underestimation of the afforested coniferous stands in the Apennine. Conversely, in the Alps (see panel c of Figure 6), the high contribution of vegetation surveys on forest consortia dominated by native tree species of Firs, Larches, and Pines, the low impact of forestation/afforestation activities, give a more accurate classification. Moreover, the evident discrepancy of data that are available for the semi-automatic classification (i.e., response variable) can also be seen in Table 3, where, out of a total of 2281 plots of coniferous forest, only the 24% are within the Mediterranean biogeographical regions (i.e., most of all the Apennine mountain chain).
The classification of the 24 forests habitat-types revealed different results for each of the EUNIS II level classes. The overall accuracy for the detection of tree species distribution with spectral indices analysis, obtained within the T2 (91.5%) and T3 (76.2%), is in line with similar studies [55]. The low (67.6%) overall accuracy within T1 EUNIS III classes, is confirmed by previous studies, where the accuracy of the broadleaved deciduous species was always below of the broadleaved evergreen and coniferous forests [55].
The performance of habitat detection using EUNIS classification based on the stepwise approach of pre-defined hierarchical habitat schemes demonstrates that the mismatching between categories depends solely on ecological and biogeographic factors that reflects the real altitudinal gradient and coexistence in natural woodlands in southern Europe, thus revealing a good predicting capability of the method [116].
The misclassification in the confusion matrix of broadleaved deciduous (T1, Table 4) and evergreen (T2) forests (at the EUNIS II level) is mainly due to the similar ecological drivers and partial overlapping pattern at the edge of their own distribution, determining mixed forests along the elevation belt. The same driver on altitudinal contacts acts for the needleleaved forests that partially mismatch the broadleaved deciduous forests on the Alpine chain. Natural forests clearly reflected this condition. In the Apennines, the driver (ecological filtering on biogeographical pattern, see "Species Pool" theory) is different, but the results in detection capability are similar. Coniferous forests are relics [114], because of historical and biogeographic transformation, and the needleleaved trees are enveloped in broadleaved forests (F. sylvatica), again mismatching the net detection between the two categories.
The natural elevation pattern also explains the highest misclassification found between the broadleaved deciduous and broadleaved evergreen habitat-types at the III level of EUNIS classification (Tables 5 and 6). In needleleaved forest stands, Mediterranean and temperate coniferous trees show a net separability in forest classification as the result of a clear distinct chorological and ecological pattern, thus reflecting the quite different species composition (Table 7). Otherwise, Larix forest, the only deciduous conifer, was well classified (UA 73.5 % and PA 67.6 %). The misclassification with temperate mountain Picea forest is explained by ecological and altitudinal overlapping of the two habitats, while the coexistence with temperate and sub-mediterranean montane P. sylvestris-P. nigra forest is frequent on stone slopes. Another justification of the misclassification of RF results could be the "class imbalances", as proposed by other researchers [30,55]. The EUNIS detection can be affected by the class with more samples, which is preferred by the RF classifier and, consequently, highlights the importance of an equilibrate sample design [116,117]. However, when a research includes several targets and refers to a wide extent, the only database available is collected from different sources over a large span of time, and it is difficult to manage (i.e., resampling thematic plots versus geographic). Moreover, in practice, some habitat-types often bring about mixed stands (i.e., see "Coexistence Theory") with others, and it is difficult to detect at fine spatial resolution data.
The results suggest that further pathways, to overcome some weaknesses shown, will aim to: (i) increase the EUNIS habitat categories to be mapped (i.e., grasslands, shrubs, wetlands, and coastal area); (ii) amplify the geographical extent to a wider area (i.e., from national to continental scale), obtain homogeneous classification within single biome unit (i.e., Mediterranean Basin); (iii) test different semi-supervised machine learning algorithms (e.g., Convolutional Neural Network, [118]) to obtain a more suitable habitat classification; and, (iv) also consider C-band Synthetic Aperture Radar (SAR) satellite data as candidate predictors for habitat classification purposes (e.g., forest canopy structure [119]).

Conclusions
Data-driven decisions are increasingly emerging as a turning point for policy makers. Sound scientific evidence and the availability of reliable and replicable data are key elements supporting informed decisions. In accordance with the renewed European Union ambition settled in the European Green Deal and in the targets of the 2030 Biodiversity Strategy, to protect nature as a whole, and strictly protect forests, the procedure presented here can contribute to detect forest types and changes over time and space, thus enhancing their conservation and restoration.
The combination of EO data, observing and measuring ecosystem processes, big data analytics, making use of advanced computational analytic techniques. like RF machine learning algorithm, allowed for demonstrating the applicability of up-to-date technologies for habitats classification. To this end, an operational example of how EO processing chains can support forest habitats assessment and monitoring is attempted in this paper. Several studies on habitat mapping by remote sensing products at the regional or sub-regional scale are available, but this is a first attempt to map EUNIS forest habitat-types in Italy at the national scale with the integration of nationwide biodiversity databases.
The approach presented here will allow an information technology procedure to be sped up with annual or seasonal updating, depending on the extension of the study area and the monitoring objectives. The obtained procedures could be applied on several environmental data in order to cyclically and promptly repeat spatial analysis to detect changes in space and time in support of ecosystem conservation issues, especially to evaluate the impact of illegal actions (e.g., forest harvesting) or natural hazards (e.g., destructive storms or other natural disasters) on habitat distribution.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. List of variables used as predictors for the analysis of forest habitat types classification. River network and shoreline used to calculate distance from shoreline and distance from river network was MATTM National Geoportal (http://www.pcn.minambiente.it/mattm/catalogo-metadati/, accessed on 12 January 2020); the source for soil properties variables was SoilGrids (https://soilgrids.org, accessed on 16 January 2020); the source for all the others predictors was ISPRA Database (http://www.sinanet.isprambiente.it/it/sia-ispra, accessed on 28 January 2020).