Designing a Validation Protocol for Remote Sensing Based Operational Forest Masks Applications. Comparison of Products Across Europe

: The spatial and temporal dynamics of the forest cover can be captured using remote sensing data. Forest masks are a valuable tool to monitor forest characteristics, such as biomass, deforestation, health condition and disturbances. This study was carried out under the umbrella of the EC H2020 MySustainableForest (MSF) project. A key achievement has been the development of supervised classiﬁcation methods for delineating forest cover. The forest masks presented here are binary forest / non-forest classiﬁcation maps obtained using Sentinel-2 data for 16 study areas across Europe with di ﬀ erent forest types. Performance metrics can be selected to measure accuracy of forest mask. However, large-scale reference datasets are scarce and typically cannot be considered as ground truth. In this study, we implemented a stratiﬁed random sampling system and the generation of a reference dataset based on visual interpretation of satellite images. This dataset was used for validation of the forest masks, MSF and two other similar products: HRL by Copernicus and FNF by the DLR. MSF forest masks showed a good performance (OA MSF = 96.3%; DC MSF = 96.5), with high overall accuracy (88.7–99.5%) across all the areas, and omission and commission errors were low and balanced (OE MSF = 2.4%; CE MSF = 4.5%; relB MSF = 2%), while the other products showed on average lower accuracies (OA HRL = 89.2%; OA FNF = 76%). However, for all three products, the Mediterranean areas were challenging to model, where the complexity of forest structure led to relatively high omission errors (OE MSF = 9.5%; OE HRL = 59.5%; OE FNF = 71.4%). Comparing these results with the vision from external local stakeholders highlighted the need of establishing clear large-scale validation datasets and protocols for remote sensing-based forest products. Future research will be done to test the MSF mask in forest types not present in Europe and compare new outputs to available reference datasets.


Introduction
One of the main data challenges forest managers face is the lack of accurate and up-to-date data to support specific activities in the silvicultural cycle, such as reporting and planning for commercial forestry or recreational activities [1,2]. Forests are dynamic environments and information related to its characterization and health condition is key [3][4][5]. In addition, some natural dynamics such as pest and droughts are being exacerbated by recent changes in climate, causing more frequent damages and costs [3,[6][7][8][9]. Therefore, there is a need for reliable, precise and quality data which can cover large

Study Areas
The MSF forest mask layers used on this study were implemented on sixteen case demos in six countries and include the most representative forest types across Europe (Figure 1). These Areas of Interest (AOI) are managed by forest owners' associations and forestry research institutes which promote conservation and sustainable management values in over one million hectares in size. For simplicity, each AOI has an acronym related to the country (e.g., Spain is ESP) and the number of AOI present in each country.
Remote Sens. 2020, 12, x FOR PEER REVIEW 3 of 17 locations and European forest types; and (3) to assess the skill of MSF Forest Masks in comparison to other large-scale products available on those areas.

Study Areas
The MSF forest mask layers used on this study were implemented on sixteen case demos in six countries and include the most representative forest types across Europe (Figure 1). These Areas of Interest (AOI) are managed by forest owners' associations and forestry research institutes which promote conservation and sustainable management values in over one million hectares in size. For simplicity, each AOI has an acronym related to the country (e.g., Spain is ESP) and the number of AOI present in each country. Further details of each AOI can be found on the MySustainableForest demo cases website section [16], including the details of the main forest systems present:  Further details of each AOI can be found on the MySustainableForest demo cases website section [16], including the details of the main forest systems present: • Croatia: Continental lowland forests by heterogeneous stands and lowland Slavonian pedunculated oak forests (HRV-1 and HRV-2). • Czech Republic: Temperate Pannonian mixed forests: (i) conifers, namely spruces, pines and larches; and (ii) broadleaf, namely beeches, oaks and hornbeams (CZE-1). • France: Oceanic maritime pine forest (FRA-1) and temperate continental with pedunculated oak (FRA-2).

High Resolution Forest/Non-Forest MSF Classification Dataset
According to Food and Agriculture Organization (FAO), forest land is any surface of more than 0.5 ha with more than 10% of tree canopy and trees higher than 5 m, excluding land predominated by agricultural or urban uses [38]. To calculate the MSF forest mask used in this study, we considered as forest the area with more than 50% of tree stratum coverage (i.e., when more that 50% of the pixel appears covered by tree canopy) [33]. We are aware that we are excluding areas with lower than 50% tree coverage which per definition may cause some disagreement with ground truth data and other forest mask datasets.

High Resolution Forest/Non-Forest External Classification Datasets
There are many methodologies and providers of forest mask layers. However, the number of forest/non-forest products available at large scale as in this case, across Europe, is limited. Two freely available forest/non-forest classification datasets were selected to compare them against MSF forest mask. The products chosen are comparable to MSF forest mask as they were produced close in time, within 3 years of the Sentinel images used for the development of our layers, and have lower but still comparable spatial resolution (20-50 m): • Forests High-Resolution Layer (HRL) is provided by Copernicus Land Monitoring Service (CLMS) and coordinated by the European Environment Agency (EEA) [32]. The Tree Cover Density (TCD) product provides the level of tree coverage per pixel in a range of 0-100%. It is obtained through a semi-automatic classification of multitemporal satellite images (Sentinel-2 and Landsat-8) for the year 2015 (±1 year) [32], using a combination of supervised and unsupervised classification techniques. The resulting product has 20 m spatial resolution. To make it comparable with the MSF forest masks, the TCD product was pre-processed to obtain a binary forest/non-forest classification, considering as forest all those pixels with a cover density of 50% or higher. A future update of this dataset for 2018 reference year was announced in August 2002 by Copernicus, but it was not available prior submission of this paper.

•
TanDEM-X Forest/Non-Forest (FNF) is provided by Microwaves and Radar Institute of the German Aerospace Center (DLR) [33,34]. It is a global forest/non-forest map generated from interferometric synthetic aperture radar (InSAR) data from the TanDEM-X mission. The bistatic stripmap single polarization (HH) InSAR data were acquired between 2011 and 2016, being 2015 the reference year for the final product. FNF maps are produced with a final pixel size of 50 m.

Satellite Data Required to Build the Independent Validation Dataset
Two sources with different bands spatial resolutions were used to create the validation dataset: (1) High Resolution imagery: Sentinel-2 L2A provided by Copernicus (Table 1); and (2) Very High Resolution imagery: satellite data (GoogleEarth™ and Bing Aerial imagery), taking into consideration the spatial and temporal resolution of the forest/non-forest classification datasets.

Methods
Validation and intercomparison of the results across different areas may not be straightforward if not following a standard methodology. Considering the non-existence of homogenized forest/non-forest ground truth datasets, there are different strategies that may be implemented to perform a quantitative validation of forest masks:

•
Cross validation or dataset split [46]. Commonly used for products developed using any supervised classification approach. A training dataset is needed, either provided by the user or built by the producer. The use of cross-validation techniques is widely accepted when there are not independent training and testing sets. However, the statistical distribution of the training and test samples are not independent, leading to an optimistic bias in the resulting metrics [46]. Note that the final values of the metrics will be strongly determined by the quality of the input dataset.

•
Using data from national forest inventories (NFI). Many methodologies used to perform forest inventories as the selected method by each nation will depend of the purpose and the scale of the inventory, leading to significant efforts to perform data search, normalization and data engineering. In Europe, the European National Forest Inventory Network (ENFIN) [47] promotes NFIs and harmonizes forest information. However, there are still different methods and plot sizes used. Moreover, inventories contain errors from different sources, as they rely on sampling strategies and the measuring protocols are not always clear, thus adding uncertainty to the metrics generated through the validation process [4, 48,49]. Hence, validation metrics might differ significantly Remote Sens. 2020, 12, 3159 6 of 17 across areas where the product has identical quality, due to the different inventory methods used, which are in general heterogeneous.

•
Building independent datasets based in visual interpretation of images [50]. This method is costly, and the validation must be carried out by independent interpreters in order to build an unbiased dataset trying to represent the variability of forest types within a determined area. The statistical metrics obtained with this method may be close to the real quality of the product if sampling and interpretation processes are carefully designed.
Considering the difficulties of using cross-validation and national forest inventories for our objectives, a new reference forest/non-forest dataset was generated by an independent operator to validate MSF forest masks and compare it with the other external products (e.g., HRL and FNF). The validation strategy was divided into four phases: (1) sampling design; (2)  • Building independent datasets based in visual interpretation of images [50]. This method is costly, and the validation must be carried out by independent interpreters in order to build an unbiased dataset trying to represent the variability of forest types within a determined area. The statistical metrics obtained with this method may be close to the real quality of the product if sampling and interpretation processes are carefully designed.
Considering the difficulties of using cross-validation and national forest inventories for our objectives, a new reference forest/non-forest dataset was generated by an independent operator to validate MSF forest masks and compare it with the other external products (e.g., HRL and FNF). The validation strategy was divided into four phases: (1) sampling design; (2)

Sampling Design
Pixels were the spatial unit selected as validation samples. Following the recommendations made by Olofsson et al. [26], a probability sampling strategy was applied to select the samples of the reference dataset. The two conditions defining a probability sample are: (1) the inclusion probability must be known for each unit selected in the sample; and (2) the inclusion probability must be greater than zero for all units in the AOI [29]. A stratified random sampling strategy was designed considering the two strata forest and non-forest. This design allows increasing the sample size in classes that occupy a small proportion of area to reduce the standard errors of the class-specific accuracy estimates for these rare classes [26].
To obtain the number of points to be included in the reference dataset, the following equation was used (Equation (1)) [51]: where N represents the number of units in the AOI, S (Ô) is the standard error of the estimated overall accuracy, Wi is the mapped proportion of area of stratum i and Si is the standard deviation of stratum where Ui is the expected user's accuracy.
With a pessimistic user's accuracy estimate of 0.7 and a standard error of 0.01 for the overall accuracy, the total number of points for the dataset was 2100.
To choose the optimal number of points per AOI, the recommendation [26] is to use an allocation method that should meet two conditions. Firstly, the number of points per AOI should lay between equal and proportional allocation (i.e., proportional to the area of the AOI relative to the total area of all the AOIs). Secondly, the final allocation should push the proportional distribution to the equal, in order to have enough samples to robustly validate the poorer class, but without it becoming completely equal.
The specific equation (Equation (2)) used for sample allocation per AOI was:

Sampling Design
Pixels were the spatial unit selected as validation samples. Following the recommendations made by Olofsson et al. [26], a probability sampling strategy was applied to select the samples of the reference dataset. The two conditions defining a probability sample are: (1) the inclusion probability must be known for each unit selected in the sample; and (2) the inclusion probability must be greater than zero for all units in the AOI [29]. A stratified random sampling strategy was designed considering the two strata forest and non-forest. This design allows increasing the sample size in classes that occupy a small proportion of area to reduce the standard errors of the class-specific accuracy estimates for these rare classes [26].
To obtain the number of points to be included in the reference dataset, the following equation was used (Equation (1)) [51]: where N represents the number of units in the AOI, S (Ô) is the standard error of the estimated overall accuracy, W i is the mapped proportion of area of stratum i and S i is the standard deviation of stratum i.
With a pessimistic user's accuracy estimate of 0.7 and a standard error of 0.01 for the overall accuracy, the total number of points for the dataset was 2100.
To choose the optimal number of points per AOI, the recommendation [26] is to use an allocation method that should meet two conditions. Firstly, the number of points per AOI should lay between equal and proportional allocation (i.e., proportional to the area of the AOI relative to the total area of all the AOIs). Secondly, the final allocation should push the proportional distribution to the equal, in order to have enough samples to robustly validate the poorer class, but without it becoming completely equal.
The specific equation (Equation (2)) used for sample allocation per AOI was: where p is the distribution of n proportional to the area of each AOI and e is the equal distribution of n for each AOI.
Once the sample size per AOI was computed, samples need to be allocated to strata (i.e., forest/non-forest). It is important that the sample allocation allows precise estimates of accuracy and area [52]. Having a fixed number of samples per AOI to allocate in two strata, proportional allocation would increase the variances of the performance metrics for the rarer class, as it would be poorly represented. Using equal allocation would allow a more robust validation of the poorer class at the expense of increasing the variances of the bigger class. The optimal sample allocation should minimize the variance of the desired target metric. Considering the overall accuracy (OA) as the metric whose variance needs to be minimized, the variance of an estimated OA can be computed using the following equation (Equation (3)) [26]: Being n i the number of samples per stratum i in a specific AOI, the optimal sample allocation (i.e., optimal value of n i per stratum) was computed through an iterative process estimating the variance of a desired OA for all possible values of n i . The values of n i which yielded the minimum variance of OA were selected as optimal sample allocation (Table 2 and Figure 3).

Dataset Generation
Once sampling size and allocation were computed, the validation dataset was created. In each AOI, the random points were generated with a minimum distance of 100 m between them. Each point was subsequently assigned to forest or non-forest via photointerpretation by an independent trained consultant. Sentinel-2 and very high resolution images (see data, Section 2.4) were used as basis for the visual interpretation.
For the Sentinel-2 images, the RGB composition used was NIR/SWIR/red, as it allows clear vegetation discrimination due to the high reflectance of the vegetation in NIR wavelengths [53]. Forests were visualized in red tones for the coniferous species and as light reds or orange for deciduous species. Non-forest areas, such as crops, roads or buildings, were visualized in green, blue or yellow. High resolution images were visualized in true color (i.e., red/green/blue).

Performance Metrics
Confusion matrices (Table 3) were built for each area using the forest mask products and the reference dataset. A total confusion matrix was computed pooling the samples of all the areas. Different agreement and error metrics were computed from confusion matrices (Table 4) [35]. All metrics were multiplied by 100 to be expressed as percentage.

Dataset Generation
Once sampling size and allocation were computed, the validation dataset was created. In each AOI, the random points were generated with a minimum distance of 100 m between them. Each point was subsequently assigned to forest or non-forest via photointerpretation by an independent trained consultant. Sentinel-2 and very high resolution images (see data, Section 2.4) were used as basis for the visual interpretation.
For the Sentinel-2 images, the RGB composition used was NIR/SWIR/red, as it allows clear vegetation discrimination due to the high reflectance of the vegetation in NIR wavelengths [53]. Forests were visualized in red tones for the coniferous species and as light reds or orange for deciduous species. Non-forest areas, such as crops, roads or buildings, were visualized in green, blue or yellow. High resolution images were visualized in true color (i.e., red/green/blue).

Performance Metrics
Confusion matrices (Table 3) were built for each area using the forest mask products and the reference dataset. A total confusion matrix was computed pooling the samples of all the areas. Different agreement and error metrics were computed from confusion matrices (Table 4) [35]. All metrics were multiplied by 100 to be expressed as percentage.

Agreement metrics
Overall Accuracy (

Results
Performance metrics for each forest/non-forest mask layer (i.e., MSF, HRL and FNF) can be found in Table 5. For the pooled set of AOIs, agreement metrics were high, the OA ranging from 88.7% to 99.5% and the DC from 91.1% to 99.5%. Highest scores were always found for MSF, followed by HRL and finally FNF. Errors and bias were low and balanced in the MSF product (OE = 2.4%; CE = 4.5; relB = 2%), while HRL (OE = 14.4%; CE = 6.3; relB = −8.6 %) and FNF (OE = 27.3%; CE = 19.7; relB = −9.5%) showed higher and imbalanced errors, which led to higher relative bias. HRL and FNF products had higher omission than commission errors (negative relB). As a summary, the relative bias of MSF product was lower and the commission error was slightly higher than the omission error, thus confirming the good calibration and balance of MSF errors. Analyzing the agreement metrics by AOIs (Table 6, Figure 4), we observed that results follow the same trend in overall accuracy (OA) and Dice similarity Coefficient (DC). The Interquartile Range (IQR, i.e. the difference between the first and third quartiles) was narrower in MSF (IQR OA = 2.93 and IQR DC = 2.25), followed by HRL (IQR OA = 13.95 and IQR DC = 13.28) and FNF (IQR OA = 16.85 and IQR DC = 18.83). This confirmed a good performance of MSF forest mask across different European forest types. MSF and HRL present median values higher than 90% in the OA and DC, while, in FNF, these values are 76.05% and 78.55%. The high dispersion of agreement values in FNF also pointed out a high variability of the product across different forest types.    The IQROE (Table 6 and Figure 4) was higher in FNF (22.9%), followed by HRL (13.48%) and MSF (2.65%). The IQRCE showed a similar behavior, being higher in FNF (8.83%) followed by HRL (6.73%) and finally MSF (4.3%). It should be noted that forest omissions were much higher than commissions in the case of HRL and FNF products. On the other hand, in MSF, the commission was greater than the omission, even though they achieved a great balance. The medians were lower than 10% for the MSF forest mask (OE = 2.5% and CE = 2.7%) and HRL (OE = 9.25% and CE = 4.6%), and it was around The IQR OE (Table 6 and Figure 4) was higher in FNF (22.9%), followed by HRL (13.48%) and MSF (2.65%). The IQR CE showed a similar behavior, being higher in FNF (8.83%) followed by HRL (6.73%) and finally MSF (4.3%). It should be noted that forest omissions were much higher than commissions in the case of HRL and FNF products. On the other hand, in MSF, the commission was greater than the omission, even though they achieved a great balance. The medians were lower than 10% for the MSF forest mask (OE = 2.5% and CE = 2.7%) and HRL (OE = 9.25% and CE = 4.6%), and it was around 20% in the case of FNF (OE = 22.6% and CE = 19.2%). The relative bias was higher for FNF and HRL than for MSF, highlighting more imbalanced errors. This fact was reinforced by the relatively high dispersion of relB in FNF and HRL. While for MSF the range of relB is 16.3, with areas with relB = 0%, the range of this metric reaches 88.3 for HRL and 71.9 for FNF. The low IQRs of MSF confirmed its consistent performance in forests with different characteristics compared to the other products analyzed.
There were some relevant differences among the AOIs ( Figure 5). Independently of the forest mask used, the best metrics were obtained in Central and Northern Europe (CZE-1, FRA-2, LTU-1 and LTU-2), where continental forest have in general a continuous and dense canopy. All the forest mask products were less accurate for Portugal and Spain, especially for HRL and FNF, with OA and DC around 20% lower than MSF. Regarding the outliers of agreement metrics, most of them were located in the HRL product, mainly in Southern Portugal (PRT-4, OA; PRT-3 and PRT-4, DC). OE was up to 50% higher in areas with Mediterranean influence (PRT-3 and PRT-4). The greater omission problems were located in HRL and FNF datasets (PRT-3, OE = 71.4% and 59.5%, respectively) in areas dominated by tree-grass ecosystems, where also relative bias reached the highest values. Finally, there is an evidence that Eucalyptus plantations also presented some difficulties to be classified (CE and OE) by FNF. There were some relevant differences among the AOIs ( Figure 5). Independently of the forest mask used, the best metrics were obtained in Central and Northern Europe (CZE-1, FRA-2, LTU-1 and LTU-2), where continental forest have in general a continuous and dense canopy. All the forest mask products were less accurate for Portugal and Spain, especially for HRL and FNF, with OA and DC around 20% lower than MSF. Regarding the outliers of agreement metrics, most of them were located in the HRL product, mainly in Southern Portugal (PRT-4, OA; PRT-3 and PRT-4, DC). OE was up to 50% higher in areas with Mediterranean influence (PRT-3 and PRT-4). The greater omission problems were located in HRL and FNF datasets (PRT-3, OE = 71.4% and 59.5%, respectively) in areas dominated by tree-grass ecosystems, where also relative bias reached the highest values. Finally, there is an evidence that Eucalyptus plantations also presented some difficulties to be classified (CE and OE) by FNF.

Discussion
The validation of the MSF high resolution forest masks on average yielded high scores for the agreement metrics and low for the error metrics. Nevertheless, we found some clear performance differences between the AOIs mainly related to the dominant forest types in each area. The best results were obtained for areas dominated by conifer, broadleaf or mixed forest in Central Europe. These forests are generally characterized by homogenous tree masses and crops which are relatively

Discussion
The validation of the MSF high resolution forest masks on average yielded high scores for the agreement metrics and low for the error metrics. Nevertheless, we found some clear performance differences between the AOIs mainly related to the dominant forest types in each area. The best results were obtained for areas dominated by conifer, broadleaf or mixed forest in Central Europe. These forests are generally characterized by homogenous tree masses and crops which are relatively easy to discriminate using remote sensing data. The shrub stratum is common in these same areas under a dense tree canopy. On the other hand, the metrics were generally worse in areas closer to the Mediterranean climate, such as Portugal and Spain. The areas with the worst results (PRT-3 and PRT-4) are characterized by the high presence of tree-grass ecosystems, which are similar to savannas. Tree-grass ecosystems are characterized by the co-existence of a sparse tree stratum and a grassland matrix with large seasonal contrasts [54,55]. Hence, reflectance contribution to the pixel highly depends on the behavior of the seasonal grass stratum, being this contribution variable depending on the density of the sparse tree stratum [56,57]. The complexity of these ecosystems in Southern Portugal (PRT-3 and PRT-4) led to relatively high omission errors compared to the other AOIs in all the products analyzed. Nevertheless, MSF forest masks had a maximum OE of 9% in these areas, while the HRL product obtained a maximum of 78.2% and FNF a maximum of 71.4%. This highlights the challenge of accurately detecting sparse tree-grass ecosystem [58]. Determining an exact threshold to discriminate pure cover types in transitional environments may lead to misinterpretation of natural landscapes, while providing fuzzy values might be more interesting to perform further analysis [59]. In the case of MSF forest mask, using a different threshold for assigning a pixel to forest (<50%) might change significantly the binary forest mask obtained for Southern Europe. The good performance of the MSF forest mask for this type of forest in comparison to the other products result from the great effort payed on this project to adapt the algorithm to different forest types across Europe, with a strong focus in Mediterranean tree-grass areas. Transitional woody shrublands are also present in some parts of AOIs in Southern Europe, resulting in higher commission errors in Portugal and Spain. For the AOIs located in Galicia (northwest of Spain; ESP-2, ESP-3, ESP-4 and ESP-5), the commission error was relatively high due to the large areas dominated by degraded shrublands, which have a similar spectral response as the dominant Eucalyptus plantations of these region. This effect seemed to be attenuated in HRL, while MSF and FNF showed higher CE in these AOIs.
HRL showed more outliers than the other products when comparing the metrics across AOIs, especially for the Dice coefficient (DC = 35.8%), commission error (CE = 78.2%) and relative bias (relB = −78.2%). These three outliers were yielded by the validation of PRT-4. In this area, HRL did not classify as non-forest any real forest pixel, but this happens at the cost of classifying misclassifying 78% of the predicted forest pixels. In this area, MSF also showed its highest relB but with opposite sign. Therefore, all the pixels classified as forest were correctly predicted (CE = 0%), but many real forest pixels are classified as non-forest (OE = 16.3%), which is common when working on sparse tree-grass ecosystem. This highlights the importance of considering relative bias for a complete performance vision even where the agreement metrics show high skill. Otherwise, it might lead to an optimistic interpretation of the results while the bias indicates the possible lack of robustness of the products in some areas.
The HRL official validation report [60] yielded a R 2 = 0.84 for the Tree Cover Density (TCD) product, with significantly lower values for the Mediterranean region, especially for Portugal (R 2 = 0.66), which coincides with the challenges found in this study. The HRL document also reports low R 2 values for Iceland, the Arctic and Anatolian regions, which are not represented in MSF test areas or in this study. The best results in HRL report were found for Boreal, Pannonian and Continental bioregions (R 2 > 0.85), which also coincide with the outcomes of this study. However, the methodology and results from both studies are not comparable since in the present work a pre-processing of the TCD was carried out to obtain a binary product.
The official FNF validation [33] reported a mean agreement of 86% for Germany and Eastern Europe. This is similar to the accuracy found for FNF in the present study in similar areas (CZE-1, LTU-1 and LTU-2). Nevertheless, Martone et al. [33] validated the FNF product using as ground truth the HRL product. Hence, this value of agreement should be taken with caution as the uncertainty of this validation and the errors of HRL products are not quantified in the FNF validation report.
External validations were carried out by local stakeholders within MSF project. ESP-2, ESP-3, ESP-4 and ESP-5, located in northwestern Spain, reported accuracies ranging between 93% and 100%. However, these AOIs were validated with sets of 8, 14, 31 and 20 points, making the final accuracy not significant enough. LTU-1 reported an overall accuracy of 90.4% and a kappa coefficient (k) [61] of 0.80 comparing MSF forest mask with information from the Lithuanian National Forest Inventory. The OA given by external stakeholders in this case is 10% lower than the OA achieved in this study. For LTU-2, the difference was lower (external OA = 91.1%; external k = 0.82). FRA-1 and FRA-2 were validated by stakeholders using non-systematic visual field control, reporting accuracies of 90% for both areas, around 8% below the OA found in this work. Stakeholders from CZE-1 reported an accuracy of more than 95% through visual inspection using high resolution ortophotos and cadaster data. In this last case, the difference with our OA was under 2%. At the time of writing this article, local validation from Portuguese and Croatian stakeholders is still pending, being the former particularly interesting to compare as it was clearly the most problematic area. However, the results of these external validations are hard to interpret and compare given the variability of methods and data sources used to carry out the accuracy assessments. The differences between the independent validation here presented and the external validation carried out by local stakeholders may be explained by two main points. Firstly, local forestry institutions are not always used to deal with continuous remote sensing data (i.e., based on pixels). Hence, these institutions may not have clearly defined validation protocols for this type of products, leading to several problems such as under-sampling or not taking into account the confidence level or standard error of the metrics computed for a certain area of interest. Secondly, the main purpose of local institutions may differ. Within the MSF project consortium, some institutions were centered in public forest management at national scale (e.g., CFRI), while others were associations of private owners (e.g., FORESNA) or privately owned companies with specific objectives (e.g., RAIZ is part of the pulp industry, while MADERA+ is focused on wood technological properties). This fact, together with the different nationalities involved, led to different definition of forests depending on national forest inventories or other sources. These considerations led to significantly different external validation approaches, which highlighted the need of standard definitions and methods to validate remote sensing-based forestry products, as the protocol proposed in this study.
There are several limitations of the validation exercise presented in this study and are explained below. The characteristics of the AOIs, namely the size, locations and forest types, were defined on the basis of local stakeholders within the MSF project consortium. Hence, while some forest types and regions are well represented (e.g., Eucalyptus plantations in northern Spain or mixed and broadleaf forests), other forest types, such as Northern European boreal forests, are poorly represented within the AOIs. A more detailed analysis and validation could take into account a more proportional representation of all forest types present in Europe. It is also important to bear in mind that, while the MSF forest masks were specifically calibrated for the European continent, the other two datasets analyzed (i.e., Copernicus HRL and Tandem-X FNF) are global datasets. This might explain the apparent better quality of the MSF product in the selected AOIs, which is likely to decrease if the algorithm is calibrated for other environments such as tropical forests or arid regions.
We are also aware that some of the products characteristics such as the date of the imagery used and the spatial resolution may have an impact on the performance and comparison of the products (MSF, HRL and FNF). As previously explained, the three products have different spatial resolutions and were developed using satellite images from different years. The products were validated using a reference dataset built using images from 2017-2018. As there is more consistency between the satellite images used to build the MSF layers and those for the validation dataset, this might have had a penalty of the results of the other two products. However, it is unlikely that difference between the dates used on the reference dataset and product (maximum three years) would have had such a high impact. We assume that the studied forests did not experience significant changes in the last five years, as previously stated directly by local stakeholders. A multi-temporal dataset (i.e., with reference samples for different years) would be needed to identify how much is the impact of difference between products and reference datasets affecting the performance metrics. In the case of the different spatial resolutions used, even when this may have had an effect on the final results, it is not likely to be the main factor explaining such remarkable differences in performance metrics, as we confirmed by doing a visual analysis of the most problematic areas (PRT-3 and PRT-4).

Conclusions
In this study, a protocol was established to validate remote sensing derived forest/non-forest classification maps across Europe. Based on a stratified random sampling, a reference dataset was created using visual interpretation of satellite images in order to validate MSF forest mask. Performance metrics were compared with two similar products: HRL and FNF.
The validation protocol allowed comparing the results of the accuracy assessment for different areas across Europe and different forest mask products. Problems were found when trying to compare the results of this assessment with the validation results provided by local experts. This was caused by the high variability of techniques used by stakeholders and justified the need of clearly established operational protocols for carrying out accuracy assessments of remote sensing-based products.
Accuracies were generally high for all forest mask products (OA = 76-96.3%), with MSF showing a more precise calibration for different European forest types (MSF accuracy ranges from 89% to 99.5%, while FNF accuracy ranges from 56% to 95%), which confirms its suitability as an adequate data source for large scale forest mapping in this continent. The greatest problems to discriminate forest from other land covers were found in areas with Mediterranean influence characterized by the presence of tree-grass ecosystems. Large shrublands in degraded areas also hindered forest discrimination. Performance metrics were significantly better for MSF in these areas compared to the other datasets analyzed, while FNF performance seemed to yield worse results. Note that the reference dataset was built with satellite imagery dates similar to the MSF masks, which also have higher spatial resolution (10 m). The other products used imagery up to three years older and with slightly lower resolution (20-50 m). This difference of dates and spatial resolutions might have slightly impacted the comparison of products. The performance of MSF satellite forest mask in other continents was out of the scope in this study but should be carried out in the near future.
The design of the accuracy assessment and selection of performance metrics highlight the need of establishing clear validation protocols for remote sensing-based forestry products. Efforts should be made in creating unified reference datasets at continental or global scales in order to clearly define the advantages and limitations of incorporating remote sensing-derived forest data in operational forestry processes.