Evaluating Tree Species Mapping: Probability Sampling Validation of Pure and Mixed Species Classes Using Convolutional Neural Networks and Sentinel-2 Time Series

: The accurate large-scale classiﬁcation of tree species is crucial for the monitoring, protection


Introduction
Forests are important for a wide range of ecological, economic, and social reasons.They play a vital role in maintaining the balance of our planet's ecosystems and support biodiversity.They provide climate and water regulation and serve as natural shields, protecting against natural hazards such as soil erosion, floods, landslides, rock falls, and avalanches.Forests offer significant economic benefits, including the production of timber and non-timber forest products, as well as the growth of industries such as tourism and recreation.Overall, forests are essential for preserving the health of our planet, combating climate change and ensuring the well-being of human societies [1][2][3][4][5][6][7][8][9].
Earth observation (EO) data provide valuable advantages for understanding forests, including large-scale coverage, timeliness, multi-sensor data fusion, non-invasiveness, and cost-effectiveness.The data enable the creation of detailed forest maps with high spatial and temporal resolution, providing insights into forest structure, health, and dynamics.Such insights are crucial in the face of increasing environmental challenges.Climate change is altering habitat suitability for many tree species [10][11][12][13], making forests more vulnerable to various threats.For example, pests like Ips typographus and pathogens such as Diplodia pinea are causing increasing damage to forests in Austria and other regions [10,[14][15][16][17]. EO data facilitate the creation of detailed maps of species distribution and abundance [18], which are valuable for monitoring changes in forest composition, identifying areas at risk, and informing management strategies.Such information can aid in developing targeted approaches to mitigate the impacts of climate change and biological threats on forest ecosystems [19].Within the realm of EO data, the availability of the Sentinel-2 (S2) EO mission data has revolutionized the field of remote sensing for tree species classification.With its broad spectral coverage and high spatial and temporal resolution, the S2 data provide an excellent basis for research on tree species classification using remote sensing [18].Many studies have successfully combined S2 data with machine learning techniques to classify pure tree species stands on small to medium-sized areas at the pixel level, demonstrating the potential of this data source for advancing our understanding of forest ecosystems [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34].
Expanding the study areas introduces new challenges, particularly when multiple Sentinel-2 granules (the tiling units of S2 products, see [35]), are needed, due to the difficulty of obtaining temporally matched scenes across the entire area.Large-scale studies are scarce.However, for instance, the authors of [33] used a dense phenology time series to acquire cohesive data across multiple granules, while in [36], monthly composites were generated to map tree species across Germany's entire forested area.
The earlier mentioned studies and many others have laid a strong foundation for further research on tree species classification.However, forest structures are often heterogeneous, consisting of multiple tree species mixed in small areas.Consequently, a single S2 pixel can encompass signals from multiple tree species, and the training and validation data consisting only of pure classes do not accurately represent the forest on the mapped area.While very few studies have addressed the issue of mixed pixels, the authors of [37] modelled a vector of the proportions of tree species on a pixel basis.Additionally, Waser et al. [38] found significant differences between validation using training data holdout sets in comparison with data from an independent source, especially in areas with a high degree of mixture.
In the evaluation of ecological maps produced by machine learning techniques, spatial autocorrelation, which is inherent to the data used, poses a central challenge.As highlighted in a review by [39], neglecting the underlying structure in input data during standard cross-validation can lead to the significant overestimation of a model's predictive power.The study suggested that a thoughtful blocking strategy can mitigate this issue; however, it may also lead to an overestimation of interpolation errors.The study conducted by [40] convincingly illustrated this validation data issue in a large-scale aboveground forest biomass case study.A recent study by [41] compared validation approaches, including validation based on independent probability sampling (IPS), standard (random) cross-validation (CV), and spatial cross-validation (SCV), when the training data were collected in clustered and non-clustered random pa erns.The results indicated that CV performed comparably to IPS in non-clustered cases, while SCV tended to overestimate the root mean squared error (RMSE).However, with clustered training data, as is common in tree species classification studies, CV tended to underestimate the RMSE.Few tree species classification studies have addressed this issue; however, ref. [30] implemented a spatial cross-validation scheme by dividing the area covered by the reference dataset into squares of 10 by 10 km.
Previous studies have primarily focused on traditional machine learning methods such as random forest, support vector machine, or extreme gradient boosting to classify tree species.The recent research studies by [23,42] have shown that deep learning and convolutional neural networks (CNNs) outperform traditional techniques.Bolyn et al. [37] successfully applied U-Nets, spatially aware CNN structures, originally developed by [43] for pixelwise image segmentation, to tree species classification.In addition to capturing spatial features, CNN structures can also be leveraged to analyze the temporal dimension of the Sentinel-2 data.A review by [44] highlighted a CNN structure specifically designed for time series classification, featuring one-dimensional convolutions and residual connections [45], as the top-performing deep learning approach across various time series classification tasks.Furthermore, Xi et al. [31] found that a CNN structure based on one-dimensional convolutions outperformed other approaches in their tree species classification study.
We employ neural networks with one-dimensional convolutions and residual connections, along with dense phenology index time series, statistical parameters, DTM, and vegetation height data to predict tree species on a national scale at the S2 pixel level.Our study's overarching objective is to bring large-scale tree species classification closer to the ground reality of forests, a perspective that has received limited a ention in previous studies.This is accomplished by enhancing the capture of the forests' complexity and variability in both training and validation.
The primary objectives of our study are as follows: (1) to expand the set of pure tree species classes by incorporating mixed classes-each consisting of two pairwise different species-and sparse classes with low canopy cover into training and validation; (2) to evaluate the generated tree species maps through a probability sample-based validation, with a specific emphasis on investigating spatial autocorrelation.
We introduce the following innovative ideas to tree species classification:  the use of a dense time series from multi-annual S2 data developed by [36], providing phenology index data at the S2 pixel level that are consistent across S2 granules;  the inclusion of mixed species classes in training and validation data, specific validation metrics for mixed and pure species classes, and training data synthesis;  a hybrid neural network architecture tailored specifically for the combination of time series and non-time series features, based on [44];  A spatial split and an independent probability sample-based validation including a comprehensive and innovative spatial autocorrelation analysis for both.

Materials and Methods
To commence this section, we offer a concise overview, including a workflow diagram (see Figure 1), along with the motivations driving the subsequent sections.In Section 2.1, we introduce the study area followed by a delineation of the forested area in Section 2.2.Sections 2.3-2.5 delve into the datasets covering the entire study area, which serve as input features for the classification models.
Given that the 10 by 10 m Sentinel-2 pixels often contain more than a single species, mapping tree species faces additional challenges.Following a brief definition of some necessary terms (Section 2.6), we explore the intricacies of classifying not only pure but also mixed and sparse classes and define the classes we aim to separate within the scope of this study (Section 2.7).Moving forward, we address the labeling of data for training and validation purposes.Section 2.8 presents the details of training data acquisition.However, these data present certain challenges, primarily because mixed-class data contain pixels with single species as well as mixed-species pixels.In Section 2.9, we propose the creation of synthetic data to tackle this issue by averaging multiple pixels.After the creation of the training data, Section 2.10 outlines how the national forest inventory (NFI) data were utilized to generate a validation dataset.However, a significant concern arises regarding spatial autocorrelation in training and validation data.Mishandling this issue can lead to overly optimistic estimates of classification accuracies.In Section 2.11, we discuss the strategies to analyze and address spatial autocorrelation, aiming for independence of validation samples and thereby improving the reliability and accuracy of our validation procedures.In Section 2.12, we detail the neural network architecture utilized in this study, while Section 2.13 provides an overview of the training process for the models.Subsequently, in Section 2.14, we explore the validation metrics applied and introduce accuracy measures specifically tailored for the validation of pure and mixed classes.Furthermore, we outline the different validation datasets and explain how the validation results are gathered over multiple training runs.Finally, in the last part of this section (Section 2.15), we present the training data and model configurations employed in our study.The software implementation for data analysis, preprocessing, and other methods was carried out using Python 3.11.5.

Study Area
The republic of Austria is a landlocked country in Central Europe situated between latitudes 46° and 49°N and longitudes 9° and 18°E with a total land area of 83,879 km 2 .More than 70% of the country's federal territory comprises mountains, including sections of the Central Eastern Alps.The altitude ranges between 114 and 3798 m above sea level.Austria predominantly lies within the cool/temperate climate zone, with an oceanic type climate prevailing in the western and northern regions, characterized by humid westerly winds.In the east, the climate is mostly Pannonian-continental, with low precipitation, hot summers, and cold winters [42,46,47].The study area encompassed the entire forested area of the federal territory of Austria, accounting for 47.9% of the total land area (according to the national forest area definition, which differs from the Forest and Agriculture Organization of the United Nations definition by including other wooded lands).The Austrian forest is diverse with over 60 tree species.The species mainly occurring are spruce (Picea abies) 41.8%, beech (Fagus sylvatica) 9.5%, larch (Larix decidua) 4.4%, white pine (Pinus sylvestris) 3.4%, fir (Abies alba) 2.2%, mountain pine (Pinus mugo) 2.2%, oak (Quercus) 1.7%, green alder (Alnus viridis) 0.9%, arolla pine (Pinus cembra) 0.7%, and black pine (Pinus nigra) 0.4%.According to the specific National Forest inventory (NFI) evaluations, the forest area is composed of approximately 32% pure stands (where a single species comprises more than 90%), 47% mixed stands, and 21% areas without trees (temporary unstocked forest areas and shrub areas).The tree coverage is composed of 71% dense (no gaps in the canopy) and 29% sparse canopies.

Forest Area Map
The Austrian Research Center for Forests, Natural Hazard, and Landscape (BFW) routinely publishes a vectorized forest area map derived through a semi-automated process that utilizes aerial photography data, which are sourced from a flight campaign with updates every three years.The derived forest area map was rasterized and aligned with the S2 granules.

S2 Phenology Features
Our study uses an interpolated dense S2 time series with seamless S2 granule transitions as phenology features.S2 is a passive multi-spectral imaging mission with 13 spectral bands and a revisit frequency of 5 days at the equator.The S2 products are made freely available via the Copernicus Data Space Ecosystem (h ps://dataspace.copernicus.eu/,accessed on 19 July 2024) by the European Space Agency.The time series analysis tool (TSA) developed by [36] produces a time series for spectral indices (Table 1) on a single pixel level.The TSA utilizes Level-1C multi annual data and discards areas affected by snow, clouds, and cloud shadows.Additionally, a threshold-based filter is used to exclude outliers.The filtered data points, spanning over multiple years, are combined to create a synthetic single-year phenology dataset.Next, a Savi ky-Golay trajectory [48] is fi ed to this dataset to obtain a smooth modeled main phenology course (MPC) on a single pixel level.

Developed within this study
In addition, the vitality and seasonality metrics were extracted from the MPC.Basic MPC statistics and a percentile transition analysis (PTA) as presented by [52] were used to derive the descriptive metrics such as greening day of the year (DOY), defoliation DOY, the vegetation period, maximum/minimum value and the corresponding DOY, and first reach, last pass, and a set of percentile values.
The phenology input features for this study's classification models included the MPC between DOY 100 and DOY 270 at a temporal resolution of 5 days for each of the indices listed in Table 1, as well as the phenological metrics listed in Appendix Table A1.

Digital Terrain Model Features
Point clouds from airborne laser scanning (ALS) were filtered to last returns only, before interpolating them to a DTM of cell size 1 m.The resulting DTM dataset was provided by the Austrian Federal Ministry of Agriculture, Forestry, Regions and Water Management (BML).These data were collected in a decentralized manner by individual federal states during flight campaigns spanning from 2003 to 2018.A slope map, along with southness (see Equation ( 1)) and eastness (see Equation ( 2)) of the slope aspect were derived from the DTM.These data were resampled to a 10 m resolution, aligned with the S2 granules, and used as additional input data for this study's classification models.

Normalized Digital Surface Model
A digital surface model (DSM) with a spatial resolution of one meter was generated from the digital aerial imagery via image matching [53] using the ApplicationsMaster 13.1 (now Trimble Inpho) software, in particular the Match-T DSM Commander module, by Trimble, Inc. (Westminster, CO, USA) [54].The generated 3D point clouds were subjected to outlier detection and subsequently used to create raster models by selecting the maximal value per cell at one meter resolution.. Subtracting the DTM from the DSM generated the normalized digital surface model (NDSM), which contains object heights measured from the ground level.
The mean, standard deviation, 5-percentile, 95-percentile, and range (maximumminimum) of the NDSM were computed on the 10 m S2 resolution.These resulting data were also used as additional input features for the tree species classification models.

Definitions-Feature Stacks and Feature Space
For each S2 pixel in the study area, a concatenation of input features resulted in a feature stack.These feature stacks (features) consisted of 210 phenology index time series features, 684 phenology metric features, 4 DTM features, and 5 NDSM features.Each of these feature stacks can be viewed as a vector in a 903-dimensional vector space, referred to as the feature space.The component of such a vector corresponding to a specific feature is referred to as the feature dimension.The statistical distribution of features in the feature space for a given dataset is referred to as distribution.For instance, the training data distribution pertains to how the feature stacks are distributed in the feature space, encompassing all the pixels within the training data.

Pure, Mixed, and Sparse Classes
Due to the considerable spatial diversity of forests and the 10 m resolution of S2 pixels, a single pixel may contain the spectral information from multiple tree species [37].To address this issue, we introduced the use of mixed classes, which are composites of two species (e.g., spruce-beech), to improve the classification scheme's ability to represent the complexity and spatial variety in the forest.
We included eight pure classes for the most common tree species, namely spruce, larch, black pine, white pine, mountain pine, beech, oak, and green alder.To account for the remaining deciduous species, the "other deciduous" class was introduced.Furthermore, the eleven mixed classes spruce-fir, spruce-larch, spruce-white pine, spruce-arolla pine, larch-arolla pine, spruce-beech, spruce-deciduous, larch-deciduous, white pineoak, white pine-deciduous, and black pine-deciduous and the four sparse classes, spruce sparse, larch sparse, white pine sparse, and deciduous sparse were added to the classification scheme.Finally, we included a low vegetation class for vegetation below 4 m in height, resulting in a total of 25 classes in the classification scheme (see Table 2).It is worth mentioning that, in several cases, the NFI-VD was used as a visual baseline on the orthophotos to identify nearby patches of the same species, introducing spatial dependency between the training data and the NFI-VD.
The labeled training dataset was obtained by intersecting the labeled training data areas with the S2 aligned features, resulting in a feature stack for each labeled pixel.Any pixels containing empty (no-data) values in their feature stack were discarded.

Synthetic Training Data
Mixed class training areas typically include both pure pixels, containing a single species (a constituent of the mixed class), and mixed pixels.These pure pixels negatively impact the pixel label quality, as they are labeled with the mixed class.To address this issue we applied training data synthesis [55] to all mixed and sparse classes.New synthetic pixels were created by randomly selecting two pixels from a training area and taking their mean in every feature dimension.This process was repeated until the original number of pixels of the area was reached.

NFI Validation Data
The Austrian NFI conducts a systematic grid-based field inventory campaign on a six-year cycle gathering data from approximately 22,000 observational plots on about 200 parameters [56].These observation plots span an area of around 300 m 2 (corresponding to a circle with a radius of 9.77 m) each and are organized into around 5500 clusters.These clusters are evenly spaced with a distance of 3.89 km in a grid that spans the entire federal territory of Austria.Each cluster comprises four plots arranged in a square with a side length of 200 m.The top-level vegetation on each plot is characterized by parameters for the predominant species, secondary species, and potential admixtures.
To create a labeled NFI validation dataset (NFI-VD) for this study, the plots situated entirely within the forest area were selected, totaling around 9200.These plots were labeled based on the predominant and secondary species parameters.Following the tree species classes defined in this study (Section 2.7), approximately 200 plots that did not fit the classification scheme were discarded.Subsequently, the radius of the remaining 9000 plots was extended to a full 10 m (approx.314°m 2 ).The resulting dataset was then overlayed with the 10 m resolution S2 tiles (see [35]; the center of a pixel needed to be within the circular observation plot), resulting in approximately 27,000 class-labeled pixels.
It should be noted that the NFI top-level vegetation description does not differentiate between black pine and white pine.Therefore, these two classes were merged into the "pine" class for NFI validation purposes.Furthermore, the class "low vegetation" was not present in the NFI-VD-class scheme.

Investigating and Addressing Autocorrelation in Training and Validation Data
Autocorrelation significantly impacts the thematic map validation due to the similarity of spatially close points, leading to overestimated predictor accuracies.To address this issue, in the following sections, we created clustered training data spatial splits and buffered the NFI validation data, removing training data within the buffered areas.

Clustered Spatial Splits
To mitigate the spatial autocorrelation between the training data and the training data holdout set, a random clustered spatial split (CSS) strategy ( [41]) on training data holdout sets was employed.The training data were initially organized into clusters, so that for each class, two distinct clusters were at least a variable split distance apart from each other.For a spatial split holdout set, clusters were randomly sampled until at least 5 percent of all training data pixels were added to the holdout set.If the holdout set contained all classes, it was retained, otherwise it was discarded, and the process was restarted.To gain an insight into the relationship between the split distance and holdout set accuracy, this iterative procedure was performed for 10 different random seeds and split distances ranging from 125 to 5000 m, in increments of 125 m.

Buffered NFI Validation Data
To address the autocorrelation between the training data and NFI-VD (see Section 2.8), a buffering strategy on NFI plots was employed.A direct implementation of this approach, given the dense grid of NFI plots, would have led to the removal of a substantial amount of training data.To mitigate this, we opted for a staged approach, as follows:


The NFI-VD plots were divided into 10 folds, denoted as , … , , with classes almost evenly distributed amongst them.This division was accomplished by first determining the number of plots for each class .For each fold , random observational plots (not already assigned to another fold) were selected.If the class of a selected plot was not yet represented by more than in , the plot was added to the fold.Additionally, the other plots from the same NFI cluster (each comprising four NFI plots) were included in the fold.


For each NFI-VD fold, training areas contained within any buffered NFI plot from the respective fold were eliminated.This process yielded 10 sets of training data, denoted as , … , , each corresponding to an NFI validation data fold.


For each set of training data , a model was trained. Subsequently, each model was validated using the corresponding NFI validation data fold . Finally, the validation results over all folds were evaluated.
To explore the relationship between the buffer distance and NFI-VD accuracy, split distances ranging from 250 to 20,000 m with increments of 250 m (from 250 to 5000 m) and increments of 2500 m (from 7500 to 20,000 m) were chosen.Based on these parameters, two series of experiments were conducted.The first involved discarding training data based on buffer distance, as described earlier.In the second series, the discard rate was maintained at a constant level up to a buffer distance of 15,000 m.This was achieved by discarding training areas within the current buffer distance individually for each split and class and then randomly discarding training areas until the discard percentage for the current split and class matched the discard rate at 15,000 m buffer.

Neural Network Architecture
In this sub-section, we draw upon the foundational definitions in neural network literature, particularly the insights provided by [57], to establish a common understanding of the key concepts.Our study adopted a hybrid neural network architecture that combined the ResNet architecture proposed by [44], with downscaled numbers of filter channels for the convolutional layers, and a multilayer perceptron (MLP).The neural network models were implemented within the scope of this study using the PyTorch library (version 2.0.1),particularly the torch.nnmodule for constructing and training deep learning architectures.
The time series classification ResNet architecture applied in this study consists of three consecutive blocks each comprising multiple convolutional, normalization, and activation layers.The convolutional layers scan the time series data within a moving window extracting temporal pa erns.The normalization layers standardize the output of the convolutional layers, improving training stability.Lastly, the activation layers introduce non-linearity to the network, enabling it to learn complex (non-linear) relationships within the data.With each block, a residual connection allows for the direct propagation of data from the input to the block, to the stage immediately preceding the final activation function within the block.
Figure 2 depicts the detailed flow of data in this study's MLP-ResNet hybrid structure.The phenology time series data for each index are fed into the first ResNet block.Within each block, one-dimensional convolutional layers are applied with parameters described as "conv(8 × 1, 6, 16)" where "8 × 1" represents the filter size (8 wide and 1 high), "6" represents the number of input channels, and "16" represents the number of output channels.Each convolutional layer is followed by a batch normalization layer with a batch size of 128 (which is the same for all batch normalization layers), and a ReLU activation function.After the third block, a pooling layer averages along the time-axis resulting in 32 output features of the ResNet.These 32 output features together with the non-time series input features are then fed into an MLP consisting of five liner layers (input and output dimensions given within the brackets), each followed by a batch normalization layer and a LeakyReLu [58] activation function, except for the very last one, which is trailed by a softmax layer, resulting in class probabilities.

Neural Network Training
The training data were divided into 95% training data (~540,000 pixels) and 5% holdout set validation data (~28,000 pixels).When training models to be validated with NFI-VD, the holdout set was a simple random sample, otherwise it was selected as a CSS, as described in Section 2.11.1.Before training and inference, the data were standardized by ( − )/ , where is the feature vector, is the vector of feature means, and is the vector of feature standard deviations.Both and were calculated using forest data of the complete study area.Following rigorous experimentation to optimize the training parameters, this study employed the following se ings: conducting neural network training over nine epochs, employing a batch size of 128, and configuring the learning rate to 0.00025 (see [57] for definitions).During training, the performance of the model was evaluated on a holdout set to detect any signs of overfi ing with respect to the training data.As a training criterion, the cross-entropy loss was used along with the Adam optimizer [59] and a learning rate scheduler to reduce the learning rate during loss-plateaus (see [57] for definitions).The models were trained on a NVIDIA GeForce RTX 3070 Ti (Nvidia Corp., Santa Clara, CA, USA)with training times ranging from 10 to 30 min.Inferring the models on the complete study area took about 20 h.

Tree Species Map Validation
A multitude of validation metrics were computed to assess the resulting tree species maps, as follows:


Overall accuracy (OA) for the NFI-VAL and NFI-weighted OA (NFI-w-OA) during training (the holdout set OA was weighted by NFI-Class-Distribution).


Overall misclassification score (OMS) for the NFI-VAL: A score calculated based on the severity, judged by the phenological similarity, of all misclassifications in the confusion matrix.A score of 1.00 is the best possible value and means that all predictions are correct.See Appendix B for details.


Prediction in close phenological proximity (PCPP) for the NFI-VAL: Predictions where at least one of the involved classes is correctly predicted.Examples include pixels predicted as spruce-fir but validated as spruce, pixels predicted as sprucelarch but validated as larch-arolla pine, pixels predicted as pine-oak but validated as oak, and pixels predicted as spruce-beech but validated as spruce-deciduous.


Deciduous and coniferous confusions (DCC) for the NFI-VAL: Confusions between pure coniferous and pure deciduous classes.Examples include pixels predicted as larch but validated as oak and pixels predicted as beech but validated as sprucelarch.


Post hoc pure class overall accuracy (POA), determined by eliminating all non-pureclass entries from the confusion matrix.


Post hoc mixed class overall accuracy (MOA), calculated by eliminating all nonmixed-class entries from the confusion matrix. Macro-averaged (each class was weighted equally) F1 score (MAF1). F1 scores, producer and user accuracies, and misclassification scores on a class level.


Confusion matrices for splits as well as aggregated confusion matrices over multiple splits.
The OA for the NFI validation, as a measure, closely aligns with the NFI-w-OA observed during training, given that the NFI validation data are a representative sample of the Austrian forest.In all validation methods, the sparse classes were merged with their respective base class-spruce-sparse with spruce, for instance.
The classification models were assessed using three different data sources: random training data holdout sets, CSS (see Section 2.11.1) training data holdout sets, and NFI-VD (see Section 2.11.2).

Random Holdout Set Validation
The randomly selected pixels from the entire training dataset formed a random training data holdout set for model validation.

Clustered Spatial Split Validation
To assess a model configuration with CSS training data holdout sets (CSS-VAL), 10 separate CSS were generated.Models were trained and validated in three distinct runs for each CSS fold.Subsequently, the means and standard deviations of validation metrics and an aggregated confusion matrix were computed from these 30 validations.See Section 2.11.1 for details on CSS.

NFI Data Validation
The merging of black and white pine classes (see Section 2.10) as well as the combination of sparse and base classes (see the beginning of this section) led to an NFI validation (NFI-VAL) scheme, comprising 19 classes.
To validate a model configuration with NFI-VAL, a model was trained on each of the 10 reduced training datasets, as detailed in Section 2.11.2, the results were aggregated, and means and standard deviations were calculated.
To address the spatial inaccuracies of NFI-VD as well as the alignment of the S2 data, nine positional variants were considered for each NFI plot.One variant was centered precisely at the given coordinates, while for four variants the position was shifted one S2 pixel to the north, east, south, and west, respectively.For the remaining four variants, the position was shifted by one Sentinel-2 pixel in two contiguous directions, resulting in diagonal shifts.Out of these nine positional variants, the ones with the most pixels within the forest area were selected, and then the one with the best validation match was considered for the final validation result.

Training Data and Model Configurations
In this section, the data and model architecture configurations employed in the following results section are delineated.These configurations were selected after an extensive series of experiments that explored various aspects, including varying features, incorporating synthetic data, adjusting model capacities, and modifying training and architecture parameters.From this series, we aim to spotlight the configurations that yielded the most noteworthy results.
The base model's configuration encompassed all features outlined in Sections 2.3-2.5.It integrated the synthetic training data for the mixed and sparse classes (see Section 2.9), and the model's architecture is as described in Section 2.12.The training data were standardized employing means and standard deviations computed from all forest data, and the training used the parameters detailed in Section 2. 13.
The subsequent models utilized the base model's configuration, with specific variations as explicitly described (see Table 3 for architecture details), as follows:


The no_syn model used the raw training data without any synthetic data for mixed or sparse classes.


The res_xx_yy_yy models were built with a with xx filters in the first block of the residual network and yy filters in the second and third blocks.Additionally, the sizes of the trailing MLP layers were adjusted.The final map was created by ensembling the 10 NFI fold models.It determined the predicted class for each pixel based on the majority vote of the models.In case of a tie, the tiebreaker relied on the sum of classes probabilities (see Section 2.12).

Clustered Spatial Split Distance Analysis
As described in Section 2.11.1, three models were trained for each of the ten spatial splits, covering split distances ranging from 125 to 5000 m with an increment of 125 m. Figure 3a illustrates the mean NFI-weighted overall accuracy (NFI-w-OA) and the mean MAF1 across the spatial split distances.The results reveal a clear decline in accuracies up to 3000 m, followed by a stabilization around 4000 m, particularly in the macro-averaged F1 score (MAF1).Figure 3b shows the standard deviations in the holdout set size relative to training data for the individual classes and splits, demonstrating a clear upward trend with increasing split distance.This trend indicates that the holdout set share per class and split becomes increasingly unequal as the split distance grows.

NFI Validation Buffer Distance Analysis
To analyze the effects of the buffer distance on the NFI-VAL results, two series of experiments for buffer distances ranging from 0 to 20,000 m with varying increments were conducted (see Section 2.14.3).
The results of the first series, where the training data discard was determined solely by the buffers are depicted in Figure 4. Figure 4a shows the accuracy results for overall accuracy (OA) and MAF1 per buffer distance, while Figure 4b illustrates the amount of training data discard due to the buffering of the NFI plots per buffer distance.After a small drop from 0 to 250 m, the accuracy results remain steady until approximately 5000 m buffer distance, with a training data loss of up to about 30%.From 7500 m onwards, accuracies start dropping significantly, while the training data discard rises to 95%.The results for the second series, where the amount of training data discarded was kept constant up to 15,000 m, show steady accuracies until the 10,000 m buffer distance and a slight drop from 10,000 to 15,000 m.The standard deviations for the accuracy measures are higher than in the first series.The data for 17,500 and 20,000 m were not affected by the method for this second series, and the accuracy values are in accordance with that (Figure 5).

Models
In this section, we present the validation results of the set of models described in Section 2.15.During the spatial split and NFI fold training, the configurations mostly achieved 99% NFI-weighted overall accuracy (NFI-w-OA) and 99% NFI-w-OA on a random holdout set for NFI fold training.The exceptions were the no_syn, res_8_16_16 models, which reached 98%, and the res_8_16_16s model with 96%.The results of the clustered spatial split validation (CSS-VAL) are presented in Table 4.The best values for each metric are highlighted in bold.
In the CSS-VAL assessment, the base model outperformed (measured by the NFI-w-OA) the other models by reaching an NFI-weighted overall accuracy (NFI-w-OA) of 73.8% (±5.4) and a macro-averaged F1 score (MAF1) of 55.0% (±3.5).The integration of the synthetic training data, as illustrated by comparing the base model with the no_syn model, leads to a significant enhancement in the classifier accuracy.Specifically, we observed an increase of 10% in NFI-w-OA and 5.2% in MAF1.Model capacity variations for the base model exhibited a minor downward trend at the lowest parameter counts.Interestingly, the parameter reduction had a more pronounced effect on NFI-w-OA compared to MAF1.
The accuracy measures of the NFI-VAL are presented in Table 5.In the NFI-VAL assessment, the base model outperformed the other models, achieving an overall accuracy (OA) of 55.3% (±1.8),MAF1 of 42.0% (±1.3), and an overall misclassification score (OMS) of 1.63 (±0.04) (see Section 2.14).The model demonstrated an accuracy of 79.4% (±1.4) when disregarding confusions in phenological proximity (PCPP), such as spruce-larch confused with spruce-beech (see Section 2.14) and the confusion percentage between deciduous and coniferous classes (DCC) was 1.5% (±0.3).Additionally, the post-hoc pure class overall accuracy (POA) reached 90.7% (±1.4), while the post-hoc mixed class accuracy (MOA) was 64.6% (±4.2).The confusion matrix and additional accuracy measures (Appendix Tables A2-A4) reveal additional noteworthy observations: Within the pure classes, larch and pine demonstrate lower F1 scores compared to other pure classes, and mixed classes exhibit lower scores than pure classes.
The integration of synthetic training data, as demonstrated by comparing the base model with the no_syn model, significantly enhanced the classifier's predictive power.Specifically, the OA increased by 7.6%, the MAF1 by 1.4%, the OMS was improved by 0.21 points, the POA by 0.9%, and the MOA by 6.2%.However, it is noteworthy that the PCPP and the DCC slightly worsened.The reduction in model capacity (res_8_16_16 and res_8_16_16s) showed a subtle downward trend in the accuracy metrics, predominantly reflected in the OMS.
The tree species map over the area of the Austrian federal territory is presented in Figure 6.An online-accessible and user-friendly version of the resulting tree species map from this publication can be found on the Austrian National Forest Inventory's Homepage [60]: h ps://www.waldinventur.at/?x=1486825&y=6059660&z=7.75968&r=0&l=1111#/map/1/mBaumartenkarte/Bundesland/erg9 (accessed on 25 January 2024).It provides a simplified and interactive representation of the map, allowing users to easily explore the data.

Discussion
In this study, the classification of tree species was conducted across a large area spanning 40,178 km 2 of forests.By utilizing a dense phenology time series extracted from freely available Sentinel-2 (S2) imagery, our methodology demonstrates its suitability for largescale applications.To align with the scale of this study, a comprehensive training dataset consisting of about 570,000 S2 pixels (~57 km 2 ), spread across the entire study area, was collected via orthophoto interpretation.The inclusion of mixed classes in this study's classification scheme offers a robust alternative to the approach suggested by [37] to align training and validation data more closely with real forest conditions.The utilization of synthetic data resulted in significant improvements in the model's predictive power.The challenge of spatial autocorrelation typically remained inadequately addressed in largescale tree species classification.Our study proposes methods to assess the influence of spatial dependencies specifically from the perspective of map validation.In addition to employing a clustered spatial split validation (CSS-VAL), we utilize a validation method based on a ground truth probability sample (NFI-VAL).Through the analysis of national forest inventory data-based validation (NFI-VAL) results across various buffer distances, we assert that this validation approach closely approximates the true independence of the validation data.Comparing different validation approaches emphasized the importance of accounting for spatial autocorrelation and accurately modeling the complexity of the forest.The post-hoc pure class accuracy of 91%, given the large study area and the meticulous validation method, compares well with other studies focusing exclusively on pure classes.

Mixed Species Classes, Training Data Labeling, and Training Data Synthesis
The significant disparity between the post-hoc pure class accuracy (POA) and the overall accuracy (OA) in the NFI-VAL underscores the importance of accounting for mixed species stands in tree species mapping validation, particularly in regions with high forest diversity.From a training perspective, including mixed species information into the training data is essential for modeling diverse forest areas.Understanding the interactions and overlaps between different species within the same spatial context can enhance the generalization capabilities and predictive performance of the models.However, a major challenge arises.While it is feasible to label training areas with proportions of large numbers of different species, the quality of labels at the single-pixel level deteriorates, because mixtures are typically not spatially homogeneous.We believe that mixed species classes, consisting of two different species, represent a natural progression in complexity from considering only pure species stands, which is common in large-scale tree species mapping.
The training data were labeled through an iterative process that took model evaluation and consistency metrics into account.Additionally, we identified areas of relatively low probability for the predicted class and sought to locate training areas within these regions to improve the model's weak spots.Labeling the mixed data classes presents inherent difficulties.One main challenge is the reduced separability of different classes in the feature space, as classes containing the same species are naturally closer together.Furthermore, when vectorizing training areas, achieving a consistent mixture in every S2 pixel of a mixed class's two constituent species can be challenging.To add to that, training areas that exhibit too much homogeneity reduce the variety in the training data and fail to represent real forest conditions, resulting in a trade-off between label quality and training data representativity.In our study, mixed class training areas typically contain both pure and mixed pixels (i.e., a spruce-beech labeled polygon contains pure spruce, pure beech pixels, and spruce-beech mixed pixels), and up to 10% species impurity was permi ed.This label inconsistency has a negative impact on class separability; however, synthetic training data can help to address the issue.
The introduction of synthetic data for mixed classes demonstrated significant increases in CSS and NFI validation.Intuitively, this improvement can be a ributed to the synthesis (averaging of pixels) counteracting the occurrence of pixels containing only a single (constituent) species (of the target class) in training areas.
Synthetic data in the context of machine learning are not a novel concept; see the recent survey by [61], to give one example.The synthetic training data approach was based on the simplified model that describes the spectral signature of multi-species pixels as a linear combination of the constituents.

Neural Network Architecture
In a review study by [44], a residual neural network (ResNet) architecture employing one-dimensional convolutions and specifically designed for time series classification, emerged as the best performing deep learning approach across various time series classification tasks.Furthermore, a study by [31] found that a neural network structure based on one-dimensional convolutions outperformed other approaches in their tree species classification study.Our study's architecture is based on the ResNet architecture presented in [44] and specifically tailored to address the intricacies of our classification task, accommodating both phenology time series features and non-time series features such as phenology metrics, DTM, and NDSM.We opted against using spatially aware architectures like U-Net ( [43]).Due to our training data's clustered nature and homogeneity, the data featured more homogeneous pixel neighborhoods compared to actual forest conditions.Therefore, we believe the addition of neighborhood relations to the models' input data would exacerbate the existing mismatch between the training and real-world data.

Autocorrelation Analysis
The autocorrelation analysis of the CSS-VAL depicted in Figure 3a reveals an anticipated trend: as the split distances increase and the autocorrelation decreases, the decline in accuracy levels off. Figure 3b highlights a discernible trend of increasing variation in holdout set ratios relative to training data for individual classes and splits.We believe this trend is particularly pronounced in classes with fewer training samples and more spatially concentrated distributions.As the split distance grows, individual clusters expand, making it increasingly challenging to maintain a consistent holdout set ratio across multiple splits and classes.The growing variation in holdout set ratio explains the larger standard deviation in accuracy with growing distances, exhibited in Figure 3a.These results, particularly those in Figure 3a, suggest quasi-spatial independence within the set of training data polygons at a split distance of 4000 m.
As the collection of training data was not independent of the NFI-VD, we utilized a buffered NFI-VAL to examine the spatial dependency.Figure 4 illustrates a slight drop at 250 m, a plateau up to 5000 m, and a decline in accuracies with increasing buffer distances from 7500 m onwards, especially notable when a substantial portion of training data was excluded due to the expanded buffer sizes.To further test the spatial autocorrelation between the training and NFI-VAL data, we initiated an experiment where the discard of training data was kept constant up to a buffer distance of 15,000 m (see Section 2.11.2), prompted by the significant drop in accuracies observed when the training data were discarded.The results in Figure 5 showed stabilized accuracies for up to 10,000 m buffer distances and slight drops at 12,500 and 15,000 m.Overall, these results provide evidence of the NFI-VD's quasi-spatial independence even at relatively small buffer distances, affirming the suitability of the NFI-VAL as a measure for the model's predictive power.
It is interesting to note the significant differences in autocorrelation distances between the CSS and NFI-VAL datasets.We a ribute this discrepancy to the differing spatial distributions: the CSS dataset reflects a sample of the training data's spatial pa ern, whereas the NFI-VAL dataset represents a systematic probability sample of the entire study area.The training data have clusters of labeled polygons with varying densities, reflecting the nature of the training data collection process.These clusters might capture specific localized pa erns and result in stronger spatial dependencies.In contrast, the evenly distributed NFI-VAL dataset does not exhibit these localized pa erns, leading to different autocorrelation characteristics.
Wadoux et al. [41] highlighted that due to spatial autocorrelation, the accuracy of predictors for thematic maps, trained on clustered training data, may be overestimated when validated with randomly selected data from the same training set.Consistent with this claim, our study demonstrates a substantial decline in accuracy when comparing random holdout set validation and clustered spatial split validation (CSS-VAL, see Section 4.4).Our approach for analyzing spatial autocorrelation, based on model validation, seamlessly incorporates a high number of input dimensions, unlike more traditional methods such as semi-variograms, and directly quantifies the impact of spatial autocorrelation on the model's accuracy.However, it is important to note that these approaches are contingent upon the characteristics of the data (both input and validation) and the model itself and the specific results may not generalize well to different datasets.Therefore, we advise against directly applying the resulting split and buffer distances to other studies.

Validation
Our study underscores a well-known principle: the importance of independent and representative ground reference data in validating tree species maps.This finding is consistent with results from various studies across different machine learning disciplines, including [39][40][41].While we did not encounter large-scale tree species classification studies addressing this specific issue, research such as [38] on the classification of dominant leaf types has identified notable differences in validation outcomes when comparing validation data from the training distribution with data from an independent distribution, especially in areas with a high degree of mixture.
Despite its utility, employing NFI-VD presents certain challenges and limitations.Spatial inaccuracies arising from differences between the geolocation of NFI plots and S2 images can significantly impact validation results.As reported by [62], the long-term performance for unrefined S2 products is close to 11 m or be er at 95% confidence.Since the activation of the global refinement in August 2021, the absolute geolocation error is be er than 7.1 m for S2A and 5.6 m for S2B at 95% confidence.Therefore, the data used in our study, spanning from 2017 to 2021, are mostly affected by an 11 m or be er geolocation error at 95% confidence.Given the small plot areas (two to four S2 pixels), even a shift by one pixel significantly impacts the validation results.To mitigate this issue, we implemented a validation with shift variants, which offers greater stability under these spatial inaccuracies (see Section 2.14.3).Additionally, the classes provided by the NFI-VD do not entirely align with the classification scheme.Specifically, the white and black pine classes could not be distinguished in the NFI-VD, and the low vegetation class could only be assigned in the confusion matrix when it was misclassified.Furthermore, plots labeled as mixed classes by the NFI-VD may contain single pixels containing only a single constituent species.These cases are not correctly represented in the NFI-VD, as full plots are being labeled instead of single pixels.For example, an NFI plot labeled as spruce-beech may intersect with three S2 pixels, of which two contain spruce and beech while one contains only spruce.Even if the model correctly predicts the pixel with only spruce as spruce, it is still considered a misclassification.This discrepancy prompted us to introduce an overall misclassification score (OMS) and a prediction in close phenological proximity (PCPP) metric to model the phenological differences and similarities between species, and pure and mixed classes.These metrics allow for a be er interpretation of model performance and guide development.

Results and Model Performance
The validation results for the best performing model (base) exhibit a tremendous gap between the random holdout set validation (99% NFI-weighted overall accuracy (NFI-w-OA)) and the CSS-VAL (74% NFI-w-OA).We believe this decrease of about 25% can be a ributed to the spatial autocorrelation inherent in the (training) data.To maintain sufficient label quality on the single S2 pixel level, the training areas were selected to be homogeneous in species distribution.However, this led to increased pixel similarity within training areas, especially after training data synthesis.This pixel similarity within training polygons explains the drop from 99% NFI-w-OA to about 87% NFI-w-OA at a 125 m split distance, where clusters mainly consist of single polygons.The further decline of 13% to a 4000 m split distance can be a ributed to both the spatial autocorrelation inherent to the data, independent of training area homogeneity, and the similarity of homogeneous training areas in spatial proximity.These findings highlight the absolute necessity of accounting for spatial autocorrelation in tree species map validation, particularly when working with clustered training data.
The comparison of CSS-VAL NFI-w-OA and NFI-VAL overall accuracy (OA) results for the base model shows a vast decline of about 20% in OA (given that the NFI-VAL is naturally NFI-weighted, these metrics compare well).We a ribute this decline to the feature distribution disparity between the training data and ground truth forests.While homogeneous polygons provide a feasible way of labeling training data, in diverse forest situations where individual pixels can differ greatly in close spatial proximity, they fail to capture real forest complexity.We believe this argument is further supported by the high post-hoc pure class overall accuracy (POA) of 91% and the higher F1 scores for pure classes (see Appendix A Table A4), showcasing that in simple forest compositions, the model is much more capable of generalizing from training to inference data than in situations with species mixtures.
The overall misclassification score (OMS) provides a valuable tool for comparing the quality of generated maps, taking the intricacies of mixed class confusions into account.The base model's prediction in the close phenological proximity (PCPP) metric result of 79% provides an important context to the OA of 55%.A total of 24% of the confusions occur in close phenological proximity to the target class when one of the species in the mixed classes matches, but the other does not.The deciduous coniferous confusions (DCC) of 1.5% remain low, recognizing the significant phenological differences between coniferous and deciduous species, especially during leafing-out and leaf fall.Table A4 in Appendix A reveals interesting details on the class level.In F1 scores, pure classes-apart from larch and pine-clearly and expectedly outperform mixed classes.The misclassification scores paint a different picture.Here, spruce, spruce-coniferous mixed (except for spruce-arolla pine, a class with very li le training data), mountain pine, and green alder are the best performing classes.This reflects the results in Appendix A Table A2, where most of the confusions for the spruce and spruce-coniferous classes happen amongst themselves.Too li le training data for individual classes can negatively impact their performance, as shown by classes such as spruce-arolla pine and pine-oak.However, when classes are less difficult to separate, such as mountain pine and green alder (due to their low height), given small amounts of training data, they can still perform well.
Figures 8-10 showcase diverse age structures and mixtures, predominantly involving two species in spatial proximity, as identifiable in the CIR orthophoto.The model demonstrates proficient handling of these scenarios.In Figure 11, a characteristic mixed forest is depicted, featuring a diverse array of tree species within a confined spatial scale.Furthermore, the presence of sparsely populated regions, where different ground vegetation types introduce noise to the signal from the actual forest canopy, poses significant challenges for the model.
The comparison of the base and no_syn model's results exhibits profound improvements in OA, OMS, and post-hoc mixed class overall accuracy (MOA).The POA was barely affected, as the pure class training data were not synthesized.The 2% decrease in PCPP is noteworthy: at the PCPP level, confusions between pure classes and mixed species classes containing the respective pure class are not considered.The idea behind synthetic training data was to reduce labeling errors in mixed class training areas when pure species pixels were present.The PCPP not improving with training data synthesis supports our theory on mixed class label quality.The slight decrease emphasizes that the homogeneous training data decrease the models' generalization capabilities.Overall, the considerable accuracy improvements resulting from training data synthesis highlight that mixed class label quality was improved and justify the resulting reduction in training data representativity.
To test for the potential overfi ing of our models, we conducted an extensive series of experiments using models with varying capacities and presented a few key results.Investigating the results of the complete series, the NFI-VAL and CSS-VAL showed no signs of overfi ing, but rather a slight downward trend in accuracy with decreasing model capacity, as is reflected by the published results.

Conclusions
This study underscores the imperative need to approach real forest complexity in modeling and validation while accounting for spatial autocorrelation in the validation process.Our findings reveal immense disparities between the random training data holdout set and clustered spatial split validation, highlighting the importance of considering spatial factors in reliable tree species map evaluation.Additionally, we observed a substantial decline in accuracy when validating with an independent probability sample and a major accuracy increase when only pure species classes were considered, emphasizing the need to account for real forest complexity both in validation and modelling.
In our study, we introduced several innovative methods.We incorporated mixed species classes into the classification scheme, allowing us to be er capture the diversity of forests on a Sentinel-2 pixel level.We implemented training data synthesis for mixed species classes, significantly improving accuracy results, and developed validation metrics tailored to the intricacies of mixed species class validation.Our in-depth analysis of spatial autocorrelation solidified the independent probability sample-based validation approach, investigating its susceptibility to spatial biases.
However, this study is not without its limitations.The geolocation accuracy of Sentinel-2 imagery and NFI plots poses challenges in the evaluation process.Furthermore, the

Misclassification Score
To account for the complexity of class matches in validation introduced by mixed classes, a set of misclassification levels and a misclassification score were developed.Firstly, the classes were partitioned into five meta-classes:


Pure coniferous consisting of spruce, larch, white pine, black pine, and mountain pine.


Pure deciduous consisting of beech, oak, green alder, and other deciduous.
Next, seven misclassification levels were established to model misclassification severity, considering matches and confusions between meta-classes and predicted species.Confusions between meta-classes, excluding low vegetation, were treated more severely than confusions within meta-classes.The misclassification levels are defined as follows:


Level 1: An exact match between the predicted and the validated class. Level 2: Confusion within mixed coniferous or between pure coniferous and mixed coniferous, where one species of the predicted class matches the validation.Examples include predicted spruce-fir but validated as spruce and predicted spruce-larch but validated as larch-arolla pine. Level 3: Confusion between pure coniferous or pure deciduous and mixed coniferous-deciduous, between mixed coniferous and mixed coniferous-deciduous or within mixed coniferous-deciduous, where one species of the predicted class matches the validation.Examples include predicted spruce but validated as spruce-beech, predicted pine-oak but validated as oak, predicted spruce-white pine but validated as spruce-deciduous and predicted spruce-beech but validated as spruce-deciduous. Level 4: Confusion within pure or mixed coniferous, between pure and mixed coniferous, within pure deciduous, within mixed coniferous-deciduous, where no species of the predicted class matches the validation.Examples include predicted spruce but validated as larch, predicted larch-arolla pine but validated as spruce, predicted larch-arolla pine but validated as spruce-white pine, predicted beech but validated as oak and predicted spruce-beech but validated as black pine-deciduous. Level 5: Confusion between pure or mixed coniferous and mixed coniferous-deciduous or between pure deciduous and mixed coniferous-deciduous, where no species of the predicted class matches the validation.Examples include predicted larch-arolla pine but validated as spruce-oak and predicted oak but validated as spruce-deciduous. Level 6: Confusion between pure or mixed coniferous and pure deciduous.Examples include predicted larch but validated as oak and predicted beech but validated as spruce-larch. Level 0: Confusion between any meta-class and the low vegetation class received distinct handling because the low vegetation class was not included in the NFI-VD data.Therefore, only user confusions could be calculated for these cases.
Misclassifications up to Level 3 coincide with the prediction in close phenological proximity (PCPP) metric defined in Section 2.14.In Level 4, the distinction between coniferous, deciduous and mixed is still correct.Levels 5 and 6 were considered severe errors.Specifically, Level 6 corresponds to the deciduous and coniferous confusions (DCC) metric defined in Section 2.14.To evaluate the performance of the classifiers, each cell in the confusion matrices was assigned a misclassification level based on the rules previously described.User and producer misclassification scores were then computed for each class as a weighted average over the corresponding row and column in the confusion matrix, respectively.The weights were determined by the misclassification level of each cell.Finally, an overall misclassification score OMS was computed as a weighted average over all gradings and classes, with the weights determined by the number of validation samples for each class.

Figure 3 .
Figure 3. Analysis of accuracy measures over spatial split distances.

Figure 5 .
Figure 5. National Forest inventory data validation (NFI-VAL) buffer distance analysis with training data discard kept constant up to 15,000 m.

Figure 6 .
Figure 6.Tree species map over the entire forest in the study area.

Figures 7 -
Figures 7-11 display a color-infrared (CIR) orthophoto, provided by the Austrian Federal Ministry of Agriculture, Forestry, Regions and Water Management (BML), on the left, with the tree species map overlaid on top of it on the right.

Figure 7 .
Figure 7. Tree species map on an intermediate zoom level.

Figure 8 .
Figure 8. Tree species map intermediate-close zoom level: predominantly mountain pine, white pine, spruce, and larch.

Figure 9 .
Figure 9. Tree species map intermediate-close zoom level: predominantly black pine, black pineother deciduous, and beech.

Figure 10 .
Figure 10.Tree species map intermediate-close zoom level: predominantly spruce and beech.

Figure 11 .
Figure 11.Tree species map intermediate-close zoom level: typical mixed forest.

Table 2 .
Training pixels and areas per class.
2.8.Training Data LabelingLabeled training areas were obtained by visually interpreting orthophotos, identifying single-class areas, and marking them with polygons.These areas met the following quality requirements:  spatial disjointedness from the NFI validation dataset (see Section 2.10);  at least 90% target class composition;  a minimum size of 3000 m 2 ;  for mixed classes, homogeneous mixing of both tree species;  for sparse classes, homogeneous mixing of canopy cover and ground-level area.