Tree Species Mapping on Sentinel-2 Satellite Imagery with Weakly Supervised Classiﬁcation and Object-Wise Sampling

: Information on forest composition, speciﬁcally tree types and their distribution, aids in timber stock calculation and can help to better understand the biodiversity in a particular region. Automatic satellite imagery analysis can signiﬁcantly accelerate the process of tree type classiﬁcation, which is traditionally carried out by ground-based observation. Although computer vision methods have proven their efﬁciency in remote sensing tasks, speciﬁc challenges arise in forestry applications. The forest inventory data often contain the tree type composition but do not describe their spatial distribution within each individual stand. Therefore, some pixels can be assigned a wrong label in the semantic segmentation task if we consider each stand to be homogeneously populated by its dominant species. Another challenge is the spatial distribution of individual stands within the study area. Classes are usually imbalanced and distributed nonuniformly that makes sampling choice more critical. This study aims to enhance tree species classiﬁcation based on a neural network approach providing automatic markup adjustment and improving sampling technique. For forest species markup adjustment, we propose using a weakly supervised learning approach based on the knowledge of dominant species content within each stand. We also propose substituting the commonly used CNN sampling approach with the object-wise one to reduce the effect of the spatial distribution of forest stands. We consider four species commonly found in Russian boreal forests: birch, aspen, pine, and spruce. We use imagery from the Sentinel-2 satellite, which has multiple bands (in the visible and infrared spectra) and a spatial resolution of up to 10 meters. A data set of images for Leningrad Oblast of Russia is used to assess the methods. We demonstrate how to modify the training strategy to outperform a basic CNN approach from F1-score 0.68 to 0.76. This approach is promising for future studies to obtain more speciﬁc information about stands composition even using incomplete data.


Introduction
Many ecological and forest management studies are based on knowledge about tree species within a region of interest. Such knowledge can be used for the precise analysis of natural conditions [1], the development of ecological models [2], and for conservation and restoration decision-making [3]. Accompanied by other characteristics, such as tree age and height, crown width, and tree species information can be leveraged for timber volume [4,5] and biomass estimation [6].
A commonly used approach for forest type data gathering is field-based measurement, which has the obvious drawbacks of acquisition cost and difficulty. Many studies are now focused on the automatization of land-cover survey through the use of remote sensingderived data. This approach is more preferable when analysing vast territories. For instance, the creation of large-scale maps has been described in [7,8]. For such tasks, both low spatial resolution and high resolution data can be used. Examples of frequently leveraged data sources with resolution lower than 30 m is Landsat satellite imagery [9,10]. Promising results have been shown in studies, both for single image and time-series data [11][12][13]. Nevertheless, some tasks require more precise data with higher resolution. Multispectral images with high resolution strive to provide more thorough land-cover analysis. Many studies have performed forest survey based on Sentinel images with spatial resolution adjusted to 10 m [14]. For instance, one of the open source packages for Sentinel data analysis is eo-learn project [15].
Recently, image classification algorithms have demonstrated high prediction accuracy in a variety of applied tasks. Algorithms based on machine learning methods are now commonly used for land-cover mapping-particularly for forest species prediction-using satellite imagery. Classical methods, such as Random Forest [16], Support Vector Machine [17], and Linear Regression, usually work with feature vectors, where each value corresponds to some spectral band or combination of bands (in the case of vegetation indices) [18,19]. Deep neural network approaches have proved to be more capable for many land-cover tasks [20][21][22]. In [23], a CNN was compared with XGBoost [24]. In [25], a CNN approach was examined for tree mapping, through the use of airplane-based RGB and LiDAR data. In [26], neural-based hierarchical approach was implemented to improve forest species classification.
In contrast with typical image classification tasks (such as in the Imagenet data set), land-cover tasks involve spatial data. Vast study regions are usually supplied, with a reference map covering the entire area. Classes within this area may not be evenly distributed in many cases [27]. Moreover, classes of vegetation types of land-cover are often imbalanced within the study region. In many works, the analysed territory can be covered by a single satellite tile (e.g., the size for Sentinel-2 is 100 × 100 km 2 ). Therefore, researchers must choose both how to select the training and validation regions and how to organize the training procedure to deal with imbalanced classes and a spatial distribution that is usually far from uniform. Sampling approach is vital for the remote sensing domain as simple image partition into tiles is ineffective for vast territories [28]. The training procedure depends on whether we use a pixel-wise [29] or object-wise approach [10,18]. In a pixel-wise approach, each pixel is ascribed a particular class label and the goal is to predict this label using a feature description of the pixel. In an object-wise approach, a set of pixels is considered as a single object. In some classical machine learning methods, a combination of the two approaches has also been considered [19]. An alternative approach to classical pixel-or object-wise has been provided in [25] for a CNN tree classification task using airplane-based data. During the described patch-wise training procedure, the model strove to predict one label for a whole input image of size 64 × 64 pixels. However, for some semantic segmentation tasks with lower spatial resolution, the input image can include pixels with different labels and, therefore, the aforementioned approach is not always applicable. The same issue was faced in [21], where patch-wise approach was implemented for CNN for a land-cover classification task using RapidEye satellite imagery. Some patches with mixed labels were excluded, in order to solve the problem. In our study, we aim to provide sampling approach for medium resolution satellite imagery for forest species classification. In contrast to [25], we focus on the particular area within a patch and do not exclude from training patches with mixed labels as in [21].
Another important issue is markup limitations. Field-based measurements are commonly used as reference data. Vast territories are often split into small aggregated areas comprised of groups of trees called individual stands [30]. These stands are not necessarily homogeneous but, in some cases, the percentages of different tree species within the stand is available [26]. The location of the non-dominant trees is unknown. In such cases, machine learning algorithms are often trained to predict the dominant class even for regions with mixed forest species [31], or just areas with a single dominant tree species are selected [32]. This raises the issue of weak markup adjustment. Among weakly supervision tasks, this one belongs to inexact supervision when only coarse-grained labels are given [33]. Weakly supervised images occur both in the general domain [34,35] and in specific tasks such as medical images segmentation [36]. These studies involve new neural network architectures or frameworks development to decrease requirements for labor-intensive data labeling. In the remote sensing domain weakly supervised learning was also considered in different tasks such as cropland segmentation using low spatial resolution satellite data [37], cloud detection through high resolution images [38], and detection of red-attacked trees with very high resolution areal images [39]. However, in the field of forest species classification, the weak markup problem requires additional analysis according to data specificity (both satellite and field-based). In this study, we propose a CNN-based approach to extract more homogeneous areas from the traditional forest inventory data that includes only species' content within stands and does not provide each species' location. We focus on semantic segmentation problem using high resolution multispectral satellite data. The approach is particularly based on the Co-teaching paradigm presented in [40] where two neural networks are trained, and small-loss instances are selected as clean data for image classification task. In contrast, we split the data adjustment and training process into two separate stages and implement this pipeline for the semantic segmentation task.
In this study, we aim to explore a deep neural network approach for forest type classification in Russian boreal forests using Sentinel-2 images. We set the following objectives: • to develop a novel approach for forest species classification using convolutional neural networks (CNN) combining pixel-and object-wise approaches during the training procedure, and compare it with a typically used approach for semantic segmentation; and • to provide a strategy for weak markup improvements and examine forest type classification both as a problem of (a) dominant class estimation for non-homogeneous individual stands and (b) more precise homogeneous classification.

Study Site
The study was conducted in the Russian boreal forests of Leningrad Oblast. The coordinates of these regions are between 33 • 42 and 33 • 76 longitude and between 60 • 78 and 61 • 01 latitude ( Figure 1). The vegetation cover is mixed and includes deciduous and conifer tree species. The main species are pine, spruce, aspen, and birch. The climate in the region is humid. An average daily high temperature in the vegetation period (from May to August) is above 15°C. The rain period usually lasts for 7 months (from April to November). From September to May, it is snowy (or rain mixed with snow). Throughout the course of the year, the region is generally cloudy (with the clearer periods during the summer time, when the probability of a clear sky is about 20%) [41].

Reference Data
Reference data was previously reported in [26]. It was collected by field-based measurements carried out in July-August 2018. The methodology of data gathering corresponded to the official Russian inventory regulation [30]. In accordance, the study area was split into individual stands with the following characteristics: polygonal coordinates, a certain percentage of each tree species, average age, and height within the stand. The distribution of stand sizes is presented in Figure 2. The majority of polygons had their longest side length between 100 and 600 m. Although the percentage for each stand was defined, the spatial distribution within the stand was unknown. The number of individual stands with particular dominant tree species (larger than 50% within the stand) is shown in Figure 3 and in Table 1. The vast majority of individual stands consisted of mixed species; for instance, there were less than 100 stands of pure (not mixed) birch type. Example of mixed individual stands are presented in Figure 4. (a) (b) Figure 2. Size distribution of individual stands within the study area. Polygons with a side larger than 64 pixels or smaller than 8 pixels were eliminated.

Satellite Data
For optical multispectral imagery, we acquired Sentinel-2 data. This data is available for free download in L1C format from EarthExplorer USGS [42]. Tiles IDs and acquisition dates are presented in Table 2. In this study, we considered only summer images. High cloud cover imposes limits on data for this northern region. Therefore, only two summer images from different years but of the comparable summer period were used to create the training dataset. Images acquired in other summer dates did not provide a sufficient amount of clear areas without clouds. There were no significant forest cover changes between survey time and image acquisition time; therefore, both images are relevant for the study. 10 bands of the following wavelengths were used: Band 2: Blue, 458-523 nm; Band 3: Green, 543-578 nm; Band 4: Red, 650-680 nm; Band 5: Red-edge I (R-edge I), 698-713 nm; Band 6: Red-edge II (R-edge II), 733-748 nm; Band 7: Red-edge III (Redge III), 773-793 nm; Band 8: Near infrared (NIR), 785-900 nm; Band 8A: Narrow Near infrared (NNIR), 855-875 nm; Band 11: Shortwave infrared-1 (SWIR1), 1566-1651 nm; Band 12: Shortwave infrared-2 (SWIR2), 2100-2280 nm). Images were pre-processed with the Sen2Cor package [43] for atmospheric correction. Although, Sen2Cor package provides a cloud and shadow map, which can be used to eliminate irrelevant pixels, we selected cloudless images for the study. The obtained data were in L2A format, including values of Bottom-Of-Atmosphere (BOA) reflectances. For CNN-based tasks, image values are often brought to the interval from 0 to 1 [44,45]. Therefore, pixel values were mapped to the interval [0, 1] through division by 10000 (the maximum physical surface reflectance value for Sentinel-2 in level L2A) and clipping to 0 and 1. We used bands with a spatial resolution of 10 m per pixel (B02, B03, B04, B08 bands) and 20 m per pixel (B05, B06, B07, B11, B12, B8A bands), adjusted to 10 m by Nearest Neighbor interpolation [46]. Each image covered the entire study area, and images were considered separately without any spatial averaging (the same as in [47]).

Organizing Samples for Classification
Four tree species were considered: aspen, birch, spruce, and pine. We also considered the 'conifer' class as a combination of spruce and pine, and the 'deciduous' class as a combination of aspen and birch. As a sample for the further analysis, we chose individual stands. There was no information on the spatial distribution of tree species within an individual stand. Therefore, we defined the label for each stand as the dominant tree species within it, if the stand contained more than 50% of this forest type (the same approach was described in [31]). For conifer and deciduous classes, we summed the percentages for spruce and pine, and for aspen and birch, respectively. The described sample definition assumed that the markup had some pre-defined uncertainty for non-homogeneous stands. However, it provided information necessary to the dominant species classification task. Thus, for each sample in the data set, we know the label of the dominant forest type, the percentage of secondary types (if any), and an ascribed polygon in a multispectral satellite image.
For the experiment of training procedure adjustment, we selected 8 test regions of about 450 ha each ( Figure 5). For the experiment of weak markup improvement, 30% of samples were selected randomly for test. Samples outside test regions were split into train and validation sets randomly, in a ratio of 7:3, following the constraint of no occurrence of the same individual stand in both validation and training sets. For each polygon it can be more than one sample depending on the images' number covering the polygon. Non-overlapping parts of the same satellite image could appear in both the training and test sets.

Forest Species Classification
Instead of typical multi-class classification, we used an hierarchical approach described in [26]. The task of four-species prediction was split into three tasks: (a) classification of conifer and deciduous; (b) classification of birch and aspen; and (c) classification of spruce and pine. The final results followed from the intersection between the predicted mask of birch and aspen and the predicted deciduous mask (with a similar approach followed in the conifer case). Such an hierarchical approach allows for solving each task independently and ensuring greater control over experiment at each step.
For the forest type classification, we implemented a deep neural network approach, which have been widely used for image classification and segmentation tasks when spatial characteristics are important in the remote sensing domain [48][49][50]. At the input of such a neural network, there is usually a combination of spectral bands. The output of the semantic segmentation model is a map, where each pixel is ascribed a particular class label. During the training procedure, a model is forced to correctly predict as many pixel labels as possible by observing random image patches with pre-defined size. This is achieved through the implementation of a particular loss function. The loss is computed for each step of neural network training, when all images patches from one batch have been processed. For our study, we implemented the categorical cross entropy per-pixel loss function: whereŷ ik -predicted probability of the ith pixel to belong to the kth class, y ik -ground truth value for the ith pixel (1 if the pixel belongs to the kth class), N-number of not masked pixels, C-number of classes.
In this loss, all pixels in the scene are taken into account. Therefore, if the classes are highly unbalanced, a model rarely observes pixels labeled as the smaller class. This results in poor performance of the model for a less represented class. A common solution is using a larger penalty for errors on the smaller class samples, such as in the weighted categorical cross entropy: whereŷ ik -predicted probability of the ith pixel to belong to the kth class, y ik -ground truth value for the ith pixel (1 if the pixel belongs to the kth class), N-number of not masked pixels, C-number of classes. Another issue that should be taken into account is that samples of particular classes may not be evenly distributed across the study region. This means that random selection of image patches in batch can lead to a situation where samples concentrated in one area may be seldom observed.
To tackle this problem, we modified the classical sampling approach for semantic segmentation with CNN, as described in the next section.

Object-Wise Sampling Approach
We replaced the commonly used batch creation approach. The sample content was taken into account, instead of simply using random patch selection. The choice of patch size was governed by the relevant size of polygons. As we eliminated polygons with sides less than 80 m and larger than 640 m, the patch size was selected as 64 × 64 pixels. The number of patches per batch was set to 128. Although we considered two classes, the general approach is also applicable for more classes. For each class, we picked the same number of polygons and cut patches around these polygons to create the batch. As the polygon size could vary in the defined range, the patch crop could also differ for the same polygon. The only demand was that the polygon's bounding box should be within the patch boundary. The patch was also geometrically augmented, in order to provide more variability during the training procedure. We implemented random rotate, mirror, and flip operations. The general approach for batch creation is described in Algorithm 1. The next step was loss computation. The approach is described in the Figure 6. For this purpose, we used polygon mask. Patch has dimension Patch_Rows, Patch_Columns, Number_o f _classes. The patch mask contains non-zero values for pixels within the polygon's area and for the appropriate correct class. Despite the fact that individual stands are not often homogeneous, all pixels within one stand were ascribed the same label. The loss was computed for this area. There can be an available markup for other pixels within the patch, but this was not considered. The main reason for this is that it can affect the balance of classes.
We compared this approach with the commonly used per-pixel semantic segmentation approach, for which the batch was randomly formed and an extra penalty for mistakes in the smaller class was added (Figure 7). In this approach, for calculation of the weighted categorical cross-entropy loss, all pixels within the patch were considered. The weights were set proportionally to the amount of each class represented.

Weak Markup
Another adjustment was aimed at addressing weak markup. It includes two stages, as shown in the Figure 8. The first stage was as follows. The aforementioned reference data consisted of the percentage of each class within the individual stand. We took this knowledge during the loss computation. The loss was calculated for each individual stand and multiplied by the dominant species percentage. For example, for a stand that consisted 60% of conifers and 40% of deciduous trees, the penalty will be 0.6. If the percentage is higher, then the penalty becomes stricter. For a homogeneous stand, all pixels have the maximum loss weights. Therefore, in Equation (2), weights were defined as the dominant species percentage. When the learning curve started to change less rapidly and could achieve sufficient results on the validation set (after about 15 epochs), we changed the training data set. We eliminated all individual stands with percentage less than 90%. Thus, for a few epochs (about 2 epochs), the model observes more pure data. Obviously, such a model will perform poorly, in terms of the initial dominant species problem statement. However, at the same time, it will not strive to label deciduous trees within a conifer individual stand as conifer trees (as for case with 60% conifer and 40% deciduous). Then, we used this model to predict conifer and deciduous species both for training and validation regions. The first stage of markup adjustment results was the intersection between the predicted mask and initial dominant species markup. It was assumed that the map acquired in this way contained less pixels of minor (i.e., non-dominant) classes. The next stage of the weak markup study was the implementation of the newly obtained markup in further training. We intersected the new conifer mask with the initial spruce and pine dominant masks, and the same for the deciduous classes. The goal of this intersection was to reduce the number of deciduous pixels within individual stands dominated by pine and spruce, and vice versa. For this study, we created the validation data set only from homogeneous individual stands.

Experimental Setup
For all experiments, the U-Net architecture [51] with ResNet [52] encoder was used, as it has been shown to successfully perform in popular image classification tasks both in general and remote sensing domains [50]. The model implementation referred to [53]. It used Keras [54] with Tensorflow [55] backend. For model training, a PC with GTX-1080Ti GPUs was used. The batch size was 128 patches, where each patch had size of 64 × 64 pixels. The batch size was chosen accord-ing to GPU memory limitations. There were 100-200 steps per epoch and the number of epochs varied from 10 to 30, depending on the size of classes. Similar results reproduction was achieved by fixing a random seed for pseudo-random number generator for all training methods.

Validation Methods
To assess the classification quality, we considered F1-score (Equation (5)). It is a commonly used score for semantic segmentation tasks [56], in particular in cases of unbalanced datasets. F1-score is also often considered in the remote sensing domain [50].
where P is precision, R is recall, TP is True Positive (the number of correctly classified pixels/stands of a given class), FP is False Positive (the number of pixels/stands classified as a given class while being of another class), and FN is False Negative (the number of pixels/stands of a given class missed by the model).
In the one case, we evaluated the number of correctly predicted individual stands. To this end, per-pixel predictions within stands were aggregated and the dominant class was defined for each stand. Based on reference and predicted stand labels, the amounts of true positive, false positive, and false negative samples were estimated. In the second case, we evaluated the F1-score in a per-pixel manner.
A CNN model for each experiment was trained five times with different random seeds, and then results were averaged. Standard deviation was computed.

Sampling Approach For Species Classification
We compared a typical sampling procedure for forest species semantic segmentation with our modified one. The results are presented in Table 3. For all classes, the object-wise sampling approach performed better. The average F1-score before aggregation was improved from 0.8 to 0.85. The final aggregated results were obtained by multiplying the predicted conifer binary mask with spruce and pine masks and multiplying the predicted deciduous mask with aspen and birch masks. Aggregated results for four forest classes are shown in Table 6. The object-wise sampling approach allows us to improve segmentation's F1-score from 0.68 to 0.74. The larger difference between the two methods was for the birch and aspen classes. The reason for this is that these classes were the most difficult to distinguish due to imbalance. The proposed approach leads to a more balanced training samples choice.
Standard deviation was computed for averaged F1-score of different model training running. It shows that achieved results are relevant for further forest analysis studies.

Markup Adjustment
We conducted experiments aimed to improve conifer and deciduous markup. Some areas were eliminated by the model predictions intersected with the initial dominant species map. It aims to leave only homogeneous areas with conifer or deciduous trees. The per-pixel metric is intended to label all pixels even within inhomogeneous individual stand as the dominant class type. Therefore, at this stage of the task, the goal was not to improve the per-pixel score. The average per-pixel F1-score for conifer and deciduous classification became 0.76, in comparison with the previously achieved 0.82 (Table 4). However, we aimed to preserve the score per individual stands than the per-pixel one. The score of dominant classification per individual stands was still approximately at the same level (F1-score 0.85). It means that the model was trained to ignore pixels of non-dominant classes within the stand. For the further assessment, homogeneous stands were considered. Table 4. Conifer and deciduous classification (average score) using source markup and updated markup.

Per-Pixel F1-Score
Per-Stand F1-Score The obtained map was then used for species classification. We compared the model trained on the source markup and that trained on updated one. Their performances were assessed on homogeneous individual stands for four species from the test set. The results are presented in Table 5. Although we eliminated pixels from the (non-homogeneous) training set, the model performed better than when using the larger training data of weaker quality. It allowed us to improve the average F1-score for four species from 0.74 to 0.76 compared with initial markup usage ( Table 6). The results confirmed the benefit of the proposed approach. Example of the final predictions using both modified sampling approach and adjusted markup is presented in Figure 9.

Sampling Approach for Species Classification
The analysis showed that the sampling procedure is highly essential for the forest species classification task. The same approach can be implemented for other problems where maps of vast territories are used and some classes are distributed not uniformly. The proposed object-wise sampling approach for CNN leads to better results than the commonly used approach where patches are chosen randomly within the entire satellite image.
It is worth mentioning the reason why a classical patch-wise approach was not considered suitable for our problem. It implies that we can choose the patch size small enough to include just the pixels of one class. However, in our case, there are two obstacles to implement this. The first being that individual stands are not of rectangular shape and, therefore, the patch size must be rather small. The other point is that individual stands are not homogeneous and we do not know the spatial distributions within stands. Therefore, a random small patch within an individual stand may turn out to, in fact, be a set of pixels of a minor class. This makes the approach described in [21] inappropriate in the presented case.
Another alternative approach to classical pixel-and object-wise classification for remote sensing applications (e.g., airplane-based) has been discussed in [25]. It should be noted that, despite the apparent similarity of airplane and satellite-derived remote sensing data, they have substantive differences. The main difference is spatial resolution. The relevant observation field can vary by 100 times (e.g., 0.1 m for UAV and 10 m for satellite images). Thus, the approach have to be modified.

Markup Adjustment
Clear markup is essential for remote sensing tasks. In some cases, non-homogeneous areas are excluded from training set [32]. Another approach is to use plots with different species and ascribed it by the dominant species class [31]. It is reasonable to move further in the direction of an automatic markup adjustment, in order to make the data clearer without extra manual labeling. The next step of the study can be label adjustment for all classes, not only for conifer and deciduous. The weighted loss function adjustment can also be considered to improve homogeneous areas detection.
Weakly supervised learning is now applied in different remote sensing tasks. They vary by the target objects and remote sensing data properties such as spatial resolution and spectral bands number. In our study, we focus on 10 m spatial resolution and 10 multi-spectral bands. In cases of very high spatial resolution and just RGB bands such as in [39] markup constraints differ significantly. Particular tasks also pose some limitations and additional opportunities for a weakly supervised learning approach [38]. Therefore, remote sensing datasets can differ drastically from such datasets as MNIST or CIFAR considered in [40]. Another difference is that the forest species classification is considered as a semantic segmentation task instead of an image classification task, such as in the case of noisy labels problem in [40].
Markup adjustment can be also studied in the case with machine learning algorithms instead of neural network based such as methods described in [57][58][59].
The main error source in such land cover tasks is diversity within each forest species. Spectral characteristics vary drastically for different tree age and depend on environmental conditions. Therefore, markup adjustment and optimal sampling choice are promising approaches to improve model performance. Another error source is mixed border pixels of neighboring individual stands. In the case of 10 m spatial resolution, even for homogeneous forest stands, spectral characteristics on the border can be affected by other species outside this stand. A possible approach to address this problem for homogeneous stands is to consider just inner pixels remote from the border.
One of the potential limitations is the time and computational cost for markup adjustment model training. In our case, we used the same CNN architecture to perform this stage. We trained the model for markup adjustment and the final segmentation model sequentially. In future studies, an alternative approach can be developed and implemented to perform markup adjustment on the fly for remote sensing tasks.
In this study, we considered forest species classification. However, the proposed approach can be transferred in future studies for other tasks where samples are grouped, and for a group, label distribution is known. The described approach is also applicable for other neural network architectures. Therefore, experiments with new state-of-the-art architectures can be conducted using the same method. Both the sampling and markup adjustment approaches are transferable to new satellite data sources. We considered multispectral Sentinel-2 imagery with a spatial resolution of 10 m. However, it can also be implemented for high-resolution multispectral data such as WorldView or just RGB images such as base maps.
Vegetation indices are significant for environmental tasks as they provide relevant surface characteristics. Therefore, they are widely used as features for classical machine learning methods. However, in the case of deep neural networks, it is assumed that neural networks can learn non-linear connections between raw input data and use prior information for more general characteristics extraction. In our study, we considered only multispectral satellite bands. However, future studies might include vegetation indices or supplementary materials such as digital elevation or canopy height models to achieve higher results and reduce training time.
It is promising to study different augmentation techniques combined with improved markup and the object-wise sampling approach. For example, the object-based augmentation described in [60] can further be implemented to create more variable training samples with different homogeneous stands.
Precise forest species classification can also be implemented in ecological and environmental studies, as large forest patches have been proved to affect human health [61]. Detailed forest characteristics can be helpful for such analysis.

Conclusions
The sampling approach and ground truth markup quality are crucial in forestry tasks involving remote sensing data. In this study, we analyzed the potential of combining CNN and Sentinel-2 images for the task of forest species classification using weak markup with non-homogeneous individual stands. During the first stage, a CNN was trained to find the homogeneous areas within each stand, providing a more accurate markup. During the second stage, the final model was trained to predict four forest species. This markup adjustment allows us to increase F1-score from 0.74 to 0.76 compared to the initial markup. The experiment confirms the opportunity of finding weak labels and shows promising results for further classification enhancement. We also proposed the CNNbased sampling approach for spatial data in forest species classification. The proposed modification outperformed the prediction quality of a commonly used per-pixel semantic segmentation model (the average F1-metric was increased from 0.68 to 0.74). The described pipeline helps to address the issue of highly imbalanced and not evenly distributed classes. The provided training strategy can help solve forest species classification tasks more precisely, even when the reference data has significant limitations. Further study for other vast territories is promising, and the proposed sampling technique seems to be beneficial in such spatial studies.