Automated Delineation of Microstands in Hemiboreal Mixed Forests Using Stereo GeoEye-1 Data

A microstand is a small forest area with a homogeneous tree species, height, and density composition. High-spatial-resolution GeoEye-1 multispectral (MS) images and GeoEye-1-based canopy height models (CHMs) allow delineating microstands automatically. This paper studied the potential benefits of two microstand segmentation workflows: (1) our modification of JSEG and (2) generic region merging (GRM) of the Orfeo Toolbox, both intended for the microstand border refinement and automated stand volume estimation in hemiboreal forests. Our modification of JSEG uses a CHM as the primary data source for segmentation by refining the results using MS data. Meanwhile, the CHM and multispectral data fusion were achieved as multiband segmentation for the GRM workflow. The accuracy was evaluated using several sets of metrics (unsupervised, supervised direct assessment, and system-level assessment). Metrics were calculated for a regular segment grid to check the benefits compared with the simple image patches. The metrics showed very similar results for both workflows. The most successful combinations in the workflow parameters retrieved over 75 % of the boundaries selected by a human interpreter. However, the impact of data fusion and parameter combinations on stand volume estimation accuracy was minimal, causing variations of the RMSE within approximately 7 m3/ha.


Introduction
Accurate information about forests' qualitative and quantitative conditions (forest inventory) is critical to effectively achieving forest management's economic, environmental, and social goals.
Remote sensing data provide spatially detailed information over vast areas and can be used as a complementary information source for an automated forest inventory [1]. Especially, very-high-resolution (VHR) stereo satellite images may be beneficial by offering up-to-date information about both the spectral and structural characteristics of the forest, since the cycles of updating aerial imagery and LiDAR data throughout the given area are much less frequent. For example, aerial photography of the entire territory of Latvia is carried out every three years, while laser scanning was performed only once [2].
Various processing methods have been applied to support national forest inventories (NFIs) depending on a remote sensing data source [3]. Data processing workflows can be grouped by the employed datasets and by the spatial units in which forest inventory variable estimations are carried out. For example, the spatial unit of an automated forest inventory based on remote sensing data can be: (1) a pixel [4], (2) a single tree [5], and (3) homogeneous image patches or forest microstands [6].
The most accurate forest inventory parameter estimations can be achieved by operating at the single-tree level [7]. However, a successful individual tree identification and delineation greatly depend on the spatial resolution of the remote sensing data and forest of a set of unambiguous metrics allowing selecting the best segmentation result [28]. Wulder et al. [29] stated that an objective accuracy assessment of polygons using the ground truth cannot be achieved in forest stand delineation, because stand borders are not so clear as in the case of urban or agricultural applications. The selection of forest stands by image analysts can differ between experts [12]. The microstand delineation accuracy can be measured using two basic approaches: (1) system-level and (2) direct assessments [28]. The system-level accuracy assessment means evaluating segmentation results in a comprehensive workflow for estimating forest inventory variables. The higher accuracy of those forest inventory variables hints at a more efficient segmentation [23,26]. The direct accuracy assessment includes evaluating the quality of the segments using reference data such as stands delineated by experts (supervised assessment) [21] or segment homogeneity criteria [17] (unsupervised assessment).
The objective of this research was to investigate the application of two microstand segmentation workflows: (1) our modification of JSEG and (2) generic region merging (GRM) of the Orfeo Toolbox, both for the microstand border refinement (to provide additional input for traditional forest inventory process) and automated stand volume estimation in hemiboreal mixed forests.
The main contribution of this research is related to the performance analysis of the microstand segmentation algorithms in the case of highly heterogeneous hemiboreal forest stands.
Our contribution in this research also includes: 1. Development of modified JSEG segmentation [30] workflow in a way providing for the CHM and multispectral data fusion; 2.
Application of the JSEG workflow and freeware solution provided by the Orfeo Toolbox: Generic Region Merging to four-band GeoEye-1 images and the CHM prepared using the same GeoEye-1 stereo scene. The CHM produced in this manner includes time compatibility with the spectral bands; 3.
Extensive accuracy assessment for hemiboreal forests in Latvia using (1) unsupervised and forest-specific metrics, (2) supervised, direct accuracy assessment using 2700 microstands delineated by an independent image analyst, and (3) system-level assessment by estimating stand volume. All metrics were also calculated for grid cells to evaluate the benefits of segmentation.
Hemiboreal forests are characterised by the dominance of coniferous trees, but with a significant presence of numerous deciduous tree species [31], resulting in high spectral and structural variability within a stand and between stands. This diversity poses challenges for both segmentation and machine learning algorithms.
The motivation for choosing JSEG [30] for this study was the ability of the algorithm to capture uniform, high-level textures, which are specifically important for the delineation of microstands. Texture created by tree crowns is one of the most important features observed by image analysts when selecting microstands by hand. The JSEG segmentation method has been proven efficient for extracting regions of similar texture without restrictions on the region size [30]. Wang et al. [32] compared the classical and improved JSEG method with the eCognition segmentation algorithm and concluded that improved JSEG could retrieve boundaries better. However, comparatively, JSEG has been considered in a small amount of studies. This could be the lack of open-source implementations and the high computational complexity.
The generic region merging of Orfeo Toolbox (OTB) was chosen for comparison because it is a freely available and easy-to-use implementation.  More than half of the area is owned by the Joint Stock Company "Latvia's State Forests" (LVM), but the rest is occupied by forests of private owners and non-forested land cover types. The average stand age of the state-owned forests is 66 y, with a standard deviation of 34 y, according to the National Forest Inventory (NFI) database. State-owned forest stands are under heavier management activities than private owners, resulting in visibly different textures formed by tree crowns in the VHR images.

Study Site
Microstand polygon reference data were acquired during this research for six sample areas (for both state and private lands within the study site) with a total area of 19.5 km 2 .

Remote Sensing and Reference Data
We used GeoEye-1 stereo image pairs acquired at 9:27 on 7 August 2020. Each scene includes 4-band (blue: 450-510 nm, green: 510-580 nm, red: 655-690 nm, and near-infrared: 780-920 nm) multispectral data with a spatial resolution of 2m/px and the panchromatic band (450-800 nm) with a 0.5 m/px resolution [33]. The canopy height model (CHM) was prepared on a 0.5 m/px grid employing the PCI Geomatica software, by using near-infrared stereo images from the same GeoEye-1 scene and the external digital terrain model (DTM) provided by the Latvian Geospatial Information Agency.
Microstand segmentation was performed using only GeoEye-1 MS and CHM data. For validation purposes, we also employed LiDAR-based CHM with a spatial resolution of 0.5 m/px, which allowed us to evaluate the approximate number of tree crowns within a microstand.
The reference dataset included the NFI database for state forests within a study site and 2770 microstand polygons drawn by the image analyst for 6 sample areas.
The image analyst (researcher in the field of forestry, hired for this research) visually assessed the remote sensing datasets mentioned above by analysing the CHM first and using multispectral information to split polygons found in the CHM if necessary. For every polygon, the image analyst assigned 1 of 5 land cover class codes (1-pure stand containing only 1 tree species, 2-mixed stand containing more than 1 tree species, 3-recently felled area, 4-non-forest, 5-young stand) and the confidence level of microstand border position. If the microstand border was clearly visible in the remote sensing data as in case of evenaged pure stands, then the expert recorded a confidence level of 1 (high confidence). If the borders selected might be ambiguous or hard to distinguish, then the confidence level of 0 (low confidence) was assigned. A summary of the spatial characteristics for the six sample areas is given in Table 1.  Figure 2, created by k-means) using a variable J [30] calculated in a pixel neighbourhood specified by the window size. Higher J values indicate higher heterogeneity of the texture within a neighbourhood. Then, the region-growing algorithm was initialised by seed regions. Seed regions were found using a workflow based on thresholding, which selects areas with low J values, indicating potentially uniform textures. The region growing can be employed by combining J images of varying window sizes. For example, region seeds can be found using the J image of a larger window size, while the pixel-by-pixel region growing can be performed using some of the J images of smaller window sizes. We modified JSEG segmentation workflow described in [30], to incorporate sequen-197 tial processing of CHM and MS data sets.

198
The workflow for microstand delineation at a spatial resolution of 0.5 m/pixel 199 includes the following steps: We modified the JSEG segmentation workflow described in [30], to incorporate sequential processing of the CHM and MS datasets.
The workflow for microstand delineation at a spatial resolution of 0.5 m/px includes the following steps:

1.
Calculate J images for clustered (number of clusters k = 16) CHM at three scales with window sizes w = 33 px, w = 17 px, and w = 7 px as J I 33 , J I 17 , and J I 7 .
The number of clusters was set by the trial-and-error method, aiming to emphasise microstands distinguishable by visual assessment. To emphasise sharp boundaries, the morphological gradient of the CHM with a 5 × 5 square structuring element was merged with J images using the elementwise maximum operation; 2.
Perform the multiscale segmentation of J images. To save the calculation time, we excluded pixels with a CHM value lower than 3 m from further analysis, since we were interested only in the tree-covered areas; Scale 1: Find a seed image [30] using J I 33 by setting the minimum allowed seed size as 512 px. Add homogeneous chunks of J I 17 to the seed image, and perform region growing pixel-by-pixel using J I 17 . The output is denoted as R CHM1 ; Scale 2: Find the refined seed image for R CHM1 using J I 17 and 128 as the minimum allowed seed size; perform region growing by adding homogeneous chunks using J I 7 ; perform region growing pixel-by-pixel using J I 7 . The output is denoted as R CHM2 ; 3.
Calculate J images for the clustered (k = 16) multispectral image (MS) at three scales with the same window sizes w = 33, 17, 7 as J I MS33 , J I MS17 , and J I MS7 ; 4.
Resegment each region in R CHM2 . Statistical measures of the JSEG method were calculated for each region from R CHM2 to be processed individually, and segmentation again was performed at multiple scales: (a) for the first scale MS1, find new seeds for the region using J I MS33 , and employ the pixel-by-pixel region growing using J I MS17 . The output is denoted as R CHM2,MS1,J I33,J I17 , where the first index shows the CHM segmentation scale, the second one reflects the MS scale, and the last ones indicate the J images employed; (b) For the second scale MS2, find new seeds for the segmented image from the previous Step (a) using J I MS17 and perform the pixel-by-pixel region growing using J I MS7 . The output is denoted as R CHM2,MS2,J I17,J I7 ; (c) If three scales are employed, find new seeds (64 as the minimum allowed seed size) for the output of Step (b) using J I MS7 , and apply the pixel-by-pixel region growing using J I MS7 . The output is denoted as R CHM2,MS3,J I33,J I7,J I7 ; 5.
An optional step is merging regions smaller than the specified threshold with the most similar neighbour region, defining the similarity as the Euclidean distance between the trimmed mean values of the regions under consideration.

Generic Region Merging
The GRM algorithm of the Orfeo Toolbox (OTB) starts by considering each pixel as a separate segment and numbering it with a unique label. Adjacent segments are iteratively combined if they meet the homogeneity criterion. The OTB implementation offers several homogeneity criteria, but we employed the Baatz and Schape criterion [34]. The Baatz and Schape criterion measures spectral and spatial homogeneity before and after merging two adjacent segments. Adjacent segments are merged if the criterion value is below the threshold T specified by the user. Weight coefficients can be applied to put higher significance on spectral homogeneity w s or spatial compactness w c .
We applied GRM to MS, the CHM images separately, and a fused dataset as well. Data fusion of 4 multispectral bands and the CHM was performed by simply adding the CHM as the fifth band in the image. The CHM was normalised by setting all height values larger than 50 m to 50 m and dividing the CHM by 50 to ensure values in the range [0, 1]. Multispectral bands were normalised by dividing each band with its maximum value to keep values similarly distributed among the bands. Figure 3 provides a small sample of the input data for GRM: GeoEye-1 false-colour image (NIR, green, blue as the red, green, blue layers), the GeoEye-1-based CHM and fused multiband image (NIR, green, the CHM were used for visualisation as the red, green, blue layers). or spatial compactness w c . 244 We applied GRM to MS, CHM images separately and to a fused data set as well.

Microstand Quality Assessment
Unsupervised image segmentation is by nature an ill-defined problem with many potentially correct solutions [30]. Objects in the forests are not visually clearly separable, and the reference data also depend on the image analyst's subjective opinions. Therefore, the reference dataset as well do not represent the sole and full-fledged solution [12]. Thus, finding the best segmentation algorithm or parameter combination according to numerical metrics is also challenging. Räsänen et al. [28] studied different direct supervised and system-level metrics for forest habitat mapping and concluded that different segmentation results were considered the best when different metrics were used.
We calculated 3 sets of metrics to capture different aspects of the segmentation results in hemiboreal forests: (1) unsupervised metrics characterizing internal homogeneity and intersegment heterogeneity, (2) direct, supervised overlap metrics with microstands delineated by the image analyst, and (3) the system-level RMSE for stand volume estimation using different segmentations as a basic spatial unit; see Table 2.
All metrics were calculated for the microstand segments and between forest microstand segments only if adjacent segments were analysed, not including segments belonging to other land cover types. Forest segments were selected by setting a height threshold to the CHM. A segment was considered a forest microstand only if its average GeoEye-1 CHM value was greater than or equal to 3 m. According to local legislation, a stand with an average height of at least 5 m is considered a forest. This threshold was reduced to consider the spacing between the trees and the lower parts of the tree crowns, which were included in the calculation of the average value.
Metric wVarNorm CHM characterises the height variance of the pixels belonging to the same microstand [21]. The upper bound of wVarNorm CHM is equal to 1 when the variance of the segment is equal to the variance of the whole segmented image, while the lower bound is 0 when all pixel values within a segment are the same. Lower values indicate higher internal homogeneity of the segments. This metric is unsuitable if gaps between trees are clearly visible in the CHM. Height values of those gaps would affect the height variance values, but those gaps are not undesirable in a microstand, if the stand density is low, but forms a regular macrotexture. We calculated wVarNorm CHM for GeoEye-1-based CHM, where individual tree crowns are not clearly visible. Table 2. Accuracy metrics employed in this study. The last column shows if a higher or lower value indicates better segmentation: ↑-a higher metric value indicates higher quality, ↓-a lower metric value indicates higher quality.

Abbreviation
where n is the number of microstands; f i , f j is the average value of the feature f for microstand i and microstand j; w ij = 1 if microstand numbers i and j have a common border and i = j, otherwise w ij = 0.
We employed several features: S CHM , S MS , and S LMH . S CHM shows the average difference of the GeoEye-1-based mean CHM values of adjacent microstand pairs in metres, while S MS shows the average Euclidean distance between the mean spectral vectors of adjacent microstands. S LMH was calculated using the LiDAR-based CHM where individual tree crowns can be separated by simple local maximum filtering with a filter size of 3.5 m. S LMH shows the difference in the average local maximum height for adjacent microstands. Higher S f values indicate better segmentation results for all metrics because the neighbouring microstands are less similar in the context of forest-related features.
Supervised, direct polygon overlap metrics included OS, US, D, and C B . Oversegmentation (OS), undersegmentation (US), and summary score (D) explained in [21,35] were used as area-based metrics to characterise the overlap between microstand polygons delineated by the image analyst and the segmentation workflow. Thus, the values close to zero indicate higher accuracy, but those close to one indicate low accuracy.
We also calculated the boundary similarity metric C B similar to outline proportions within the buffer zones presented by Neubert and Herold [36]. Lucieer et al. [37] suggested calculating the average distance between a segment boundary pixel and the reference boundary (D(b)) to characterise the quality and dissimilarity of the borders. However, a visual comparison showed that there were cases where the border found by the workflow Remote Sens. 2022, 14, 1471 9 of 18 actually was more accurate than the border selected by the image analyst. In those cases, the average distance can give the impression of errors. A more accurate interpretation would be considering these offsets as natural differences between the border generalisation by a human interpreter and a pixel-by-pixel analysis of the computer method. Therefore, we used the metric C B , which allows one to evaluate the percentage of the reference border located close to the segmentation borders (see Figure 4). It can be defined as follows: where B r is the rasterised reference boundary, one pixel thick at the same spatial resolution as the remote sensing dataset; B m is the one-pixel-thick boundary created by the segmentation workflow. A binary dilation with a square structuring element of size T × T was employed to create a buffer zone around the boundary denoted by B m,dilated .
The threshold T sets the spatial distance in which we still consider B r and segmentation border B m as well matched. In our study, we set T = 10 m according to the visual assessment.  (c) B r AND B m,dilated : fragments of the microstand border selected by the image analyst that are located close to the microstand border marked by the data processing workflow. Finally, the system-level assessment was performed by applying the random forest regressor [38] for stand volume estimation at the microstand level using the NFI data as the ground truth. If the microstand overlaps with the forest compartment by more than 50%, then the stand volume of the dominant tree species within the forest compartment is assigned to the microstand; otherwise, the microstand is not used for further processing. Each of the microstands was described by the following remote sensing data features: mean values of 4 GeoEye-1 spectral bands, the GeoEye-1 CHM mean value and standard deviation, the number of local maximums in the LGIA CHM (maximum filter size 7 m), and the average height value of local maximums in the LGIA CHM. As a result, each microstand was characterised by 8 features.
The described microstands with assigned reference stand volume values were split into training and test datasets using 70% for fitting the random forest and 30% for testing. Once random forest predictions for the test set were made, the RMSE was calculated as a quality metric. A lower RMSE indicates lower prediction errors. As microstands were employed as the basic spatial units, a lower RMSE for the segmentation case would indicate a better segmentation result for stand volume estimation purposes.

Adjusting Workflow Parameters
Parameters for the JSEG-based workflow were set based on a visual assessment to produce segments as similar as possible to the segments delineated by the image analyst.
Parameters for the GRM-based workflow were adjusted using computational parameter tests. In the parameter tests, all meaningful combinations of the most important GRM parameters T, w s , and w c were tested. The two best segmentation results for each dataset were produced using two different optimisation criteria:

1.
D: lowest D score showing the best match with segments delineated by the image analyst; 2.
RMSE: lowest RMSE when segments are employed as the basic spatial units for stand volume estimation.
Due to a large amount of VHR data in our study site, we applied data parallelism to the segmentation workflows by splitting the dataset into 1 km × 1 km tiles according to the local map page division nomenclature and by processing each tile in a separate process. Of course, data splitting results in visible tile boundaries, also in the segmentation results, which could be avoided by adding postprocessing to the workflows. However, in our study, tile borders were not additionally processed. Figure 5 shows an example of the first stage of JSEG segmentation using only the CHM as the input. It can be seen that the JSEG segmentation can efficiently capture smooth borders between microstands of different heights, and R CHM2 seems to be a visually better match to the reference data.   Figure 6 shows examples of the JSEG results when the multispectral data segmentation was added to the CHM segmentation. Green borders in the image confirm how the complementary nature of height information in the CHM and spectral information in the satellite images allowed one to include significant aspects of microstands. Similar height classes were extracted from the CHM, but the coniferous and deciduous tree species and the tree crown closure as a macrotexture were extracted from the multispectral images.      practical gain of data fusion already in the segmentation stage, since an even smaller 407 error RMSE = 74.8m 3 /ha was achieved by performing no segmentation at all and 408 using regular grid patches as a basic spatial unit for stand volume estimation.  Table 3 summarises the unsupervised metric values for the JSEG and GRM workflows, as well as for regular grid cells and the segmentation by the image analyst. The abbreviations CHM, MS, and fused (both the CHM and MS) explain the dataset used for segmentation, but the D and RMSE reflect the optimisation criteria used to the adjust workflows' parameters.

Unsupervised Metrics
The internal heterogeneity of the segments vWarNorm CHM with respect to the CHM was naturally higher for the regular grid and for those experiments performed using multispectral information only. When the CHM was included in the segmentation process, both segmentation algorithms achieved higher internal segment homogeneity than the image analyst.
Differences between stand-specific segment characteristics S f were higher for tests optimised by the D criterion, but significantly lower for tests where the optimisation using the RMSE of stand volume estimation was performed. The GRM and JSEG algorithms for the fused dataset showed the best performance. GRM in all cases captured multispectral differences between stands better, while JSEG segmentation resulted in higher height differences between adjacent microstands. Table 3. Unsupervised metrics. Arrows ↓ and ↑ indicate whether a lower or higher value is considered as better. The best value for each metric is emphasised with bold font.

Supervised and System-Level Metrics
The supervised direct and system-level metrics are summarised in Table 4. In addition, the average segment area avgA and standard deviation of the segment area stdA are shown as well. The lowest RMSE = 67.9 m 3 /ha was achieved by using the image analyst's selection. However, in this case, the regression task was performed only for six sample areas containing a smaller number of segments. Other tests were performed in the whole study site with a large number of segments. For example, the GRM fused test included 10,422 segments for training random forest regression and 4467 segments for testing. In comparison, only 1890 segments were available for training and 810 for testing in the case of the image analyst's selection. Table 4 shows that the stand volume estimation error was minimally affected by the segmentation results. The RMSE varied within a range of 6 m 3 /ha showing RMSE = 76.0 m 3 /ha for JSEG applied to the fused dataset and RMSE = 78.8 m 3 /ha for GRM applied to the fused dataset as well. However, this small difference does not show a practical gain of data fusion already in the segmentation stage, since an even smaller error RMSE = 74.8 m 3 /ha was achieved by performing no segmentation at all and using regular grid patches as a basic spatial unit for stand volume estimation.
Comparing the overlap metrics between the results of the segmentation algorithms and the image analyst, one can observe that the summary score D was quite high for all outputs, indicating that the areas of the microstands found by the workflows were noticeably different from those found by the image analyst, however similar in size. Optimisation tests with respect to the RMSE showed clear trends of oversegmentation. The mutually similar OS and US scores for outputs with the multispectral information added characterised only different segmentation results without clear trends in oversegmentation or undersegmentation.
Meanwhile, the coincidence of the boundaries C B was high, indicating that significant transitions were found even in the presence of different polygon shapes. However, the C B metric was also affected by oversegmentation, which led to a higher border matching score. Considering only the JSEG results and GRM results with D as the optimisation criteria, 76-79% of the borders found by the image analyst were also found using automatic segmentation methods.
We found no correlations between the border matching score and forest inventory parameters such as tree species, stand volume, and stand density. Still, Table 5 shows a mild correlation between C B and the level of confidence of the image analyst in different sample areas. Metric C B was slightly lower for sample areas where the image analyst reported lower confidence levels indicating complex forest structures. A visual assessment of those study sites showed that Areas 1 and 6 were subject to intensive management with frequent, clearly visible rectangular compartments divided by rides, roads, and ditches, but Areas 3 and 4 were very inhomogeneous. The higher confidence level for Area 3 might be explained by subjectivity factors. It was the first study site to be processed by the image analyst, and further work process and communication with remote sensing specialists might have changed this evaluation.

Applicability for Microstand Border Refinement
A visual assessment of the workflow segments confirmed that meaningful and relevant microstands were separated by both workflows. However, the segments were visually different from those selected by the image analyst. This was already expected because microstand segmentation is ill-defined with many correct solutions.
Considering oversegmentation OS, undersegmentation US, and the summary score D, those values for our study were higher than presented by Sanchez-Lopez et al. [21], but these differences could also be caused by differences in the spatial resolution of the datasets used in our study and the study of [21].
Since microstands in many cases cannot be unambiguously defined even by a human interpreter and fieldwork, the C B measure is an efficient metric to measure the ability of the workflow to find significant borders even in the presence of oversegmentation. However, since a buffer zone is created around the algorithm border, C B can show higher values only due to oversegmentation. Therefore, the width of the buffer zone has to be taken into consideration when analysing the values of C B . A larger buffer zone can cause a risk of unreasonably increasing C B , while a too-small buffer zone does not include the possible delineation inaccuracies of the image analyst.
In general, the unsupervised metrics showed better values for the JSEG and GRM workflows than for the selection of the image analyst. This could be explained by border generalisation when microstands are selected by hand, while the data processing workflows without additional modules for generalisation draw borders at the pixel scale. Finer borders might result in higher internal segment homogeneity and higher average differences between adjacent segments. The study by [17] evaluated an internal homogeneity metric, normalised variance (the normalisation methodology was different than in our study), on a 1 m 2 grid and obtained higher values. However, this could be explained by both the different normalisation methodology and higher heterogeneity of LiDAR-based metrics.
Many authors acknowledge that segmentation accuracy metrics describe only certain aspects of segments [28,35,39], but there are no metrics that can unambiguously determine the best segmentation result. Varo-Martínez et al. [17] specified that metric sets should be established for certain tasks, for example precision silviculture requires very precise stand delineation regarding intra-region variance.

Applicability to the Stand Volume Estimation Task
Our study showed that segments produced by different algorithms, datasets, and algorithm parameters have a small impact on the stand volume estimation accuracy, suggesting that other factors are setting limits to reduced the RMSE instead of segment delineation. The same conclusion about different datasets, including LiDAR and hyperspectral data, were also made by Dechesne et al. [26]. Moreover, even using regular grid patches as a basic spatial unit resulted in a lower error of the random forest regressor.
The RMSE values in our tests were similar (74.8 m 3 /ha RMSE 80.6 m 3 /ha, nRMSE ≈ 13% using 600 m 3 /ha as the maximal stand volume) to those acquired in other studies reviewed by Surovỳ and Kuželka [40].
The lack of significant fluctuations in the RMSE between different tests and better results using regular grid patches raise a question about potential error sources. The classification accuracy also had a small variation in the study of Räsänen et al. [28], providing evidence of the robustness of object-based methodology, meaning that good classification accuracy can be obtained even if the segmentation is not the best possible.
However, in our study, the RMSE limits might be determined by the reference data compatibility with microstands or grid patches as the spatial units (see Figure 8) and erroneous entries in the NFI database. The stand volume value for a forest compartment is obtained by the summation of the values for microstands within this compartment. Since the forest compartment boundaries are also affected by management factors and historical reasons, compartments are more heterogeneous than microstands. Therefore, in the case of mixed compartments, even stand volume values for dominant tree species are not valid reference data for much more homogeneous microstands, introducing errors both in training and testing. The solution could be to increase the study area to a much larger region and use only pure NFI compartments (with one dominant tree species of a similar age) to capture the correct ground truth accounting for natural variations in tree species and age. Since the RMSE was used just as one metric to evaluate the segmentation quality, the workflow for stand volume estimation was also simplified, and more studies about the error sources in the stand volume estimation would be required. The higher accuracy obtained using regular grid patches could be caused by the reduced variance of the feature values when the spatial unit is without any size variations. The optimisation procedure for GRM tended to achieve lower RMSEs when oversegmentation was applied with a higher emphasis on spectral homogeneity. Thus, the unsupervised metrics already allowed us to conclude that oversegmentation is not an undesirable phenomenon, if the segments are further planned to be used as basic units for forest inventory parameter estimation tasks. Still, differences between tests showed variation in a range of no more than 7 m 3 .

Conclusions
Microstands can serve as robust spatial processing units for an automated remotesensing-based forest inventory and for increasing the accuracy of microstand borders used in the traditional forest inventory. GeoEye-1 stereo data give a great opportunity to acquire both spectral and height information, which is up-to-date and coincident in time.
Metrics characterizing overlap between the workflow segments and segments produced by the image analyst showed that 76-79% of significant borders were also found by the segmentation methods; however, the segment geometry itself was different.
In future studies, we would recommend changing the accuracy assessment procedure by offering workflow-produced segments for review to several independent image analysts instead of requiring the image analysts to delineate microstands themselves. Then, those workflow segments could be analysed visually and assessed as useful ones in daily forest inventory or invalid segments specifying why a segment cannot be used for practical microstand border refinement. Since we found no correlations between forest inventory parameters and the confidence level assigned by the image analyst, the reason for the low confidence would give valuable information about the further development of the workflows. This might also give more appropriate insight into the practical applicability of segmentation for border refinement because the visual interpretation of the segmentation results confirmed meaningful microstand propositions.
Stand volume estimation tests showed no benefits of using segments instead of just regular grid patches as a basic spatial unit. The best RMSE = 74.8 m 3 /ha was achieved by using regular grid patches. This could be explained by the actual incompatibility between the stand volume format in the NFI and required for using microstands as a basic spatial unit.
The GRM workflow efficiently captures borders in multispectral data, but the JSEG workflow is more efficient on tree crown macrotexture delineation.
Both segmentation approaches would be useful for preparing ready-to-use microstand maps, but the accuracy assessment procedure should be switched to one including direct visual analysis of workflow segments. For stand volume estimations using the microstand as a basic spatial unit, tests should be repeated using a much larger study site and only forest compartments with a homogeneous structure.