Next Article in Journal
Mass Change in Antarctica from 2002 to 2025 Using GRACE and GRACE-FO
Previous Article in Journal
Evaluation of Multi-Source Satellite XCO2 Products over China Using the Three-Cornered Hat Method and Multi-Reference Comprehensive Comparisons
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficiency of Data Clustering for Stratification and Sampling in the Two-Phase ALS-Enhanced Forest Stock Inventory

Zakład Geomatyki, Instytut Badawczy Leśnictwa, ul. Braci Leśnej 3, Sękocin Stary, 05-090 Raszyn, Poland
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(23), 3871; https://doi.org/10.3390/rs17233871
Submission received: 12 October 2025 / Revised: 20 November 2025 / Accepted: 26 November 2025 / Published: 28 November 2025
(This article belongs to the Section Forest Remote Sensing)

Highlights

What are the main findings?
  • Ward’s clustering was advantageous over other data clustering methods.
  • Inconsiderable reduction in RMSE observed above around 200 sample plots.
  • Complex stands benefit more from increased sample size than homogeneous ones.
What are the implications of the main findings?
  • Data clustering methods can aid more optimal forest inventory stratification.
  • Structurally guided sampling can be effectively performed with the data clustering.

Abstract

Within the last few decades, ALS-enhanced two-phase forest inventory has emerged as viable alternative to standard inventory designs. As a relatively new and compound method, there still remains significant potential for its optimisation. One key aspect concerns the design of the second-phase sampling. Apart from well-known designs such as random, systematic, or stratified sampling—which often involve some degree of uncertainty regarding their realisations—there are relatively less common, structurally guided sampling designs (SGS), which can facilitate the unambiguous allocation of balanced and well-optimised samples. Unlike traditional stratification, the SGS design does not rely on fixed divisions, which may induce additional errors due to pre-defined and potentially non-representative strata. Instead of geographical (spatial) sample deployment, the SGS uses the multidimensional space of covariates, e.g., ALS metrics, to optimise sample allocation. SGS can be powered by different engines. While some algorithms for SGS, such as the cube method or local pivotal method, have been briefly tested in recent studies, no thorough attention has yet been paid to data clustering algorithms. Therefore, this study compares the performance of several popular data clustering algorithms for structurally guided sampling to train the model for growing stock volume estimation in a two-phase ALS-enhanced forest inventory design. The results showed that hierarchical clustering was competitive with other methods but outperformed them in terms of the highest stability of estimates, even at lower sampling intensity levels. The use of data clustering methods can ensure unambiguous yet more optimal sample distribution, minimising sampling variation or estimation error caused by the randomness of other sampling methods or the inflexibility of pre-defined strata.

1. Introduction

Accurate information about forest resources is becoming increasingly relevant as pressure on forest environments grows to sustainably provide multiple services related to, among others, climate change mitigation, the global carbon cycle, natural environment preservation, renewable wood production, sustainable forest management, biodiversity restoration, landscape management, social services, and tourism [1,2,3,4,5,6,7]. This issue is becoming more pressing as forest areas are generally shrinking and are increasingly isolated into smaller patches [8,9,10]. Another key factor justifying the need for accurate environmental information is the range of policies that use growing stock volume (GSV) or biomass data for carbon budget management [11,12,13,14,15]. In such circumstances, the margin for forest management error should be minimised. Therefore, the demand for precise information about forest resources has become apparent and pronounced by numerous studies.
As single-tree inventories are not yet practically operational at larger scales and satellite remote sensing (RS) solutions may yield some local bias when applied to small areas without in situ calibration data, traditional ground-based sampling remains the standard in most forest censuses at various scales [16,17,18]. Among the many sampling designs, different realisations of systematic sampling (SYS) are by far the most commonly applied in practical forest inventories [16,19,20,21,22,23,24]. The key benefits of this approach are reported as follows: (1) economical and efficient as compared to random sampling, (2) capable of providing uniform coverage of the entire sampling population, (3) relatively simple in terms of sample deployment and operational field logistics, and (4) not prone to plot clumping [25,26,27]. Despite these advantages, systematic sampling may be challenging due to following reasons: (1) substantial bias and/or non-representative samples if the population exhibits strong patterns with spatial autocorrelation, (2) lack of a design-based variance estimator, causing analysts to often resort to simple random sampling (SRS) formulas for uncertainty approximation, which can inflate estimation intervals, (3) less true randomness, meaning not every possible sample configuration is equally likely, (4) inflexibility in the sampling interval, and (5) omission of rare specimens or clusters often individually scattered over the sampling area [22,28,29,30,31,32,33].
Alternative sampling methods are noticeably less common in forest inventory practice. Although considered unbiased and straightforward, simple random sampling (SRS) results in irregular plot distribution and related logistical and ergonomic issues—the aspects especially pertinent in difficult forest terrain. Moreover, this approach requires a substantial sample size to minimise random effects and ensure the stability of results [33,34]. Cluster sampling (CS) addresses some of the aforementioned issues of SRS and is therefore often combined with systematic sampling (SYS), particularly for large and dispersed populations, as in the case of many national forest inventories (NFI), e.g., Poland [35,36], Finland [37], Germany [38,39], USA [40], and Romania [41]. On the other hand, CS still has notable potential for optimisation in forest inventory. Key areas for optimisation include the following: reduction in the clustering effect, i.e., minimisation of the standard error and inherent uncertainty [42,43]; cluster definition and their universality [44,45]; numerical optimisation with spatial information [46]; and improvements in forest attributes estimation by integration with other techniques [47]. Furthermore, CS poses a similar risk of systematic errors as stratified sampling (STRS) when clusters or strata are poorly defined and non-representative of the population, which often occurs when no a priori knowledge is available about the target area [24,33,48,49].
This is where remote sensing (RS)-enhanced methods often come in handy. RS data are known to aid contemporary forest inventories in several ways: enabling continuous spatial coverage of the entire area of interest (AOI), providing initial knowledge about the sampling population [50,51], increasing the precision of stand-level attributes estimation [52], reducing sample size [50,53,54,55,56], and enabling objective monitoring of structurally complex and hardly accessible sites [57,58]. The two-phase sampling design with an RS regression estimator (often referred to as the area-based approach) [59] is of particular interest to forest inventory data analysts, researchers, and an increasing number of practitioners. The suitability of this method for forest inventory purposes, especially in connection with airborne laser scanning (ALS), have been demonstrated by many studies [60,61,62,63]. Already, for over a dozen years, RS-based methods have been under development, and consequently adopted to aid practical forestry [64,65,66,67,68,69,70]. In short, the core aspect of this method relies on the relationship between RS metrics (first-phase plots) and ground (second-phase) plots, on which variables of interest are measured, usually as calibration/control data for some kind of model that serves as an estimator for the whole sampling area [71,72].
As previously mentioned, in practice, ground plots are usually sampled using a systematic rather than a random approach, where general sample size is often determined by formulas in which one term represents the minimal/base sample size (either a fixed number, area, or allowable error-dependent quantity), and the other term reflects some measure of variance in the population [25,42,70,73,74,75,76,77]. Nowadays, thanks to the application of RS data, it is practically feasible to assess stand diversity and variance within the target population in advance, which can help stratify the area of interest before ground measurements and approximate optimal sample size [54,78,79]. As determining the required number of plots based on the desired sampling error and/or population variability seems logical and well justified [80], some research challenges methods which incorporate sampling intensity (e.g., surface area) [81,82], instead seeking to establish sampling thresholds for specific forest regions [55,83,84,85]. Moreover, aside from the previously mentioned caveats regarding sample plot deployment strategies, the grid origin in practical systematic sampling is often a single arbitrary point (usually some AOI corner) [19,70], which when appointed may influence inventory results if not randomly selected or when the resulting sample is non-representative [42,86,87]. This highlights the need to address the common pitfalls of systematic sampling, for instance, by using RS auxiliary data.
Similarly to SYS, CS is often used, especially in NFI-sampling designs. While much research focuses on the spatial (geographical) optimisation of the cluster/plot distribution [44,46,88,89,90,91,92,93,94], less attention is given to sampling in the feature domain consisting of auxiliary variables (e.g., RS/ALS data) [54,95,96,97,98,99]. Perhaps, instead of seeking an optimal spatial pattern for sample plot distribution, which, if not random or representative, may yield biased results [100,101], investigating the multidimensional space of independent variables may offer an alternative perspective on how to view modern forest surveys. Bearing in mind the above caveats, the use of SYS or CS for second-phase sample distribution may appear as a hindrance to revealing the full potential of RS-enhanced forest inventories (EFI). Therefore, the aspects of second-phase sample plot distribution that consider sample size determination and its spread in both spatial and feature domains could still be optimised, especially as some promising results in this area have been reported by Junttila et al. (2013) [21] and more recently by Heikkinen et al. (2025) [102].
To reduce the variance and uncertainty of an estimator, as well as the excessive networking associated with SYS/CS, some researchers propose RS data-driven stratification for sample plot selection. These attempts are collectively referred to as structurally guided sampling (SGS) [103]. In model-based inventory design, this approach can improve the accuracy and precision of estimates [53,104,105,106,107]. One of the most important assumptions in this design is that adequate strata representing the feature space variability of the target population are objectively composed. Dividing the inventory area into homogeneous patches, based on accurate and up-to-date ALS metrics (which can also serve as predictors of forest attributes), can help capture the variability in the target area, while maintaining the sample size at an optimal level, as some portion of variance may already be explained by the strata division [54,98]. Moreover, such stratification can guide representative and more precise sample distribution, entailing all related benefits [104,108,109,110]. Some techniques for stratification and sampling in auxiliary data space have already been tested by Hawbaker et al. (2009) [111], who used a fixed number of strata (30), to delineate forest patches homogeneous in relation to ALS-height metrics, showing substantial improvements over SRS. Similarly, Gobakken et al. (2012) [53] used an arbitrary ALS-assisted stratification (8 strata) for sample plot designation in Scotch pine/Norway spruce-dominated stands, which allowed for RMSE and sample size reduction. Melville et al. (2015) [109] employed balanced sampling (the cube method [112]) as an effective alternative to stratification. A year later, they combined relatively simple nearest centroid (NC) sampling with k-means stratification for both small area estimations and totals, reporting a 50% gain over SRS [109]. Another SGS technique was adopted by Grafström et al. (2014) [79], who introduced the local pivotal method (LPM), finding its performance competitive with SRS. Nevertheless, hitherto, only few studies have covered the application of data clustering algorithms for sampling and/or strata designation [98]. In most cases, only the popular k-means approach was studied [109], though not always explicitly concerning forest stock inventories [113]. More recent insights into other clustering techniques were obtained by Xu et al. (2024) [99], who used Ward’s hierarchical clustering to improve inter-strata heterogeneity in coniferous sites and reduce sample size by about 20%. This promising results and peculiar niche of the subject should encourage further and more comprehensive studies.
Data clustering methods can be distinguished based on the fundamentals behind the data discretization, i.e., partitioning, hierarchical, and density-based approaches, and/or the type of data they handle, i.e., numerical, categorical, ordinal, or mixed [114]. Nonetheless, the principal goal of all these methods is to split the full dataset into groups of observations characterised by similar variance, while maximising the variance between the groups [115]. These procedures are often performed in a multidimensional space consisting of independent or auxiliary variables (covariates), where the number of clusters is often the sole or the most significant hyperparameter to set [98,116]. There are no standardised guidelines regarding the use of the most optimal method [117]. The debate about the suitability of particular methods is lively among data scientists, with various conclusions, greatly depending on the research field, goal, and data specification [98,116]. The key aspects in this regard usually concern the following: appropriate agglomeration algorithm selection, coupled with the appropriate number of clusters and clustering variables [98,116,118]; the type and scale of heterogeneity in the population [45]; branch specification and thus specific data characteristics and complexity [119,120]; the goal of the data analysis [121]; the amount and type of inherent noise and the presence of outliers [122,123]; results consistency [124,125]; and computational efficiency [117,126]. The variety of subjects concerning data clustering algorithms motivated us to draw a summary table (Table 1) that presents a brief comparison of selected data clustering methods evaluated in this study.
As shown in the previous paragraphs, there is still some progress to be made in the context of practical optimisation of forest inventory campaigns. As concluded by Tompalski et al. 2019 [136], the details concerning the required sample size for model calibration still remain to be determined. The range of possible solutions presented, along with their potential and inherent uncertainty, create the ground for the analysis concerning their aptitude in the reciprocal aspects of desired accuracy, precision, available resources, sampling scheme, sample size, and plot distribution. Clearly, the issues in question are complex, thus addressing them all within the scope of a single study may prove challenging. Therefore, the aim of this study was to compare the performance of popular data clustering methods, fed with multiple ALS metrics, in a two-phase forest inventory design, driven by the sample size and the diversity of the population. The authors sought to address the following questions: (1) How can data clustering methods aid optimal stratification for GSV estimation in a two-phase ALS sampling design? (2) What is the influence of sample size in this context? (3) Do the results of particular clustering methods vary with the level of stand complexity? (4) What is the sensitivity or stability of particular methods? (5) How do the scores compare with other, more traditional sampling designs?

2. Methods

2.1. Input Data

The reference data come from 5870 sample plots located in 10 forest districts in Poland: Białowieża, Supraśl, Milicz, Katrynka, Herby, Gorlice, Pieńsk (all measured in 2015), Taczanów, Głogów Małopolski (measured in 2020), and Leżajsk (measured in 2021). Ground surveys were conducted in compliance with Polish standard forest management inventory procedures established in IUL 2012 [137]. Tree species, age, height, and breast height diameters were recorded for every tree exceeding 7 cm at DBH, within 500 m2 circular sample plots, systematically distributed in each forest district. Volumes of individual trees were estimated using the official allometric equations applied by Polish State Forests [138] and aggregated at the single plot level. In this study, the GSV was the sole target variable, as it is one of the most important forest attributes, closely related to biomass and carbon stock [139], which in order for its accurate determination typically requires larger samples than other variables [80]. Plot coordinates were recorded using Topcon HiPer V (Topcon Positioning Systems Inc., Livermore, U.S.) survey-grade GNSS receivers, achieving approximately 1 m positioning accuracy, which should be sufficient for the two-phase ALS-aided forest inventory [140,141]. Table 2 contains the main characteristics of selected forest districts, while Figure 1 shows their locations.
Airborne laser scanning campaigns were conducted simultaneously with field inventories. Point cloud tiles covering all forest districts were provided in LAS 1.2 file format. Height normalisation was performed using Digital Terrain Models (DTM) with 0.5 m resolution. Spatial co-registration between ground sample plots and the corresponding 3D space of the point clouds was then carried out. Clipped point clouds were subsequently equalised to a density of 10 pulses per square metre using an original thinning algorithm [141] for better control over the analysed factors. Finally, a set of standard ALS metrics was computed for each individual plot. All data processing in this study was performed employing the lidR library within the R programming environment [142].
In order to account for non-linearity among some explanatory variables, log and power transformations were applied to the original ALS metrics. Categorical variables, such as species information, were converted to numerical values and expressed as the volumetric share of each species within a plot. These procedures resulted in a large number of potential predictors (over 360), thereby a data dimensionality reduction steps were necessary. At first, metrics with a coefficient of variation lower than 10% were eliminated. Next, a random forest (RF) generic model was run on the entire dataset to evaluate importance of each variable. Metrics contributing less than 10% towards the GSV prediction (according to the RF importance index embedded in the randomForest R package (v. 4.7-1.1) [143]) were excluded from further analysis. Finally, an autocorrelation matrix was constructed, and variables were assigned to groups of similar categories. In each group, one or two strong GSV predictors were retained. Table 3 presents the final grouping and explanatory variables used as input data for the main analysis.

2.2. Forest Generation

An underlying assumption had to be fulfilled to make the results of this study comparable, viz. equal-sized populations for each sampling intensity and forest complexity level. Therefore, the core analysis of this study was based on mutated, semi-artificial (synthetic) forests, generated from real inventory data described in the previous sub-chapter. We called them semi-artificial, as they were only merged and replicated based on real ground forest inventory measurements. The procedures behind forest generation were as follows. First, the real data sample plots were ordered according to their reference GSV values. Next, every 10th plot was selected as an independent forest kernel. This step was necessary to reduce the tremendous number of possible forest instances. Subsequently, all explanatory variables were scaled, and the k-Nearest Neighbours (kNN) algorithm was employed to expand the kernel with the most similar plots. The final procedure in forest generation was the replication/mutation of the selected plots. To ensure a sufficiently large sampling population and to keep computational constraints at feasible levels, the forest ceiling was set at 50,000 plots. The concept of mutation of the original plots was based on the addition of small noise/variance to the vectors of variables describing the original plots (up to a few percent, depending on the variable). This approach is known in the literature as synthetic oversampling (bootstrap) with stochastic variance induction [147,148,149].
Thus, generated synthetic forest was evaluated for its structural complexity. For this purpose, a Weighted Coefficient of Variation (CVW) was introduced as the proposed forest complexity indicator. This is essentially the average coefficient of variation from all covariates, weighted by the Pearson coefficients of determination between each predictor and the dependent variable, i.e., the GSV. The entire procedure of forest generation and evaluation was repeated 30 times for each scenario, to account for some randomness inherent in the kNN algorithm, while maintaining a statistically sufficient number of replications [150,151,152,153]. In total, 211,320 uniquely generated forests were produced. Table 4 presents the combination of all factors and levels used to create unique scenarios for forest generation.

2.3. Clustering/Stratification and Sampling

Several data clustering methods was evaluated in terms of their ability to define optimal stratification groups for drawing a sample for model calibration. The goal of both clustering and stratification is coherent. Essentially, it involves transitioning from a set of variables into discrete, more general groups, by minimising within-group (co)variance and maximising between-group distances [115]. While fundamentals behind most data clustering methods are richly described in the literature (Table 1), we need to explain the principles behind our original, yet relatively simple, IDS algorithm, which stands for Individual Dimensions Sampling. The main difference from other clustering methods is that in IDS, the distribution and variance are evaluated separately for each predictor. In IDS, predictors are ordered according to their importance tag, e.g., the value of GSV correlation. Next, each predictor receives one plot (starting from the strongest to the weakest). The loop is repeated until no more plots are available. The vector of each independent variable is then divided into groups using the natural breaks algorithm [154], with the number of groups equivalent to the number of sample plots assigned to the given variable. As plots selection in IDS was independent for each variable, no covariates scaling was performed before clustering operations, unlike for other clustering methods, where normalisation of covariates was performed using the base R scale function [155].
For most data clustering algorithms, the number of clusters, k, is often the sole and the most important hyperparameter to set. For these algorithms, to adequately capture the full range of variance within the target population and to decimate overly saturated groups (reducing redundant plots), we decided to directly link the number of clusters to the number of sample plots available in a given scenario (N). Here, it is important to distinguish the correct notation, i.e., N refers to the number of plots available in a given scenario, while n refers to the number of plots assigned to a given cluster. In most cases, k = N. This approach allowed us to avoid k tuning, which would otherwise have made the already large number of simulations even greater. On the contrary, the architecture of HDBSCAN requires setting a minimum cluster size instead of the number of clusters. To mimic the behaviour of other clustering algorithms, i.e., to equalise the number of clusters k with the number of plots N, the minimum cluster size for HDBSCAN was set as the quotient of k-neighbours (from the forest generation stage) and the number of plots available in the given scenario.
Intra-group sampling was the next step after forest stratification by clustering algorithms. As in most scenarios, the number of k clusters equalled the number of plots N, two main sampling solutions remained feasible, i.e., random and cluster-centroid sampling. As random in-cluster sampling induces another source of variance, and to better control the experiment of comparing selected clustering methods, the authors focused on centroid-based sampling, which has unambiguous properties in selecting precisely specified plots. Cluster-centroid sampling aimed to identify the plot closest to the centre point of a cluster in its multidimensional space, expressed with scaled covariates. The distance was defined as the average difference between the values of particular covariates characterising a plot and the averaged values of the corresponding covariates from all plots within the cluster. In scenarios where N > k (HDBSCAN), cluster quantile interval sampling was used. For example, if there were two plots in a cluster, those selected were at the 25% and 75% quartiles of the distance distribution; if three plots were available in the cluster, the selected ones were, respectively, at 25%, 50%, and 75% quartiles, and so on. To contrast the performance of cluster-stratified sampling methods, results from simple random sampling were juxtaposed, as well as those from the use of generic models (GM), i.e., models trained on all real sample plots except those chosen to generate the target (artificial) forest.

2.4. GSV Estimation and Validation

Random forest models were trained on the samples selected in each scenario. The RF estimator was implemented as it is relatively universal and does not require a priori knowledge about the data structure. It handles non-linearity among the input data autonomously, and different types of variables can be processed in its engine. The hyperparameters of the RF model as implemented in the R library randomForest [143] were as follows: number of trees—701, number of variables—7, node size—5. To reduce another source of contingency, the same RF settings were used in each simulation. We did not expect significant influence from this factor as, in general, the RF hyperparameters are known to have marginal effects on the predictive performance of RF models, provided that the number of trees is set to the default [156], around 100 [157], at least 128 [158], or a few hundred [159]. Thus, calibrated models were subsequently used to estimate GSV value of the synthetic forest. The criteria for the performance of each method were normalised estimation errors, i.e., nBIAS and nRMSE. Validation of both models and clustering methods was ensured in 30 repetitions. For better understanding of the methodology applied, a graphical concept presenting the workflow of analysis and data processing steps is provided in Figure 2.

3. Results

Figure 3 and Figure 4 present an essential part of the results. Figure 3 traces nRMSE and nBIAS trends as a function of stand complexity expressed by the CVW indicator for each investigated sampling design. In Figure 3, to illustrate central tendencies, the results from 30 repetitions were aggregated as arithmetic means. Figure 4 assesses the stability of each method, expressed as the standard deviation of the error measures. Second-degree polynomial functions were fitted to the raw scatter plots to better highlight subtle differences between some methods and to clear out the jitter caused by the iterations.
Variance among the original sample plots, which mostly comes from managed Polish forests, could derive synthetic stands of small to medium complexity, i.e., CVW values between 0.15 and 0.4. Therefore, the displayed trends should only be considered valid within this range. A strong positive correlation was detected between RMSE and CVW values, indicating less precise estimates for more-complex stands (R2 from 0.80 for GM to 0.92 for HDBSCAN). Obtained RMSE values ranged from 10% to 24%. This type of error depended more on stand complexity level than on the particular sampling scheme or sampling intensity (Figure 3—left column). The gain in performance due to increased sampling intensity was most evident in the transition from 100 to 200 samples, whereupon differences gradually became more subtle. Similarly, with random sampling in the two-phase ALS-enhanced forest inventory, there is usually no benefit in increasing sampling intensity above about 200–300 plots, as only negligible performance gains can be attained [55]. The congruous optimal sampling zone was identified by Grafström et al. (2014) [79], who used the local pivotal method (LPM) [160] to draw a spatially balanced sample in a similar study design. Structurally guided sampling and stratification methods analysed in this study also proved competitive with the systematic sampling design presented in Parkitna et al. (2021) [145], who used 900 evenly distributed sample plots to train random forest models, boosted regression trees, and ordinary least squares regression. Regarding systematic error (Figure 3—right column), even 100 samples were sufficient to keep BIAS values within ±1% for all compared methods except HDBSCAN and generic models, the former being visibly more susceptible to sampling intensity levels. The same figure also shows that complex stands benefit more than simple ones (higher RMSE reduction) from an increased number of sample plots. The largest observed change was for HDBSCAN (from 23% to 18%) when transitioning from 100 to 300 plots. This trend was also confirmed by Li [80], who reported that higher heterogeneity in the vertical structure of forest stands entails an increased minimum number of required sample plots.
Relative differences between particular methods were not substantial—up to 3 percentage points (pp), with similar trends along the CVW axis. Nevertheless, hierarchical clustering (H-CLUST) proved to be the most efficient and stable method (Figure 3 and Figure 4). Unsurprisingly, simple random sampling and k-means were the least-biased designs (Figure 3), but required more sample plots to achieve stability comparable to that of hierarchical clustering (Figure 4). The performance of the two remaining methods, that is, IDS and HDBSCAN, was usually intermediate between the most efficient methods (H-CLUST/k-means) and the contrast (GM) method, with HDBSCAN being the most susceptible to the number of sample plots. Generic models yielded the worst results among all methods. The discrepancies compared to other methods became more pronounced as the number of sample plots used to train the RF model in other sampling methods increased.
Some transitions from underestimation to overestimation (and vice versa) around the CVW value of 0.2 may be observed in certain methods. However, the magnitude of this questionable effect is not substantial and is influenced either by random chance from the iterations or, more likely, by the proximity of the scenario sample mean to the total sample mean in this region. Finally, a flattening effect may be observed at around 0.3–0.4 of CVW, indicating no further increase in RMSE errors due to the stand complexity level. However, this behaviour should be interpreted with caution, as it is most likely caused by the smaller amount of available data inputs (kernels) at higher forest complexity levels and increased variance in those regions.

4. Discussion

It is necessary to acknowledge certain limitations concerning our study; however, this can simultaneously cast light on the potential for further research. Specifically, the presented array of 5870 real sample plots, along with their range of complexity, is characteristic of Central European forests; therefore, the results should be considered applicable only to this population. Consequently, there is a well-founded need to apply the proposed methodology to more diverse and complex stands in order to determine whether estimation error measures stabilise above a certain CVW level. This is particularly important, as shown in Figure 3 and by Tompalski et al. (2019) [136], because applying models outside their sampling region (generic models) may result in increased inaccuracies if calibration and estimation forests differ substantially. Nonetheless, the nearly 6000 real data sample plots (both ground measurements and ALS) provide a solid validation for the applicable regions.
Secondly, as model-based inference is a crucial component in the two-phase sampling design with regression estimator [59], this study compared the performance of selected data clustering methods for optimal model calibration sampling. The performance was evaluated using standard measures of predictive capability, namely, BIAS and RMSE. As shown in the previous section, there is a very strong correlation between RMSE and variance in the population. Therefore, with a large array of original ground sample plots, where every single tree was measured, the model estimation errors should closely approximate the true variance in the population. For this reason, we did not evaluate variance estimators as is performed in more traditional design-based inference.
One should also bear in mind other aspects that could slightly influence the observed trends which, however, were not covered within the scope of this study, mainly to avoid excessively convoluted research. First, grouping procedures constitute the initial stage in stratified sampling design. Strata can be delineated in various ways, but their optimisation in terms of required accuracy and available resources for any given area should always be a goal. For the reasons mentioned in the Section 2, in most scenarios, the authors directly linked the number of clusters (strata) to the number of available sample plots, i.e., k = N. Nevertheless, for data clustering algorithms, the number of clusters is often the most important parameter to set, having a significant impact on their performance [98,161,162]. For the same reasons, we only checked a few popular data clustering methods, whereas some research indicates attractive alternatives such as local pivotal methods (LPM) [162] or mixture models [163,164,165]. Moreover, only one unique and explicitly defined set of clustering variables was used in this study. On the other hand, Ref. [98] reported that the type of algorithm and set of clustering variables can have secondary importance. Therefore, these issues still remain to be confirmed.
After strata definition is set, the subsequent procedure involves within-stratum plot selection. This includes both per-stratum sample size allocation and its deployment. Generally, in this regard, random sampling, regular sampling, and centroid sampling can be distinguished, with either proportional or equal sample distribution. Centre-based (centroid) within-stratum plot selection (used in this study) ensures unambiguous sample plot designation, which eliminates part of the error related to random effects, thus increasing the stability (Figure 4). However, not all possible combinations of clustering and in-cluster sampling variants were tested. In this regard, further detail could be explored. Namely, the cluster centre can be defined in various ways. In our study, it was the averaged point in multidimensional space built from scaled covariates with no weights. In reality, however, some predictors have a stronger effect on the target variable than others. Therefore, in future studies, different approaches to identifying cluster centres should also be considered. Perhaps this could help to better optimise sample deployment (favouring only the most important and non-autocorrelated predictors) and thus result in a sample reduced to an even smaller size.
Another point for discussion is that only one type of estimator, namely, random forest, was applied, with a fixed set of values for its hyperparameters in each scenario. Nevertheless, as mentioned in the methodology and demonstrated by other research, RF regression should not differ significantly from other types of estimators for this type of analysis and input data [145]. In fact, in some cases it can even outperform other modelling approaches, especially for GSV estimation [136]. Regarding the hyperparameters, numerous studies have reported their negligible influence on RF predictive power, provided that the number of trees is set to a sufficiently large quantity [156,157,158,159].
Certain caveats can be attributed to the different types of predictors used in this study. In addition to ALS metrics, dominant species and age information were used for both forest stratification and target-variable estimation. These metrics are recognised as relatively significant auxiliary predictors of biomass or GSV [166,167]. The absence of such auxiliary information could potentially yield slightly different results. Nonetheless, species and age do not appear to be very dynamic forest attributes in mostly managed Central European forests, and can now be accurately determined, for instance, using RS data classifications [168,169] or previous inventories. This constitutes another aspect related to data dimensionality reduction, as different types, quality, and quantity of available predictors and auxiliary variables can influence optimal inference and strata designation [170,171,172,173]. After dimensionality reduction in this study, authors did not shuffle the fixed set of 11 input variables between simulations (although this was partially addressed by the use of the RF model, where some degree of randomness in the variables selection is inherently embedded). In contrast, Ref. [98] showed that effective stratification can be achieved with as few as one to three clustering variables, and that the choice of clustering variables could have a secondary impact on clustering performance. Therefore, this issue still warrants further exploration. One should also expect even better performance from each investigated method if outlier-handling analysis was performed, which was not the case in this study, in order to avoid arbitrary judgements. Surprisingly, HDBSCAN, which claims to handle noisy data [133,134,135], performed poorly in this regard (Figure 3 and Figure 4). Nevertheless, in this study, relative differences between particular methods and general trends were more pertinent than their absolute effectiveness, which nonetheless yielded decent GSV estimates, as shown by the comparison to similar studies.
Some aspects related to spatial autocorrelation in forest environments were simplified in this study. The k-NN algorithm for forest replication was not fed with geographical data. This was due to the specific study design and small effect of spatial autocorrelation, as expressed with Moran’s I index (Table 2), in mostly managed forest districts used in this study, where even adjacent stands can drastically differ in their structures. Moreover, as stated in the introduction, the SGS is designed for sampling in the feature space, rather than in geographical sense. Furthermore, remote sensing data (as used in this study), are known to mitigate/explain the influence of the spatial structure of forests [30]. Additionally, SGS fosters an unambiguous sample distribution, meaning that this design is unaffected by a relocation of units within the population (the same units are selected regardless their proximity). This seems like a great advantage of structural guided sampling, especially for populations with trends [162].
Despite the depicted limitations, the results obtained were generally consistent with the literature reports. Stratified sampling usually led to better precision and stability than simple random sampling (Figure 3 and Figure 4), provided that the designed strata adequately captured naturally occurring clusters in the population [42], or when the sampling intensity exceeded 50 plots, with a flattening effect observed at around 200–250 plots [79,174,175]. Further reduction in sampling intensity for model calibration should be possible, as ref. [104] found that even 100 plots (or less) might suffice, provided the sample covers the population variability. An adaptive sampling appears as an apt extension in this regard.
Another advantage of pre-stratification based on auxiliary or RS metrics is that data captured immediately before planned field measurements can provide valuable a priori knowledge about the sampling population, resulting in less biased and more precise estimates and/or less costly inventory campaigns [176]. Therefore, in the presented sampling approaches, adequate and objective strata delineation, as well as better optimised plot allocation (than in post-stratification) [24], could, for instance, be achieved using one of the analysed data clustering algorithms with explicit in-cluster sampling. Their implementation is both unambiguous and flexible, meaning that the procedure behind sample selection is clearly defined, but the exact realisation depends heavily on the given population variance, as indicated by RS or other metrics known in advance.
The choice of a specific sampling method depends primarily on the study goal, available resources, allowable errors, and the characteristics of the target population [33]. Each sampling method has its advantages and disadvantages. Consequently, scientists often combine their features to develop more efficient, universal, and reliable solutions [177]. This can lead to the creation of sophisticated and complex systems, which may, however, face difficulties in gaining acceptance among practitioners [178,179,180]. It is generally preferable to understand the process before implementation. Nevertheless, the high automation capabilities of contemporary digital systems can often relieve users from intricate pipeline details, provided that the “customers” can verify the reliability of the final outcome (estimates) in a manner they trust, such as through control surveys, reference methods, previous assessments, or expertise. This, in turn, can motivate the development and implementation of more comprehensive and auto-optimised methods.
Hierarchical clustering (H-CLUST), using the Ward’s agglomeration algorithm, outperformed the other data clustering methods analysed in this study, as well as the two control methods: random sampling and generic models. K-means produced similar estimates (Figure 3), but was less stable, particularly at lower sampling intensity levels (Figure 4). In contrast, generic models were the most stable approach but returned the least accurate and most biased estimates. Similar findings were reported by [83] who stated that incorporating sample plots from the target populations improved predictions and reduced systematic error. Therefore, it is advisable always to secure a set of in situ sample plots, either for local model development, external model parameter calibration, or at least for control purposes and assessment of estimation errors or variance. Based on the results obtained, the H-CLUST method is preferable, especially when fewer second-phase sample plots are available.
Yet, in this study, the authors did not compare the analysed methods with systematic sampling, which is essentially the most common sampling method applied in practical forestry [16,19,20,21,22,23,24]. The reason for this omission was that it would require a completely different approach, making an already complex methodology even more convoluted. Comparison and evaluation of SYS performance would require a complete reference survey of the entire forest district, which is practically impossible for large-area inventories. On the other hand, contemporary machine learning algorithms, geostatistical interpolation methods, and more recently AI models are capable of generating spatially continuous population (often with RS data) from which one can draw samples [181,182,183,184,185,186,187,188]. It should also be noted that algorithms for digitally generated forests have their own hyperparameters to set and can be sensitive to input data. Although such reconstructions tend to be merely smoother versions of the true population, some research has indicated SGS methods as a good alternative [104], or more recently, even superior to systematic sampling [102]. Our results showed that for model-based inference, SGS sampling can be competitive with SYS [145] in two-phase ALS-enhanced forest inventory, using fewer sample plots for model calibration. Nevertheless, the rapid development of numerical deep learning, neural networks, and generative AI methods, along with an appropriate research approach based on such synthetic data, could further help to explore the following aspects of forest inventory optimisation: (1) Which sampling method is most adequate for the forest inventory of a given area, for a desired performance level and with the available resources? (2) Is it always the same method in the context of forest type and its heterogeneity? (3) What is the extent of the differences in efficiency and reliability between SRS, SYS, STRS, and SGS? These questions seem to highlight the right direction for further research in the field of forest inventory sampling optimisation. They will be addressed in another study.

5. Conclusions

The following conclusions can be drawn from the results of this study. RMSE values were highly correlated with stand complexity. Structurally more-complex stands benefit more from increased sample size than homogeneous stands, in terms of decreased GSV estimation error. Based on the results from synthetic forests, even 100 sample plots may be sufficient to achieve unbiased GSV estimates for the means (±1%), using structurally guided sampling in a two-phase inventory design with the ALS random forest estimator. However, real-world complexity would likely require larger samples. Notwithstanding, a major potential of ALS-enhanced forest inventories lies in their capability for spatially continuous estimation of forest attributes for small areas and individual stands. If this aspect of forest inventory is of interest, the sampling intensity saturation appears to occur at about 200–300 plots. This observation is also generally consistent with some similar study reports discussed in the previous sections. Another conclusion is that the lack of in situ forest inventory data can produce biased and less-accurate GSV estimates. Hierarchical data clustering with the Ward’s agglomeration algorithm outperformed other data clustering methods investigated in this study. The differences were not prominent, but became more pronounced when more sample plots were used for model calibration. Among other clustering methods, the Ward’s criterion was also the most stable in GSV predictions. Further research in the field of structurally guided sampling for forest stock inventory optimisation should, among other things, focus on in-cluster sampling, cluster centre identification, the influence of clustering variables, and comparison with other SGS algorithms, like cube or local pivotal methods. Moreover, a comparison with the most common sampling design in forestry, i.e., systematic, should be made for different areas and at different sampling intensity levels.

Author Contributions

Conceptualization, M.L.; methodology, M.L.; validation, M.L., K.S. and T.H.; formal analysis, M.L.; investigation, M.L.; resources, M.L. and K.S.; data curation, M.L. and K.S.; writing—original draft preparation, M.L.; writing—review and editing, M.L., K.S. and T.H.; visualisation, M.L.; supervision, M.L., K.S. and T.H.; project administration, K.S.; funding acquisition, K.S. All authors have read and agreed to the published version of the manuscript.

Funding

Data for this research was funded by (1) the National Centre for Research and Development in Poland under the BIOSTRATEG programme (grant agreement number BIOSTRATEG1/267755/4/NCBR/2015), project REMBIOFOR ‘Remote sensing-based assessment of woody biomass and carbon storage in forests’, and (2) the State Forests National Forest Holding, project no. EO.271.2.12.2019 ‘Extension of the forest management inventory method using the results of the REMBIOFOR project’ (int. no. 500463).

Data Availability Statement

Data are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Næsset, E. Geographical information systems in long-term forest management and planning with special reference to preservation of biological diversity: A review. For. Ecol. Manag. 1997, 93, 121–136. [Google Scholar] [CrossRef]
  2. Hurteau, M.D. The role of forests in the carbon cycle and in climate change. In Climate Change, 3rd ed.; Letcher, T.M., Ed.; Elsevier: Amsterdam, The Netherlands, 2021; pp. 561–579. [Google Scholar] [CrossRef]
  3. Hochmalová, M.; Purwestri, R.C.; Yongfeng, J.; Jarský, V.; Riedl, M.; Yuanyong, D.; Hájek, M. Demand for forest ecosystem services: A comparison study in selected areas in the Czech Republic and China. Eur. J. For. Res. 2022, 141, 867–886. [Google Scholar] [CrossRef]
  4. Hua, F.; Bruijnzeel, L.A.; Meli, P.; Martin, P.A.; Zhang, J.; Nakagawa, S.; Miao, X.; Wang, W.; McEvoy, C.; Peña-Arancibia, J.L.; et al. The biodiversity and ecosystem service contributions and trade-offs of forest restoration approaches. Science 2022, 376, 839–844. [Google Scholar] [CrossRef] [PubMed]
  5. Banaś, J.; Janeczko, E.; Zięba, S.; Utnik-Banaś, K.; Janeczko, K. Which forest type do visitors find most attractive? Integrating management activities with the recreational attractiveness of forests at a landscape level. Landsc. Urban Plan. 2025, 259, 105367. [Google Scholar] [CrossRef]
  6. Daryaei, A.; Trailovic, Z.; Sohrabi, H.; Atzberger, C.; Hochbichler, E.; Immitzer, M. Optimal integration of forest inventory data and aerial image-based canopy height models for forest stand management. For. Ecosyst. 2025, 13, 100299. [Google Scholar] [CrossRef]
  7. Mulverhill, C.; Coops, N.C.; White, J.C.; Tompalski, P.; Achim, A. Evaluating the potential for continuous update of enhanced forest inventory attributes using optical satellite data. Forestry 2025, 98, 253–265. [Google Scholar] [CrossRef]
  8. Gibson, L.; Lynam, A.; Bradshaw, C.; He, F.; Bickford, D.; Woodruff, D.; Bumrungsri, S.; Laurance, W. Near-Complete Extinction of Native Small Mammal Fauna 25 Years After Forest Fragmentation. Science 2013, 341, 1508–1510. [Google Scholar] [CrossRef]
  9. Haddad, N.; Brudvig, L.; Clobert, J.; Davies, K.; Gonzalez, A.; Holt, R.; Lovejoy, T.; Sexton, J.; Austin, M.; Collins, C.; et al. Habitat fragmentation and its lasting impact on Earth ecosystems. Sci. Adv. 2015, 1, e1500052. [Google Scholar] [CrossRef]
  10. FAO; UNEP. The State of the World’s Forests 2020. Forests, Biodiversity and People; Food and Agriculture Organization of the United Nations & United Nations Environment Programme: Rome, Italy, 2020. [Google Scholar] [CrossRef]
  11. UNFCCC. United Nations Framework Convention on Climate Change; Secretariat of the United Nations Framework Convention on Climate Change: Bonn, Germany, 1992; p. 24. [Google Scholar]
  12. Mohren, G.M.J.; Hasenauer, H.; Köhl, M.; Nabuurs, G.-J. Forest inventories for carbon change assessments. Curr. Opin. Environ. Sustain. 2012, 4, 686–695. [Google Scholar] [CrossRef]
  13. IPCC. Climate Change and Land: An IPCC Special Report on Climate Change, Desertification, Land Degradation, Sustainable Land Management, Food Security, and Greenhouse Gas Fluxes in Terrestrial Ecosystems; Shukla, P.R., Skea, J., Calvo Buendia, E., Masson-Delmotte, V., Pörtner, H.-O., Roberts, D.C., Zhai, P., Slade, R., Connors, S., van Diemen, R., et al., Eds.; IPCC: Geneva, Switzerland, 2019. [Google Scholar]
  14. Sandker, M.; Carrillo, O.; Leng, C.; Lee, D.; d’Annunzio, R.; Fox, J. The Importance of High–Quality Data for REDD+ Monitoring and Reporting. Forests 2021, 12, 99. [Google Scholar] [CrossRef]
  15. Shannon, E.S.; Coulston, J.W.; Domke, G.M.; Finley, A.O.; Green, E.J.; Stovall, A.E.L.; Woodall, C.W. Leveraging National Forest Inventory Data to Estimate Forest Carbon Density Status and Trends for Small Areas. Environ. Res. Lett. 2025, 20, 104001. [Google Scholar] [CrossRef]
  16. Kleinn, C. New technologies and methodologies for national forest inventories. Forstwiss. Cent. 2002, 53, 10–15. [Google Scholar]
  17. Kangas, A.; Astrup, R.; Breidenbach, J.; Fridman, J.; Gobakken, T.; Korhonen, K.T.; Maltamo, M.; Nilsson, M.; Nord-Larsen, T.; Næsset, E.; et al. Remote sensing and forest inventories in Nordic countries–roadmap for the future. Scand. J. For. Res. 2018, 33, 397–412. [Google Scholar] [CrossRef]
  18. Luigui, A.; Renaud, J.-P.; Vega, C. How reliable are remote sensing maps calibrated over large areas? A matter of scale? arXiv 2024, arXiv:2408.03953. [Google Scholar] [CrossRef]
  19. Freese, F. Elementary Forest Sampling; U.S. Department of Agriculture, Forest Service, Handbook No. 232; Southern Forest Experiment Station: New Orleans, LA, USA, 1962.
  20. Chojnacky, D.C. Double Sampling for Stratification: A Forest Inventory Application in the Interior West; USDA Forest Service Research Paper RMRS-RP-7; U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station: Fort Collins, CO, USA, 1998.
  21. Junttila, V.; Finley, A.O.; Bradford, J.B.; Kauranne, T. Strategies for minimizing sample size for use in airborne LiDAR-Based forest inventory. For. Ecol. Manag. 2013, 292, 75–85. [Google Scholar] [CrossRef]
  22. Mello, J.; Scolforo, H.; Raimundo, M.; Scolforo, J.; Oliveira, A.; Ferraz Filho, A. Estimating precision of systematic sampling in forest inventories. Ciênc. Agrotec. 2015, 39, 15–22. [Google Scholar] [CrossRef]
  23. Magnussen, S.; McRoberts, R.E.; Breidenbach, J.; Nord-Larsen, T.; Ståhl, G.; Fehrmann, L.; Schnell, S. Comparison of estimators of variance for forest inventories with systematic sampling-results from artificial populations. For. Ecosyst. 2020, 7, 17. [Google Scholar] [CrossRef]
  24. Finley, A.O.; Doser, J.W. Introduction to Forestry Data Analysis with R; Chapman & Hall/CRC: Boca Raton, FL, USA, 2025. [Google Scholar]
  25. West, P. Tree and Forest Measurement; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar] [CrossRef]
  26. Herries, D. Forest Inventory Sampling Designs for Plot/Sample Locations. Interpine Blog. 26 March 2014. Available online: https://interpine.nz/forest-inventory-sampling-designs-for-plotsample-locations/ (accessed on 3 August 2025).
  27. Räty, M.; Kuronen, M.; Myllymäki, M.; Kangas, A.; Mäkisara, K.; Heikkinen, J. Comparison of the local pivotal method and systematic sampling for national forest inventories. For. Ecosyst. 2020, 7, 54. [Google Scholar] [CrossRef]
  28. Thompson, W.L.; White, G.C.; Gowan, C. Sampling Designs and Related Topics. In Monitoring Vertebrate Populations; Thompson, W.L., White, G.C., Gowan, C., Eds.; Academic Press: San Diego, CA, USA, 1998; pp. 43–73. [Google Scholar] [CrossRef]
  29. Mostafa, S.; Ahmad, I. Recent Developments in Systematic Sampling: A Review. J. Stat. Theory Pract. 2017, 12, 290–310. [Google Scholar] [CrossRef]
  30. Babcock, C.; Finley, A.O.; Gregoire, T.G.; Andersen, H.-E. Remote sensing to reduce the effects of spatial autocorrelation on design-based inference for forest inventory using systematic samples. arXiv 2018, arXiv:1810.08588. [Google Scholar] [CrossRef]
  31. Griffith, D.A.; Plant, R.E. Statistical Analysis in the Presence of Spatial Autocorrelation: Selected Sampling Strategy Effects. Stats 2022, 5, 1334–1353. [Google Scholar] [CrossRef]
  32. Thomas, L. Systematic Sampling: A Step-by-Step Guide with Examples. Scribbr. 2023. Available online: https://www.scribbr.com/methodology/systematic-sampling/ (accessed on 15 August 2025).
  33. Ahmed, S.K. How to Choose a Sampling Technique and Determine Sample Size for Research: A simplified guide for researchers. Oral Oncol. Rep. 2024, 12, 100662. [Google Scholar] [CrossRef]
  34. Jayaraman, K. Statistical Manual for Forestry Research; Food and Agriculture Organization of the United Nations, Regional Office for Asia and the Pacific: Bangkok, Thailand, 1999. [Google Scholar]
  35. BDL-Bank Danych o Lasach. Instrukcja Wykonywania Wielkoobszarowej Inwentaryzacji Stanu Lasu; PGL Lasy Państwowe: Warsaw, Poland, 2014. Available online: https://www.bdl.lasy.gov.pl/portal/Media/Default/Publikacje/Instrukcja%20WISL_2015.pdf (accessed on 16 August 2025).
  36. SLU—Swedish University of Agricultural Sciences, Department of Forest Research Management. About NFI. 2025. Available online: https://www.slu.se/en/about-slu/organisation/departments/forest-resource-management/miljoanalys/nfi/about-nfi/inventory-design (accessed on 4 September 2025).
  37. Mehtätalo, L.; Räty, M.; Mehtätalo, J. A new growth curve and fit to the National Forest Inventory data of Finland. Ecol. Model. 2025, 501, 111006. [Google Scholar] [CrossRef]
  38. Bindewald, A.; Miocic, S.; Wedler, A.; Bauhus, J. Forest inventory-based assessments of the invasion risk of Pseudotsuga menziesii (Mirb.) Franco and Quercus rubra L. in Germany. Eur. J. For. Res. 2021, 140, 883–899. [Google Scholar] [CrossRef]
  39. Simons, N.K.; Felipe-Lucia, M.R.; Schall, P.; Ammer, C.; Bauhus, J.; Blüthgen, N.; Boch, S.; Buscot, F.; Fischer, M.; Goldmann, K.; et al. National Forest Inventories capture the multifunctionality of managed forests in Germany. For. Ecosyst. 2021, 8, 5. [Google Scholar] [CrossRef]
  40. Bechtold, W.A.; Scott, C.T. The Forest Inventory and Analysis Plot Design. In The Enhanced Forest Inventory and Analysis Program—National Sampling Design and Estimation Procedures; Bechtold, W.A., Patterson, P.L., Eds.; U.S. Department of Agriculture, Forest Service, Southern Research Station: Asheville, NC, USA, 2005; pp. 27–42. [Google Scholar]
  41. Bontemps, J.-D.; Bouriaud, O. Take five: About the beat and the bar of annual and 5-year periodic national forest inventories. Ann. For. Sci. 2024, 81, 53. [Google Scholar] [CrossRef]
  42. Cochran, W.G. Sampling Techniques, 3rd ed.; John Wiley & Sons: New York, NY, USA, 1977. [Google Scholar]
  43. Qian, J. Sampling. In International Encyclopedia of Education, 3rd ed.; Peterson, P., Baker, E., McGaw, B., Eds.; Elsevier: Amsterdam, The Netherlands, 2010; pp. 390–395. [Google Scholar] [CrossRef]
  44. Räty, M.; Heikkinen, J.; Korhonen, K.T.; Peräsaari, J.; Ihalainen, A.; Pitkänen, J.; Kangas, A.S. Effect of cluster configuration and auxiliary variables on the efficiency of local pivotal method for national forest inventory. Scand. J. For. Res. 2019, 34, 607–616. [Google Scholar] [CrossRef]
  45. Lister, A.J.; Leites, L.P. Cost implications of cluster plot design choices for precise estimation of forest attributes in landscapes and forests of varying heterogeneity. Can. J. For. Res. 2022, 52, 188–200. [Google Scholar] [CrossRef]
  46. Tokola, T.; Shrestha, S.M. Comparison of cluster-sampling techniques for forest inventory in southern Nepal. For. Ecol. Manag. 1999, 116, 219–231. [Google Scholar] [CrossRef]
  47. Grafström, A.; Zhao, X.; Nylander, M.; Petersson, H. A new sampling strategy for forest inventories applied to the temporary clusters of the Swedish national forest inventory. Can. J. For. Res. 2017, 47, 1161–1167. [Google Scholar] [CrossRef]
  48. Coulston, J. Forest Inventory and Stratified Estimation: A Cautionary Note; Res. Note SRS-16; U.S. Department of Agriculture, Forest Service, Southeastern Forest Experiment Station: Asheville, NC, USA, 2008. [CrossRef]
  49. OpenGenus IQ. Cluster Sampling. OpenGenus IQ. Available online: https://iq.opengenus.org/cluster-sampling/ (accessed on 22 August 2025).
  50. Lv, T.; Zhou, X.; Tao, Z.; Sun, X.; Wang, J.; Li, R.; Xie, F. Remote Sensing-Guided Spatial Sampling Strategy over Heterogeneous Surface Ground for Validation of Vegetation Indices Products with Medium and High Spatial Resolution. Remote Sens. 2021, 13, 2674. [Google Scholar] [CrossRef]
  51. Yan, Z.; Ma, L.; Wang, X.; Kim, Y.; Zhang, L. High-Precision population estimates by remote sensing big data and advanced transformer deep learning model. Remote Sens. Appl. Soc. Environ. 2025, 39, 101638. [Google Scholar] [CrossRef]
  52. Ene, L.T.; White, J.C.; Tompalski, P.; Maltamo, M.; Heiskanen, J.; Saarela, S.-R.; Packalen, P.; Kangas, A.; Tomppo, E. Simulation-Based assessment of sampling strategies for large-area biomass estimation using airborne laser scanning. Remote Sens. 2016, 8, 1. [Google Scholar] [CrossRef]
  53. Gobakken, T.; Næsset, E.; Nelson, R.; Bollandsås, O.M.; Gregoire, T.G.; Ståhl, G.; Holm, S.; Ørka, H.O.; Astrup, R. Estimating biomass in Hedmark County, Norway using national forest inventory field plots and airborne laser scanning. Remote Sens. Environ. 2012, 123, 443–456. [Google Scholar] [CrossRef]
  54. Papa, D.d.A.; Almeida, D.R.A.; Silva, C.A.; Figueiredo, E.O.; Stark, S.C.; Valbuena, R.; Rodriguez, L.C.E.; Oliveira, M.V.N. Evaluating tropical forest classification and field sampling stratification from lidar to reduce effort and enable landscape monitoring. For. Ecol. Manag. 2020, 457, 117634. [Google Scholar] [CrossRef]
  55. Lisańczuk, M.; Mitelsztedt, K.; Parkitna, K.; Krok, G.; Stereńczak, K.; Wysocka-Fijorek, E.; Miścicki, S. Influence of sampling intensity on performance of two-phase forest inventory using airborne laser scanning. For. Ecosyst. 2020, 7, 65. [Google Scholar] [CrossRef]
  56. Silva, V.S.d.; Silva, C.A.; Mohan, M.; Cardil, A.; Rex, F.E.; Loureiro, G.H.; Almeida, D.R.A.d.; Broadbent, E.N.; Gorgens, E.B.; Dalla Corte, A.P.; et al. Combined Impact of Sample Size and Modeling Approaches for Predicting Stem Volume in Eucalyptus spp. Forest Plantations Using Field and LiDAR Data. Remote Sens. 2020, 12, 1438. [Google Scholar] [CrossRef]
  57. Corona, P.; Fattorini, L.; Pagliarella, M.C. Sampling strategies for estimating forest cover from remote sensing-based two-stage inventories. For. Ecosyst. 2015, 2, 18. [Google Scholar] [CrossRef]
  58. Dupuis, C.; Lejeune, P.; Michez, A.; Fayolle, A. How Can Remote Sensing Help Monitor Tropical Moist Forest Degradation?—A Systematic Review. Remote Sens. 2020, 12, 1087. [Google Scholar] [CrossRef]
  59. Köhl, M.; Magnussen, S.; Marchetti, M. Sampling Methods, Remote Sensing and GIS Multiresource Forest Inventory; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  60. Næsset, E. Predicting forest stand characteristics with airborne scanning laser using a practical two-stage procedure and field data. Remote Sens. Environ. 2002, 80, 88–99. [Google Scholar] [CrossRef]
  61. Bergseng, E.; Ørka, H.O.; Næsset, E.; Gobakken, T.; Ståhl, G.; Gregoire, T.; Solberg, B. Assessing forest inventory information obtained from different inventory approaches and remote sensing data sources. Ann. For. Sci. 2015, 72, 33–45. [Google Scholar] [CrossRef]
  62. Chirici, G.; McRoberts, R.E.; Fattorini, L.; Mura, M.; Marchetti, M. Comparing echo-based and canopy height model-based metrics for enhancing estimation of forest aboveground biomass in a model-assisted framework. Remote Sens. Environ. 2016, 174, 1–9. [Google Scholar] [CrossRef]
  63. Dettmann, G.T.; Radtke, P.J.; Coulston, J.W.; Green, P.C.; Wilson, B.T.; Moisen, G.G. Review and Synthesis of Estimation Strategies to Meet Small Area Needs in Forest Inventory. Front. For. Glob. Change 2022, 5, 813569. [Google Scholar] [CrossRef]
  64. Hyyppä, J.; Yu, X.; Hyyppä, H.; Vastaranta, M.; Holopainen, M.; Kukko, A.; Kaartinen, H.; Jaakkola, A.; Vaaja, M.; Koskinen, J.; et al. Advances in Forest Inventory Using Airborne Laser Scanning. Remote Sens. 2012, 4, 1190–1207. [Google Scholar] [CrossRef]
  65. Maltamo, M.; Packalén, P. Species specific management inventory in Finland. In Forestry Applications of Airborne Laser Scanning–Concepts and Case Studies; Maltamo, M., Næsset, E., Vauhkonen, J., Eds.; Managing Forest Ecosystems; Springer: Dordrecht, The Netherlands, 2014; Volume 27, pp. 241–252. [Google Scholar] [CrossRef]
  66. White, J.; Coops, N.; Wulder, M.; Vastaranta, M.; Hilker, T.; Tompalski, P. Remote Sensing Technologies for Enhancing Forest Inventories: A Review. Can. J. Remote Sens. 2016, 42, 619–641. [Google Scholar] [CrossRef]
  67. White, J.; Tompalski, P.; Vastaranta, M.; Wulder, M.; Saarinen, N.; Stepper, C.; Coops, N. A Model Development and Application Guide for Generating an Enhanced Forest Inventory Using Airborne Laser Scanning Data and an Area-Based Approach, Canadian Forest Service, Canadian Wood Fibre Centre, Information, Report FI-X-018. 2017. Available online: https://publications.gc.ca/collections/collection_2018/rncan-nrcan/Fo148-1-18-eng.pdf (accessed on 15 September 2025).
  68. White, J.; Penner, M.; Woods, M. Assessing single photon LiDAR for operational implementation of an enhanced forest inventory in diverse mixedwood forests. For. Chron. 2021, 97, 78–96. [Google Scholar] [CrossRef]
  69. Fassnacht, F.E.; White, J.C.; Wulder, M.A.; Næsset, E. Remote sensing in forestry: Current challenges, considerations and directions. Forestry 2023, 97, 11–25. [Google Scholar] [CrossRef]
  70. IUL. Instrukcja Urządzania Lasu; Państwowe Gospodarstwo Leśne Lasy Państwowe: Warszawa, Poland, 2024. Available online: https://www.lasy.gov.pl/pl/publikacje/copy_of_gospodarka-lesna/urzadzanie/iul/instrukcja-urzadzenia-lasu-2024/instrukcja-urzadzania-lasu-czesc-i.pdf/view (accessed on 8 September 2025).
  71. Næsset, E.; Bjerknes, K.-O. Estimating tree heights and number of stems in young forests using airborne laser scanner data. Remote Sens. Environ. 2001, 78, 328–340. [Google Scholar] [CrossRef]
  72. Næsset, E.; Gobakken, T.; Holmgren, J.; Hyyppä, H.; Hyyppä, J.; Maltamo, M.; Nilsson, M.; Olsson, H.; Persson, Å.; Derman, U. Laser scanning of forest resources: The Nordic experience. Scand. J. For. Res. 2004, 18, 482–499. [Google Scholar] [CrossRef]
  73. Stauffer, H.B. A Sample Size Table for Forest Sampling. For. Sci. 1982, 28, 777–784. [Google Scholar] [CrossRef]
  74. Musa, S.; Kassim, A.R.; Yusoff, S.M.; Ibrahim, S. Assessing the status of logged-over production forests: The development of a rapid appraisal technique. In Information and Analysis for Sustainable Forest Management: Linking National and International Efforts in South and Southeast Asia; EC-FAO Partnership Programme 2000–2002, Tropical Forestry Budget Line B7-6201/1B/98/0531, Project GCP/RAS/173/EC; FAO: Bangkok, Thailand, 2003; Available online: https://www.fao.org/4/ac838e/AC838E12.htm#7818 (accessed on 10 September 2025).
  75. Reams, G.; Smith, W.D.; Hansen, M.H.; Bechtold, W.A.; Roesch, F.A.; Moisen, G.G. The Forest Inventory and Analysis Sampling Frame. In The Enhanced Forest Inventory and Analysis Program—National Sampling Design and Estimation Procedures; Gen. Tech. Rep. SRS-80; USDA Forest Service: Asheville, NC, USA, 2005; pp. 11–26. [Google Scholar]
  76. Avery, T.E.; Burkhart, H.E. Forest Measurements, 6th ed.; Waveland Press: Long Grove, IL, USA, 2015. [Google Scholar]
  77. Sanjerehei Mousaei, M. Sample Size Calculations for Vegetation Studies. Maced. J. Ecol. Environ. 2021, 23, 85–97. [Google Scholar] [CrossRef]
  78. Wulder, M.; White, J.; Nelson, R.; Næsset, E.; Ørka, H.; Coops, N.; Hilker, T.; Bater, C.; Gobakken, T. LiDAR sampling for large-area forest characterization: A review. Remote Sens. Environ. 2012, 121, 196–209. [Google Scholar] [CrossRef]
  79. Grafström, A.; Saarela, S.; Ene, L.T. Efficient sampling strategies for forest inventories by spreading the sample in auxiliary space. Can. J. For. Res. 2014, 44, 1156–1164. [Google Scholar] [CrossRef]
  80. Li, C.; Yu, Z.; Dai, H.; Zhou, X.; Zhou, M. Effect of sample size on the estimation of forest inventory attributes using airborne LiDAR data in large-scale subtropical areas. Ann. For. Sci. 2023, 80, 40. [Google Scholar] [CrossRef]
  81. Kleinn, C.; Fehrmann, L. Basic forest statistics–Accuracy, precision and bias [Unpublished presentation]. In Proceedings of the Regional Course on REDD+, MRV and Monitoring, Sokoine University of Agriculture, Morogoro, Tanzania, 11–15 July 2011; UN-REDD Programme: Geneva, Switzerland, 2011. [Google Scholar]
  82. Strimbu, B.M. Comparing the efficiency of intensity-based forest inventories with sampling-error-based forest inventories. Forestry 2014, 87, 249–255. [Google Scholar] [CrossRef]
  83. Latifi, H.; Koch, B. Evaluation of most similar neighbour and random forest methods for imputing forest inventory variables using data from target and auxiliary stands. Int. J. Remote Sens. 2012, 33, 6668–6694. [Google Scholar] [CrossRef]
  84. Stereńczak, K.; Lisańczuk, M.; Parkitna, K.; Mitelsztedt, K.; Mroczek, P.; Miścicki, S. The influence of number and size of sample plots on modelling growing stock volume based on airborne laser scanning. Drewno 2018, 61, 5–22. [Google Scholar] [CrossRef]
  85. Garrido de Lera, A.; Gobakken, T.; Ørka, H.; Næsset, E.; Bollandsås, O. Estimating forest attributes in airborne laser scanning based inventory using calibrated predictions from external models. Silva Fenn. 2022, 56, 10695. [Google Scholar] [CrossRef]
  86. Bhattacherjee, A. 8.2: Probability sampling. In Social Science Research: Principles, Methods, and Practices; LibreTexts: Davis, CA, USA, 2012. [Google Scholar]
  87. Natural Resources Conservation Service (NRCS). Sampling Vegetation Attributes; Field Guidance; NRCS: Washington, DC, USA, 2022.
  88. Hahn, J.T.; MacLean, C.D.; Arner, S.L.; Bechtold, W.A. Procedures to handle inventory cluster plots that straddle two or more conditions. For. Sci. Monogr. 1995, 31, 12–25. [Google Scholar] [CrossRef]
  89. Yim, J.S.; Shin, M.-Y.; Son, Y.; Kleinn, C. Cluster plot optimization for a large area forest resource inventory in Korea. For. Sci. Technol. 2015, 11, 139–146. [Google Scholar] [CrossRef]
  90. Quon, C.; Lam, T.Y.; Lin, H.-T. Designing Cluster Plots for Sampling Local Plant Species Composition for Biodiversity Management. For. Syst. 2020, 29, e002. [Google Scholar] [CrossRef]
  91. Xu, Q.; Ståhl, G.; McRoberts, R.; Li, B.; Tokola, T.; Hou, Z. Generalizing systematic adaptive cluster sampling for forest ecosystem inventory. For. Ecol. Manag. 2021, 489, 119051. [Google Scholar] [CrossRef]
  92. Nazariani, N.; Fallah, A.; Ramezani, H.; Naghavi, H.; Jalilvand, H. Assessing the Optimum Cluster Sampling Plan for Estimating the Quantitative Characteristics of Zagros Forests (Case Study: Watershed Olad Ghobad Forests). Iran. J. For. 2022, 14, 37–48. [Google Scholar] [CrossRef]
  93. Ramezani, H.; Lister, A. Effects of cluster plot design parameters on landscape fragmentation estimates: A case study using data from the Swedish national forest inventory. Appl. Geogr. 2023, 159, 103118. [Google Scholar] [CrossRef]
  94. Luo, S.; Xu, L.; Yu, J.; Zhou, W.; Yang, Z.; Wang, S.; Guo, C.; Gao, Y.; Xiao, J.; Shu, Q. Sampling Estimation and Optimization of Typical Forest Biomass Based on Sequential Gaussian Conditional Simulation. Forests 2023, 14, 1792. [Google Scholar] [CrossRef]
  95. Kumar, J.; Mills, R.T.; Hoffman, F.M.; Hargrove, W.W. Parallel k-Means Clustering for Quantitative Ecoregion Delineation Using Large Data Sets. Procedia Comput. Sci. 2011, 4, 1602–1611. [Google Scholar] [CrossRef]
  96. Melville, G.; Stone, C. Optimising nearest neighbour information—A simple, efficient sampling strategy for forestry plot imputation using remotely sensed data. Aust. For. 2016, 79, 217–228. [Google Scholar] [CrossRef]
  97. Abdullahi Sahra, M.; Schardt, M.; Pretzsch, H. An unsupervised two-stage clustering approach for forest structure classification based on X-band InSAR data—A case study in complex temperate forest stands. Int. J. Appl. Earth Obs. Geoinf. 2017, 57, 36–48. [Google Scholar] [CrossRef]
  98. Georgakis, A.; Gatziolis, D.; Stamatellos, G. A Primer on Clustering of Forest Management Units for Reliable Design-Based Direct Estimates and Model-Based Small Area Estimation. Forests 2023, 14, 1994. [Google Scholar] [CrossRef]
  99. Xu, M.; Han, X.; Zhang, J.; Huang, K.; Peng, M.; Qiu, B.; Yang, K. Integrating Ward’s Clustering Stratification and Spatially Correlated Poisson Disk Sampling to Enhance the Accuracy of Forest Aboveground Carbon Stock Estimation. Forests 2024, 15, 2111. [Google Scholar] [CrossRef]
  100. Maniatis, D.; Mollicone, D. Options for sampling and stratification for national forest inventories to implement REDD+ under the UNFCCC. Carbon Balance Manag. 2010, 5, 9. [Google Scholar] [CrossRef] [PubMed]
  101. Hetzer, J.; Huth, A.; Wiegand, T.; Dobner, H.J.; Fischer, R. An analysis of forest biomass sampling strategies across scales. Biogeosciences 2020, 17, 1673–1683. [Google Scholar] [CrossRef]
  102. Heikkinen, J.; Henttonen, H.; Katila, M.; Tuominen, S. Stratified, Spatially Balanced Cluster Sampling for Cost-Efficient Environmental Surveys. Environmetrics 2025, 36, e70019. [Google Scholar] [CrossRef]
  103. Goodbody, T.R.H.; Coops, N.C.; Queinnec, M.; White, J.C.; Tompalski, P.; Hudak, A.T.; Auty, D.; Valbuena, R.; LeBoeuf, A.; Sinclair, I.; et al. sgsR: A structurally guided sampling toolbox for LiDAR-Based forest inventories. Forestry 2023, 96, 411–424. [Google Scholar] [CrossRef]
  104. Maltamo, M.; Bollandsås, O.; Næsset, E.; Gobakken, T.; Packalén, P. Different plot selection strategies for field training data in ALS-assisted forest inventory. For. Int. J. For. Res. 2011, 84, 23–31. [Google Scholar] [CrossRef]
  105. Lindgren, N.; Christensen, P.; Nilsson, B.; Åkerholm, M.; Allard, A.; Reese, H.; Olsson, H. Using Optical Satellite Data and Airborne Lidar Data for a Nationwide Sampling Survey. Remote Sens. 2015, 7, 4253–4267. [Google Scholar] [CrossRef]
  106. Pagliarella, M.C.; Sallustio, L.; Capobianco, G.; Conte, E.; Corona, P.; Fattorini, L.; Marchetti, M. From one- to two-phase sampling to reduce costs of remote sensing-based estimation of land-cover and land-use proportions and their changes. Remote Sens. Environ. 2016, 184, 410–417. [Google Scholar] [CrossRef]
  107. Luther, J.E.; Fournier, R.A.; van Lier, O.R.; Bujold, M. Extending ALS-Based Mapping of Forest Attributes with Medium Resolution Satellite and Environmental Data. Remote Sens. 2019, 11, 1092. [Google Scholar] [CrossRef]
  108. Georgakis, A. Stratification of Forest Stands as a Basis for Small Area Estimations, Proceedings of the 33rd Panhellenic Statistics Conference (2021), pp. 233–247. 2022. Available online: https://www.researchgate.net/publication/361391008_Stratification_of_Forest_Stands_as_a_Basis_for_Small_Area_Estimations (accessed on 14 September 2025).
  109. Melville, G.; Stone, C.; Turner, R. Application of LiDAR data to maximise the efficiency of inventory plots in softwood plantations. N. Z. J. For. Sci. 2015, 45, 9. [Google Scholar] [CrossRef]
  110. Queinnec, M.; Coops, N.C.; White, J.C.; McCartney, G.; Sinclair, I. Developing a forest inventory approach using airborne single photon lidar data: From ground plot selection to forest attribute prediction. Forestry 2022, 95, 347–362. [Google Scholar] [CrossRef]
  111. Hawbaker, T.; Keuler, N.; Lesak, A.; Gobakken, T.; Contrucci, K.; Radeloff, V. Improved estimates of forest vegetation structure and biomass with a LiDAR-Optimized sampling design. J. Geophys. Res. Biogeosci. 2009, 114, G00E03. [Google Scholar] [CrossRef]
  112. Deville, J.-C.; Tillé, Y. Efficient balanced sampling: The cube method. Biometrika 2004, 91, 893–912. [Google Scholar] [CrossRef]
  113. Haron, N. Stratified sampling using cluster analysis. AIP Conf. Proc. 2022, 2472, 050012. [Google Scholar] [CrossRef]
  114. Byrd, J. Data Clustering: Intro, Methods, Applications. Encord Blog. 8 November 2023. Available online: https://encord.com/blog/data-clustering-intro-methods-applications/ (accessed on 14 September 2025).
  115. Abbas, O.A. Comparisons Between Data Clustering Algorithms. Int. Arab J. Inf. Technol. 2008, 5, 320–325. [Google Scholar]
  116. Ezugwu, A.E.; Ikotun, A.M.; Oyelade, O.O.; Abualigah, L.; Agushaka, J.O.; Eke, C.I.; Akinyelu, A.A. A Comprehensive Survey of Clustering Algorithms: State-of-the-Art Machine Learning Applications, Taxonomy, Challenges, and Future Research Prospects. Eng. Appl. Artif. Intell. 2022, 110, 104743. [Google Scholar] [CrossRef]
  117. Rodriguez, M.; Comin, C.; Casanova, D.; Bruno, O.M.; Amancio, D.; Rodrigues, F.; da F. Costa, L. Clustering Algorithms: A Comparative Approach. arXiv 2016, arXiv:1612.08388. [Google Scholar] [CrossRef]
  118. Jayashree; Shivaprakash, T. Optimal Value for Number of Clusters in a Dataset for Clustering Algorithm. In Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2022; pp. 1–10. [Google Scholar] [CrossRef]
  119. Andreopoulos, B.; An, A.; Wang, X.; Schroeder, M. A roadmap of clustering algorithms: Finding a match for a biomedical application. Brief. Bioinform. 2009, 10, 297–314. [Google Scholar] [CrossRef]
  120. Han, J.; Pei, J.; Tong, H. Data Mining: Concepts and Techniques, 4th ed.; Morgan Kaufmann: Burlington, MA, USA, 2022; pp. 1–752. [Google Scholar] [CrossRef]
  121. Gagolewski, M. A framework for benchmarking clustering algorithms. SoftwareX 2022, 20, 101270. [Google Scholar] [CrossRef]
  122. Zhang, Z.; Feng, Q.; Huang, J.; Guo, Y.; Xu, J.; Wang, J. A local search algorithm for k-means with outliers. Neurocomputing 2021, 450, 230–241. [Google Scholar] [CrossRef]
  123. Nowak-Brzezińska, A.; Gaibei, I. How the Outliers Influence the Quality of Clustering? Entropy 2022, 24, 917. [Google Scholar] [CrossRef]
  124. Garge, N.R.; Page, G.P.; Sprague, A.P.; Gorman, B.S.; Allison, D.B. Reproducible clusters from microarray research: Whither? BMC Bioinform. 2005, 6 (Suppl. S2), S10. [Google Scholar] [CrossRef]
  125. Kumar, R.; Chambers, E., IV. Unreliability of clustering results in sensory studies and a strategy to address the issue. Front. Food Sci. Technol. 2024, 4, 1271193. [Google Scholar] [CrossRef]
  126. Lim, Z.-Y.; Ong, L.-Y.; Leow, M.-C. A Review on Clustering Techniques: Creating Better User Experience for Online Roadshow. Future Internet 2021, 13, 233. [Google Scholar] [CrossRef]
  127. MacQueen, J.B. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Le Cam, L.M., Neyman, J., Eds.; University of California Press: Berkeley, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
  128. Bock, H.H. Origins and extensions of the k-means algorithm in cluster analysis. Electron. J. Hist. Probab. Stat. 2008, 4. Available online: https://www.jehps.net/Decembre2008/Bock.pdf (accessed on 22 September 2025).
  129. Wani, A. Comprehensive analysis of clustering algorithms: Exploring limitations and innovative solutions. PeerJ Comput. Sci. 2024, 10, e2286. [Google Scholar] [CrossRef] [PubMed]
  130. Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. 1999, 31, 264–323. [Google Scholar] [CrossRef]
  131. Cabezas, R.; Izbicki, M.; Stern, R. Hierarchical clustering: Visualization, feature importance and model selection. arXiv 2021, arXiv:2112.01372. [Google Scholar] [CrossRef]
  132. Shetty, S.; Singh, A. Hierarchical clustering: A survey. Int. J. Comput. Appl. 2021, 178, 178–181. [Google Scholar] [CrossRef]
  133. Campello, R.J.G.B.; Moulavi, D.; Sander, J. Density-Based clustering based on hierarchical density estimates. In Advances in Knowledge Discovery and Data Mining; Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 160–172. [Google Scholar] [CrossRef]
  134. Gao, Y.; Zhang, L.; Wang, H. An overview of clustering methods with guidelines for practical applications. Inf. Sci. 2023, 630, 1–32. [Google Scholar] [CrossRef]
  135. HDBSCAN Development Team. HDBSCAN: How HDBSCAN Works. Available online: https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html (accessed on 27 September 2025).
  136. Tompalski, P.; White, J.C.; Coops, N.C.; Wulder, M.A. Demonstrating the transferability of forest inventory attribute models derived using airborne laser scanning data. Remote Sens. Environ. 2019, 227, 110–124. [Google Scholar] [CrossRef]
  137. IUL. Forest Management Manual; Święcicki, Z., Ed.; Ośrodek Rozwojowo-Wdrożeniowy Lasów Państwowych w Bedoniu: Andrespol, Poland, 2012. (In Polish)
  138. Bruchwald, A.; Dudek, A.; Michalak, K.; Rymer-Dudzińska, T.; Wróblewski, L.; Zasada, M. Wzory empiryczne do określania wysokości i pierśnicowej liczby kształtu grubizny drzewa (Empirical formulae for defining height and dbh shape figure of thick wood). Sylwan 2000, 10, 5–13. (In Polish) [Google Scholar]
  139. Gschwantner, T.; Alberdi, I.; Bauwens, S.; Bender, S.; Borota, D.; Bosela, M.; Bouriaud, O.; Breidenbach, J.; Donis, J.; Fischer, C.; et al. Growing stock monitoring by European National Forest Inventories: Historical origins, current methods and harmonisation. Forest Ecol. Manag. 2022, 505, 119868. [Google Scholar] [CrossRef]
  140. Gobakken, T.; Næsset, E. Assessing effects of positioning errors and sample plot size on biophysical stand properties derived from airborne laser scanner data. Can. J. For. Res. 2009, 39, 1036–1052. [Google Scholar] [CrossRef]
  141. Lisańczuk, M.; Mitelsztedt, K.; Stereńczak, K. The Influence of the Spatial Co-Registration Error on the Estimation of Growing Stock Volume Based on Airborne Laser Scanning Metrics. Remote Sens. 2024, 16, 4709. [Google Scholar] [CrossRef]
  142. Roussel, J.-R.; Auty, D.; Coops, N.C.; Tompalski, P.; Goodbody, T.R.H.; Meador, A.S.; Bourdon, J.-F.; de Boissieu, F.; Achim, A. lidR: An R package for analysis of Airborne Laser Scanning (ALS) data. Remote Sens. Environ. 2020, 251, 112061. [Google Scholar] [CrossRef]
  143. Liaw, A.; Wiener, M. randomForest: Breiman and Cutler’s Random Forests for Classification and Regression, R Package Version 4.7-1.2. Computer Software. Comprehensive R Archive Network (CRAN): Online, 2024. Available online: https://cran.r-project.org/package=randomForest (accessed on 28 September 2025).
  144. Næsset, E.; Gobakken, T. Estimating forest growth using canopy metrics derived from airborne laser scanner data. Remote Sens. Environ. 2005, 96, 453–465. [Google Scholar] [CrossRef]
  145. Parkitna, K.; Krok, G.; Miścicki, S.; Ukalski, K.; Lisańczuk, M.; Mitelsztedt, K.; Magnussen, S.; Markiewicz, A.; Stereńczak, K. Modelling growing stock volume of forest stands with various ALS area-based approaches. Forestry 2021, 94, 630–650. [Google Scholar] [CrossRef]
  146. SILP—Biuro Urządzania Lasu i Geodezji Leśnej. System Informatyczny Lasów Państwowych (SILP); 2015, 2020, 2021. Available online: https://www.zilp.lasy.gov.pl/ (accessed on 14 September 2025).
  147. Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman & Hall: London, UK, 1986. [Google Scholar]
  148. Moeur, M.; Stage, A.R. Most similar neighbor: An improved sampling inference procedure for natural resource planning. For. Sci. 1995, 41, 337–359. [Google Scholar] [CrossRef]
  149. Loosmore, N.B.; Ford, E.D. Statistical inference using the G or K point pattern spatial statistics. Ecology 2006, 87, 1925–1931. [Google Scholar] [CrossRef]
  150. Fisher, R.A. Statistical Methods for Research Workers; Oliver & Boyd: Edinburgh, UK, 1925. [Google Scholar]
  151. Hogg, R.V.; Tanis, E.A.; Zimmerman, D.L. Probability and Statistical Inference, 9th ed.; Pearson: Boston, MA, USA, 2015; ISBN 978-0-321-92327-1. [Google Scholar]
  152. Mascha, E.J.; Vetter, T.R. Significance, Errors, Power, and Sample Size: The Blocking and Tackling of Statistics. Anesth. Analg. 2018, 126, 691–698. [Google Scholar] [CrossRef]
  153. Fraenkel, J.R.; Wallen, N.E. How to Design and Evaluate Research in Education, 7th ed.; McGraw-Hill: New York, NY, USA, 2009; ISBN 978-0-07-352596-9. [Google Scholar]
  154. Rabosky, D.L.; Grundler, M.; Anderson, C.; Title, P.; Shi, J.J.; Brown, J.W.; Huang, H.; Larson, J.G. BAMMtools: An R package for the analysis of evolutionary dynamics on phylogenetic trees. Methods Ecol. Evol. 2014, 5, 701–707. [Google Scholar] [CrossRef]
  155. R Core Team. R: A Language and Environment for Statistical Computing, version 5.5; R Foundation for Statistical Computing: Vienna, Austria, 2025; Available online: https://www.R-project.org/ (accessed on 29 September 2025).
  156. Probst, P.; Boulesteix, A.-L.; Wright, M. Hyperparameters and Tuning Strategies for Random Forest. arXiv 2018, arXiv:1804.03515. [Google Scholar] [CrossRef]
  157. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  158. Oshiro, T.; Perez, P.; Baranauskas, J. How Many Trees in a Random Forest? In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7376, pp. 154–168. [Google Scholar] [CrossRef]
  159. Probst, P.; Boulesteix, A.-L. To tune or not to tune the number of trees in random forest? J. Mach. Learn. Res. 2017, 18, 1–18. [Google Scholar] [CrossRef]
  160. Grafström, A.; Lundstrom, N.L.P.; Schelin, L. Spatially balanced sampling through the Pivotal method. Biometrics 2012, 68, 514–520. [Google Scholar] [CrossRef] [PubMed]
  161. Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.D.; Silverman, R.; Wu, A.Y. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 881–892. [Google Scholar] [CrossRef]
  162. Yin, H.; Aryani, A.; Petrie, S.; Nambissan, A.; Astudillo, A.; Cao, S. A Rapid Review of Clustering Algorithms. arXiv 2024, arXiv:2401.07389. [Google Scholar] [CrossRef]
  163. Magnussen, S.; Næsset, E.; Gobakken, T. An application niche for finite mixture models in forest resource surveys. Can. J. For. Res. 2019, 49, 1453–1462. [Google Scholar] [CrossRef]
  164. Khan, M.K.H.; Chakraborty, A.; Petris, G.; Wilson, B. Constrained Functional Regression of National Forest Inventory Data Over Time Using Remote Sensing Observations. J. Am. Stat. Assoc. 2020, 116, 1168–1180. [Google Scholar] [CrossRef]
  165. Zhang, Z.; Wang, J.; Li, Z.; Zhao, Y.; Wang, R.; Habib, A. Optimization Method of Airborne LiDAR Individual Tree Segmentation Based on Gaussian Mixture Model. Remote Sens. 2022, 14, 6167. [Google Scholar] [CrossRef]
  166. Szymkiewicz, B. Tablice Zasobności i Przyrostu Drzewostanów Sosnowych, Świerkowych, Jodłowych, Dębowych i Bukowych; Państwowe Wydawnictwo Rolnicze i Leśne: Warszawa, Poland, 1966. [Google Scholar]
  167. Liu, J.; Chen, Z.; Zhao, Z. Assessing the accuracy of forest above-ground biomass and carbon storage estimation by meta-analysis based close-range remote sensing. For. Res. 2025, 5, e017. [Google Scholar] [CrossRef]
  168. Mouret, F.; Morin, D.; Planells, M.; Vincent-Barbaroux, C. Tree Species Classification at the Pixel Level Using Deep Learning and Multispectral Time Series in an Imbalanced Context. Remote Sens. 2025, 17, 1190. [Google Scholar] [CrossRef]
  169. Guo, H.; Boonprong, S.; Wang, S.; Zhang, Z.; Liang, W.; Xu, M.; Yang, X.; Wang, K.; Li, J.; Gao, X.; et al. Dominant Tree Species Mapping Using Machine Learning Based on Multi-Temporal and Multi-Source Data. Remote Sens. 2024, 16, 4674. [Google Scholar] [CrossRef]
  170. Li, Y.; Li, C.; Li, M.; Liu, Z. Influence of Variable Selection and Forest Type on Forest Aboveground Biomass Estimation Using Machine Learning Algorithms. Forests 2019, 10, 1073. [Google Scholar] [CrossRef]
  171. Kotze, J.D.F.; Beukes, H.B.; Seifert, T. Essential environmental variables to include in a stratified sampling design for a national-level invasive alien tree survey. iForest 2019, 12, 418–426. [Google Scholar] [CrossRef]
  172. Jiang, X.; Li, G.; Lu, D.; Chen, E.; Wei, X. Stratification-Based Forest Aboveground Biomass Estimation in a Subtropical Region Using Airborne Lidar Data. Remote Sens. 2020, 12, 1101. [Google Scholar] [CrossRef]
  173. Wu, Z.; Liu, X.; Cheng, S.; Yang, C.; Wang, Z.; Liu, Y.; Dong, L.; Li, F.; Hao, Y. Evaluating the effectiveness of forest type stratification for aboveground biomass inference. Int. J. Appl. Earth Obs. Geoinf. 2025, 143, 104829. [Google Scholar] [CrossRef]
  174. Köhl, M.; Lister, A.; Scott, C.T.; Baldauf, T.; Plugge, D. Implications of sampling design and sample size for national carbon accounting systems. Carbon Balance Manag. 2011, 6, 10. [Google Scholar] [CrossRef]
  175. Jin, J.; Yang, J. Effects of sampling approaches on quantifying urban forest structure. Landsc. Urban Plan. 2020, 195, 103722. [Google Scholar] [CrossRef]
  176. Häbel, H.; Kuronen, M.; Henttonen, H.M.; Kangas, A.; Myllymäki, M. The effect of spatial structure of forests on the precision and costs of plot-level forest resource estimation. For. Ecosyst. 2019, 6, 8. [Google Scholar] [CrossRef]
  177. Patummasut, M.; Borkowski, J. Adaptive Cluster Sampling with Spatially Clustered Secondary Units. J. Appl. Sci. 2014, 14, 2516–2522. [Google Scholar] [CrossRef][Green Version]
  178. Cabin, R.J.; Clewell, A.; Ingram, M.; McDonald, T.; Temperton, V. Bridging Restoration Science and Practice: Results and Analysis of a Survey from the 2009 Society for Ecological Restoration International Meeting. Restor. Ecol. 2010, 18, 494–503. [Google Scholar] [CrossRef]
  179. Poudyal, B.H.; Maraseni, T.; Cockfield, G. Scientific Forest Management Practice in Nepal: Critical Reflections from Stakeholders’ Perspectives. Forests 2020, 11, 27. [Google Scholar] [CrossRef]
  180. Kapoor, T.; Falconer, M.; Hutchen, J.; Westwood, A.R.; Young, N.; Nguyen, V.M. Implementing and evaluating knowledge exchange: Insights from practitioners at the Canadian Forest Service. Environ. Sci. Policy 2023, 148, 103549. [Google Scholar] [CrossRef]
  181. Fassnacht, F.E.; Latifi, H.; Hartig, F. Using synthetic data to evaluate the benefits of large field plots for forest biomass estimation with LiDAR. Remote Sens. Environ. 2018, 213, 115–128. [Google Scholar] [CrossRef]
  182. Williams, B.; Ritsos, P.D.; Headleand, C. Virtual Forestry Generation: Evaluating Models for Tree Placement in Games. Computers 2020, 9, 20. [Google Scholar] [CrossRef]
  183. Ferreira, J.F.; Nunes, R.; Peixoto, P. Procedural Generation of Synthetic Forest Environments to Train Machine Learning Algorithms. In Proceedings of the 2022 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 10317–10323. [Google Scholar] [CrossRef]
  184. Duong, T.H.K.; Vega, C.; Renaud, J.-P.; Chauvet, G.; Bouriaud, O. A large-Scale Artificial Forest Tree Population for Sampling and Estimation Methods Simulations [Data Set]. Zenodo. 2023. Available online: https://zenodo.org/records/10252806 (accessed on 30 September 2025).
  185. Jevšenak, J.; Arnič, D.; Krajnc, L.; Skudnik, M. Machine Learning Forest Simulator (MLFS): R package for data-driven assessment of the future state of forests. Ecol. Inform. 2023, 75, 102115. [Google Scholar] [CrossRef]
  186. Cattaneo, N.; Astrup, R.; Antón-Fernández, C. PixSim: Enhancing high-resolution large-scale forest simulations. Softw. Impacts 2024, 21, 100695. [Google Scholar] [CrossRef]
  187. Yu, Z.; Qi, J.; Liu, S.; Zhao, X.; Huang, H. Evaluating forest aboveground biomass estimation model using simulated ALS point cloud from an individual-based forest model and 3D radiative transfer model across continents. J. Environ. Manag. 2024, 372, 123287. [Google Scholar] [CrossRef]
  188. AI Sweden. Synthetic Data and AI Are Taking Forestry into the Future. Available online: https://www.ai.se/en/news/synthetic-data-and-ai-are-taking-forestry-future (accessed on 30 September 2025).
Figure 1. Localization of forest districts whence inventory data were acquired.
Figure 1. Localization of forest districts whence inventory data were acquired.
Remotesensing 17 03871 g001
Figure 2. Methodology pipeline.
Figure 2. Methodology pipeline.
Remotesensing 17 03871 g002
Figure 3. Performance (RMSE and BIAS) of analysed sampling approaches.
Figure 3. Performance (RMSE and BIAS) of analysed sampling approaches.
Remotesensing 17 03871 g003
Figure 4. Stability of analysed methods. pp—percentage points.
Figure 4. Stability of analysed methods. pp—percentage points.
Remotesensing 17 03871 g004
Table 1. A brief characteristics of selected data clustering methods.
Table 1. A brief characteristics of selected data clustering methods.
AlgorithmCommon UsageProsCons
K-means (1)Partitioning data into k clusters with roughly spherical, equally sized clusters. Market segmentation, image compression, etc.
  • Simple, intuitive, fast for large datasets
  • Easy to implement and scale
  • Deterministic once initialization fixed
  • Cluster centroid interpretable
  • Must choose k ahead of time
  • Assumes spherical clusters
  • Sensitive to initialization and outliers
  • Hard clustering only
Hierarchical
Clustering (2)
(Agglomerative/
Divisive)
Exploratory data analysis for hierarchical structures; used in biology, taxonomy, document clustering, etc.
  • No need to pre-specify number of clusters
  • Provides nested clusters via dendrogram
  • Flexible with linkage criteria (single, complete, average, Ward’s)
  • Computationally expensive for large datasets
  • Sensitive to linkage and distance metric choice
  • Irreversible decisions
  • Less useful for non-hierarchical clusters
HDBSCAN (3)
(Hierarchical
Density-Based
Spatial Clustering of Applications with Noise)
Clustering with noise, uneven densities, irregular shapes; used in spatial, astrophysics, geospatial data, etc.
  • Finds clusters of varying density and shape
  • Handles noise well
  • No need for epsilon parameter (unlike DBSCAN)
  • Produces hierarchical structure with flat clustering
  • More computationally intensive
  • Some points considered noise (not clustered)
  • Parameter tuning needed
  • Less effective in very high dimensions
IDS (Individual Dimension Sampling)Original authors’ algorithm tested in this study
(1) [127,128,129]; (2) [130,131,132]; (3) [133,134,135].
Table 2. Brief characteristics of selected forest districts.
Table 2. Brief characteristics of selected forest districts.
DistrictAverage AgeGSV [m3/ha]Moran’s I *Major Tree Species
Białowieża1104010.012Norway spruce, Oak, Black alder, Scotch pine, Hornbeam
Głogów573090.066Scotch pine, Beech, Silver fir, Oak
Gorlice643730.035Beech, Silver fir, Scotch pine
Herby603250.031Scotch pine, Oak
Katrynka603950.059Scotch pine, Norway spruce
Taczanów773040.042Scotch pine, Oak
Leżajsk623540.012Scotch pine, Beech, Oak, Silver fir
Milicz603790.013Scotch pine, Oak, Beech
Pieńsk543030.005Scotch pine, Norway spruce, Birch
Supraśl584000.002Scotch pine, Norway spruce, Oak
* Spatial autocorrelation index for the target variable: 1—full correlation, 0—no correlation.
Table 3. Grouping and explanatory variables.
Table 3. Grouping and explanatory variables.
GroupVariablesSource/Definition
ALS central tendencymean heightlidR [142]
ALS dispersionheight sdlidR [142]
ALS quantilesheight 1st quartile,
95th height percentile
lidR [142]
ALS cumulative histogramssquare of the ratio between the number of points above 2nd threshold to all points,
square of ratio between the number of points below 7th height threshold to all points
[53,144,145]
Species
related
coniferous species share,
dominant species share,
Shannon diversity index
ground inventory [15]
Age relatedaverage age,
natural logarithm of average age
ground inventory, GIS layers, previous inventories [146]
Table 4. Forest generation contingency table.
Table 4. Forest generation contingency table.
FactorsLevels
Kernel plot ID1, 10, 20, 30, …, 580
k-neighbours100, 250, 500, 1000
Sampling intensity100, 200, 300
Repetitions1,2,3, …, 30
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lisańczuk, M.; Hycza, T.; Stereńczak, K. Efficiency of Data Clustering for Stratification and Sampling in the Two-Phase ALS-Enhanced Forest Stock Inventory. Remote Sens. 2025, 17, 3871. https://doi.org/10.3390/rs17233871

AMA Style

Lisańczuk M, Hycza T, Stereńczak K. Efficiency of Data Clustering for Stratification and Sampling in the Two-Phase ALS-Enhanced Forest Stock Inventory. Remote Sensing. 2025; 17(23):3871. https://doi.org/10.3390/rs17233871

Chicago/Turabian Style

Lisańczuk, Marek, Tomasz Hycza, and Krzysztof Stereńczak. 2025. "Efficiency of Data Clustering for Stratification and Sampling in the Two-Phase ALS-Enhanced Forest Stock Inventory" Remote Sensing 17, no. 23: 3871. https://doi.org/10.3390/rs17233871

APA Style

Lisańczuk, M., Hycza, T., & Stereńczak, K. (2025). Efficiency of Data Clustering for Stratification and Sampling in the Two-Phase ALS-Enhanced Forest Stock Inventory. Remote Sensing, 17(23), 3871. https://doi.org/10.3390/rs17233871

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop