Comparison of Data Grouping Strategies on Prediction Accuracy of Tree-Stem Taper for Six Common Species in the Southeastern US

: Clustering data into similar characteristic groups is a commonly-used strategy in model development. However, the impact of data grouping strategies on modeling stem taper has not been well quantiﬁed. The objective of this study was to compare the prediction accuracy of different data grouping strategies. Speciﬁcally, a population-level model was compared to the models ﬁtted with grouped data based on taxonomic rank, tree form and size. A total of 3678 trees were used in the analyses, which included six common species in upland hardwood forests of the southeastern U.S. Results showed that overall predictions are more accurate when building stem taper models at the species, species group or division level rather than at the population level. The prediction accuracy was not considerably improved between species-speciﬁc functions and models ﬁtted with species-related groups for the four hardwood species examined. Grouping data by taxonomic rank provided more reliable predictions than height-to-diameter ratio (H–D ratio) or diameter at breast height (DBH). The form/size-related grouping methods (i.e., data grouped by H–D ratio or DBH) generally did not improve the prediction precision compared to a population-level model. In this study, the effect of sample size in model ﬁtting showed a minimal impact on prediction accuracy. The methodology presented in this study provides a modeling strategy for mixed-species data, which will be of practical importance when data grouping is needed for developing stem taper models.


Introduction
Tree-stem taper, defined as the change in tree diameter with increasing tree height from ground level to total tree height, is a quantitative description of stem profile [1]. For a given population, stem taper functions are typically built by species, also known as species-specific models, e.g., [2][3][4]. Since every species in a plant community may respond differently to environmental and management changes and conditions, developing stem taper models at the species level has been generally assumed to better capture variable tree forms compared to a single population or community-level model (i.e., a single taper model for the entire population) [5,6]. However, building species-specific models usually requires relatively large samples due to complex model forms and large numbers of parameters [7]. When the target population includes a variety of species (e.g., mixed-hardwood forests), especially if many of them are recorded infrequently or are sparse in the population, fitting stem taper models by species can be difficult under time and cost constraints. Rather than grouping data at the species level, an alternative approach is to re-aggregate individuals into a smaller number of groups based on similar tree characteristics (e.g., taxonomic rank, tree form, size). This approach is cost-efficient when quantifying stem profile with limited data for diverse species, e.g., [8].
In model evaluation, prediction accuracy is an important criterion and is commonly assessed using an independent validation dataset. It was found that parametric stem taper models produced reliable predictions for loblolly pine when the size distribution of the predicted populations deviated from the observations used in model development (i.e., high robustness) [9]. Although the data grouping approach has been implemented in forest and natural resources practice, to our knowledge, the accuracy of stem taper models fit by different data grouping approaches and calibration sample sizes has not been extensively investigated. Stem taper modeling has primarily focused on single stemmed, excurrent crown form trees (e.g., coniferous species), e.g., [10][11][12][13]. Predicting stem taper for decurrent trees (e.g., deciduous hardwoods) is generally more challenging than excurrent trees due to a more complicated geometric shape of the main stem [1,8]. Although stem taper equations for upland hardwoods in the southeastern US were built in the past, e.g., [4], the predictability of models under various data grouping strategies has not been extensively examined.
Therefore, the objectives of this study were (1) to compare the prediction accuracy among different data grouping strategies, and (2) to examine the effect of sample size on the prediction accuracy of stem taper with different data grouping strategies. To achieve the first objective, stem taper models fit at the population level (i.e., one taper model for the entire dataset) were compared with those grouped based on taxonomic rank, tree form and size. Specifically, trees were grouped by species (species-specific), species group, division (phylum) group (i.e., softwoods vs. hardwoods), height-DBH ratios (H-D ratios) or DBH, respectively. For the second objective, trees were split randomly between a fitting and validation set at 10/90, 20/80, 30/70, 40/60, 50/50, 60/40, 70/30, 80/20 and 90/10 splits. For example, with a 20/80 split, 20% of the trees were randomly selected as fitting data, and the remaining 80% were used for validation. Six common species in the upland hardwood forests in the southeastern US were selected, including shortleaf pine (Pinus echinata), Virginia pine (Pinus virginiana), yellow poplar (Liriodendron tulipifera), hickory spp. (Carya spp.), white oak (Quercus alba) and southern red oak (Quercus falcata). These species are economically and ecologically important in the region [14]. The results of this study will provide insights on selecting appropriate data grouping strategies when developing tree-taper models with inadequate per-species data.

Data
The stem taper data used in this study were collected from the LegacyTree database (http://www.legacytreedata.org, last accessed on 18 October 2021). The LegacyTree database is a large compilation of North American trees sampled in the past century [15]. Felled trees with measured diameter outside bark (d, cm), diameter at breast height (DBH, cm) and total tree height (Ht, m) were used in analysis. The average taper trends and distributions of height to DBH ratios (H-D ratios) are shown in Figure 1. The sample trees were collected from 13 states in the southeastern US, including Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, North Carolina, Oklahoma, South Carolina, Tennessee, Texas and Virginia. After trees with DBH < 7.6 cm (3 in) and total tree height < 4.6 m (15 ft) were excluded [16], a total number of 3678 trees were obtained. To reduce the sources of uncertainty in data collection, only a single dataset collected by Clark et al. [4] was used in this work. A summary of tree characteristics for each species is given in Table 1.

Taper Model
Models proposed by Kozak [12] and Max and Burkhart [11] were applied to predict stem tapers. Due to their flexibility, both models have been widely used to describe the tree profiles for a variety of species in different regions [10].

Variable-Exponent Model
The nine-parameter variable-exponent model proposed by Kozak [12] can be written as:  Table 1. Summary statistics of tree characteristics for the six species evaluated. N tree is the total number of sample trees, and N obs/tree is number of observations (taper points) per tree. DBH is diameter at breast height in cm, and Ht is total tree height in m. For N obs/tree , DBH and Ht, the average is given followed by standard deviation in parentheses.

Taper Model
Models proposed by Kozak [12] and Max and Burkhart [11] were applied to predict stem tapers. Due to their flexibility, both models have been widely used to describe the tree profiles for a variety of species in different regions [10].

Variable-Exponent Model
The nine-parameter variable-exponent model proposed by Kozak [12] can be written as: d = a 0 DBH a 1 Ht a 2 K z (1) and a 0 -a 8 are model coefficients. In some cases, this model form can produce negative value of Z when K = 0 (h = Ht), leading to an undefined value of d. Thus, when h = Ht, the restriction of d = 0 was imposed to Equation (1).

Segmented Polynomial Regression Model
Max and Burkhart [11] proposed using the squared ratio d 2 /DBH 2 as the dependent variable, but Yang and Burkhart [9] found that the model with the first order ratio d/DBH provided more accurate predictions. To be comparable with the Kozak [12] model, the model with the first order ratio was used in this study, which is where x is h/Ht, a 0 −a 3 are model coefficients, In Max and Burkhart [11] model, coefficients b 1 and b 2 are used to join three segments of tree stems to form a single model, which were estimated as 0.69 and 0.11, respectively, using all tree-stem taper data in this study. The fixed estimates of b 1 and b 2 were applied to all cases listed in Section 2.3.

Model Fitting and Evaluation
The step-by-step procedure of tree selection for model fitting and validation is given as:

1.
A random sample of 100 trees was selected from the original dataset for a given species.

(a)
Population-level case (fitting a single stem taper model for all data): For a given species, 100 sample trees selected in step 1 were randomly split into fitting and validation datasets based on step 3. Then, the randomly-split trees were merged into a fitting and validation dataset, respectively. In this case, fitting and validation datasets included all species, and each species contributed equal number of trees.
Data grouping cases (fitting stem taper with grouped data): Trees drawn from step 1 were grouped based on taxonomic rank, tree form and size, which were detailed in Sections 2.3.1 and 2.3.2.

3.
Fitting and validation data were created with 10/90, 20/80, 30/70, 40/60, 50/50, 60/40, 70/30, 80/20 and 90/10 splits. Trees used in fitting and validation were randomly selected. Model parameters were estimated with the Levenberg-Marquardt (LM) non-linear least squares algorithm that is implemented in the nlsLM function in R [17]. The LM algorithm is a compromise between the gradient-descent and Gauss-Newton approaches, which leads to more stable parameter estimates [18]. The initial values for parameter estimation were obtained from Yang and Burkhart [9]. The model evaluation statistics are given in Section 2.3.3.
To provide comparable results, the total number of sampling trees summed from all groups was 600 (i.e., 600 = 100 trees/species × 6 species), which was consistent for all cases (i.e., population level and data grouping cases). When a tree was selected, all stem taper measurements within the tree were included, so that the correlation structure of repeated measurements was retained (i.e., cluster sampling).

Grouping Data Based on Taxonomic Rank
Three ranks in the taxonomic hierarchy: species-specific, species group and divisions, were used in data grouping. The methods were defined as:

1.
Trees grouped by species (fitting stem taper at species level): All trees selected in step 1 were used in model fitting and evaluation where fitting and validation datasets included only a single species. For a given species, 100 sample trees selected in step 1 were randomly split into fitting and validation datasets based on step 3.

2.
Trees grouped by species group: Six species were divided into three species groups: pine (shortleaf pine and Virginia pine), oak (white oak and southern red oak) and other hardwoods (yellow poplar and hickory spp.). Species in the pine or oak groups belong to the same genus.
Although yellow popular and hickory spp. were in different genera, the classifying strategy is commonly implemented in practice when species data are not available [4]. For a given group, the fitting/validation datasets were composed of equal proportion of sample trees from each species. For example, under the 90/10 split, each species in a group contributed 90 trees for fitting and 10 trees for validation.

3.
Trees grouped by division group (gymnosperm vs. angiosperm): Six species were divided into softwood and hardwood groups. The softwood group included short-leaf pine and Virginia pine, whereas the other group contained white oak, southern red oak, yellow poplar and hickory spp. For a given group, the fitting/validation datasets were composed of equal proportion of sample trees from each species. Each species in a group has 100 trees randomly selected for fitting and validation.
Notably, data splitting was species-independent. When species were mixed, the fitting/validation data included the same number of trees from each species. For example, when species A and B are mixed, each species provides 100 trees. Given the 90/10 split, 90 trees were selected for fitting from the 100 trees of each species and the validation data included the remaining 10 trees (i.e., 10 = 100 − 90) from species A and B, respectively.

Grouping Data Based on Tree Form and Size
In this scenario, the sample trees of the six species selected in step 1 were merged into a dataset (a total of 600 trees, 600 = 100 × 6), and then regrouped into k number of equal-sized groups by H-D ratios or DBH. A taper function was applied to each of k groups, where k is equal to 6, 3 and 2 (i.e., 6, 3 and 2 groups). Specifically,

1.
Six HD ratio or DBH groups: Trees were divided into six groups based on H-D ratios or DBH. Each group included 100 trees (i.e., 100 trees/group = 600 trees/6 groups).

2.
Three H-D ratio or DBH groups: Trees were divided into the smallest, middle and largest one-thirds based on H-D ratios or DBH to generate three H-D ratio or DBH groups. Each group included 200 trees (i.e., 200 trees/group = 600 trees/3 groups).

3.
Two H-D ratio or DBH groups: Trees were divided into the smallest and largest 50% based on H-D ratios or DBH to generate two H-D ratio or DBH groups. Each group included 300 trees (i.e., 300 trees/group = 600 trees/2 groups).

Statistics for Model Evaluation
To evaluate the accuracy of stem diameter prediction, the percent mean bias (MB) and percent root mean square error (RMSE) for a given repetition were calculated as where N is the total number of observations in a sample, and the residuals for stem taper points (e d , cm) were calculated as where d andd are the observed and predicted diameters in cm, respectively. For a given group, the estimates and 95% confidence intervals of MB and RMSE were computed by the median, 2.5% and 97.5% quantiles of 500 repetitions. Then, the overall estimates and 95% confidence intervals of MB and RMSE were calculated by averaging all groups for a given case.

Grouping Data Based on Taxonomic Rank
Overall, species-specific models provided more accurate predictions of stem taper than those fit at population level (see percent MB and RMSE in Figures 2 and 3). When data were grouped by taxonomic rank, the three methods yielded similar mean bias regardless of fitting/validation data or model form. Generally, the overall prediction of stem taper was more precise when the data were divided by the lower rank of taxonomic hierarchy (i.e., species level) than the higher tank, but the improvements were minimal. As shown in Figures 2 and 3, the models fitted by species provided smaller RMSE than the models fitted by the other species-related groups (i.e., data grouped by species group or division). However, the differences were only about 2% for fitting, and less than 2% in validation for both the Kozak (2004) and Max and Burkhart (1976)

models.
We further examined model validation by species. As Table 2 shows, all six species showed improvements in prediction accuracy when changing from a population-level model to the models fitted by species-related groups except for Virginia pine with the Max and Burkhart [11] model. With the Kozak [12] model, the largest reduction in MB and RMSE between species-specific and population-level models was found for shortleaf pine, ranging from approximately 15% for MB and 10% for RMSE, followed by oaks and hickory. Similar results were found using the Max and Burkhart [11] model with larger RMSE improvements being realized for the oak and hickory species. The differences in accuracy between the population-level and species-grouping models for yellow poplar were relatively small compared to other species using the Kozak [12] model. Notably, for Virginia pine, the model fitted with species group or division group yielded lower precision than the species-level model, which may be because it was grouped with shortleaf pine. Furthermore, when building models at the species level, excurrent trees (shortleaf pine, Virginia pine and yellow poplar) showed a lower RMSE than decurrent trees (white oak, southern red oak and hickory) ( Table 2), which implied that excurrent trees had a lower variation in stem profile among individuals. Notably, grouping data by three different taxonomic ranks for the four hardwood species did not show noticeable differences in prediction accuracy (see MB and RMSE in Table 2). Precision was not greatly decreased using a species-specific model compared with higher-level groupings for both model forms. In other words, building stem taper models with species-specific (a lower rank in taxonomic hierarchy) did not greatly improve the prediction accuracy. For example, a similar range  Table 2) when fitting white oak alone or white oak in the oak group with the Kozak [12] model. taxonomic ranks for the four hardwood species did not show noticeable differences in prediction accuracy (see MB and RMSE in Table 2). Precision was not greatly decreased using a species-specific model compared with higher-level groupings for both model forms. In other words, building stem taper models with species-specific (a lower rank in taxonomic hierarchy) did not greatly improve the prediction accuracy. For example, a similar range of RMSE was produced (15.7-17.7% and 16.3-17.3%, respectively, in Table  2) when fitting white oak alone or white oak in the oak group with the Kozak [12] model.

Grouping Data Based on Tree Form and Size
When H-D ratio or DBH was used in data grouping, the overall absolute mean bias was similar to the species-related grouping models. Based on 95% confidence intervals shown in Figure 4, the [12] model yielded fewer biased predictions than the [11] model. For RMSE, the average differences increased to 4-5% (Figures 2 and 3). In some cases, using the form/size-grouping methods produced less precise predictions than the populationlevel model (See Figure 2d). Unlike using taxonomic rank in data grouping, the results showed that increasing the number of H-D ratio or DBH groups in model fitting did not appreciably improve the prediction accuracy. As Figures 2 and 3 illustrate, MB and RMSE were similar among the different number of H-D ratio/DBH groups. Although H-D ratio has been shown to be related to crown/tree form, and tree taper usually varies by tree DBH, e.g., [8,9,19,20], we found that the uncertainty of prediction was not considerably reduced when grouping data based on H-D ratio or DBH. model fitting did not appreciably improve the prediction accuracy. As Figures 2 and 3 illustrate, MB and RMSE were similar among t h e different number of H-D ratio/DBH groups. Although H-D ratio has been shown to be related to crown/tree form, and tree taper usually varies by tree DBH, e.g., [8,9,19,20], we found that the uncertainty of prediction was not considerably reduced when grouping data based on H-D ratio or DBH.

Effect of Sample Size on Prediction Accuracy
Generally, the effect of sample size used for model fitting was small except for the form/size-grouping methods with the Kozak (2004) model. Taper models were robust across all fitting/validation ratios evaluated. Larger sample sizes minimally affected MB for both the fitting and validation data regardless of grouping strategies (Figures 2 and 3). Larger fitting sample sizes resulted in slightly larger RMSE values; however, the validation RMSE values noticeably improved with larger fitting sample sizes, especially when changing from 10% to 20% of the total data with the Kozak [12] model used (Figures 2 and 3). The largest improvement in fit statistics for RMSE

Effect of Sample Size on Prediction Accuracy
Generally, the effect of sample size used for model fitting was small except for the form/size-grouping methods with the Kozak (2004) model. Taper models were robust across all fitting/validation ratios evaluated. Larger sample sizes minimally affected MB for both the fitting and validation data regardless of grouping strategies (Figures 2 and 3). Larger fitting sample sizes resulted in slightly larger RMSE values; however, the validation RMSE values noticeably improved with larger fitting sample sizes, especially when changing from 10% to 20% of the total data with the Kozak [12] model used (Figures 2 and 3). The largest improvement in fit statistics for RMSE occurred with the six H-D ratio grouping strategy with an approximate 3% improvement from the smallest fitting size to the largest.

Discussion
In forestry, grouping data by species to build species-specific taper models has long been assumed as the most accurate and precise strategy among other data clustering methods. Grouping data by other criterion (e.g., higher taxonomic rank) was viewed as a compromise when sufficient species-level observations were lacking. This resulted in most of the past efforts being confined to developing statistical methods for species-level models with a limited sample size. However, our results showed that grouping data by species did not greatly improve the prediction accuracy of stem taper compared to clustering data by species group or division. Grouping data by the higher rank of taxonomic hierarchy may still provide a certain level of accuracy in prediction. Notably, in this study, Virginia pine was grouped with shortleaf pine because they are the only two coniferous pine species. However, both species could have variable size and stem shape, which results in poor prediction accuracy when both species were grouped (see Figure 1 and Table 1). We found that species-specific models could be less precise than those fit to higher levels of grouping for a given species as an individual species may contain considerable variation in stem taper depending on growing conditions. Clustering data into a small number of similar, simplified groups has been examined and implemented in forestry and ecology. However, many of the past studies were primarily focused on grouping data from ecological perspectives (e.g., aggregating data into functional groups) in species-abundant forest ecosystems (e.g., tropical rain forests), e.g., [19,21,22]. The results showed that using only H-D ratio or DBH as a grouping criterion was not adequate to accurately classify data so that the individuals within groups have more similar taper than those between groups (i.e., the variation within groups is smaller than that between groups.). Using multiple criteria (e.g., a combination of species and tree size) in data grouping may improve the overall prediction accuracy, but adding additional criteria usually requires a larger sample size in model development. Thus, in this study, the data were classified by only a single criterion at a time, so that the results can be better implemented in forestry practice.
When handling mixed-species data, the primary goal usually lies in finding a proper modeling strategy for minimizing the uncertainty for all species, not just for a single species. Grouping data by species group or division was found to not cause a large reduction in precision and accuracy in prediction. In other words, the influence of grouping by upper levels of taxonomic rank was minimal and dependent on the population of interest. Various statistical methods have been widely studied for modeling stem taper in the forestry literature [10]; however, to our knowledge, the impact of grouping strategies on predicting stem taper has not been extensively examined. The findings of this work can be used to provide insights in building stem taper models for multi-species datasets. In addition to the six species examined, the methodology can be applied to other types of forests when data clustering is needed.
Other than aggregating data, an alternative approach for dealing with multi-species data is to construct a population-level, mixed-effect model, and localize the equation with the upper stem diameter of the target trees, e.g., [23]. However, measuring upper stem diameters requires additional time and effort in the field, which may not be a feasible option in many cases. Lam et al. [24] proposed adding the taxonomic hierarchy of genus and species as random effects in developing species-specific, height-diameter relationship models for tropical forests in Malaysia. However, the trajectories of stem profile are usually more complicated than H-D relationships. In addition to taxonomic rank, it is worth investigating adding measuring procedure, location or environmental/climatic factors as a random effect in a mixed model. Comparing the accuracy of stem taper predicted by the grouped data and the mixed-effect models is suggested for future studies. Strategies for selecting proper initial values and random effect parameters need to be further investigated. The Kozak [12] and Max and Burkhart [11] taper models used in this work are not necessarily optimal for each species but are used due to their flexibility. Choosing a proper base model and initial values in parameter estimation is critical in model development and should be considered on a case-by-case basis when developing local taper models.
Lastly, in regression analysis, fitting (training) data commonly contain more observations than validation data. In this work, we examined using validation datasets that were considerably larger than the fitting data. This is of interest because in practice, fitting datasets are much smaller than the populations of interest. Models that successfully validate when fit with relatively small samples provide additional evidence of robustness and confidence in their ability to successfully function in practice. These results indicate that the parametric models evaluated are robust against small sample sizes, which can be applied when sufficient numbers of destructively sampled data are not available due to logistical or ecological limitations.

Conclusions
In summary, the overall prediction is more accurate when building stem taper model at the species (group) or division level than at the population level. The prediction accuracy was not considerably improved between species-specific functions and models with species-related groups for the four hardwood species examined. Grouping data by the taxonomic rank provided better prediction accuracy than by height-to-diameter ratio (H-D ratio) or diameter at breast height (DBH). The form/size-related grouping methods (i.e., data grouped by H-D ratio or DBH) generally did not improve the prediction precision compared to a population-level model. In this study, the effect of sample size in model fitting showed a minimal impact on prediction accuracy. However, the goal was not to elucidate what a sufficient sample size or proper model form is for a particular species. This will be situation specific and depend on the target species, tree sizes available for sampling, the taper model form used and the desired model precision. The methodology presented in this study provides a modeling strategy for a mixed-species population, which will be of practical importance when data grouping is needed for developing stem taper models.
Author Contributions: Conceptualization, S.Y. and P.C.G.; methodology and analysis, S.Y. and P.C.G.; writing-original draft preparation, S.Y.; writing-review and editing, S.Y. and P.C.G. All authors have read and agreed to the published version of the manuscript.