1. Introduction
Site quality refers to the production potential of a specific forest or vegetation type on a given site [
1]. Forest quality and yield are closely related to site quality. Accurately and scientifically assessing forest site quality to fully realize the multifunctional value of forests has become a primary focus in forestry research [
2,
3,
4]. Current site quality evaluations for even-aged pure forests are relatively well-established. Due to their uniform species composition and known planting times, the site quality of even-aged pure forests is typically evaluated using the site index based on the relationship between dominant height and age [
5,
6]. In contrast, the growth processes of mixed forests are more complex than those of even-aged forests. Their diverse species and age structures create significant challenges for site quality assessment. Establishing a precise and scientific evaluation system for mixed forest site quality could profoundly impact mixed forest management and sustainable development.
For the evaluation of site quality for mixed forests, some scholars have employed the biomass potential productivity method [
7]. Others have utilized the growth index method to assess forest quality [
8]. Additionally, some researchers have applied the site index method to evaluate the site quality of mixed forests. Lou et al. [
9] utilized age–height data from natural secondary mixed forests of
Juglans mandshurica (
Juglans spp.) in Changbai Mountain, Jilin Province, to establish a polymorphic site index model. This model, based on univariate site index guiding curves for different site types, accurately reflects height growth differences across various site types for
Juglans mandshurica. Ercanli et al. [
10] developed a dynamic site index model for mixed forests of
Scots pine (
Pinus sylvestris L.) and
Oriental beech (
Fagus orientalis Lipsky) in northwestern Turkey based on stem analysis data from 397 dominant trees. Their study found that models based on the Bertalanffy–Richards and Hossfeld functions performed best in predicting dominant tree height. The site index method and its dynamic variant are widely used for assessing site productivity in mixed forests due to their simplicity and practical applicability. However, traditional site index models rely heavily on age-specific parameters, which limits their flexibility in forests with diverse species compositions and complex structures. Dynamic site index models improve upon this by incorporating temporal changes in stand development, but they often require extensive data and remain less adaptable to mixed-species conditions. Although the site index is commonly used for assessing forest productivity, its reliance on a reference age poses a significant limitation, particularly under the uneven-aged and complex growth conditions of mixed forests. Additionally, calculating the site index requires obtaining the age information of sample trees, which is both time-consuming and prone to inaccuracies in uneven-aged mixed forests. In contrast, the site form method, which is based on the relationship between tree diameter at breast height (DBH) and height, overcomes the dependence on age data and is better suited for evaluating site quality in complex forest ecosystems [
11,
12].
In the study of site form, Vanclay and Henry [
11] were the first to propose “site form” as a method for evaluating productivity in uneven-aged forests, applying this approach to uneven-aged Araucaria-dominated coniferous forests(
Araucaria cunninghamii Aiton ex D. Don) in Queensland, Australia. Herrera-Fernández et al. demonstrated that the site form method effectively assesses site quality in broadleaved forests within Neotropical secondary rainforests, showing that site form indicators provide valuable information on forest productivity and growth potential. This information is crucial for forest management and conservation across diverse environmental conditions [
13]. For tropical humid forests, site form has been shown to be a key indicator when assessing productivity. Do et al. [
14] used site form to indirectly estimate potential productivity in natural secondary tropical humid forests, confirming its effectiveness in this environment. The significance of this method lies in its ability to provide reliable data on site quality, aiding in the development of precise forest management and conservation strategies. Castano-Santamaría et al. [
15] applied site form models to natural
beech forests in northwestern Spain, using fitted dynamic equations to estimate site quality. Their findings revealed a significant correlation between site form and site index, demonstrating that site form effectively reflects site quality. Gao et al. [
12] developed a site quality classification model for
Chinese fir (
Cunninghamia lanceolata (Lamb.) Hook.) plantations using both site index and site form models, concluding that the site form model provided more accurate assessments than the site index model. In summary, current research suggests that the relationship between diameter at breast height (DBH) and tree height is more closely associated with growth than age, and DBH is simpler, more convenient, and more accurate to measure. Consequently, site form is widely used in evaluating site quality for mixed or natural forests.
The construction of site quality models aims to enhance the classification of forest site quality. Given that the growth of mixed forest stands is a complex process influenced by factors such as time, site conditions, and climate, effective site classification generally requires investigating various influencing factors and assessing site conditions based on their differences. There are two main approaches to site quality classification. The first involves combining ecological, botanical, and geoscience characteristics, using expert judgment to classify forest sites. For instance, Barnes [
16] explored ecological methods for ecosystem classification, with a particular focus on multi-factor site classification that incorporates factors like soil and climate. Tesch [
17] emphasized that climate, soil, and hydrology are critical site factors affecting tree growth, and he advocated for an interdisciplinary approach—integrating geography and soil science—for forest site classification and evaluation. The second approach involves developing site quality classification models based on statistical methods. For example, Zhang et al. [
18] applied quantitative theory to classify
Eucalyptus urophylla (Eucalyptus urophylla S.T. Blake) plantations in Hainan Island and Leizhou Peninsula into 12 site types. Lu et al. [
19] examined the relationship between dominant tree growth characteristics and site factors using quantitative theory, employing hierarchical clustering and canonical correlation analysis to classify and evaluate the growth potential of hybrid
Eucalyptus (
Eucalyptus spp.) plantations. Additionally, Zhang et al. [
20] used dominant factor analysis to identify key factors and developed a site index table to assess site quality. However, these traditional site quality classification methods typically rely on multivariate statistical models. Since tree growth and site factors often exhibit complex nonlinear relationships, traditional linear or simple nonlinear models rely on oversimplified assumptions, limiting their effectiveness. With advances in computing, machine learning algorithms now offer a powerful alternative capable of handling high-dimensional data with complex nonlinear interactions, thus providing robust support for forest site quality classification and assessment [
21]. Recent studies have highlighted the significant advantages of machine learning algorithms, such as Random Forest and Support Vector Machines, in estimating forest parameters and conducting large-scale forest quality assessments [
22]. Moreover, machine learning techniques have been successfully applied to predict forest yields, further demonstrating their potential and versatility in forest quality evaluation [
23]. Additionally, machine learning has shown promising results in utilizing site quality classification in plantations. For instance, Piri-Sahragard et al. [
24] employed random forest to examine the relationship between site factors and the distribution of five plant species, including
Haloxylon persicum (
Persian saxaul), demonstrating that machine learning models outperform traditional mathematical modeling in evaluating site quality. Chen et al. [
25] used decision tree algorithms to construct a site quality classification model for
Chinese fir plantations in Jinping County, Guizhou Province, based on site factors selected through expert judgment, verifying growth patterns for Chinese fir. Most existing studies use a single machine learning algorithm for model construction without comparing multiple algorithms, and applications of machine learning for site quality classification in mixed forests remain rare.
To accurately assess forest quality on a specific site, it is essential to clarify the criteria for forest stand type classification and understand the relationship between site conditions and stand growth. Therefore, it is necessary to classify mixed forests based on these criteria. In this study, mixed forests in southwestern Zhejiang were selected as the research focus. Using the Two-way Indicator Species Analysis (TWINSPAN) method, the mixed forests were classified into different stand types. An Algebraic Difference Approach (ADA) was then applied to develop a site form model. On this basis, site and climatic factors were comprehensively considered to construct site quality classification models for mixed forests, utilizing four machine learning algorithms: Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and XGBoost. Through comparative analysis, the most suitable classification model for site quality was selected for each stand type. This research provides a more scientific and effective method for evaluating site quality in mixed forests, offering theoretical and technical support for forest land-use planning in Lishui City’s mixed forests.
2. Materials and Methods
2.1. Study Area
The study area is located in Lishui City, in southwestern Zhejiang Province (
Figure 1). As of 2020, Lishui’s forested area covers 1.42 million ha, with a high forest coverage rate of 82.27% and a total standing volume of 96 million m
3. The region experiences a subtropical monsoon climate, with an annual average temperature ranging from 18.2 °C to 19.6 °C, a frost-free period of 246–274 days, and annual precipitation days of between 54 and 186. The annual rainfall ranges from 1309.9 to 1970.5 mm, with annual sunlight hours of 1102.3 to 1759.6 h and a total annual radiation of 102.1 to 110.0 kWh/m
2.
2.2. Data
The research data were derived from the 2020 Second National Forest Inventory of Lishui City, Zhejiang Province, focusing on mixed forest sub-compartments dominated by coniferous mixed forest, broadleaved mixed forest, and mixed coniferous–broadleaved forests, totaling 95,730 sub-compartments. Key survey variables for each sub-compartment included elevation, aspect, slope position, slope gradient, soil type, humus layer thickness, average diameter at breast height (DBH), dominant species, age, tree height, and species composition. For constructing the site form model, the growth factors of average DBH and average tree height were utilized. In developing the classification model, site factors such as elevation, aspect, slope position, and slope gradient were selected based on prior research. The growth and site factors for the sample sub-compartments are summarized in
Table 1.
This study also requires climate data as independent variables to construct the site classification model. The climate data were sourced from WorldClim (
http://www.worldclim.org, accessed on 10 July 2024) [
26]. Sixteen climate factors that significantly influence tree growth, including temperature and precipitation, were selected for the analysis. Details of these climate factors are provided in
Table 2.
2.3. TWINSPAN Classification Methodology
In forest ecology, commonly used clustering methods include hierarchical clustering, k-means clustering, and the TWINSPAN method. Hierarchical clustering offers intuitive and visually interpretable tree-like structures but becomes computationally intensive when applied to large datasets. K-means clustering is computationally efficient but requires predefining the number of clusters and assumes that clusters have a spherical distribution, which may not align with the complex ecological gradients found in mixed forests. The Two-way Indicator Species Analysis (TWINSPAN) method is a modified form of indicator species analysis and is primarily used as a classification algorithm for simultaneously categorizing sample plots and species [
27]. TWINSPAN employs axes derived from Correspondence Analysis (CA) or Detrended Correspondence Analysis (DCA) to progressively refine hierarchical divisions within the dataset, creating an ordered two-way table that categorizes objects (sample plots) and variables (species) [
28].
The concept of pseudo-species is introduced in the TWINSPAN classification, which assumes that the same species can have different indicative meanings at varying abundance levels. Thus, these are treated as distinct “species” during analysis. Based on the tree species composition characteristics of mixed forests in Lishui City, Zhejiang Province, tree species abundance was categorized into five levels (0, 2, 4, 6, 8), considering the proportion of species composition comprehensively.
The specific steps of TWINSPAN used in this study are as follows:
First, the sub-compartment data were classified into three major categories—coniferous mixed forest, broadleaved mixed forest, and mixed coniferous–broadleaved forests—based on dominant species. After removing outliers and missing values using a three-standard deviation criterion, 52,876, 9572, and 33,282 records were retained, respectively.
The original data matrix was constructed based on the species composition of the mixed forests, recorded as percentages. These composition percentages formed the basis of the original data matrix (Equation (1)).
The structure of the matrix used in this study is described in
Appendix A: where the matrix
is a
row vector composed of tree species names; the matrix
is the stand volume percentage of the corresponding tree species in each sample site consisting of the
matrix; m is the number of tree species in the sub-compartments, and n is the number of sub-compartments, i.e., n = 52,876, 9572, 33,282, and m = 39, 36, 41.
The twinspanR package in R (v 4.3.3) was used to perform TWINSPAN analysis. For the coniferous mixed forest, the maximum level of division was set to 3, with pseudo-species cut levels at 0, 2, 4, 6, and 8. For the broadleaved mixed forest and mixed coniferous–broadleaved forests, the maximum level of division was set to 4, using the same pseudo-species cut levels (0, 2, 4, 6, and 8). These parameters were applied to generate the classification results.
2.4. Mixed Forested Site Form Models
2.4.1. Determination of Base Diameter at Breast Height for Mixed Forests
Site form refers to the relationship between dominant height and dominant diameter at breast height (DBH) within a stand, and it requires determining a reference DBH that represents stable height growth while sensitively reflecting site quality differences [
29]. Based on the literature and available quantitative methods for calculating reference DBH, this study uses the most frequently occurring DBH value in the research data as the reference DBH [
11]. Additionally, since the second national forest inventory does not include data on the average dominant height of trees, and mixed forests are minimally affected by human activities like logging, average tree height is used as a proxy for dominant height.
2.4.2. Guiding Curve Selection
Within the cluster of mean height growth curves for dominant trees in a stand, there is a curve that represents the average growth trajectory of dominant tree height under neutral site conditions as stand age or mean DBH changes. This is known as the guiding curve [
30,
31]. The choice of guiding curve has a direct impact on the accuracy of site quality assessments. In this study, six commonly used theoretical growth guiding curves—Korf, Schumacher, Richards, Logistic, Compertz, and Weibull—were selected. The expressions for each equation are shown in
Table 3.
2.4.3. Construction of Site Form Models
The algebraic difference approach (ADA) is widely used to model stand growth processes and is one of the most common methods for establishing site index curves [
32,
33,
34,
35]. Its principle involves selecting a theoretical equation as the base model and identifying one parameter as the elimination variable. By eliminating this parameter, a difference equation containing two sets of dependent and independent variables is derived. In this study, the optimal guiding curves—Richards, Logistic, Compertz, and Weibull models—were chosen as the base equations for the site form model, with their respective expressions shown in Equations (2)–(5).
where d
2 denotes the baseline diameter at breast height (DBH); d
1 denotes the dominant diameter at breast height (DBH); H
1 denotes the dominant height at DBH d
1; parameter a represents the maximum potential growth of trees, parameter c represents the growth rate of trees, and parameter b is used as an elimination parameter to obtain the elimination.
2.5. Machine Learning Algorithms
This study employs two traditional classification algorithms—K-Nearest Neighbors (KNN) and Support Vector Machine (SVM)—along with two ensemble learning algorithms—Random Forest (RF) and XGBoost—to construct site quality classification models. The dataset is divided into a 7:3 ratio for training and testing to ensure robust evaluation and validation of the models.
2.5.1. Random Forest Algorithm
Random Forest (RF) is a classic, straightforward ensemble learning algorithm [
36,
37] known for its high classification accuracy and ability to handle high-dimensional data [
38]. Random Forest uses decision trees as base estimators, training multiple decision trees and combining their outputs to produce a final result [
39]. Additionally, RF randomly selects features and samples during training, which helps prevent overfitting and reduces the impact of irrelevant or inconsistent data on the model’s performance [
40].
2.5.2. Support Vector Machine Algorithm
Support Vector Machine (SVM) is a supervised learning algorithm that seeks an optimal hyperplane in feature space to separate data points [
41]. Before the advent of deep learning, SVM was regarded as one of the most successful and effective machine learning algorithms over the past decade, especially suited for high-dimensional data with low sample sizes. Its core idea is to maximize the margin between two classes of data points, thereby defining a clear decision boundary. Furthermore, with the use of kernel functions [
42], SVM can map data into higher-dimensional spaces to address nonlinear problems present in the original features. The advantages of SVM include its ability to handle high-dimensional data, robust performance against noise and outliers, and strong generalization capabilities [
43,
44,
45].
2.5.3. K-Nearest Neighbors Algorithm
The K-Nearest Neighbors (KNN) algorithm is one of the traditional machine learning classification algorithms. It predicts the class of a new data point by measuring its distance from each point in the training set [
46]. The core idea of KNN is based on the principle of “similarity”, meaning that similar points are close to each other in feature space. KNN is theoretically well-established and widely used in fields such as data analysis. The algorithm identifies the K closest neighbors to the new data point and assigns its class based on the majority class among these neighbors. KNN is easy to understand theoretically and relatively simple to implement in practice.
2.5.4. XGBoost Algorithm
XGBoost is an efficient machine learning algorithm [
47] and an optimized version of Gradient Boosting Decision Trees (GBDTs). It iteratively adds new trees to improve the model while incorporating a regularization term in the loss function to prevent overfitting. XGBoost supports parallel processing, enabling it to leverage multi-core processors to accelerate training, and it can automatically handle missing values, making it highly adaptable [
48]. The algorithm performs exceptionally well on large-scale datasets and is widely used for various machine learning tasks, including classification, regression, and ranking [
49].
2.6. Model Evaluation and Testing
2.6.1. Indicators for the Evaluation of the Site Form Model
Model simulation and validation are critical steps in the modeling process, focusing on both the significance of the model and its statistical fit. In this study, commonly used regression evaluation metrics, including the coefficient of determination (R
2), root mean square error (RMSE), and mean absolute error (MAE), were selected as primary indicators for assessing model performance. Specifically, R
2 and RMSE were calculated using fitted sample data, with R
2 values closer to 1 and lower RMSE values indicating better model fit. MAE, on the other hand, was evaluated using independent test samples, with smaller MAE values reflecting improved predictive accuracy. To further assess the stability of the guidance curve equations, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were used as secondary evaluation metrics. AIC and BIC balance model complexity with goodness of fit, where lower values indicate a model that effectively explains the data while avoiding overfitting. Together, these metrics ensure that the selected model achieves both high predictive accuracy and robustness. The formulas for these metrics are as follows:
where R
2 is the coefficient of determination; RMSE is the root mean square error; MAE is the mean absolute error; n is the number of samples;
, y
i, and
are the mean average height, the sample measured value of mean height, and the model predicted value of each sample point, respectively; k is the number of estimated parameters in the model; and
is the maximum likelihood of the model.
2.6.2. Machine Learning Classification Model Evaluation Metrics
In this study, commonly used evaluation metrics for classifiers, including accuracy, precision, and F1-score, were applied. Accuracy represents the ratio of correctly classified samples to the total number of samples. Precision measures the ratio of correctly predicted positive samples to the total predicted positive samples. F1-score is the harmonic mean of recall and precision, with a value ranging from 0 to 1; values closer to 1 indicate better model performance, while values closer to 0 indicate poorer performance. The expressions for these metrics are shown in Equations (11)–(14).
where TP, TN, FP, and FN are the first letters indicating the results of judgment of real and predicted values, T indicates that the judgment matches, F indicates that the judgment does not match, and the second letter indicates that the classifier predicts the results; P is the judgment of the positive cases and N is the judgment of the negative cases.
4. Discussion
This study utilized the TWINSPAN classification method based on forest stand species composition. Sun et al. [
50] applied TWINSPAN to analyze the composition of Sassafras communities in Zhejiang Province, classifying 96 sample plots into eight types, each with unique species compositions and environmental characteristics. Similarly, Rahman et al. [
51] used TWINSPAN to quantify and classify vegetation into five community types in the Sultan Khel Valley of the Hindu Kush Mountains. Novák et al. [
52] applied TWINSPAN hierarchical clustering to a dataset of 15,817 vegetation plots, identifying clusters with similar species composition and ecological–geographical characteristics. In this study, TWINSPAN effectively distinguished 15 types of mixed forests, which were then classified by site quality. Differentiating site quality among various stand types enables a more precise assessment of growth performance under different site conditions, providing insights into each stand type’s adaptability and productivity potential across varying environmental conditions. This approach is crucial for forest planning and resource allocation.
For constructing the site form model, site form represents the relationship between dominant height and DBH, requiring a reference DBH that reflects stable height growth and sensitively captures differences in site quality. The choice of reference DBH significantly impacts the site form model. This study adopted a method that uses the most frequently occurring diameter at breast height (DBH) value within this study’s dataset as the reference DBH. The commonly occurring DBH value accurately represents the growth status of the majority of trees, enhancing the model’s representativeness and applicability. Nevertheless, this method has certain limitations. Using the most frequently observed DBH may overlook the variability within DBH distributions, particularly in cases where the DBH distribution is highly dispersed or multimodal. In such instances, the mode may not accurately reflect the overall growth characteristics of the stand. Additionally, in stands with abnormal DBH distributions or those influenced by external factors, relying on the most frequently observed DBH could underestimate the stand’s productive potential. Apart from the method used in this study, other approaches for selecting reference DBH have been proposed in previous research. These include using half the average DBH typically achieved in the growth history of dominant trees, identifying the DBH at the inflection point of the height–DBH curve (where the second derivative equals zero, indicating a change in growth trend), or deriving the DBH corresponding to a reference age from a DBH–age model. In mixed forests, determining stand age is challenging, while DBH data are readily available, making the application of site form in mixed forest site quality assessment feasible. Identifying the most suitable reference DBH to represent the growth patterns of mixed forest stands remains an important direction for future research. Additionally, obtaining dominant height is essential for site form model calculations. In this study, mean height was used as a substitute for dominant height. However, the diversity of species, natural thinning, and human activities in mixed forests can affect mean height, potentially underestimating site productivity. The use of mean height instead of dominant height is a limitation, and further data collection on dominant height would enhance model accuracy.
In forestry, machine learning algorithms have increasingly become powerful tools for handling high-dimensional, complex nonlinear data [
53,
54]. Our results indicate that the XGBoost model is an effective method for site quality classification. Using site and climate factors as features, we constructed site quality classification models with two traditional classification algorithms and two ensemble learning algorithms, comparing their performance. All four models achieved overall accuracies above 0.77, with XGBoost consistently performing well across most groups. As a tree-based ensemble learning method, XGBoost improves the accuracy of weak classifiers through weighted boosting and excels on large datasets [
55]. While XGBoost demonstrates strong predictive performance, its ecological applications require careful consideration of model complexity and practical requirements. XGBoost has been successfully applied in various forestry tasks, demonstrating its robustness and versatility. For instance, it has been used in tree species identification with hyperspectral data, achieving significantly higher accuracy compared to traditional methods [
56]. In forest fire prediction, XGBoost integrated climatic and environmental data to accurately identify high-risk areas, outperforming conventional statistical models [
57]. Similarly, in carbon storage estimation, XGBoost effectively modeled the complex relationships between site quality, species composition, and climatic conditions, providing precise predictions for sustainable forest management [
58]. These examples illustrate its potential for addressing diverse and complex challenges in forestry.
Despite its powerful predictive capabilities, XGBoost presents some drawbacks in practical applications, particularly its sensitivity to hyperparameters and potential overfitting issues. The model requires careful tuning of hyperparameters, such as learning rate, tree depth, and the proportion of sample subsets. Improper selection of these parameters can lead to model instability or overfitting. In certain situations, especially with small datasets or complex features, XGBoost may overfit, meaning the model performs well on training data but poorly on unseen data, limiting its generalization ability in real-world applications. Nevertheless, machine learning methods like XGBoost can offer valuable tools for site quality assessment, particularly in areas lacking vegetation cover. By integrating site and climatic factors, these models can predict the potential site quality of non-forested areas with high accuracy. Such predictions provide a scientific foundation for tree species selection and afforestation planning, making them especially significant for future forest development and ecological restoration. This approach enables a more targeted and effective restoration process, aligning with sustainable forestry practices and improving the ecological resilience of mixed forests. While the model proposed here is specific to Lishui City, Zhejiang Province, it has broad potential for application in other regions and non-forested areas. In the future, classification models can be developed for different regions with varying environmental conditions and forest types, further enhancing the versatility and applicability of the method. This adaptability makes the approach valuable for large-scale ecological restoration and sustainable land management practices.