Predicting Individual Tree Mortality of Larix gmelinii var. Principis-rupprechtii in Temperate Forests Using Machine Learning Methods

: Accurate prediction of individual tree mortality is essential for informed decision making in forestry. In this study, we proposed machine learning models to forecast individual tree mortality within the temperate Larix gmelinii var. principis-rupprechtii forests in Northern China. Eight distinct machine learning techniques including random forest, logistic regression, artificial neural network, generalized additive model, support vector machine, gradient boosting machine, k-nearest neighbors, and naive Bayes models were employed, to construct an ensemble learning model based on comprehensive dataset from this specific ecosystem. The random forest model emerged as the most accurate, demonstrating 92.9% accuracy and 92.8% sensitivity, making it the best model among those tested. We identified key variables impacting tree mortality, and the results showed that a basal area larger than the target trees (BAL), a diameter at 130 cm (DBH), a basal area (BA), an elevation, a slope, NH4-N, soil moisture, crown density, and the soil’s available phosphorus are important variables in the Larix Principis-rupprechtii individual mortality model. The variable importance calculation results showed that BAL is the most important variable with an importance value of 1.0 in a random forest individual tree mortality model. By analyzing the complex relationships of individual tree factors, stand factors, environmental, and soil factors, our model aids in decision making for temperate Larix gmelinii var. principis-rupprechtii forest conservation.


Introduction
Forests, which cover approximately 31% of the world's terrestrial ecosystems [1] and constitute about 80% of the global vegetation mass, play a crucial role as essential ecosystems on Earth.Forests serve multiple vital functions, such as in timber production, hydrological regulation, soil conservation, climate change mitigation, and air quality regulation [2,3].
Accurate assessment and monitoring of forest dynamics are of paramount importance.Currently, dynamic monitoring of forests mainly includes monitoring of forest stand dynamics, forest climate, and forest fire prevention, among which forest stand dynamics is a key link in the monitoring process.Determination of forest stock volume, biomass, and carbon storage are largely based on the forest dynamics, such as tree growth, tree mortality, and human influences, such as thinning [4].The integration of tree mortality into the study of forest stand quantity dynamics is vital, as it is a fundamental process within forest dynamics [5].Additionally, tree mortality, productivity, and biodiversity play crucial roles in shaping forest ecosystem dynamics and, consequently, influencing forest carbon sequestration [6,7] Tree mortality is a crucial ecological process in forest development, as dead and decaying trees play vital roles in maintaining a healthy forest ecosystem [8].Tree mortality encompasses the entire process from the initial decline in vitality to the eventual death of a tree, influenced by both its intrinsic ecological characteristics and external conditions.Forest mortality drives changes in species composition and stand density [9,10], and plays a significant role in the coexistence of different communities [11].Elevated tree mortality levels can significantly impact ecosystem structure and function, affecting the services that forests provide to people [12].Even minor changes in the mortality rates can have profound effects on tree lifespan, biodiversity, and the cycling of carbon and nutrients.In fact, tree mortality rates are key drivers of forest community changes, leading to notable alterations in composition and structure [13].
Moreover, an increase in mortality rates reduces the residence time of carbon in both forests and soil [14,15] and may affect the carbon storage potential of forests [16].Consequently, conducting mortality research can enhance the understanding of mortality causes [13], contribute to a deeper comprehension of the succession and diversity dynamics in future forest communities [17], facilitate precise evaluation and estimation of forest carbon storage [18], support sustainable forest resource management, and enable accurate monitoring of forest carbon sinks [19].
Predicting tree mortality requires classification from 0 to 1. Therefore, most of the research on an individual tree mortality model was developed using logistic regression [20].Some researchers used generalized mixed-effect model [21,22].Additionally, other modeling methods, such as classification regression trees [23], non-parametric Bayesian estimation [24], compound Poisson models [25], semi-parametric regression [26], multilevel logistic regression [27], and Cox proportional hazard models [28] have been attempted in individual tree-mortality-model research.Vanclay (1994) [29] classified tree mortality into two categories: natural and nonnatural mortality.Natural mortality occurs during the developmental stages of trees, arising from variations in maturity among tree species and differences in individual genetic factors.This leads to varying competitive abilities for nutrients, water, and sunlight among different tree species and between larger and smaller trees.Consequently, trees in a weaker competitive position gradually die off.Non-natural mortality refers to tree mortality caused by improper afforestation techniques or external disturbances such as fires, droughts, flash floods, windstorms, and snow disasters [30].In our study, we only focus on natural mortality.In recent tree-mortality-modelling research, the relationship between soil characteristics, topography, and tree mortality were often neglected [31].Soil characteristics (e.g., moisture content, pH, texture, nutrients, and their availability) also affect plant growth and death.Studies have shown that tree mortality rates in China's forest-grassland ecotone are significantly influenced by soil properties, topography, and tree size [32,33].Furthermore, some research proved a strong correlation exists between soil moisture content and tree mortality [23].Existing tree mortality modeling has mainly focused on predictor variables related to tree size, such as diameter at breast height or tree height [8,34]; growthrelated variables, such as DBH increment, annual ring width, or basal area increment [24]; crown-related variables, such as leaf area index and crown shedding [35,36]; ratios of crown-related variables to growth-related variables [37]; competition variables, divided into distance-related competition and distance-independent competition [38,39]; climate variables [40]; and site quality [35].
The Larix gmelinii var.principis-rupprechtii tree-mortality-modeling studies have not yet explored the impact of soil nutrients on tree mortality.Soil, as a key habitat factor for tree regeneration and survival, possesses numerous physical and chemical properties.Various soil factors are interconnected, and they exhibit significant scale effects, even showing noticeable spatial variations on a small scale [41].We consider the main soil nutrient factors affecting tree mortality, including total soil moisture, pH value, soil carbon (Organic C), nitrate nitrogen (NO 3 -N), ammonium nitrogen (NH 4 -N), and available potassium (available K), available phosphorus (available P).Carbon, nitrogen, potassium, and phosphorus are closely related to plant growth, thereby affecting plant regeneration and survival [42,43].
The prediction of tree mortality is a complex task due to the multitude of factors that can influence a tree's health and survival.Traditional statistical models often struggle with this complexity, as they are limited in their ability to handle non-linear relationships and interactions between variables.Machine learning models, on the other hand, excel in these situations.They can learn from the data, identifying complex patterns and relationships that can improve prediction accuracy.
In recent years, machine learning has emerged as a powerful tool in various fields, including forestry.Machine learning algorithms can learn from data and improve their performance with experience, which make them particularly useful for tasks where explicit programming is difficult [44].In the context of forestry, machine learning can be used to predict tree mortality, growth, and other key forest dynamics.These predictions can be based on a variety of factors, including climate, soil nutrients [45], and other individual or stand-level variables.Machine learning models, such as logistic regression [46], support vector machines [47], random forests [48] gradient boosting [49], and naive Bayes [50], have been successfully applied in this field.These models can handle complex interactions and non-linear relationships between variables, making them more flexible and accurate than traditional statistical models.
To our knowledge, no tree-mortality-modeling studies has been carried out on the comparisons of different machine learning models.In this study, we applied several machine learning models, including logistic regression, support vector machines, random forests, gradient boosting, and naive Bayes, to predict tree mortality based on a variety of environmental factors.Our main aim is to develop a model to predict tree mortality, essentially a binary classification problem.This model categorizes the trees into two distinct classes: alive (0) and dead (1).Given the either live or dead nature of this problem, machine learning techniques are particularly well-suited for this task.Therefore, our main aim of this study is to (i) establish a prediction model of individual tree mortality prediction with machine learning methods; (ii) compare eight machine learning models and figure out the most suitable prediction model for individual tree mortality of the larch forests; (iii) analyze the effects of different factors and determine which ones have strong influence on individual tree mortality and to provide a scientific foundation for larch forest sustainable development.

Study Area
Data from 49 permanent sample plots (PSPs) were collected, which are located in natural stands of Prince Rupprecht larch in the state-owned Boqiang forest (49 PSPs) in northern Shanxi, northern China.Western and northern Shanxi are the principal regions where this species is found in China.Each PSP is square (20 m × 20 m), encompassing an area of 0.04 hectares, and was established in 2015, nested within a total of eight different blocks.The 49 PSPs in northern Shanxi were each allocated across four blocks.The sampling design provided representative information concerning various stand structures, tree heights, ages, site productivity, and density.As in this study, soil nutrients were regarded as an important variable.Our study was based on the data of 20 sample plots and a total of 1301 trees (Figure 1) which were allocated across two blocks with detailed soil nutrients data.Within each sample plot, five 1 m 2 subplots were evenly set along the diagonal, and one soil sample was taken from each.The soil samples were collected for analyses of some important physical and chemical indicators.

Data Collection
All 1222 standing, living trees with a diameter at breast height (DBH) equal to exceeding 5 cm underwent comprehensive measurements, encompassing total tree hei (H), height to live crown base (HCB), and the determination of four crown radii.The DB of the 79 dead trees were also measured.The distribution of DBH based on mortality tus is available in the supplementary materials, depicted in Figure S1.The positionin these four crown radii for each tree was established using two azimuths.Crown wi was subsequently computed as the half sum of the measured values for the four cro radii.In accordance with the methodology outlined in reference [51], four trees with largest DBH were identified as dominant trees in each plot.To ascertain the age of selected trees, growth rings were meticulously counted on increment cores extracted fr the stems, specifically at a point 0.1 m above the ground, following the procedure deta in reference [52].Dead trees were assigned a code value of 1, while live trees were assig a code value of 0. For each PSP, the dominant diameter, dominant tree height (DH), the age of the dominant tree (DA) were obtained from the averages of these attributes [ Within each PSP, five 1 m 2 subplots were evenly set along the diagonal, and one soil sa ple was taken from each.The soil samples were analyzed for the following characterist soil moisture, soil thickness, pH value, nitrate nitrogen (NO3-N), ammonium nitro (NH4-N), available potassium (available K), available phosphorus (available P), and t carbon content (TC).Other data were also measured for each PSP including canopy d sity (CD), elevation, slope degree, and slope aspect.Three subplots (1 m × 1 m) were up within each PSP, and grass species, numbers, mean height, and coverage rate w measured and recorded to signify the bio-diversity of this plot.Summary statistics of measurements of individual tree characteristics and relevant stand characteristics are p sented in Table 1.

Data Collection
All 1222 standing, living trees with a diameter at breast height (DBH) equal to or exceeding 5 cm underwent comprehensive measurements, encompassing total tree height (H), height to live crown base (HCB), and the determination of four crown radii.The DBHs of the 79 dead trees were also measured.The distribution of DBH based on mortality status is available in the supplementary materials, depicted in Figure S1.The positioning of these four crown radii for each tree was established using two azimuths.Crown width was subsequently computed as the half sum of the measured values for the four crown radii.In accordance with the methodology outlined in reference [51], four trees with the largest DBH were identified as dominant trees in each plot.To ascertain the age of the selected trees, growth rings were meticulously counted on increment cores extracted from the stems, specifically at a point 0.1 m above the ground, following the procedure detailed in reference [52].Dead trees were assigned a code value of 1, while live trees were assigned a code value of 0. For each PSP, the dominant diameter, dominant tree height (DH), and the age of the dominant tree (DA) were obtained from the averages of these attributes [53].Within each PSP, five 1 m 2 subplots were evenly set along the diagonal, and one soil sample was taken from each.The soil samples were analyzed for the following characteristics: soil moisture, soil thickness, pH value, nitrate nitrogen (NO 3 -N), ammonium nitrogen (NH 4 -N), available potassium (available K), available phosphorus (available P), and total carbon content (TC).Other data were also measured for each PSP including canopy density (CD), elevation, slope degree, and slope aspect.Three subplots (1 m × 1 m) were set up within each PSP, and grass species, numbers, mean height, and coverage rate were measured and recorded to signify the bio-diversity of this plot.Summary statistics of the measurements of individual tree characteristics and relevant stand characteristics are presented in Table 1.

Mortality Data Pre-Processing
In our research, the forest stand dataset presents an imbalanced distribution, particularly with the scarcity of data for the deceased tree class (class 1) due to its natural rarity.To address this issue, we proactively employed oversampling techniques, such as the synthetic minority oversampling (SMOTE) [54].Due to the fact that the random oversampling method directly reuses a few classes, there are many duplicate samples in the training set, which can easily lead to model overfitting problems.The basic idea of the SMOTE algorithm is to handle each minority class sample, randomly select a sample from its nearest neighbors and then randomly select a point on the connecting line as the newly synthesized minority class sample.SMOTE enhances the ability of our machine learning models to capture the distinct features of the less-frequent class, ultimately improving their predictive accuracy.Through strategic oversampling, we intend to counteract the bias towards the majority class, resulting in more reliable and generalizable outcomes for our study conducted in a real-world natural setting.We utilized the "smotefamily" package in R 4.3.1 [55] for conducting the data pre-process.The dataset was partitioned into two distinct subsets: 70% was allocated for training the models, and the remaining 30% was reserved for testing.

Model Selection
We employed eight distinct machine learning models to analyze and predict tree mortality.These models encompass random forest (RF), logistic regression (LR), artificial neural network (ANN), generalized additive model (GAM), support vector machine (SVM), gradient boosting machine (GBM), k-nearest neighbors (KNN), and naive Bayes (NB).Each model was carefully selected based on its ability to handle the complexity of the data and its relevance to the problem at hand.The eight machine learning algorithms selected for predicting single-tree mortality offer a well-rounded portfolio of benefits.They span a wide spectrum of approaches, from linear models like logistic regression to non-linear ensemble methods like random forest [56] and gradient boosting machine, allowing for the finding of diverse relationships within the data.Most are computationally efficient at handling large datasets, although some, like k-nearest neighbors [50,57] may require more computational resources.The list strikes a balance between algorithms that are easily interpretable, such as logistic regression [46,58] and naive Bayes [50], and those that prioritize predictive power at the expense of clarity, such as artificial neural networks [59][60][61].This set of algorithms is robust to outliers and irrelevant features, particularly the ensemble methods like random forest and gradient boosting machine, making them well-suited for complex, real-world datasets.They are also relatively easy to use and tune, thanks to their extensive implementation in various software packages.Employing a range of algorithms facilitates robust bench marking and validation, helping to discern whether good performance is due to the algorithm's fit to the problem or whether it is merely an artifact of overfitting.Additionally, these algorithms are commonly employed in both academic and industrial settings for binary classification, providing a level of familiarity and trust.Lastly, several algorithms in the list offer built-in feature importance evaluation, crucial for understanding the impact of environmental factors on tree mortality.

Random Forest
Random forest constitutes an ensemble learning approach that operates by generating a multitude of decision trees during the training phase and determining the class output as the mode of the classes predicted using individual trees.This methodology addresses the tendency of decision trees to overfit to their training dataset [62].
The basic principle of random forest is to generate a set of independent decision trees that are trained on different subsets of the original dataset.Each individual tree within the random forest provides a classification, and it is characterized as "voting" for a specific class.The collective decision of the random forest is determined by selecting the classification with the highest number of votes across all trees in the ensemble.Parameters governing the random forest model, such as the quantity of trees (n_estimators) and the maximum depth of the trees (max_depth), are commonly optimized through the utilization of cross-validation techniques.Another crucial parameter open to adjustment is the number of features considered during the search for the optimal split (max_features).

Logistic Regression
Logistic regression serves as a statistical model employed to predict the likelihood of an event's occurrence through fitting data to a logistic function.It represents a generalized linear model specifically applied in the context of binomial regression [63].Given a set of predictor variables, the model allows us to estimate the probability of the binary response variable, which in our case is tree mortality.
The determination of coefficients involves the application of maximum likelihood estimation (MLE).MLE serves as a statistical technique for estimating the parameters within a statistical model based on the observed data.The derived estimates represent the values that optimize the likelihood function, taking into account the provided observational data.

Support Vector Machines
Support vector machines (SVM) constitute a collection of supervised learning techniques employed for both classification and regression purposes.SVM exhibits notable efficacy when confronted with intricate datasets of a modest or intermediate scale [64].The fundamental tenet of SVM involves the creation of a hyperplane serving as the decision boundary, with the specific aim of maximizing the margin that separates positive and negative instances.In a two-dimensional context, this hyperplane manifests as a line partitioning a plane into two regions, with each class situated on opposing sides.
The parameters of the SVM are estimated using quadratic programming.The objective of the quadratic programming problem is to minimize the norm of the weight vector subject to some constraints, which ensures that the samples are correctly classified.The kernel function serves the purpose of mapping input data into a higher-dimensional space, facilitating the identification of a hyperplane that effectively separates the data.Popular selections for the kernel function encompass linear, polynomial, and radial basis function transformations.

Generalized Additive Models
Generalized additive models (GAM) represent a category of statistical models that permit the modeling of non-linear relationships between predictors and the response variable.Extending beyond GLM, GAM substitutes the linear predictor with a summation of smooth functions of predictors [65].The GAM model allows for flexible modeling of complex ecological relationships and can handle non-linear and non-monotonic relationships between the predictors and responses, making it a suitable choice for our study on tree mortality.2.4.5.K-Nearest Neighbors K-nearest neighbors (KNN) constitutes an instance-based learning algorithm applicable to both classification and regression challenges.The essence of KNN lies in identifying a predetermined number of training samples in close proximity to a new data point, and subsequently predicting the label based on the nearest neighbors [66].

Naive Bayes
Naive Bayes is a classification method grounded in Bayes' Theorem, operating under the assumption of predictor independence.Put succinctly, a naive Bayes classifier posits that the occurrence of a specific feature within a class is unrelated to the occurrence of any other feature.This assumption is termed class-conditional independence [67].

Gradient Boosting Machine
A gradient boosting machine (GBM) method is a potent ensemble technique that amalgamates the predictive capabilities of multiple weak learners-typically decision trees-to create a stronger predictive model.By repeatedly refining predictions and addressing errors from previous models, GBM enhances accuracy progressively [68].This approach is adept at capturing intricate data relationships and handling diverse features.
The GBM's decision function aggregates the predictions of individual decision trees.In classification, it sums weighted class probabilities to generate the final prediction.For regression, it combines individual tree predictions to yield the ultimate regression prediction.

Artificial Neural Networks
Artificial neural networks (ANNs) are a class of computational models inspired by the intricate neural networks found in the human brain.These networks consist of interconnected processing units, or "neurons", that work collaboratively to process and learn from data.ANNs are renowned for their remarkable ability to solve complex problems, especially those that involve pattern recognition, data classification, regression, and even tasks involving unstructured data, like images and texts [69].
In our study, we employed a variety of machine learning algorithms to predict tree mortality, including RF using the "randomForest" package, LR through the "glm" function, SVM via the "e1071" package's "svm" function, GAM through the "mgcv" package's "gam" function, K-NN using "class"package through "knn" function, NB using the "naiveBayes" function, GBM using "gbm"package via "gbm" function and ANN via R's "nnet" package.These models utilized both individual-level and stand-level factors as predictor variables and single-tree mortality as the response variable.To ensure a robust evaluation of the model performance, we implemented 10-fold cross-validation using the "trainControl()" function in R, specifying the "cv" method.This cross-validation approach mitigates the risk of model overfitting and provides a more accurate estimate of the model's generalization capabilities.

Model Validation
In the evaluation phase of the study, predictions were made using the optimized models on the reserved 30% test dataset.This subset of data, independent from the training process, allowed for an unbiased assessment of the models' predictive precision.The hyper parameters were meticulously tuned to ensure that the models were well-fitted to the underlying patterns within the training data.The evaluation was further conducted using the confusion matrix's statistical metrics, providing critical insights into the models' true-positive, false-positive, true-negative, and false-negative rates.This comprehensive approach, encompassing both evaluation of the test dataset and analysis through the confusion matrix, offered a rigorous and robust measure of the models' generalization capabilities, reflecting their potential effectiveness in predicting tree mortality in unseen data.

Feature Importance
Understanding the importance of different features in the model can provide valuable insights into the relationships between predictors and the response variable.We used feature importance to analyze its impact on the predictive outcomes.Feature importance serves to elucidate the influence of each feature on the model's predictions.Generally, features with high importance denote their pivotal role in predictions, while features with lower importance may have a relatively minor impact on the predictive outcomes.They were calculated for each model using the varImp() function in R.

Model Evaluation
In this study, we used several metrics (confusion Matrix) to evaluate the performance of our models.Below are brief introductions to each statistical metric you employ, along with their respective calculation formulas: (1) Accuracy: represents the proportion of correctly predicted samples to the total number of samples.It gauges the overall correctness of the model's classifications.It can be calculated as follows: where TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives.
(2) Sensitivity: Referred to as the recall or true-positive rate, quantifies the proportion of accurately predicted positive samples relative to the total actual positive samples.It provides insight into the model's capacity to correctly identify instances belonging to the positive class.It can be calculated as follows: (3) Specificity: Specificity denotes the proportion of correctly predicted negative samples to the total actual negative samples.It underscores the model's capacity to differentiate negative class samples.It can be calculated as follows: (4) Cohen's Kappa: Cohen's Kappa is a statistic that quantifies the agreement between predicted and actual results, while considering the difference between classification outcomes and random chance.
It can be calculated as follows: Here, p 0 represents the observed agreement proportion, and p e signifies the expected agreement proportion.
(5) Precision: Precision denotes the ratio of correctly predicted positive samples to the total samples predicted as positive.It assesses the accuracy of the model's positive class predictions.It can be calculated as follows: Precision = TP/(FP + TP) (5) (6) F1 Score: The F1 score is the harmonic mean of precision and recall, offering a balanced assessment of the model's accuracy and coverage.It can be calculated as (7) Area under the ROC curve (AUC-ROC): The ROC curve is a graphical representation of the true-positive rate plotted against the false-positive rate.It illustrates the balance between sensitivity and specificity.AUC-ROC serves as a metric indicating the effectiveness of a parameter in discriminating between two diagnostic groups (diseased/normal).A higher AUC value corresponds to a superior ability of the model to distinguish between trees that perished and those that endured.
These metrics were calculated for each model using the 'pROC' and 'caret' packages in R. The models were then compared based on these metrics to determine the best performing model.

Model Fitting Accuracy
Using the SMOTE method, 1185 dead trees' data were produced.Based on the dataset of 1222 living trees and 1185 dead trees, the modeling work was carried out.The distribution of DBH based on mortality status after oversampling is available in the supplementary materials, depicted in Figure S2.In this study, we evaluated eight distinct machine learning models to understand their fitting accuracy to the training dataset.The detailed evaluation of each model is as follows and is also shown in Table 2.The RF model exhibits exceptional performance, marked by near-perfect precision (99.76%) and very high levels of accuracy (97.93%), sensitivity (96.23%), and an F1 score (97.96%), underscoring its superior predictive capability and reliability in accurately classifying tree mortality.Its high Kappa value (0.9585) further indicates a significant agreement beyond chance, making it a robust choice for complex ecological predictions.

Model Prediction Accuracy Evaluation on Test Dataset
The performance of the eight machine learning models was further validated on the test dataset.The evaluation metrics for each model are detailed below and are also shown in Table 3: The prediction statistics of the eight machine learning models on the test dataset are analyzed, focusing on the relative performance and effectiveness of each model across various metrics such as accuracy, sensitivity, specificity, Kappa, precision, and F1 score.Based on the results of the model performance metrics, it can be observed that the random forest (RF) model excels, demonstrating the highest accuracy (0.9291) and a Kappa statistic of 0.8580.It achieves commendable scores in both sensitivity (0.9277) and specificity (0.9303).The naive Bayes (NB) model also exhibits a performance comparable to random forest, with accuracy and Kappa statistics reaching the same levels (0.9291 and 0.8580, respectively).Other models, such as logistic regression (LR), artificial neural network (ANN), generalized additive model (GAM), support vector machine (SVM), gradient boosting machine (GBM), and k-nearest neighbors (K-NN), also perform well, albeit with slight variations in certain metrics.

AUC-ROC Curve
The ROC curves were constructed, and the area under the curve (AUC) was computed to quantify the discriminative ability of the models (Figure 2).The RF model exhibited an AUC of 0.966, indicating a very high level of discriminative capacity.The LR model followed with an AUC of 0.898, and the ANN model presented an AUC of 0.894, showing substantial predictive power.The GAM demonstrated robust discrimination with an AUC of 0.961, whereas the SVM model achieved an AUC of 0.968, slightly surpassing the GAM.The GBM model also showed excellent performance with an AUC of 0.967, closely matching the SVM model.The K-NN model yielded an AUC of 0.929, indicating good classification ability, while the NB model had an AUC of 0.893, which, despite being the lowest in this group, still represents a good discriminative ability.The prediction statistics of the eight machine learning models on the test dataset are analyzed, focusing on the relative performance and effectiveness of each model across various metrics such as accuracy, sensitivity, specificity, Kappa, precision, and F1 score.Based on the results of the model performance metrics, it can be observed that the random forest (RF) model excels, demonstrating the highest accuracy (0.9291) and a Kappa statistic of 0.8580.It achieves commendable scores in both sensitivity (0.9277) and specificity (0.9303).The naive Bayes (NB) model also exhibits a performance comparable to random forest, with accuracy and Kappa statistics reaching the same levels (0.9291 and 0.8580, respectively).Other models, such as logistic regression (LR), artificial neural network (ANN), generalized additive model (GAM), support vector machine (SVM), gradient boosting machine (GBM), and k-nearest neighbors (K-NN), also perform well, albeit with slight variations in certain metrics.

AUC-ROC Curve
The ROC curves were constructed, and the area under the curve (AUC) was computed to quantify the discriminative ability of the models (Figure 2).The RF model exhibited an AUC of 0.966, indicating a very high level of discriminative capacity.The LR model followed with an AUC of 0.898, and the ANN model presented an AUC of 0.894, showing substantial predictive power.The GAM demonstrated robust discrimination with an AUC of 0.961, whereas the SVM model achieved an AUC of 0.968, slightly surpassing the GAM.The GBM model also showed excellent performance with an AUC of 0.967, closely matching the SVM model.The K-NN model yielded an AUC of 0.929, indicating good classification ability, while the NB model had an AUC of 0.893, which, despite being the lowest in this group, still represents a good discriminative ability.

Variables Importance
In this study, we employed eight distinct machine learning models to analyze and predict the target variable.These models encompass the ANN, GAM, LR, RF, GBM, KNN, NB and SVM models.The results are shown in Figure 3.

Variables Importance
In this study, we employed eight distinct machine learning models to analyze and predict the target variable.These models encompass the ANN, GAM, LR, RF, GBM, KNN, NB and SVM models.The results are shown in Figure 3.The random forest model, prioritized BAL, DBH, BA, elevation, slope, NH4-N, moisture, CD, and available P. The consistent emphasis on BAL and DBH across most models, coupled with the varied importance of factors like elevation, slope, and soil nutrients such as NH4-N and the available P, demonstrates the intricate interplay of physical and environmental variables in tree ecology.Through the analysis of eight different machine learning models, the BAL, DBH, and BA variables were found to be of high importance in most models.Additionally, other variables such as crown density, elevation, slope, and the available P and NH4-N also exhibited high levels of importance in certain models.The random forest model, prioritized BAL, DBH, BA, elevation, slope, NH 4 -N, moisture, CD, and available P. The consistent emphasis on BAL and DBH across most models, coupled with the varied importance of factors like elevation, slope, and soil nutrients such as NH 4 -N and the available P, demonstrates the intricate interplay of physical and environmental variables in tree ecology.Through the analysis of eight different machine learning models, the BAL, DBH, and BA variables were found to be of high importance in most models.Additionally, other variables such as crown density, elevation, slope, and the available P and NH 4 -N also exhibited high levels of importance in certain models.

Discussion
Based on the performance metrics derived from both the training and test datasets, we observe nuanced insights into the predictive capabilities of the eight machine learning models employed in our study on tree mortality.The RF model showcased the best performance, with the highest precision and the highest accuracy, underscoring its robustness across various metrics.This model also demonstrated a high Kappa score, indicating a strong agreement beyond chance in its predictions, making it the most reliable model for predicting outcomes accurately.
In contrast, the LR and NB models showed foundational performance with reasonable metrics, indicating that they may struggle with complex data relationships compared to more sophisticated models.However, GBM exhibited superior performance, particularly in accuracy, and it had the highest F1 score, highlighting its capability in handling variable interactions and non-linear dynamics effectively.The SVM model also performed well, demonstrating high levels of accuracy and precision, suggesting it is effective in minimizing false positives.The K-NN model, while not achieving the highest scores, still provided a solid performance across all metrics, particularly in terms of its AUC-ROC curve, which suggests good classification ability.
In conclusion, the analysis underscores the RF and GBM models as the most promising in terms of accuracy, reliability, and overall performance.These models strike an excellent balance between precision and sensitivity, adeptly predicting outcomes most of the time.However, model selection should still consider specific project requirements, including computational costs and the implications of various types of prediction errors.Conversely, models like NB and LR, while offering solid foundational capabilities, display limitations in their predictive performance, likely due to their simpler nature and assumptions, which may not capture the intricate relationships within the data effectively.
The pivotal role of BAL as the most significant variable in predicting individual tree mortality underscores the intricate dynamics of forest stand structure and competition within ecosystems.This finding is consistent with ecological theories and empirical evidence suggesting that the spatial distribution and size hierarchy within a forest significantly affect individual tree growth, survival, and overall forest productivity [70].
The prominence of BAL within our analysis underscores the principle of competitive exclusion, illustrating that the trees within more densely populated stands surrounded by trees with greater basal areas, are at an increased risk of experiencing stunted growth and a higher likelihood of mortality.This struggle for vital resources like sunlight, water, and minerals intensifies when the basal area of neighboring trees surpasses that of the focal tree, resulting in increased stress and a potential rise in mortality.Essentially, trees that boast a larger basal area are better positioned to monopolize these resources, overshadowing their smaller counterparts and outperforming them for access to water and soil nutrients.
DBH is indicative of tree size, age, growth rate, and resilience [8] and is largely included as variable in tree mortality research [71][72][73] and emerged as a pivotal variable across several models with notable importance values such as 1.0000 in GAM, around 0.7 in the RF, SVM, KNN, and NB models.The prominence of DBH aligns with the understanding that trees with larger diameters are typically more resilient to environmental stressors [74].However, the models also allude to intricate interactions, implying that specific conditions may challenge even trees with substantial DBH.
The mortality caused by competition for light, water, temperature, and nutrients is referred to as intrinsic mortality.Intrinsic mortality is influenced over the long term by the genetic and physiological characteristics of tree species, site conditions, and climatic factors [75].Site conditions form the foundation of forest productivity and are closely tied to tree mortality.The present study primarily incorporates topography-related factors as site variables, encompassing elevation, aspect, position on slope, gradient, and microtopography.These factors predominantly influence hydrothermal factors and soil conditions directly associated with tree growth [76].In this study, we applied slope and elevation as factors.Elevation, a factor influencing temperature, humidity, light, and soil characteristics, was accentuated in various models, particularly in the ANN model.This finding resonates with the ecological theories that particular altitudes may predispose certain tree species to mortality, underscoring the complex equilibrium between environmental parameters and tree vitality.In mountainous regions characterized by significant variations in elevation, distinct vegetation-vertical-zonation profiles are formed due to the undulating topography [77].Slope, a determinant of soil erosion, moisture retention, and light exposure, was emphasized in models such as ANN, KNN and RF.While its ecological relevance in shaping tree growth and survival is recognized, slope was not uniformly significant across all models.This discrepancy invites further exploration to elucidate slope's multifaceted role in forest ecology.Li Chunming et al. [78] also attempted to incorporate the influencing factors of aspect and elevation in their study on stand mortality in Mongolian oak forests.However, they found that the model outcomes indicated that these independent variables did not qualify for inclusion in the model.This result is different from our result.We attributed this outcome to the relatively low elevation (600-750 m) in their research and the high elevation (2079-2438 m) in our research.
CD, a measure of forest canopy cover, was highlighted in models like RF, SVM, KNN and NB.Within a given species, superior tree health is commonly linked to higher crown density values, reduced foliage transparency values, and diminished crown dieback values [79].The models' focus on CD reflects its critical influence on sunlight penetration, photosynthesis efficiency, and overall tree growth, emphasizing the intricate relationship between canopy architecture and arboreal survival.
Soil plays a pivotal role in tree growth by providing essential nutrients, moisture, and structural support.Among the spectrum of soil nutrients, NH 4 -N and available P, assume a critical role in tree physiological processes.The soil's NH 4 -N content significantly influences plant health and growth by modifying nitrogen absorption efficiency, altering soil pH, and impacting the root environment's microbial ecosystem.Too much NH 4 -N can cause nitrogen toxicity, negatively affecting plant growth, while too little may hinder plant development and reduce productivity [80].
Phosphorus, being a fundamental constituent of ATP, nucleic acids, and phospholipids, exerts profound influence on tree development and growth when present in the form of available P [81].In forest ecosystems, the concentration of available P within the soil can emerge as a constraining factor, especially within regions characterized by weathered or phosphorus-depleted soils [82].The association between available P and tree vitality is intricate and multifaceted, often interacting with various other soil attributes and environmental variables.Grasping this relationship holds paramount importance in forest management and conservation, as it underscores the intricate equilibrium between soil fertility and tree well-being.The available P, denoting the available phosphorus in the soil, was underscored in models such as RF and ANN.As an essential nutrient for plant growth, the importance of the available P in these models suggests that phosphorus scarcity may constrain tree development.Although not uniformly significant, its ecological relevance merits further investigation.
In conclusion, these patterns of variable importance furnish invaluable insights into the mechanisms governing tree mortality, unveiling the synergistic interactions between tree attributes, soil nutrients, topographical variations, and tree mortality.The disparities in variable importance across models illuminate the unique attributes and sensitivities of each modeling approach, providing a road map for model selection tailored to specific ecological inquiries and management goals.This comprehensive assessment augments our understanding of individual tree characteristics and accentuates the significance of judicious model selection and feature engineering in advancing ecological research.
This study integrates machine learning insights with ecological theories and offers a multifaceted perspective on tree mortality factors.The prominence of variables such as BAL, BA, DBH, elevation, and CD across different models underscores their importance, while also highlighting the need for a nuanced understanding of other variables like slope, available P, and NH4-N.Future research should consider these complex interactions and the specific context of tree species, location, and environmental conditions.
Additionally, our study has some limitations.Firstly, our dataset may have biases as it comes from specific populations and regions.Secondly, the models might be influenced by the lack of data on dead trees or further influenced by data pre-processing methods.Future research can further improve model performance by using more diverse datasets and exploring different feature engineering techniques.

Conclusions
In this study, eight diverse machine learning methods were harnessed to formulate a predictive model for individual tree mortality.Our analysis revealed varying performance across methodologies; random forest demonstrated the best prediction performance.The significance of tree-and stand-level factors, and site and soil factors, in predicting tree mortality was emphasized, underscoring the necessity of encompassing these multifaceted elements within the model.
Notably, the variables significantly impacting individual tree mortality were identified through feature importance analysis across models: BAL, DBH, BA, elevation, slope, NH4-N, soil moisture, crown density, and the soil's available phosphorus are important variables in the Larix gmelinii var.principis-rupprechtii individual mortality model.This emphasizes the role of the tree growth environment, physiological traits, and soil phosphorus content.Although promising, challenges including data limitations and ecosystem complexity should be considered when applying the model.This study exemplifies the potential of machine learning for predicting tree mortality, offering insights for model enhancement, and aiding ecosystem-management decisions.

Figure 1 .
Figure 1.Study area with sample plots' location (dots represent sample plot positions).

Figure 1 .
Figure 1.Study area with sample plots' location (dots represent sample plot positions).

Figure 2 .
Figure 2. AUC-ROC curve across models.(a) RF model ROC curve; (b) LR model ROC curve; (c) ANN model ROC curve; (d) GAM model ROC curve; (e) SVM model ROC curve; (f) GBM model ROC curve; (g) KNN model ROC curve; (h) NB model ROC curve.(The grey dotted line is a diagonal line representing the predictive performance of a random guessing model).

Table 1 .
Summary statistics of measurements for individual, stand-level variables and soil cha ters variables.

Table 1 .
Summary statistics of measurements for individual, stand-level variables and soil characters variables.
Note: DBH: diameter at breast height, BA: basal area; BAL: basal area larger than the target trees; Thickness: soil thickness; CD: crown density; Elevation: the elevation at which the trees are located; DA: average age of dominant trees; Slope: slope degree; Moisture: water content of soil; Density: The ratio of the mass of a certain volume of soil after drying to the volume before drying; PH: the PH value of soil; TC: total carbon content of soil; NO 3 -N: nitrate nitrogen; NH 4 -N: ammonium nitrogen; Available K: available potassium content of soil; Available P: available phosphorus content of soil; Age: average age of the average trees in a certain plot.

Table 2 .
Fitting statistics of the eight models on fitting dataset.

Table 3 .
Prediction statistics of the eight models for test dataset.

Table 3 .
Prediction statistics of the eight models for test dataset.