Quantifying Plant Species α -Diversity Using Normalized Difference Vegetation Index and Climate Data in Alpine Grasslands

: Quantitative plant species α -diversity of grasslands at multiple spatial and temporal scales is important for investigating the responses of biodiversity to global change and protecting biodiversity under global change. Potential plant species α -diversity (i.e., SR p , Shannon p , Simpson p and Pielou p : potential species richness, Shannon index, Simpson index and Pielou index, respectively) were quantiﬁed by climate data (i.e., annual temperature, precipitation and radiation) and actual plant species α -diversity (i.e., SR a , Shannon a , Simpson a and Pielou a : actual species richness, Shannon index, Simpson index and Pielou index, respectively) were quantiﬁed by normalized difference vegetation index and climate data. Six methods (i.e., random forest, generalized boosted regression, artiﬁcial neural network, multiple linear regression, support vector machine and recursive regression trees) were used in this study. Overall, the constructed random forest models performed the best among the six algorithms. The simulated plant species α -diversity based on the constructed random forest models can explain no less than 96% variation of the observed plant species α -diversity. The RMSE and relative biases between simulated α -diversity based on the constructed random forest models and observed α -diversity were ≤ 1.58 and within ± 4.49%, respectively. Accordingly, plant species α -diversity can be quantiﬁed from the normalized difference vegetation index and climate data using random forest models. The random forest models of plant α -diversity build by this study had enough predicting accuracies, at least for alpine grassland ecosystems, Tibet. The proposed random forest models of plant α -diversity by this current study can help researchers to save time by abandoning plant community ﬁeld surveys, and facilitate researchers to conduct studies on plant α -diversity over a long-term temporal scale and larger spatial scale under global change.


Introduction
Plant species α-diversity, as key components of biodiversity and important characteristics of plant community, can always be represented by species richness (i.e., species numbers), Shannon, Simpson and Pielou [1][2][3][4].Quantitative plant species α-diversity of grasslands at multiple spatial and temporal scales, as one key aspect of plant diversityrelated studies, is important for assessing the responses of biodiversity to global change and protecting biodiversity under global change [5][6][7].Although there are numerous studies which are related to plant diversity [8][9][10][11][12], some issues are still not resolved.Firstly, on the one hand, with the increasing number of responsibilities in the work of scientists, there are fewer experienced researchers with sufficient knowledge in diagnostics who can supplement the tests performed in the small area system.On the other hand, at present, plant diversity data are mainly collected from field plant community surveys at relatively smaller spatial scales, such as single point scale, vertical transect scale, and horizontal transect scale, but plant diversity data are lacking at the relatively larger spatial scales, Remote Sens. 2022, 14, 5007 2 of 14 such as the whole Tibet scale [1,[13][14][15][16].High-precision models of plant α-diversity are the basis for the plant diversity studies at relatively greater spatial scales.Secondly, compared with vegetation productivity models (e.g., gross/net primary production models of the Moderate Resolution Imaging Spectroradiometer) [17][18][19], plant species diversity models are relatively rare.Thirdly, along with the rapid development of various science and technology including computer science and 3S technology, data mining technology has gradually entered the human field of vision and has played key roles in all walks of life [20][21][22][23][24][25][26][27], which makes it possible for us to quantify massive primary plant α-diversity data from field plant community surveys.However, there are a variety of data mining technologies, such as the models of random forest, generalized boosted regression, artificial neural network, multiple linear regression, support vector machine and recursive regression trees [20,27].It is still unclear which technology of data mining can be the best model to simulate α-diversity of plants.Accordingly, further studies are needed to find the optimal model of plant α-diversity, and this optimal model can be used to perform studies associated with plant α-diversity at various spatial and temporal scales (e.g., the spatial patterns of plant α-diversity and associated driving factors in alpine grassland ecosystems of the Qinghai-Tibet Plateau).
Actually, several earlier studies have been carried out on the plant α-diversity of alpine grassland ecosystems in Tibet [8,9,11,12,28,29].These earlier studies can have important guiding significance for the conservation of biodiversity in alpine grassland ecosystems of Tibet and even the world.However, some issues are still not resolved.Firstly, when analyzing the relationships between climate variables and plant species α-diversity, earlier studies mainly discussed the relationships between temperature and precipitation and plant species α-diversity, but lacked the discussions on the relationships between radiation and plant species α-diversity [10,11].However, some earlier studies have proved that radiation change had stronger influences on nutrition quality of plant community than temperature change and precipitation change in alpine grassland ecosystems of Tibet [20,27].Meanwhile, plant species α-diversity can be closely correlated with nutrition quality and nutrition production of plant community in alpine grassland ecosystems of Tibet [8,30].Moreover, plant growth is indeed influenced by radiation, although photoinhibition may occur under relatively higher magnitude radiation in alpine regions of the Qinghai-Tibet Plateau [31][32][33].These earlier findings imply that radiation can affect plant species αdiversity.Secondly, the random forest model has been proved to be the optimal model in predicting some key plant parameters (e.g., plant nutrition quality and nutrition production) in alpine grassland ecosystems of Tibet by several earlier studies [20,27].However, it is still unclear whether or not the performance of the random forest model in simulating plant α-diversity is better than other models of data mining in the alpine grassland ecosystems of Tibet.Accordingly, further studies are needed.
Here, plant species α-diversity was quantified from measured normalized difference vegetation index, temperature, precipitation and radiation using six methods of data mining (i.e., random forest, generalized boosted regression, artificial neural network, multiple linear regression, support vector machine and recursive regression trees) in alpine grassland ecosystems of Tibet.The main objective was to compare the performance of the six approaches in terms of plant α-diversity.

Data
There were 532 and 398 sampling quadrats with a size of 0.50 m × 0.50 m under the fencing and free-grazing scenes, respectively.The geographic positions of these sampling sites were illustrated in Figure 1.The 532 sampling quadrats under fencing scenes were investigated in 2011-2020, and the 398 sampling quadrats under grazing scenes were investigated in 2010, 2012 and 2017-2020.The investigated plant community data included numbers of species, species coverage and height.Based on the investigated plant community data, plant α-diversity was calculated for each sampling quadrat [9][10][11][12]34,35].
Considering the nonlinear relationships among different indices of plant α-diversity, four different indices of plant α-diversity (i.e., SR: species richness, Shannon, Simpson and Pielou indices) were adopted in this current study.The SR was actually referred to species number of plant community for each quadrat.The Shannon, Simpson and Pielou indices were calculated with Equations ( 1)-(3), respectively: where P i is the relative important value of each plant species within a quadrat.The other data used in this research included maximum normalized difference vegetation index during the period from May to September (NDVI max ), annual temperature (AT), annual precipitation (AP) and annual radiation (ARad) in 2000-2020.The normalized difference vegetation index data were adopted from MODIS product (i.e., MOD13A3).The AT, AP and ARad data were based on interpolated monthly air temperature, precipitation and radiation, respectively.The accuracies of interpolated monthly climate data were performed and validated by earlier studies [36].Based on previous studies [20,27], the plant α-diversity data under fencing and free-grazing scenes were treated as the potential and actual plant α-diversity, respectively.The potential plant α-diversity data were assumed to be only influenced by climatic change, while the actual plant α-diversity data were assumed to be simultaneously influenced by anthropogenic activities and climatic change [20,27].The potential SR, Shannon, Simpson and Pielou data were labeled as SR p , Shannon p , Simpson p and Pielou p , respectively.In contrast, the actual SR, Shannon, Simpson and Pielou data were labeled as SR a , Shannon a , Simpson a and Pielou a , respectively.The SR p , Shannon p , Simpson p and Pielou p data were calculated from the AT, AP and ARad based on six diverse methods (i.e., random forest, generalized boosted regression, artificial neural network, multiple linear regression, support vector machines and recursive regression trees) (Table 1).By contrast, the SR a , Shannon a , Simpson a and Pielou a data were calculated from the AT, AP, ARad and NDVI max based on the six diverse methods mentioned above (Table 1).The reasons on why these six approaches were adopted in the current study were as follows.
Firstly, random forest, generalized boosted regression, artificial neural network, support vector machines and recursive regression trees are common big data mining tools, while multiple linear regression is a common statistical regression tool.Secondly, earlier studies examined and compared the performances of the random forest, multiple linear regression, support vector machines and recursive regression trees in assessing plant nutrition quality and production of alpine grasslands in Tibet [20,27].However, the performances of these six approaches in quantifying plant α-diversity of alpine grasslands in Tibet is unclear.
Thirdly, optimal plant α-diversity model screening is necessary to predict the change of plant α-diversity and protection of biodiversity under global change.All the spatial resolutions of AT, AP, ARad, NDVI max , SR p , Shannon p , Simpson p , Pielou p , SR a , Shannon a , Simpson a and Pielou a were 1 km × 1 km.

Statistical Analysis
According to some earlier studies [20,[37][38][39], 30 dataset of observed plant α-diversity, annual temperature, annual precipitation and annual radiation and/or growing season maximum normalized difference vegetation index were randomly selected from the 532 and 398 samples under fencing and free-grazing scenes, respectively.The 30 datasets were used to validate the predicting accuracies (i.e., linear slope, R 2 : determination coefficient, RMSE: root-mean-square error and relative bias) of all the models used in this study [20].Random forest, generalized boosted regression, artificial neural network, multiple linear regression, support vector machines and recursive regression trees were performed by the randomForest, gbm, rminer, stats, e1071 and rpart packages of the R.4.1.2software, respectively [40][41][42].The R.4.1.2was also used to perform all the other statistical analyses, including the linear regression between potential and actual plant α-diversity.

Model Construction
The model parameters for the random forest, generalized boosted regression, artificial neural network, multiple linear regression, support vector machines and recursive regression trees of plant α-diversity are shown in Table 1, respectively.The values and numbers of model parameters varied among the six approaches (Table 1).Random forest, multiple linear regression and recursive regression trees directly provided R 2 values, and generalized boosted regression, artificial neural network and support vector machines did not directly provide R 2 values (Table 1).Generally, the R 2 values of constructed random forest models were the greatest, and the R 2 values of constructed multiple linear regression models were the lowest (Table 1).Random forest, generalized boosted regression and support vector machine provided numbers of trees or vector machines, and the other three approaches did not provide numbers of trees or vector machines (Table 1).Different indices of plant α-diversity adopted different numbers of trees or vector machines (Table 1).Moreover, some approaches directly provided internal error evaluation parameters (Table 1).Random forest models directly provided mean square errors, as the internal error evaluation and generalized boosted regression directly provided mean train errors and mean cross-validation errors as the internal error evaluation, and vector machines models directly provided mean residuals as the internal error evaluation (Table 1).

Model Accuracies
The simulated plant α-diversity was significantly and linearly correlated with the observed plant α-diversity (Figures 2-5).The relative bias and RMSE values between the simulated plant α-diversity and observed plant α-diversity are shown in Table 2.The linear slopes, R 2 values, relative bias and RMSE values between the simulated plant α-diversity and observed plant α-diversity varied among the six approaches, respectively (Figures 2-5, Table 2).The linear slopes and R 2 values between simulated plant α-diversity and observed plant α-diversity were no less than 0.88 (Table 2).The relative bias and RMSE values between simulated plant α-diversity and observed plant α-diversity were within a range from −6.53% to 9.14% and from 0.05 to 2.37, respectively (Figures 2-5).boosted regression (c,d), artificial neural network (e,f), multiple linear regression (g,h), support vector machines (i,j) and recursive regression trees (k,l).The solid lines indicate the linear fitted lines between the simulated and observed species richness.SRp: potential species richness; SRa: actual species richness.All the regressions were significant at p < 0.001.

Discussion
Similar with earlier studies, not all models can directly provide the R 2 values [20,38].In fact, only three (i.e., random forest, multiple linear regression and recursive regression trees) of the six methods can directly provide R 2 values (Table 1).Among the three methods mentioned above, the AT, AP, ARad and/or NDVI max based on constructed random forest models can explain the highest variations of plant α-diversity, and the AT, AP, ARad and/or NDVI max based on constructed multiple linear regression models can explain the lowest variations of plant α-diversity.This consequence was similar to an earlier paper, which showed that the AT, AP, ARad and/or NDVI max based on constructed random forest models can explain the greater variations of plant nutritional quality and production than the constructed recursive regression trees and multiple linear regression models in alpine grassland ecosystems of Tibet [20].Accordingly, the R 2 values of the constructed models can preliminarily evaluate the quality of the models, at least for plant nutrition production and α-diversity in the alpine grassland ecosystem of Tibet.
Different models/methods can generally have different calculation thought and parameters, and only three (i.e., random forest, generalized boosted regression and support vector machine) of the six methods can provide the numbers of trees or vector machines (Table 1).Compared to the constructed generalized boosted regression and support vector machine models, the tree numbers of constructed random forest models were lower for most cases.Lower tree numbers implied that lower model complexity and higher computational speed.Accordingly, the constructed random forest models in this study had the highest computational speed and lowest model complexity, but the constructed generalized boosted regression models in this study had the lowest computational speed and highest model complexity, at least for plant α-diversity in alpine grassland ecosystems of Tibet.However, this consequence seemed to be completely opposite with an earlier paper, which demonstrated that the tree numbers of constructed random forest models were greater than the constructed vector machines numbers of support vector machines models [20].Accordingly, relying only on the number of trees/vector-machines evaluation model may not be universal.
The accuracy and robustness of the constructed random forest models of plant αdiversity were greater than the other five approaches, which was supported by the facts mentioned above and succeeding facts.Firstly, from the scatter plots between simulated plant α-diversity and observed plant α-diversity, there were too many points where multiple observed values correspond to only one simulated value, especially for the constructed artificial neural network and recursive regression trees models (Figures 2-5).Secondly, the reasonable range of Simpson and Pielou values are generally within a range from zero to one.However, the simulated potential plant Pielou values (e.g., 1.01) were not all within the reasonable range for the constructed models of multiple linear regression and artificial neural network (Figures 2-5).Thirdly, there were some situations where the absolute values of the relative bias between the simulated plant α-diversity and observed plant α-diversity were greater than 4.80% for four of the six methods (i.e., artificial neural network, multiple linear regression, support vector machines and recursive regression trees) (Table 2).The largest value for the absolute value of the relative bias between the simulated plant α-diversity based on random forest and observed plant α-diversity was about 4.49%, but that between the simulated plant α-diversity based on generalized boosted regression and observed plant α-diversity was about 4.61% (Table 2).Fourthly, the linear slopes between simulated plant α-diversity from the constructed random forest models and observed plant α-diversity were the closest to 1 among the six methods for more than half cases, but the linear slopes between simulated plant α-diversity from the constructed generalized boosted regression and observed plant α-diversity were the closest to 1 among the six methods for only a quarter situation (Figures 2-5).Fifthly, for most cases, the constructed random forest models had the lowest RMSE values and the largest R 2 values between simulated plant α-diversity and observed plant α-diversity among the six methods (Table 2, Figures 2-5).Moreover, AT, AP, ARad and/or NDVI max explained about 71-73% and 61-73% variations of plant potential and actual α-diversity based on the constructed random forest, respectively (Table 1).The linear slopes, R 2 values, relative bias and RMSE values between simulated potential and actual plant α-diversity from random forest and observed potential and actual plant α-diversity were 0.97-1.00and 0.91-1.00,0.97-1.00and 0.96-0.99,−1.81-0.70 and −4.49-4.39 and 0.05-1.10 and 0.09-1.58,respectively (Table 2, Figures 2-5).Accordingly, plant species potential α-diversity can be quantified from the AT, AP and ARad using the constructed random forest models, and plant species actual α-diversity can be quantified from the AT, AP, ARad and NDVI max using the constructed random forest models, at least for the alpine grassland ecosystems of Tibet.However, on the basis of data only from Tibet, which is a hotspot of unique vegetation, it is difficult to generalize the presented results to other areas.Thus, future studies should focus on areas outside Tibet, and their findings can be compared with the results of this current study.

Conclusions
Here, to our best knowledge, this research was the first study to quantify the potential (i.e., only affected by climate change) and actual (i.e., simultaneously affected by climate change and human activities) plant α-diversity (i.e., species richness, Shannon, Simpson and Pielou) based on six models (i.e., random forest, generalized boosted regression, artificial neural network, multiple linear regression, support vector machines and recursive regression trees) using climate data (i.e., AT: annual temperature; AP: annual precipitation; ARad: annual radiation) and growing-season maximum normalized difference vegetation index (NDVI max ) in the alpine grassland ecosystem of the Tibetan Plateau under the background of the rapid development of global big data mining technologies.The predicting accuracies of the six approaches in plant α-diversity were compared in the current research by analyzing the linear slopes, R 2 , bias and RMSE values between simulated and observed plant α-diversity.The constructed random forest models of plant α-diversity had the better performance than the other five methods.Accordingly, the proposed tool by this current study will help in proposing solutions to urgent environmental problems.For example, the constructed random forest models of plant α-diversity can be used to quantify the spatial and temporal patterns of the potential and actual plant α-diversity, and predict the changes of potential and actual plant α-diversity, at least for alpine grassland ecosystems of Tibet under future global change.

Figure 2 .
Figure 2. Correlations between the simulated and observed species richness of the plant community under fencing (a,c,e,g,i,k) and grazing (b,d,f,h,j,l) scenes, for random forest (a,b), generalized

Figure 2 .
Figure 2. Correlations between the simulated and observed species richness of the plant community under fencing (a,c,e,g,i,k) and grazing (b,d,f,h,j,l) scenes, for random forest (a,b), generalized boosted regression (c,d), artificial neural network (e,f), multiple linear regression (g,h), support vector machines (i,j) and recursive regression trees (k,l).The solid lines indicate the linear fitted lines between the simulated and observed species richness.SR p : potential species richness; SR a : actual species richness.All the regressions were significant at p < 0.001.

Figure 3 .
Figure 3. Correlations between the simulated and observed Shannon of the plant community under fencing (a,c,e,g,i,k) and grazing (b,d,f,h,j,l) scenes, for random forest (a,b), generalized boosted re-

Figure 3 .
Figure 3. Correlations between the simulated and observed Shannon of the plant community under fencing (a,c,e,g,i,k) and grazing (b,d,f,h,j,l) scenes, for random forest (a,b), generalized boosted regression (c,d), artificial neural network (e,f), multiple linear regression (g,h), support vector machines (i,j) and recursive regression trees (k,l).The solid lines indicate the linear fitted lines between the simulated and observed Shannon.Shannon p : potential Shannon; Shannon a : actual Shannon.All the regressions were significant at p < 0.001.

Figure 4 .
Figure 4. Correlations between the simulated and observed Simpson of the plant community under fencing (a,c,e,g,i,k) and grazing (b,d,f,h,j,l) scenes, for random forest (a,b), generalized boosted re-

Figure 4 .
Figure 4. Correlations between the simulated and observed Simpson of the plant community under fencing (a,c,e,g,i,k) and grazing (b,d,f,h,j,l) scenes, for random forest (a,b), generalized boosted regression (c,d), artificial neural network (e,f), multiple linear regression (g,h), support vector machines (i,j) and recursive regression trees (k,l).The solid lines indicate the linear fitted lines between the simulated and observed Simpson.Simpson p : potential Simpson; Simpson a : actual Simpson.All the regressions were significant at p < 0.001.

Figure 5 .
Figure 5. Correlations between the simulated and observed Pielou of the plant community under fencing (a,c,e,g,i,k) and grazing (b,d,f,h,j,l) scenes, for random forest (a,b), generalized boosted re-

Figure 5 .
Figure 5. Correlations between the simulated and observed Pielou of the plant community under fencing (a,c,e,g,i,k) and grazing (b,d,f,h,j,l) scenes, for random forest (a,b), generalized boosted regression (c,d), artificial neural network (e,f), multiple linear regression (g,h), support vector machines (i,j) and recursive regression trees (k,l).The solid lines indicate the linear fitted lines between the simulated and observed Pielou.Pielou p : potential Pielou; Pielou a : actual Pielou.All the regressions were significant at p < 0.001.

Author
Contributions: Conceptualization, G.F. and Y.T.; methodology, G.F.; software, Y.T.; validation, G.F.; formal analysis, G.F. and Y.T.; investigation, G.F.; resources, Y.T.; data curation, G.F.; writingoriginal draft preparation, G.F. and Y.T.; writing-review and editing, G.F. and Y.T.; visualization, Y.T.; supervision, Y.T.; project administration, Y.T.; funding acquisition, G.F. and Y.T.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the Youth Innovation Promotion Association of Chinese Academy of Sciences [2020054], National Natural Science Foundation of China [31600432], Bingwei Outstanding Young Talents Program of Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences [2018RC202], Science and Technology Project of Tibet Autonomous Region [XZ202101ZD0003N, XZ202101ZD0007G, XZ202201ZY0003N], STS Project of Chinese Academy of Sciences [KFJ-STS-QYZD-2021-22-003], Construction of Fixed Observation and Experimental Station of First and Try Support System for Agricultural Green Development in Zhongba County and Central Government Guides Local Science and Technology Development Program [XZ202202YD0009C].

Table 1 .
The parameters for random forest, generalized boosted regression, artificial neural network, multiple linear regression, support vector machines and recursive regression trees of potential and actual species richness, Shannon, Simpson and Pielou, respectively.

Table 2 .
The relative bias (%) and RMSE values between observed and simulated potential and actual α-diversity of the plant community (n = 30).