Modelling the Vegetation Response to Climate Changes in the Yarlung River Basin Using Random Forest. Water.

: Vegetation coverage variation may inﬂuence watershed water balance and water resource availability. Yarlung Zangbo River, the longest river on the Tibetan Plateau, has high spatial heterogeneity in vegetation coverage and is the main freshwater resource of local residents and downstream countries. In this study, we proposed a model based on random forest (RF) to predict the Normalized Di ﬀ erence Vegetation Index (NDVI) of the Yarlung Zangbo River Basin and explore its relationship with climatic factors. High-resolution datasets of NDVI and monthly meteorological observation data from 2000 to 2015 were used to calibrate and validate the proposed model. The proposed model was then compared with artiﬁcial neural network and support vector machine models, and principal component analysis and partial correlation analysis were also used for predictor selection of artiﬁcial neural network and support vector machine models for comparative study. The results show that RF had the highest model e ﬃ ciency among the compared models. The Nash–Sutcli ﬀ e coe ﬃ cients of the proposed model in the calibration period and veriﬁcation period were all higher than 0.8 for the ﬁve subzones; this indicated that the proposed model can successfully simulate the relationship between the NDVI and climatic factors. By using built-in variable importance evaluation, RF chose appropriate predictor combinations without principle component analysis or partial correlation analysis. Our research is valuable because it can be integrated into water resource management and elucidates ecological processes in Yarlung Zangbo River Basin.


Introduction
Vegetation is produced as a result of the interactions among factors such as soil, atmosphere, and moisture [1]. Vegetation is affected by climate because of biophysical responses such as plant respiration, photosynthesis, and evapotranspiration [2]. Recent research found that vegetation plays a key role in future terrestrial hydrologic response, and understanding water stress is of the utmost importance for properly predicting future dryness and water resources [3]. Changes in global climate and associated effects on vegetation condition have received an increasing amount of attention [4]. Among such research, the Normalized Difference Vegetation Index (NDVI) is frequently used to monitor changes in vegetation conditions, because of its close relationship with photosynthetically active radiation, which is absorbed by photosynthesizing tissues [5,6]. With the improvement of remote sensors, the NDVI has been widely applied in continental and regional

Study Area
The Yarlung Zangbo River is the largest river on the Tibetan Plateau and one of the most important international rivers [43]. It originates from the Jemayangzong Glacier in southern Tibet and has a total length of 2229 km and its drainage area is 2.42 × 105 km 2 [44]. This river is one of the highest rivers in the world, with an average elevation of above 4600 m, and tilts from the west to the east, with an average slope of 2.6° [21]. The Yarlung Zangbo River has six major tributaries (the Dogxung Zangbo, Nyangqu, Lhasa, Nyang, Yigong Zangbo, and Purlung Zangbo Rivers). The locations of the YZRB and its tributaries are shown in Figure 1.
Because of the unique topographic characteristics and high altitude of the plateau, the vegetation and ecological environment of the YZRB are relatively fragile and complex [23] and show obvious changes from upstream to downstream [45]. According to the China Vegetation Atlas (Figure 2), the upstream region is located in an arid zone that is dominated by alpine grassland and meadows [46]. With decreasing elevation, the midstream transitions into a continental climate and is mainly covered by alpine grassland and meadows, and the cultivated vegetation slightly increases. The lower reaches of YZRB have a subtropical climate and are mainly covered by coniferous and broadleaf forests.   Because of the unique topographic characteristics and high altitude of the plateau, the vegetation and ecological environment of the YZRB are relatively fragile and complex [23] and show obvious changes from upstream to downstream [45]. According to the China Vegetation Atlas (Figure 2), the upstream region is located in an arid zone that is dominated by alpine grassland and meadows [46]. With decreasing elevation, the midstream transitions into a continental climate and is mainly covered by alpine grassland and meadows, and the cultivated vegetation slightly increases. The lower reaches of YZRB have a subtropical climate and are mainly covered by coniferous and broadleaf forests.
Water 2020, 12, x FOR PEER REVIEW 3 of 12

Study Area
The Yarlung Zangbo River is the largest river on the Tibetan Plateau and one of the most important international rivers [43]. It originates from the Jemayangzong Glacier in southern Tibet and has a total length of 2229 km and its drainage area is 2.42 × 105 km 2 [44]. This river is one of the highest rivers in the world, with an average elevation of above 4600 m, and tilts from the west to the east, with an average slope of 2.6° [21]. The Yarlung Zangbo River has six major tributaries (the Dogxung Zangbo, Nyangqu, Lhasa, Nyang, Yigong Zangbo, and Purlung Zangbo Rivers). The locations of the YZRB and its tributaries are shown in Figure 1.
Because of the unique topographic characteristics and high altitude of the plateau, the vegetation and ecological environment of the YZRB are relatively fragile and complex [23] and show obvious changes from upstream to downstream [45]. According to the China Vegetation Atlas (Figure 2), the upstream region is located in an arid zone that is dominated by alpine grassland and meadows [46]. With decreasing elevation, the midstream transitions into a continental climate and is mainly covered by alpine grassland and meadows, and the cultivated vegetation slightly increases. The lower reaches of YZRB have a subtropical climate and are mainly covered by coniferous and broadleaf forests.    Yarlung Zangbo river has more than 130 tributaries, which is larger than 100 km 2 , and its major tributaries include Nianchu River, Lhasa River, Nyang River, and Parlung Tsangpo. By considering the hydrological and vegetation similarity, which is shown in Figures 1 and 2, the YLZB is divided into 5 subzones in the research. The area and vegetation conditions of the five subzones are shown in Tables 1 and 2.

Data Description
A quality-controlled NDVI remote sensing product (MOD13A3) is selected in this study, obtained from the observation of MODIS (Moderate Resolution Image Spectroradiometer) data provided by NASA, spanning 16 years (February 2000 to December 2015). MOD13A3 is the third level product, based on the secondary product, corrected the edge distortion (Bowtie effect) produced by the sensor imaging process. The spatial resolution of the product is 1.1 km × 1.1 km, while the time resolution is monthly. The data were processed into the Geostationary Earth Orbit Tag Image File Format (GEO TIFF) by MODIS Reprojection Tool (MRT) software and processed by ArcGIS projection splicing. The monthly mean air temperature and precipitation data covering 2000-2015 from 30 meteorological stations located in the YZRB were collected from the China Meteorological Data Network. The locations of the meteorological stations are shown in Figure 1.

Random Forest
RF was first proposed by Breiman (2001) [47] as an ensemble learning method that can be used in both classification and regression tasks. The model is considered capable of dealing with small sample sizes and high-dimensional correlation relationships [47]. RF is also advantageous because of its robustness; it does not easily lead to overfitting or provide biased estimates when predictors that do not add information are used [48].
RF includes a group of classification and regression decision trees (CARTs) that are unpruned and generated by bootstrap sampling and random variable selection. The RF algorithm can be divided into the following steps. First, the training dataset is randomly extracted from the original dataset by bootstrap resampling. Second, the CARTs are established for each training set. Compared with the traditional CART method, RF selects random feature combinations to split each node, and each CART grows to the maximum extent without any pruning. Finally, the RF output is obtained by voting in classification mode or averaging in the regression mode of all of the CART predictions.
RF provides a built-in cross-validation process that occurs in parallel with the training procedure for the out-of-bag (OOB) samples, which are not chosen by the bootstrap process. RF can evaluate variable importance by randomly permuting these variables and observing the difference in model performance using OOB samples. At the end of procedure, RF obtains variable importance by averaging these differences, which is then normalized by the standard deviation.

Model Implementation and Validation
The RF is utilized to simulate the nonlinear relationship between the NDVI and climate factors in the 5 subzones. The monthly area-averaged NDVI datasets for each subzone were obtained by calculating the average values of each pixel. The monthly area-averaged precipitation and temperature of 5 subzones were obtained from actual data. For model development, the monthly average NDVI datasets and monthly area precipitation and temperature datasets of each subzone are divided into two datasets. The datasets from 2000 to 2009 were used for model calibration, and those from 2010 to 2015 were used for model validation.
Two machine learning models, ANN and SVM, which were previously used for NDVI prediction [49,50], were selected to compare with RF performance. A three-layer back propagation (BP) ANN model was used in this research. The BP method has been the most widely used algorithm to design multiple layer neural networks, and has also been successfully used for NDVI prediction [16]. SVM was the first classification machine learning algorithm and was proposed by Vapnik, and then gradually derived to the regression algorithm [51]. SVM has been widely used for hydrological prediction, and most recently for NDVI prediction. In this research, SVM with linear kernel function was used, which has been widely used in former studies.
One of the most important steps in the development of machine learning prediction models is the choice of appropriate predictors. Due to the spatial heterogeneity of the vegetation in the YLZB, the relationships between the NDVI and climate factors are different in the 5 subzones. Therefore, it is important to choose appreciate climate factors as predictors in NDVI prediction. In previous studies, PCA and PAR were used for predictor selections of ANN and SVM models. For comparative study, here, PAR and PCA were both used for predictor selection of the ANN and SVM models [52]. For PCA, the climate factors are standardized by subtracting the mean from the original values and then dividing the results by the standard deviation of the original variables. The PCA method is then applied to the standardized climate factors to extract principal components (PCs) that are orthogonal. The obtained PCs preserve more than 90% of the variances that are selected as predictors. Then, the PCs are used in the ANN and SVM modeling, and these results are marked as, ANN-PCA and SVM-PCA. For PAR, climate factors with a partial correlation coefficient greater than 0.3 were selected as predictors, and the corresponding results are marked as ANN-PAR and SVM-PAR.

Model Evaluation Index
The mean absolute percentage error (MAE), Nash-Sutcliffe coefficient (NASH), root mean square errors (RMSE), and correlation coefficient (R) statistical indicators were used to assess the predictive performance of the ANN, SVM, and RF models. MAE, NASH, RMSE, and R were defined as: where Y i,obs is the measured NDVI value of the station, Y obs is the mean of the observed NDVI value, Y i,sim is the vector of the simulated NDVI value, and Y sim is the mean of the simulated NDVI value. In general, a higher NASH value indicates better model efficiency; in contrast, smaller RMSE, MAE, and R values indicate higher accuracy.

Spatial and Temporal Characteristics of the NDVI in the YZRB
The inter-annual variations of the NDVI, precipitation, and temperature on the subzone scale from 2000 to 2015 are shown in Table 3. The NDVI and temperature values showed a statistically insignificant increase, whereas the average precipitation of the Yarlung Zangbo River Basin significantly decreased from 528 mm in 2000 to 396 mm in 2015, with a total increase of 0.8 • C over the 16 years. This finding is consistent with the results of previous studies [26]. In the five subzones, NDVI gradually increased from upstream to downstream. The average annual growth of NDVI in the five subzones was 0.1 × 10 −3 , 0.1 × 10 −3 , 0.4 × 10 −3 , 0.7 × 10 −3 , and 0.2 × 10 −3 . The precipitation and temperature show similar trends. The average annual growth of precipitation was −3.9, −3.7, −9.86, −13.86, and −12.8; the average annual growth of temperature was 0.02, 0.04, 0.07, 0.04, and 0.01.

Predictors Selection
In order to determine the optimal predictors for NDVI prediction models, PCA and PAR were used to analyze the relationships between NDVI and precipitation/temperature at different lead times. The results are shown in Tables 4 and 5, where Pn represents the average precipitation with a lead time n month, and Tn represents the average precipitation with a lead time n month. With reference to similar studies and the meteorological cycles [32][33][34][35], the maximum lead times were set to 6 months.
As shown in Tables 4 and 5, the correlations between the NDVI and precipitation/temperature gradually decayed with the increase of lead time. The PCA results show that the precipitation and temperature whose lead time was shorter than 2 months had major impacts on the NDVI in these subzones. However, the PAR results varied in these subzones. In Sub1 and Sub5, the precipitation in the present month and temperature whose lead time was shorter than 2 months had major impacts on the NDVI. In Sub2, the precipitation whose lead time was shorter than 1 month and temperature whose lead time was shorter than 2 months had major impacts on the NDVI. In Sub3, the precipitation whose lead time shorter than 1 months and temperature whose lead time shorter than 3 months had major impacts on NDVI. In Sub4, the precipitation whose lead time was shorter than 2 months and temperature whose lead time was shorter than 3 months had major impacts on the NDVI. In general, the relationships between the NDVI and temperature were slightly closer than those between NDVI and precipitation in the five subzones.
RF evaluates the relative contribution of each predictor using a built-in variable importance evaluation process. The importance of the precipitation/temperature at different lead times in these subzones are calculated and indicated in Figure 3. As illustrated in Figure 3, although the importance of precipitation and temperature gradually decreased, the increase in lead time and the decreases were not as significant as in the PCA and PAR results. This finding may indicate that RF can use all predictors without overfitting. Thus, the precipitation and temperature whose lead time was shorter than 6 months were used for RF modeling of the five subzones.
Water 2020, 12, x FOR PEER REVIEW 7 of 12 As shown in Tables 4 and 5, the correlations between the NDVI and precipitation/temperature gradually decayed with the increase of lead time. The PCA results show that the precipitation and temperature whose lead time was shorter than 2 months had major impacts on the NDVI in these subzones. However, the PAR results varied in these subzones. In Sub1 and Sub5, the precipitation in the present month and temperature whose lead time was shorter than 2 months had major impacts on the NDVI. In Sub2, the precipitation whose lead time was shorter than 1 month and temperature whose lead time was shorter than 2 months had major impacts on the NDVI. In Sub3, the precipitation whose lead time shorter than 1 months and temperature whose lead time shorter than 3 months had major impacts on NDVI. In Sub4, the precipitation whose lead time was shorter than 2 months and temperature whose lead time was shorter than 3 months had major impacts on the NDVI. In general, the relationships between the NDVI and temperature were slightly closer than those between NDVI and precipitation in the five subzones.
RF evaluates the relative contribution of each predictor using a built-in variable importance evaluation process. The importance of the precipitation/temperature at different lead times in these subzones are calculated and indicated in Figure 3. As illustrated in Figure 3, although the importance of precipitation and temperature gradually decreased, the increase in lead time and the decreases were not as significant as in the PCA and PAR results. This finding may indicate that RF can use all predictors without overfitting. Thus, the precipitation and temperature whose lead time was shorter than 6 months were used for RF modeling of the five subzones.    T 0 : Temperature of the month, P 0 : rainfall of the month, T 1 : temperature of the previous month, P 1 : rainfall of the previous month, T 2 : temperature of the first 2 months, P 2 : rainfall of the last 2 months, T 3 : temperature of the first 3 months, P 3 : rainfall in the first 3 months. T 0 : Temperature of the month, P 0 : rainfall of the month, T 1 : temperature of the previous month, P 1 : rainfall of the previous month, T 2 : temperature of the first 2 months, P 2 : rainfall of the last 2 months, T 3 : temperature of the first 3 months, P 3 : rainfall in the first 3 months.

Comparative Study
The calibration and validation results of the RF and comparative models are summarized in Table 6. The results show that RF was superior to the comparative models in the calibration and validation periods. The NASH RF values for the five subzones were 0.96, 0.97, 0.96, 0.94, and 0.92 in the calibration period, and 0.91, 0.95, 0.96, 0.89, and 0.83 in the validation period. All of the measured criteria were superior to those of the compared models (ANN and SVM).
The results of the two-parameter selection were also compared between the ANN and SVM models. PCA was superior to PAR for both the ANN and SVM models. For the ANN models, the average RMSE and MAE were similar in both the calibration and validation periods. However, the average NASH and R of the results using PAR were superior to those of the PCA by 0.03 and 0.04 in the calibration period, and 0.03 and 0.05 in the validation period, respectively. For the SVM models, the average NASH and R increased by 0.03 and 0.02 in the calibration period, and 0.03 and 0.02 in the validation period, respectively. The average RMSE and MAE decreased by 0.002 and 0.004 in the calibration period, and 0.004 and 0.006 in validation period, respectively. Therefore, PCA was advantageous over PAR, with increases of NASH and R, and decreases of RMSE and MAE.

Conclusions
As a key component of ecohydrological processes, vegetation conditions influence the efficiency of plant water use and potentially affect water resources. Therefore, investing the changes of vegetation conditions and exploring the vegetation responses to climate changes will provide essential information for regional water resource management [53,54]. Combining with climate models, NDVI prediction models can assess the effects of future drought events [10]. As a covariate with other environmental variables, NDVI prediction models will also provide essential information for irrigation management [15] and soil-loss-prone area identification [12,13], etc. By exploring the vegetation condition changes of the YZRB and their relationship with climatic factors, we proposed an NDVI prediction model based on RF with area-averaged precipitation and temperature as predictors. The monthly rainfall and temperature observations from 30 meteorological stations in the YZRB and the MODIS NDVI datasets from 2000 to 2015 were selected to calibrate and validate the proposed model. The RF results were also compared with those of ANN and SVM models. The primary conclusions are as follows: 1.
RF successfully simulated the relationship between NDVI and climatic factors. The NASH coefficients of the proposed model during the calibration period in the five subzones were all higher than 0.9, and those during the verification period were all higher than 0.8. Among the five tested models, RF showed the highest model efficiency in both the calibration and validation periods among all compared models.

2.
RF showed advantages for predictor selection. The built-in variable importance evaluation allowed RF to select predictors without additional selection methods, such as PAR and PCA. Moreover, the numbers of predictors were greatest for RF among the compared models. RF showed robustness for modeling, because it could take full advantage of all predictor and avoid overfitting. 3.
PCA and PAR were used to analyze the factors that affect the NDVI in YZRB subzones. The results show that the rainfall and temperature of the first 3 months had significant impacts on NDVI, and temperature had a greater influence than rainfall in most of the subzones.
Because of sparse meteorological networks, this research was conducted on a subzone scale. In the future, we will try to explore the relationships between NDVI and climatic factors at a higher resolution with gridded meteorological observations, which will be more applicable for integrated water resource management. The adoption of more vegetation indices, such as leaf area index (LAI), is another important direction.
Author Contributions: K.C. used RF, SVM, ANN model for simulation, calibration, and validation; B.P., L.C. and D.P. collected and processed data; K.C., B.P. and wrote the paper; Z.Z., G.Z. and S.S. supervised the research. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.