Hyperspectral Monitoring Driven by Machine Learning Methods for Grassland Above-Ground Biomass

: Above-ground biomass (AGB) is a key indicator for studying grassland productivity and evaluating carbon sequestration capacity; it is also a key area of interest in hyperspectral ecological remote sensing. In this study, we use data from a typical alpine meadow in the Qinghai–Tibet Plateau during the main growing season (July–September), compare the results of various feature selection algorithms to extract an optimal subset of spectral variables, and use machine learning methods and data mining techniques to build an AGB prediction model and realize the optimal inversion of aboveground grassland biomass. The results show that the Lasso and RFE_SVM band ﬁltering machine learning models can effectively select the global optimal feature and improve the prediction effect of the model. The analysis also compares the support vector machine (SVM), least squares regression boosting (LSB), and Gaussian process regression (GPR) AGB inversion models; our ﬁndings show that the results of the three models are similar, with the GPR machine learning model achieving the best outcomes. In addition, through the analysis of different data combinations, it is found that the accuracy of AGB inversion can be signiﬁcantly improved by combining the spectral characteristics with the growing season. Finally, by constructing a machine learning interpretable model to analyze the speciﬁc role of features, it was found that the same band plays different roles in different records, and the related results can provide a scientiﬁc basis for the research of grassland resource monitoring and estimation.


Introduction
The Qinghai-Tibet Plateau's alpine grasslands play a critical role in climate change and the global carbon cycle [1][2][3][4][5]. Grasslands use energy and materials from nature to maintain their own stable system structure and function [6,7], preventing soil erosion [8,9], purifying the air [10], and performing other functions [11], as well as serving as an important green barrier for the earth and a material source for livestock farming [2,10,12]. Grassland biomass contributes significantly to the carbon pool of terrestrial ecosystems, representing a carbon source in the form of organic matter [3,11,13]. The biomass of grassland areas objectively reflects their potential to sequester carbon and support animal life [14][15][16]. Thus, accurate grassland biomass assessment helps to reveal the role of grassland ecosystems in global climate change, as well as to understand the efficient use of grassland resources [17,18]. Image bands are red, green, and blue. Ten of the sample plots were enclosed, while the other ten were allowed to be freely grazed.

Collection of Data
The spectrum data of the alpine grassland were gathered utilizing the dual-beam spectral simultaneous measuring equipment from Analytical Spectral Devices Inc. (ASD) due to the unpredictable and rapidly changing climate on the Tibetan plateau. The system comprises two ASD FieldSpec 4 feature spectrometers and a dual simultaneous measuring software system [49]. This technology allows reference whiteboard and target data to be acquired simultaneously under equal illumination conditions, considerably increasing their efficiency and allowing for optimal measurements in the field. The spectral range of the instrument is 350-2500 nm, with a wavelength precision of 0.5 nm and a spectral resolution of 3 nm@700 nm wavelength and 8 nm@1400/2100 nm wavelength [50].
The field survey was conducted from July to September in 2019 and 2020. In the test site, a total of 20 sample plots were constructed, as described above, each with 10 quadrats of 0.5 m × 0.5 m size and a cover requirement of above 85%. A total of 400 sample squares were gathered in early July, 400 sample squares in mid-late August, and 400 sample squares in mid-September, for a total of 1200 sample squares collected across two years; each sample was repeated six times for a total of 7200 spectral data points. For each quadrat, the cover, dominant species, longitude, latitude, and whether or not it was grazed was recorded, and the average observed plant height was used to calculate the average grass height for the entire quadrat. The herbaceous plants were mowed flush after spectral data collection, and the fresh biomass was weighed immediately using an electronic scale.

Methods and Flow of Data Processing 2.3.1. Pre-Processing of Data
After exporting the spectrum data with ViewSpec Pro, MATLAB was used to conduct Savitzky-Golay (SG) smoothing and denoising [51]. The SG filter created a generally smooth spectral curve while also retaining the absorption properties at all wavelengths. The spectra were determined after the spectral curves with major discrepancies and errors were deleted, yielding a total of 7200 spectra. The six spectra of each sample were averaged into a composite spectrum, yielding a total of 1200 valid hyperspectral data points.

Feature Selection
Because full-band hyperspectral data contain a large amount of redundant information and their dimensionality is usually high, the direct modeling of these data will likely cause Remote Sens. 2022, 14,2086 5 of 26 problems such as poor computational efficiency and overfitting; thus, feature selection is required to lower the spectral dimensions of the data. Feature selection is the process of picking the most important attributes from a large number of options to create a more effective model [52,53]. The original spectral feature characteristics are unchanged; rather, an optimal subset of them is selected. Feature selection and machine learning algorithms are tightly linked, and feature selection strategies can be classed as filter, wrapper, or embedded based on a combination of the subset assessment criteria and subsequent learning algorithms [40,54].

•
Filter feature selection The filter model, as shown in Figure 2, uses feature selection as one of its preprocessing steps, with the evaluation of the selected feature subset based primarily on the characteristics of the data set itself [55]. As a result, the filter model is independent of the learning algorithm and directly evaluates the features, ranks them according to their importance, and then removes the features with low scores. We select the following three filtering feature selection algorithms.
Remote Sens. 2022, 14, x FOR PEER REVIEW 5 of 28 likely cause problems such as poor computational efficiency and overfitting; thus, feature selection is required to lower the spectral dimensions of the data. Feature selection is the process of picking the most important attributes from a large number of options to create a more effective model [52,53]. The original spectral feature characteristics are unchanged; rather, an optimal subset of them is selected. Feature selection and machine learning algorithms are tightly linked, and feature selection strategies can be classed as filter, wrapper, or embedded based on a combination of the subset assessment criteria and subsequent learning algorithms [40,54].
• Filter feature selection The filter model, as shown in Figure 2, uses feature selection as one of its preprocessing steps, with the evaluation of the selected feature subset based primarily on the characteristics of the data set itself [55]. As a result, the filter model is independent of the learning algorithm and directly evaluates the features, ranks them according to their importance, and then removes the features with low scores. We select the following three filtering feature selection algorithms. The Chi-square test (CHI), a filtering method, is used to capture the linear relationship between each feature and the label [56]. By calculating the Chi-square statistic between each non-negative feature and the label, the features are ranked from highest to lowest according to the Chi-square statistic. The Chi-square statistic is a measure of the difference between the distribution of the index data and the chosen expected or assumed distribution. It was proposed by the British statistician Pearson in 1900. It is obtained by dividing the square of the difference between the actual observation times ( 0 f ) and the theoretical distribution times ( e f ) by the theoretical times and then summating them. Its calculation formula is as follows: Linear regression (LR) determines which independent variables are correlated with dependent variables [57]. Linear regression is a common model in machine learning tasks. It is actually a broad concept and usually refers to a set of independent variables to predict the dependent variable. The formula is as follows: where  n is the coefficient of the regression model, i y is the observation vector of the dependent variable, n x is the observation value of the independent variables, and  is a random error.
In addition, the linear regression model can also answer another question: that is, among all the normalized independent variables, which variables are the most important The Chi-square test (CHI), a filtering method, is used to capture the linear relationship between each feature and the label [56]. By calculating the Chi-square statistic between each non-negative feature and the label, the features are ranked from highest to lowest according to the Chi-square statistic. The Chi-square statistic is a measure of the difference between the distribution of the index data and the chosen expected or assumed distribution. It was proposed by the British statistician Pearson in 1900. It is obtained by dividing the square of the difference between the actual observation times ( f 0 ) and the theoretical distribution times ( f e ) by the theoretical times and then summating them. Its calculation formula is as follows: Linear regression (LR) determines which independent variables are correlated with dependent variables [57]. Linear regression is a common model in machine learning tasks. It is actually a broad concept and usually refers to a set of independent variables to predict the dependent variable. The formula is as follows: where β n is the coefficient of the regression model, y i is the observation vector of the dependent variable, x n is the observation value of the independent variables, and ε is a random error. In addition, the linear regression model can also answer another question: that is, among all the normalized independent variables, which variables are the most important and which are not important. If features are essentially independent of each other, even the simplest linear regression models can achieve excellent results on low-noise data or data 6 of 26 that significantly outnumber the number of features of interest. The least square method can be used to calculate the β order, according to the order analysis of its importance.
where β is an estimate of the coefficient, β n is the coefficient of the regression model, y i is the observation vector of the dependent variable, X is the independent variable matrix, X T is the transpose of X, and Y is the dependent variable matrix. The maximum information coefficient (MIC) is used to mine linear and nonlinear relations between continuous variables through discretization optimization with unequal intervals [58], and the non-functional dependence between features can also be extensively mined. The calculation method of the maximum information coefficient is as follows.
Mutual information and meshing methods are used for the calculation. Mutual information can be regarded as the amount of information contained in a random variable about another random variable, or the uncertainty reduced by a random variable due to the fact that another random variable is known. In this experiment, the mutual information I(F; R) of aboveground biomass (F) and spectral variables (R) is defined as where p(F, R) is the joint probability density of F and R, and p(F) and p(R) are the marginal probability distribution densities of F and R respectively. Consider F separately with each R as a data set; the value range of F is divided into a intervals, and the value range of R is divided into b intervals, so on the scatter plot of F − R, all the points are divided into range a * b. Data sets in different interval division methods will obtain different data distributions, and the maximum value in different interval division methods is the maximum information value. The maximum information coefficient (MIC) is obtained after normalization, and its mathematical expression is MIC(F; R) = max (a×b)<B(n) I(F; R) log 2 min(a, b) , B(n) = n 0.6 (5) where n is the data size. The maximum information coefficient is a standard to measure the correlation between two variables (including linear correlation and nonlinear correlation). According to the formula, its value is distributed between 0 and 1. The larger the value is, the stronger the correlation is; otherwise, the weaker it is.

•
Wrapped feature selection Wrapped feature attribute selection is often based on a specific learning algorithm, such as regression; the performance of the feature attribute subset in that learning algorithm is determined by whether this subset meets the data sample selection requirements [59]. The worth of the feature subsets is determined directly by the regression, and the feature subsets that perform well in the training and learning process are selected first in this model. Figure 2 depicts the technique selection process.
The basic principle behind recursive feature elimination (RFE) is to iteratively create a model (for example, a support vector machine (SVM) or linear regression (LR) model), select the best features, and then repeat the process for the rest of the features until all the features have been explored [60]. In this process, features are arranged in the order in which they are deleted. Therefore, this is a greedy method to determine the best collection. Below, we use the classical SVM-based RFE algorithm (RFE_SVM) to discuss this algorithm.
Firstly, k features are input into the SVM regression as the initial feature subset, the importance of each feature is calculated, and the prediction accuracy of the initial feature subset is obtained by the cross validation method. Then, the feature with the lowest importance is removed from the current feature subset to obtain a new feature subset, which is input into the SVM regressor again to calculate the importance of each feature in the new feature subset, and the prediction accuracy of the new feature subset is obtained by using the cross-validation method. Finally, the recursion repeated until the feature subset is empty. Finally, k feature subsets with different number of features are obtained, and the feature subset with the highest prediction accuracy is selected as the optimal feature combination [61]. The RFE algorithm based on the linear regression model (RFE_LR) involves a similar calculation process.

•
Embedded feature selection Embedded feature selection mixes feature attribute selection with a learning algorithm that searches for the best set of feature attributes while constructing a data model [62]. Under the condition of a given regression, this method type blends the feature attribute space with the data model space and embeds the selection of feature attributes into the regression, while the corresponding learning algorithm still evaluates the feature attribute set. The embedded algorithm procedure is challenging; however, it has the advantages of filtered feature attribute selection and encapsulated feature attribute selection, which help to achieve both efficiency and accuracy.
Lasso regression is a new variable selection technique proposed by Robert Tibshirani in 1996 [63]. Lasso is a contraction estimation method. Its basic idea is to minimize the sum of squares of residuals under the constraint that the sum of absolute values of regression coefficients is less than a constant, so that some regression coefficients strictly equal to 0 can be generated, and further explicable models can be obtained. In the case of processing a large number of variables or with sparse variable matrix processing, this shows obvious advantages. Lasso achieves compression estimation by constructing penalty functions to make the coefficient values of some variables equal to zero, which can not only simplify the model but also avoid overfitting [64]. The objective function of Lasso's least square form is as follows.
Ridge modifies the loss function by adding the L2 norm of the coefficient vector. Ridge regression is a biased estimation regression method for collinear data analysis [65].The role of the Ridge is to improve the performance of the model by keeping all variables; for example, using all variables to build a model, while giving them importance. Because the Ridge keeps all variables intact, Lasso does a better job of assigning importance to variables [66]. The coefficients of the Lasso (Ridge) regression equation were normalized to 0-1 to obtain the feature score.
lasso : min ridge : min where β 0 , β is the coefficient of the regression model, y i is the observation vector of the dependent variable, x ij is the observation matrix of the independent variables, N is the number of observed values and P is the number of features, and λ ≥ 0 is an adjustable parameter whose size is related to the sparsity of β. Lasso filters variables by adjusting the size of λ.
The random forest (RF) algorithm was first proposed by Breiman L.and is a machine learning algorithm based on classification and regression decision tree (CART) [67]. It can analyze the importance of thousands of input features. The main idea is to integrate the results of multiple decision trees to conduct an overall analysis of classification tasks, and the specific implementation process is as follows: (1) Construct the training sample set: parts of the samples are randomly selected from the original sample set to form the training sample set (Bootstrap method). N training sample sets could be obtained after repeating N times.
(2) Establish N CART decision trees: Based on the samples in the training sample set, firstly M features are randomly selected from all input features M (node random splitting method), and then M features are constructed according to the variance impurity index. The calculation formula is as follows: where w i is a attribute of class i, w j is the frequency of w j samples in the total samples at node N, and i is variance impurity. Set a threshold for the decrease of variance impurity. If the decrease of variance impurity behind the branch is less than this threshold, the branch is stopped. At this point, the construction of N decision trees is completed.
(3) Overall planning of decision tree results. All the constructed decision trees are formed into a random forest, the random forest regressor is used to predict, and finally the result is determined by voting.
Random forest can rank the importance of input features. During Bootstrap sampling, about one-third of the original data is not extracted, which is called out-of-bag data (OOB). The OOB error generated by OOB data can be used to calculate the importance of each input feature, from which feature selection can be carried out [68]. The expression of feature importance assessment model is where FE is the characteristic importance, M is the total number of features, N is the total number of decision trees, E M A O t is the OOB error value of the t-th decision tree before adding noise to feature M A , and E M A n t is the OOB error value of the t-th decision tree after adding noise to feature M A . If the OOB error increases significantly and the accuracy loss is large after adding noise to feature M A , it indicates that the input feature is of high importance.

Methods of Model Construction
To evaluate and analyze the effectiveness of several feature selection algorithms for simulating natural alpine meadows, AGB prediction models were built using support (SVM), Gaussian process regression (GPR), least squares regression boosting (LSB), and artificial neural network (ANN) approaches.
SVM is an exceptionally strong and adaptable supervised learning technique for classification and regression [69]. SVM's purpose is to use dividers (lines or curves in 2D space) or streamers (curves, surfaces, etc. in multidimensional space) to create a classification hyperplane that not only accurately classifies each sample but also separates them [70]. Support vector machines are essentially boundary maximization evaluations since the resultant classification hyperplane not only correctly classifies each sample but also makes the closest sample of each class as far away as possible from the classification line (i.e., the hyperplane).
Gaussian process regression (GPR) is a nonparametric regression model that uses a Gaussian process (GP) prior to data analysis [71,72]. The GPR model assumptions include both noise (regression residuals) and a Gaussian process prior, which is solved using Bayesian inference [73]. In addition, GPR can provide a posteriori information of the predicted results, and the posteriori information has an analytic form when the likelihood is in a normal distribution. As a result, GPR is a generalizable and resolvable probabilistic model [74].
Predictive models built from weighted combinations of multiple individual regression trees are known as regression tree ensembles. Boosting algorithm is an ensemble learning algorithm that can transform a weak learner into a strong one. Least squares Remote Sens. 2022, 14, 2086 9 of 26 regression boosting (LSB) is one of the boosting methods using least squares as the loss criterion [75]. The integration trees constructed by LSB select only one feature in each round to approximate the alignment accuracy residuals and fit the regression set well by minimizing the mean.
Artificial neural networks (ANNs) are information processing systems based on emulating the structure and operations of neural networks in the brain, simulating neuronal processes with mathematical models [76,77]. The features of ANN include self-learning, self-organizing, self-adaptation, and strong nonlinear function approximation capabilities, as well as fault tolerance.
The machine learning algorithms were tested using a five-fold cross-validation method. In this approach, the data set was separated into five parts, with four serving as training data and one serving as test data for the test.
The coefficient of determination (R 2 ) and root mean square error (RMSE) were used as accuracy evaluation methods. Higher R 2 values suggest more accurate modeling. The RMSE measures the model's predictive capacity, and its magnitude is inversely proportional to the model's accuracy.

Interpretability Based on SHAP Values
The Shapley value was first proposed by Lloyd Shapley-a professor at University of California, Los Angeles [79]. Its initial research objective was to help solve the problems of social contribution and the distribution of economic benefits in the process of economic cooperation or a game. For example, if there are N people cooperating, due to their different abilities, each member's contribution will be different, and the distribution of benefits will be different. The optimal distribution is as follows: individual contribution is equal to individual benefit. The Shapley method quantifies the contribution rate and profit distribution. The Shapley value refers to the amount of money those individual members earn on a project equal to their own contribution level. The Shapley Additive Explanations (SHAP) package is a model interpretation package developed in Python that can be used to interpret the output of any machine learning model [80]. The model interpretation method based on the Shapley value is a post-model interpretation method independent of the model. As shown in Figure 3, the process of model prediction and SHAP interpretation are two parallel processes, and SHAP interprets the results of model prediction. Its main principle is as follows: the SHAP value of a variable is the average marginal contribution of this variable to all special sequences. SHAP constructs an additive explanatory model inspired by cooperative game theory, in which all variables are regarded as "contributors".  The main advantage of the SHAP value is that it can reflect the influence of independent variables in each sample, and it can also show the positive and negative factors of its influence [81]. A significant advantage of some machine learning models is that they can consider the importance of the independent variable of the model results; the variable importance has a variety of traditional calculation methods, but the traditional calculation method of variable importance effect is not good, and the traditional variable importance can only tell which variables are important, but not shed light on how the variable affects the forecasting result. The SHAP method provides another way to calculate the importance of variables: take the average of the absolute value of the SHAP value of each variable as the importance of the variable and obtain a standard bar chart, which can The main advantage of the SHAP value is that it can reflect the influence of independent variables in each sample, and it can also show the positive and negative factors of its influence [81]. A significant advantage of some machine learning models is that they can consider the importance of the independent variable of the model results; the variable importance has a variety of traditional calculation methods, but the traditional calculation method of variable importance effect is not good, and the traditional variable importance can only tell which variables are important, but not shed light on how the variable affects the forecasting result. The SHAP method provides another way to calculate the importance of variables: take the average of the absolute value of the SHAP value of each variable as the importance of the variable and obtain a standard bar chart, which can simply and clearly show the ranking of the importance of the variable. Further, to understand how a single variable affects the output of the model, we can compare the SHAP value of the variable with the variable values of all the samples in the dataset. The SHAP package also provides extremely powerful data visualization capabilities to show the interpretation of a model or prediction.
The complete data analysis process is showed in Figure 4.
The main advantage of the SHAP value is that it can reflect the influence of independent variables in each sample, and it can also show the positive and negative factors of its influence [81]. A significant advantage of some machine learning models is that they can consider the importance of the independent variable of the model results; the variable importance has a variety of traditional calculation methods, but the traditional calculation method of variable importance effect is not good, and the traditional variable importance can only tell which variables are important, but not shed light on how the variable affects the forecasting result. The SHAP method provides another way to calculate the importance of variables: take the average of the absolute value of the SHAP value of each variable as the importance of the variable and obtain a standard bar chart, which can simply and clearly show the ranking of the importance of the variable. Further, to understand how a single variable affects the output of the model, we can compare the SHAP value of the variable with the variable values of all the samples in the dataset. The SHAP package also provides extremely powerful data visualization capabilities to show the interpretation of a model or prediction.
The complete data analysis process is showed in Figure 4.

Statistical Analysis of Ecological Data
The AGB of all 1200 samples was used to create a histogram ( Figure 5); the data were found to be almost normally distributed, indicating that there were sufficient data points used in the experiment. The AGB of one sample was anomalously large (220 g), which

Statistical Analysis of Ecological Data
The AGB of all 1200 samples was used to create a histogram ( Figure 5); the data were found to be almost normally distributed, indicating that there were sufficient data points used in the experiment. The AGB of one sample was anomalously large (220 g), which made the right tail of the histogram longer; however, the difference was not significant. These findings are consistent with the expected behavior of natural grassland-that is, the biomass is concentrated in a range, and the proportion of large values is very small.
Remote Sens. 2022, 14, x FOR PEER REVIEW 11 of 28 made the right tail of the histogram longer; however, the difference was not significant. These findings are consistent with the expected behavior of natural grassland-that is, the biomass is concentrated in a range, and the proportion of large values is very small. The mean AGB value was 80.5779 g, with a coefficient of variation of 0.358 based on a descriptive statistical analysis of the ground survey data (Table 1), indicating a large variation. The mean grass height was 18.578 cm, with a coefficient of variation of 0.372, also indicating significant fluctuation. The mean AGB value was 80.5779 g, with a coefficient of variation of 0.358 based on a descriptive statistical analysis of the ground survey data (Table 1), indicating a large variation. The mean grass height was 18.578 cm, with a coefficient of variation of 0.372, also indicating significant fluctuation. To ensure a balanced data population, the same numbers of samples were grazed and not grazed, with the same number of samples taken for each growing season. Box graphs were constructed showing samples from different categories ( Figure 6). Average AGB and grass height both showed an increasing trend over time; however, the AGB increase between July and August was greater than that recorded from August to September. The maximum grass height and AGB values were both recorded in August. The AGB and grass height values in the not grazed area were higher than those in the grazed area. The mean AGB value was 80.5779 g, with a coefficient of variation of 0.358 based on a descriptive statistical analysis of the ground survey data (Table 1), indicating a large variation. The mean grass height was 18.578 cm, with a coefficient of variation of 0.372, also indicating significant fluctuation. To ensure a balanced data population, the same numbers of samples were grazed and not grazed, with the same number of samples taken for each growing season. Box graphs were constructed showing samples from different categories ( Figure 6). Average AGB and grass height both showed an increasing trend over time; however, the AGB increase between July and August was greater than that recorded from August to September. The maximum grass height and AGB values were both recorded in August. The AGB and grass height values in the not grazed area were higher than those in the grazed area. The analysis of variance (ANOVA) studies the contribution of variation from different sources to total variation to determine the influence of controllable factors on research results. A two-way ANOVA performed on AGB showed significance values of 0.000 (i.e., less than 0.01) for both growing season and grazing situation variables. The value for the interaction between the growing season and grazing situation was 0.012 (i.e., greater than 0.01). As a result, it can be inferred that both the growing season and the grazing scenario had a substantial impact on AGB, with the interaction between the two having a greater impact (p-value of 0.05). interaction between the growing season and grazing situation was 0.012 (i.e., greater than 0.01). As a result, it can be inferred that both the growing season and the grazing scenario had a substantial impact on AGB, with the interaction between the two having a greater impact (p-value of 0.05).
Similarly, a two-factor ANOVA was conducted on grass height; the results indicate that growing season, grazing condition, and the interaction between the two all had a significant effect on grass height.
Drawing scatter plots of AGB and grass height (Figure 7), it can be seen from the figure that the overall distribution of scattered points is clumped, and there are some isolated points. This indicated that there was no obvious linear relationship between AGB and grass height. By quantitative correlation analysis, there was a linear relationship between AGB and grass height (significance of 0.000) with a correlation coefficient of 0.424.

Spectral Data Analysis
The spectra were all plotted on a single graph (Figure 8). The grassland's spectral structure is typical of vegetation, with visible absorption valleys and reflection peaks. Because the range ~2300 to 2500 nm is near the edge of the spectrometer band, and the reflectance ranges of ~1340 to 1550 nm and ~1780 to 2050 nm are affected by the water absorption of the leaves, the noise in these bands is high; this affects the modeling, and thus these parts of the data were removed and the remaining bands were used for analysis.

Feature Selection
Each spectral parameter is given a score by the feature selection algorithm: the greater the score, the more important the parameter is. The top-ranking feature parameters are then chosen as the final parameters. The feature selection analysis was carried out on both the pure spectral and multi-source data.

Spectral Data Analysis
The spectra were all plotted on a single graph (Figure 8). The grassland's spectral structure is typical of vegetation, with visible absorption valleys and reflection peaks. Because the range~2300 to 2500 nm is near the edge of the spectrometer band, and the reflectance ranges of~1340 to 1550 nm and~1780 to 2050 nm are affected by the water absorption of the leaves, the noise in these bands is high; this affects the modeling, and thus these parts of the data were removed and the remaining bands were used for analysis.

Spectral Data Analysis
The spectra were all plotted on a single graph (Figure 8). The grassland's spectral structure is typical of vegetation, with visible absorption valleys and reflection peaks. Because the range ~2300 to 2500 nm is near the edge of the spectrometer band, and the reflectance ranges of ~1340 to 1550 nm and ~1780 to 2050 nm are affected by the water absorption of the leaves, the noise in these bands is high; this affects the modeling, and thus these parts of the data were removed and the remaining bands were used for analysis.

Feature Selection
Each spectral parameter is given a score by the feature selection algorithm: the greater the score, the more important the parameter is. The top-ranking feature parameters are then chosen as the final parameters. The feature selection analysis was carried out on both the pure spectral and multi-source data.

Feature Selection
Each spectral parameter is given a score by the feature selection algorithm: the greater the score, the more important the parameter is. The top-ranking feature parameters are then chosen as the final parameters. The feature selection analysis was carried out on both the pure spectral and multi-source data.
Scoring of features based on pure spectral data The spectral data were processed using several feature selection methods to determine the score of each band (minimum 0 and maximum 1), the results of which are displayed in Figure 9. The details of the individual curves are as follows. (1) The overall Cor curve is reasonably smooth; the neighboring band scores are close to each other without discarding redundancy between variables, allowing the band intervals to be identified with high correlation. (2) The linear scoring may quantify the linear relationship between spectra and AGB; however, it also singles out the correlation characteristics of all relevant featuresi.e., numerous higher score values emerge surrounding each individual high score. For example, high scores near variable 200 are densely distributed, and the wide bar chart contains several adjacent bands. (3) The MIC and CHI curves are generally similar, showing linear relationships between the independent and response variables as well as nonlinear ones. The MIC curve, however, does locally show a major excursion compared to Cor. (4) The RFE_LR curve is based on linear regression model optimization, which is more prone to bipolar evaluation overall; however, the distribution of high and low scores is relatively uniform, and the feature selection performance is poor. (5) The RFE_SVM scores show considerable unpredictability and volatility, with visible valleys and a broad distribution of peaks. (6) The Lasso approach can successfully select a limited number of critical feature variables while allowing the scores of the majority of other features to converge to zero. It is particularly useful when the number of features needs to be reduced; however, it is not very effective when trying to interpret data. (7) The Ridge approach scores each linked variable equally, as shown in Figure 9, and the scores of close bands are extremely similar. (8) After assigning a few characteristics with the highest ratings, the RF approach scores the other factors considerably lower. (9) Overall, when all of the features are averaged, most of the variables have values between 0.1 and 0.5.

2.
Selection of features based on pure spectral data Variable selection is performed for spectral data according to the score, with the results displayed in Figure 10. The bands selected by Cor are clustered in the leftmost region of the plot because the algorithm scores this section of the spectrum the highest and ranks it first, thus resulting in this method not selecting other crucial parts of the spectrum with a high degree of multicollinearity. The linear approach's chosen bands are notably dispersed. To choose features, the MIC approach focuses on various nearby bands; thus, the results of this selection method are clustered. The RFE_LR approach's bands cover a wide range of frequencies, some of which are close to those of the linear approach. The RFE_SVM and Lasso selections produce comparable results, with dense band selection in certain regions and sparse selection in others. The Ridge findings are similar to those of Lasso, but they span fewer regions and have more continuous bands. The RF results are similar to those of the Lasso selection method but are less dispersed.
number of critical feature variables while allowing the scores of the majority of other features to converge to zero. It is particularly useful when the number of features needs to be reduced; however, it is not very effective when trying to interpret data. (7) The Ridge approach scores each linked variable equally, as shown in Figure 9, and the scores of close bands are extremely similar. (8) After assigning a few characteristics with the highest ratings, the RF approach scores the other factors considerably lower. (9) Overall, when all of the features are averaged, most of the variables have values between 0.1 and 0.5. Figure 9. Results of feature scoring based on spectral data. The abscissa is the feature number and the ordinate is the score. MEAN shows the average value of all algorithms.

Selection of features based on pure spectral data
Variable selection is performed for spectral data according to the score, with the results displayed in Figure 10. The bands selected by Cor are clustered in the leftmost region of the plot because the algorithm scores this section of the spectrum the highest and ranks it first, thus resulting in this method not selecting other crucial parts of the spectrum with a high degree of multicollinearity. The linear approach's chosen bands are notably dispersed. To choose features, the MIC approach focuses on various nearby bands; thus, the results of this selection method are clustered. The RFE_LR approach's bands cover a wide range of frequencies, some of which are close to those of the linear approach. The

Multi-Source Data-Based Variable Score Analysis
The feature selection analysis was re-run after adding grass height, season, and grazing condition (i.e., grazing or non-grazing) to the pure spectral data to compose multisource data. Given the differences in data sources, the different methods showed substantial variability in their evaluation of the three newly added factors (Table 2). Both linear and RFE_LR approaches identify these features as worthless, whereas RFE_SVM identifies them as the most important. Based on the mean values of all methods, the season is the most important factor, which is consistent with the ecological experience of grass. This also demonstrates that the spectrum provides a more accurate representation of grass height and grazing. A temporal term such as season, however, cannot be directly represented by the spectrum without a priori understanding.

Multi-Source Data-Based Variable Score Analysis
The feature selection analysis was re-run after adding grass height, season, and grazing condition (i.e., grazing or non-grazing) to the pure spectral data to compose multi-source data. Given the differences in data sources, the different methods showed substantial variability in their evaluation of the three newly added factors (Table 2). Both linear and RFE_LR approaches identify these features as worthless, whereas RFE_SVM identifies them as the most important. Based on the mean values of all methods, the season is the most important factor, which is consistent with the ecological experience of grass. This also demonstrates that the spectrum provides a more accurate representation of grass height and grazing. A temporal term such as season, however, cannot be directly represented by the spectrum without a priori understanding. (1) Multi-source data-based feature scoring In terms of variable selection, the CHI, MIC, and RF algorithms show the most significant differences in terms of results when compared to the pure spectral data analysis model ( Figure 11). All three algorithms treat the three newly added elements as important variables and effectively ignore the effect of the spectrum; the RF algorithm is easy to overfit, the search converges too quickly, and the results are not ideal. The RFE_LR, RFE_SVM, and linear models are the most stable; in these approaches, the band's fractional value changes only slightly. With some image form alterations, the Lasso and Ridge approaches are stable. The Lasso approach produces sparse models, which are beneficial for selecting feature subsets; however, Ridge performs more consistently than Lasso since the valuable features tend to correlate to non-zero coefficients. As a result, the Ridge approach was deemed suitable for data comprehension. the valuable features tend to correlate to non-zero coefficients. As a result, the Ridge approach was deemed suitable for data comprehension. Figure 11. Feature scoring based on multi-source data. The abscissa is the feature number, and the ordinate is the score.
(2) Feature screening based on pure multi-source data The results of variable screening utilizing nine different types of multi-source data are depicted in Figure 12. The image reveals that only three important variables, i.e., CHI, MIC, and RF, are chosen, indicating that these three variables really do matter and their relatively faster convergence, which does not take spectral effects into account, compared to other algorithms. Because the algorithm itself evaluates the relationship between the independent and dependent variables when evaluating their importance and does not consider interaction relationships, the results of the linear and RFE_LR approaches do not differ substantially; RFE_IR is an optimized upgrade of the linear approach with similar results. RFE_SVM is essentially the original variable multiplied by three significant variables. In contrast, the Lasso method introduces three factors while drastically reducing the number of other spectral variables. The Ridge approach, like Lasso, was also able to decrease the number of variables. (2) Feature screening based on pure multi-source data The results of variable screening utilizing nine different types of multi-source data are depicted in Figure 12. The image reveals that only three important variables, i.e., CHI, MIC, and RF, are chosen, indicating that these three variables really do matter and their relatively faster convergence, which does not take spectral effects into account, compared to other algorithms. Because the algorithm itself evaluates the relationship between the independent and dependent variables when evaluating their importance and does not consider interaction relationships, the results of the linear and RFE_LR approaches do  In conclusion, among the nine feature selection algorithms, the Lasso algorithm performs well. Firstly, it does not ignore the spectral information due to data changes such as CHI, MIC, and RF (only three variables are selected). Secondly, compared with linear and RFE_LR approaches, in which a large number of highly correlated (adjacent) variables were selected, Lasso was able to suppress feature multicollinearity well (such as in Figure  11, Lasso's score for most of the variables was always biased to 0). Finally, the characteristics for which it screens, which are few in number but widely distributed, avoid the risk of over-selection. The variable screening results of RFE_SVM and Ridge are similar to those of Lasso, but both have the situation of screening adjacent spectra. Therefore, we improve RFE_SVM, perform CHI calculation on the screened features of RFE_SVM, remove the redundant features with high linear correlation, and finally simplify the features. The results are shown in Figure 13. The number of features is significantly reduced in the two different data sets, and the main variation area is between variables 1200 and 1500. Combined with the feature selection part of RFE_SVM mentioned above, we know that the feature score of this area is mostly high and fluctuates greatly.

AGB Inversion Model Based on Machine Learning
Pure spectral data and multi-source spectral data after feature selection were used as independent variables in the AGB prediction models. SVM, GPR, and integrated tree re- In conclusion, among the nine feature selection algorithms, the Lasso algorithm performs well. Firstly, it does not ignore the spectral information due to data changes such as CHI, MIC, and RF (only three variables are selected). Secondly, compared with linear and RFE_LR approaches, in which a large number of highly correlated (adjacent) variables were selected, Lasso was able to suppress feature multicollinearity well (such as in Figure 11, Lasso's score for most of the variables was always biased to 0). Finally, the characteristics for which it screens, which are few in number but widely distributed, avoid the risk of over-selection. The variable screening results of RFE_SVM and Ridge are similar to those of Lasso, but both have the situation of screening adjacent spectra. Therefore, we improve RFE_SVM, perform CHI calculation on the screened features of RFE_SVM, remove the redundant features with high linear correlation, and finally simplify the features. The results are shown in Figure 13. The number of features is significantly reduced in the two different data sets, and the main variation area is between variables 1200 and 1500. Combined with the feature selection part of RFE_SVM mentioned above, we know that the feature score of this area is mostly high and fluctuates greatly. In conclusion, among the nine feature selection algorithms, the Lasso algorithm performs well. Firstly, it does not ignore the spectral information due to data changes such as CHI, MIC, and RF (only three variables are selected). Secondly, compared with linear and RFE_LR approaches, in which a large number of highly correlated (adjacent) variables were selected, Lasso was able to suppress feature multicollinearity well (such as in Figure  11, Lasso's score for most of the variables was always biased to 0). Finally, the characteristics for which it screens, which are few in number but widely distributed, avoid the risk of over-selection. The variable screening results of RFE_SVM and Ridge are similar to those of Lasso, but both have the situation of screening adjacent spectra. Therefore, we improve RFE_SVM, perform CHI calculation on the screened features of RFE_SVM, remove the redundant features with high linear correlation, and finally simplify the features. The results are shown in Figure 13. The number of features is significantly reduced in the two different data sets, and the main variation area is between variables 1200 and 1500. Combined with the feature selection part of RFE_SVM mentioned above, we know that the feature score of this area is mostly high and fluctuates greatly.

AGB Inversion Model Based on Machine Learning
Pure spectral data and multi-source spectral data after feature selection were used as independent variables in the AGB prediction models. SVM, GPR, and integrated tree re-

AGB Inversion Model Based on Machine Learning
Pure spectral data and multi-source spectral data after feature selection were used as independent variables in the AGB prediction models. SVM, GPR, and integrated tree regression models (LSB) were built with the observed AGB as the response variable. Table 3 illustrates the characteristics of each model based on purely spectral data, demonstrating that only the RFE_SVM and Lasso approaches were potentially useful, as the R 2 values of all other approaches were too small. The six effective models are quite similar, with the Lasso approach and GPR achieving the best inversion, with an R 2 value of 0.18 and an RMSE of 26.132. In terms of the multi-source data, all the models except linear and RFE_LR achieved better results than those of the pure spectral data, with a significant improvement in R 2 values. Although CHI, MIC, Ridge, RF, and Mean methods showed improved results, the main feature variables they selected were grass height, season, and grazing conditions, and thus they did not fully exploit the available spectral information. In general, only RFE_SVM and Lasso approaches achieved good results across different data types. In terms of the three regression algorithms, GPR predicted the best results in most cases; however, the results were generally very similar to those of LSB and slightly better than those of SVM. This indicates that no machine learning regression algorithm has a clear advantage, and the differences in results between the algorithms are limited.
The improved algorithms RFE_SVM-CHI and RFE_SVM are compared and analyzed. In spectral data, except for LSB, the performance of RFE_SVM-CHI in the other two regression models decreased compared with that of RFE_SVM, indicating that eliminating some variables will inevitably reduce the potential useful spectral information while reducing the multicollinearity. In the multi-source data, RFE_SVM-CHI performed better than RFE_SVM in the other two regression models except SVM. The effect in GPR is significantly improved and close to that in the best model (Lasso-GPR).

AGB Inversion Model Based on Deep Learning Neural Networks
To further investigate the role of grass height, growing season, and grazing condition, an ANN model was constructed in the AGB inversion model using different combinations of variables for comparative analysis. A basic neural network model was built using the MATLAB neural network toolbox; a Sigmoid activation function was used in the hidden layer, and Identity was used as the activation function of the output layer, as shown in Figure 14. Figure 15 shows the detailed results for the data after performing Lasso feature selection. The training set, test set, and validation set for the neural network account for 70%, 15%, and 15% of the total data, respectively.
Remote Sens. 2022, 14, x FOR PEER REVIEW 19 of 28 the hidden layer, and Identity was used as the activation function of the output layer, as shown in Figure 14. Figure 15 shows the detailed results for the data after performing Lasso feature selection. The training set, test set, and validation set for the neural network account for 70%, 15%, and 15% of the total data, respectively. In the case of the pure spectral data (Figure 15a), the test set R-value of the ANN model is 0.465-i.e., the linear correlation between the predicted and actual AGB is 0.465; this is slightly lower than the training set R-value but higher than the validation set, indicating that the model itself has achieved a better tuning effect. Considering both the spectra-seasonal data ( Figure 15b) and spectra-seasonal-grass height data (Figure 15c), the effect of adding a seasonal factor to the spectra showed a significant improvement, with a test set R-value of 0.7. Adding the grass height factor did not significantly improve the test R-value; however, the overall R-value improved from 0.75 to 0.79, which enhanced the robustness of the model. When all the factors were added (Figure 15d), the test set Rvalue remained around 0.7. In the case of the pure spectral data (Figure 15a), the test set R-value of the ANN model is 0.465-i.e., the linear correlation between the predicted and actual AGB is 0.465; this is slightly lower than the training set R-value but higher than the validation set, indicating that the model itself has achieved a better tuning effect. Considering both the spectraseasonal data (Figure 15b) and spectra-seasonal-grass height data (Figure 15c), the effect of adding a seasonal factor to the spectra showed a significant improvement, with a test set R-value of 0.7. Adding the grass height factor did not significantly improve the test R-value; however, the overall R-value improved from 0.75 to 0.79, which enhanced the robustness of the model. When all the factors were added (Figure 15d), the test set R-value remained around 0.7.

Interpretability of AGB Inversion Model Based on SHAP
Based on the results of the Lasso-GPR model in Table 3, SHAP was established for interpretability analysis.
(1) The effect of features on AGB in aggregate data tra-seasonal data ( Figure 15b) and spectra-seasonal-grass height data (Figure 15c), the effect of adding a seasonal factor to the spectra showed a significant improvement, with a test set R-value of 0.7. Adding the grass height factor did not significantly improve the test R-value; however, the overall R-value improved from 0.75 to 0.79, which enhanced the robustness of the model. When all the factors were added (Figure 15d), the test set Rvalue remained around 0.7.

Interpretability of AGB Inversion Model Based on SHAP
Based on the results of the Lasso-GPR model in Table 3, SHAP was established for interpretability analysis.
(1) The effect of features on AGB in aggregate data  (2) The influence of features on AGB in single data The following figure (Figure 17) uses SHAP to visualize the influence of each feature in the article 40 data on AGB (restricted to the static graph display, only the feature with a large impact can be seen). The red feature indicates that AGB increases, and the blue (2) The influence of features on AGB in single data The following figure (Figure 17) uses SHAP to visualize the influence of each feature in the article 40 data on AGB (restricted to the static graph display, only the feature with a large impact can be seen). The red feature indicates that AGB increases, and the blue feature indicates that AGB decreases. The increased or decreased data is the size of the SHAP value-that is, the width of each feature on the graph, where the f(x) (value is 43.06) at the red-blue line segment represents the AGB value of 43.06 g in this record, and the base value above the blue line segment on the right represents the AGB average value of all data. It can be clearly seen from left to right that (i) the value of band 1124 is about 0.487, resulting in an increase of 1.43 g in AGB; (ii) the season (July) with the value of 7 can explain the AGB reduction of 21 g; (iii) the graze score is 1 (grazing), indicating that AGB decreases by 7 g compared with no grazing. Under the combined action of all the characteristics, the AGB of the final record was much lower than the average, with seasonal factors, grazing factors, and band 763 having the greatest influence.

Discussion
The study's findings reveal that for high-dimensional spectral datasets, each of the three feature selection algorithms exhibits its own set of properties (Table 4) [82][83][84]. Our findings show that the RFE_SVM-CHI or Lasso feature selection methods can significantly reduce the number of redundant variables and maximize useful feature variables; thus, the resulting variables can be confidently selected as important factors in the inversion of machine learning models. Not only does the optimal selection of variables have a stronger correlation to AGB, but it can also minimize the complexity of model training samples while enhancing prediction accuracy and capability. The feature selection algorithm, on the other hand, has several flaws. Although the R 2 and RMSE values of the models built using the Lasso and RFE_SVM approaches are fairly close, the variables screened by these two techniques are very different. To identify better variables, the similarities and differences between variable subsets must be further evaluated. In addition, we found that the results of the feature algorithm are not necessarily optimal. For example, we use the CHI algorithm to further process the results obtained by RFE_SVM, which can improve the effect on some models; that is, different feature selection algorithms can be combined to achieve the best goal.  For comparison, article 800 data were selected for the same analysis. At this point, f(x) (value 89.13) at the red-blue line segment indicates that the AGB value in the record is 89.13 g, and the base value is located in the left red line segment. It can be seen from the figure that (i) when the graze score is 0 (no grazing), AGB increases by 2 g; (ii) Band 763, band 1752 and band 2208 explain the increase of 11 g AGB; (iii) the season with a value of 9 increases AGB; (iv) band 359 and band 2204 together reduce AGB by 9G. The record ended up with a slightly larger AGB than the average. By comparing the two records, it can be found that the same band plays different roles in different records. For example, band 763 promotes the reduction of AGB in record 40, while it is opposite in record 800.

Discussion
The study's findings reveal that for high-dimensional spectral datasets, each of the three feature selection algorithms exhibits its own set of properties (Table 4) [82][83][84]. Our findings show that the RFE_SVM-CHI or Lasso feature selection methods can significantly reduce the number of redundant variables and maximize useful feature variables; thus, the resulting variables can be confidently selected as important factors in the inversion of machine learning models. Not only does the optimal selection of variables have a stronger correlation to AGB, but it can also minimize the complexity of model training samples while enhancing prediction accuracy and capability. The feature selection algorithm, on the other hand, has several flaws. Although the R 2 and RMSE values of the models built using the Lasso and RFE_SVM approaches are fairly close, the variables screened by these two techniques are very different. To identify better variables, the similarities and differences between variable subsets must be further evaluated. In addition, we found that the results of the feature algorithm are not necessarily optimal. For example, we use the CHI algorithm to further process the results obtained by RFE_SVM, which can improve the effect on some models; that is, different feature selection algorithms can be combined to achieve the best goal. In this study, we constructed a model based on diverse data and feature selection methods and tried to extract some knowledge about the relationship between spectral and ecological indicators from the AGB inversion of hyperspectral grassland. After creating multi-source data by incorporating grass height, growth season, and grazing condition factors, the model differs significantly from the pure spectral model. First, the applicability of the feature selection methods differed greatly from the findings of the algorithms. The MIC and RF feature selection algorithms converged too soon, picking only three variables and neglecting the importance of spectral variables. The Lasso and RFE_SVM approaches, on the other hand, added three variables in addition to retaining some spectral properties. This demonstrates that the spectrum can, to some extent, reflect the information contained in the three variables; however, not all feature selection techniques can do so. Second, a comparison of the machine learning inverse model outcomes demonstrates that adding at least one variable, i.e., the growing season, can produce excellent results, and adding three additional variables can make the model more resilient. This outcome allowed us to deduce the variables' importance and function in this study. Finally, the spectra are utilized as a link between ecological experiments and remote sensing models to better comprehend the underlying mechanisms-it is much easier to conceptually grasp the model's inherent meaning and ecological value than the typical inverted logic of merely creating mathematical models to enhance accuracy [85,86].
In the AGB inversion studied in this paper, the neural network-based deep learning regression model was roughly as effective as the optimal traditional machine learning regression model. Machine learning techniques are commonly applied in vegetation spectral surveys, and while they can efficiently solve for multiple covariates of the independent variables, the full validity of the models requires proper feature engineering. Deep learning models, however, combine feature extraction and performance evaluation to automatically perform optimal training. In general, of the two modeling approaches, deep learning models outperform machine learning models in terms of their statistical results and outperform elasticity and generalization models in terms of their predictive power [87,88]. However, after proper feature engineering, the results of machine learning and deep learning can be roughly the same, demonstrating the potential of the feature engineering approach. In addition, deep learning is a "black box", and the "inside" of deep networks remains poorly understood [89]. Hyperparameters and network design are also conceptually challenging due to a lack of theoretical foundation [90].
Interpretability has always been a key and difficult problem in machine learning research. In the field of ecological remote sensing, we not only look forward to the improvement of results, but also hope that the model itself can provide new value for better understanding of nature. Some feature selection approaches adopted in this paper can determine the importance of different features, but we still do not know the specific role of features. By constructing an interpretable model based on SHAP, it can be found that SHAP values also evaluate features, which is different from the sorting and scoring of feature selection, and the distribution of features' influence on samples (figure). It is even possible to analyze a single set of samples and visualize the different effects of each feature, reflecting the different effects of different samples, which is valuable for physiological analysis. However, the fundamental interpretability of spectra comes from the different responses of light at different wavelengths to different vegetation traits, which we have not studied.
Data-algorithm-model represents a trinity in data mining; however, in essence, both algorithm and model serve the data, which is thoroughly explored by continuously altering both the algorithm and model [91,92]. The data augment the supervised experience with prior information, the model reduces the size of the hypothesis space with prior knowledge, and the algorithm adjusts the search for the best hypothesis in a given hypothesis space with previous knowledge [93]. A major challenge in quantitative remote sensing inversion is that the applied parameters do not fully reflect the major elements affecting remote sensing information but merely produce weak indications of the underlying phenomena [30,94]. Although the inversion technique in data mining is more accurate, it also contains more parameters or hyperparameters and typically requires large-scale sophisticated training. The optimum algorithm should strike a balance between simulation accuracy, training parameters, and training duration. Using typical samples, excellent accuracy was achieved in this study. In a future study, we propose to tune the relevant machine learning parameters in terms of AGB and spectral response mechanism, better interpret the intrinsic connection between AGB and spectra, and establish a connection with existing multispectral remote sensing systems on this basis; we anticipate that the combination of these elements will provide a scientific basis for larger-scale AGB studies of alpine grasslands on the Tibetan plateau.

Conclusions
The results show that Lasso and RFE_SVM feature selection algorithms can quickly and accurately select feature spectra, effectively improving the representativeness and completeness of spectral analysis feature data, with the resulting model built based on the screened feature variables showing good predictive performance and high stamina. In the fields of feature variable screening and qualitative analysis, these algorithms have significant potential.
Machine learning with appropriate feature engineering reduces the training time of the AGB prediction model and greatly enhances its accuracy and predictive power, comparable to the performance of deep learning regression methods. When examining the impact of several data sources on the model, adding only seasons to the neural network prediction model results in higher accuracy; adding the other two components does not significantly improve the accuracy of the model, but it does improve its robustness.
The SHAP-based interpretable model gives the importance ranking of features and shows the role of feature population. At the same time, it can be found that the influence of the same feature in different records may be different; on the one hand, reflected in the direction of action (making the dependent variable increase or decrease), and on the other hand, reflected in the size.  Data Availability Statement: The digital elevation model (DEM) data used in this paper is publicly available and can be found at (https://www.gscloud.cn/sources/index?pid=302, accessed on 1 December 2021).