2.3.1. Feature Extraction of Ground-Measured Data
(1) Spectral Derivative
A spectral transformation is the deep processing of grass species spectral data by using mathematical transformations. After the grass species spectrum is spectrally transformed, spectral features will be enhanced, which will help amplify the spectral changes between the grass species and improve the separability of grass species. Common spectral transformation methods include spectral derivative transformations and continuum removal transformations. The first-order derivative (FD) spectrum is the first derivative of the grass species spectral data that is solved (i.e., the slope of the grass species spectral curve), which can reflect the change speed of the raw data and play a role in enhancing the difference between grass species. Equation (3) shows the calculation formula of the FD spectrum [
29]:
where
is the value of the FD spectrum at point
i,
i is the midpoint of band
i − 1 and band
i + 1.
is the value of reflectance at band
i + 1,
is the value of reflectance at band
i − 1, and
is the distance between band
− 1 and band
+ 1.
(2) Continuum Removal
Continuum removal, also known as envelope removal, is a normalization of spectral data so that the spectral absorption characteristics of various grass species can be compared under one criterion. The continuum removal process can enhance and expand the absorption features of the grass species spectral data, thereby highlighting the differences in spectral features between grass species [
18]:
where
is the value of the continuum at band
,
is the value of reflectance at band
,
is the magnitude of the Hull value at band
,
is the slope of Hull at the selected absorption start point to end point,
is the value of reflectance at the absorption start point,
is the value of reflectivity at the absorption end point, and
and
are the wavelengths corresponding to the absorption start and end points, respectively.
(3) Spectral Characteristic Parameters
The spectral dimension information of the measured spectral data is extremely rich, which provides the possibility for the rational and effective use of the data, but at the same time, a large amount of data and redundant data result, which lead to an increase in the amount of calculation and a decrease in efficiency in the specific application process and is not conducive for the identification of grass species. Therefore, 6 three-edge parameters, 4 peak-valley parameters, 3 three-edge area parameters, and 7 spectral characteristic parameters were selected as spectral characteristic parameters for calculation and analysis in this study (
Table 1).
The amplitude of the three edges is the maximum value of the FD within a certain range. The position of the three edges is the wavelength corresponding to the amplitude of the three edges. The area of the three edges is the sum of the FD values within a certain range. The blue edge corresponds to 490–530 nm, the red edge corresponds to 560–640 nm, and the yellow edge corresponds to 560–640 nm.
The chlorophyll contained in plant leaves can strongly absorb most of the energy of red and blue light. The “green peak and red valley” in the visible light range is the key difference between the raw spectrum of vegetation and that of other ground objects. The green peak amplitude is the maximum reflectance in the range of 510–560 nm, the green peak position is the wavelength corresponding to the green peak amplitude, the red valley peak is the minimum reflectance in the range of 640–680 nm, and the red valley position is the wavelength corresponding to the red valley amplitude.
(4) Vegetation Index
The vegetation index method calculates the spectral reflectance of a specific band and amplifies the difference in spectral characteristics between grass species in the form of a vegetation index. The normalized difference vegetation index (NDVI), ratio vegetation index (RVI), difference vegetation index (DVI), modified soil adjusted vegetation index (MSAVI), transformed vegetation index (TVI) and renormalized difference vegetation index (RDVI) were selected for research. As shown in
Table 2, the vegetation indices of 6 grass species were calculated according to the calculation formula. Bands
and
in the table are obtained by averaging the spectral reflectance within a certain range, where band
corresponds to 620–670 nm and band
corresponds to 841–876 nm.
2.3.2. Establishment of Grassland Species Recognition Models
Grassland species recognition in the study area was conducted based on ground-measured spectral data. First, the feature extraction method based on principal component analysis (PCA) was used to reduce the dimensions of the grass species’ spectral data, FD and CR data and to build datasets. Then, the algorithms of RF, SVM, BP neural network and CNN were used to establish the recognition models that are applicable to the measured spectral data in order to identify and analyze the grass species and to study high-precision recognition methods.
(1) Feature Extraction Based on PCA
PCA is a multivariate statistical analysis method that can transform multiple indicators in the high-dimensional space of the raw data via their internal structural relationships to obtain a few mutually independent composite indicators in the low-dimensional space while retaining most information of the original indicators. In the process of applying PCA, multiple independent variables in the raw spectrum are linearly fitted based on the principle of maximum variance, and the original high-dimensional variables are replaced by the converted low-dimensional variables. Usually, the first few principal components that can represent T% of all feature information (T is generally higher than 80%) are selected to reduce the dimensionality of the data. PCA can not only replace the original indicators in the high-dimensional space with a few unrelated principal component factors to reduce the crossover and redundancy of the raw information, but it can also eliminate the need to manually determine the weights, obtain the internal structural relationships between the indicators via the analysis of the raw data and then obtain the weight according to the correlation and variability existing between the indicators, thus making the calculation results more scientific and reliable [
40].
Hyperspectral data have many raw wave bands and high dimensionality. The FD and CR datasets can be calculated from the raw spectrum, which further enriches the spectral information of the data and highlights the differences in spectral features among grass species. The PCA method can extract the most effective features from the high-dimensional spectral information, which is conducive to the efficient recognition of grassland plant species.
(2) Establishment of SVM Models
The SVM algorithm is a model based on the Vapnik–Chervonenkis dimension theory and the principle of structural risk minimization in statistics, which further improves the traditional machine model by avoiding the dimensionality disaster of calculating high-dimensional data in the process of learning and training so that it can demonstrate excellent generalization ability and better robustness for pattern recognition and can be applied to small sample learning [
41].
The core component of SVM is to maximize the classification boundary, transfer the data to the high-dimensional space in the form of nonlinear mapping using the inner product kernel function and establish the optimal hyperplane for the division of the high-dimensional space. The calculation formula is shown in (7) [
41].
The dot product in the optimal hyperplane is represented by the inner product
; that is, the data are mapped to a new space, and then the optimization function is calculated, as shown in (9) [
41]:
where
is the Lagrange multiplier, which corresponds to the sample one by one. The optimal solution of the quadratic function is calculated, and the calculation formula of the optimal classification function is obtained, as shown in (12) [
41]:
where
denotes the threshold value of classification, and
denotes the kernel function. There are three main types of kernel functions that are widely used in SVM algorithms: radial basis functions, polynomial kernel functions and neural network kernel functions.
(3) Establishment of BP Neural Network Models
The BP neural network is a multilayer feedforward neural network based on the backpropagation training algorithm. It has strong self-learning and adaptive capabilities. It takes the error of the network as the objective function and calculates the minimum value of the objective function based on the gradient search technique, which enhances the classification ability of the network. The damage to some neurons of the BP neural network will not have a fatal impact on the global learning and training of the network, and it can still operate normally with a certain degree of fault tolerance. It generally consists of three layers: an input layer, a hidden layer and an output layer. The BP neural network hidden layer is located between the input layer and output layer, which is not directly related to the outside, but the number of its nodes has a great influence on the network’s accuracy. Therefore, choosing a reasonable number of neurons in the output layer and hidden layer is beneficial for training the network and improving the recognition accuracy [
42].
The computation process of the BP neural network can be divided into two stages, namely the forward computation process of the raw data and the back computation process. In the forward propagation process, the input data enter the network from the input layer and pass to the hidden layer, and the results are delivered to the output layer after layer-by-layer computational processing in the hidden layer. In this process, the neurons in each layer only influence the neurons in the next layer. When the output of the output layer does not reach the desired state, the network enters the backpropagation training process. The error of each layer is returned along the original path, and the weights of each neuron are corrected and learned so that the error is minimized.
The calculation formula is shown in (13) [
42].
represents the input of the neural network, and the input of each neuron in the hidden layer can be expressed as follows:
where
denotes the weight of the
input layer neuron and the
hidden layer neuron,
denotes the threshold of the
hidden layer neuron, and both
and
are adjusted in the direction of error reduction during the training process.
(4) Establishment of CNN Models
CNN is a type of neural network for deep supervised learning in the field of machine learning [
43,
44]. It can imitate the signal processing of the visual neural system to learn the deep features. Via parameter sharing and sparse connections, it solves the problem of having many parameters in traditional neural networks and is conducive for optimizing the network; on the other hand, it simplifies the model’s structure and reduces the risk of overfitting [
44].
CNN generally includes an input layer, convolutional layer, pooling layer, fully connected layer and objective function. The training process of CNN is divided into two stages: forward propagation and backpropagation. In the forward propagation process, different modules, such as the convolution layer and pooling layer, are superimposed to gradually learn the original features of the input data and transfer them to the final objective function. In the backpropagation process, the objective function is used to balance the difference between the real value and the predicted value, and the objective function is used to update the weight and offset.
The input layer is the raw data transmitted into the entire neural network. The convolution layer is the core of the entire convolutional neural network. The network realizes the dimensionality reduction and feature extraction of data via the operation of the convolution layer. The calculation formula of the convolution layer is shown in (14) [
44]:
where
is the feature mapping of the current layer,
is the nonlinear activation function,
is the mapping of the upper layer features,
is the set of feature mappings,
is the weight value corresponding to the filter at (
) in the
layer, and
is the bias.
Nonlinear activation function
processes the linear output of the upper layer and can enhance the characterization ability of the model. The commonly used activation functions in the activation layer are ReLU, tanh and sigmoid, which are calculated by Equations (15)–(17) [
44], respectively.
The pooling layer, also known as the downsampling or subsampling layer, can effectively reduce the amount of computation, improve the network’s efficiency, enhance the generalization ability and prevent overfitting. Commonly used pooling layers include maximum pooling and average pooling, and their calculation formulas are (18) and (19) [
45], respectively:
where
denotes max pooling, i.e., the maximum value of the pooled domain;
denotes average pooling, i.e., the average value of the pooled domain;
is the activation value of the pooled domain;
is the corresponding domain on the feature map.
The full connection layer combines the features transmitted at the lower layers, such as convolution and pooling, in the network to obtain high-level features and complete the classification task.
The objective function, also called the cost function or loss function, is the commander in chief of the entire convolutional neural network. The objective function generally includes the cost function and regularization. The backpropagation of the network is performed using the errors generated by the real values and predicted values of the samples to learn and adjust the network parameters.
The 1D CNN model set up in this study contains 6 convolutional layers and 3 pooling layers; finally, the softmax classifier is used to complete the recognition. Considering that the 1D CNN can only extract the spectral features of a single layer, there is a loss of detailed information; thus, the 1D CNN model is improved. By upsampling and merging pooling layers of different dimensions, multiscale features are fused, and an attention mechanism (AM) is introduced to redistribute the weight of different layers to obtain more spectral information features, thus improving the recognition performance of the 1D CNN. The improved network structure is shown in
Figure 4.
(5) Dataset Construction
The input data applied to the model are derived from field-measured data, and the spectral feature parameters + vegetation index, spectral FD and CR data are calculated from the grass species spectra. PCA was carried out on the three sets of data, and the first k components were selected according to the cumulative contribution rate so that the cumulative contribution rate reached more than 90%, and the spectral characteristic parameters + vegetation index dataset, FD dataset and CR dataset were obtained. Each dataset contains 256 sample data. The above-mentioned three datasets are inputted into four recognition models, and the model automatically splits and disrupts the input dataset, using 70% of the data of each grass species as the training sample and 30% as the validation sample—that is, a total of 179 labeled data as the training set and 77 unlabeled data as the validation set. The validation set is set up to validate and compare the performance of individual models in identifying unknown spectral data.
(6) Accuracy Assessment
The overall accuracy (OA), producer’s accuracy (PA) and average accuracy (AA) are used to evaluate the recognition accuracy of each model [
9].
The OA is calculated by Equation (20), which represents the ratio of the number of samples
correctly classified by the model classification method to the total number of samples
. It can characterize the recognition performance of the entire classifier, which can reflect well the classification accuracy of the model and the overall error of the classification results. However, it is greatly affected by the sample’s distribution, especially when the sample distribution is unbalanced and there are too many samples of a certain category, which will have a significant impact on the
OA of the classification results.
The
PA can be calculated by Equation (21). It is the ratio of the number of samples
correctly classified in each category to the total number of samples
in that category—that is, the probability that the category of any random sample in the validation sample is consistent with its real category in the classification result—reflecting the accuracy rate corresponding to each category.
The
AA is shown in Equation (22). It is the average of the classification accuracy of each category. It represents the average of the percentage of correctly classified samples, giving each category the same degree of attention.