1. Introduction
Natural grasslands play a critical role in maintaining the ecological balance of global terrestrial ecosystems [
1], accounting for more than 30% of the total ecosystem. However, with global climate change and human activities, more than half of grasslands are severely threatened by desertification [
2,
3]. The Inner Mongolia Autonomous Region has the largest proportion of grasslands in China, with a total natural grassland area of 8.6 × 10
11 m
2, approximately 90% of which is severely degraded [
4,
5]. Desert grasslands are representative of the degradation process from grassland to desert, which not only changes the original grassland communities and reduces biodiversity, but also severely affects the normal functions of grassland ecosystems, such as climate regulation, soil conservation, and biodiversity [
6,
7,
8]. The degradation of desert grasslands can be accurately evaluated and managed by studying their taxonomy.
Currently, most traditional grassland surveys are conducted manually in the field. Although this method is more accurate, it is time-consuming and cannot be extended to cover large areas [
9]. To achieve the long-term rapid monitoring of grassland features over large areas, researchers have developed experimental methods for satellite remote sensing. Although satellite remote sensing has become an essential tool for grassland monitoring because of its large spatial scale and ability to identify the spatial and temporal dynamics of grasslands, the spatial resolution of images captured with satellite remote sensing is relatively low. It can only accurately identify vegetation at large spatial scales, and its images and spectral features are submerged in mixed pixels of small- and medium-sized vegetation in desert grasslands. Satellite remote sensing is affected by the motion of satellites around the Earth, and the interval between repeated experiments is too long [
10,
11,
12]. Therefore, more sophisticated remote sensing equipment needs to be deployed or used to achieve a finer classification of desert grassland vegetation.
In recent years, with the continuous development of unmanned aerial vehicle (UAV) technology, it has become well-known to the general public for its simple operation methods, low cost of use, and access to areas that are difficult for humans to reach [
13]. Advances in optical technology have led to the development of portable hyperspectral imagers that offer higher spatial and spectral resolutions and a more prosperous continuous spectral band than satellite remote sensing. This provides a higher recognition accuracy for delineating fine features. In contrast to traditional RGB color images, hyperspectral images can unveil more hidden features within invisible bands, which are crucial for the classification and monitoring of desert grassland plants. The capability of hyperspectral imaging to intricately distinguish and capture the spectral properties of matter in minute detail renders it a powerful tool in fields such as ecology, agriculture, and environmental science. Using UAVs as platforms to carry portable hyperspectral imagers, the two can complement each other to build low-altitude UAV remote-sensing platforms [
14,
15,
16,
17]. This platform is now widely used in vegetation cover calculations [
18], agricultural precision management [
19], vegetation leaf area monitoring [
20], and vegetation condition monitoring [
21,
22], among other applications.
In hyperspectral remote sensing image processing, the vegetation index method is commonly used to calculate the numerical indicators of the reflectance or radiation of features in remotely sensed images. It is used to assess the growth status of objects and vegetation cover and to monitor vegetation changes on the land surface [
23,
24,
25,
26]. These vegetation indices are dimensionless values [
27]. By conducting vegetation index calculations on hyperspectral images, the most appropriate separability threshold for each feature is determined based on the calculation results, thereby completing the task of classifying image features. The most widely used vegetation indices include the Normalized Difference Vegetation Index (NDVI) [
28], Ratio Vegetation Index (RVI) [
29], Difference Vegetation Index (DVI) [
30], and Soil-Adjusted Vegetation Index (SAVI) [
31], among others. Researchers have improved commonly used vegetation indices and explored several practical applications. Ref. [
32] studied the leaf area index of winter wheat in arid areas and used first- and second-order differential data preprocessing to construct two-dimensional and three-dimensional vegetation indices by combining arbitrary wavebands. The results showed that the correlation between vegetation and leaf area indices formed by combining wavebands was significantly improved. Ref. [
33] constructed a microplaque index threshold (MPI-T) for the problem in which NDVI and SAVI are difficult to distinguish between desert grassland rat holes and achieved positive recognition results. However, the vegetation index calculation method has limitations and cannot fully exploit the rich waveband information in hyperspectral image data.
With the emergence of big data and advancements in computer technology, machine and deep learning techniques have rapidly developed. Researchers have widely applied these techniques to grassland monitoring and classification. In a study on desert grasslands, ref. [
34] achieved an overall classification accuracy of 91.06% using the random forest algorithm to classify grassland vegetation. However, in hyperspectral images, machine learning methods for image classification require the manual extraction and analysis of image features, which is time-consuming and labor-intensive. In deep learning, convolutional neural networks (CNNs) are among the most widely used and representative algorithms. CNNs consist of a convolutional layer for feature extraction and a sampling layer for feature processing, an “end-to-end” learning approach that distinguishes machine learning from other algorithms [
35]. Ref. [
36] used a multilayer feature fusion 2D convolutional neural network (MFF-2DCNN) to identify micropatches on the surface of desert grasslands, achieving a high classification accuracy for rat holes and bare soil. However, a 2D-CNN cannot capture spectral information effectively in hyperspectral image information extraction tasks, destroying the 3D structure of the image data. To address this issue, some researchers have applied three-dimensional convolution (3D-CNN) to hyperspectral images to address this issue. Ref. [
37] classified the vegetation and bare soil in desert grasslands by constructing a 3D-CNN model and continuously optimizing it. The network models developed in these studies have shown promising results in classifying desert grassland features. However, these models did not consider memory consumption, which could pose considerable challenges for future deployment on mobile devices and the rapid monitoring of desert grassland degradation. Currently, there is no sufficiently detailed method for selecting hyperspectral image-band data for desert grasslands. To address the issue of data redundancy, Principal Component Analysis (PCA) is often used to reduce the dimensionality of the image data. However, this can result in the reorganization of the original image data features [
38,
39,
40], or the direct discarding of bands with substantial fluctuations in the spectral curves of features owing to undesirable noise [
41]. Therefore, optimal bands cannot be selected to simplify hyperspectral data, which presents challenges for subsequent data processing. Additionally, 3D convolutional operations are computationally demanding and involve numerous training parameters, exacerbating these problems. Therefore, there is an urgent need for methods that enable data dimensionality reduction and the construction of lightweight network models to achieve efficient and accurate grassland monitoring.
To solve these problems, this study used a UAV hyperspectral remote sensing system to collect hyperspectral data on vegetation in a desert grassland in the Inner Mongolia Autonomous Region. A convolutional neural network model was proposed based on feature enhancement, which was applied to vegetation plant taxa classification. The most accurate vegetation species classification model was obtained through data, model, and parameter optimization. This study aimed to provide a new method for achieving the efficient and high-precision dynamic monitoring of desert grassland species by constructing a streamlined 2D-CNN classification model. The main contributions of this study were as follows:
- (1)
Based on an improved depth-separable convolution to improve the nonlinear fitting ability of the model, this study proposed a streamlined 2D-CNN (SL-CNN) model for desert grassland plant taxa classification. This model effectively explored lightweight convolution in desert grassland species classification research and could achieve the efficient and high-precision monitoring of grassland species.
- (2)
The model used improved convolutional block attention (CBAM-F) to effectively focus on important channel features and key spatial information and improved the model’s feature refinement capability by adaptively learning feature map channels and spatial relationships. It was combined with residual block convolution (RBC-F) to fuse the feature data and improve the model classification performance.
- (3)
Using the variance and Frobenius norm2 feature band selection methods, we could efficiently reduce the dimensionality of the data, enhance the computational efficiency of the model, retain important information for classification tasks, and effectively alleviate data redundancy in hyperspectral images.
4. Results and Discussion
This experiment used the TensorFlow-GPU deep learning framework, Python programming language, Windows 10 as the operating system, NVIDIA RTX3060 with 6 GB of graphics, AMD R7-5800H as the CPU, and 16 GB of running memory. The model with the most successful performance in the validation set during the training was saved. The overall classification accuracy (OA), average accuracy (AA), single accuracy, test loss, and training time were used as evaluation metrics for model classification. The initial network parameters were set as follows: the sliding window size was 7 × 7; the loss function was the cross-entropy loss function; the optimizer was Adam; the initial learning rate was 0.001; the number of epochs was 50; and the batch size was 64. The hyperspectral images were downscaled to 51 bands using a PCA.
4.1. Waveband Processing
Hyperspectral image data contain hundreds of consecutive spectral bands that provide rich spectral and spatial information [
46]. However, a higher number of bands leads to increased inter-band correlation, data redundancy, and computational costs, which can result in the Hughes phenomenon [
47]. Therefore, reducing the dimensionality of data is necessary. However, the choice of dimensionality reduction method can affect the experimental results. In this experiment, the within-band variance-based combined with Frobenius norm
2 [
48] (F-norm
2) algorithm was compared with a principal component analysis (PCA), a standard dimensionality reduction algorithm for hyperspectral images, to select the most accurate classification result among the initial network model processing methods.
Variance [
49] is typically used to describe the degree of deviation among data points in a random variable. F-norm
2 is used to describe the different distances between unrelated n-dimensional variables. In this experiment, the variance value for each spectral band was used to describe the degree of the dispersion of the information content among the spectral bands. The more significant the difference in variance values, the more dispersed the information. The F-norm
2 value describes the amount of information in each spectral band. The larger the F-norm
2 value, the richer is the information content. Equation (7) for calculating the variance is:
Equation (7), where: indicates the band variance, indicates the number of pixels in a single band, indicates the pixel value, and indicates the mean of the pixel values in a single band.
The F-norm
2 calculates the equation shown in (8):
Equation (8), where: is the tensor, is the number of rows (samples), is the number of columns (lines), and is the number of bands (bands).
Figure 10a schematically shows the normalized F-norm
2 values and
Figure 10b the results of the within-band variance operation.
Figure 10a shows that, before 677 nm (the number of bands is 96) and after 751 nm (166 band), although the number of bands increases further, the value decreases. In both cases, the number of intermediate bands was relatively small, but the value increased sharply. This indicated that the information content increased sharply at this time. In
Figure 10b, there is a decline from 689 nm (126 band) to 713 nm (136 band), which is a turnaround compared to the bands before and after, indicating that the information in this band was relatively stable and concentrated. In summary, band division was conducted by choosing 126–136 bands as the center and the left and right bands as the increments. The experimental results are listed in
Table 2.
Table 2 shows that all four categories of bands achieved a better performance, and the training time increased with the increase in bands. The first and the fourth categories had a greater classification accuracy, but the overall accuracy difference was insignificant. Regarding time costs and the redundancy of the band information, the first category should be selected as the input band for the subsequent model. Under the same training conditions, the PCA dimensionality reduction method was used to select the first 11 principal components (the cumulative contribution rate of principal components was 99.10%) and the full-waveband image for training, respectively, and then they were compared with the results of the first category of runs, which showed that the results of the first category of runs had the best performance. In addition, an overall analysis of the results in
Table 2 shows that the model using the full-band image operation had the longest training time and the lowest accuracy, and the model performance was poor, which also verifies the necessity of band selection for hyperspectral images. Therefore, 126–136 bands were selected as the model input bands.
4.2. Parameter Optimization
4.2.1. Window Size Selection
The larger the window size, the more information contained in the image texture, but there is also a greater information redundancy. To investigate the optimal window size for this model, five window sizes (5, 7, 9, 11, and 13) were used in the experiment. The results are presented in
Figure 11.
Figure 11 showed that, as the window size increased from left to right, both the model’s OA and training time values showed an increasing trend, but with different growth rates. However, when the window size was 11, the growth in both values was the smallest. The OA value was higher at 99.143% and the training time was 428 s. Therefore, for practicality, a window size of 11 was selected as the model input.
4.2.2. Learning Rate Selection
The learning rate is an essential factor that affects the speed of model construction. If it is set too large, the loss can explode; if it is set too small, it can lead to a slow loss reduction. Three sets of learning rates (0.01, 0.001, and 0.0001) were used to decrease the gradient and determine the most appropriate learning rate. To prevent the learning rate from decreasing too rapidly along the gradient, an additional group of learning rates (0.0004) was set for the control test. The experimental results are presented in
Figure 12.
Figure 12 shows that the model training time generally tended to increase and then decrease, and the overall classification accuracy reached its maximum when the learning rate was set to 0.001. Therefore, the learning rate for the model input was set as 0.001.
4.2.3. Batch Size Optimization
The batch size setting significantly affects the optimization of the constructed model and the memory usage of the computer. If the batch size is set too small, the gradient will be unstable, and it will be challenging for the model to converge. If the batch size is set too large, the speed of processing the same data will be accelerated, but the epoch required to achieve the same accuracy will also increase, and the model will quickly fall into a local optimum. Four different batch sizes (32, 64, 128, and 256) were compared, and the classification results are shown in
Figure 13.
Figure 13 shows that, as the batch size increased, the overall classification accuracy and training time of the model gradually decreased. When the batch sizes were 32 and 64, respectively, the overall classification accuracy of the model performed well. Compared to the former, the overall classification accuracy decreased by 0.065% when the batch size was 64, but the training efficiency of the model increased by nearly 51.2%, which is in line with the demand from a practical point of view. Therefore, the batch size of the model input was selected to be 64.
4.2.4. Optimization of the Number of Base Blocks
After setting these parameters, we compared four different numbers of base blocks, that is, 2, 3, 4, and 5, to investigate their effects on this experiment. The classification performance results are listed in
Table 3.
As shown in
Table 3, the model results increased as the number of base blocks increased. The number of memory items and total parameters of the generated model increased at double or multiple rates. When the number of base blocks was four or five, the overall classification accuracy of the model was high, with the latter increasing by 0.126% compared with the former. However, the former only accounted for 46.40% and 27.58% of the latter in terms of the training time and memory used by the generated model, respectively. Therefore, we selected four base blocks as inputs for the model structure.
4.3. Comparison of Ablation Experiments
Ablation experiments were conducted to investigate the effectiveness of each module in the SL-CNN model. Five detailed evaluation metrics were selected, that is, overall accuracy (OA), average accuracy (AA), kappa, test loss, and mean F1 scores. The experimental results are listed in
Table 4. As shown in the table, the SL-CNN model performed well in all aspects compared to a single module, particularly in the OA and Kappa terms, with improvements of 0.216% and 0.349, respectively, compared to the single RBC-F module. There were improvements of 0.359% and 0.581, respectively, compared to the single CBAM-F module. For the AA, test loss, and F1 score, the RBC-F and CBAM-F modules performed similarly when used alone in the model proposed in this study. However, both were inferior to the SL-CNN model with a combination of the two modules. Based on these results, the addition of both modules improved the classification performance of the model.
4.4. Experimental Results
The SL-CNN model continuously compared and optimized the initial network model structure and four operational parameters, resulting in further improvements in the performance and accuracy of the network model. For the hyperspectral images, after applying the band selection and model optimization techniques, the SL-CNN model achieved an increase of 0.46% in its overall accuracy (OA) compared to that of the initial model with the Windowsizes parameter set to 7, and there was a decrease of 61 s in the training time compared to that of the initial model with the Windowsizes parameter set to 11. Therefore, the band selection method and parameter optimization used in this study were confirmed to be beneficial for improving the classification performance of desert grassland hyperspectral images and accelerating the model construction.
To verify the validity of the SL-CNN model, four widely used hyperspectral model classification algorithms were selected for a comparative study that is, ResNet34, GoogLeNet, DenseNet121, and MLP. In addition, to verify the advantages of the improved depth-separable convolution, the conventional convolution method was used for reconvolution under the SL-CNN model to generate the 2D-CNN model. All the classification algorithms were executed in the same programming environment using the same data preprocessing method to ensure experimental reliability. The single-feature recognition classification accuracy results are shown in the confusion matrix (
Figure 14).
Table 5 presents the results.
As shown in
Figure 14, the SL-CNN model constructed in this study had the most accurate overall performance, with recognition accuracies of 99.56%, 99.31%, 98.40%, and 96.49% for Features 1, 2, 3, and 4, respectively. This indicated that the SL-CNN model had a high capability for grassland feature extraction. As shown in
Table 5, regarding the overall classification performance, the SL-CNN model achieved kappa coefficient, OA, and AA values of 98.735, 99.216%, and 98.442%, respectively. Its training time and generated model required the lowest memory compared to other models at 367 s and 16.3 MB, respectively, and the total number of parameters run during the model construction comprised 4.73 MB of the memory. The results showed that the SL-CNN had a high generalization ability and could be applied to desert grassland feature classification tasks.
4.5. Discussion
As shown in
Figure 14 and
Table 5, except for the Multilayer Perceptron (MLP) and GoogLeNet models, which had a poor recognition accuracy for Feature 4, all the other models achieved high recognition accuracies for the remaining features, with accuracies above 90%. ResNet34 was closer to the SL-CNN model regarding its single-feature recognition accuracy, but the other evaluation indices showed significant differences, in which the kappa coefficient, OA, and AA values were reduced by 0.662, 0.409%, and 1.633%, respectively, compared to the SL-CNN model. The ResNet34 model’s training time, the memory occupied by the generated model, and the number of total parameters of the model building run accounted for 18.21%, 6.60%, and 5.81% of the ResNet34 model, respectively. GoogLeNet had a high classification accuracy for Artemisia frigida and Bare Soil, but compared to the SL-CNN model, the classification accuracy of Feature 3 was significantly lower by 3.58%. Its generated model and the number of total parameters of the model building run, the memory occupied by the total number of parameters for the build run were 93.4 MB and 28.76 MB, respectively. This represented increases of 82.55% and 83.56% compared to the SL-CNN model. DenseNet121 had a single feature classification accuracy approximately similar to that of ResNet34 and possessed approximately the same memory occupation as GoogLeNet. However, the model training time was 72.487% that of ResNet34. The MLP had the lowest classification accuracy among all of the models, with an AA value of 72.487%. The detailed analysis results showed that the MLP is a fully connected network model with a simple model structure and limited feature extraction ability, resulting in the lowest classification accuracy for the fine features in desert grasslands. However, the model training time was shorter. In contrast, GoogLeNet used multiple parallel convolutional branches to capture grassland features at different scales and levels, which enhanced the model structure and network depth and improved the network expression ability. However, the model complexity was not high, and the features were not fully extracted, so the classification accuracy was limited. ResNet34 and DenseNet121 used a residual structure and dense connection structure, respectively. This increased the complexity and depth of the network, addressed the problems of gradient disappearance and information loss to the greatest extent, and improved the performance of fine-grained classification. However, they also introduced more operational parameters, which increased the model construction time and memory requirements. The SL-CNN model was different from the four conventional models mentioned above, especially ResNet34 and DenseNet121. It constructed the CBAM-F feature refinement module based on the improved depth-separable convolution by transforming the Shared MLP module in CBAM attention to 2D-CNN convolution. Additionally, SL-CNN also made full use of the residual structure to construct a residual block convolution feature enhancement module. These three elements synergized with each other to construct a lightweight design with a unique feature extraction capability, which allowed the SL-CNN model to significantly reduce the model parameters while maintaining high-precision image classification, effectively improving its memory efficiency and training speed. In summary, increasing the network depth, parallel structures, or using structures such as residuals is not fully applicable for desert grassland fine-grained feature classification, and the model structure should be optimized and adjusted appropriately.
The 2D-CNN and SL-CNN differed only in the convolution method. Therefore, the model classification accuracy was similar. However, the SL-CNN model’s training time, generated model, and the total number of parameters for the model building run accounted for 65.88%, 29.21%, and 26.41% of the memory of the 2D-CNN, respectively. This indicates that an improved depth-separable convolution is necessary for the convolutional approach.
In addition, we explored the differences between this model and other desertification-grassland feature classification models. In this study, the latest deep learning network models DIS-O [
39], LGFEN [
41], and GDIF-3D-CNN [
50] for hyperspectral grassland feature recognition were selected for a comparative study. To ensure experimental reliability, the structure and parameters of the selected models were the same as those in the original study. The experimental results are listed in
Table 6. As shown in the table, all the models achieved more accurate results for the classification task. Although the SL-CNN model was not time efficient, it showed the highest accuracy in classification and consistency testing. This indicated that the SL-CNN model had an appropriate model complexity, could more effectively capture features, and had a stronger generalization ability and robustness. The DIS-O model had the lowest classification accuracy, mainly because of its relatively simple model structure, resulting in the extraction of grassland features. The DIS-O model had the lowest classification accuracy, mainly because of its relatively simple model structure, leading to an insufficient ability to extract grassland features. The DIS-O model was originally designed for a small number of classified species, and increasing the number of classified categories would lead to underfitting of the model and make its capacity insufficient. By replacing 2D convolution with 3D convolution, GDIF-3D-CNN improved the performance compared to the DIS-O model, indicating that 3D convolution helped to extract higher-level features. However, without further structural design, it still faces the problem of insufficient feature extraction capabilities. In contrast, the classification accuracy of the LGFEN model was slightly lower than that of the SL-CNN model, which indicates that the addition of the CBAM attention mechanism helped to improve the recognition and classification ability of desert grassland features and further enhanced the robustness of the model based on a separately designed feature extraction module.
4.6. Data Visualization
A random set of sample data was selected for visualization and analysis to verify the optimized SL-CNN classification model and its practical classification performance. In addition, three grassland feature classification models, namely GDIF-3D-CNN, DIS-O, and LGFEN, were used to visualize the same samples for a comparative study. In the paper, to emphasize the real data on the ground after the model classification, the experimental markers (mats and small flags) part of the RGB color image captured by the DJI Phantom 3 Pro UAV was displayed, as shown in
Figure 15f. The visualization and local zooming results of the classification of SL-CNN and the contrasting models are displayed in
Figure 15b–e. After comparing the visualization results with ground survey data, the study’s results revealed that the DIS-O model had the worst overall classification performance, the GDIF-3D-CNN model had more pixel classification errors, and the LGFEN model misclassified more Stipa breviflora as Artemisia frigida. The predicted classification results of the SL-CNN model were the most consistent with the actual spatial distribution of the features and retained the spatial characteristics of the features effectively. This showed that the model had a high generalization ability and could meet the classification needs of desert grassland vegetation taxa.
While the primary focus of this study is on the desert grasslands of Inner Mongolia, its findings can offer fresh perspectives for ecological and environmental studies on a global scale. Additionally, the research could provide valuable theoretical references for similar studies conducted in other regions, signifying its significant contribution to understanding the functions of desert grassland ecosystems.
5. Conclusions
The classification of desert grassland taxa is essential for studying the process of grassland desertification. However, this study has some limitations. In this study, we built a UAV hyperspectral remote sensing system to collect remote sensing images of desert grassland vegetation efficiently and precisely under natural light to compensate for the shortcomings of traditional grassland survey methods. We developed a lightweight 2D-CNN model called SL-CNN for classifying desert grassland taxa. We used an improved depth-separable convolution to ensure species classification accuracy and achieve convenient and rapid species monitoring. To prevent information redundancy in the hyperspectral data, we used a combination of variance and F-norm2 operations for feature band selection. We constructed a CBAM-F feature refinement module by improving the channel attention in the CBAM attention module. This was combined with the RBC-F residual block feature enhancement module to improve the feature extraction capability and classification performance of the network model.
In this study, four important parameters of the model were optimized, the effects of different parameter values on the classification performance of the model were analyzed, and ablation experiments were conducted to verify the effectiveness of the building blocks. To demonstrate the advantages of the model, it was compared with the latest and most commonly used hyperspectral image classification models. The results showed that the OA, AA, and Kappa values of this model performed more effectively than those of the other models, with 99.216%, 98.442%, and 98.735%, respectively. It had the advantages of fewer parameters, relatively fast construction, and lower memory occupation. This study has provided a new research method for monitoring the degradation of desert grassland features using UAV remote sensing technology.
However, desert grassland features are usually small and sparse, and the phenomenon of “same thing, different spectrum” and “same spectrum, different thing” often occurs in remote sensing images, which poses great difficulties in data annotation. Therefore, future research should address the effective classification and inversion of features using a small number of samples. In addition, the SL-CNN model needs to be further optimized to reduce its construction time and memory footprint for subsequent deployment in mobile terminals. This provides additional potential for practical applications.