Remote Sensing Monitoring of Grassland Locust Density Based on Machine Learning

The main aim of this study was to utilize remote sensing data to establish regression models through machine learning to predict locust density in the upcoming year. First, a dataset for monitoring grassland locust density was constructed based on meteorological data and multi-source remote sensing data in the study area. Subsequently, an SVR (support vector regression) model, BP neural network regression model, random forest regression model, BP neural network regression model with the PCA (principal component analysis), and deep belief network regression model were built on the dataset. The experimental results show that the random forest regression model had the best prediction performance among the five models. Specifically, the model achieved a coefficient of determination (R2) of 0.9685 and a root mean square error (RMSE) of 1.0144 on the test set, which were the optimal values achieved among all the models tested. Finally, the locust density in the study area for 2023 was predicted and, by comparing the predicted results with actual measured data, it was found that the prediction accuracy was high. This is of great significance for local grassland ecological management, disaster warning, scientific decision-making support, scientific research progress, and sustainable agricultural development.


Introduction
Grasslands are a crucial ecosystem in China, serving not only as a significant geographic barrier but also as the country's primary natural ecological defense line.Grasslands play a vital role in maintaining ecological balance and biodiversity.However, grassland locust plagues not only severely hinder the growth of grassland vegetation and the development of local pastoralism, but also bring substantial economic losses to local herders, affecting the healthy development of the region's pastoralism and grassland ecology [1].Locusts primarily feed on Poaceae plants, such as wheat, rice, maize, and various pastures; leguminous and Cyperaceae plants; and some vegetables [2].In the Inner Mongolia grasslands, the main locust species causing significant environmental damage include the Asian migratory locust, the short-winged locust, Acrida cinerea, and the large-winged locust.These locust populations have a significant negative impact on ecological balance.Locusts require exposed ground surfaces for egg-laying.In extensive pastoral and agropastoral areas, due to improper grassland management and excessive grazing pressure, the overgrazing and degradation of grasslands occur, creating favorable conditions for the large-scale breeding of grassland locusts.Furthermore, grassland pest infestations exacerbate the degradation and desertification of grasslands.Combined with drought, reduced rainfall, land exposure, and a reduction in natural predators, these factors collectively contribute to a vicious cycle [3].
With the advancement of remote sensing technology, the methods used to monitor grassland locusts have shifted from traditional, time-consuming, and less accurate ground location of the study area, which is located in Xiwuzhumuqin Banner, XilinGol League, Inner Mongolia Autonomous Region, China.and 60 mountain springs [7].The climatic conditions in the study area are suitable for the survival and reproduction of grassland locusts, so these areas have long been disasterprone zones for grassland locust infestations.The region mainly relies on agriculture and animal husbandry, so the disasters caused by grassland locusts have a significant impact on the economic development of these study areas.Figure 1 demonstrates the specific location of the study area, which is located in Xiwuzhumuqin Banner, XilinGol League, Inner Mongolia Autonomous Region, China.

Grassland Locust Density Data
The sample grasshopper data used in this study were obtained from the Xiwuzhumuqin Banner Grassland Workstation.As a local forestry and grassland management unit, this institution is responsible for compiling annual data on grasshopper density.Grasshopper disaster survey data are collected from mid-May to early June each year, as the early control of grasshoppers in this region is crucial to ensure the normal operation of animal husbandry and protect it from the damage caused by grasshopper infestations.
The first step in the grasshopper density survey was to select the survey area, which was set at one square kilometer.Seventy percent of the areas with dense grasshopper activity and thirty percent of the areas with sparse grasshopper activity were selected to form a control group.The second step was to select survey sites, randomly and evenly choosing a certain number of sites within the survey area.The third step was sampling, using a one square meter enclosed container to cover the sampling points, ensuring no gaps between the container and the ground that would allow the grasshoppers to escape.Insecticide was then sprayed into the container and, after the grasshoppers died, they were collected and counted.The fourth step was statistics, where the number of grasshoppers at each sampling point was divided by the number of sampling points to represent the average grasshopper density in the survey area.The central latitude and longitude of the survey area correspond to the latitude and longitude of the grasshopper density value.
The survey areas cover Xiwuzhumuqin Banner, following the standard pest survey procedures of the forestry and grassland department.This study collected grasshopper

Grassland Locust Density Data
The sample grasshopper data used in this study were obtained from the Xiwuzhumuqin Banner Grassland Workstation.As a local forestry and grassland management unit, this institution is responsible for compiling annual data on grasshopper density.Grasshopper disaster survey data are collected from mid-May to early June each year, as the early control of grasshoppers in this region is crucial to ensure the normal operation of animal husbandry and protect it from the damage caused by grasshopper infestations.
The first step in the grasshopper density survey was to select the survey area, which was set at one square kilometer.Seventy percent of the areas with dense grasshopper activity and thirty percent of the areas with sparse grasshopper activity were selected to form a control group.The second step was to select survey sites, randomly and evenly choosing a certain number of sites within the survey area.The third step was sampling, using a one square meter enclosed container to cover the sampling points, ensuring no gaps between the container and the ground that would allow the grasshoppers to escape.Insecticide was then sprayed into the container and, after the grasshoppers died, they were collected and counted.The fourth step was statistics, where the number of grasshoppers at each sampling point was divided by the number of sampling points to represent the average grasshopper density in the survey area.The central latitude and longitude of the survey area correspond to the latitude and longitude of the grasshopper density value.
The survey areas cover Xiwuzhumuqin Banner, following the standard pest survey procedures of the forestry and grassland department.This study collected grasshopper disaster data from 2021 and 2022, for a total of 160 sampling points, with 80 for each year.Figure 2 shows the locations of the grasshopper survey sites in 2021 and 2022.disaster data from 2021 and 2022, for a total of 160 sampling points, with 80 for each year.
Figure 2 shows the locations of the grasshopper survey sites in 2021 and 2022.

Meteorological Data
The meteorological data used in this study were sourced from the Xiwuzhumqin Banner Meteorological Station.These data have a temporal resolution of one ten-day period, covering a range of metrics, including average temperature, precipitation, surface temperature, and soil moisture.The data span from January 2020 to December 2022.The meteorological data from the station were primarily used to double-check the remote sensing meteorological data, aiming to enhance the accuracy and reliability of the overall dataset.

Multi-Source Remote Sensing Data
The daily 1 km all-weather land surface temperature dataset of China's mainland and surrounding areas has a temporal resolution of four times per day and a spatial resolution of 1 km.Data from 2020 to 2022 were selected, covering the spatial scope of China.The method used to prepare the dataset was the enhanced satellite thermal infrared remote sensing-reanalysis data integration method.The main input data of the method were Terra/Aqua MODIS LST products and GLDAS data, and the auxiliary data included the vegetation index and surface albedo provided by satellite remote sensing.The method fully utilized the high-frequency components, low-frequency components, and spatial correlation of land surface temperature provided by satellite thermal infrared remote sensing and reanalysis data and, finally, it reconstructed a high-quality all-weather land surface temperature dataset [8].The dataset can be downloaded from the following website: https://data.tpdc.ac.cn/en/data/05d6e569-6d4b-43c0-96aa-5584484259f0/ (accessed on 18 February 2024).
The daily all-weather surface soil moisture dataset of China has a 1 km resolution (2003-2022) and was generated by downscaling the SSM (surface soil moisture), based on AMSR-E (Advanced Microwave Scanning Radiometer for EOS) and AMSR-2 (Advanced Microwave Scanning Radiometer 2) data, from a 36 km resolution to a 1 km resolution, significantly surpassing the well-known combined SMAP/Sentinel (active-passive microwave) SSM product at a 1 km resolution.It boasts a temporal resolution of 1 day and a spatial resolution of 1 km [9].Data from 2020 to 2022 were downloaded.The dataset can be downloaded from the following website: https://data.tpdc.ac.cn/en/data/e1f24e35-6235-40b2-b3d7-677dfb249e39/ (accessed on 18 February 2024).
The Monthly Precipitation Dataset of China with a Resolution of 1 km under Multiple Scenarios and Modes for 2021-2100 is a dataset that collects monthly precipitation data in China under multiple scenarios and modes.The spatial resolution of this dataset is

Meteorological Data
The meteorological data used in this study were sourced from the Xiwuzhumqin Banner Meteorological Station.These data have a temporal resolution of one ten-day period, covering a range of metrics, including average temperature, precipitation, surface temperature, and soil moisture.The data span from January 2020 to December 2022.The meteorological data from the station were primarily used to double-check the remote sensing meteorological data, aiming to enhance the accuracy and reliability of the overall dataset.

Multi-Source Remote Sensing Data
The daily 1 km all-weather land surface temperature dataset of China's mainland and surrounding areas has a temporal resolution of four times per day and a spatial resolution of 1 km.Data from 2020 to 2022 were selected, covering the spatial scope of China.The method used to prepare the dataset was the enhanced satellite thermal infrared remote sensing-reanalysis data integration method.The main input data of the method were Terra/Aqua MODIS LST products and GLDAS data, and the auxiliary data included the vegetation index and surface albedo provided by satellite remote sensing.The method fully utilized the high-frequency components, low-frequency components, and spatial correlation of land surface temperature provided by satellite thermal infrared remote sensing and reanalysis data and, finally, it reconstructed a high-quality all-weather land surface temperature dataset [8].The dataset can be downloaded from the following website: https://data.tpdc.ac.cn/en/data/05d6e569-6d4b-43c0-96aa-5584484259f0/ (accessed on 18 February 2024).
The daily all-weather surface soil moisture dataset of China has a 1 km resolution (2003-2022) and was generated by downscaling the SSM (surface soil moisture), based on AMSR-E (Advanced Microwave Scanning Radiometer for EOS) and AMSR-2 (Advanced Microwave Scanning Radiometer 2) data, from a 36 km resolution to a 1 km resolution, significantly surpassing the well-known combined SMAP/Sentinel (active-passive microwave) SSM product at a 1 km resolution.It boasts a temporal resolution of 1 day and a spatial resolution of 1 km [9].Data from 2020 to 2022 were downloaded.The dataset can be downloaded from the following website: https://data.tpdc.ac.cn/en/data/e1f24e35-6235 -40b2-b3d7-677dfb249e39/ (accessed on 18 February 2024).
The Monthly Precipitation Dataset of China with a Resolution of 1 km under Multiple Scenarios and Modes for 2021-2100 is a dataset that collects monthly precipitation data in China under multiple scenarios and modes.The spatial resolution of this dataset is 0.0083333 • (approximately 1 km), and the data selected covered the period from January 2020 to December 2022.The data are in NETCDF format.This dataset was generated by downscaling the global climate model dataset with a resolution of >100 km released by the IPCC Coupled Model Intercomparison Project Phase 6 (CMIP6) and the global high-resolution climate dataset published by WorldClim using the delta spatial Sensors 2024, 24, 3121 5 of 21 downscaling scheme in China.The geospatial scope of the dataset covered the main land areas of China [10].The dataset can be downloaded from the following website: https://data.tpdc.ac.cn/zh-hans/data/a9cd4a09-51a9-433b-9540-0376c6134cf6 (accessed on 18 February 2024).
The MYD13Q1 dataset is a part of MODIS (Moderate-Resolution Imaging Spectroradiometer) and is a global vegetation index (NDVI) product.This dataset provides important information about the status of surface vegetation.Global MYD13Q1 data are provided every 16 days with a spatial resolution of 250 m.The data were atmospherically corrected, removing interference caused by clouds, heavy aerosols, and cloud shadows.The dataset has reached validation stage 3, indicating that its quality and reliability have been rigorously evaluated by the scientific community.MOD13Q1 data are the same.Combined, the MOD13Q1 and MYD13Q1 datasets form a dataset with a time resolution of 8 days, and they come from NASA's Terra and Aqua satellites, respectively, which have slightly different orbits and observation times, but both provide vegetation index updates every 16 days.By reasonably combining these two datasets, the temporal resolution could be increased, allowing for the more frequent monitoring of vegetation changes.We also selected data from 2020 to 2022.
The aforementioned downloaded remote sensing data covered the period from January 2020 to December 2022, representing a significant amount of data.Therefore, Figure 3 only shows random 1-day remote sensing data maps of the study area for several types of remote sensing data, including soil moisture data, precipitation data, land surface temperature data, and NDVI data.
0.0083333° (approximately 1 km), and the data selected covered the period from January 2020 to December 2022.The data are in NETCDF format.This dataset was generated by downscal ing the global climate model dataset with a resolution of >100 km released by the IPCC Cou pled Model Intercomparison Project Phase 6 (CMIP6) and the global high-resolution climate dataset published by WorldClim using the delta spatial downscaling scheme in China.The geospatial scope of the dataset covered the main land areas of China [10].The dataset can be downloaded from the following website: https://data.tpdc.ac.cn/zh-hans/data/a9cd4a09-51a9 433b-9540-0376c6134cf6 (accessed on 18 February 2024).
The MYD13Q1 dataset is a part of MODIS (Moderate-Resolution Imaging Spectrora diometer) and is a global vegetation index (NDVI) product.This dataset provides im portant information about the status of surface vegetation.Global MYD13Q1 data are pro vided every 16 days with a spatial resolution of 250 m.The data were atmospherically corrected, removing interference caused by clouds, heavy aerosols, and cloud shadows The dataset has reached validation stage 3, indicating that its quality and reliability have been rigorously evaluated by the scientific community.MOD13Q1 data are the same Combined, the MOD13Q1 and MYD13Q1 datasets form a dataset with a time resolution of 8 days, and they come from NASA's Terra and Aqua satellites, respectively, which have slightly different orbits and observation times, but both provide vegetation index update every 16 days.By reasonably combining these two datasets, the temporal resolution could be increased, allowing for the more frequent monitoring of vegetation changes.We also selected data from 2020 to 2022.
The aforementioned downloaded remote sensing data covered the period from Jan uary 2020 to December 2022, representing a significant amount of data.Therefore, Figure 3 only shows random 1-day remote sensing data maps of the study area for several type of remote sensing data, including soil moisture data, precipitation data, land surface tem perature data, and NDVI data.

Correlation Analysis between Meteorological Factors and Locust Density
This study took the locust density as the target dependent variable.At the same time, a Pearson correlation analysis was conducted on the original environmental variables, including soil moisture, daytime land surface temperature, the Normalized Difference Vegetation Index, cumulative precipitation, and night-time land surface temperature.In addition, the "random forest-Gini importance" (RF GI) method was employed to rank the importance of environmental factors.By combining these two methods, the selection of input variables was achieved, and the characteristic variables of the input dataset were determined.
Taking the correlation analysis between locust density in 2021 and land surface temperature from 1 January 2020 to 30 December 2021, as an example, Table 2 shows the correlation between locust density and 10-day average (daytime/night-time) land surface temperature, as well as the confidence level of this correlation.Taking the correlation analysis between locust density in 2021 and precipitation from 1 January 2020 to 30 December 2021, as an example, Table 3 displays the correlation between locust density and average precipitation in 10-day periods, along with the confidence level of this correlation.Taking the correlation analysis between locust density in 2021 and soil moisture from 1 January 2020 to 30 December 2021, as an example, Table 4 displays the correlation between locust density and average soil moisture in 10-day periods, as well as the confidence level of this correlation.Taking the correlation analysis between locust density in 2021 and NDVI from 1 January 2020 to 30 December 2021, as an example, Table 5 displays the correlation between locust density and average NDVI in 10-day periods, along with the confidence level of this correlation.
Table 5.The correlation between locust density and average NDVI in 10-day periods.

Correlation Parameter Correlation Significance
The Taking the correlation analysis between locust density in 2021 and various meteorological factors from 1 January 2020 to 30 December 2021, as an example, Table 6 displays the random forest importance scores of environmental factor variables relative to locust density.Through the establishment of a random forest model, we analyzed the importance of each environmental factor at each 10-day time point relative to the locust density data.Only some of the higher-scoring data are presented in Table 6.The remaining data, which have too low scores, are not shown in the table.
Specifically, the random forest constructs multiple decision trees and makes predictions by averaging or voting on these trees.During the construction of each tree, the algorithm considers all features and calculates the information gain of each feature at the splitting nodes.Information gain measures the reduction in uncertainty or entropy of the sample response (in this case, locust density) after splitting using that feature.Features with a higher information gain are considered more important for model prediction.
Therefore, in random forest regression, by calculating the average information gain of each feature across all trees, we could obtain the importance score of that feature in the model.These scores help us understand which environmental factor variables are the most critical for predicting locust density.In the random forest model, "gain" typically refers to the average value of the information gain of a feature (i.e., an environmental factor variable) when used to split samples during tree construction.Information gain is a metric for measuring feature importance, indicating how much the purity of the dataset (i.e., the degree of clustering of samples from the same class) has improved after splitting based on that feature.Thus, in Table 6, "gain" can be understood as "information gain" or simply "gain," representing the contribution of each environmental factor variable to improving the accuracy of predicting locust density in the random forest model.Similarly, the correlation analysis between the locust density data and various meteorological factors in 2022 revealed the same correlation characteristics as in 2021.Finally, several important characteristic habitat factors were identified for the inversion of locust density: the average daytime land surface temperature in the last 10 days of February of the current year; the average daytime land surface temperature in the first 10 days of April of the current year; the average daytime land surface temperature in the middle 10 days of May of the current year; the average nighttime land surface temperature in the middle 10 days of August of the previous year; the average night-time land surface temperature in the middle 10 days of January of the current year; the average precipitation in the first 10 days of December of the previous year; the average precipitation in the first 10 days of April of the same year; the average precipitation in the middle 10 days of June of the same year; the average soil moisture in the first 10 days of July of the previous year; the average soil moisture in the middle 10 days of October of the previous year; the average soil moisture in the middle 10 days of April of the current year; the average NDVI in the middle 10 days of August of the previous year; and the average NDVI in the middle 10 days of May of the current year.

Deviation Normalization
Calculating the deviation normalization helps to eliminate the impacts of different units or scales when dealing with data.Since the distribution of environmental data is not normal or contains outliers, deviation normalization may be more suitable.This scales all values to a range of 0 to 1, and is more sensitive to outliers.For models such as neural networks, the deviation normalization operation helps to accelerate training and improve In the formula, Z denotes the calculated value of the Min-Max normalization of the current environmental variable, X i denotes the current environmental variable at the time of operation, X min denotes the minimum value of the current environmental variable at the time of operation, and X max denotes the maximum value of the current environmental variable at the time of operation [11].
When analyzing the relationship between locust density and meteorological conditions, various meteorological variables, such as temperature and humidity, are involved.These variables have different units and ranges of values.Deviation normalization can convert all these variables to a unified scale (from 0 to 1), which helps to avoid certain variables dominating the model due to their larger numerical ranges [12].

Construction of the Dataset
The method used to construct the dataset for this study was as follows: after preprocessing the original data, missing values were filled in, and the temporal and spatial resolutions were unified.Based on the work conclusion of the previous section, the characteristic meteorological factor data used for inversing the locust density were obtained; the same feature data corresponding to different sample points were normalized via deviation, and a dataset was formed, as shown in Table 7.In Table 7, "Serial Number" refers to the serial numbers of 180 locust density survey sites in 2021 and 2022."N0_01" indicates the average Normalized Difference Vegetation Index (NDVI) in the middle 10 days of August of the previous year."T_01" represents the average soil moisture in the middle 10 days of April of the current year."D1_01" stands for the average daytime land surface temperature in the middle 10 days of May of the current year."D2_01" means the average night-time land surface temperature in the middle 10 days of August of the previous year."J_01" signifies the average precipitation in the middle 10 days of June of the same year."N0_02" indicates the average NDVI in the middle 10 days of May of the current year."HCMD" represents the average density of locusts at the sample site.

Locust Density Inversion Model
This study divided the dataset into training and test sets, and it constructed models based on BP neural network regression combined with the principal component analysis (PCA), random forest regression, BP neural network regression only, deep belief network regression, and support vector regression (SVR).Subsequently, the models underwent training and parameter optimization [13].
Sensors 2024, 24, 3121 10 of 21 BP Neural Network Regression Based on Principal Component Analysis: The principal component analysis (PCA) has become a common method for handling high-dimensional data and simplifying datasets.The core purpose of this technique is to transform complex multi-dimensional data into a lower-dimensional subspace, while minimizing the overall loss of information in order to more effectively represent the original dataset.In meteorological data analyses, the PCA is particularly important, as factors such as rainfall, air humidity, and soil moisture often have close inter-relationships.By applying the PCA to transform these interrelated data, their dimensions can be reduced, thereby improving the efficiency of model training [14].
The BP (backpropagation) neural network, inspired by the human brain's response mechanism, is a type of multi-layer, fully connected network primarily used for data fitting and classification [15].It consists of three key components: the input layer, hidden layers, and the output layer.Neurons, as the fundamental units of the network, facilitate signal transmission between these layers.With the help of internal activation functions in the neurons, the BP neural network can approximate a variety of complex non-linear functions.The workflow of the BP neural network is as follows: signals propagate forward from the input layer, passing through multiple hidden layers, where the signal undergoes complex processing before reaching the output layer [16].The data at the output layer are compared with the target data, generating an error value.If the current weights and thresholds do not produce the desired output, the error information will propagate back along the same path; that is, it backpropagates to each corresponding neuron, adjusting the weights and thresholds.This process repeats until the network output error falls within an acceptable range, completing the training process [17].A model of the BP neural network is illustrated in Figure 4.  Random Forest Regression: The random forest regression algorithm employs an ensemble method consisting of numerous independently constructed decision trees.The core process of this algorithm includes the following: firstly, the generation of multiple different training samples and attribute subsets by repeatedly sampling the original dataset with replacement; secondly, the construction of a decision tree for each sample and attribute subset; and, finally, the derivation of the final prediction value by voting or taking the weighted average of the predictions from these decision trees [19].Compared with other machine learning techniques, a significant advantage of a random forest is its ensemble learning characteristic.A random forest can usually avoid the overfitting problem that might occur in a single decision tree, thereby improving generalizability to new data, as well as possessing good noise resistance.Moreover, a random forest maintains an efficient training speed, even when handling large datasets; it can process high-dimensional data without the need for feature selection; and it can provide assessments of the impact of each feature on prediction results, offering some basis for model interpretation [20].In this study, the model used cross-validation.A schematic diagram of the random forest regression model is shown in Figure 5.The calculation formula for the nodes in the hidden layer in the diagram is as follows: In formula (2), n represents the number of nodes; φ is the activation function, ω i denotes the parameter weights for the i-th layer, and b i is the bias for the i-th layer.Combining the PCA and BP neural network for a regression analysis helps to reduce the risk of overfitting and enhances generalization to unseen data by eliminating noise and irrelevant variables from the data.However, it also comes with disadvantages [18].The dimensionality reduction process may discard some components that are crucial for prediction, leading to a deterioration in the interpretability of the model.Combining these two techniques also implies the need to adjust and optimize more parameters, potentially complicating the model training and optimization processes.
Random Forest Regression: The random forest regression algorithm employs an ensemble method consisting of numerous independently constructed decision trees.The core process of this algorithm includes the following: firstly, the generation of multiple different training samples and attribute subsets by repeatedly sampling the original dataset with replacement; secondly, the construction of a decision tree for each sample and attribute subset; and, finally, the derivation of the final prediction value by voting or taking the weighted average of the predictions from these decision trees [19].Compared with other machine learning techniques, a significant advantage of a random forest is its ensemble learning characteristic.A random forest can usually avoid the overfitting problem that might occur in a single decision tree, thereby improving generalizability to new data, as well as possessing good noise resistance.Moreover, a random forest maintains an efficient training speed, even when handling large datasets; it can process high-dimensional data without the need for feature selection; and it can provide assessments of the impact of each feature on prediction results, offering some basis for model interpretation [20].In this study, the model used cross-validation.A schematic diagram of the random forest regression model is shown in Figure 5. Random Forest Regression: The random forest regression algorithm employs an ensemble method consisting of numerous independently constructed decision trees.The core process of this algorithm includes the following: firstly, the generation of multiple different training samples and attribute subsets by repeatedly sampling the original dataset with replacement; secondly, the construction of a decision tree for each sample and attribute subset; and, finally, the derivation of the final prediction value by voting or taking the weighted average of the predictions from these decision trees [19].Compared with other machine learning techniques, a significant advantage of a random forest is its ensemble learning characteristic.A random forest can usually avoid the overfitting problem that might occur in a single decision tree, thereby improving generalizability to new data, as well as possessing good noise resistance.Moreover, a random forest maintains an efficient training speed, even when handling large datasets; it can process high-dimensional data without the need for feature selection; and it can provide assessments of the impact of each feature on prediction results, offering some basis for model interpretation [20].In this study, the model used cross-validation.A schematic diagram of the random forest regression model is shown in Figure 5. BP Neural Network Regression: In this model, BP neural network regression is used independently, without the implementation of the principal component analysis.BP neural networks are capable of capturing and modeling complex non-linear relationships, which is extremely valuable for complex datasets that are difficult to handle with linear models.BP neural networks can effectively predict unseen data, demonstrating good generalization capabilities.
Deep Belief Network Regression: Deep belief networks (DBNs) are a type of deep learning model composed of multiple layers of generative models; specifically, typically stacked Restricted Boltzmann Machines (RBMs).Each RBM layer learns representations of data at different levels of abstraction.DBNs initially employ unsupervised learning for the layer-wise pre-training of the network, followed by fine-tuning through supervised learning.DBNs are capable of automatically learning complex and high-level feature representations of data, which is particularly important in fields such as image and speech recognition [21].DBNs generally demonstrate good generalization performance across a variety of tasks.Figure 6 shows a structural diagram of a deep belief network (DBN) model.This model includes three stacked RBM layers and one BP layer.DBNs initially conduct preliminary pre-training of the network through multiple RBM layers and utilize the BP layer for fine-tuning with supervised learning, thereby achieving comprehensive training of the model [22].
The activation probability formula for Restricted Boltzmann Machines (RBMs) is the sigmoid function.This function yields values between 0 and 1 for the entire range of (−∞, +∞), allowing for the computation of activation probabilities for respective nodes.When the activation status of all neural units in the visible layer (or hidden layer) is known, the activation probabilities for the hidden layer (or visible layer) neurons can be inferred.This involves calculating  ℎ 1| and   1|ℎ .The unknown RBM parameters W, a, and b can be determined through unsupervised learning [24].SVR Model: Support vector regression (SVR) is a regression method based on support vector machines (SVMs).In traditional SVMs, the goal is to find a decision boundary that maximizes the margin between different classes of data points.In SVR, this concept is applied to regression problems, i.e., predicting a continuous value, rather than classification [25].SVR allows for the setting of an "epsilon margin" within the model, which defines the acceptable error between predicted values and actual values.This approach helps to control the model's generalization ability and the risk of overfitting.SVR is robust against outliers and noise [26].The model primarily relies on support vectors (i.e., data points near the boundary) rather than all data, making it less sensitive to outliers [27].SVR can effectively handle data in high-dimensional feature spaces, working well even when the number of features exceeds the number of samples [28].

Evaluation Criteria
In this study, BP neural network regression combined with the principal component analysis, random forest regression, BP neural network regression only, deep belief network regression, and SVR models based on the principal component analysis were applied to build models and compare their performance on grassland locust monitoring In this study, the RBM receives the data vector transmitted from the bottommost layer through the visible layer.The input vector undergoes an activation function transformation to the hidden layer and, through training, the internal energy function is minimized [23].Given visible units v i , hidden units h j , and their connection weights W i,j (with a size of n v , n h ), as well as the offset a i for v i and the bias weight b j for h j , the energy function E(v, h) is defined using formula (3): By calculating the energy function E(v, h), the probability distribution P(v, h) for the visible and hidden layers can be expressed as Equations ( 4) and (5), where Z denotes the normalization factor: The probability distribution P θ (v), for observed data v, corresponding to the marginal distribution of P θ (v, h), is referred to as the likelihood function, as shown in Equation (6).Equation ( 7) represents the vector obtained by removing component h k from h, and it is substituted into Equations ( 8) and (9).
Sensors 2024, 24, 3121 The energy function simplifies to Equation (10), and the solution for the likelihood function is derived as shown in Equations ( 11) and ( 12): The activation probability formula for Restricted Boltzmann Machines (RBMs) is the sigmoid function.This function yields values between 0 and 1 for the entire range of (−∞, +∞), allowing for the computation of activation probabilities for respective nodes.When the activation status of all neural units in the visible layer (or hidden layer) is known, the activation probabilities for the hidden layer (or visible layer) neurons can be inferred.This involves calculating P(h k = 1|v) and P(v k = 1|h).The unknown RBM parameters W, a, and b can be determined through unsupervised learning [24].SVR Model: Support vector regression (SVR) is a regression method based on support vector machines (SVMs).In traditional SVMs, the goal is to find a decision boundary that maximizes the margin between different classes of data points.In SVR, this concept is applied to regression problems, i.e., predicting a continuous value, rather than classification [25].SVR allows for the setting of an "epsilon margin" within the model, which defines the acceptable error between predicted values and actual values.This approach helps to control the model's generalization ability and the risk of overfitting.SVR is robust against outliers and noise [26].The model primarily relies on support vectors (i.e., data points near the boundary) rather than all data, making it less sensitive to outliers [27].SVR can effectively handle data in high-dimensional feature spaces, working well even when the number of features exceeds the number of samples [28].

Evaluation Criteria
In this study, BP neural network regression combined with the principal component analysis, random forest regression, BP neural network regression only, deep belief network regression, and SVR models based on the principal component analysis were applied to build models and compare their performance on grassland locust monitoring data.Inputting the habitat characterization dataset resulted in the prediction of grassland locust density in 2021 using the above five models.Subsequently, these predicted values were compared with the actual values in the test set and analyzed using scatter plots.The horizontal coordinates of the scatterplot represent the actual locust density in the test set, and the vertical coordinates represent the predicted values of the models.The diagonal line in the plot is a 1:1 line, indicating the exact agreement between the predicted and actual values.The closer the sample points are to the 1:1 line, the smaller the difference between predicted and actual values and, thus, the better the model prediction performance.If the predicted value is higher than the actual value, it will be above the 1:1 line; if the predicted value is lower than the actual value, it will be below the 1:1 line.When validating the effectiveness of a model, we often adopt two indicators: the coefficient of determination (R²) and the root mean square error (RMSE).The coefficient of determination (R²) is a key indicator to measure how well the regression model fits the sample data.The closer the value of R² to 1, the better the model fits the data, which means that the model can better explain the variation in the data.

Discussion of Results
Figure 7 below shows the results of the BP neural network regression combined with the principal component analysis.It can be seen that the coefficient of determination is 0.8718, which is a relatively high R² value, implying that the model's predictions are of good quality, explaining most of the data variance.However, the overall performance is poor, and the predictions are not as good as expected.The root mean square error (RMSE) of 2.0476 indicates that the model's predictions statistically deviate from the actual observations by an average of about 2.0476 units.The scatterplot shows that most of the data points are distributed close to or around the ideal line, indicating that the predicted values are close to the actual values.Data points in the range of actual values of 45 to 60 seem to have a better predictive accuracy because these points are more compactly distributed around the ideal line.For actual values exceeding 60, the predicted values appear to slightly overestimate the actual results, as most data points in the scatterplot lie above the ideal line.The model utilizes the principal component analysis (PCA) for dimensionality reduction, which is designed to process high-dimensional habitat factor data and pass these factors as inputs to a backpropagation neural network (BP neural network) in order to predict locust densities.The PCA removes noise and redundancy from the data and extracts the most important features, and BP neural networks are an effective non-linear regression method commonly used in complex pattern recognition and prediction problems.Overall, the model shows a good predictive ability, especially for locust density prediction in moderate ranges.However, for high-density areas, the model may need further tuning to improve prediction accuracy.A possible reason for this is that some information, such as meteorological factors with a low correlation, may be discarded during the PCA analysis.Figure 8 below shows the validation results of the SVR (support vector regression model.The root mean square error (RMSE) is 1.8487, which indicates an average deviation between the model's predictions and the actual values.A lower RMSE signifies smalle prediction errors and, thus, this model's RMSE indicates a relatively good predictive ac curacy.The coefficient of determination (R²) is 0.8955, suggesting that the model account for a significant portion of the data's variability.This is an improvement over the PCA-BP neural network regression model, meaning that the predictions of this model are more accurate than those of the previous one.However, the data points in certain areas of the  accuracy.The coefficient of determination (R²) is 0.8955, suggesting that the model accounts for a significant portion of the data's variability.This is an improvement over the PCA-BP neural network regression model, meaning that the predictions of this model are more accurate than those of the previous one.However, the data points in certain areas of the graph show a degree of dispersion, indicating a decrease in predictive accuracy in these regions. Model.
Figure 8 below shows the validation results of the SVR (support vector regress model.The root mean square error (RMSE) is 1.8487, which indicates an average devia between the model's predictions and the actual values.A lower RMSE signifies sm prediction errors and, thus, this model's RMSE indicates a relatively good predictiv curacy.The coefficient of determination (R²) is 0.8955, suggesting that the model acco for a significant portion of the data's variability.This is an improvement over the PCA neural network regression model, meaning that the predictions of this model are m accurate than those of the previous one.However, the data points in certain areas o graph show a degree of dispersion, indicating a decrease in predictive accuracy in t regions.By observing Figure 9, which presents the results of using a BP neural network only for regression, it can be seen that the graph consists of two parts: the training process loss curve and the comparison between actual and predicted values.The validation loss (orange line) starts high and then rapidly decreases, indicating improvement in model learning during the initial phase.After several training epochs, it stabilizes, suggesting that the model achieves a lower error rate on the training data without showing signs of significant overfitting or underfitting, as the validation loss does not start increasing but instead remains consistent with the training loss.The coefficient of determination (R 2 ) is 0.9158, indicating that the variability predicted by the model is highly correlated with actual data variability, and the model can explain 91.58% of the data's variability.The root mean square error (RMSE) is 2.0178, signifying that the average deviation between the model's predictions and actual observed values is 2.0178 units, which is a relatively small error, thus indicating high predictive accuracy.Most data points are tightly clustered around the dashed line, which represents a good prediction scenario.The distribution of data points suggests that the predictions are generally very close to actual values, especially within the middle range.However, for some lower actual values, the model's predictions appear to be slightly worse.
predictions and actual observed values is 2.0178 units, which is a relatively small error, thus indicating high predictive accuracy.Most data points are tightly clustered around the dashed line, which represents a good prediction scenario.The distribution of data points suggests that the predictions are generally very close to actual values, especially within the middle range.However, for some lower actual values, the model's predictions appear to be slightly worse.Figure 11 shows the comparison of the predicted values of the random forest regression model and actual values.The root mean square error (RMSE) of this model's validation results is 1.0144, which is relatively low compared to that of the other four models previously discussed, indicating that the average deviation between the model's predictions and actual values is only 1.0144 units.In regression models, this is a good indicator of high predictive accuracy.The coefficient of determination (R 2 ) is 0.9685, a value higher than that of the other four predictive models, suggesting a very high correlation between the model's predictions and actual data.The model can explain 96.85% of the variability in the actual data.Although there are deviations in individual samples, overall, the random forest model's predictions are quite ideal.This demonstrates that the random forest regression method, using environmental variables, can successfully predict grassland locust density.Although the overall performance is excellent, we can see some slight deviations between several predicted points and the ideal prediction line.The incomplete accuracy of the model's predictions may be due to some outliers or the model's inability to fully capture all relevant factors.Among these outliers are the large errors in meteorological factors and significant errors in locust density data.and patterns within the data.This gives the DBN an advantage in handling highly nonlinear and high-dimensional data, potentially resulting in a higher prediction accuracy in inversion regression tasks.Figure 11 shows the comparison of the predicted values of the random forest regression model and actual values.The root mean square error (RMSE) of this model's validation results is 1.0144, which is relatively low compared to that of the other four models previously discussed, indicating that the average deviation between the model's predictions and actual values is only 1.0144 units.In regression models, this is a good indicator of high predictive accuracy.The coefficient of determination (R 2 ) is 0.9685, a value higher than that of the other four predictive models, suggesting a very high correlation between the model's predictions and actual data.The model can explain 96.85% of the variability in the actual data.Although there are deviations in individual samples, overall, the random forest model's predictions are quite ideal.This demonstrates that the random forest regression method, using environmental variables, can successfully predict grassland locust density.Although the overall performance is excellent, we can see some slight deviations between several predicted points and the ideal prediction line.The incomplete accuracy of the model's predictions may be due to some outliers or the model's inability to fully capture all relevant factors.Among these outliers are the large errors in meteorological factors and significant errors in locust density data.Table 8 provides a comparative analysis of the accuracy of the five models.In the table, it is evident that, among these models, the random forest regression model performs the best, followed by the deep belief network regression model.The BP neural network regression and SVR models show moderate performance, while the PCA-BP neural network regression model has relatively lower performance.Random forest and deep belief networks are more effective in handling and learning the complex non-linear relationships present in habitat factor data.Based on the comprehensive analysis above, after comparing the predictive effectiveness of these five methods on the test set, it is concluded that random forest regression is more effective in extracting features of environmental variables at grassland locust sample points, thereby making it more accurate in predicting the distribution of grassland locust density.Table 8 provides a comparative analysis of the accuracy of the five models.In the table, it is evident that, among these models, the random forest regression model performs the best, followed by the deep belief network regression model.The BP neural network regression and SVR models show moderate performance, while the PCA-BP neural network regression model has relatively lower performance.Random forest and deep belief networks are more effective in handling and learning the complex non-linear relationships present in habitat factor data.Based on the comprehensive analysis above, after comparing the predictive effectiveness of these five methods on the test set, it is concluded that random forest regression is more effective in extracting features of environmental variables at grassland locust sample points, thereby making it more accurate in predicting the distribution of grassland locust density.As shown in Figure 12, a distribution map of locust density is derived from the inversion of the trained random forest model, reflecting the distribution of locust density in the region in 2023.As can be seen in the figure, locust density in the southwest and northeast of the region is relatively high, while locust density in the middle and southeast is relatively low, which is consistent with actual survey results in previous years.Figure 13 shows the specific error situations of the random forest model.Since the root mean square error (RMSE) of the random forest model was 1.01, all points with a difference between the predicted value and the actual value greater than 1 were considered points with large errors.A total of 229 points were detected in this figure, of which 42 points had relatively large errors, and the proportion of points with smaller errors was 82%.Moreover, through a data analysis, it was found that the error rate of the points with a locust density of 70 was relatively high, reaching 50%, and the error rate of the points with a locust density of 60 also reached 45%.The error rate of the points with a locust density of 55 was 31%, and the error rate of the points with a locust density of 40 was 46%.The errors in the inversion results of other locust densities accounted for a smaller proportion.It can be seen in the figure that these error points are approximately evenly distributed.Figure 13 shows the specific error situations of the random forest model.Since the root mean square error (RMSE) of the random forest model was 1.01, all points with a difference between the predicted value and the actual value greater than 1 were considered points with large errors.A total of 229 points were detected in this figure, of which 42 points had relatively large errors, and the proportion of points with smaller errors was 82%.Moreover, through a data analysis, it was found that the error rate of the points with a locust density of 70 was relatively high, reaching 50%, and the error rate of the points with a locust density of 60 also reached 45%.The error rate of the points with a locust density of 55 was 31%, and the error rate of the points with a locust density of 40 was 46%.The errors in the inversion results of other locust densities accounted for a smaller proportion.It can be seen in the figure that these error points are approximately evenly distributed.
a locust density of 70 was relatively high, reaching 50%, and the error rate of the points with a locust density of 60 also reached 45%.The error rate of the points with a locust density of 55 was 31%, and the error rate of the points with a locust density of 40 was 46%.The errors in the inversion results of other locust densities accounted for a smaller proportion.It can be seen in the figure that these error points are approximately evenly distributed.As shown in Figure 14, the random forest inversion model can well invert the real value curve, but there are a few points with large errors, which may be caused by input error data.It seems that the random forest model can more accurately analyze the importance of environmental variables to locust density, thus achieving higher inversion accuracy [29].The experiment in this paper is currently limited by a small number of sampling points.In the future, we expect to incorporate data such as slope, soil type, aboveground biomass, and altitude.Among them, slope contributes significantly to egg-stage precipitation.Of course, vegetation coverage can also be included, as it also determines the occurrence of grassland locusts.All these environmental factors constitute the habitat preferences of locusts, among which surface temperature during the egg stage, NDVI, soil moisture, and nymph-stage precipitation are significant factors affecting the density of grassland locusts [30].

Conclusions
This study focuses on the core needs of early warning and monitoring of grassland locust plagues.The high-locust-infestation areas in Xiwuzhumqin Banner were selected As shown in Figure 14, the random forest inversion model can well invert the real value curve, but there are a few points with large errors, which may be caused by input error data.It seems that the random forest model can more accurately analyze the importance of environmental variables to locust density, thus achieving higher inversion accuracy [29].The experiment in this paper is currently limited by a small number of sampling points.In the future, we expect to incorporate data such as slope, soil type, above-ground biomass, and altitude.Among them, slope contributes significantly to egg-stage precipitation.Of course, vegetation coverage can also be included, as it also determines the occurrence of grassland locusts.All these environmental factors constitute the habitat preferences of locusts, among which surface temperature during the egg stage, NDVI, soil moisture, and nymph-stage precipitation are significant factors affecting the density of grassland locusts [30].

Figure 1 .
Figure 1.Map of the Study Area.

Figure 1 .
Figure 1.Map of the Study Area.

Figure 2 .
Figure 2. Map of Locust Survey Points in 2021 and 2022.

Figure 2 .
Figure 2. Map of Locust Survey Points in 2021 and 2022.

Figure 3 .
Figure 3. Remote sensing data: (A) soil moisture data; (B) precipitation data; (C) land surface temperature data; and (D) NDVI data (Normalized Difference Vegetation Index data).

Sensors 2024 ,
24, 3121 9 of 21 model performance because it maintains the distribution of the data.The formula for calculating deviation normalization is shown in formula (1): Z

Figure 4 .
Figure 4. Schematic Diagram of the BP Neural Network Model.

Figure 4 .
Figure 4. Schematic Diagram of the BP Neural Network Model.

Figure 4 .
Figure 4. Schematic Diagram of the BP Neural Network Model.

Figure 5 .Figure 5 .
Figure 5. Schematic Diagram of the Random Forest Regression Model.BP Neural Network Regression: In this model, BP neural network regression is used independently, without the implementation of the principal component analysis.BP neural networks are capable of capturing and modeling complex non-linear relationships, Figure 5. Schematic Diagram of the Random Forest Regression Model.

Figure 6 .
Figure 6.Schematic Diagram of the Deep Belief Network Regression Model.

Figure 6 .
Figure 6.Schematic Diagram of the Deep Belief Network Regression Model.

Figure 7 .
Figure 7.Comparison of Predicted and Actual Values in the PCA-BP Neural Network Regression Model.

Figure 7 .
Figure 7.Comparison of Predicted and Actual Values in the PCA-BP Neural Network Regression Model.

Figure 8
Figure 8 below shows the validation results of the SVR (support vector regression) model.The root mean square error (RMSE) is 1.8487, which indicates an average deviation between the model's predictions and the actual values.A lower RMSE signifies smaller prediction errors and, thus, this model's RMSE indicates a relatively good predictive

Figure 8 .
Figure 8.Comparison of Predicted and Actual Values in the SVR Regression Model.

Figure 8 .
Figure 8.Comparison of Predicted and Actual Values in the SVR Regression Model.

Figure 9 .
Figure 9. Results of the Backpropagation Neural Network: (a) variation of the backpropagation loss function; (b) comparison between predicted values and actual values in the Backpropagation Neural Network.The results of the deep belief network (DBN) regression model are illustrated in Figure 10.The root mean square error (RMSE) is 1.4986, indicating that, on average, the deviation between the model's predicted values and actual values is about 1.4986 units.This relatively low RMSE value suggests that the model has high accuracy in predicting locust density.The coefficient of determination (R²) is 0.9314, meaning that the model's predicted values explain 93.14% of the variance in the actual values, indicating strong predictive performance.The blue dots represent the actual observed values and the predicted values, and they are generally distributed along the red dashed line, demonstrating a good match between the model's predictions and the actual situation.This indicates that the DBN model is effective in capturing the relationship between input features and locust density.While the model generally performs well, there are still some data points that deviate significantly from the ideal prediction line, and most predicted values are somewhat lower, suggesting that the model may not perfectly predict in certain scenarios.Compared with the BP neural network, the deep belief network exhibits superior predictive performance.The deep belief network is a generative model composed of multiple layers of Restricted Boltzmann Machines (RBMs), possessing powerful feature extraction capabilities.By learning the representation of data layer by layer, the DBN can capture complex structures

Figure 9 .
Figure 9. Results of the Backpropagation Neural Network: (a) variation of the backpropagation loss function; (b) comparison between predicted values and actual values in the Backpropagation Neural Network.The results of the deep belief network (DBN) regression model are illustrated in Figure10.The root mean square error (RMSE) is 1.4986, indicating that, on average, the deviation between the model's predicted values and actual values is about 1.4986 units.This relatively low RMSE value suggests that the model has high accuracy in predicting locust density.The coefficient of determination (R²) is 0.9314, meaning that the model's predicted values explain 93.14% of the variance in the actual values, indicating strong predictive performance.The blue dots represent the actual observed values and the predicted values, and they are generally distributed along the red dashed line, demonstrating a good match between the model's predictions and the actual situation.This indicates that the DBN model is effective in capturing the relationship between input features and locust density.While the model generally performs well, there are still some data points that deviate significantly from the ideal prediction line, and most predicted values are somewhat lower, suggesting that the model may not perfectly predict in certain scenarios.Compared with the BP neural network, the deep belief network exhibits superior predictive performance.The deep belief network is a generative model composed of multiple layers of Restricted Boltzmann Machines (RBMs), possessing powerful feature extraction capabilities.By learning the representation of data layer by layer, the DBN can capture complex structures and patterns within the data.This gives the DBN an advantage in handling highly non-linear and high-dimensional data, potentially resulting in a higher prediction accuracy in inversion regression tasks.Figure11shows the comparison of the predicted values of the random forest regression model and actual values.The root mean square error (RMSE) of this model's validation results is 1.0144, which is relatively low compared to that of the other four models previously discussed, indicating that the average deviation between the model's predictions and actual values is only 1.0144 units.In regression models, this is a good indicator of high predictive accuracy.The coefficient of determination (R 2 ) is 0.9685, a value higher than that of the other four predictive models, suggesting a very high correlation between the model's predictions and actual data.The model can explain 96.85% of the variability in the actual data.Although there are deviations in individual samples, overall, the random forest model's predictions are quite ideal.This demonstrates that the random forest regression method, using environmental variables, can successfully predict grassland locust density.Although the overall performance is excellent, we can see some slight deviations between several predicted points and the ideal prediction line.The incomplete accuracy of the model's predictions may be due to some outliers or the model's inability to fully capture

Figure 10 .
Figure 10.Comparison of Predicted and Actual Values in the Deep Belief Network Regression Model.

Figure 10 .
Figure 10.Comparison of Predicted and Actual Values in the Deep Belief Network Regression Model.Sensors 2024, 24, x FOR PEER REVIEW 18 of 22

Figure 11 .
Figure 11.Comparison of Predicted and Actual Values in the Random Forest Regression Model.

Figure 11 .
Figure 11.Comparison of Predicted and Actual Values in the Random Forest Regression Model.

Figure 12 .
Figure 12.The inversion map of locust density in Xiwuzhumuqin Banner in 2023.

Figure 13 .
Figure 13.Distribution map of locust density inversion errors.

Figure 14
Figure 14 shows a comparison between the actual and predicted values of the inversion results of locust density in Xiwuzhumuqin Banner in 2023.A total of 229 sample points were verified, and the actual values were sourced from the grassland monitoring station in Xiwuzhumuqin Banner.

Figure 13 .
Figure 13.Distribution map of locust density inversion errors.

Figure 14 22 Figure 14 .
Figure 14 shows a comparison between the actual and predicted values of the inversion results of locust density in Xiwuzhumuqin Banner in 2023.A total of 229 sample points were verified, and the actual values were sourced from the grassland monitoring station in Xiwuzhumuqin Banner.Sensors 2024, 24, x FOR PEER REVIEW 20 of 22

Figure 14 .
Figure 14.Fitting curve of predicted values and actual values.

Table 1
below shows the acquisition time of the data used in this study.

Table 1
below shows the acquisition time of the data used in this study.

Table 2 .
The correlation between locust density and 10-day average (daytime/night-time) land surface temperature.

Table 3 .
The correlation between locust density and average precipitation in 10-day periods.

Table 4 .
The correlation between locust density and average soil moisture in 10-day periods.

Table 6 .
The random forest-gain importance score of environmental factor variables relative to locust density.

Table 7 .
Dataset for the locust density inversion model.

Table 8 .
Comparison of Model Accuracies.

Table 8 .
Comparison of Model Accuracies.